Capturing the Timestamp values from resampled DataFrame - pandas

Using the .resample() method yields a DataFrame with a DatetimeIndex and a frequency.
Does anyone have an idea on how to iterate through the values of that DatetimeIndex ?
df = pd.DataFrame(
data=np.random.randint(0, 10, 100),
index=pd.date_range('20220101', periods=100),
columns=['a'],
)
df.resample('M').mean()
If you iterate, you get individual entries taking the Timestamp(‘2022-11-XX…’, freq=‘M’) form but I did not manage to get the date only.
g.resample('M').mean().index[0]
Timestamp('2022-01-31 00:00:00', freq='M')
I am aiming at feeding all the dates in a list for instance.
Thanks for your help !

You an convert each entry in the index into a Datetime object using .date and to a list using .tolist() as below
>>> df.resample('M').mean().index.date.tolist()
[datetime.date(2022, 1, 31), datetime.date(2022, 2, 28), datetime.date(2022, 3, 31), datetime.date(2022, 4, 30)]
You can also truncate the timestamp as follows (reference solution)
>>> df.resample('M').mean().index.values.astype('<M8[D]')
array(['2022-01-31', '2022-02-28', '2022-03-31', '2022-04-30'],
dtype='datetime64[D]')

This solution seems to work fine both for dates and periods:
I = [k.strftime('%Y-%m') for k in g.resample('M').groups]

Related

Fastest way to locate rows of a dataframe from two lists and concatenate them?

Apologies if this has already been asked, I haven't found anything specific enough although this does seem like a general question. Anyways, I have two lists of values which correspond to values in a dataframe, and I need to pull those rows which contain those values and make them into another dataframe. The code I have works, but it seems quite slow (14 seconds per 250 items). Is there a smart way to speed it up?
row_list = []
for i, x in enumerate(datetime_list):
row_list.append(df.loc[(df["datetimes"] == x) & (df.loc["b"] == b_list[i])])
data = pd.concat(row_list)
Edit: Sorry for the vagueness #anky, here's an example dataframe
import pandas as pd
from datetime import datetime
df = pd.DataFrame({'datetimes' : [datetime(2020, 6, 14, 2), datetime(2020, 6, 14, 3), datetime(2020, 6, 14, 4)],
'b' : [0, 1, 2],
'c' : [500, 600, 700]})
IIUC, try this
dfi = df.set_index(['datetime', 'b'])
data = dfi.loc[list(enumerate(datetime_list)), :].reset_index()
Without test data in question it is hard to verify if this correct.

Efficient way to pad a dimension with values from another array

Say I want to pad a numpy array, to make room for an extra column of values in one of the dimensions:
>>> cells.shape
(2, 3, 12, 4)
>>> padded = np.pad(cells, ((0,0),(0,0),(0,0),(0,1)))
>>> padded.shape
(2, 3, 12, 5)
If I have the values for the new column in another 1D array, what is the most efficient way to insert them into cells?
The answer I found with help from #user3483203 in the comments...
If we start with:
>>> cells.shape
(2, 3, 12, 4)
And we pad that array in the last dimension to add another column:
>>> padded = np.pad(cells, ((0,0),(0,0),(0,0),(0,1)))
>>> padded.shape
(2, 3, 12, 5)
Nicest way I found to insert values in the new column is:
>>> padded[..., -1] = new_values

How to convert Multi-Index into a Heatmap

New to Pandas/Python, I have managed to make an index like below;
MultiIndex([( 1, 1, 4324),
( 1, 2, 8000),
( 1, 3, 8545),
( 1, 4, 8544),
( 1, 5, 7542),
(12, 30, 7854),
(12, 31, 7511)],
names=['month', 'day', 'count'], length=366)
I'm struggling to find out how I can store the first number into a list (the 1-12 one) the second number into another list (1-31 values) and the third number into another seperate list (scores 0-9000)
I am trying to build a heatmap that is Month x Day on the axis' and using count as the values and failing horribly! I am assuming I have to seperate Month, Day and Count into seperate lists to make the heat map?
data1 = pd.read_csv("a2data/Data1.csv")
data2 = pd.read_csv("a2data/Data2.csv")
merged_df = pd.concat([data1, data2])
merged_df.set_index(['month', 'day'], inplace=True)
merged_df.sort_index(inplace=True)
merged_df2=merged_df.groupby(['month', 'day']).count.mean().reset_index()
merged_df2.set_index(['month', 'day', 'count'], inplace=True)
#struggling here to seperate out month, day and count in order to make a heatmap
Are you looking for:
# let start here
merged_df2=merged_df.groupby(['month', 'day']).count.mean()
# use sns
import seaborn as sns
sns.heatmap(merged_df2.unstack('day'))
Output:
Or you can use plt:
merged_df2=merged_df.groupby(['month', 'day']).count.mean().unstack('day')
plt.imshow(merged_df2)
plt.xticks(np.arange(merged_df2.shape[1]), merged_df2.columns)
plt.yticks(np.arange(merged_df2.shape[0]), merged_df2.index)
plt.show()
which gives:

How to use reindex to fill in missing timesteps?

I'm trying to test the example given in the docs that fills in missing timesteps
date_index = pd.date_range('1/1/2010', periods=6, freq='D')
df2 = pd.DataFrame({"prices": [100, 101, np.nan, 100, 89, 88]}, index=date_index)
date_index2 = pd.date_range('12/29/2009', periods=10, freq='D')
#show how many rows are in the fragmented dataframe
print(df2.shape)
df2.reindex(date_index2)
#show how many rows after reindexing
print(df2.shape)
But running this code shows that no rows were added. What am i missing here?
reindex does not work inplace by default. You can do
print(df2.shape)
# assign back
df2 = df2.reindex(date_index2)
print(df2.shape)
Output:
(6, 1)
(10, 1)

Compare date difference with pandas timestamp values

I'm trying to perform specific operations based on the age of data in days within a dataframe. What I am looking for is something like as follows:
import pandas as pd
if 10days < (pd.Timestamp.now() - pd.Timestamp(2019, 3, 20)):
print 'The data is older than 10 days'
Is there something I can replace "10days" with or some other way I can perform operations based on the difference between two Timestamp values?
What you're looking for is pd.Timedelta('10D'), pd.Timedelta(10, unit='D') (or unit='days' or unit='day'), or pd.Timedelta(days=10). For example,
In [37]: pd.Timedelta(days=10) < pd.Timestamp.now() - pd.Timestamp(2019, 3, 20)
Out[37]: False
In [38]: pd.Timedelta(days=5) < pd.Timestamp.now() - pd.Timestamp(2019, 3, 20)
Out[38]: True