I'm trying to test the example given in the docs that fills in missing timesteps
date_index = pd.date_range('1/1/2010', periods=6, freq='D')
df2 = pd.DataFrame({"prices": [100, 101, np.nan, 100, 89, 88]}, index=date_index)
date_index2 = pd.date_range('12/29/2009', periods=10, freq='D')
#show how many rows are in the fragmented dataframe
print(df2.shape)
df2.reindex(date_index2)
#show how many rows after reindexing
print(df2.shape)
But running this code shows that no rows were added. What am i missing here?
reindex does not work inplace by default. You can do
print(df2.shape)
# assign back
df2 = df2.reindex(date_index2)
print(df2.shape)
Output:
(6, 1)
(10, 1)
Related
Using the .resample() method yields a DataFrame with a DatetimeIndex and a frequency.
Does anyone have an idea on how to iterate through the values of that DatetimeIndex ?
df = pd.DataFrame(
data=np.random.randint(0, 10, 100),
index=pd.date_range('20220101', periods=100),
columns=['a'],
)
df.resample('M').mean()
If you iterate, you get individual entries taking the Timestamp(‘2022-11-XX…’, freq=‘M’) form but I did not manage to get the date only.
g.resample('M').mean().index[0]
Timestamp('2022-01-31 00:00:00', freq='M')
I am aiming at feeding all the dates in a list for instance.
Thanks for your help !
You an convert each entry in the index into a Datetime object using .date and to a list using .tolist() as below
>>> df.resample('M').mean().index.date.tolist()
[datetime.date(2022, 1, 31), datetime.date(2022, 2, 28), datetime.date(2022, 3, 31), datetime.date(2022, 4, 30)]
You can also truncate the timestamp as follows (reference solution)
>>> df.resample('M').mean().index.values.astype('<M8[D]')
array(['2022-01-31', '2022-02-28', '2022-03-31', '2022-04-30'],
dtype='datetime64[D]')
This solution seems to work fine both for dates and periods:
I = [k.strftime('%Y-%m') for k in g.resample('M').groups]
Apologies if this has already been asked, I haven't found anything specific enough although this does seem like a general question. Anyways, I have two lists of values which correspond to values in a dataframe, and I need to pull those rows which contain those values and make them into another dataframe. The code I have works, but it seems quite slow (14 seconds per 250 items). Is there a smart way to speed it up?
row_list = []
for i, x in enumerate(datetime_list):
row_list.append(df.loc[(df["datetimes"] == x) & (df.loc["b"] == b_list[i])])
data = pd.concat(row_list)
Edit: Sorry for the vagueness #anky, here's an example dataframe
import pandas as pd
from datetime import datetime
df = pd.DataFrame({'datetimes' : [datetime(2020, 6, 14, 2), datetime(2020, 6, 14, 3), datetime(2020, 6, 14, 4)],
'b' : [0, 1, 2],
'c' : [500, 600, 700]})
IIUC, try this
dfi = df.set_index(['datetime', 'b'])
data = dfi.loc[list(enumerate(datetime_list)), :].reset_index()
Without test data in question it is hard to verify if this correct.
I have a DataFrame with two pandas Series as follow:
value accepted_values
0 1 [1, 2, 3, 4]
1 2 [5, 6, 7, 8]
I would like to efficiently check if the value is in accepted_values using pandas methods.
I already know I can do something like the following, but I'm interested in a faster approach if there is one (took around 27 seconds on 1 million rows DataFrame)
import pandas as pd
df = pd.DataFrame({"value":[1, 2], "accepted_values": [[1,2,3,4], [5, 6, 7, 8]]})
def check_first_in_second(values: pd.Series):
return values[0] in values[1]
are_in_accepted_values = df[["value", "accepted_values"]].apply(
check_first_in_second, axis=1
)
if not are_in_accepted_values.all():
raise AssertionError("Not all value in accepted_values")
I think if create DataFrame with list column you can compare by DataFrame.eq and test if match at least one value per row by DataFrame.any:
df1 = pd.DataFrame(df["accepted_values"].tolist(), index=df.index)
are_in_accepted_values = df1.eq(df["value"]).any(axis=1).all()
Another idea:
are_in_accepted_values = all(v in a for v, a in df[["value", "accepted_values"]].to_numpy())
I found a little optimisation to your second idea. Using a bit more numpy than pandas makes it faster (more than 3x, tested with time.perf_counter()).
values = df["value"].values
accepted_values = df["accepted_values"].values
are_in_accepted_values = all(s in e for s, e in np.column_stack([values, accepted_values]))
The pandas.Series groupby method makes it possible to group by another series, for example:
data = {'gender': ['Male', 'Male', 'Female', 'Male'], 'age': [20, 21, 20, 20]}
df = pd.DataFrame(data)
grade = pd.Series([5, 6, 7, 4])
grade.groupby(df['age']).mean()
However, this approach does not work for a groupby using two columns:
grade.groupby(df[['age','gender']])
ValueError: Grouper for class pandas.core.frame.DataFrame not 1-dimensional.
In the example, it is easy to add the column to the dataframe and get the desired result as follows:
df['grade'] = grade
y = df.groupby(['gender','age']).mean()
y.to_dict()
{'grade': {('Female', 20): 7.0, ('Male', 20): 4.5, ('Male', 21): 6.0}}
But that can get quite ugly in real life situations. Is there any way to do this groupby on multiple columns directly on the series?
Since I don't know of any direct way to solve the problem, I've made a function that creates a temporary table and performs the groupby on it.
def pd_groupby(series,group_obj):
df = pd.DataFrame(group_obj).copy()
groupby_columns = list(df.columns)
df[series.name] = series
return df.groupby(groupby_columns)[series.name]
Here, group_obj can be a pandas Series or a Pandas DataFrame. Starting from the sample code, the desired result can be achieved by:
y = pd_groupby(grade,df[['gender','age']]).mean()
y.to_dict()
{('Female', 20): 7.0, ('Male', 20): 4.5, ('Male', 21): 6.0}
I am analyzing a DataFrame and getting timing counts which I want to put into specific buckets (0-10 seconds, 10-30 seconds, etc).
Here is a simplified example:
import pandas as pd
filter_values = [0, 10, 20, 30] # Bucket Values for pd.cut
#Sample Times
df1 = pd.DataFrame([1, 3, 8, 20], columns = ['filtercol'])
#Use cut to get counts for each bucket
out = pd.cut(df1.filtercol, bins = filter_values)
counts = pd.value_counts(out)
print counts
The above prints:
(0, 10] 3
(10, 20] 1
dtype: int64
You will notice it does not show any values for (20, 30]. This is a problem because I want to put this into my output as zero. I can handle it using the following code:
bucket1=bucket2=bucket3=0
if '(0, 10]' in counts:
bucket1=counts['(0, 10]']
if '(10, 20]' in counts:
bucket2=counts['(10, 30]']
if '(20, 30]' in counts:
bucket3=counts['(30, 60]']
print bucket1, bucket2, bucket3
But I want a simpler cleaner approach where I can use:
print counts['(0, 10]'], counts['(10, 30]'], counts['(30, 60]']
Ideally where the print is based on the values in filter_values so they are only in one place in the code. Yes I know I can change the print to use filter_values[0]...
Lastly when using cut is there a way to specify infinity so the last bucket is all values greater than say 60?
Cheers,
Stephen
You can reindex by the categorical's levels:
In [11]: pd.value_counts(out).reindex(out.levels, fill_value=0)
Out[11]:
(0, 10] 3
(10, 20] 1
(20, 30] 0
dtype: int64