Count how many non-zero entries at each month in a dataframe column - pandas

I have a dataframe, df, with datetimeindex and a single column, like this:
I need to count how many non-zero entries i have at each month. For example, according to those images, in January i would have 2 entries, in February 1 entry and in March 2 entries. I have more months in the dataframe, but i guess that explains the problem.
I tried using pandas groupby:
df.groupby(df.index.month).count()
But that just gives me total days at each month and i don't saw any other parameter in count() that i could use here.
Any ideas?

Try index.to_period()
For example:
In [1]: import pandas as pd
import numpy as np
x_df = pd.DataFrame(
{
'values': np.random.randint(low=0, high=2, size=(120,))
} ,
index = pd.date_range("2022-01-01", periods=120, freq="D")
)
In [2]: x_df
Out[2]:
values
2022-01-01 0
2022-01-02 0
2022-01-03 1
2022-01-04 0
2022-01-05 0
...
2022-04-26 1
2022-04-27 0
2022-04-28 0
2022-04-29 1
2022-04-30 1
[120 rows x 1 columns]
In [3]: x_df[x_df['values'] != 0].groupby(lambda x: x.to_period("M")).count()
Out[3]:
values
2022-01 17
2022-02 15
2022-03 16
2022-04 17

can you try this:
#drop nans
import numpy as np
dfx['col1']=dfx['col1'].replace(0,np.nan)
dfx=dfx.dropna()
dfx=dfx.resample('1M').count()

Related

Selecting specific rows by date in a multi-dimensional df Python

image of df
I would like to select a specific date e.g 2020-07-07 and get the Adj Cls and ExMA for each of the symbols. I'm new in Python and I tried using df.loc['xy'], (xy being a specific date on the datetime) and keep getting a KeyError. Any insight is greatly appreciated.
Info on the df MultiIndex: 30 entries, (SNAP, 2020-07-06 00:00:00) to (YUM, 2020-07-10 00:00:00)
Data columns (total 2 columns):
dtypes: float64(2)
You can use pandas.DataFrame.xs for this.
import pandas as pd
import numpy as np
df = pd.DataFrame(
np.arange(8).reshape(4, 2), index=[[0, 0, 1, 1], [2, 3, 2, 3]], columns=list("ab")
)
print(df)
# a b
# 0 2 0 1
# 3 2 3
# 1 2 4 5
# 3 6 7
print(df.xs(3, level=1).filter(["a"]))
# a
# 0 2
# 1 6

Pandas groupby calculate difference

import pandas as pd
data = [['2017-09-30','A',123],['2017-12-31','A',23],['2017-09-30','B',74892],['2017-12-31','B',52222],['2018-09-30','A',37599],['2018-12-31','A',66226]]
df = pd.DataFrame.from_records(data,columns=["Date", "Company", "Revenue YTD"])
df['Date'] = pd.to_datetime(df['Date'])
df = df.groupby(['Company',df['Date'].dt.year]).diff()
print(df)
Date Revenue YTD
0 NaT NaN
1 92 days -100.0
2 NaT NaN
3 92 days -22670.0
4 NaT NaN
5 92 days 28627.0
I would like to calculate the company's revenue difference by September and December. I have tried with groupby company and year. But the result is not what I am expecting
Expecting result
Date Company Revenue YTD
0 2017 A -100
1 2018 A -22670
2 2017 B 28627
IIUC, this should work
(df.assign(Date=df['Date'].dt.year,
Revenue_Diff=df.groupby(['Company',df['Date'].dt.year])['Revenue YTD'].diff())
.drop('Revenue YTD', axis=1)
.dropna()
)
Output:
Date Company Revenue_Diff
1 2017 A -100.0
3 2017 B -22670.0
5 2018 A 28627.0
Try this:
Set it up:
import pandas as pd
import numpy as np
data = [['2017-09-30','A',123],['2017-12-31','A',23],['2017-09-30','B',74892],['2017-12-31','B',52222],['2018-09-30','A',37599],['2018-12-31','A',66226]]
df = pd.DataFrame.from_records(data,columns=["Date", "Company", "Revenue YTD"])
df['Date'] = pd.to_datetime(df['Date'])
Update with np.diff():
my_func = lambda x: np.diff(x)
df = (df.groupby([df.Date.dt.year, df.Company])
.agg({'Revenue YTD':my_func}))
print(df)
Revenue YTD
Date Company
2017 A -100
B -22670
2018 A 28627
Hope this helps.

How can a pandas dataframe with a TimedeltaIndex by grouped by nearest whole day?

I've got a pandas DataFrame with an index of pd.TimeDeltas some of which are fractions of days. I'd like to use df.groupby to group the rows by whole days (ignoring the fractions of days) so that I can calculate the mean.
Here's an example of what I'd like to do:
import pandas as pd
import numpy as np
data = [[1,2,3], [2,3,4], [3,4,5], [1,2,3], [2,3,4], [3,4,5]]
idx = [pd.Timedelta('1.2 days'), pd.Timedelta('1.2 days'), pd.Timedelta('3.8 days'), pd.Timedelta('3.8 days'), pd.Timedelta('4.2 days'), pd.Timedelta('4.2 days')]
df = pd.DataFrame(data, columns=['a', 'b', 'c'])
df.index = idx
df
Out:
a b c
1 days 04:48:00 1 2 3
1 days 04:48:00 2 3 4
3 days 19:12:00 3 4 5
3 days 19:12:00 1 2 3
4 days 04:48:00 2 3 4
4 days 04:48:00 3 4 5
The line below produces the desired a result however it creates extra rows for each day so there are rows full of NaNs which I subsequently remove with df.dropna(). Is there a better approach to doing this?
df.groupby(pd.Grouper(freq='D')).aggregate(np.mean).dropna()
Your approach is fine, or you can just group by df.index.days as below:
In [196]: df.groupby(df.index.days).mean()
Out[196]:
a b c
1 1.5 2.5 3.5
3 2.0 3.0 4.0
4 2.5 3.5 4.5
The difference in the two methods is where things get grouped on the margins. In yours, something at 2 days, 02:00:00 would get grouped with the 1-day rows since pd.Grouper will start with the first example, whereas in mine, it will get a separate row since it treats midnight as the start of a new group.

How do I use grouped data to plot rainfall averages in specific hourly ranges

I extracted the following data from a dataframe .
https://i.imgur.com/rCLfV83.jpg
The question is, how do I plot a graph, probably a histogram type, where the horizontal axis are the hours as bins [16:00 17:00 18:00 ...24:00] and the bars are the average rainfall during each of those hours.
I just don't know enough pandas yet to get this off the ground so I need some help. Sample data below as requested.
Date Hours `Precip`
1996-07-30 21 1
1996-08-17 16 1
18 1
1996-08-30 16 1
17 1
19 5
22 1
1996-09-30 19 5
20 5
1996-10-06 20 1
21 1
1996-10-19 18 4
1996-10-30 19 1
1996-11-05 20 3
1996-11-16 16 1
19 1
1996-11-17 16 1
1996-11-29 16 1
1996-12-04 16 9
17 27
19 1
1996-12-12 19 1
1996-12-30 19 10
22 1
1997-01-18 20 1
It seems df is a multi-index DataFrame after a groupby.
Transform the index to a DatetimeIndex
date_hour_idx = df.reset_index()[['Date', 'Hours']] \
.apply(lambda x: '{} {}:00'.format(x['Date'], x['Hours']), axis=1)
precip_series = df.reset_index()['Precip']
precip_series.index = pd.to_datetime(date_hour_idx)
Resample to hours using 'H'
# This will show NaN for hours without an entry
resampled_nan = precip_series.resample('H').asfreq()
# This will fill NaN with 0s
resampled_fillna = precip_series.resample('H').asfreq().fillna(0)
If you want this to be the mean per hour, change your groupby(...).sum() to groupby(...).mean()
You can resample to other intervals too -> pandas resample documentation
More about resampling the DatetimeIndex -> https://pandas.pydata.org/pandas-docs/stable/reference/resampling.html
It seems to be easy when you have data.
I generate artificial data by Pandas for this example:
import pandas as pd
import radar
import random
'''>>> date'''
r2 =()
for a in range(1,51):
t= (str(radar.random_datetime(start='1985-05-01', stop='1985-05-04')),)
r2 = r2 + t
r3 =list(r2)
r3.sort()
#print(r3)
'''>>> variable'''
x = [random.randint(0,16) for x in range(50)]
df= pd.DataFrame({'date': r3, 'measurement': x})
print(df)
'''order'''
col1 = df.join(df['date'].str.partition(' ')[[0,2]]).rename({0: 'daty', 2: 'godziny'}, axis=1)
col2 = df['measurement'].rename('pomiary')
p3 = pd.concat([col1, col2], axis=1, sort=False)
p3 = p3.drop(['measurement'], axis=1)
p3 = p3.drop(['date'], axis=1)
Time for sum and plot:
dx = p3.groupby(['daty']).mean()
print(dx)
import matplotlib.pyplot as plt
dx.plot.bar()
plt.show()
Plot of the mean measurements

How to change datetime to numeric discarding 0s at end [duplicate]

I have a dataframe in pandas called 'munged_data' with two columns 'entry_date' and 'dob' which i have converted to Timestamps using pd.to_timestamp.I am trying to figure out how to calculate ages of people based on the time difference between 'entry_date' and 'dob' and to do this i need to get the difference in days between the two columns ( so that i can then do somehting like round(days/365.25). I do not seem to be able to find a way to do this using a vectorized operation. When I do munged_data.entry_date-munged_data.dob i get the following :
internal_quote_id
2 15685977 days, 23:54:30.457856
3 11651985 days, 23:49:15.359744
4 9491988 days, 23:39:55.621376
7 11907004 days, 0:10:30.196224
9 15282164 days, 23:30:30.196224
15 15282227 days, 23:50:40.261632
However i do not seem to be able to extract the days as an integer so that i can continue with my calculation.
Any help appreciated.
Using the Pandas type Timedelta available since v0.15.0 you also can do:
In[1]: import pandas as pd
In[2]: df = pd.DataFrame([ pd.Timestamp('20150111'),
pd.Timestamp('20150301') ], columns=['date'])
In[3]: df['today'] = pd.Timestamp('20150315')
In[4]: df
Out[4]:
date today
0 2015-01-11 2015-03-15
1 2015-03-01 2015-03-15
In[5]: (df['today'] - df['date']).dt.days
Out[5]:
0 63
1 14
dtype: int64
You need 0.11 for this (0.11rc1 is out, final prob next week)
In [9]: df = DataFrame([ Timestamp('20010101'), Timestamp('20040601') ])
In [10]: df
Out[10]:
0
0 2001-01-01 00:00:00
1 2004-06-01 00:00:00
In [11]: df = DataFrame([ Timestamp('20010101'),
Timestamp('20040601') ],columns=['age'])
In [12]: df
Out[12]:
age
0 2001-01-01 00:00:00
1 2004-06-01 00:00:00
In [13]: df['today'] = Timestamp('20130419')
In [14]: df['diff'] = df['today']-df['age']
In [16]: df['years'] = df['diff'].apply(lambda x: float(x.item().days)/365)
In [17]: df
Out[17]:
age today diff years
0 2001-01-01 00:00:00 2013-04-19 00:00:00 4491 days, 00:00:00 12.304110
1 2004-06-01 00:00:00 2013-04-19 00:00:00 3244 days, 00:00:00 8.887671
You need this odd apply at the end because not yet full support for timedelta64[ns] scalars (e.g. like how we use Timestamps now for datetime64[ns], coming in 0.12)
Not sure if you still need it, but in Pandas 0.14 i usually use .astype('timedelta64[X]') method
http://pandas.pydata.org/pandas-docs/stable/timeseries.html (frequency conversion)
df = pd.DataFrame([ pd.Timestamp('20010101'), pd.Timestamp('20040605') ])
df.ix[0]-df.ix[1]
Returns:
0 -1251 days
dtype: timedelta64[ns]
(df.ix[0]-df.ix[1]).astype('timedelta64[Y]')
Returns:
0 -4
dtype: float64
Hope that will help
Let's specify that you have a pandas series named time_difference which has type
numpy.timedelta64[ns]
One way of extracting just the day (or whatever desired attribute) is the following:
just_day = time_difference.apply(lambda x: pd.tslib.Timedelta(x).days)
This function is used because the numpy.timedelta64 object does not have a 'days' attribute.
To convert any type of data into days just use pd.Timedelta().days:
pd.Timedelta(1985, unit='Y').days
84494