Slicing in group by function - pandas

How do I group by data frame based on first column after splitting data on semi colon?
In this example I need to split on last column time and group by hour.
from StringIO import StringIO
myst="""india, 905034 , 19:44
USA, 905094 , 19:33
Russia, 905154 , 21:56
"""
u_cols=['country', 'index', 'current_tm']
myf = StringIO(myst)
import pandas as pd
df = pd.read_csv(StringIO(myst), sep=',', names = u_cols)
This query does not return the expected results:
df[df['index'] > 900000].groupby([df.current_tm]).size()
current_tm
21:56 1
19:33 1
19:44 1
dtype: int64
It should be :
21 1
19 2
The time is in hh:mm format but pandas consider it as string.
Is there any utility that will convert the SQL query to pandas equivalent? (something like querymongo.com that will help mongoDB users)

You can add the hour to your dataframe as follows and then use it for grouping:
df['hour'] = df.current_tm.str.strip().apply(lambda x: x.split(':')[0] if isinstance(x, str)
else None)
>>> df[df['index'] > 900000].groupby('hour').size()
hour
19 2
21 1
dtype: int64

Create a new column:
df['hour'] = [current_time.split(':')[0] for current_time in df['current_tm']]
Then apply your method:
df[df['index'] > 900000].groupby([df['hour']]).size()
hour
19 2
21 1
dtype: int64

Related

Count how many non-zero entries at each month in a dataframe column

I have a dataframe, df, with datetimeindex and a single column, like this:
I need to count how many non-zero entries i have at each month. For example, according to those images, in January i would have 2 entries, in February 1 entry and in March 2 entries. I have more months in the dataframe, but i guess that explains the problem.
I tried using pandas groupby:
df.groupby(df.index.month).count()
But that just gives me total days at each month and i don't saw any other parameter in count() that i could use here.
Any ideas?
Try index.to_period()
For example:
In [1]: import pandas as pd
import numpy as np
x_df = pd.DataFrame(
{
'values': np.random.randint(low=0, high=2, size=(120,))
} ,
index = pd.date_range("2022-01-01", periods=120, freq="D")
)
In [2]: x_df
Out[2]:
values
2022-01-01 0
2022-01-02 0
2022-01-03 1
2022-01-04 0
2022-01-05 0
...
2022-04-26 1
2022-04-27 0
2022-04-28 0
2022-04-29 1
2022-04-30 1
[120 rows x 1 columns]
In [3]: x_df[x_df['values'] != 0].groupby(lambda x: x.to_period("M")).count()
Out[3]:
values
2022-01 17
2022-02 15
2022-03 16
2022-04 17
can you try this:
#drop nans
import numpy as np
dfx['col1']=dfx['col1'].replace(0,np.nan)
dfx=dfx.dropna()
dfx=dfx.resample('1M').count()

window function for moving average

I am trying to replicate SQL's window function in pandas.
SELECT avg(totalprice) OVER (
PARTITION BY custkey
ORDER BY orderdate
RANGE BETWEEN interval '1' month PRECEDING AND CURRENT ROW)
FROM orders
I have this dataframe:
from io import StringIO
import pandas as pd
myst="""cust_1,2020-10-10,100
cust_2,2020-10-10,15
cust_1,2020-10-15,200
cust_1,2020-10-16,240
cust_2,2020-12-20,25
cust_1,2020-12-25,140
cust_2,2021-01-01,5
"""
u_cols=['customer_id', 'date', 'price']
myf = StringIO(myst)
import pandas as pd
df = pd.read_csv(StringIO(myst), sep=',', names = u_cols)
df=df.sort_values(list(df.columns))
And after calculating moving average restricted to last 1 month, it will look like this...
from io import StringIO
import pandas as pd
myst="""cust_1,2020-10-10,100,100
cust_2,2020-10-10,15,15
cust_1,2020-10-15,200,150
cust_1,2020-10-16,240,180
cust_2,2020-12-20,25,25
cust_1,2020-12-25,140,140
cust_2,2021-01-01,5,15
"""
u_cols=['customer_id', 'date', 'price', 'my_average']
myf = StringIO(myst)
import pandas as pd
my_df = pd.read_csv(StringIO(myst), sep=',', names = u_cols)
my_df=my_df.sort_values(list(my_df.columns))
As shown in this image:
https://trino.io/assets/blog/window-features/running-average-range.svg
I tried to write a function like this...
import numpy as np
def mylogic(myro):
mylist = list()
mydate = myro['date'][0]
for i in range(len(myro)):
if myro['date'][i] > mydate:
mylist.append(myro['price'][i])
mydate = myro['date'][i]
return np.mean(mylist)
But that returned a key_error.
You can use the rolling function on the last 30 days
df['date'] = pd.to_datetime(df['date'])
df['my_average'] = (df.groupby('customer_id')
.apply(lambda d: d.rolling('30D', on='date')['price'].mean())
.reset_index(level=0, drop=True)
.astype(int)
)
output:
customer_id date price my_average
0 cust_1 2020-10-10 100 100
2 cust_1 2020-10-15 200 150
3 cust_1 2020-10-16 240 180
5 cust_1 2020-12-25 140 140
1 cust_2 2020-10-10 15 15
4 cust_2 2020-12-20 25 25
6 cust_2 2021-01-01 5 15

Pandas: drop out of sequence row

My Pandas df:
import pandas as pd
import io
data = """date value
"2015-09-01" 71.925000
"2015-09-06" 71.625000
"2015-09-11" 71.333333
"2015-09-12" 64.571429
"2015-09-21" 72.285714
"""
df = pd.read_table(io.StringIO(data), delim_whitespace=True)
df.date = pd.to_datetime(df.date)
I Given a user input date ( 01-09-2015).
I would like to keep only those date where difference between date and input date is multiple of 5.
Expected output:
input = 01-09-2015
df:
date value
0 2015-09-01 71.925000
1 2015-09-06 71.625000
2 2015-09-11 71.333333
3 2015-09-21 72.285714
My Approach so far:
I am taking the delta between input_date and date in pandas and saving this delta in separate column.
If delta%5 == 0, keep the row else drop. Is this the best that can be done?
Use boolean indexing for filter by mask, here convert input values to datetimes and then timedeltas to days by Series.dt.days:
input1 = '01-09-2015'
df = df[df.date.sub(pd.to_datetime(input1)).dt.days % 5 == 0]
print (df)
date value
0 2015-09-01 71.925000
1 2015-09-06 71.625000
2 2015-09-11 71.333333
4 2015-09-21 72.285714

How to change datetime to numeric discarding 0s at end [duplicate]

I have a dataframe in pandas called 'munged_data' with two columns 'entry_date' and 'dob' which i have converted to Timestamps using pd.to_timestamp.I am trying to figure out how to calculate ages of people based on the time difference between 'entry_date' and 'dob' and to do this i need to get the difference in days between the two columns ( so that i can then do somehting like round(days/365.25). I do not seem to be able to find a way to do this using a vectorized operation. When I do munged_data.entry_date-munged_data.dob i get the following :
internal_quote_id
2 15685977 days, 23:54:30.457856
3 11651985 days, 23:49:15.359744
4 9491988 days, 23:39:55.621376
7 11907004 days, 0:10:30.196224
9 15282164 days, 23:30:30.196224
15 15282227 days, 23:50:40.261632
However i do not seem to be able to extract the days as an integer so that i can continue with my calculation.
Any help appreciated.
Using the Pandas type Timedelta available since v0.15.0 you also can do:
In[1]: import pandas as pd
In[2]: df = pd.DataFrame([ pd.Timestamp('20150111'),
pd.Timestamp('20150301') ], columns=['date'])
In[3]: df['today'] = pd.Timestamp('20150315')
In[4]: df
Out[4]:
date today
0 2015-01-11 2015-03-15
1 2015-03-01 2015-03-15
In[5]: (df['today'] - df['date']).dt.days
Out[5]:
0 63
1 14
dtype: int64
You need 0.11 for this (0.11rc1 is out, final prob next week)
In [9]: df = DataFrame([ Timestamp('20010101'), Timestamp('20040601') ])
In [10]: df
Out[10]:
0
0 2001-01-01 00:00:00
1 2004-06-01 00:00:00
In [11]: df = DataFrame([ Timestamp('20010101'),
Timestamp('20040601') ],columns=['age'])
In [12]: df
Out[12]:
age
0 2001-01-01 00:00:00
1 2004-06-01 00:00:00
In [13]: df['today'] = Timestamp('20130419')
In [14]: df['diff'] = df['today']-df['age']
In [16]: df['years'] = df['diff'].apply(lambda x: float(x.item().days)/365)
In [17]: df
Out[17]:
age today diff years
0 2001-01-01 00:00:00 2013-04-19 00:00:00 4491 days, 00:00:00 12.304110
1 2004-06-01 00:00:00 2013-04-19 00:00:00 3244 days, 00:00:00 8.887671
You need this odd apply at the end because not yet full support for timedelta64[ns] scalars (e.g. like how we use Timestamps now for datetime64[ns], coming in 0.12)
Not sure if you still need it, but in Pandas 0.14 i usually use .astype('timedelta64[X]') method
http://pandas.pydata.org/pandas-docs/stable/timeseries.html (frequency conversion)
df = pd.DataFrame([ pd.Timestamp('20010101'), pd.Timestamp('20040605') ])
df.ix[0]-df.ix[1]
Returns:
0 -1251 days
dtype: timedelta64[ns]
(df.ix[0]-df.ix[1]).astype('timedelta64[Y]')
Returns:
0 -4
dtype: float64
Hope that will help
Let's specify that you have a pandas series named time_difference which has type
numpy.timedelta64[ns]
One way of extracting just the day (or whatever desired attribute) is the following:
just_day = time_difference.apply(lambda x: pd.tslib.Timedelta(x).days)
This function is used because the numpy.timedelta64 object does not have a 'days' attribute.
To convert any type of data into days just use pd.Timedelta().days:
pd.Timedelta(1985, unit='Y').days
84494

Pandas dataframe apply function

I have a dataframe which looks like this.
df.head()
Ship Date Cost Amount
0 2010-08-01 4257.23300
1 2010-08-01 9846.94540
2 2010-08-01 35.77764
3 2010-08-01 420.82920
4 2010-08-01 129.49638
I had to club the data week wise for which I did :
df['week_num'] = pd.DatetimeIndex(df['Ship Date']).week
x = df.groupby('week_num').sum()
it produces a dataframe which looks like this:
Cost Amount
week_num
30 3.273473e+06
31 9.715421e+07
32 9.914568e+07
33 9.843721e+07
34 1.065546e+08
35 1.087598e+08
36 8.050456e+07
now I wanted to add a column with week and year information to do this I did:
def my_conc(row):
return str(row['week_num'])+str('2011')
and
x['year_week'] = x.apply(my_conc,axis= 1)
This gives me an error message:
KeyError: ('week_num', u'occurred at index 30')
Now my questions are
1) Why groupby function produced a dataframe which looks a little odd as it doesn't have week_num as column name ?
2) Is there a better way of producing the dataframe with grouped data ?
3) How to use apply function on the above dataframe temp ?
Here's one way to do it.
Use as_index=False in groupby to not create index.
In [50]: df_grp = df.groupby('week_num', as_index=False).sum()
Then apply lambda function.
In [51]: df_grp['year_week'] = df_grp.apply(lambda x: str(x['week_num']) + '2011',
axis=1)
In [52]: df_grp
Out[52]:
week_num Cost year_week
0 30 3273473 302011
1 31 97154210 312011
2 32 99145680 322011
3 33 98437210 332011
4 34 106554600 342011
5 35 108759800 352011
6 36 80504560 362011
Or use df_grp.apply(lambda x: '%d2011' % x['week_num'], axis=1)
On your first question, I have no idea. When I try and replicate it, I just get an error.
On the other questions, Use the .dt accessor for groupby() functions ...
# get your data into a DataFrame
data = """Ship Date Cost Amount
0 2010-08-01 4257.23300
1 2010-08-01 9846.94540
2 2010-08-01 35.77764
3 2010-08-01 420.82920
4 2010-08-01 129.49638
"""
from StringIO import StringIO # import from io for Python 3
df = pd.read_csv(StringIO(data), header=0, index_col=0, sep=' ', skipinitialspace=True)
# make the dtype for the column datetime64[ns]
df['Ship Date'] = pd.to_datetime(df['Ship Date'])
# then you can use the .dt accessor to group on
x = df.groupby(df['Ship Date'].dt.dayofyear).sum()
y = df.groupby(df['Ship Date'].dt.weekofyear).sum()
There are a host more of these .dt accessors ... link