Pandas dataframe apply function - pandas

I have a dataframe which looks like this.
df.head()
Ship Date Cost Amount
0 2010-08-01 4257.23300
1 2010-08-01 9846.94540
2 2010-08-01 35.77764
3 2010-08-01 420.82920
4 2010-08-01 129.49638
I had to club the data week wise for which I did :
df['week_num'] = pd.DatetimeIndex(df['Ship Date']).week
x = df.groupby('week_num').sum()
it produces a dataframe which looks like this:
Cost Amount
week_num
30 3.273473e+06
31 9.715421e+07
32 9.914568e+07
33 9.843721e+07
34 1.065546e+08
35 1.087598e+08
36 8.050456e+07
now I wanted to add a column with week and year information to do this I did:
def my_conc(row):
return str(row['week_num'])+str('2011')
and
x['year_week'] = x.apply(my_conc,axis= 1)
This gives me an error message:
KeyError: ('week_num', u'occurred at index 30')
Now my questions are
1) Why groupby function produced a dataframe which looks a little odd as it doesn't have week_num as column name ?
2) Is there a better way of producing the dataframe with grouped data ?
3) How to use apply function on the above dataframe temp ?

Here's one way to do it.
Use as_index=False in groupby to not create index.
In [50]: df_grp = df.groupby('week_num', as_index=False).sum()
Then apply lambda function.
In [51]: df_grp['year_week'] = df_grp.apply(lambda x: str(x['week_num']) + '2011',
axis=1)
In [52]: df_grp
Out[52]:
week_num Cost year_week
0 30 3273473 302011
1 31 97154210 312011
2 32 99145680 322011
3 33 98437210 332011
4 34 106554600 342011
5 35 108759800 352011
6 36 80504560 362011
Or use df_grp.apply(lambda x: '%d2011' % x['week_num'], axis=1)

On your first question, I have no idea. When I try and replicate it, I just get an error.
On the other questions, Use the .dt accessor for groupby() functions ...
# get your data into a DataFrame
data = """Ship Date Cost Amount
0 2010-08-01 4257.23300
1 2010-08-01 9846.94540
2 2010-08-01 35.77764
3 2010-08-01 420.82920
4 2010-08-01 129.49638
"""
from StringIO import StringIO # import from io for Python 3
df = pd.read_csv(StringIO(data), header=0, index_col=0, sep=' ', skipinitialspace=True)
# make the dtype for the column datetime64[ns]
df['Ship Date'] = pd.to_datetime(df['Ship Date'])
# then you can use the .dt accessor to group on
x = df.groupby(df['Ship Date'].dt.dayofyear).sum()
y = df.groupby(df['Ship Date'].dt.weekofyear).sum()
There are a host more of these .dt accessors ... link

Related

Count how many non-zero entries at each month in a dataframe column

I have a dataframe, df, with datetimeindex and a single column, like this:
I need to count how many non-zero entries i have at each month. For example, according to those images, in January i would have 2 entries, in February 1 entry and in March 2 entries. I have more months in the dataframe, but i guess that explains the problem.
I tried using pandas groupby:
df.groupby(df.index.month).count()
But that just gives me total days at each month and i don't saw any other parameter in count() that i could use here.
Any ideas?
Try index.to_period()
For example:
In [1]: import pandas as pd
import numpy as np
x_df = pd.DataFrame(
{
'values': np.random.randint(low=0, high=2, size=(120,))
} ,
index = pd.date_range("2022-01-01", periods=120, freq="D")
)
In [2]: x_df
Out[2]:
values
2022-01-01 0
2022-01-02 0
2022-01-03 1
2022-01-04 0
2022-01-05 0
...
2022-04-26 1
2022-04-27 0
2022-04-28 0
2022-04-29 1
2022-04-30 1
[120 rows x 1 columns]
In [3]: x_df[x_df['values'] != 0].groupby(lambda x: x.to_period("M")).count()
Out[3]:
values
2022-01 17
2022-02 15
2022-03 16
2022-04 17
can you try this:
#drop nans
import numpy as np
dfx['col1']=dfx['col1'].replace(0,np.nan)
dfx=dfx.dropna()
dfx=dfx.resample('1M').count()

Convert and replace a string value in a pandas df with its float type

I have a value in pandas df which is accidentally put as a string as follows:
df.iloc[5329]['values']
'72,5'
I want to convert this value to float and replace it in the df. I have tried the following ways:
df.iloc[5329]['values'] = float(72.5)
also,
df.iloc[5329]['values'] = 72.5
and,
df.iloc[5329]['values'] = df.iloc[5329]['values'].replace(',', '.')
It runs successfully with a warning but when I check the df, its still stored as '72,5'.
The entire df at that index is as follows:
df.iloc[5329]
value 36.25
values 72,5
values1 72.5
currency MYR
Receipt Kuching, Malaysia
Delivery Male, Maldives
How can I solve that?
iloc needs specific row, col positioning.
import pandas as pd
df = pd.DataFrame(
{
'A': np.random.choice(100, 3),
'B': [15.2,'72,5',3.7]
})
print(df)
df.info()
Output:
A B
0 84 15.2
1 92 72,5
2 56 3.7
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 A 3 non-null int64
1 B 3 non-null object
Update to value:
df.iloc[1,1] = 72.5
print(df)
Output:
A B
0 84 15.2
1 92 72.5
2 56 3.7
Make sure you don't have recurring indexing (i.e. [][]) when doing assignment, since df.iloc[5329] will make a copy of data and further assignment will happen to the copy not original df. Instead just do:
df.iloc[5329, 'values'] = 72.5

Pandas Groupby -- efficient selection/filtering of groups based on multiple conditions?

I am trying to
filter dataframe groups in Pandas, based on multiple (any) conditions.
but I cannot seem to get to a fast Pandas 'native' one-liner.
Here I generate an example dataframe of 2*n*n rows and 4 columns:
import itertools
import random
n = 100
lst = range(0, n)
df = pd.DataFrame(
{'A': list(itertools.chain.from_iterable(itertools.repeat(x, n*2) for x in lst)),
'B': list(itertools.chain.from_iterable(itertools.repeat(x, 1*2) for x in lst)) * n,
'C': random.choices(list(range(100)), k=2*n*n),
'D': random.choices(list(range(100)), k=2*n*n)
})
resulting in dataframes such as:
A B C D
0 0 0 26 49
1 0 0 29 80
2 0 1 70 92
3 0 1 7 2
4 1 0 90 11
5 1 0 19 4
6 1 1 29 4
7 1 1 31 95
I want to
select groups grouped by A and B,
filtered groups down to where any values in the group are greater than 50 in both columns C and D,
A "native" Pandas one-liner would be the following:
test.groupby([test.A, test.B]).filter(lambda x: ((x.C>50).any() & (x.D>50).any()) )
which produces
A B C D
2 0 1 70 92
3 0 1 7 2
This is all fine for small dataframes (say n < 20).
But this solution takes quite long (for example, 4.58 s when n = 100) for large dataframes.
I have an alternative, step-by-step solution which achieves the same result, but runs much faster (28.1 ms when n = 100):
test_g = test.assign(key_C = test.C>50, key_D = test.D>50).groupby([test.A, test.B])
test_C_bool = test_g.key_C.transform('any')
test_D_bool = test_g.key_D.transform('any')
test[test_C_bool & test_D_bool]
but arguably a bit more ugly. My questions are:
Is there a better "native" Pandas solution for this task? , and
Is there a reason for the sub-optimal performance of my version of the "native" solution?
Bonus question:
In fact I only want to extract the groups and not together with their data. I.e., I only need
A B
0 1
in the above example. Is there a way to do this with Pandas without going through the intermediate step I did above?
This is similar to your second approach, but chained together:
mask = (df[['C','D']].gt(50) # in the case you have different thresholds for `C`, `D` [50, 60]
.all(axis=1) # check for both True on the rows
.groupby([df['A'],df['B']]) # normal groupby
.transform('max') # 'any' instead of 'max' also works
)
df.loc[mask]
If you don't want the data, you can forgo the transform:
mask = df[['C','D']].min(axis=1).gt(50).groupby([df['A'],df['B']]).any()
mask[mask].index
# out
# MultiIndex([(0, 1)],
# names=['A', 'B'])

How do I use grouped data to plot rainfall averages in specific hourly ranges

I extracted the following data from a dataframe .
https://i.imgur.com/rCLfV83.jpg
The question is, how do I plot a graph, probably a histogram type, where the horizontal axis are the hours as bins [16:00 17:00 18:00 ...24:00] and the bars are the average rainfall during each of those hours.
I just don't know enough pandas yet to get this off the ground so I need some help. Sample data below as requested.
Date Hours `Precip`
1996-07-30 21 1
1996-08-17 16 1
18 1
1996-08-30 16 1
17 1
19 5
22 1
1996-09-30 19 5
20 5
1996-10-06 20 1
21 1
1996-10-19 18 4
1996-10-30 19 1
1996-11-05 20 3
1996-11-16 16 1
19 1
1996-11-17 16 1
1996-11-29 16 1
1996-12-04 16 9
17 27
19 1
1996-12-12 19 1
1996-12-30 19 10
22 1
1997-01-18 20 1
It seems df is a multi-index DataFrame after a groupby.
Transform the index to a DatetimeIndex
date_hour_idx = df.reset_index()[['Date', 'Hours']] \
.apply(lambda x: '{} {}:00'.format(x['Date'], x['Hours']), axis=1)
precip_series = df.reset_index()['Precip']
precip_series.index = pd.to_datetime(date_hour_idx)
Resample to hours using 'H'
# This will show NaN for hours without an entry
resampled_nan = precip_series.resample('H').asfreq()
# This will fill NaN with 0s
resampled_fillna = precip_series.resample('H').asfreq().fillna(0)
If you want this to be the mean per hour, change your groupby(...).sum() to groupby(...).mean()
You can resample to other intervals too -> pandas resample documentation
More about resampling the DatetimeIndex -> https://pandas.pydata.org/pandas-docs/stable/reference/resampling.html
It seems to be easy when you have data.
I generate artificial data by Pandas for this example:
import pandas as pd
import radar
import random
'''>>> date'''
r2 =()
for a in range(1,51):
t= (str(radar.random_datetime(start='1985-05-01', stop='1985-05-04')),)
r2 = r2 + t
r3 =list(r2)
r3.sort()
#print(r3)
'''>>> variable'''
x = [random.randint(0,16) for x in range(50)]
df= pd.DataFrame({'date': r3, 'measurement': x})
print(df)
'''order'''
col1 = df.join(df['date'].str.partition(' ')[[0,2]]).rename({0: 'daty', 2: 'godziny'}, axis=1)
col2 = df['measurement'].rename('pomiary')
p3 = pd.concat([col1, col2], axis=1, sort=False)
p3 = p3.drop(['measurement'], axis=1)
p3 = p3.drop(['date'], axis=1)
Time for sum and plot:
dx = p3.groupby(['daty']).mean()
print(dx)
import matplotlib.pyplot as plt
dx.plot.bar()
plt.show()
Plot of the mean measurements

sorting within the keys of group by

I have a group by table as follows, I want to sort by index within the keys ['CPUCore', Offline_RetetionAge'] (need to keep the structure of ['CPUCore', Offline_RetetionAge']) how should I do?
I think there is problem dtype of your second level is object, what is obviously string, so if use sort_index it sorts alphanumeric:
df = pd.DataFrame({'CPUCore':[2,2,2,3,3],
'Offline_RetetionAge':['100','1','12','120','15'],
'index':[11,16,5,4,3]}).set_index(['CPUCore','Offline_RetetionAge'])
print (df)
index
CPUCore Offline_RetetionAge
2 100 11
1 16
12 5
3 120 4
15 3
print (df.index.get_level_values('Offline_RetetionAge').dtype)
object
print (df.sort_index())
index
CPUCore Offline_RetetionAge
2 1 16
100 11
12 5
3 120 4
15 3
#change multiindex - cast level Offline_RetetionAge to int
new_index = list(zip(df.index.get_level_values('CPUCore'),
df.index.get_level_values('Offline_RetetionAge').astype(int)))
df.index = pd.MultiIndex.from_tuples(new_index, names = df.index.names)
print (df.sort_index())
index
CPUCore Offline_RetetionAge
2 1 16
12 5
100 11
3 15 3
120 4
EDIT by comment:
print (df.reset_index()
.sort_values(['CPUCore','index'])
.set_index(['CPUCore','Offline_RetetionAge']))
index
CPUCore Offline_RetetionAge
2 12 5
100 11
1 16
3 15 3
120 4
I think what you mean is this:
import pandas as pd
from pandas import Series, DataFrame
# create what I believe you tried to ask
df = DataFrame( \
[[11,'reproducible'], [16, 'example'], [5, 'a'], [4, 'create'], [9,'!']])
df.columns = ['index', 'bla']
df.index = pd.MultiIndex.from_arrays([[2]*4+[3],[10,100,1000,11,512]], \
names=['CPUCore', 'Offline_RetentionAge'])
# sort by values and afterwards by index where sort_remaining=False preserves
# the order of index
df = df.sort_values('index').sort_index(level=0, sort_remaining=False)
print df
The statement sort_values sorts the values by index and the sort_index restores the grouping by multiindex without changing the order of index for rows with the same CPUCore.
I don't know what a "group by table" is supposed to be. If you have a pd.GroupBy object, you won't be able to use sort_values() like that.
You might have to rethink what you group by or use functools.partial and DataFrame.apply
Output:
index bla
CPUCore Offline_RetentionAge
2 11 4 create
1000 5 a
10 11 reproducible
100 16 example
3 512 9 !