I have a Python DataFrame (df) which includes time series data (days of the year) for various countries. Timeseries is a column of the df, and is repeated by country. The file looks something like this:
Country
Date
Total Vaccinations
People Vaccinated Per Hundred
Israel
1/1/21
100
.001
Israel
1/2/21
104
.002
...
...
...
.004
Israel
2/1/21
150
.010
USA
1/1/21
0
.000
USA
1/2/21
50
.001
...
...
...
USA
2/1/21
500
.05
etc,etc. Assume this is the case for 200+ countries.
I wish to create a new df which will contain data for a single feature (say "Total Vaccinations") by countries, using Date as the index.
So the engineered df becomes (selecting "Total Vaccinations"):
Date
Israel
USA
U.K.
1/1/21
100
0
0
1/2/21
104
50
25
...
...
...
...
2/1/21
150
500
250
You can try pivot_table to achieve this -
>>> import pandas as pd
>>>
>>> d = ['1/1/21','1/2/21','1/3/21','1/4/21'] * 3
>>> v = [100,104,103,205] * 3
>>> c = ['USA','Isreal','UK'] * 4
>>> df = pd.DataFrame({'Country':c,'Total_Vaccinations':v,'Date':d})
>>> df
Country Total_Vaccinations Date
0 USA 100 1/1/21
1 Isreal 104 1/2/21
2 UK 103 1/3/21
3 USA 205 1/4/21
4 Isreal 100 1/1/21
5 UK 104 1/2/21
6 USA 103 1/3/21
7 Isreal 205 1/4/21
8 UK 100 1/1/21
9 USA 104 1/2/21
10 Isreal 103 1/3/21
11 UK 205 1/4/21
>>> pivot_df = pd.pivot_table(df,index=['Date'],values=['Total_Vaccinations'],columns=['Country'])
>>> pivot_df.columns = pivot_df.columns.droplevel()
>>> pivot_df = pivot_df.reset_index()
>>> pivot_df
Country Date Isreal UK USA
0 1/1/21 100 100 100
1 1/2/21 104 104 104
2 1/3/21 103 103 103
3 1/4/21 205 205 205
>>>
You can also drop the Country column further
Use:
#create DatetimeIndex
df['Date'] = pd.to_datetime(df['Date'])
df = df.set_index('Date')
#create subset by list of countries
need = ['Israel','USA']
df = df[df['Country'].isin(need)]
Related
I am working on olympics dataset and want to create another dataframe that has total number of athletes and total number of medals won by type for each country.
Using following pivot_table gives me an error "ValueError: Grouper for 'ID' not 1-dimensional"
pd.pivot_table(olymp, index='NOC', columns=['ID','Medal'], values=['ID','Medal'], aggfunc={'ID':pd.Series.nunique,'Medal':'count'}).sort_values(by='Medal')
Result should have one row for each country with columns for totalAthletes, gold, silver, bronze. Not sure how to go about it using pivot_table. I can do this using merge of crosstab but would like to use just one pivottable statement.
Here is what original df looks like.
Update
I would like to get the medal breakdown as well e.g. gold, silver, bronze. Also I need unique count of athlete id's so I use nunique since one athlete may participate in multiple events. Same with medal, ignoring NA values
IIUC:
out = df.pivot_table('ID', 'NOC', 'Medal', aggfunc='count', fill_value=0)
out['ID'] = df[df['Medal'].notna()].groupby('NOC')['ID'].nunique()
Output:
>>> out
Medal Bronze Gold Silver ID
NOC
AFG 2 0 0 1
AHO 0 0 1 1
ALG 8 5 4 14
ANZ 5 20 4 25
ARG 91 91 92 231
.. ... ... ... ...
VIE 0 1 3 3
WIF 5 0 0 4
YUG 93 130 167 317
ZAM 1 0 1 2
ZIM 1 17 4 16
[149 rows x 4 columns]
Old answer
You can't have the same column for columns and values:
out = olymp.pivot_table(index='NOC', values=['ID','Medal'],
aggfunc={'ID':pd.Series.nunique, 'Medal':'count'}) \
.sort_values('Medal', ascending=False)
print(out)
# Output
ID Medal
NOC
USA 9653 5637
URS 2948 2503
GER 4872 2165
GBR 6281 2068
FRA 6170 1777
.. ... ...
GAM 33 0
GBS 15 0
GEQ 26 0
PNG 61 0
LBA 68 0
[230 rows x 2 columns]
Another way to get the result above:
out = olym.groupby('NOC').agg({'ID': pd.Series.nunique, 'Medal': 'count'}) \
.sort_values('Medal', ascending=False)
print(out)
# Output
ID Medal
NOC
USA 9653 5637
URS 2948 2503
GER 4872 2165
GBR 6281 2068
FRA 6170 1777
.. ... ...
GAM 33 0
GBS 15 0
GEQ 26 0
PNG 61 0
LBA 68 0
[230 rows x 2 columns]
I tried to translate the problem with my real data to example data presented in my question. Maybe I just have a simple technical problem. Or maybe my whole way and workflow is not the best?
The objectiv
There are persons (column name) who have eaten different fruit's at different day's. And there is some more data (column foo and bar) I do not want to lose.
I want to separate/split the original data, without loosing the additational data (in foo and bar).
The condition to separate is the number of unique fruits eaten at the specific days.
That is the initial data
>>> df
name day fruit foo bar
0 Tim 1 Apple 708 20
1 Tim 1 Apple 135 743
2 Tim 2 Apple 228 562
3 Anna 1 Banana 495 924
4 Anna 1 Strawberry 236 542
5 Bob 1 Strawberry 420 894
6 Bob 2 Apple 27 192
7 Bob 2 Kiwi 671 145
The separated interim result should look like this two DataFrame's:
>>> two
name day fruit foo bar
0 Anna 1 Banana 495 924
1 Anna 1 Strawberry 236 542
2 Bob 2 Apple 27 192
3 Bob 2 Kiwi 671 145
>>> non_two
name day fruit foo bar
0 Tim 1 Apple 708 20
1 Tim 1 Apple 135 743
2 Tim 2 Apple 228 562
3 Bob 1 Strawberry 420 894
Example explanation in words: Tim ate just Apple's at day 1 and 2. It does not matter how many apples. It just matters that it is one unique fruit.
What I have done so far
I did some groupby() magic to find out who and when have eaten two or less/more then two unique fruits.
import pandas as pd
import random as rd
data = {'name': ['Tim', 'Tim', 'Tim', 'Anna', 'Anna', 'Bob', 'Bob', 'Bob'],
'day': [1, 1, 2, 1, 1, 1, 2, 2],
'fruit': ['Apple', 'Apple', 'Apple', 'Banana', 'Strawberry',
'Strawberry', 'Apple', 'Kiwi'],
'foo': rd.sample(range(1000), 8),
'bar': rd.sample(range(1000), 8)
}
# That is the primary DataFrame
df = pd.DataFrame(data)
# Explore the data
a = df[['name', 'day', 'fruit']].groupby(['name', 'day', 'fruit']).count().reset_index()
b = a.groupby(['name', 'day']).count()
# People who ate 2 fruits on specific days
two = b[(b.fruit == 2)].reset_index()
print(two)
# People who ate less or more then 2 fruits on specific days
non_two = b[(b.fruit != 2)].reset_index()
print(non_two)
Here is my roadblocker
With the dataframes two and non_two I have the informations I want. Know I want to separate the initial dataframe based on that informations. I think name and day are the columns I should use to select and separate in the initial dataframe.
# filter mask
mymask = (df.name == two.name) & (df.day == two.day)
df_two = df[mymask]
df_non_two = df[~mymask]
But this does not work. The first line raise ValueError: Can only compare identically-labeled Series objects.
Use DataFrameGroupBy.nunique in GroupBy.transform, so possible filter original DataFrame:
mymask = df.groupby(['name', 'day'])['fruit'].transform('nunique').eq(2)
df_two = df[mymask]
df_non_two = df[~mymask]
print (df_two)
name day fruit foo bar
3 Anna 1 Banana 335 62
4 Anna 1 Strawberry 286 694
6 Bob 2 Apple 822 738
7 Bob 2 Kiwi 793 449
I have a dataframe as shown below
Player Goal Freekick
Messi 2 5
Ronaldo 1 4
Messi 1 4
Messi 0 5
Ronaldo 0 9
Ronaldo 1 8
Xavi 1 1
Xavi 0 7
From the above I would like do groupby sum of Goal and Freekick as shown below.
Expected Output:
Player toatal_goals total_freekicks
Messi 3 14
Ronaldo 2 21
Xavi 1 8
I tried below code:
df1 = df.groupby(['Player'])['Goal'].sum().reset_index().rename({'Goal':'toatal_goals'})
df1['total_freekicks'] = df.groupby(['Player'])['Freekick'].sum()
But above one does not work, please help me..
First aggregate sum by Player, then DataFrame.add_prefix and convert columns names to lowercase:
df = df.groupby('Player').sum().add_prefix('total_').rename(columns=str.lower)
print (df)
total_goal total_freekick
Player
Messi 3 14
Ronaldo 2 21
Xavi 1 8
You can use namedagg to create the aggregations with customized column names.
(
df.groupby(by='Player')
.agg(toatal_goals=('Goal', 'sum'),
total_freekicks=('Freekick', 'sum'))
.reset_index()
)
Player toatal_goals total_freekicks
Messi 3 14
Ronaldo 2 21
Xavi 1 8
I have a dataframe like this. i have regular fields till "state" then i will have trailers (3 columns tr1* represents 1 tailer) i want to convert those trailers to rows. I tried melt function but i am able to use only 1 trailer column. kindly look at below example you can understand
Name number city state tr1num tr1acct tr1ct tr2num tr2acct tr2ct tr3num tr3acct tr3ct
DJ 10 Edison nj 1001 20345 Dew 1002 20346 Newca. 1003. 20347. pen
ND 20 Newark DE 2001 1985 flor 2002 1986 rodge
I am expecting the output like this.
Name number city state trnum tracct trct
DJ 10 Edison nj 1001 20345 Dew
DJ 10 Edison nj 1002 20346 Newca
DJ 10 Edison nj 1003 20347 pen
ND 20 Newark DE 2001 1985 flor
ND 20 Newark DE 2002 1986 rodge
You need to look at using pd.wide_to_long. However, you will need to do some column renaming first.
df = df.set_index(['Name','number','city','state'])
df.columns = df.columns.str.replace('(\D+)(\d+)(\D+)',r'\1\3_\2')
df = df.reset_index()
pd.wide_to_long(df, ['trnum','trct','tracct'],
['Name','number','city','state'], 'Code',sep='_',suffix='\d+')\
.reset_index()\
.drop('Code',axis=1)
Output:
Name number city state trnum trct tracct
0 DJ 10 Edison nj 1001.0 Dew 20345.0
1 DJ 10 Edison nj 1002.0 Newca. 20346.0
2 DJ 10 Edison nj 1003.0 pen 20347.0
3 ND 20 Newark DE 2001.0 flor 1985.0
4 ND 20 Newark DE 2002.0 rodge 1986.0
5 ND 20 Newark DE NaN NaN NaN
you could achieve this by renaming your columns and bit and applying the pandas wide_to_long method. Below is the code which produces your desired output.
df = pd.DataFrame({"Name":["DJ", "ND"], "number":[10,20], "city":["Edison", "Newark"], "state":["nj","DE"],
"trnum_1":[1001,2001], "tracct_1":[20345,1985], "trct_1":["Dew", "flor"], "trnum_2":[1002,2002],
"trct_2":["Newca", "rodge"], "trnum_3":[1003,None], "tracct_3":[20347,None], "trct_3":["pen", None]})
pd.wide_to_long(df, stubnames=['trnum', 'tracct', 'trct'], i='Name', j='dropme', sep='_').reset_index().drop('dropme', axis=1)\
.sort_values('trnum')
outputs
Name state city number trnum tracct trct
0 DJ nj Edison 10 1001.0 20345.0 Dew
1 DJ nj Edison 10 1002.0 NaN Newca
2 DJ nj Edison 10 1003.0 20347.0 pen
3 ND DE Newark 20 2001.0 1985.0 flor
4 ND DE Newark 20 2002.0 NaN rodge
5 ND DE Newark 20 NaN NaN None
Another option:
df = pd.DataFrame({'col1': [1,2,3], 'col2':[3,4,5], 'col3':[5,6,7], 'tr1':[0,9,8], 'tr2':[0,9,8]})
The df:
col1 col2 col3 tr1 tr2
0 1 3 5 0 0
1 2 4 6 9 9
2 3 5 7 8 8
subsetting to create 2 df's:
tr1_df = df[['col1', 'col2', 'col3', 'tr1']].rename(index=str, columns={"tr1":"tr"})
tr2_df = df[['col1', 'col2', 'col3', 'tr2']].rename(index=str, columns={"tr2":"tr"})
res = pd.concat([tr1_df, tr2_df])
result:
col1 col2 col3 tr
0 1 3 5 0
1 2 4 6 9
2 3 5 7 8
0 1 3 5 0
1 2 4 6 9
2 3 5 7 8
One option is the pivot_longer function from pyjanitor, using the .value placeholder:
# pip install git+https://github.com/pyjanitor-devs/pyjanitor.git
import janitor
import pandas as pd
(df
.pivot_longer(
index=slice("Name", "state"),
names_to=(".value", ".value"),
names_pattern=r"(.+)\d(.+)",
sort_by_appearance=True)
.dropna()
)
Name number city state trnum tracct trct
0 DJ 10 Edison nj 1001.0 20345.0 Dew
1 DJ 10 Edison nj 1002.0 20346.0 Newca.
2 DJ 10 Edison nj 1003.0 20347.0 pen
3 ND 20 Newark DE 2001.0 1985.0 flor
4 ND 20 Newark DE 2002.0 1986.0 rodge
The .value keeps the part of the column associated with it as header, and since we have multiple .value, they are combined into a single word. The .value is determined by the groups in the names_pattern, which is a regular expression.
Note that currently the multiple .value option is available in dev.
I want to do aggregations on a panda dataframe by word.
Basically there are 3 columns with the click/impression count with the corresponding phrase. I would like to split the phrase into tokens and then sum up their clicks to tokens to decide which token is relatively good/bad.
Expected input: Panda dataframe as below
click_count impression_count text
1 10 100 pizza
2 20 200 pizza italian
3 1 1 italian cheese
Expected output:
click_count impression_count token
1 30 300 pizza // 30 = 20 + 10, 300 = 200+100
2 21 201 italian // 21 = 20 + 1
3 1 1 cheese // cheese only appeared once in italian cheese
tokens = df.text.str.split(expand=True)
token_cols = ['token_{}'.format(i) for i in range(tokens.shape[1])]
tokens.columns = token_cols
df1 = pd.concat([df.drop('text', axis=1), tokens], axis=1)
df1
df2 = pd.lreshape(df1, {'tokens': token_cols})
df2
df2.groupby('tokens').sum()
This creates a new DataFrame like piRSquared's but tokens are stacked and merged with the original:
(df['text'].str.split(expand=True).stack().reset_index(level=1, drop=True)
.to_frame('token').merge(df, left_index=True, right_index=True)
.groupby('token')['click_count', 'impression_count'].sum())
Out:
click_count impression_count
token
cheese 1 1
italian 21 201
pizza 30 300
If you break this down, it merges this:
df['text'].str.split(expand=True).stack().reset_index(level=1, drop=True).to_frame('token')
Out:
token
1 pizza
2 pizza
2 italian
3 italian
3 cheese
with the original DataFrame on their indices. The resulting df is:
(df['text'].str.split(expand=True).stack().reset_index(level=1, drop=True)
.to_frame('token').merge(df, left_index=True, right_index=True))
Out:
token click_count impression_count text
1 pizza 10 100 pizza
2 pizza 20 200 pizza italian
2 italian 20 200 pizza italian
3 italian 1 1 italian cheese
3 cheese 1 1 italian cheese
The rest is grouping by the token column.
You could do
In [3091]: s = df.text.str.split(expand=True).stack().reset_index(drop=True, level=-1)
In [3092]: df.loc[s.index].assign(token=s).groupby('token',sort=False,as_index=False).sum()
Out[3092]:
token click_count impression_count
0 pizza 30 300
1 italian 21 201
2 cheese 1 1
Details
In [3093]: df
Out[3093]:
click_count impression_count text
1 10 100 pizza
2 20 200 pizza italian
3 1 1 italian cheese
In [3094]: s
Out[3094]:
1 pizza
2 pizza
2 italian
3 italian
3 cheese
dtype: object