Split pandas dataframe index - pandas

I have a pretty big dataframe with column names categories (foreign trade statistics), while the index is a string containing the country code AND the year: w2013 meaning World, year 2013, r2015 meaning Russian Federation, year 2015.
Index([u'w2011', u'c2011', u'g2011', u'i2011', u'r2011', u'w2012', u'c2012',
u'g2012', u'i2012', u'r2012', u'w2013', u'c2013', u'g2013', u'i2013',
u'r2013', u'w2014', u'c2014', u'g2014', u'i2014', u'r2014', u'w2015',
u'c2015', u'g2015', u'i2015', u'r2015'],
dtype='object')
What would be the easiest way to make a multiple index for plotting the various columns - I need a column plotted for each country and each year?

You can try create Multiindex from_tuples - for extract letters use indexing with str.
import pandas as pd
li =[u'w2011', u'c2011', u'g2011', u'i2011', u'r2011', u'w2012', u'c2012',
u'g2012', u'i2012', u'r2012', u'w2013', u'c2013', u'g2013', u'i2013',
u'r2013', u'w2014', u'c2014', u'g2014', u'i2014', u'r2014', u'w2015',
u'c2015', u'g2015', u'i2015', u'r2015']
df = pd.DataFrame(range(25), index = li, columns=['a'])
print df
a
w2011 0
c2011 1
g2011 2
i2011 3
r2011 4
w2012 5
c2012 6
g2012 7
i2012 8
r2012 9
w2013 10
c2013 11
g2013 12
i2013 13
r2013 14
w2014 15
c2014 16
g2014 17
i2014 18
r2014 19
w2015 20
c2015 21
g2015 22
i2015 23
r2015 24
print df.index.str[0]
Index([u'w', u'c', u'g', u'i', u'r', u'w', u'c', u'g', u'i', u'r', u'w', u'c',
u'g', u'i', u'r', u'w', u'c', u'g', u'i', u'r', u'w', u'c', u'g', u'i',
u'r'],
dtype='object')
print df.index.str[1:]
Index([u'2011', u'2011', u'2011', u'2011', u'2011', u'2012', u'2012', u'2012',
u'2012', u'2012', u'2013', u'2013', u'2013', u'2013', u'2013', u'2014',
u'2014', u'2014', u'2014', u'2014', u'2015', u'2015', u'2015', u'2015',
u'2015'],
dtype='object')
df.index = pd.MultiIndex.from_tuples(zip(df.index.str[0], df.index.str[1:]))
print df
a
w 2011 0
c 2011 1
g 2011 2
i 2011 3
r 2011 4
w 2012 5
c 2012 6
g 2012 7
i 2012 8
r 2012 9
w 2013 10
c 2013 11
g 2013 12
i 2013 13
r 2013 14
w 2014 15
c 2014 16
g 2014 17
i 2014 18
r 2014 19
w 2015 20
c 2015 21
g 2015 22
i 2015 23
r 2015 24
If you need convert years to int, use astype:
df.index = pd.MultiIndex.from_tuples(zip(df.index.str[0], df.index.str[1:].astype(int)))
print df.index
MultiIndex(levels=[[u'c', u'g', u'i', u'r', u'w'], [2011, 2012, 2013, 2014, 2015]],
labels=[[4, 0, 1, 2, 3, 4, 0, 1, 2, 3, 4, 0, 1, 2, 3, 4, 0, 1, 2, 3, 4, 0, 1, 2, 3], [0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4]])

If I understood well, you can:
reset your index
df.reset_index(inplace=1)
create two other columns, one for the year, and one for the country:
df.loc[,"year"] = df.foo.apply(lambda x: x[1:])
df.loc[,"country"] = df.foo.apply(lambda x: x[0])
assuming that the columns of your former index is called foo and that the length of the country code is 1. You can adapt otherwise.
Set those two columns as index:
df.set_index(["year", "country"], inplace=1)
HTH

Related

Calculate number of cases per year from pandas data frame

I have a data frame in the following format:
import pandas as pd
d = {'case_id': [1, 2, 3], 'begin': [2002, 1996, 2001], 'end': [2019, 2001, 2002]}
df = pd.DataFrame(data=d)
with about 1,000 cases.
I need to calculate how many cases are in force by year. This information can be derived from the 'begin' and 'end' columns.
For example, case 2 was in force between the years 1996 and 2001.
The resulting data frame should like as follows:
e = {'year': [1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019],
'cases': [1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
df_ = pd.DataFrame(data=e)
Any idea how I can do this in a few lines for 1,000 cases?
Assign the new value with range then explode
df['new'] = [range(x,y+1) for x , y in zip(df.begin,df.end)]
df = df.explode('new')
And we do groupby + nunique
out = df.groupby(['new']).case_id.nunique().reset_index()
Out[257]:
new case_id
0 1996 1
1 1997 1
2 1998 1
3 1999 1
4 2000 1
5 2001 2
6 2002 2
7 2003 1
8 2004 1
9 2005 1
10 2006 1
11 2007 1
12 2008 1
13 2009 1
14 2010 1
15 2011 1
16 2012 1
17 2013 1
18 2014 1
19 2015 1
20 2016 1
21 2017 1
22 2018 1
23 2019 1
Here is another way:
df.assign(year = df.apply(lambda x: np.arange(x['begin'],x['end']+1),axis=1)).explode('year').groupby('year')['case_id'].count().reset_index()

Getting the difference iterated over every row from one dataframe to another in pandas

I have two datasets: df1 and df2, each with a column named 'value' with 10 records. Currently I have:
df = df1.value - df2.value
but this code outputs 10 rows only (as expected). How would one iterate the difference for all rows instead of just the difference between corresponding row index (and get a table of 100 records instead)?
Thanks in advance!
You can pandas.DataFrame.merge with how = 'cross' (cartesian product), then get the columns difference with pandas.DataFrame.diff:
#setup
df1 = pd.DataFrame({"value":[7,5,4,8,9]})
df2 = pd.DataFrame({"value":[1,7,9,5,3]})
df2.merge(df1, "cross", suffixes=['x','']).diff(axis = 1).dropna(1)
Output
value
0 6
1 4
2 3
3 7
4 8
5 0
6 -2
7 -3
8 1
9 2
10 -2
11 -4
12 -5
13 -1
14 0
15 2
16 0
17 -1
18 3
19 4
20 4
21 2
22 1
23 5
24 6
Try this.
ndf = df.assign(key=1).merge(df2.assign(key=1),on='key',suffixes=('_l','_r')).drop('key',axis=1)
ndf['value_l'] - ndf['value_r']
Use an outer subtraction.
import numpy as np
import pandas as pd
df1 = pd.DataFrame({"value":[7,5,4,8,9]})
df2 = pd.DataFrame({"value":[1,7,9,5,3]})
np.subtract.outer(df1['value'].to_numpy(), df2['value'].to_numpy())
#array([[ 6, 0, -2, 2, 4],
# [ 4, -2, -4, 0, 2],
# [ 3, -3, -5, -1, 1],
# [ 7, 1, -1, 3, 5],
# [ 8, 2, 0, 4, 6]])
Add a .ravel() if you want the same order as a cross join.
np.subtract.outer(df1['value'].to_numpy(), df2['value'].to_numpy()).ravel('F')
#array([ 6, 4, 3, 7, 8, 0, -2, -3, 1, 2, -2, -4, -5, -1, 0, 2, 0,
# -1, 3, 4, 4, 2, 1, 5, 6])

Pandas Vlookup 2 DF Columns Different Lengths & Perform Calculation

I need to execute a vlookup-like calculation considering two df's of diff lengths with the same column name. Suppose i have a df called df1 such as:
Y M P D
2020 11 Red 10
2020 11 Blue 9
2020 11 Green 12
2020 11 Tan 7
2020 11 White 5
2020 11 Cyan 17
and a second df called df2 such as:
Y M P D
2020 11 Blue 4
2020 11 Red 12
2020 11 White 6
2020 11 Tan 7
2020 11 Green 20
2020 11 Violet 10
2020 11 Black 7
2020 11 BlackII 3
2020 11 Cyan 14
2020 11 Copper 6
I need a new df like df3['Res','P'] with 2 columns showing results from subtracting df1 from df2 such as:
Res P
Red -2
Blue 5
Green -8
Tan 0
White -1
Cyan 3
I have not been able to find anything with a lookup and then calculation on the web. I've tried merging df1 and df2 into one df but I do not see how to execute the calculation when the values in the "P" column match. I think that a merge of df1 and df2 is probably the first step though?
Based on the example, column 'Y' and 'M' do not matter for the merge. If these columns are relevant, then use a list with the on parameter (e.g. on=['Y', 'M', 'P']).
Currently, only columns [['P', 'D']] are being used from df1 and df2.
The following code, produces the desire output for the example, but it's difficult say what will happen with larger dataframes and if there are repeating values in 'P'.
import pandas as pd
# setup the dataframes
df1 = pd.DataFrame({'Y': [2020, 2020, 2020, 2020, 2020, 2020], 'M': [11, 11, 11, 11, 11, 11], 'P': ['Red', 'Blue', 'Green', 'Tan', 'White', 'Cyan'], 'D': [10, 9, 12, 7, 5, 17]})
df2 = pd.DataFrame({'Y': [2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020], 'M': [11, 11, 11, 11, 11, 11, 11, 11, 11, 11], 'P': ['Blue', 'Red', 'White', 'Tan', 'Green', 'Violet', 'Black', 'BlackII', 'Cyan', 'Copper'], 'D': [4, 12, 6, 7, 20, 10, 7, 3, 14, 6]})
# merge the dataframes
df = pd.merge(df1[['P', 'D']], df2[['P', 'D']], on='P', suffixes=('_1', '_2')).rename(columns={'P': 'Res'})
# subtract the values
df['P'] = (df.D_1 - df.D_2)
# drop the unneeded columns
df = df.drop(columns=['D_1', 'D_2'])
# display(df)
Res P
0 Red -2
1 Blue 5
2 Green -8
3 Tan 0
4 White -1
5 Cyan 3

Select Rows Where MultiIndex Is In Another DataFrame

I have one DataFrame (DF1) with a MultiIndex and many additional columns. In another DataFrame (DF2) I have 2 columns containing a set of values from the MultiIndex. I would like to select the rows from DF1 where the MultiIndex matches the values in DF2.
df1 = pd.DataFrame({'month': [1, 3, 4, 7, 10],
'year': [2012, 2012, 2014, 2013, 2014],
'sale':[55, 17, 40, 84, 31]})
df1 = df1.set_index(['year','month'])
sale
year month
2012 1 55
2012 3 17
2014 4 40
2013 7 84
2014 10 31
df2 = pd.DataFrame({'year': [2012,2014],
'month': [1, 10]})
year month
0 2012 1
1 2014 10
I'd like to create a new DataFrame that would be:
sale
year month
2012 1 55
2014 10 31
I've tried many variations using .isin, .loc, slicing, but keep running into errors.
You could just set_index on df2 the same way and pass the index:
In[110]:
df1.loc[df2.set_index(['year','month']).index]
Out[110]:
sale
year month
2012 1 55
2014 10 31
more readable version:
In[111]:
idx = df2.set_index(['year','month']).index
df1.loc[idx]
Out[111]:
sale
year month
2012 1 55
2014 10 31

Pandas DataFrame.update with MultiIndex label

Given a DataFrame A with MultiIndex and a DataFrame B with one-dimensional index, how to update column values of A with new values from B where the index of B should be matched with the second index label of A.
Test data:
begin = [10, 10, 12, 12, 14, 14]
end = [10, 11, 12, 13, 14, 15]
values = [1, 2, 3, 4, 5, 6]
values_updated = [10, 20, 3, 4, 50, 60]
multiindexed = pd.DataFrame({'begin': begin,
'end': end,
'value': values})
multiindexed.set_index(['begin', 'end'], inplace=True)
singleindexed = pd.DataFrame.from_dict(dict(zip([10, 11, 14, 15],
[10, 20, 50, 60])),
orient='index')
singleindexed.columns = ['value']
And the desired result should be
value
begin end
10 10 10
11 20
12 12 3
13 4
14 14 50
15 60
Now I was thinking about a variant of
multiindexed.update(singleindexed)
I searched the docs of DataFrame.update, but could not find anything w.r.t. index handling.
Am I missing an easier way to accomplish this?
You can use loc for selecting data in multiindexed and then set new values by values:
print singleindexed.index
Int64Index([10, 11, 14, 15], dtype='int64')
print singleindexed.values
[[10]
[20]
[50]
[60]]
idx = pd.IndexSlice
print multiindexed.loc[idx[:, singleindexed.index],:]
value
start end
10 10 1
11 2
14 14 5
15 6
multiindexed.loc[idx[:, singleindexed.index],:] = singleindexed.values
print multiindexed
value
start end
10 10 10
11 20
12 12 3
13 4
14 14 50
15 60
Using slicers in docs.