Pandas Vlookup 2 DF Columns Different Lengths & Perform Calculation - pandas

I need to execute a vlookup-like calculation considering two df's of diff lengths with the same column name. Suppose i have a df called df1 such as:
Y M P D
2020 11 Red 10
2020 11 Blue 9
2020 11 Green 12
2020 11 Tan 7
2020 11 White 5
2020 11 Cyan 17
and a second df called df2 such as:
Y M P D
2020 11 Blue 4
2020 11 Red 12
2020 11 White 6
2020 11 Tan 7
2020 11 Green 20
2020 11 Violet 10
2020 11 Black 7
2020 11 BlackII 3
2020 11 Cyan 14
2020 11 Copper 6
I need a new df like df3['Res','P'] with 2 columns showing results from subtracting df1 from df2 such as:
Res P
Red -2
Blue 5
Green -8
Tan 0
White -1
Cyan 3
I have not been able to find anything with a lookup and then calculation on the web. I've tried merging df1 and df2 into one df but I do not see how to execute the calculation when the values in the "P" column match. I think that a merge of df1 and df2 is probably the first step though?

Based on the example, column 'Y' and 'M' do not matter for the merge. If these columns are relevant, then use a list with the on parameter (e.g. on=['Y', 'M', 'P']).
Currently, only columns [['P', 'D']] are being used from df1 and df2.
The following code, produces the desire output for the example, but it's difficult say what will happen with larger dataframes and if there are repeating values in 'P'.
import pandas as pd
# setup the dataframes
df1 = pd.DataFrame({'Y': [2020, 2020, 2020, 2020, 2020, 2020], 'M': [11, 11, 11, 11, 11, 11], 'P': ['Red', 'Blue', 'Green', 'Tan', 'White', 'Cyan'], 'D': [10, 9, 12, 7, 5, 17]})
df2 = pd.DataFrame({'Y': [2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020], 'M': [11, 11, 11, 11, 11, 11, 11, 11, 11, 11], 'P': ['Blue', 'Red', 'White', 'Tan', 'Green', 'Violet', 'Black', 'BlackII', 'Cyan', 'Copper'], 'D': [4, 12, 6, 7, 20, 10, 7, 3, 14, 6]})
# merge the dataframes
df = pd.merge(df1[['P', 'D']], df2[['P', 'D']], on='P', suffixes=('_1', '_2')).rename(columns={'P': 'Res'})
# subtract the values
df['P'] = (df.D_1 - df.D_2)
# drop the unneeded columns
df = df.drop(columns=['D_1', 'D_2'])
# display(df)
Res P
0 Red -2
1 Blue 5
2 Green -8
3 Tan 0
4 White -1
5 Cyan 3

Related

Iteration between rows and columsn of a DataFrame to calculate the mean

I have a dataframe which reads:
A 2007/Ago 2007/Set 2007/Out ... 2020/Jan 2020/Fev
row1 x number number number ... number number
row2 y number number number ... number number
row3 w number number number ... number number
...
row27 z number number number ... number number
I mean, there are numbers in each cell. I want to calculate the mean of the cells for which the columns starts with 2007, and then calculate the mean of the cells of which the columns starts with 2008, and then 2009, ..., and then 2020 and do this for each row.
What I tried to sketch is something like:
x = []
for i in df.row(i): #that is, for each row of the dataframe
if column.startswith('j'): #which starts with j=2008, 2009, 2010 etc
x += df[i][j] #the variable x gets the number on that row i,column j and sum
What I want in the end are various columsn with the results of the mean for each year, that is, I want
result1 result2 result3 ... resultn
mean colums mean colums mean columsn mean columsn
starts starts starts starts
with 2008 with 2009 with 2010 with 2020
That is, I want 13 new columns: one for each mean (years from 2008 to 2020).
I can't continue this loop and I do not know how much basic this is, but my questions are:
1- Are there any more optimal way of doing this? I mean, using pandas functions other than loops?
In my dataframe, each cell corresponds to the total cost of health expends in that month, and I want to take the mean of the cost of the entire year to compare it to the population of each city (which are thw rows). I am struggling with this for some time and I am not able to solve it. My level using pandas is very basic.
PS: sorry for the dataframe representation, I do not know how to properly write one in the stackoverflow's body question.
An option via melt + pivot_table with the aggfunc set to mean:
import pandas as pd
df = pd.DataFrame({
'A': {'row1': 'x', 'row2': 'y', 'row3': 'w', 'row27': 'z'},
'2007/Ago': {'row1': 1, 'row2': 2, 'row3': 3, 'row27': 4},
'2007/Set': {'row1': 5, 'row2': 6, 'row3': 7, 'row27': 8},
'2007/Out': {'row1': 9, 'row2': 10, 'row3': 11, 'row27': 12},
'2020/Jan': {'row1': 13, 'row2': 14, 'row3': 15, 'row27': 16},
'2020/Fev': {'row1': 17, 'row2': 18, 'row3': 19, 'row27': 20}
})
df = df.melt(id_vars='A', var_name='year')
# Rename month columns to their year value
df['year'] = df['year'].str.split('/').str[0]
# pivot to wide format based on the new year value
df = (
df.pivot_table(columns='year', index='A', aggfunc='mean')
.droplevel(0, 1)
.rename_axis(None)
.rename_axis(None, axis=1)
)
print(df)
df:
2007 2020
w 7 17
x 5 15
y 6 16
z 8 18
Suppose you have this dataframe:
A 2007/Ago 2007/Set 2007/Out 2020/Jan 2020/Fev
row1 x 1 5 9 13 17
row2 y 2 6 10 14 18
row3 w 3 7 11 15 19
row27 z 4 8 12 16 20
You can use .filter() and .mean(axis=1) to compute the values:
df["result"] = df.filter(regex=r"^\d{4}").mean(axis=1)
print(df)
Prints:
A 2007/Ago 2007/Set 2007/Out 2020/Jan 2020/Fev result
row1 x 1 5 9 13 17 9.0
row2 y 2 6 10 14 18 10.0
row3 w 3 7 11 15 19 11.0
row27 z 4 8 12 16 20 12.0
While re-working my other answer, I found this one-liner:
df.mean().groupby(lambda x: x[:4]).mean()
Explanation
Pandas' `mean' function calculates the mean per column:
# using the DataFrame from Henry's answer:
df = pd.DataFrame({
'A': {'row1': 'x', 'row2': 'y', 'row3': 'w', 'row27': 'z'},
'2007/Ago': {'row1': 1, 'row2': 2, 'row3': 3, 'row27': 4},
'2007/Set': {'row1': 5, 'row2': 6, 'row3': 7, 'row27': 8},
'2007/Out': {'row1': 9, 'row2': 10, 'row3': 11, 'row27': 12},
'2020/Jan': {'row1': 13, 'row2': 14, 'row3': 15, 'row27': 16},
'2020/Fev': {'row1': 17, 'row2': 18, 'row3': 19, 'row27': 20}
})
# calculate mean per column
col_means = df.mean()
# 2007/Ago 2.5
# 2007/Set 6.5
# 2007/Out 10.5
# 2020/Jan 14.5
# 2020/Fev 18.5
# dtype: float64
# group above columns by first 4 characters, i.e., the year
year_groups = col_means.groupby(lambda x: x[:4])
# calculate the mean per year group
year_groups.mean()
# 2007 6.5
# 2020 16.5
# dtype: float64
You could iterate over the years, select the subset of columns and just use pandas' mean() function to get the mean of that year:
means = {}
for year in range(2007, 2021):
# assuming df is your dataframe
sub_df = df.loc[:, df.columns.str.startswith(str(year))]
# first mean() aggregates per column, second mean() aggregates the whoƶe year
means[year] = sub_df.mean().mean()
This yields a dict with the years as key and the mean for that year as value. If there are no columns for one year, means[year] contains NaN.

Reshaping column values into rows with Identifier column at the end

I have measurements for Power related to different sensors i.e A1_Pin, A2_Pin and so on. These measurements are recorded in file as columns. The data is uniquely recorded with timestamps.
df1 = pd.DataFrame({'DateTime': ['12/12/2019', '12/13/2019', '12/14/2019',
'12/15/2019', '12/16/2019'],
'A1_Pin': [2, 8, 8, 3, 9],
'A2_Pin': [1, 2, 3, 4, 5],
'A3_Pin': [85, 36, 78, 32, 75]})
I want to reform the table so that each row corresponds to one sensor. The last column indicates the sensor ID to which the row data belongs to.
The final table should look like:
df2 = pd.DataFrame({'DateTime': ['12/12/2019', '12/12/2019', '12/12/2019',
'12/13/2019', '12/13/2019','12/13/2019', '12/14/2019', '12/14/2019',
'12/14/2019', '12/15/2019','12/15/2019', '12/15/2019', '12/16/2019',
'12/16/2019', '12/16/2019'],
'Power': [2, 1, 85,8, 2, 36, 8,3,78, 3, 4, 32, 9, 5, 75],
'ModID': ['A1_PiN','A2_PiN','A3_PiN','A1_PiN','A2_PiN','A3_PiN',
'A1_PiN','A2_PiN','A3_PiN','A1_PiN','A2_PiN','A3_PiN',
'A1_PiN','A2_PiN','A3_PiN']})
I have tried Groupby, Melt, Reshape, Stack and loops but could not do that. If anyone could help? Thanks
When you tried stack, you were on one good track. you need to set_index first and reset_index after such as:
df2 = df1.set_index('DateTime').stack().reset_index(name='Power')\
.rename(columns={'level_1':'ModID'}) #to fit the names your expected output
And you get:
print (df2)
DateTime ModID Power
0 12/12/2019 A1_Pin 2
1 12/12/2019 A2_Pin 1
2 12/12/2019 A3_Pin 85
3 12/13/2019 A1_Pin 8
4 12/13/2019 A2_Pin 2
5 12/13/2019 A3_Pin 36
6 12/14/2019 A1_Pin 8
7 12/14/2019 A2_Pin 3
8 12/14/2019 A3_Pin 78
9 12/15/2019 A1_Pin 3
10 12/15/2019 A2_Pin 4
11 12/15/2019 A3_Pin 32
12 12/16/2019 A1_Pin 9
13 12/16/2019 A2_Pin 5
14 12/16/2019 A3_Pin 75
I'd try something like this:
df1.set_index('DateTime').unstack().reset_index()

Select Rows Where MultiIndex Is In Another DataFrame

I have one DataFrame (DF1) with a MultiIndex and many additional columns. In another DataFrame (DF2) I have 2 columns containing a set of values from the MultiIndex. I would like to select the rows from DF1 where the MultiIndex matches the values in DF2.
df1 = pd.DataFrame({'month': [1, 3, 4, 7, 10],
'year': [2012, 2012, 2014, 2013, 2014],
'sale':[55, 17, 40, 84, 31]})
df1 = df1.set_index(['year','month'])
sale
year month
2012 1 55
2012 3 17
2014 4 40
2013 7 84
2014 10 31
df2 = pd.DataFrame({'year': [2012,2014],
'month': [1, 10]})
year month
0 2012 1
1 2014 10
I'd like to create a new DataFrame that would be:
sale
year month
2012 1 55
2014 10 31
I've tried many variations using .isin, .loc, slicing, but keep running into errors.
You could just set_index on df2 the same way and pass the index:
In[110]:
df1.loc[df2.set_index(['year','month']).index]
Out[110]:
sale
year month
2012 1 55
2014 10 31
more readable version:
In[111]:
idx = df2.set_index(['year','month']).index
df1.loc[idx]
Out[111]:
sale
year month
2012 1 55
2014 10 31

Pandas DataFrame.update with MultiIndex label

Given a DataFrame A with MultiIndex and a DataFrame B with one-dimensional index, how to update column values of A with new values from B where the index of B should be matched with the second index label of A.
Test data:
begin = [10, 10, 12, 12, 14, 14]
end = [10, 11, 12, 13, 14, 15]
values = [1, 2, 3, 4, 5, 6]
values_updated = [10, 20, 3, 4, 50, 60]
multiindexed = pd.DataFrame({'begin': begin,
'end': end,
'value': values})
multiindexed.set_index(['begin', 'end'], inplace=True)
singleindexed = pd.DataFrame.from_dict(dict(zip([10, 11, 14, 15],
[10, 20, 50, 60])),
orient='index')
singleindexed.columns = ['value']
And the desired result should be
value
begin end
10 10 10
11 20
12 12 3
13 4
14 14 50
15 60
Now I was thinking about a variant of
multiindexed.update(singleindexed)
I searched the docs of DataFrame.update, but could not find anything w.r.t. index handling.
Am I missing an easier way to accomplish this?
You can use loc for selecting data in multiindexed and then set new values by values:
print singleindexed.index
Int64Index([10, 11, 14, 15], dtype='int64')
print singleindexed.values
[[10]
[20]
[50]
[60]]
idx = pd.IndexSlice
print multiindexed.loc[idx[:, singleindexed.index],:]
value
start end
10 10 1
11 2
14 14 5
15 6
multiindexed.loc[idx[:, singleindexed.index],:] = singleindexed.values
print multiindexed
value
start end
10 10 10
11 20
12 12 3
13 4
14 14 50
15 60
Using slicers in docs.

Split pandas dataframe index

I have a pretty big dataframe with column names categories (foreign trade statistics), while the index is a string containing the country code AND the year: w2013 meaning World, year 2013, r2015 meaning Russian Federation, year 2015.
Index([u'w2011', u'c2011', u'g2011', u'i2011', u'r2011', u'w2012', u'c2012',
u'g2012', u'i2012', u'r2012', u'w2013', u'c2013', u'g2013', u'i2013',
u'r2013', u'w2014', u'c2014', u'g2014', u'i2014', u'r2014', u'w2015',
u'c2015', u'g2015', u'i2015', u'r2015'],
dtype='object')
What would be the easiest way to make a multiple index for plotting the various columns - I need a column plotted for each country and each year?
You can try create Multiindex from_tuples - for extract letters use indexing with str.
import pandas as pd
li =[u'w2011', u'c2011', u'g2011', u'i2011', u'r2011', u'w2012', u'c2012',
u'g2012', u'i2012', u'r2012', u'w2013', u'c2013', u'g2013', u'i2013',
u'r2013', u'w2014', u'c2014', u'g2014', u'i2014', u'r2014', u'w2015',
u'c2015', u'g2015', u'i2015', u'r2015']
df = pd.DataFrame(range(25), index = li, columns=['a'])
print df
a
w2011 0
c2011 1
g2011 2
i2011 3
r2011 4
w2012 5
c2012 6
g2012 7
i2012 8
r2012 9
w2013 10
c2013 11
g2013 12
i2013 13
r2013 14
w2014 15
c2014 16
g2014 17
i2014 18
r2014 19
w2015 20
c2015 21
g2015 22
i2015 23
r2015 24
print df.index.str[0]
Index([u'w', u'c', u'g', u'i', u'r', u'w', u'c', u'g', u'i', u'r', u'w', u'c',
u'g', u'i', u'r', u'w', u'c', u'g', u'i', u'r', u'w', u'c', u'g', u'i',
u'r'],
dtype='object')
print df.index.str[1:]
Index([u'2011', u'2011', u'2011', u'2011', u'2011', u'2012', u'2012', u'2012',
u'2012', u'2012', u'2013', u'2013', u'2013', u'2013', u'2013', u'2014',
u'2014', u'2014', u'2014', u'2014', u'2015', u'2015', u'2015', u'2015',
u'2015'],
dtype='object')
df.index = pd.MultiIndex.from_tuples(zip(df.index.str[0], df.index.str[1:]))
print df
a
w 2011 0
c 2011 1
g 2011 2
i 2011 3
r 2011 4
w 2012 5
c 2012 6
g 2012 7
i 2012 8
r 2012 9
w 2013 10
c 2013 11
g 2013 12
i 2013 13
r 2013 14
w 2014 15
c 2014 16
g 2014 17
i 2014 18
r 2014 19
w 2015 20
c 2015 21
g 2015 22
i 2015 23
r 2015 24
If you need convert years to int, use astype:
df.index = pd.MultiIndex.from_tuples(zip(df.index.str[0], df.index.str[1:].astype(int)))
print df.index
MultiIndex(levels=[[u'c', u'g', u'i', u'r', u'w'], [2011, 2012, 2013, 2014, 2015]],
labels=[[4, 0, 1, 2, 3, 4, 0, 1, 2, 3, 4, 0, 1, 2, 3, 4, 0, 1, 2, 3, 4, 0, 1, 2, 3], [0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4]])
If I understood well, you can:
reset your index
df.reset_index(inplace=1)
create two other columns, one for the year, and one for the country:
df.loc[,"year"] = df.foo.apply(lambda x: x[1:])
df.loc[,"country"] = df.foo.apply(lambda x: x[0])
assuming that the columns of your former index is called foo and that the length of the country code is 1. You can adapt otherwise.
Set those two columns as index:
df.set_index(["year", "country"], inplace=1)
HTH