How to (correctly) merge 2 Pandas DataFrames and scatter-plot - pandas

Thank you for your answers, in advance.
My end goal is to produce a scatter-plot - corruption as an explanatory variable (x axis, from a DataFrame 'corr') and inequality as a dependent variable (y axis, from a DataFrame 'inq').
A hint to produce an informative table (DataFrame) by joining these two Dataframes would be much appreciated
I have a dataframe 'inq' for a country inequality (GINI index) and another one 'corr' for country corruption index.
pd.DataFrame(
{
"country": {0: "Angola", 1: "Albania", 2: "United Arab Emirates"},
"1975": {0: nan, 1: nan, 2: nan},
"1976": {0: nan, 1: nan, 2: nan},
"2017": {0: nan, 1: 33.2, 2: nan},
"2018": {0: 51.3, 1: nan, 2: nan},
}
)
pd.DataFrame(
{
"country": {0: "Afghanistan", 1: "Angola", 2: "Albania"},
"1975": {0: 44.8, 1: 48.1, 2: 75.1},
"1976": {0: 44.8, 1: 48.1, 2: 75.1},
"2018": {0: 24.2, 1: 40.4, 2: 28.4},
"2019": {0: 40.5, 1: 37.6, 2: 35.9},
}
)
I concatenate and manipulate and get
cm = pd.concat([inq, corr], axis=0, keys=["Inequality", "Corruption"]).reset_index(
level=1, drop=True
)
a new Dataframe
pd.DataFrame(
{
"indicator": {0: "Inequality", 1: "Inequality", 2: "Inequality"},
"country": {0: "Angola", 1: "Albania", 2: "United Arab Emirates"},
"1967": {0: nan, 1: nan, 2: nan},
"1969": {0: nan, 1: nan, 2: nan},
"2018": {0: 51.3, 1: nan, 2: nan},
"2019": {0: nan, 1: nan, 2: nan},
}
)

You should concatenate your dataframe in a different way:
df = (pd.concat([inq.set_index('country'),
corr.set_index('country')],
axis=1,
keys=["Inequality", "Corruption"]
)
.stack(level=1)
)
Inequality Corruption
country
Angola 1975 NaN 48.1
1976 NaN 48.1
2018 51.3 40.4
2019 NaN 37.6
Albania 1975 NaN 75.1
1976 NaN 75.1
2017 33.2 NaN
2018 NaN 28.4
2019 NaN 35.9
Afghanistan 1975 NaN 44.8
1976 NaN 44.8
2018 NaN 24.2
2019 NaN 40.5
Then to plot:
df.plot.scatter(x='Corruption', y='Inequality')
NB. there is only one point as most of your data is NaN

Related

Pandas Dataframe: change columns, index and plot

Hi, I generated the table above using Counter from collections for counting the combinations of 3 variables from a dataframe: Jessica, Mike, and Dog. I got the combination and their counts.
Any help to make that table a bit more prettier? I would like to rename the index as grp1, grp2, etc and the column as well with something else than 0.
Also, what would be the best plot to use for plotting the different groups?
Thanks for your help!!
I used this command to produce the table here:
df= np.random.choice(["Mike", "Jessica", "Dog"], size=(20, 3))
Z= pd.DataFrame(df,columns=['a', 'b', 'c'])
import collections
from collections import Counter
LL= Z.apply (Counter, axis= "columns").value_counts()
H= pd.DataFrame(LL)
print(H)
quite an unusual technique....
you can change the dict index to a multi-index
then plot() as barh and labels make sense
df= np.random.choice(["Mike", "Jessica", "Dog"], size=(20, 3))
Z= pd.DataFrame(df,columns=['a', 'b', 'c'])
import collections
from collections import Counter
LL= Z.apply (Counter, axis= "columns").value_counts()
H= pd.DataFrame(LL)
I = pd.Series(H.index).apply(pd.Series)
H = H.set_index(pd.MultiIndex.from_arrays(I.T.values, names=I.columns))
H.plot(kind="barh")
H after setting as multi-index
0
Mike Dog Jessica
2.0 1.0 NaN 5
NaN 1.0 4
NaN 1.0 2.0 3
1.0 NaN 2.0 3
1.0 1.0 2
NaN NaN 3.0 1
2.0 1.0 1
3.0 NaN NaN 1
Instead of using counter, you can apply value_counts directly to each row:
import pandas as pd
from matplotlib import pyplot as plt
# Hard Coded For Reproducibility
df = pd.DataFrame({'a': {0: 'Dog', 1: 'Jessica', 2: 'Mike',
3: 'Dog', 4: 'Dog', 5: 'Dog',
6: 'Jessica', 7: 'Jessica',
8: 'Dog', 9: 'Dog', 10: 'Jessica',
11: 'Mike', 12: 'Dog',
13: 'Jessica', 14: 'Mike',
15: 'Mike',
16: 'Mike', 17: 'Dog',
18: 'Jessica', 19: 'Mike'},
'b': {0: 'Mike', 1: 'Mike', 2: 'Jessica',
3: 'Jessica', 4: 'Dog', 5: 'Jessica',
6: 'Mike', 7: 'Dog', 8: 'Mike',
9: 'Dog', 10: 'Dog', 11: 'Dog',
12: 'Dog', 13: 'Jessica',
14: 'Jessica', 15: 'Dog',
16: 'Dog', 17: 'Dog', 18: 'Jessica', 19: 'Jessica'},
'c': {0: 'Mike', 1: 'Dog', 2: 'Jessica',
3: 'Dog', 4: 'Dog', 5: 'Dog', 6: 'Dog',
7: 'Jessica', 8: 'Mike', 9: 'Dog',
10: 'Dog', 11: 'Mike', 12: 'Jessica',
13: 'Jessica', 14: 'Jessica',
15: 'Jessica', 16: 'Jessica',
17: 'Dog', 18: 'Mike', 19: 'Dog'}})
# Apply value_counts across each row
df = df.apply(pd.value_counts, axis=1) \
.fillna(0)
# Group By All Columns and
# Get Duplicate Count From Group Size
df = pd.DataFrame(df
.groupby(df.columns.values.tolist())
.size()
.sort_values())
# Plot
plt.figure()
df.plot(kind="barh")
plt.tight_layout()
plt.show()
df after groupby, size, and sort:
0
Dog Jessica Mike
0.0 3.0 0.0 1
1.0 2.0 0.0 1
0.0 2.0 1.0 3
1.0 0.0 2.0 3
3.0 0.0 0.0 3
2.0 1.0 0.0 4
1.0 1.0 1.0 5
Plt:

pandas groupby + multiple aggregate/apply with multiple columns

I have this minimal sample data:
import pandas as pd
from pandas import Timestamp
data = pd.DataFrame({'Client': {0: "Client_1", 1: "Client_2", 2: "Client_2", 3: "Client_3", 4: "Client_3", 5: "Client_3", 6: "Client_4", 7: "Client_4"},
'Id_Card': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 7, 7: 8},
'Type': {0: 'A', 1: 'B', 2: 'C', 3: np.nan, 4: 'A', 5: 'B', 6: np.nan, 7: 'B'},
'Loc': {0: 'ADW', 1: 'ZCW', 2: 'EWC', 3: "VWQ", 4: "OKS", 5: 'EQW', 6: "PKA", 7: 'CSA'},
'Amount': {0: 10.0, 1: 15.0, 2: 17.0, 3: 32.0, 4: np.nan, 5: 51.0, 6: 38.0, 7: -20.0},
'Net': {0: 30.0, 1: 42.0, 2: -10.0, 3: 15.0, 4: 98, 5: np.nan, 6: 23.0, 7: -10.0},
'Date': {0: Timestamp('2018-09-29 00:00:00'), 1: Timestamp('1996-08-02 00:00:00'), 2: np.nan, 3: Timestamp('2020-11-02 00:00:00'), 4: Timestamp('2008-12-27 00:00:00'), 5: Timestamp('2004-12-21 00:00:00'), 6: np.nan, 7: Timestamp('2010-08-25 00:00:00')}})
data
I'm trying to aggregate this data grouping by Client column. Counting the Id_Card per client, concatenating Type, Loc, separated by ; (e.g. A;B and ZCW;EWC values for Client_2, NOT A;ZCW B;EWC), sum the Amount, Net, per client, and getting the minimum Date per client. However, I'm facing some problems:
These functions works perfectly individually, but I can't find a way to mix the aggregate function and apply function:
Code example:
data.groupby("Client").agg({"Id_Card": "count", "Amount":"sum", "Date": "min"})
data.groupby('Client')['Loc'].apply(';'.join).reset_index()
The apply function doesn't work for columns with missing values:
Code example:
data.groupby('Client')['Type'].apply(';'.join).reset_index()
TypeError: sequence item 0: expected str instance, float found
The aggregate and apply functions don't allow me to put multiple columns for one transformation:
Code example:
cols_to_sum = ["Amount", "Net"]
data.groupby("Client").agg({"Id_Card": "count", cols_to_sum:"sum", "Date": "min"})
cols_to_join = ["Type", "Loc"]
data.groupby('Client')[cols_to_join].apply(';'.join).reset_index()
In (3) I only put Amount and Net and I could put them separately in the aggregate function, but I'm looking to a more efficient way as I'm working with plenty of columns.
The output expected is the same dataframe, but aggregated with the conditions outlined at the beggining.
For doing a join, you would have to filter out the NaN values. As join you have to apply at two places, I have created a separate function
def join_non_nan_values(elements):
return ";".join([elem for elem in elements if elem == elem]) # elem == elem will fail for Nan values
data.groupby("Client").agg({"Id_Card": "count", "Type": join_non_nan_values,
"Loc": join_non_nan_values, "Amount":"sum", "Net": "sum", "Date": "min"})
Go step by step, and prepare three different data frames to merge them later.
First dataframe is for simple functions like count,sum,mean
df1 = data.groupby("Client").agg({"Id_Card": "count", "Amount":"sum", "Net":sum, "Date": "min"}).reset_index()
Next you deal with Type and Loc join, we use fill na to deal with nan values
df2=data[['Client', 'Type']].fillna('').groupby("Client")['Type'].apply(
';'.join).reset_index()
df3=data[['Client', 'Loc']].fillna('').groupby("Client")['Loc'].apply(
';'.join).reset_index()
And finally you merge the results together:
data_new = df1.merge(df2, on='Client').merge(df3, on='Client')
data_new output:

Pandas pivot table / groupby to calculate weighted average

I'm using pandas version 0.25.0 to calculate weighted averages of priced contracts.
Data:
{'Counterparty': {0: 'A',
1: 'B',
2: 'B',
3: 'A',
4: 'A',
5: 'C',
6: 'D',
7: 'E',
8: 'E',
9: 'C',
10: 'F',
11: 'C',
12: 'C',
13: 'G'},
'Contract': {0: 'A1',
1: 'B1',
2: 'B2',
3: 'A2',
4: 'A3',
5: 'C1',
6: 'D1',
7: 'E1',
8: 'E2',
9: 'C2',
10: 'F1',
11: 'C3',
12: 'C4',
13: 'G'},
'Delivery': {0: '1/8/2019',
1: '1/8/2019',
2: '1/8/2019',
3: '1/8/2019',
4: '1/8/2019',
5: '1/8/2019',
6: '1/8/2019',
7: '1/8/2019',
8: '1/8/2019',
9: '1/8/2019',
10: '1/8/2019',
11: '1/8/2019',
12: '1/8/2019',
13: '1/8/2019'},
'Price': {0: 134.0,
1: 151.0,
2: 149.0,
3: 134.0,
4: 132.14700000000002,
5: 150.0,
6: 134.566,
7: 153.0,
8: 151.0,
9: 135.0,
10: 149.0,
11: 135.0,
12: 147.0,
13: 151.0},
'Balance': {0: 200.0,
1: 54.87,
2: 200.0,
3: 133.44,
4: 500.0,
5: 500.0,
6: 1324.05,
7: 279.87,
8: 200.0,
9: 20.66,
10: 110.15,
11: 100.0,
12: 100.0,
13: 35.04}}
Method 1:
df.pivot_table(
index=['Counterparty', 'Contract'],
columns='Delivery',
values=['Balance', 'Price'],
aggfunc={
'Balance': sum,
'Price': np.mean
},
margins=True
).fillna('').swaplevel(0,1,axis=1).sort_index(axis=1).round(3)
Result 1:
Is there any way in which I can use np.average in pandas pivot table?
Thinking along the lines of
aggfunc = {
'Balance': sum,
'Price': lambda x: np.average(x, weights='Balance')
}
Current result: 143.265, which is computed by np.mean.
Desired result: 140.424, which is the weighted average of Price by Balance.
Method 2:
df_grouped = df.groupby(['Counterparty', 'Contract', 'Delivery']).apply(lambda x: pd.Series(
{
'Balance': x['Balance'].sum(),
'Price': np.average(x['Price'], weights=x['Balance']),
}
)).round(3).unstack().swaplevel(0,1, axis=1).sort_index(axis=1)
Result 2:
Using groupby, I would need to pd.concat and append sum by level to get grand totals with aggfunc = {Balance: sum, Price: np.average}.
The expected result is:
Balance: 3758.08 (using sum)
Price: 140.424 (using np.average)
Which is displayed in a Grand Total row beneath all the rows of data.
Just define a custom function to calculate weighted average, and use it with aggfunc instead of np.mean in your code as follows:
wa_func =lambda x: np.average(x, weights=df.loc[x.index, 'Balance'])
df1 = df.pivot_table(
index=['Counterparty', 'Contract'],
columns='Delivery',
values=['Balance', 'Price'],
aggfunc={
'Balance': sum,
'Price': wa_func
},
margins=True
).fillna('').swaplevel(0,1,axis=1).sort_index(axis=1).round(3)
Out[35]:
Delivery 1/8/2019 All
Balance Price Balance Price
Counterparty Contract
A A1 200.00 134.000 200.00 134.000
A2 133.44 134.000 133.44 134.000
A3 500.00 132.147 500.00 132.147
B B1 54.87 151.000 54.87 151.000
B2 200.00 149.000 200.00 149.000
C C1 500.00 150.000 500.00 150.000
C2 20.66 135.000 20.66 135.000
C3 100.00 135.000 100.00 135.000
C4 100.00 147.000 100.00 147.000
D D1 1324.05 134.566 1324.05 134.566
E E1 279.87 153.000 279.87 153.000
E2 200.00 151.000 200.00 151.000
F F1 110.15 149.000 110.15 149.000
G G 35.04 151.000 35.04 151.000
All 3758.08 140.424 3758.08 140.424

pivoting pandas df - turn column values into column names

I have a df:
pd.DataFrame({'time_period': {0: pd.Timestamp('2017-04-01 00:00:00'),
1: pd.Timestamp('2017-04-01 00:00:00'),
2: pd.Timestamp('2017-03-01 00:00:00'),
3: pd.Timestamp('2017-03-01 00:00:00')},
'cost1': {0: 142.62999999999994,
1: 131.97000000000003,
2: 142.62999999999994,
3: 131.97000000000003},
'revenue1': {0: 56,
1: 113.14999999999998,
2: 177,
3: 99},
'cost2': {0: 309.85000000000002,
1: 258.25,
2: 309.85000000000002,
3: 258.25},
'revenue2': {0: 4.5,
1: 299.63,2: 309.85,
3: 258.25},
'City': {0: 'Boston',
1: 'New York',2: 'Boston',
3: 'New York'}})
I want to re-structure this df such that for revenue and cost separately:
pd.DataFrame({'City': {0: 'Boston', 1: 'New York'},
'Apr-17 revenue1': {0: 56.0, 1: 113.15000000000001},
'Apr-17 revenue2': {0: 4.5, 1: 299.63},
'Mar-17 revenue1': {0: 177, 1: 99},
'Mar-17 revenue2': {0: 309.85000000000002, 1: 258.25}})
And a similar df for costs.
Basically, turn the time_period column values into column names like Apr-17, Mar-17 with revenue/cost string as appropriate and values of revenue1/revenue2 and cost1/cost2 respectively.
I've been playing around with pd.pivot_table with some success but I can't get exactly what I want.
Use set_index and unstack
import datetime as dt
df['time_period'] = df['time_period'].apply(lambda x: dt.datetime.strftime(x,'%b-%Y'))
df = df.set_index(['A', 'B', 'time_period'])[['revenue1', 'revenue2']].unstack().reset_index()
df.columns = df.columns.map(' '.join)
A B revenue1 Apr-2017 revenue1 Mar-2017 revenue2 Apr-2017 revenue2 Mar-2017
0 Boston Orlando 56.00 177.0 4.50 309.85
1 New York Dallas 113.15 99.0 299.63 258.25

data Frame to dictionary

I can create a new dataframe based on the list of dicts. But how do I get the same list back from dataframe?
mylist=[{'points': 50, 'time': '5:00', 'year': 2010},
{'points': 25, 'time': '6:00', 'month': "february"},
{'points':90, 'time': '9:00', 'month': 'january'},
{'points_h1':20, 'month': 'june'}]
import pandas as pd
df = pd.DataFrame(mylist)
The following will return the dictionary as per column and not row as shown in the example above.
n [18]: df.to_dict()
Out[18]:
{'month': {0: nan, 1: 'february', 2: 'january', 3: 'june'},
'points': {0: 50.0, 1: 25.0, 2: 90.0, 3: nan},
'points_h1': {0: nan, 1: nan, 2: nan, 3: 20.0},
'time': {0: '5:00', 1: '6:00', 2: '9:00', 3: nan},
'year': {0: 2010.0, 1: nan, 2: nan, 3: nan}}
df.to_dict(outtype='records')
Answer is from from: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_dict.html