Plot multiple columns side by side - pandas

I have the dataframe below.
111_a 111_b 222_a 222_b 333_a 333_b
row_1 1.0 2.0 1.5 2.5 1.0 2.5
row_2 1.0 2.0 1.5 2.5 1.0 2.5
row_3 1.0 2.0 1.5 2.5 1.0 2.5
I'm trying to plot a bar chart such that column *_a are together, and *_b are together. I would also be plotting each row (row_1, row_2, etc) in separate tables.
What I'm trying to get is this:
where in my case,
Asia.SUV = 111_a, Europe.SUV = 222_a, USA.SUV = 333_a
Asia.Sedan = 111_b, Europe.Sedan = 222_b, USA.Sedan = 333_b
I would rename the "Type". How can I plot this? It would also be a bonus if I can plot each row in separate tables with just a command, instead of manually plotting each row.

Assuming df the dataframe, you can use:
ax = df.T.rename_axis(columns='Origin', index='Type').plot.bar()
ax.set_ylabel('Frequency')
output:

Related

Pandas - map one column to another and apply multiplication on row

I have this dataframe:
df = pd.DataFrame({target:[0.0, 1.0, 2.0],
points:[5.0, 2.5, 3.6})
And I need to get the lowest 'target' value (other than 0) and multiply 'points' by 2 on the respective row.
Ending up with:
target points
0.0 5.0
1.0 5.0
2.0 3.6
How so?
Perhaps using idxmin and loc:
idx = df[df['target']!=0]['target'].idxmin()
df['points'].loc[idx] = df['points'].loc[idx] * 2

Substitute some values from a field into another

I'd like to substitute some valors from a field into another. For instance:
Let's say I have a pandas.DataFrame object with an identifier df (yeap, very original), it has several columns but there are some of them which are relevant, and cannot be empty.
I noticed some of the values were set into another field. Let's say field1 is a relevant field, and field2 is not. I have a thousand of registers and it's increasing every week, when I get new data, and as I love make things be automated I first check for these possible values:
idx = df[df.field1.isna() & df.field2.notna()].index
Then I tried to replace them:
df.loc[idx, ['field1']] = df.loc[idx, ['field2']]
But when I see the result nothing has changed... why? I con make substitutions this way with a single value, but if they differ I cannot anymore.
df.loc[idx, ['field1']] = "Not empty any longer" # This will work
I can't figure it out how to achieve this in a ... good way? I mean, I don't want to check it manually, it doesn't matter if they're only 50, I have to do the same with other fields and I may get more like this (and I will).
Thanks!
Try this: df.loc[idx, ['field1']] = df.loc[idx, ['field2']].values
Example:
# The None in 'field1' should be replaced by the 'field2' value
df = pd.DataFrame({'field1':[1,2,3,None,5], 'field2':[6,7,8,8,None]})
idx = df[df.field1.isna() & df.field2.notna()].index
df.loc[idx, ['field1']] = df.loc[idx, ['field2']].values
Original dataframe:
df
field1 field2
0 1.0 6.0
1 2.0 7.0
2 3.0 8.0
3 NaN 8.0
4 5.0 NaN
Modified df:
df
field1 field2
0 1.0 6.0
1 2.0 7.0
2 3.0 8.0
3 8.0 8.0
4 5.0 NaN

Get a new df with the mean values of other dfs

I have 1 list composed of unknown number of dfs:
The dfs have same dimensions with same column names and same column values in the same order:
df1=pd.DataFrame(data=np.transpose([[1,2,3,4],[2,4,6,8]]),index=['A','B','C','D'],columns=['x','y'])
df2=pd.DataFrame(data=np.transpose([[3,3,3,3],[4,4,4,4]]),index=['A','B','C','D'],columns=['x','y'])
I would like to group the values of the n dfs into a new df with the values being the mean of the values of the n dfs
The ouput
df2=pd.DataFrame(data=np.transpose([[2,2.5,3,3.5],[3,4,5,6]]),index=['A','B','C','D'],columns=['x','y'])
Use concat with mean per index values:
print (pd.concat([df1, df2]).mean(level=0))
x y
A 2.0 3.0
B 2.5 4.0
C 3.0 5.0
D 3.5 6.0
First concatenate the dataframes, reset the index to use it as groupby-keys and then calculate the mean over all columns.
pd.concat([df1, df2]).reset_index().groupby('index').mean()
Output
x y
index
A 2.0 3.0
B 2.5 4.0
C 3.0 5.0
D 3.5 6.0

categorical variables to binary variables

I have a DataFrame that looks like this :
initial dataframe
I have different tags in the 'Concepts_clean' column and I want to automatically fill the other ones like so : resulting dataframe
For example: fourth row, column 'Concepts_clean" I have ['Accueil Amabilité', 'Tarifs'] then I want to fill the columns 'Accueil Amabilité' and 'Tarifs' with ones and all the others with zeros.
What is the most effective way to do it?
Thank you
It's more of a n-hot encoding problem -
>>> def change_df(x):
... for i in x['Concepts_clean'].replace('[','').replace(']','').split(','):
... x[i.strip()] = 1
... return x
...
>>> df.apply(change_df, axis=1)
Example Output
Concepts_clean Ecoute Informations Tarifs
[Tarifs] 0.0 0.0 1.0
[] 0.0 0.0 0.0
[Ecoute] 1.0 0.0 0.0
[Tarifs, Informations] 0.0 1.0 1.0

sum vs np.nansum weirdness while summing columns with same name on a pandas dataframe - python

taking inspiration from this discussion here on SO (Merge Columns within a DataFrame that have the Same Name), I tried the method suggested and, while it works while using the function sum() it doesn't when I am using np.nansum :
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.rand(100,4), columns=['a', 'a','b','b'], index=pd.date_range('2011-1-1', periods=100))
print(df.head(3))
sum() case:
print(df.groupby(df.columns, axis=1).apply(sum, axis=1).head(3))
a b
2011-01-01 1.328933 1.678469
2011-01-02 1.878389 1.343327
2011-01-03 0.964278 1.302857
np.nansum() case:
print(df.groupby(df.columns, axis=1).apply(np.nansum, axis=1).head(3))
a [1.32893299939, 1.87838886222, 0.964278430632,...
b [1.67846885234, 1.34332662587, 1.30285727348, ...
dtype: object
any idea why?
The issue is that np.nansum converts its input to a numpy array, so it effectively loses the column information (sum doesn't do this). As a result, the groupby doesn't get back any column information when constructing the output, so the output is just a Series of numpy arrays.
Specifically, the source code for np.nansum calls the _replace_nan function. In turn, the source code for _replace_nan checks if the input is an array, and converts it to one if it's not.
All hope isn't lost though. You can easily replicate np.nansum with Pandas functions. Specifically use sum followed by fillna:
df.groupby(df.columns, axis=1).sum().fillna(0)
The sum should ignore NaN's and just sum the non-null values. The only case you'll get back a NaN is if all the values attempting to be summed are NaN, which is why fillna is required. Note that you could also do the fillna before the groupby, i.e. df.fillna(0).groupby....
If you really want to use np.nansum, you can recast as pd.Series. This will likely impact performance, as constructing a Series can be a relatively expensive, and you'll be doing it multiple times:
df.groupby(df.columns, axis=1).apply(lambda x: pd.Series(np.nansum(x, axis=1), x.index))
Example Computations
For some example computations, I'll be using the following simple DataFrame, which includes NaN values (your example data doesn't):
df = pd.DataFrame([[1,2,2,np.nan,4],[np.nan,np.nan,np.nan,3,3],[np.nan,np.nan,-1,2,np.nan]], columns=list('aaabb'))
a a a b b
0 1.0 2.0 2.0 NaN 4.0
1 NaN NaN NaN 3.0 3.0
2 NaN NaN -1.0 2.0 NaN
Using sum without fillna:
df.groupby(df.columns, axis=1).sum()
a b
0 5.0 4.0
1 NaN 6.0
2 -1.0 2.0
Using sum and fillna:
df.groupby(df.columns, axis=1).sum().fillna(0)
a b
0 5.0 4.0
1 0.0 6.0
2 -1.0 2.0
Comparing to the fixed np.nansum method:
df.groupby(df.columns, axis=1).apply(lambda x: pd.Series(np.nansum(x, axis=1), x.index))
a b
0 5.0 4.0
1 0.0 6.0
2 -1.0 2.0