categorical variables to binary variables - pandas

I have a DataFrame that looks like this :
initial dataframe
I have different tags in the 'Concepts_clean' column and I want to automatically fill the other ones like so : resulting dataframe
For example: fourth row, column 'Concepts_clean" I have ['Accueil Amabilité', 'Tarifs'] then I want to fill the columns 'Accueil Amabilité' and 'Tarifs' with ones and all the others with zeros.
What is the most effective way to do it?
Thank you

It's more of a n-hot encoding problem -
>>> def change_df(x):
... for i in x['Concepts_clean'].replace('[','').replace(']','').split(','):
... x[i.strip()] = 1
... return x
...
>>> df.apply(change_df, axis=1)
Example Output
Concepts_clean Ecoute Informations Tarifs
[Tarifs] 0.0 0.0 1.0
[] 0.0 0.0 0.0
[Ecoute] 1.0 0.0 0.0
[Tarifs, Informations] 0.0 1.0 1.0

Related

Plot multiple columns side by side

I have the dataframe below.
111_a 111_b 222_a 222_b 333_a 333_b
row_1 1.0 2.0 1.5 2.5 1.0 2.5
row_2 1.0 2.0 1.5 2.5 1.0 2.5
row_3 1.0 2.0 1.5 2.5 1.0 2.5
I'm trying to plot a bar chart such that column *_a are together, and *_b are together. I would also be plotting each row (row_1, row_2, etc) in separate tables.
What I'm trying to get is this:
where in my case,
Asia.SUV = 111_a, Europe.SUV = 222_a, USA.SUV = 333_a
Asia.Sedan = 111_b, Europe.Sedan = 222_b, USA.Sedan = 333_b
I would rename the "Type". How can I plot this? It would also be a bonus if I can plot each row in separate tables with just a command, instead of manually plotting each row.
Assuming df the dataframe, you can use:
ax = df.T.rename_axis(columns='Origin', index='Type').plot.bar()
ax.set_ylabel('Frequency')
output:

Function to replace all NaN values with zero:

I am trying to clean and fill out around 300 columns. I have already replaced all the empty fields with 'NaN', and now I am trying to convert those values to 0, if certain checks are passed:
NaN values need to be present in the column.
There cannot already exist 0 values in the column.
If 0 already exists, replace with 0.1 instead.
(I am still trying to figure out what to replace with, since 0 already contributes with relevant information for that particular column in the dataframe)
thus far I have implemented
def convert(df, col):
if (df[col].isnull().sum() > 0): #& (df[df[col] != '0'])
#if (df[df[col] != '0']):
df[col].replace(np.NaN, '0', inplace = True)
for col in df.columns:
convert(df, col)
But, checking for the second condition (no zeroes can exist in the column already) is not working. Tried to implement it (commented out part), but returns following error:
TypeError: Cannot perform 'rand_' with a dtyped [float64] array and scalar of type [bool]
On an another note, regarding the field of Data Science; I am not sure whether some of the columns should have their empty fields replaced by the column-mean instead of 0. I have features describing weight, dimensions, prices etc.
Use boolean mask.
Suppose the following dataframe:
>>> df
A B C
0 0.0 1 2.0
1 NaN 4 5.0 # <- NaN should be replace by 0.1
2 6.0 7 NaN # <- NaN should be replace by 0
m1 = df.isna().any() # Is there a NaN in columns (not mandatory)
m2 = df.eq(0).any() # Is there a 0 in columns
# Replace by 0
df.update(df.loc[:, m1 & ~m2].fillna(0))
# Replace by 0.1
df.update(df.loc[:, m1 & m2].fillna(0.1))
Only the second mask is useful
Output result:
>>> df
A B C
0 0.0 1 2.0
1 0.1 4 5.0
2 6.0 7 0.0

Using pandas and scipy regression line slope to identify growth

My goal is to be able to identify price growth in a table of records.
I know this is probably far off from what is possible with data tools, so I appreciate any help or suggestions for improvement.
The immediate trouble I'm having is that scipy.stats.linregress does not return if some data in the pandas rows is absent. I think some kind of masking or filling will be necessary to return the slope measure for rows where there are nulls. There is an exception thrown but it still works.
Also, am I using the best solution to find the growth?
I've observed that if I filter for the records that have a positive slope, higher rvalue (correlation) and lower stderr (standard error) the trendline for these rows is upward and consistent.
The reason I tried quantifying the price growth with the slope and other numeric values is because if I plot the lines from all the data in an excel chart, it's overwhelming to select the lines that show consistent upward movement because there is so much noise. Can it be done in a better way?
Here is the working sample:
# credit jezrael
import pandas as pd
import numpy as np
import scipy
from scipy import stats
def calc_slope(row):
a = scipy.stats.linregress(row, y=axisvalues)
return pd.Series(a._asdict())
table=pd.DataFrame({'Category':['A','A','A','B','C','C','C','B','B','A','A','A','B','B','D','A','B','B'],
'Quarter':['2016-Q1','2017-Q2','2017-Q3','2017-Q4','2017-Q2','2016-Q2','2017-Q2','2016-Q3','2016-Q4','2016-Q2','2016-Q3','2017-Q4','2016-Q1','2016-Q2','2016-Q4','2016-Q4','2017-Q2','2017-Q3'],
'Value':[100,200,500,800,700,900,300,400,600,200,300,400,200,300,100,300,500,600]})
db=(table.groupby(['Category','Quarter']).filter(lambda group: len(group) >= 1)).groupby(['Category','Quarter'])["Value"].mean()
db=db.unstack()
axisvalues=list(range(1,len(db.columns)+1)) #used in calc_slope function
db = db.join(db.apply(calc_slope,axis=1))
You can use:
#np.arange instead range
axisvalues= np.arange(1,len(db.columns)+1)
def calc_slope(row):
#mask NaNs out
mask = row.notnull()
a = scipy.stats.linregress(row[mask.values], y=axisvalues[mask])
return pd.Series(a._asdict())
db = db.join(db.apply(calc_slope,axis=1))
print (db)
print (db)
2016-Q1 2016-Q2 2016-Q3 2016-Q4 2017-Q2 2017-Q3 2017-Q4 \
Category
A 100.0 200.0 300.0 300.0 200.0 500.0 400.0
B 200.0 300.0 400.0 600.0 500.0 600.0 800.0
C NaN 900.0 NaN NaN 500.0 NaN NaN
D NaN NaN NaN 100.0 NaN NaN NaN
slope intercept rvalue pvalue stderr
Category
A 0.012895 0.315789 0.802955 0.029677 0.004281
B 0.010057 -0.885057 0.947623 0.001172 0.001516
C -0.007500 8.750000 -1.000000 0.000000 0.000000
D NaN NaN 0.000000 NaN NaN
But for last row get RuntimeWarnings, because only one value in 2016-Q4.
And for remove warnings is possible use filterwarnings, thank Kdog:
import warnings
warnings.filterwarnings("ignore")

formatting numbers in pandas

for a pandas.DataFrame: df
min max mean
a 0.0 2.300000e+04 6.450098e+02
b 0.0 1.370000e+05 1.651754e+03
c 218.0 1.221550e+10 3.975262e+07
d 1.0 5.060000e+03 2.727708e+02
e 0.0 6.400000e+05 6.560047e+03
I would like to format the display such as numbers show in the
":,.2f" format (that is ##,###.##) and remove the exponents.
I tried: df.style.format("{:,.2f}")which gives: <pandas.io.formats.style.Styler object at 0x108b86f60> that i have no idea what to do with.
Any lead please?
try this young Pandas apprentice
pd.options.display.float_format = '{:,.2f}'.format

sum vs np.nansum weirdness while summing columns with same name on a pandas dataframe - python

taking inspiration from this discussion here on SO (Merge Columns within a DataFrame that have the Same Name), I tried the method suggested and, while it works while using the function sum() it doesn't when I am using np.nansum :
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.rand(100,4), columns=['a', 'a','b','b'], index=pd.date_range('2011-1-1', periods=100))
print(df.head(3))
sum() case:
print(df.groupby(df.columns, axis=1).apply(sum, axis=1).head(3))
a b
2011-01-01 1.328933 1.678469
2011-01-02 1.878389 1.343327
2011-01-03 0.964278 1.302857
np.nansum() case:
print(df.groupby(df.columns, axis=1).apply(np.nansum, axis=1).head(3))
a [1.32893299939, 1.87838886222, 0.964278430632,...
b [1.67846885234, 1.34332662587, 1.30285727348, ...
dtype: object
any idea why?
The issue is that np.nansum converts its input to a numpy array, so it effectively loses the column information (sum doesn't do this). As a result, the groupby doesn't get back any column information when constructing the output, so the output is just a Series of numpy arrays.
Specifically, the source code for np.nansum calls the _replace_nan function. In turn, the source code for _replace_nan checks if the input is an array, and converts it to one if it's not.
All hope isn't lost though. You can easily replicate np.nansum with Pandas functions. Specifically use sum followed by fillna:
df.groupby(df.columns, axis=1).sum().fillna(0)
The sum should ignore NaN's and just sum the non-null values. The only case you'll get back a NaN is if all the values attempting to be summed are NaN, which is why fillna is required. Note that you could also do the fillna before the groupby, i.e. df.fillna(0).groupby....
If you really want to use np.nansum, you can recast as pd.Series. This will likely impact performance, as constructing a Series can be a relatively expensive, and you'll be doing it multiple times:
df.groupby(df.columns, axis=1).apply(lambda x: pd.Series(np.nansum(x, axis=1), x.index))
Example Computations
For some example computations, I'll be using the following simple DataFrame, which includes NaN values (your example data doesn't):
df = pd.DataFrame([[1,2,2,np.nan,4],[np.nan,np.nan,np.nan,3,3],[np.nan,np.nan,-1,2,np.nan]], columns=list('aaabb'))
a a a b b
0 1.0 2.0 2.0 NaN 4.0
1 NaN NaN NaN 3.0 3.0
2 NaN NaN -1.0 2.0 NaN
Using sum without fillna:
df.groupby(df.columns, axis=1).sum()
a b
0 5.0 4.0
1 NaN 6.0
2 -1.0 2.0
Using sum and fillna:
df.groupby(df.columns, axis=1).sum().fillna(0)
a b
0 5.0 4.0
1 0.0 6.0
2 -1.0 2.0
Comparing to the fixed np.nansum method:
df.groupby(df.columns, axis=1).apply(lambda x: pd.Series(np.nansum(x, axis=1), x.index))
a b
0 5.0 4.0
1 0.0 6.0
2 -1.0 2.0