Pivot to multi-index and combining columns into one level using pandas - pandas

I am trying to pivot a dataframe such that the unique values in an 'ID' column will be used for column labels and a multi-index will be created to organize the data into grouped rows. The second level of the multi-index, will be unique values obtained from the 'date' column and the first level of the multi-index will contain all other columns 'not considered' in the pivoting operation.
Here's the dataframe sample:
df = pd.DataFrame(
data=[['A', '10/19/2020', 33, 0.2],
['A', '10/6/2020', 17, 0.6],
['A', '11/8/2020', 7, 0.3],
['A', '11/14/2020', 19, 0.2],
['B', '10/28/2020', 26, 0.6],
['B', '11/6/2020', 19, 0.3],
['B', '11/10/2020', 29, 0.1]],
columns=['ID', 'Date', 'Temp', 'PPM'])
original df
ID Date Temp PPM
0 A 10/19/2020 33 0.2
1 A 10/6/2020 17 0.6
2 A 11/8/2020 7 0.3
3 A 11/14/2020 19 0.2
4 B 10/28/2020 26 0.6
5 B 11/6/2020 19 0.3
6 B 11/10/2020 29 0.1
desired output
ID A B
Date
Temp 10/19/2020 33 NaN
10/28/2020 NaN 26
11/6/2020 17 19
11/8/2020 7 NaN
11/10/2020 NaN 29
11/14/2020 19 NaN
PPM 10/19/2020 0.2 NaN
10/28/2020 NaN 0.6
11/6/2020 0.6 0.3
11/8/2020 0.3 NaN
11/10/2020 NaN 0.1
11/14/2020 0.2 NaN
I took a look at this extensive answer for pivoting dataframes in pandas, but I am unable to see how it covers/apply it to, the specific case I am trying to implement.
EDIT: While I've provided dates as strings in the sample, these are actually datetime64 objects in the full dataframe I'm dealing with.

Let us try set_index and unstack
out = df.set_index(['ID','Date']).unstack().T
Out[27]:
ID A B
Date
Temp 10/19/2020 33.0 NaN
10/28/2020 NaN 26.0
10/6/2020 17.0 NaN
11/10/2020 NaN 29.0
11/14/2020 19.0 NaN
11/6/2020 NaN 19.0
11/8/2020 7.0 NaN
PPM 10/19/2020 0.2 NaN
10/28/2020 NaN 0.6
10/6/2020 0.6 NaN
11/10/2020 NaN 0.1
11/14/2020 0.2 NaN
11/6/2020 NaN 0.3
11/8/2020 0.3 NaN

Related

function to replace null values with mean

I have an unemployment data for the 30 countries and there are some missing values but in the excel sheet these all numbers are all strings so I first convert them to floats and then if row is empty then I want to replace row with its columns mean value. Function works well doesnt return any error but when I print the data still I have the Null values
data=pd.read_excel(r'C:\Users\OĞUZ\Desktop\employment.xlsx')
data=data.set_index('Unnamed: 0')
for column in data:
for row in column:
if len(row)>5:
row=float(row)
if row.isnull():
row=column.mean()
print(data['Argentina'].head())
This is what I get after print.
Unnamed: 0
1990 NaN
1991 NaN
1992 NaN
1993 NaN
1994 NaN
Name: Argentina, dtype: float64
You can either iterate over the columns, or use DataFrame.transform or DataFrame.apply.
Whichever approach you use, you'll want to:
Convert column values from strings to floats
Calculate the mean of the column
Use Series.fillna to fill the NaN values with the previously calcualted value
Create Data
import pandas as pd
import numpy as np
rng = np.random.default_rng(0)
df = pd.DataFrame({
"a": rng.integers(5, size=10),
"b": rng.integers(5, 10, size=10),
"c": rng.integers(10, 15, size=10)
}).astype(str)
df.loc[2:5, :] = np.nan
# note all the numbers you see are actually strings
print(df)
a b c
0 4 8 11
1 3 9 14
2 NaN NaN NaN
3 NaN NaN NaN
4 NaN NaN NaN
5 NaN NaN NaN
6 0 8 12
7 0 7 10
8 0 7 13
9 4 9 13
Solution - DataFrame transform
def clean_column(series):
series = pd.to_numeric(series, downcast="float")
avg = series.mean()
return series.fillna(avg)
new_df = df.transform(clean_column)
print(new_df)
0 4.000000 8.0 11.000000
1 3.000000 9.0 14.000000
2 1.833333 8.0 12.166667
3 1.833333 8.0 12.166667
4 1.833333 8.0 12.166667
5 1.833333 8.0 12.166667
6 0.000000 8.0 12.000000
7 0.000000 7.0 10.000000
8 0.000000 7.0 13.000000
9 4.000000 9.0 13.000000
To fill NaNs use df.fillna(value). For the mean use df.mean(). If your column is named Argentina this could look like below:
df.Argentina.fillna(df.Argentina.mean(), inplace=True)
The inplace=True is for the reassignment. The line is equivalent to
df.Argentina = df.Argentina.fillna(df.Argentina.mean())
Example
df = pd.DataFrame({'Argentina':[1,np.nan,2,4]}, index=[1990, 1991, 1992, 1993])
>>> df
Argentina
1990 1.0
1991 NaN
1992 2.0
1993 4.0
df.Argentina.fillna(df.Argentina.mean(), inplace=True)
>>> df
Argentina
1990 1.000000
1991 2.333333
1992 2.000000
1993 4.000000
If you have many columns, and you want to fill the NaNs with values depending by the column, you can loop over the column names like below:
for name in df.columns:
df[name].fillna(df[name].mean(), inplace=True)

Grouping by and applying lambda with condition for the first row - Pandas

I have a data frame with IDs, and choices that have made by those IDs.
The alternatives (choices) set is a list of integers: [10, 20, 30, 40].
Note: That's important to use this list. Let's call it 'choice_list'.
This is the data frame:
ID Choice
1 10
1 30
1 10
2 40
2 40
2 40
3 20
3 40
3 10
I want to create a variable for each alternative: '10_Var', '20_Var', '30_Var', '40_Var'.
At the first row of each ID, if the first choice was '10' for example, so the variable '10_Var' will get the value 0.6 (some parameter), and each of the other variables ('20_Var', '30_Var', '40_Var') will get the value (1 - 0.6) / 4.
The number 4 stands for the number of alternatives.
Expected result:
ID Choice 10_Var 20_Var 30_Var 40_Var
1 10 0.6 0.1 0.1 0.1
1 30
1 10
2 40 0.1 0.1 0.1 0.6
2 40
2 40
3 20 0.1 0.6 0.1 0.1
3 40
3 10
you can use np.where to do this. It is efficient that df.where
df = pd.DataFrame([['1', 10], ['1', 30], ['1', 10], ['2', 40], ['2', 40], ['2', 40], ['3', 20], ['3', 40], ['3', 10]], columns=('ID', 'Choice'))
choices = np.unique(df.Choice)
for choice in choices:
df[f"var_{choice}"] = np.where(df.Choice==choice, 0.6, (1 - 0.6) / 4)
df
Result
ID Choice var_10 var_20 var_30 var_40
0 1 10 0.6 0.1 0.1 0.1
1 1 30 0.1 0.1 0.6 0.1
2 1 10 0.6 0.1 0.1 0.1
3 2 40 0.1 0.1 0.1 0.6
4 2 40 0.1 0.1 0.1 0.6
5 2 40 0.1 0.1 0.1 0.6
6 3 20 0.1 0.6 0.1 0.1
7 3 40 0.1 0.1 0.1 0.6
8 3 10 0.6 0.1 0.1 0.1
Edit
To set values to 1st row of group only
df = pd.DataFrame([['1', 10], ['1', 30], ['1', 10], ['2', 40], ['2', 40], ['2', 40], ['3', 20], ['3', 40], ['3', 10]], columns=('ID', 'Choice'))
df=df.set_index("ID")
## create unique index for each row if not already
df = df.reset_index()
choices = np.unique(df.Choice)
## get unique id of 1st row of each group
grouped = df.loc[df.reset_index().groupby("ID")["index"].first()]
## set value for each new variable
for choice in choices:
grouped[f"var_{choice}"] = np.where(grouped.Choice==choice, 0.6, (1 - 0.6) / 4)
pd.concat([df, grouped.iloc[:, -len(choices):]], axis=1)
We can use insert o create the rows based on the unique ID values ​​obtained through Series.unique.We can also create a mask to fill only the first row using np.where.
At the beginning sort_values ​​is used to sort the values ​​based on the ID. You can skip this step if your data frame is already well sorted (like the one shown in the example):
df=df.sort_values('ID')
n=df['Choice'].nunique()
mask=df['ID'].ne(df['ID'].shift())
for choice in df['Choice'].sort_values(ascending=False).unique():
df.insert(2,column=f'{choice}_Var',value=np.nan)
df.loc[mask,f'{choice}_Var']=np.where(df.loc[mask,'Choice'].eq(choice),0.6,0.4/n)
print(df)
ID Choice 10_Var 20_Var 30_Var 40_Var
0 1 10 0.6 0.1 0.1 0.1
1 1 30 NaN NaN NaN NaN
2 1 10 NaN NaN NaN NaN
3 2 40 0.1 0.1 0.1 0.6
4 2 40 NaN NaN NaN NaN
5 2 40 NaN NaN NaN NaN
6 3 20 0.1 0.6 0.1 0.1
7 3 40 NaN NaN NaN NaN
8 3 10 NaN NaN NaN NaN
A mix of numpy and pandas solution:
rows = np.unique(df.ID.values, return_index=1)[1]
df1 = df.loc[rows].assign(val=0.6)
df2 = (pd.crosstab([df1.index, df1.ID, df1.Choice], df1.Choice, df1.val, aggfunc='first')
.reindex(choice_list, axis=1)
.fillna((1-0.6)/len(choice_list)).reset_index(level=[1,2], drop=True))
pd.concat([df, df2], axis=1)
Out[217]:
ID Choice 10 20 30 40
0 1 10 0.6 0.1 0.1 0.1
1 1 30 NaN NaN NaN NaN
2 1 10 NaN NaN NaN NaN
3 2 40 0.1 0.1 0.1 0.6
4 2 40 NaN NaN NaN NaN
5 2 40 NaN NaN NaN NaN
6 3 20 0.1 0.6 0.1 0.1
7 3 40 NaN NaN NaN NaN
8 3 10 NaN NaN NaN NaN

Move strings within a mixed string and float column to new column in Pandas

Can't seem to find the answer anywhere. I have a column 'q' within my dataframe that has both strings and floats. I would like to remove the string values from 'q' and move them into an existing string column 'comments'. Any help is appreciated.
I have tried:
df['comments']=[isinstance(x, str) for x in df.q]
I have also tried some str methods on q but to no avail. Any direction on this would be appreciated
If series is:
s=pd.Series([1.0,1.1,1.2,1.3,'this','is',1.4,'a',1.5,'comment'])
s
Out[24]:
0 1
1 1.1
2 1.2
3 1.3
4 this
5 is
6 1.4
7 a
8 1.5
9 comment
dtype: object
then only floats can be:
[e if type(e) is float else np.NaN for e in s if type(e)]
Out[25]: [1.0, 1.1, 1.2, 1.3, nan, nan, 1.4, nan, 1.5, nan]
And comments can be:
[e if type(e) is not float else '' for e in s if type(e)]
Out[26]: ['', '', '', '', 'this', 'is', '', 'a', '', 'comment']
This is what you are trying to do.
But element-wise iteration with pandas does not scale well, so extract floats only using:
pd.to_numeric(s,errors='coerce')
Out[27]:
0 1.0
1 1.1
2 1.2
3 1.3
4 NaN
5 NaN
6 1.4
7 NaN
8 1.5
9 NaN
dtype: float64
and :
pd.to_numeric(s,errors='coerce').to_frame('floats').merge(s.loc[pd.to_numeric(s,errors='coerce').isnull()].to_frame('comments'), left_index=True, right_index=True, how='outer')
Out[71]:
floats comments
0 1.0 NaN
1 1.1 NaN
2 1.2 NaN
3 1.3 NaN
4 NaN this
5 NaN is
6 1.4 NaN
7 NaN a
8 1.5 NaN
9 NaN comment
there is a side effect to pd.to_numeric(s,errors='coerce') where it'll convert all strings with float literals to float instead of keeping it as a string.
pd.to_numeric(pd.Series([1.0,1.1,1.2,1.3,'this','is',1.4,'a',1.5,'comment','12.345']), errors='coerce')
Out[73]:
0 1.000
1 1.100
2 1.200
3 1.300
4 NaN
5 NaN
6 1.400
7 NaN
8 1.500
9 NaN
10 12.345 <--- this is now the float 12.345 not str
dtype: float64
If you don't want to convert strings with float literals into floats, you can use also str.isnumeric() method:
df = pd.DataFrame({'q':[1.5,2.5,3.5,'a', 'b', 5.1,'3.55','1.44']})
df['comments'] = df.loc[df['q'].str.isnumeric()==False, 'q']
In [4]: df
Out[4]:
q comments
0 1.5 NaN
1 2.5 NaN
2 3.5 NaN
3 a a
4 b b
5 5.1 NaN
6 3.55 3.55 <-- strings are not converted into floats
7 1.44 1.44
Or something like this:
criterion = df.q.apply(lambda x: isinstance(x,str))
df['comments'] = df.loc[criterion, 'q']
Again, it won't convert strings into floats.

Pandas dataframe creating multiple rows at once via .loc

I can create a new row in a dataframe using .loc():
>>> df = pd.DataFrame({'a':[10, 20], 'b':[100,200]}, index='1 2'.split())
>>> df
a b
1 10 100
2 20 200
>>> df.loc[3, 'a'] = 30
>>> df
a b
1 10.0 100.0
2 20.0 200.0
3 30.0 NaN
But how can I create more than one row using the same method?
>>> df.loc[[4, 5], 'a'] = [40, 50]
...
KeyError: '[4 5] not in index'
I'm familiar with .append() but am looking for a way that does NOT require constructing a new row into a Series before having it appended to df.
Desired input:
>>> df.loc[[4, 5], 'a'] = [40, 50]
Desired output
a b
1 10.0 100.0
2 20.0 200.0
3 30.0 NaN
4 40.0 NaN
5 50.0 NaN
Where last 2 rows are newly added.
Admittedly, this is a very late answer, but I have had to deal with a similar problem and think my solution might be helpful to others as well.
After recreating your data, it is basically a two-step approach:
Recreate data:
import pandas as pd
df = pd.DataFrame({'a':[10, 20], 'b':[100,200]}, index='1 2'.split())
df.loc[3, 'a'] = 30
Extend the df.index using .reindex:
idx = list(df.index)
new_rows = list(map(str, range(4, 6))) # easier extensible than new_rows = ["4", "5"]
idx.extend(new_rows)
df = df.reindex(index=idx)
Set the values using .loc:
df.loc[new_rows, "a"] = [40, 50]
giving you
>>> df
a b
1 10.0 100.0
2 20.0 200.0
3 30.0 NaN
4 40.0 NaN
5 50.0 NaN
Example data
>>> data = pd.DataFrame({
'a': [10, 6, -3, -2, 4, 12, 3, 3],
'b': [6, -3, 6, 12, 8, 11, -5, -5],
'id': [1, 1, 1, 1, 6, 2, 2, 4]})
Case 1 Note that range can be altered to whatever it is that you desire.
>>> for i in range(10):
... data.loc[i, 'a'] = 30
...
>>> data
a b id
0 30.0 6.0 1.0
1 30.0 -3.0 1.0
2 30.0 6.0 1.0
3 30.0 12.0 1.0
4 30.0 8.0 6.0
5 30.0 11.0 2.0
6 30.0 -5.0 2.0
7 30.0 -5.0 4.0
8 30.0 NaN NaN
9 30.0 NaN NaN
Case 2 Here we are adding a new column to a data frame that had 8 rows to begin with. As we extend our new column c to be of length 10 the other columns are extended with NaN.
>>> for i in range(10):
... data.loc[i, 'c'] = 30
...
>>> data
a b id c
0 10.0 6.0 1.0 30.0
1 6.0 -3.0 1.0 30.0
2 -3.0 6.0 1.0 30.0
3 -2.0 12.0 1.0 30.0
4 4.0 8.0 6.0 30.0
5 12.0 11.0 2.0 30.0
6 3.0 -5.0 2.0 30.0
7 3.0 -5.0 4.0 30.0
8 NaN NaN NaN 30.0
9 NaN NaN NaN 30.0
Also somewhat late, but my solution was similar to the accepted one:
import pandas as pd
df = pd.DataFrame({'a':[10, 20], 'b':[100,200]}, index=[1,2])
# single index assignment always works
df.loc[3, 'a'] = 30
# multiple indices
new_rows = [4,5]
# there should be a nicer way to add more than one index/row at once,
# but at least this is just one extra line:
df = df.reindex(index=df.index.append(pd.Index(new_rows))) # note: Index.append() doesn't accept non-Index iterables?
# multiple new rows now works:
df.loc[new_rows, "a"] = [40, 50]
print(df)
... which yields:
a b
1 10.0 100.0
2 20.0 200.0
3 30.0 NaN
4 40.0 NaN
5 50.0 NaN
This also works now (useful when performance on aggregating dataframes matters):
# inserting whole rows:
df.loc[new_rows] = [[41, 51], [61,71]]
print(df)
a b
1 10.0 100.0
2 20.0 200.0
3 30.0 NaN
4 41.0 51.0
5 61.0 71.0

Strange pandas.DataFrame.sum(axis=1) behaviour

I have a pandas DataFrame compiled from some web data (for tennis games) that exhibits strange behaviour when summing across selected rows.
DataFrame:
In [178]: tdf.shape
Out[178]: (47028, 57)
In [201]: cols
Out[201]: ['L1', 'L2', 'L3', 'L4', 'L5', 'W1', 'W2', 'W3', 'W4', 'W5']
In [177]: tdf[cols].head()
Out[177]:
L1 L2 L3 L4 L5 W1 W2 W3 W4 W5
0 4.0 2 NaN NaN NaN 6.0 6 NaN NaN NaN
1 3.0 3 NaN NaN NaN 6.0 6 NaN NaN NaN
2 7.0 5 3 NaN NaN 6.0 7 6 NaN NaN
3 1.0 4 NaN NaN NaN 6.0 6 NaN NaN NaN
4 6.0 7 4 NaN NaN 7.0 5 6 NaN NaN
When then trying to compute the sum over the rows using tdf[cols].sum(axis=1). From the above table, the sum for the 1st row should be 18.0, but it is reported as 10, as below:
In [180]: tdf[cols].sum(axis=1).head()
Out[180]:
0 10.0
1 9.0
2 13.0
3 7.0
4 13.0
dtype: float64
The problem seems to be caused by a specific record (row 13771), because when I exclude this row, the sum is calculated correctly:
In [182]: tdf.iloc[:13771][cols].sum(axis=1).head()
Out[182]:
0 18.0
1 18.0
2 34.0
3 17.0
4 35.0
dtype: float64
whereas, including it:
In [183]: tdf.iloc[:13772][cols].sum(axis=1).head()
Out[183]:
0 10.0
1 9.0
2 13.0
3 7.0
4 13.0
dtype: float64
Gives the wrong result for the entire column.
The offending record is as follows:
In [196]: tdf[cols].iloc[13771]
Out[196]:
L1 1
L2 1
L3 NaN
L4 NaN
L5 NaN
W1 6
W2 0
W3
W4 NaN
W5 NaN
Name: 13771, dtype: object
In [197]: tdf[cols].iloc[13771].W3
Out[197]: ' '
In [198]: type(tdf[cols].iloc[13771].W3)
Out[198]: str
I'm running the following versions:
In [192]: sys.version
Out[192]: '3.4.3 (default, Nov 17 2016, 01:08:31) \n[GCC 4.8.4]'
In [193]: pd.__version__
Out[193]: '0.19.2'
In [194]: np.__version__
Out[194]: '1.12.0'
Surely a single poorly formatted record should not influence the sum of other records? Is this a bug or am I doing something wrong?
Help much appreciated!
Problem is with empty string - then dtype of column W3 is object (obviously string) and sum omit it.
Solutions:
Try replace problematic empty string value to NaN and then cast to float
tdf.loc[13771, 'W3'] = np.nan
tdf.W3 = tdf.W3.astype(float)
Or try replace all empty strings to NaN in subset cols:
tdf[cols] = tdf[cols].replace({'':np.nan})
#if necessary
tdf[cols] = tdf[cols].astype(float)
Another solution is use to_numeric in problematic column - replace all non numeric to NaN:
tdf.W3 = pd.to_numerice(tdf.W3, erors='coerce')
Or generally apply for columns cols:
tdf[cols] = tdf[cols].apply(lambda x: pd.to_numeric(x, errors='coerce'))