I have this pandas DataFrame filled with lists filled with strings, that I wish to split into two frames:
Input:
df = pd.DataFrame({
'A': {'a': ['NaN'],'b': ['1.11', '0.00']},
'B': {'a': ['3.33', '0.22'],'b': ['NaN']},
})
Desired output:
df1 = pd.DataFrame({
'A': {'a': ['NaN'],'b': ['1.11']},
'B': {'a': ['3.33'],'b': ['NaN']},
})
df2 = pd.DataFrame({
'A': {'a': ['NaN'],'b': ['0.00']},
'B': {'a': ['0.22'],'b': ['NaN']},
})
I tried to use the apply function, which works for Series, and was wondering if there is an easy way to apply an operation that achieves this on the entire df.
You can stack and apply(pd.Series)
s=df.stack().apply(pd.Series)
s[0].unstack()
Out[508]:
A B
a NaN 3.33
b 1.11 NaN
s[1].unstack()
Out[509]:
A B
a NaN 0.22
b 0.00 NaN
If you do need object for single cell
s[0].unstack().applymap(lambda x : [x])
Out[512]:
A B
a [NaN] [3.33]
b [1.11] [NaN]
s[1].unstack().applymap(lambda x : [x])
Out[513]:
A B
a [nan] [0.22]
b [0.00] [nan]
Related
import numpy as np
import pandas as pd
data = { 'ID': [112,113],'empDetails':[[{'key': 'score', 'value': 2},{'key': 'Name', 'value': 'Ajay'}, {'key': 'Department', 'value': 'HR'}],[ {'key': 'salary', 'value': 7.5},{'key': 'Name', 'value': 'Balu'}]]}
dataDF = pd.DataFrame(data)
#trails
# dataDF['newColumns'] = dataDF['empDetails'].apply(lambda x: x[0].get('key'))
# dataDF = dataDF['empDetails'].apply(pd.Series)
# create dataframe
# dataDF = pd.DataFrame(dataDF['empDetails'], columns=dataDF['empDetails'].keys())
# create the dataframe
# df = pd.concat([pd.DataFrame(v, columns=[k]) for k, v in dataDF['empDetails'].items()], axis=1)
# print(dataDF['empDetails'].items())
display(dataDF)
I am trying to iterate through empDetails column and fetch the value of Name,salary and Department into 3 different column
Using pd.series I am able to split the dictionary into different columns, but not able to rename the columns as the column order may change.
What will be the effective way to do this.
Expected output
Use lambda function for extract keys and values to new dictionaries and pass to DataFrame constructor:
f = lambda x: {y['key']:y['value'] for y in x}
df = dataDF.join(pd.DataFrame(dataDF['empDetails'].apply(f).tolist(), index=dataDF.index))
print (df)
ID empDetails score Name \
0 112 [{'key': 'score', 'value': 2}, {'key': 'Name',... 2.0 Ajay
1 113 [{'key': 'salary', 'value': 7.5}, {'key': 'Nam... NaN Balu
Department salary
0 HR NaN
1 NaN 7.5
Alternative solution:
f = lambda x: pd.Series({y['key']:y['value'] for y in x})
df = dataDF.join(dataDF['empDetails'].apply(f))
print (df)
ID empDetails score Name \
0 112 [{'key': 'score', 'value': 2}, {'key': 'Name',... 2.0 Ajay
1 113 [{'key': 'salary', 'value': 7.5}, {'key': 'Nam... NaN Balu
Department salary
0 HR NaN
1 NaN 7.5
Or use list comprehension (only pandas solution):
df1 = pd.DataFrame([{y['key']:y['value'] for y in x} for x in dataDF['empDetails']],
index=dataDF.index)
df = dataDF.join(df1)
If you are using python 3.5+, then you can unroll dict elements and append "ID" column in one line:
df.apply(lambda row: pd.Series({**{"ID":row["ID"]}, **{ed["key"]:ed["value"] for ed in row["empDetails"]}}), axis=1)
Update: If you want all columns from original df, then use dict comprehension:
df.apply(lambda row: pd.Series({**{col:row[col] for col in df.columns}, **{ed["key"]:ed["value"] for ed in row["empDetails"]}}), axis=1)
Full example:
data = { 'ID': [112,113],'empDetails':[[{'key': 'score', 'value': 2},{'key': 'Name', 'value': 'Ajay'}, {'key': 'Department', 'value': 'HR'}],[ {'key': 'salary', 'value': 7.5},{'key': 'Name', 'value': 'Balu'}]]}
df = pd.DataFrame(data)
df = df.apply(lambda row: pd.Series({**{col:row[col] for col in df.columns}, **{ed["key"]:ed["value"] for ed in row["empDetails"]}}), axis=1)
[Out]:
Department ID Name salary score
0 HR 112 Ajay NaN 2.0
1 NaN 113 Balu 7.5 NaN
I have a pandas dataframe containing IDs and Codes which are of type list:
df = pd.DataFrame({'ID': [1, 1, 1, 2, 2, 3, 3, 4], 'Code': [['A', 'B'], ['A', 'B'], ['A', 'B', 'C'],
['A'], ['A'], ['A', 'C'], ['D', 'C'], ['A', 'D']]})
I would like to groupby ID and get a list of all codes associated with each ID:
df_groupby = pd.DataFrame(df.groupby('ID')['Code'].apply(list))
After executing the above code I have a dataframe at the ID level with the 'Code' column transformed to a list of lists. How would I flatten each list of lists within the 'Code' column such that I have a list of all codes associated with each ID?
Try this.You can use np.hstack to Stack arrays in sequence horizontally.
import numpy as np
df_groupby["Code"] = df_groupby["Code"].apply(lambda x: np.hstack(x))
or
df_groupby["Code"] = df_groupby["Code"].apply(np.hstack)
Use list comprehension:
df = df.groupby('ID')['Code'].agg(lambda x: [z for y in x for z in y]).to_frame()
print(df)
Code
ID
1 [A, B, A, B, A, B, C]
2 [A, A]
3 [A, C, D, C]
4 [A, D]
Answering my own question. Applying numpy's hstack does the trick:
df_groupby['Code'] = df_groupby['Code'].apply(np.hstack)
I am trying to compare 2 pandas dataframes in terms of column names and datatypes. With assert_frame_equal, I get an error since shapes are different. Is there a way to ignore it, as I could not find it in the documentation.
With df1_dict == df2_dict, it just says whether its similar or not, I am trying to print if there are any differences in terms of feature names or datatypes.
df1_dict = dict(df1.dtypes)
df2_dict = dict(df2.dtypes)
# df1_dict = {'A': np.dtype('O'), 'B': np.dtype('O'), 'C': np.dtype('O')}
# df2_dict = {'A': np.dtype('int64'), 'B': np.dtype('O'), 'C': np.dtype('O')}
print(set(df1_dict) - set(df2_dict))
print(f'''Are two datsets similar: {df1_dict == df2_dict}''')
pd.testing.assert_frame_equal(df1, df2)
Any suggestions would be appreciated.
It seems to me that if the two dataframe descriptions are outer joined, you would have all the information you want.
example:
df1 = pd.DataFrame({'a': [1,2,3], 'b': list('abc')})
df2 = pd.DataFrame({'a': [1.0,2.0,3.0], 'b': list('abc'), 'c': [10,20,30]})
diff = df1.dtypes.rename('df1').reset_index().merge(
df2.dtypes.rename('df2').reset_index(), how='outer'
)
def check(x):
if pd.isnull(x.df1):
return 'df1-missing'
if pd.isnull(x.df2):
return 'df2-missing'
if x.df1 != x.df2:
return 'type-mismatch'
return 'ok'
diff['diff_status'] = diff.apply(check, axis=1)
# diff prints:
index df1 df2 diff_status
0 a int64 float64 type-mismatch
1 b object object ok
2 c NaN int64 df1-missing
I have a dataframe that I split into two dataframes of the same amount of columns and rows (df1 and df2). I want to write a function that will go through each row and feed their values into the scipy.stats.pearsonr() function. How would I do this?
Something like:
for index, row in d1.iterrows():
print(scipy.stats.pearsonr(df1.loc[index], df2.loc[index]))
If you just want the function, try this:
import pandas as pd
from scipy.stats import pearsonr
df1 = pd.DataFrame(
{
'A': [0,2,3,4,5],
'B': [2,3,4,5,6],
'C': [5,6,7,8,9],
}
)
df2 = pd.DataFrame(
{
'A': [2,1,3,4,5],
'B': [3,2,4,5,6],
'C': [7,7,7,3,3],
}
)
def pandas_pearsonr(df1, df2):
assert len(df1)==len(df2)
coefs = []
for i in range(0, len(df1)):
coefs.append(pearsonr(df1.iloc[i].values, df2.iloc[i].values))
print(coefs)
return pd.DataFrame(index=df1.index, data=coefs, columns=['coef', 'p-value'])
pandas_pearsonr(df1, df2)
Output looks like this:
coef p-value
0 0.976221 0.139109
1 0.996271 0.054996
2 1.000000 0.000000
3 -0.720577 0.487754
4 -0.838628 0.366717
But I think, it can be more pythonic. And maybe you can use pandas.DataFrame.corr
Assuming the following dataframe:
In [26]: x = { 'a': 9 }
In [27]: y = { 'b': 10 }
In [28]: 'a' in x
Out[28]: True
In [29]: 'a' in y
Out[29]: False
In [32]: df = DataFrame('data': Series([x, y]))
In [34]: df
Out[34]:
data
0 {'a': 9}
1 {'b': 10}
How can I obtain a new dataframe that only contains rows where the dictionary in the data column has the key a?
df['a' in df['data']] results in an error.
This is quite simple still:
df[df['data'].apply(lambda x: 'a' in x)]