Assuming the following dataframe:
In [26]: x = { 'a': 9 }
In [27]: y = { 'b': 10 }
In [28]: 'a' in x
Out[28]: True
In [29]: 'a' in y
Out[29]: False
In [32]: df = DataFrame('data': Series([x, y]))
In [34]: df
Out[34]:
data
0 {'a': 9}
1 {'b': 10}
How can I obtain a new dataframe that only contains rows where the dictionary in the data column has the key a?
df['a' in df['data']] results in an error.
This is quite simple still:
df[df['data'].apply(lambda x: 'a' in x)]
Related
I am trying to compare 2 pandas dataframes in terms of column names and datatypes. With assert_frame_equal, I get an error since shapes are different. Is there a way to ignore it, as I could not find it in the documentation.
With df1_dict == df2_dict, it just says whether its similar or not, I am trying to print if there are any differences in terms of feature names or datatypes.
df1_dict = dict(df1.dtypes)
df2_dict = dict(df2.dtypes)
# df1_dict = {'A': np.dtype('O'), 'B': np.dtype('O'), 'C': np.dtype('O')}
# df2_dict = {'A': np.dtype('int64'), 'B': np.dtype('O'), 'C': np.dtype('O')}
print(set(df1_dict) - set(df2_dict))
print(f'''Are two datsets similar: {df1_dict == df2_dict}''')
pd.testing.assert_frame_equal(df1, df2)
Any suggestions would be appreciated.
It seems to me that if the two dataframe descriptions are outer joined, you would have all the information you want.
example:
df1 = pd.DataFrame({'a': [1,2,3], 'b': list('abc')})
df2 = pd.DataFrame({'a': [1.0,2.0,3.0], 'b': list('abc'), 'c': [10,20,30]})
diff = df1.dtypes.rename('df1').reset_index().merge(
df2.dtypes.rename('df2').reset_index(), how='outer'
)
def check(x):
if pd.isnull(x.df1):
return 'df1-missing'
if pd.isnull(x.df2):
return 'df2-missing'
if x.df1 != x.df2:
return 'type-mismatch'
return 'ok'
diff['diff_status'] = diff.apply(check, axis=1)
# diff prints:
index df1 df2 diff_status
0 a int64 float64 type-mismatch
1 b object object ok
2 c NaN int64 df1-missing
I've a Pandas DataFrame with 3 columns:
c={'a': [['US']],'b': [['US']], 'c': [['US','BE']]}
df = pd.DataFrame(c, columns = ['a','b','c'])
Now I need the max value of these 3 columns.
I've tried:
df['max_val'] = df[['a','b','c']].max(axis=1)
The result is Nan instead of the expected output: US.
How can I get the max value for these 3 columns? (and what if one of them contains Nan)
Use:
c={'a': [['US', 'BE'],['US']],'b': [['US'],['US']], 'c': [['US','BE'],['US','BE']]}
df = pd.DataFrame(c, columns = ['a','b','c'])
from collections import Counter
df = df[['a','b','c']].apply(lambda x: list(Counter(map(tuple, x)).most_common()[0][0]), 1)
print (df)
0 [US, BE]
1 [US]
dtype: object
if it as # Erfan stated, most common value in a row then .agg(), mode
df.agg('mode', axis=1)
0
0 [US, BE]
1 [US]
while your data are lists, you can't use pandas.mode(). because lists objects are unhashable and mode() function won't work.
a solution is converting the elements of your dataframe's row to strings and then use pandas.mode().
check this:
>>> import pandas as pd
>>> c = {'a': [['US','BE']],'b': [['US']], 'c': [['US','BE']]}
>>> df = pd.DataFrame(c, columns = ['a','b','c'])
>>> x = df.iloc[0].apply(lambda x: str(x))
>>> x.mode()
# Answer:
0 ['US', 'BE']
dtype: object
>>> d = {'a': [['US']],'b': [['US']], 'c': [['US','BE']]}
>>> df2 = pd.DataFrame(d, columns = ['a','b','c'])
>>> z = df.iloc[0].apply(lambda z: str(z))
>>> z.mode()
# Answer:
0 ['US']
dtype: object
As I can see you have some elements as a list type, So I think the below-mentioned code will work fine.
First, append all value into an array
Then, find the most occurring element from that array.
from scipy.stats import mode
arr = []
for i in df:
for j in range(len(df[i])):
for k in range(len(df[i][j])):
arr.append(df[i][j][k])
from collections import Counter
b = Counter(arr)
print(b.most_common())
this will give you an answer as you want.
I know that there are several ways to build up a dataframe in Pandas. My question is simply to understand why the method below doesn't work.
First, a working example. I can create an empty dataframe and then append a new one similar to the documenta
In [3]: df1 = pd.DataFrame([[1,2],], columns = ['a', 'b'])
...: df2 = pd.DataFrame()
...: df2.append(df1)
Out[3]: a b
0 1 2
However, if I do the following df2 becomes None:
In [10]: df1 = pd.DataFrame([[1,2],], columns = ['a', 'b'])
...: df2 = pd.DataFrame()
...: for i in range(10):
...: df2.append(df1)
In [11]: df2
Out[11]:
Empty DataFrame
Columns: []
Index: []
Can someone explain why it works this way? Thanks!
This happens because the .append() method returns a new df:
Pandas Docs (0.19.2):
pandas.DataFrame.append
Returns: appended: DataFrame
Here's a working example so you can see what's happening in each iteration of the loop:
df1 = pd.DataFrame([[1,2],], columns=['a','b'])
df2 = pd.DataFrame()
for i in range(0,2):
print(df2.append(df1))
> a b
> 0 1 2
> a b
> 0 1 2
If you assign the output of .append() to a df (even the same one) you'll get what you probably expected:
for i in range(0,2):
df2 = df2.append(df1)
print(df2)
> a b
> 0 1 2
> 0 1 2
I think what you are looking for is:
df1 = pd.DataFrame()
df2 = pd.DataFrame([[1,2,3],], columns=['a','b','c'])
for i in range(0,4):
df1 = df1.append(df2)
df1
df.append() returns a new object. df2 is a empty dataframe initially, and it will not change. if u do a df3=df2.append(df1), u will get what u want
I have this pandas DataFrame filled with lists filled with strings, that I wish to split into two frames:
Input:
df = pd.DataFrame({
'A': {'a': ['NaN'],'b': ['1.11', '0.00']},
'B': {'a': ['3.33', '0.22'],'b': ['NaN']},
})
Desired output:
df1 = pd.DataFrame({
'A': {'a': ['NaN'],'b': ['1.11']},
'B': {'a': ['3.33'],'b': ['NaN']},
})
df2 = pd.DataFrame({
'A': {'a': ['NaN'],'b': ['0.00']},
'B': {'a': ['0.22'],'b': ['NaN']},
})
I tried to use the apply function, which works for Series, and was wondering if there is an easy way to apply an operation that achieves this on the entire df.
You can stack and apply(pd.Series)
s=df.stack().apply(pd.Series)
s[0].unstack()
Out[508]:
A B
a NaN 3.33
b 1.11 NaN
s[1].unstack()
Out[509]:
A B
a NaN 0.22
b 0.00 NaN
If you do need object for single cell
s[0].unstack().applymap(lambda x : [x])
Out[512]:
A B
a [NaN] [3.33]
b [1.11] [NaN]
s[1].unstack().applymap(lambda x : [x])
Out[513]:
A B
a [nan] [0.22]
b [0.00] [nan]
I want to change the orders of data frames using for loop but it doesn't work. My code is as follows:
import pandas as pd
df1 = pd.DataFrame({'a':1, 'b':2}, index=1)
df2 = pd.DataFrame({'c':3, 'c':4}, index=1)
for df in [df1, df2]:
df = df.loc[:, df.columns.tolist()[::-1]]
Then the order of columns of df1 and df2 is not changed.
You can make use of chain assignment with list comprehension i.e
df1,df2 = [i.loc[:,i.columns[::-1]] for i in [df1,df2]]
print(df1)
b a
1 2 1
print(df2)
c
1 4
Note: In my answer I am trying to build up to show that using a dictionary to store the datafrmes is the best way for a general case. If you are looking to mutate the original dataframe variables, #Bharath answer is the way to go.
Answer:
The code doesn't work because you are not assigning back to the list of dataframes. Here's how to fix that:
import pandas as pd
df1 = pd.DataFrame({'a':1, 'b':2}, index=[1])
df2 = pd.DataFrame({'c':3, 'c':4}, index=[1])
l = [df1, df2]
for i, df in enumerate(l):
l[i] = df.loc[:, df.columns.tolist()[::-1]]
so the difference, is that I iterate with enumerate to get the dataframe and it's index in the list, then I assign the changed dataframe to the original position in the list.
execution details:
Before apply the change:
In [28]: for i in l:
...: print(i.head())
...:
a b
1 1 2
c
1 4
In [29]: for i, df in enumerate(l):
...: l[i] = df.loc[:, df.columns.tolist()[::-1]]
...:
After applying the change:
In [30]: for i in l:
...: print(i.head())
...:
b a
1 2 1
c
1 4
Improvement proposal:
It's better to use a dictionary as follows:
import pandas as pd
d= {}
d['df1'] = pd.DataFrame({'a':1, 'b':2}, index=[1])
d['df2'] = pd.DataFrame({'c':3, 'c':4}, index=[1])
for i,df in d.items():
d[i] = df.loc[:, df.columns.tolist()[::-1]]
Then you will be able to reference your dataframes from the dictionary. For instance d['df1']
You can reverse columns and values:
import pandas as pd
df1 = pd.DataFrame({'a':1, 'b': 2}, index=[1])
df2 = pd.DataFrame({'c':3, 'c': 4}, index=[1])
print('before')
print(df1)
for df in [df1, df2]:
df.values[:,:] = df.values[:, ::-1]
df.columns = df.columns[::-1]
print('after')
print(df1)
df1
Output:
before
a b
1 1 2
after
b a
1 2 1