I know that there are several ways to build up a dataframe in Pandas. My question is simply to understand why the method below doesn't work.
First, a working example. I can create an empty dataframe and then append a new one similar to the documenta
In [3]: df1 = pd.DataFrame([[1,2],], columns = ['a', 'b'])
...: df2 = pd.DataFrame()
...: df2.append(df1)
Out[3]: a b
0 1 2
However, if I do the following df2 becomes None:
In [10]: df1 = pd.DataFrame([[1,2],], columns = ['a', 'b'])
...: df2 = pd.DataFrame()
...: for i in range(10):
...: df2.append(df1)
In [11]: df2
Out[11]:
Empty DataFrame
Columns: []
Index: []
Can someone explain why it works this way? Thanks!
This happens because the .append() method returns a new df:
Pandas Docs (0.19.2):
pandas.DataFrame.append
Returns: appended: DataFrame
Here's a working example so you can see what's happening in each iteration of the loop:
df1 = pd.DataFrame([[1,2],], columns=['a','b'])
df2 = pd.DataFrame()
for i in range(0,2):
print(df2.append(df1))
> a b
> 0 1 2
> a b
> 0 1 2
If you assign the output of .append() to a df (even the same one) you'll get what you probably expected:
for i in range(0,2):
df2 = df2.append(df1)
print(df2)
> a b
> 0 1 2
> 0 1 2
I think what you are looking for is:
df1 = pd.DataFrame()
df2 = pd.DataFrame([[1,2,3],], columns=['a','b','c'])
for i in range(0,4):
df1 = df1.append(df2)
df1
df.append() returns a new object. df2 is a empty dataframe initially, and it will not change. if u do a df3=df2.append(df1), u will get what u want
Related
I'm trying to replicate SQL UPDATE-type functionality in pandas. I've seen other solutions suggesting using pandas update method or merge and dropping columns.
Example dataframes:
df1 = pd.DataFrame([[1,False, None], [1,True, None], [1, False, 'UpdateMe'], [2,True, None]], columns=['id', 'value1', 'value2'])
df2 = pd.DataFrame([[1,True, 'Updated'], [2,True, 'Updated']], columns=['id', 'value1', 'value2'])
Here is the SQL I am trying to replicate:
UPDATE df1
SET value1 = df2.value1, value2 = df2.value2
FROM df1
JOIN df2 ON df1.id = df2.id
WHERE df1.value2 = 'UpdateMe';
I can get the update to work without any qualifier like so:
df1.set_index('id', inplace=True)
df2.set_index('id', inplace=True)
df1.update(df2, overwrite=True)
df1.reset_index(drop=False, inplace=True)
df1
id value1 value2
0 1 TRUE Updated
1 1 TRUE Updated
2 1 TRUE Updated
3 2 TRUE Updated
However, when I add a qualifier to which records in the dataframe to update, I get a warning and the target dataframe does not get updated.
df1.set_index('id', inplace=True)
df2.set_index('id', inplace=True)
df1.loc[
df1.value2 == 'UpdateMe'
].update(df2, overwrite=True)
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
self[col] = expressions.where(mask, this, that)
Here is the expected output:
id value1 value2
0 1 FALSE
1 1 TRUE
2 1 TRUE Updated
3 2 TRUE
Any suggestion on how to update multiple columns with a .loc or type of where clause?
You can create temporary columns using merge. Then, user np.where similar to =If() function in excel. Next, remove the temporary columns.
import pandas as pd
import numpy as np
df1 = pd.DataFrame([[1,False, None], [1,True, None], [1, False, 'UpdateMe'], [2,True, None]], columns=['id', 'value1', 'value2'])
df2 = pd.DataFrame([[1,True, 'Updated'], [2,True, 'Updated']], columns=['id', 'value1', 'value2'])
df1.set_index('id', inplace=True)
df2.set_index('id', inplace=True)
#Answer
df1 = df1.merge(df2.rename(columns = {'value1':'value1_temp','value2':'value2_temp'}), how = 'left', right_index = True, left_index = True)
df1.value1 = np.where(df1.value2 == 'UpdateMe', df1.value1_temp, df1.value1)
df1.value2 = np.where(df1.value2 == 'UpdateMe', df1.value2_temp, df1.value2)
df1 = df1.drop(labels = ['value1_temp','value2_temp'], axis = 1)
df1
I've a Pandas DataFrame with 3 columns:
c={'a': [['US']],'b': [['US']], 'c': [['US','BE']]}
df = pd.DataFrame(c, columns = ['a','b','c'])
Now I need the max value of these 3 columns.
I've tried:
df['max_val'] = df[['a','b','c']].max(axis=1)
The result is Nan instead of the expected output: US.
How can I get the max value for these 3 columns? (and what if one of them contains Nan)
Use:
c={'a': [['US', 'BE'],['US']],'b': [['US'],['US']], 'c': [['US','BE'],['US','BE']]}
df = pd.DataFrame(c, columns = ['a','b','c'])
from collections import Counter
df = df[['a','b','c']].apply(lambda x: list(Counter(map(tuple, x)).most_common()[0][0]), 1)
print (df)
0 [US, BE]
1 [US]
dtype: object
if it as # Erfan stated, most common value in a row then .agg(), mode
df.agg('mode', axis=1)
0
0 [US, BE]
1 [US]
while your data are lists, you can't use pandas.mode(). because lists objects are unhashable and mode() function won't work.
a solution is converting the elements of your dataframe's row to strings and then use pandas.mode().
check this:
>>> import pandas as pd
>>> c = {'a': [['US','BE']],'b': [['US']], 'c': [['US','BE']]}
>>> df = pd.DataFrame(c, columns = ['a','b','c'])
>>> x = df.iloc[0].apply(lambda x: str(x))
>>> x.mode()
# Answer:
0 ['US', 'BE']
dtype: object
>>> d = {'a': [['US']],'b': [['US']], 'c': [['US','BE']]}
>>> df2 = pd.DataFrame(d, columns = ['a','b','c'])
>>> z = df.iloc[0].apply(lambda z: str(z))
>>> z.mode()
# Answer:
0 ['US']
dtype: object
As I can see you have some elements as a list type, So I think the below-mentioned code will work fine.
First, append all value into an array
Then, find the most occurring element from that array.
from scipy.stats import mode
arr = []
for i in df:
for j in range(len(df[i])):
for k in range(len(df[i][j])):
arr.append(df[i][j][k])
from collections import Counter
b = Counter(arr)
print(b.most_common())
this will give you an answer as you want.
I have 2 dataframes df1 and df2. I am trying to apply styling on df1, then drop a column from it and then finally concatenate with df2. Styling on df1 should be retained, though its being lost
I am using the code as listed below, though doesn't seem to work
df1 = pd.DataFrame([["A", 1],["B", 2]], columns=["Letter", "Number"])
df2 = pd.DataFrame([["A", 1],["B", 2]], columns=["Letter2", "Number2"])
def highlight(s):
return ['background-color: red']*2
df1 = df1.style.apply(highlight)
df1.data = df1.data.drop('Letter', axis=1)
combined = pd.concat([df1.data, df2],sort=True)
with pd.ExcelWriter('testcolor.xlsx') as writer:
combined.to_excel(writer,sheet_name = 'test')
I am expecting "Number" from df1 to be highlighted red and Letter2 and Number2 to be in original colour
I've a following dataframe, df:
A B
0 [ACL1, ACL2, ACL3] [ACL1, ACL4, ACL2]
I want to perform a symmetric_difference on the A and B list so that the output will be [ACL3,ACL4]
df1 = df['A'].symmetric_difference(df['B'])
print (df1)
AttributeError: 'Series' object has no attribute 'symmetric_difference'
But it give an above error....Did I did wrongly? How can I accomplish the final output?
Thanks..
The problem is that symmetric_difference is a method of sets, instead you could do:
import pandas as pd
data = [[['ACL1', 'ACL2', 'ACL3'], ['ACL1', 'ACL4', 'ACL2']]]
df = pd.DataFrame(data=data, columns=['A', 'B'])
def symmetric_difference(x):
return list(set(x.A).symmetric_difference(x.B))
result = df[['A', 'B']].apply(symmetric_difference, axis=1)
print(result)
Output
0 [ACL3, ACL4]
dtype: object
If do care about the performance
[list(set(x).symmetric_difference(set(y))) for x , y in zip (df.A,df.B)]
[['ACL3', 'ACL4']]
I want to change the orders of data frames using for loop but it doesn't work. My code is as follows:
import pandas as pd
df1 = pd.DataFrame({'a':1, 'b':2}, index=1)
df2 = pd.DataFrame({'c':3, 'c':4}, index=1)
for df in [df1, df2]:
df = df.loc[:, df.columns.tolist()[::-1]]
Then the order of columns of df1 and df2 is not changed.
You can make use of chain assignment with list comprehension i.e
df1,df2 = [i.loc[:,i.columns[::-1]] for i in [df1,df2]]
print(df1)
b a
1 2 1
print(df2)
c
1 4
Note: In my answer I am trying to build up to show that using a dictionary to store the datafrmes is the best way for a general case. If you are looking to mutate the original dataframe variables, #Bharath answer is the way to go.
Answer:
The code doesn't work because you are not assigning back to the list of dataframes. Here's how to fix that:
import pandas as pd
df1 = pd.DataFrame({'a':1, 'b':2}, index=[1])
df2 = pd.DataFrame({'c':3, 'c':4}, index=[1])
l = [df1, df2]
for i, df in enumerate(l):
l[i] = df.loc[:, df.columns.tolist()[::-1]]
so the difference, is that I iterate with enumerate to get the dataframe and it's index in the list, then I assign the changed dataframe to the original position in the list.
execution details:
Before apply the change:
In [28]: for i in l:
...: print(i.head())
...:
a b
1 1 2
c
1 4
In [29]: for i, df in enumerate(l):
...: l[i] = df.loc[:, df.columns.tolist()[::-1]]
...:
After applying the change:
In [30]: for i in l:
...: print(i.head())
...:
b a
1 2 1
c
1 4
Improvement proposal:
It's better to use a dictionary as follows:
import pandas as pd
d= {}
d['df1'] = pd.DataFrame({'a':1, 'b':2}, index=[1])
d['df2'] = pd.DataFrame({'c':3, 'c':4}, index=[1])
for i,df in d.items():
d[i] = df.loc[:, df.columns.tolist()[::-1]]
Then you will be able to reference your dataframes from the dictionary. For instance d['df1']
You can reverse columns and values:
import pandas as pd
df1 = pd.DataFrame({'a':1, 'b': 2}, index=[1])
df2 = pd.DataFrame({'c':3, 'c': 4}, index=[1])
print('before')
print(df1)
for df in [df1, df2]:
df.values[:,:] = df.values[:, ::-1]
df.columns = df.columns[::-1]
print('after')
print(df1)
df1
Output:
before
a b
1 1 2
after
b a
1 2 1