Is there a way to use .loc on column names instead of the values inside the columns? - pandas

I am wondering if there is a way to use .loc to check to sort df with certain column names == something else on another df. I know you can usually use it to check if the value is == to something, but what about the actual column name itself?
ex.
df1 = [ 0, 1, 2, 3]
df2 .columns = [2,4,6]
Is there a way to only display df2 values where the column name is == df1 without hardcoding it and saying df2.loc[:, ==2]?

IIUC, you can use df2.columns.intersection to get columns only present in df1:
>>> df1
A B D F
0 0.431332 0.663717 0.922112 0.562524
1 0.467159 0.549023 0.139306 0.168273
>>> df2
A B C D E F
0 0.451493 0.916861 0.257252 0.600656 0.354882 0.109236
1 0.676851 0.585368 0.467432 0.594848 0.962177 0.714365
>>> df2[df2.columns.intersection(df1.columns)]
A B D F
0 0.451493 0.916861 0.600656 0.109236
1 0.676851 0.585368 0.594848 0.714365

One solution:
df3 = df2[[c for c in df2.columns if c in df1]]

Related

How do I use df.add_suffix to add suffixes to duplicate column names in Pandas?

I have a large dataframe with 400 columns. 200 of the column names are duplicates of the first 200. How can I used df.add_suffix to add a suffix only to the duplicate column names?
Or is there a better way to do it automatically?
Here is my solution, starting with:
df=pd.DataFrame(np.arange(4).reshape(1,-1),columns=['a','b','a','b'])
output
a b a b
0 1 2 3 4
Then I use Lambda function
df.columns += df.columns+np.vectorize(lambda x:'_' if x else '')(df.columns.duplicated())
Output
a b a_ b_
0 0 1 2 3
If you have more than one duplicate then you can loop until there is none left. This works for duplicated indices too, it also keeps the index name.
If I understand your question correct you have each name twice. If so it is possible to ask for duplicated values using df.columns.duplicated(). Then you can create a new list only modifying duplicated values and adding your self definied suffix. This is different from the other posted solution which modifies all entries.
df = pd.DataFrame(data=[[1, 2, 3, 4]], columns=list('aabb'))
my_suffix = 'T'
df.columns = [name if duplicated == False else name + my_suffix for duplicated, name in zip(df.columns.duplicated(), df.columns)]
df
>>>
a aT b bT
0 1 2 3 4
My answer has the disadvantage that the dataframe can have duplicated column names if one name is used three or more times.
You could do:
import pandas as pd
# setup dummy DataFrame with repeated columns
df = pd.DataFrame(data=[[1, 2, 3]], columns=list('aaa'))
# create unique identifier for each repeated column
identifier = df.columns.to_series().groupby(level=0).transform('cumcount')
# rename columns with the new identifiers
df.columns = df.columns.astype('string') + identifier.astype('string')
print(df)
Output
a0 a1 a2
0 1 2 3
If there is only one duplicate column, you could do:
# setup dummy DataFrame with repeated columns
df = pd.DataFrame(data=[[1, 2, 3, 4]], columns=list('aabb'))
# create unique identifier for each repeated column
identifier = df.columns.duplicated().astype(int)
# rename columns with the new identifiers
df.columns = df.columns.astype('string') + identifier.astype(str)
print(df)
Output (for only one duplicate)
a0 a1 b0 b1
0 1 2 3 4
Add numbering suffix starts with '_1' started with the first duplicated column and applicable to columns appearing more than once.
E.g a column name list: [a, b, c, a, b, a] will return [a, b, c, a_1, b_1, a_2]
from collections import Counter
counter = Counter()
empty_list= []
for x in range(df.shape[1]):
counter.update([df.columns[x]])
if counter[df.columns[x]] == 1:
empty_list.append(df.columns[x])
else:
tx = counter[df.columns[x]] -1
empty_list.append(df.columns[x] + '_' + str(tx))
df.columns = empty_list
df.columns

change value in row of dataframe on condition

I have a dataframe with certain values and want to exchange values in one row on a condition. If the value is greater than x I want it to change to zero. I tried with .loc but somehow I get a Keyerror everytime I try. Does .loc work to select rows instead of columns? I used it for columns before, but I cant get it to work for rows.
df = pd.DataFrame({'a': np.random.randn(4), 'b': np.random.randn(4), 'c': np.random.randn(4)})
print(df)
df.loc['Total'] = df.sum()
df.loc[(df['Total'] < x), ['Total']] = 0
I also tried using iloc, but get another error. I dont think its a complex problem, but im kind of stuck, so help would be much appreciated!
You can assign values with loc - first set rows for repalce values by string - here Total, because set row label Total and then compare values of this rows selected by loc - It return boolean mask:
np.random.seed(2019)
df = pd.DataFrame({'a': np.random.randn(4), 'b': np.random.randn(4), 'c': np.random.randn(4)})
print(df)
a b c
0 -0.217679 -0.361865 -0.235634
1 0.821455 0.685609 0.953490
2 1.481278 0.573761 -1.689625
3 1.331864 0.287728 -0.344943
df.loc['Total'] = df.sum()
x = 1
df.loc['Total', df.loc['Total'] < x] = 0
print (df)
a b c
0 -0.217679 -0.361865 -0.235634
1 0.821455 0.685609 0.953490
2 1.481278 0.573761 -1.689625
3 1.331864 0.287728 -0.344943
Total 3.416918 1.185233 0.000000
Detail:
print (df.loc['Total'] < x)
a False
b False
c True
Name: Total, dtype: bool

Merging Pandas DataFrames with a rule

I am trying to merge two DataFrames with common row and column indexes, however, I am expecting entries with similar row and column indexes to exist in both DataFrames.
Is there a way to make a rule to keep the entries in df1 if they are present but if there is no value then to use the value in df2?
so df3 = some operation on df1,df21
Example:
df1 = [[[a],[b],[c]],
[[ ],[e],[ ]],
[[g],[h],[i]]]
df2 = [[[ ],[ ],[ ]],
[[d],[x],[f]],
[[y],[z],[z]]]
df3 = [[[a],[b],[c]],
[[d],[e],[f]],
[[g],[h],[i]]]
I think you're looking for pandas df.fillna()::
d1 = [['a','b','c'],[None,'e',None],['g','h','i']]
d2 = [[None,None,None],['d','x','f'],['y','z','z']]
df1,df2 = pd.DataFrame(d1),pd.DataFrame(d2)
df1.fillna(df2)
0 1 2
0 a b c
1 d e f
2 g h i
You can also use this:
df1[df1.isnull()] = df2.values

Map values from a dataframe

I have a dataframe with the correspondence between two values:
A another list with only one of the variables:
l = ['a','b','c']
I want to make the mapping like:
df[l[0]]
and get 1
df[l[1]]
and get 2
As if it was a dictionary, how can I do that?
Are you looking for something like this?
df.loc[df['key']==l[0], 'value']
returns
0 1
1 1
Another way would be to set_index of the df to key.
df.set_index('key', inplace = True)
df.loc[l[0]].values[0]
Another way is map by Series or by dict but is necessary unique value of keys, drop_duplicates helps:
df = pd.DataFrame({'key':list('aabcc'),
'value':[1,1,2,3,3]})
s = df.drop_duplicates('key').set_index('key')['value']
print (s)
key
a 1
b 2
c 3
Name: value, dtype: int64
d = df.drop_duplicates('key').set_index('key')['value'].to_dict()
print (d)
{'c': 3, 'b': 2, 'a': 1}
l = ['a','b','c']
print (s[l[0]])
1
print (d[l[1]])
2

Pandas: Selecting rows by list

I tried following code to select columns from a dataframe. My dataframe has about 50 values. At the end, I want to create the sum of selected columns, create a new column with these sum values and then delete the selected columns.
I started with
columns_selected = ['A','B','C','D','E']
df = df[df.column.isin(columns_selected)]
but it said AttributeError: 'DataFrame' object has no attribute 'column'
Regarding the sum: As I don't want to write for the sum
df['sum_1'] = df['A']+df['B']+df['C']+df['D']+df['E']
I also thought that something like
df['sum_1'] = df[columns_selected].sum(axis=1)
would be more convenient.
You want df[columns_selected] to sub-select the df by a list of columns
you can then do df['sum_1'] = df[columns_selected].sum(axis=1)
To filter the df to just the cols of interest pass a list of the columns, df = df[columns_selected] note that it's a common error to just a list of strings: df = df['a','b','c'] which will raise a KeyError.
Note that you had a typo in your original attempt:
df = df.loc[:,df.columns.isin(columns_selected)]
The above would've worked, firstly you needed columns not column, secondly you can use the boolean mask as a mask against the columns by passing to loc or ix as the column selection arg:
In [49]:
df = pd.DataFrame(np.random.randn(5,5), columns=list('abcde'))
df
Out[49]:
a b c d e
0 -0.778207 0.480142 0.537778 -1.889803 -0.851594
1 2.095032 1.121238 1.076626 -0.476918 -0.282883
2 0.974032 0.595543 -0.628023 0.491030 0.171819
3 0.983545 -0.870126 1.100803 0.139678 0.919193
4 -1.854717 -2.151808 1.124028 0.581945 -0.412732
In [50]:
cols = ['a','b','c']
df.ix[:, df.columns.isin(cols)]
Out[50]:
a b c
0 -0.778207 0.480142 0.537778
1 2.095032 1.121238 1.076626
2 0.974032 0.595543 -0.628023
3 0.983545 -0.870126 1.100803
4 -1.854717 -2.151808 1.124028