How do I drop columns in a pandas dataframe that exist in another dataframe? - pandas

How do I drop columns in raw_clin if the same columns already exist in raw_clinical_sample? Using isin raised a cannot compute isin with a duplicate axis error.
Explanation of the code:
I want to merge raw_clinical_patient and raw_clinical_sample dataframes. However, the SAMPLE_ID column in raw_clinical_sample should be relabeled as PATIENT_ID before the merge (because it was wrongly labelled). I want the new PATIENT_ID to be the index of raw_clin.
import pandas as pd
# Clinical patient info
raw_clinical_patient = pd.read_csv("./gbm_tcga/data_clinical_patient.txt", sep="\t", header=4)
raw_clinical_patient["PATIENT_ID"] = raw_clinical_patient["PATIENT_ID"].replace()
raw_clinical_patient.set_index("PATIENT_ID", inplace=True)
raw_clinical_patient.sort_index()
# Clinical sample info
raw_clinical_sample = pd.read_csv("./gbm_tcga/data_clinical_sample.txt", sep="\t", header=4)
raw_clinical_sample.set_index("PATIENT_ID", inplace=True)
raw_clinical_sample = raw_clinical_sample[raw_clinical_sample.index.isin(raw_clinical_patient.index)]
# Get the actual patient ID from the `raw_clinical_sample` dataframe
# Drop "PATIENT_ID" and rename "SAMPLE_ID" as "PATIENT_ID" and set as index
raw_clin = raw_clinical_patient.merge(raw_clinical_sample, on="PATIENT_ID", how="left").reset_index().drop(["PATIENT_ID"], axis=1)
raw_clin.rename(columns={'SAMPLE_ID':'PATIENT_ID'}, inplace=True)
raw_clin.set_index('PATIENT_ID', inplace=True)
Now, I want to drop all the columns in raw_clinical_sample since the only columns that are needed were the PATIENT_ID and SAMPLE_ID columns.
# Drop columns that exist in `raw_clinical_sample`
raw_clin = raw_clin[~raw_clin.isin(raw_clinical_sample)]
Traceback:
ValueError Traceback (most recent call last)
<ipython-input-60-45e2e83ddc00> in <module>()
18
19 # Drop columns that exist in `raw_clinical_sample`
---> 20 raw_clin = raw_clin[~raw_clin.isin(raw_clinical_sample)]
/usr/local/lib/python3.7/dist-packages/pandas/core/frame.py in isin(self, values)
10514 elif isinstance(values, DataFrame):
10515 if not (values.columns.is_unique and values.index.is_unique):
> 10516 raise ValueError("cannot compute isin with a duplicate axis.")
10517 return self.eq(values.reindex_like(self))
10518 else:
ValueError: cannot compute isin with a duplicate axis.

You have many ways to do this.
For example using isin:
new_df1 = df1.loc[:, ~df1.columns.isin(df2.columns)]
or with drop:
new_df1 = df1.drop(columns=df1.columns.intersection(df2.columns))
example input:
df1 = pd.DataFrame(columns=['A', 'B', 'C', 'D'])
df2 = pd.DataFrame(columns=['B', 'E'])
output:
pd.DataFrame(columns=['A', 'C', 'D'])

You can use set operations for your application like this:
df1 = pd.DataFrame()
df1['string'] = ['Hello', 'Hi', 'Hola']
df1['number'] = [1, 2, 3]
df2 = pd.DataFrame()
df2['string'] = ['Hello', 'Hola']
df2['number'] = [1, 5]
ds1 = set(map(tuple, df1.values))
ds2 = set(map(tuple, df2.values))
df_out = pd.DataFrame(list(ds1.difference(ds2)))
df_out.columns = df1.columns
print(df_out)
Output:
string number
0 Hola 3
1 Hi 2
Inspired by: https://stackoverflow.com/a/18184990/7509907
Edit:
Sorry I didn't notice you need to drop the columns. For that, you can use the following: (using mozway's dummy example)
df1 = pd.DataFrame(columns=['A', 'B', 'C', 'D'])
df2 = pd.DataFrame(columns=['B', 'E'])
ds1 = set(df1.columns)
ds2 = set(df2.columns)
cols = ds1.difference(ds2)
df = df1[cols]
print(df)
Output:
Empty DataFrame
Columns: [C, A, D]
Index: []

Related

groupby with transform minmax

for every city , I want to create a new column which is minmax scalar of another columns (age).
I tried this an get Input contains infinity or a value too large for dtype('float64').
cols=['age']
def f(x):
scaler1=preprocessing.MinMaxScaler()
x[['age_minmax']] = scaler1.fit_transform(x[cols])
return x
df = df.groupby(['city']).apply(f)
From the comments:
df['age'].replace([np.inf, -np.inf], np.nan, inplace=True)
Or
df['age'] = df['age'].replace([np.inf, -np.inf], np.nan)

Pandas dataframe - column with list of dictionaries, extract values and convert to comma separated values

I have the following dataframe that I want to extract each numerical value from the list of dictionaries and keep in the same column.
for instance for the first row I would want to see in the data column: 179386782, 18017252, 123452
id
data
12345
[{'id': '179386782'}, {'id': 18017252}, {'id': 123452}]
below is my code to create the dataframe above ( I've hardcoded stories_data as an example)
for business_account in data:
business_account_id = business_account[0]
stories_data = {'data': [{'id': '179386782'}, {'id': '18017252'}, {'id': '123452'}]}
df = pd.DataFrame(stories_data.items())
df.set_index(0, inplace=True)
df = df.transpose()
df_stories['id'] = business_account_id
col = df_stories.pop("id")
df_stories.insert(0, col.name, col)
I've tried this: df_stories["data"].str[0]
but this only returns the first element (dictionary) in the list
Try:
df['data'] = df['data'].apply(lambda x: ', '.join([str(d['id']) for d in x]))
print(df)
# Output:
id data
0 12345 179386782, 18017252, 123452
Another way:
df['data'] = df['data'].explode().str['id'].astype(str) \
.groupby(level=0).agg(', '.join)
print(df)
# Output:
id data
0 12345 179386782, 18017252, 123452

series.str.split(expand=True) returns error: Wrong number of items passed 2, placement implies 1

I have a series of web addresses, which I want to split them by the first '.'. For example, return 'google', if the web address is 'google.co.uk'
d1 = {'id':['1', '2', '3'], 'website':['google.co.uk', 'google.com.au', 'google.com']}
df1 = pd.DataFrame(data=d1)
d2 = {'id':['4', '5', '6'], 'website':['google.co.jp', 'google.com.tw', 'google.kr']}
df2 = pd.DataFrame(data=d2)
df_list = [df1, df2]
I use enumerate to iterate the dataframe list
for i, df in enumerate(df_list):
df_list[i]['website_segments'] = df['website'].str.split('.', n=1, expand=True)
Received error: ValueError: Wrong number of items passed 2, placement implies 1
You are splitting the website which gives you a list-like data structure. Think [google, co.uk]. You just want the first element of that list so:
for i, df in enumerate(df_list):
df_list[i]['website_segments'] = df['website'].str.split('.', n=1, expand=True)[0]
Another alternative is to use extract. It is also ~40% faster for your data:
for i, df in enumerate(df_list):
df_list[i]['website_segments'] = df['website'].str.extract('(.*?)\.')

Pandas apply function on multiple columns

I am trying to apply a function to every column in a dataframe, when I try to do it on just a single fixed column name it works. I tried doing it on every column, but when I try passing the column name as an argument in the function I get an error.
How do you properly pass arguments to apply a function on a data frame?
def result(row,c):
if row[c] >=0 and row[c] <=1:
return 'c'
elif row[c] >1 and row[c] <=2:
return 'b'
else:
return 'a'
cols = list(df.columns.values)
for c in cols
df[c] = df.apply(result, args = (c), axis=1)
TypeError: ('result() takes exactly 2 arguments (21 given)', u'occurred at index 0')
Input data frame format:
d = {'c1': [1, 2, 1, 0], 'c2': [3, 0, 1, 2]}
df = pd.DataFrame(data=d)
df
c1 c2
0 1 3
1 2 0
2 1 1
3 0 2
You don't need to pass the column name to apply. As you only want to check if values of the columns are in certain range and should return a, b or c. You can make the following changes.
def result(val):
if 0<=val<=1:
return 'c'
elif 1<val<=2:
return 'b'
return 'a'
cols = list(df.columns.values)
for c in cols
df[c] = df[c].apply(result)
Note that this will replace your column values.
A faster way is np.select:
import numpy as np
values = ['c', 'b']
for col in df.columns:
df[col] = np.select([0<=df[col]<=1, 1<df[col]<=2], values, default = 'a')

Quantile across rows and down columns using selected columns only [duplicate]

I have a dataframe with column names, and I want to find the one that contains a certain string, but does not exactly match it. I'm searching for 'spike' in column names like 'spike-2', 'hey spike', 'spiked-in' (the 'spike' part is always continuous).
I want the column name to be returned as a string or a variable, so I access the column later with df['name'] or df[name] as normal. I've tried to find ways to do this, to no avail. Any tips?
Just iterate over DataFrame.columns, now this is an example in which you will end up with a list of column names that match:
import pandas as pd
data = {'spike-2': [1,2,3], 'hey spke': [4,5,6], 'spiked-in': [7,8,9], 'no': [10,11,12]}
df = pd.DataFrame(data)
spike_cols = [col for col in df.columns if 'spike' in col]
print(list(df.columns))
print(spike_cols)
Output:
['hey spke', 'no', 'spike-2', 'spiked-in']
['spike-2', 'spiked-in']
Explanation:
df.columns returns a list of column names
[col for col in df.columns if 'spike' in col] iterates over the list df.columns with the variable col and adds it to the resulting list if col contains 'spike'. This syntax is list comprehension.
If you only want the resulting data set with the columns that match you can do this:
df2 = df.filter(regex='spike')
print(df2)
Output:
spike-2 spiked-in
0 1 7
1 2 8
2 3 9
This answer uses the DataFrame.filter method to do this without list comprehension:
import pandas as pd
data = {'spike-2': [1,2,3], 'hey spke': [4,5,6]}
df = pd.DataFrame(data)
print(df.filter(like='spike').columns)
Will output just 'spike-2'. You can also use regex, as some people suggested in comments above:
print(df.filter(regex='spike|spke').columns)
Will output both columns: ['spike-2', 'hey spke']
You can also use df.columns[df.columns.str.contains(pat = 'spike')]
data = {'spike-2': [1,2,3], 'hey spke': [4,5,6], 'spiked-in': [7,8,9], 'no': [10,11,12]}
df = pd.DataFrame(data)
colNames = df.columns[df.columns.str.contains(pat = 'spike')]
print(colNames)
This will output the column names: 'spike-2', 'spiked-in'
More about pandas.Series.str.contains.
# select columns containing 'spike'
df.filter(like='spike', axis=1)
You can also select by name, regular expression. Refer to: pandas.DataFrame.filter
df.loc[:,df.columns.str.contains("spike")]
Another solution that returns a subset of the df with the desired columns:
df[df.columns[df.columns.str.contains("spike|spke")]]
You also can use this code:
spike_cols =[x for x in df.columns[df.columns.str.contains('spike')]]
Getting name and subsetting based on Start, Contains, and Ends:
# from: https://stackoverflow.com/questions/21285380/find-column-whose-name-contains-a-specific-string
# from: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.contains.html
# from: https://cmdlinetips.com/2019/04/how-to-select-columns-using-prefix-suffix-of-column-names-in-pandas/
# from: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.filter.html
import pandas as pd
data = {'spike_starts': [1,2,3], 'ends_spike_starts': [4,5,6], 'ends_spike': [7,8,9], 'not': [10,11,12]}
df = pd.DataFrame(data)
print("\n")
print("----------------------------------------")
colNames_contains = df.columns[df.columns.str.contains(pat = 'spike')].tolist()
print("Contains")
print(colNames_contains)
print("\n")
print("----------------------------------------")
colNames_starts = df.columns[df.columns.str.contains(pat = '^spike')].tolist()
print("Starts")
print(colNames_starts)
print("\n")
print("----------------------------------------")
colNames_ends = df.columns[df.columns.str.contains(pat = 'spike$')].tolist()
print("Ends")
print(colNames_ends)
print("\n")
print("----------------------------------------")
df_subset_start = df.filter(regex='^spike',axis=1)
print("Starts")
print(df_subset_start)
print("\n")
print("----------------------------------------")
df_subset_contains = df.filter(regex='spike',axis=1)
print("Contains")
print(df_subset_contains)
print("\n")
print("----------------------------------------")
df_subset_ends = df.filter(regex='spike$',axis=1)
print("Ends")
print(df_subset_ends)