Pandas: How to get position of columns? - pandas

I need help to get the position of the column or another way to read in the column two step left of the column Spannung.
Exceldata = pd.read_excel(str(Dateien[0]), header=[2])
print Dateien[0]
Spannung = Exceldata.columns[Exceldata.columns.str.contains('Spannung effektiv L1')]
print Spannung

IIUC you can use .get_loc
So:
pos = Exceldata.columns.get_loc(Spannung[0])
then you can index left:
other_col = Exceldata.columns[pos -2]
Example:
In [169]:
df = pd.DataFrame(columns=['hello','world','python','pandas','Spannung effektiv L1', 'asdas'])
spannung = df.columns[df.columns.str.contains('Spannung')]
spannung
Out[169]:
Index(['Spannung effektiv L1'], dtype='object')
In [178]:
pos = df.columns.get_loc(spannung[0])
df.columns[pos-2]
Out[178]:
'python'

Related

How do I drop columns in a pandas dataframe that exist in another dataframe?

How do I drop columns in raw_clin if the same columns already exist in raw_clinical_sample? Using isin raised a cannot compute isin with a duplicate axis error.
Explanation of the code:
I want to merge raw_clinical_patient and raw_clinical_sample dataframes. However, the SAMPLE_ID column in raw_clinical_sample should be relabeled as PATIENT_ID before the merge (because it was wrongly labelled). I want the new PATIENT_ID to be the index of raw_clin.
import pandas as pd
# Clinical patient info
raw_clinical_patient = pd.read_csv("./gbm_tcga/data_clinical_patient.txt", sep="\t", header=4)
raw_clinical_patient["PATIENT_ID"] = raw_clinical_patient["PATIENT_ID"].replace()
raw_clinical_patient.set_index("PATIENT_ID", inplace=True)
raw_clinical_patient.sort_index()
# Clinical sample info
raw_clinical_sample = pd.read_csv("./gbm_tcga/data_clinical_sample.txt", sep="\t", header=4)
raw_clinical_sample.set_index("PATIENT_ID", inplace=True)
raw_clinical_sample = raw_clinical_sample[raw_clinical_sample.index.isin(raw_clinical_patient.index)]
# Get the actual patient ID from the `raw_clinical_sample` dataframe
# Drop "PATIENT_ID" and rename "SAMPLE_ID" as "PATIENT_ID" and set as index
raw_clin = raw_clinical_patient.merge(raw_clinical_sample, on="PATIENT_ID", how="left").reset_index().drop(["PATIENT_ID"], axis=1)
raw_clin.rename(columns={'SAMPLE_ID':'PATIENT_ID'}, inplace=True)
raw_clin.set_index('PATIENT_ID', inplace=True)
Now, I want to drop all the columns in raw_clinical_sample since the only columns that are needed were the PATIENT_ID and SAMPLE_ID columns.
# Drop columns that exist in `raw_clinical_sample`
raw_clin = raw_clin[~raw_clin.isin(raw_clinical_sample)]
Traceback:
ValueError Traceback (most recent call last)
<ipython-input-60-45e2e83ddc00> in <module>()
18
19 # Drop columns that exist in `raw_clinical_sample`
---> 20 raw_clin = raw_clin[~raw_clin.isin(raw_clinical_sample)]
/usr/local/lib/python3.7/dist-packages/pandas/core/frame.py in isin(self, values)
10514 elif isinstance(values, DataFrame):
10515 if not (values.columns.is_unique and values.index.is_unique):
> 10516 raise ValueError("cannot compute isin with a duplicate axis.")
10517 return self.eq(values.reindex_like(self))
10518 else:
ValueError: cannot compute isin with a duplicate axis.
You have many ways to do this.
For example using isin:
new_df1 = df1.loc[:, ~df1.columns.isin(df2.columns)]
or with drop:
new_df1 = df1.drop(columns=df1.columns.intersection(df2.columns))
example input:
df1 = pd.DataFrame(columns=['A', 'B', 'C', 'D'])
df2 = pd.DataFrame(columns=['B', 'E'])
output:
pd.DataFrame(columns=['A', 'C', 'D'])
You can use set operations for your application like this:
df1 = pd.DataFrame()
df1['string'] = ['Hello', 'Hi', 'Hola']
df1['number'] = [1, 2, 3]
df2 = pd.DataFrame()
df2['string'] = ['Hello', 'Hola']
df2['number'] = [1, 5]
ds1 = set(map(tuple, df1.values))
ds2 = set(map(tuple, df2.values))
df_out = pd.DataFrame(list(ds1.difference(ds2)))
df_out.columns = df1.columns
print(df_out)
Output:
string number
0 Hola 3
1 Hi 2
Inspired by: https://stackoverflow.com/a/18184990/7509907
Edit:
Sorry I didn't notice you need to drop the columns. For that, you can use the following: (using mozway's dummy example)
df1 = pd.DataFrame(columns=['A', 'B', 'C', 'D'])
df2 = pd.DataFrame(columns=['B', 'E'])
ds1 = set(df1.columns)
ds2 = set(df2.columns)
cols = ds1.difference(ds2)
df = df1[cols]
print(df)
Output:
Empty DataFrame
Columns: [C, A, D]
Index: []

groupby with transform minmax

for every city , I want to create a new column which is minmax scalar of another columns (age).
I tried this an get Input contains infinity or a value too large for dtype('float64').
cols=['age']
def f(x):
scaler1=preprocessing.MinMaxScaler()
x[['age_minmax']] = scaler1.fit_transform(x[cols])
return x
df = df.groupby(['city']).apply(f)
From the comments:
df['age'].replace([np.inf, -np.inf], np.nan, inplace=True)
Or
df['age'] = df['age'].replace([np.inf, -np.inf], np.nan)

Quantile across rows and down columns using selected columns only [duplicate]

I have a dataframe with column names, and I want to find the one that contains a certain string, but does not exactly match it. I'm searching for 'spike' in column names like 'spike-2', 'hey spike', 'spiked-in' (the 'spike' part is always continuous).
I want the column name to be returned as a string or a variable, so I access the column later with df['name'] or df[name] as normal. I've tried to find ways to do this, to no avail. Any tips?
Just iterate over DataFrame.columns, now this is an example in which you will end up with a list of column names that match:
import pandas as pd
data = {'spike-2': [1,2,3], 'hey spke': [4,5,6], 'spiked-in': [7,8,9], 'no': [10,11,12]}
df = pd.DataFrame(data)
spike_cols = [col for col in df.columns if 'spike' in col]
print(list(df.columns))
print(spike_cols)
Output:
['hey spke', 'no', 'spike-2', 'spiked-in']
['spike-2', 'spiked-in']
Explanation:
df.columns returns a list of column names
[col for col in df.columns if 'spike' in col] iterates over the list df.columns with the variable col and adds it to the resulting list if col contains 'spike'. This syntax is list comprehension.
If you only want the resulting data set with the columns that match you can do this:
df2 = df.filter(regex='spike')
print(df2)
Output:
spike-2 spiked-in
0 1 7
1 2 8
2 3 9
This answer uses the DataFrame.filter method to do this without list comprehension:
import pandas as pd
data = {'spike-2': [1,2,3], 'hey spke': [4,5,6]}
df = pd.DataFrame(data)
print(df.filter(like='spike').columns)
Will output just 'spike-2'. You can also use regex, as some people suggested in comments above:
print(df.filter(regex='spike|spke').columns)
Will output both columns: ['spike-2', 'hey spke']
You can also use df.columns[df.columns.str.contains(pat = 'spike')]
data = {'spike-2': [1,2,3], 'hey spke': [4,5,6], 'spiked-in': [7,8,9], 'no': [10,11,12]}
df = pd.DataFrame(data)
colNames = df.columns[df.columns.str.contains(pat = 'spike')]
print(colNames)
This will output the column names: 'spike-2', 'spiked-in'
More about pandas.Series.str.contains.
# select columns containing 'spike'
df.filter(like='spike', axis=1)
You can also select by name, regular expression. Refer to: pandas.DataFrame.filter
df.loc[:,df.columns.str.contains("spike")]
Another solution that returns a subset of the df with the desired columns:
df[df.columns[df.columns.str.contains("spike|spke")]]
You also can use this code:
spike_cols =[x for x in df.columns[df.columns.str.contains('spike')]]
Getting name and subsetting based on Start, Contains, and Ends:
# from: https://stackoverflow.com/questions/21285380/find-column-whose-name-contains-a-specific-string
# from: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.contains.html
# from: https://cmdlinetips.com/2019/04/how-to-select-columns-using-prefix-suffix-of-column-names-in-pandas/
# from: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.filter.html
import pandas as pd
data = {'spike_starts': [1,2,3], 'ends_spike_starts': [4,5,6], 'ends_spike': [7,8,9], 'not': [10,11,12]}
df = pd.DataFrame(data)
print("\n")
print("----------------------------------------")
colNames_contains = df.columns[df.columns.str.contains(pat = 'spike')].tolist()
print("Contains")
print(colNames_contains)
print("\n")
print("----------------------------------------")
colNames_starts = df.columns[df.columns.str.contains(pat = '^spike')].tolist()
print("Starts")
print(colNames_starts)
print("\n")
print("----------------------------------------")
colNames_ends = df.columns[df.columns.str.contains(pat = 'spike$')].tolist()
print("Ends")
print(colNames_ends)
print("\n")
print("----------------------------------------")
df_subset_start = df.filter(regex='^spike',axis=1)
print("Starts")
print(df_subset_start)
print("\n")
print("----------------------------------------")
df_subset_contains = df.filter(regex='spike',axis=1)
print("Contains")
print(df_subset_contains)
print("\n")
print("----------------------------------------")
df_subset_ends = df.filter(regex='spike$',axis=1)
print("Ends")
print(df_subset_ends)

pandas: map color argument by multidict

I would like to map a color to each row in the dataframe as a function of two columns. It would be much easier with just one column as argument. But how can I achieve this with two columns ?
What I have done so far:
a = np.random.rand(3,10)
i = [[30,10], [10, 30], [60, 60]]
names = ['a', 'b']
index = pd.MultiIndex.from_tuples(i, names = names)
df = pd.DataFrame(a, index=index).reset_index()
c1 = plt.cm.Greens(np.linspace(0.2,0.8,3))
c2 = plt.cm.Blues(np.linspace(0.2,0.8,3))
#c3 = plt.cm.Reds(np.linspace(0.2,0.8,3))
color = np.vstack((c1,c2))
a = df.a.sort_values().values
b = df.b.sort_values().values
mapping = dict()
for i in range(len(a)):
mapping[a[i]] = {}
for ii in range(len(b)):
mapping[a[i]][b[ii]] = color[i+ii]
Maybe something similar to df['color'] = df.apply(lamda x: mapping[x.a][x.b]) ?
Looks like you answered your own question. Apply can happen across the rows by changing the axis argument to 1. df['color'] = df.apply(lambda x: mapping[x.a][x.b], axis =1)

Why am I returned an object when using std() in Pandas?

The print for average of the spreads come out grouped and calculated right. Why do I get this returned as the result for the std_deviation column instead of the standard deviation of the spread grouped by ticker?:
pandas.core.groupby.SeriesGroupBy object at 0x000000000484A588
df = pd.read_csv('C:\\Users\\William\\Desktop\\tickdata.csv',
dtype={'ticker': str, 'bidPrice': np.float64, 'askPrice': np.float64, 'afterHours': str},
usecols=['ticker', 'bidPrice', 'askPrice', 'afterHours'],
nrows=3000000
)
df = df[df.afterHours == "False"]
df = df[df.bidPrice != 0]
df = df[df.askPrice != 0]
df['spread'] = (df.askPrice - df.bidPrice)
df['std_deviation'] = df['spread'].std(ddof=0)
df = df.groupby(['ticker'])
print(df['std_deviation'])
print(df['spread'].mean())
UPDATE: no longer being returned an object but now trying to figure out how to have the standard deviation displayed by ticker
df['spread'] = (df.askPrice - df.bidPrice)
df2 = df.groupby(['ticker'])
print(df2['spread'].mean())
df = df.set_index('ticker')
print(df['spread'].std(ddof=0))
UPDATE2: got the dataset I needed using
df = df[df.afterHours == "False"]
df = df[df.bidPrice != 0]
df = df[df.askPrice != 0]
df['spread'] = (df.askPrice - df.bidPrice)
print(df.groupby(['ticker'])['spread'].mean())
print(df.groupby(['ticker'])['spread'].std(ddof=0))
This line:
df = df.groupby(['ticker'])
assigns df to a DataFrameGroupBy object, and
df['std_deviation']
is a SeriesGroupBy object (of the column).
It's a good idea not to "shadow" / re-assign one variable to a completely different datatype. Try to use a different variable name for the groupby!