My Dataframe contains 500 columns, but I only want to pick out 27 columns in a new Dataframe. How do I do that? - pandas

My Dataframe contains 500 columns, but I only want to pick out 27 columns in a new Dataframe.
How do I do that?
I used query()
but output
TypeError: query() takes from 2 to 3 positional arguments but 27 were given

If you want to select the columns based on their name, you can do the following:
df_new = df[["colA", "colB", "colC", ...]]
or use the "filter" function:
df_new = df.filter(["colA", "colB", "colC", ..])
In case that your column selection is based on the index of columns:
df_new = df.iloc[:, 0:27] # if columns are consecutive
df_new = df.iloc[:, [0,2,10,..]] # if columns are not consecutive (the numbers refer to the column indices)

Related

new_df = df1[df2['pin'].isin(df1['vpin'])] UserWarning: Boolean Series key will be reindexed to match DataFrame index

I'm getting the following warning while executing this line
new_df = df1[df2['pin'].isin(df1['vpin'])]
UserWarning: Boolean Series key will be reindexed to match DataFrame index.
The df1 and df2 has only one similar column and they do not have same number of rows.
I want to filter df1 based on the column in df2. If df2.pin is in df1.vpin I want those rows.
There are multiple rows in df1 for same df2.pin and I want to retrieve them all.
pin
count
1
10
2
20
vpin
Column B
1
Cell 2
1
Cell 4
The command is working. I'm trying to overcome the warning.
It doesn't really make sense to use df2['pin'].isin(df1['vpin']) as a boolean mask to index df1 as this mean will have the indices of df2, thus the reindexing performed by pandas.
Use instead:
new_df = df1[df1['vpin'].isin(df2['pin'])]

Create new column based of two columns

I have two columns in a dataframe. I want to create third column such that if first column > second column than 1 ow 0. As below
Df
Value1 value 2. Newcolumn
101. 0
97. 1
Comparing two columns in a Pandas DataFrame and write the results of the comparison to a third column. It can do easily by these syntaxes
conditions=[(condition1),(condition2)]
choices=["choice1","choice2"]
df["new_column_name"]=np.select(conditions, choices, default)
conditions are the conditions to check for between the two columns
choices are the results to return based on the conditions
np.select is used to return the results to the new column
The dataframe is:
import numpy as np
import pandas as pd
#create DataFrame
df = pd.DataFrame({'Value1': [100,101],
'value 2': [101,97]})
#define conditions
conditions = [df['Value1'] < df['value 2'],
df['Value1'] > df['value 2']]
#define choices
choices = ['0', '1']
#create new column in DataFrame that displays results of comparisons
df['Newcolumn'] = np.select(conditions, choices, default='Tie')
Final dataframe
print(df)
Output:
Value1 value 2 Newcolumn
0 100 101 0
1 101 97 1

pandas: appending a row to a dataframe with values derived using a user defined formula applied on selected columns

I have a dataframe as
df = pd.DataFrame(np.random.randn(5,4),columns=list('ABCD'))
I can use the following to achieve the traditional calculation like mean(), sum()etc.
df.loc['calc'] = df[['A','D']].iloc[2:4].mean(axis=0)
Now I have two questions
How can I apply a formula (like exp(mean()) or 2.5*mean()/sqrt(max()) to column 'A' and 'D' for rows 2 to 4
How can I append row to the existing df where two values would be mean() of the A and D and two values would be of specific formula result of C and B.
Q1:
You can use .apply() and lambda functions.
df.iloc[2:4,[0,3]].apply(lambda x: np.exp(np.mean(x)))
df.iloc[2:4,[0,3]].apply(lambda x: 2.5*np.mean(x)/np.sqrt(max(x)))
Q2:
You can use dictionaries and combine them and add it as a row.
First one is mean, the second one is some custom function.
ad = dict(df[['A', 'D']].mean())
bc = dict(df[['B', 'C']].apply(lambda x: x.sum()*45))
Combine them:
ad.update(bc)
df = df.append(ad, ignore_index=True)

find max in each cell of multiple DataFrame Pandas

I have about 2000 similar DataFrames (DF1,DF2,....,DF2000) with same shape ( names of columns and index).
I want to get max and min values in each cells (same positions).
I could iterate by column names and index to verify but it would be very slow. What's the best way to do such task ?
Example:
columns = ['A','B','C','D']
for i in range(4):
pd.DataFrame(np.random.randint(100, size=(4, 4)),columns=columns)
I need max values DF with
DF_max[0,'A] = 78
and min values DF with
DF_min[0,'A'] = 10
Assuming you have all the df in a list
l=[df1,df2,df3.....]
DF=pd.concat(l,keys=range(len(l))).groupby(level=1)
maxdf=DF.max()
mindf=DF.min()

Pandas: Selecting rows by list

I tried following code to select columns from a dataframe. My dataframe has about 50 values. At the end, I want to create the sum of selected columns, create a new column with these sum values and then delete the selected columns.
I started with
columns_selected = ['A','B','C','D','E']
df = df[df.column.isin(columns_selected)]
but it said AttributeError: 'DataFrame' object has no attribute 'column'
Regarding the sum: As I don't want to write for the sum
df['sum_1'] = df['A']+df['B']+df['C']+df['D']+df['E']
I also thought that something like
df['sum_1'] = df[columns_selected].sum(axis=1)
would be more convenient.
You want df[columns_selected] to sub-select the df by a list of columns
you can then do df['sum_1'] = df[columns_selected].sum(axis=1)
To filter the df to just the cols of interest pass a list of the columns, df = df[columns_selected] note that it's a common error to just a list of strings: df = df['a','b','c'] which will raise a KeyError.
Note that you had a typo in your original attempt:
df = df.loc[:,df.columns.isin(columns_selected)]
The above would've worked, firstly you needed columns not column, secondly you can use the boolean mask as a mask against the columns by passing to loc or ix as the column selection arg:
In [49]:
df = pd.DataFrame(np.random.randn(5,5), columns=list('abcde'))
df
Out[49]:
a b c d e
0 -0.778207 0.480142 0.537778 -1.889803 -0.851594
1 2.095032 1.121238 1.076626 -0.476918 -0.282883
2 0.974032 0.595543 -0.628023 0.491030 0.171819
3 0.983545 -0.870126 1.100803 0.139678 0.919193
4 -1.854717 -2.151808 1.124028 0.581945 -0.412732
In [50]:
cols = ['a','b','c']
df.ix[:, df.columns.isin(cols)]
Out[50]:
a b c
0 -0.778207 0.480142 0.537778
1 2.095032 1.121238 1.076626
2 0.974032 0.595543 -0.628023
3 0.983545 -0.870126 1.100803
4 -1.854717 -2.151808 1.124028