pandas data frame columns - how to select a subset of columns based on multiple criteria - pandas

Let us say I have the following columns in a data frame:
title
year
actor1
actor2
cast_count
actor1_fb_likes
actor2_fb_likes
movie_fb_likes
I want to select the following columns from the data frame and ignore the rest of the columns :
the first 2 columns (title and year)
some columns based on name - cast_count
some columns which contain the string "actor1" - actor1 and actor1_fb_likes
I am new to pandas. For each of the above operations, I know what method to use. But I want to do all three operations together as all I want is a dataframe that contains the above columns that I need for further analysis. How do I do this?
Here is example code that I have written:
data = {
"title":['Hamlet','Avatar','Spectre'],
"year":['1979','1985','2007'],
"actor1":['Christoph Waltz','Tom Hardy','Doug Walker'],
"actor2":['Rob Walker','Christian Bale ','Tom Hardy'],
"cast_count":['15','24','37'],
"actor1_fb_likes":[545,782,100],
"actor2_fb_likes":[50,78,35],
"movie_fb_likes":[1200,750,475],
}
df_input = pd.DataFrame(data)
print(df_input)
df1 = df_input.iloc[:,0:2] # Select first 2 columns
df2 = df_input[['cast_count']] #select some columns by name - cast_count
df3 = df_input.filter(like='actor1') #select columns which contain the string "actor1" - actor1 and actor1_fb_likes
df_output = pd.concat(df1,df2, df3) #This throws an error that i can't understand the reason
print(df_output)

Question 1:
df_1 = df[['title', 'year']]
Question 2:
# This is an example but you can put whatever criteria you'd like
df_2 = df[df['cast_count'] > 10]
Question 3:
# This is an example but you can put whatever criteria you'd like this way
df_2 = df[(df['actor1_fb_likes'] > 1000) & (df['actor1'] == 'actor1')]
Make sure each filter is contained within it's own set of parenthesis () before using the & or | operators. & acts as an and operator. | acts as an or operator.

Related

Pyspark dynamic column selection from dataframe

I have a dataframe with multiple columns as t_orno,t_pono, t_sqnb ,t_pric,....and so on(it's a table with multiple columns).
The 2nd dataframe contains certain name of the columns from 1st dataframe. Eg.
columnname
t_pono
t_pric
:
:
I need to select only those columns from the 1st dataframe whose name is present in the 2nd. In above example t_pono,t_pric.
How can this be done?
Let's say you have the following columns (which can be obtained using df.columns, which returns a list):
df1_cols = ["t_orno", "t_pono", "t_sqnb", "t_pric"]
df2_cols = ["columnname", "t_pono", "t_pric"]
To get only those columns from the first dataframe that are present in the second one, you can do set intersection (and I cast it to a list, so it can be used to select data):
list(set(df1_cols).intersection(df2_cols))
And we get the result:
["t_pono", "t_pric"]
To put it all together and select only those columns:
select_columns = list(set(df1_cols).intersection(df2_cols))
new_df = df1.select(*select_columns)

How to select only rows containing specific values with multiple data frame in for loop?

I'm new to python, I have a multiple data frame and select data frame based one columns which contains value xxx.
below is my code
MasterFiles = [Master_Jun22, Master_May22, Master_Apr22, Master_Mar22, Master_Feb22, Master_Jan22,
Master_Dec21, Master_Nov21, Master_Oct21, Master_Sep21, Master_Aug21, Master_Jul21,
Master_Jun21, Master_May21, Master_Apr21]
ColumName = ['product_category']
for d in MasterFiles:
for c in ColumName:
d = d.loc[d[c]=='XXX']
it is not working please help on this.
You need to gather the output and append it to a new Dataframe:
MasterFiles = [Master_Jun22, Master_May22, Master_Apr22, Master_Mar22, Master_Feb22, Master_Jan22,
Master_Dec21, Master_Nov21, Master_Oct21, Master_Sep21, Master_Aug21, Master_Jul21,
Master_Jun21, Master_May21, Master_Apr21]
ColumName = ['product_category']
res_df = pandas.Dataframe(columns=ColumName)
for d in MasterFiles:
for c in ColumName:
res_df.append[d.loc[d[c]=='XXX']]
# the results
res_df.head()
I am not sure if I am understanding your question correctly. So, please let me rephrase your question here.
You have 3 tasks,
first is to loop through each pandas data frame,
second is to loop through each column in your ColumName list, and
third is to return the data frame rows that consists of value Surabhi - DCL - Unsecured based on the column name in the columnName list.
If I am interpreting this correctly. This is how I would work on your issue.
MasterFiles = [Master_Jun22, Master_May22, Master_Apr22, Master_Mar22, Master_Feb22, Master_Jan22,
Master_Dec21, Master_Nov21, Master_Oct21, Master_Sep21, Master_Aug21, Master_Jul21,
Master_Jun21, Master_May21, Master_Apr21]
ColumName = ['product_category']
## list to store filter data frame by rows
df_temp = []
for d in MasterFiles:
for c in ColumName:
df_temp.append(d.loc[d[c] == 'Surabhi - DCL - Unsecured'])
## Assuming row wise concatenation
## i.e., using same column names to join data
df = pd.concat(df_temp, axis=0, ignore_index=True)
## df is the data frame you need

Select column names in pandas based on multiple prefixes

I have a large dataframe, from which I want to select specific columns that stats with several different prefixes. My current solution is shown below:
df = pd.DataFrame(columns=['flg_1', 'flg_2', 'ab_1', 'ab_2', 'aaa', 'bbb'], data=np.array([1,2,3,4,5,6]).reshape(1,-1))
flg_vars = df.filter(regex='^flg_')
ab_vars = df.filter(regex='^ab_')
result = pd.concat([flg_vars, ab_vars], axis=1)
Is there a more efficient way of doing this? I need to filter my original data based on 8 prefixes, which leads to excessive lines of code.
Use | for regex OR:
result = df.filter(regex='^flg_|^ab_')
print (result)
flg_1 flg_2 ab_1 ab_2
0 1 2 3 4

replace values for specific rows more efficiently in pandas / Python

I have two data frames, based on a condition that I get from a list (which length is 2 million) I get rows that match that condition, then for those rows I change the values in columns x and y in the first data frame by the values of x and y in the second data frame. Here is my code, but it is very slow and makes my computer freeze. Any idea how I can do this more efficiently ?
for ids in List_id:
a=df1.index[(df1['id'] == ids )==True].values[0]
b=df2.index[(df2['id'] == ids )==True].values[0]
df1['x'][a] = df2['x'][b]
df1['y'][a] = df2['y'][b]
thank you
--
Example:
List_id=[1, 11 , 12 , 13]
ids=1
a=df1.index[(df1['id'] == 1 )==True].values[0]
print('a') : 234
b=df2.index[(df2['id'] == 1 )==True].values[0]
print('b') : 789
df1['x'][a] = 0
df2['x'][b] =15
So at the end I want in my data frame 1:
df1['x'][a] = df2['x'][b]
Assuming you don't have repeated id in both dataframe, you can try something like below:
step-1: filtering df2
step-2: joining df1 with filtered one
step-3 replace values in joined df and dropping extra columns.
df2_filtered=df2[df2['id'].isin(List_id)]
join_df = df1.setIndex('id').join(df2_filtered.setIndex('id'), rsuffix = "_ignore", how = 'left')
# other columns from df2 will be null, you can use that to get the rows which needs to be updated

How to concat 3 dataframes with each into sequential columns

I'm trying to understand how to concat three individual dataframes (i.e df1, df2, df3) into a new dataframe say df4 whereby each individual dataframe has its own column left to right order.
I've tried using concat with axis = 1 to do this, but it appears not possible to automate this with a single action.
Table1_updated = pd.DataFrame(columns=['3P','2PG-3Io','3Io'])
Table1_updated=pd.concat([get_table1_3P,get_table1_2P_max_3Io,get_table1_3Io])
Note that with the exception of get_table1_2P_max_3Io, which has two columns, all other dataframes have one column
For example,
get_table1_3P =
get_table1_2P_max_3Io =
get_table1_3Io =
Ultimately, i would like to see the following:
I believe you need first concat and tthen change order by list of columns names:
Table1_updated=pd.concat([get_table1_3P,get_table1_2P_max_3Io,get_table1_3Io], axis=1)
Table1_updated = Table1_updated[['3P','2PG-3Io','3Io']]