Pyspark dynamic column selection from dataframe - dataframe

I have a dataframe with multiple columns as t_orno,t_pono, t_sqnb ,t_pric,....and so on(it's a table with multiple columns).
The 2nd dataframe contains certain name of the columns from 1st dataframe. Eg.
columnname
t_pono
t_pric
:
:
I need to select only those columns from the 1st dataframe whose name is present in the 2nd. In above example t_pono,t_pric.
How can this be done?

Let's say you have the following columns (which can be obtained using df.columns, which returns a list):
df1_cols = ["t_orno", "t_pono", "t_sqnb", "t_pric"]
df2_cols = ["columnname", "t_pono", "t_pric"]
To get only those columns from the first dataframe that are present in the second one, you can do set intersection (and I cast it to a list, so it can be used to select data):
list(set(df1_cols).intersection(df2_cols))
And we get the result:
["t_pono", "t_pric"]
To put it all together and select only those columns:
select_columns = list(set(df1_cols).intersection(df2_cols))
new_df = df1.select(*select_columns)

Related

How to create an array of dataframe by filtering a data frame with a Seq of lists in Scala

One Df1 with all data, one of the column with vl1,vl3,vl4,vl2,vl3,vl4..etc and another dataframe valueDf with a single column which have unique values vl1,vl2,vl3..etc
I want to split valueDf into an array of 4 dataframes like
Df(0)=(vl1,vl4,vl5)
Df(1)=(vl3,vl2,vl6)
Df(2)=(vl7,vl8,vl9)... etc
So that i can do leftsemi join on Df1 and write in 4 different locations
/folder_Df(0)/
/foldwer_Df(1)/.. Etc
I tried doing a randomSplit as below
val Dfs = valueDf.randomSplit(Array.range(1,5).map(_.toDouble), 1)
But now I have set of values which need to go into Df(0),Df(1),Df(2) based on this list (vl1,vl4,vl5),(vl3,vl2,vl6),(vl7,vl8,vl9).

Joining or merging a column to a dataframe [duplicate]

This question already has answers here:
Pandas Merging 101
(8 answers)
Closed last year.
We have two dataframes exported from Excel. Both have a column called "PN", which was set at the exporting. "First" and "Second" are the variables with those dataframes. "Third" stores a list of coinsidences between the 2 "PN" columns. Pandas Merge method worked without such list, but since the thing now is not working, I added it as well.
gnida = []
for h in first['PN']:
for u in zip(second['PN'], second['P']):
if h==u[0]:
gnida.append(u)
third = pd.DataFrame(gnida)
I need values in the second dataframe to be placed on the rows where coinsidence occurs. If I simply merge:
fourth = first.merge(second)
, columns that have names other than in the first df are added, but the output is 1 row of headings without rows with values.
If I merge
fourth = first.merge(third)
, I get:
No common columns to perform merge on. Merge options: left_on=None, right_on=None, left_index=False, right_index=False.
If I state further "left on = "PN", I get:
object of type 'NoneType' has no len().
Thus, how can Merge or Join or whatever the 2 dataframes in order to use one column of the second dataframe as a key, placing values in a new column where coinsidence occurs. Thank you
if you wish to merge by the index, just use fourth = first.join(third)
otherwise, you need to create a dataframe from third, add the column that you want to merge by, and use:
fourth = first.merge(third,on='name_of_the_column')

Pandas groupby year filtering the dataframe by n largest values

I have a dataframe at hourly level with several columns. I want to extract the entire rows (containing all columns) of the 10 top values of a specific column for every year in my dataframe.
so far I ran the following code:
df = df.groupby([df.index.year])['totaldemand'].apply(lambda grp: grp.nlargest(10)))
The problem here is that I only get the top 10 values for each year of that specific column and I lose the other columns. How can I do this operation and having the corresponding values of the other columns that correspond to the top 10 values per year of my 'totaldemand' column?
We usually do head after sort_values
df = df.sort_values('totaldemand',ascending = False).groupby([df.index.year])['totaldemand'].head(10)
nlargest can be applied to each group, passing the column to look for
largest values.
So run:
df.groupby([df.index.year]).apply(lambda grp: grp.nlargest(3, 'totaldemand'))
Of course, in the final version replace 3 with your actual value.
Get the index of your query and use it as a mask on your original df:
idx = df.groupby([df.index.year])['totaldemand'].apply(lambda grp: grp.nlargest(10))).index.to_list()
df.iloc[idx,]
(or something to that extend, I can't test now without any test data)

Convert Series to Dataframe where series index is Dataframe column names

I am selecting row by row as follows:
for i in range(num_rows):
row = df.iloc[i]
as a result I am getting a Series object where row.index.values contains names of df columns.
But I wanted instead dataframe with only one row having dataframe columns in place.
When I do row.to_frame() instead of 1x85 dataframe (1 row, 85 cols) I get 85x1 dataframe where index contains names of columns and row.columns
outputs
Int64Index([0], dtype='int64').
But all I want is just original data-frame columns with only one row. How do I do it?
Or how do I convert row.index values to row.column values and change 85x1 dimension to 1x85
You just need to adding T
row.to_frame().T
Also change your for loop with adding []
for i in range(num_rows):
row = df.iloc[[i]]

Replace a subset of pandas data frame with another data frame

I have a data frame(DF1) with 100 columns.( one of the column is ID)
I have one more data frame(DF2) with 30 columns.( one column is ID)
I have to update the first 30 columns of the data frame(DF1) with the values in second data frame (DF2) keeping the rest of the values in the remaining columns of first data frame (DF1) intact.
update the first 30 column value in DF1 out of the 100 columns when the ID in second data frame (DF2) is present in first data frame (DF1).
I tested this on Python 3.7 but I see no reason for it not to work on 2.7:
joined = df1.reset_index() \
[['index', 'ID']] \
.merge(df2, on='ID')
df1.loc[joined['index'], df1.columns[:30]] = joined.drop(columns=['index', 'ID'])
This assumes that df2 doesn't have a column called index or the merge will fail saying duplicate key with suffix.
Here a slow-motion of its inner workings:
df1.reset_index() returns a dataframe same as df1 but with an additional column: index
[['index', 'ID']] extracts a dataframe containing just these 2 columns from the dataframe in #1
.merge(...) merges with df2 , matching on ID . The result (joined) is a dataframe with 32 columns: index, ID and the original 30 columns of df2.
df1.loc[<row_indexes>, <column_names>] = <another_dataframe> mean you want to replace at those particular cells with data from another_dataframe. Since joined has 32 columns, we need to drop the extra 2 (index and ID)