Add a new column in a dataframe from another dataframe using category column and then finding max for each category - pandas

I have a dataframe (DF2, subset of DF1) with category column and count in each category. This was created from an original dataframe (DF1) which has date, and 7 float columns. Using one of the columns, this category column was created (continous to categorical data). Now for each of these categories, I need to find the max value from each of the remaining 6 columns and add new columns in DF2.

Related

Pyspark dynamic column selection from dataframe

I have a dataframe with multiple columns as t_orno,t_pono, t_sqnb ,t_pric,....and so on(it's a table with multiple columns).
The 2nd dataframe contains certain name of the columns from 1st dataframe. Eg.
columnname
t_pono
t_pric
:
:
I need to select only those columns from the 1st dataframe whose name is present in the 2nd. In above example t_pono,t_pric.
How can this be done?
Let's say you have the following columns (which can be obtained using df.columns, which returns a list):
df1_cols = ["t_orno", "t_pono", "t_sqnb", "t_pric"]
df2_cols = ["columnname", "t_pono", "t_pric"]
To get only those columns from the first dataframe that are present in the second one, you can do set intersection (and I cast it to a list, so it can be used to select data):
list(set(df1_cols).intersection(df2_cols))
And we get the result:
["t_pono", "t_pric"]
To put it all together and select only those columns:
select_columns = list(set(df1_cols).intersection(df2_cols))
new_df = df1.select(*select_columns)

How to create an array of dataframe by filtering a data frame with a Seq of lists in Scala

One Df1 with all data, one of the column with vl1,vl3,vl4,vl2,vl3,vl4..etc and another dataframe valueDf with a single column which have unique values vl1,vl2,vl3..etc
I want to split valueDf into an array of 4 dataframes like
Df(0)=(vl1,vl4,vl5)
Df(1)=(vl3,vl2,vl6)
Df(2)=(vl7,vl8,vl9)... etc
So that i can do leftsemi join on Df1 and write in 4 different locations
/folder_Df(0)/
/foldwer_Df(1)/.. Etc
I tried doing a randomSplit as below
val Dfs = valueDf.randomSplit(Array.range(1,5).map(_.toDouble), 1)
But now I have set of values which need to go into Df(0),Df(1),Df(2) based on this list (vl1,vl4,vl5),(vl3,vl2,vl6),(vl7,vl8,vl9).

Convert Series to Dataframe where series index is Dataframe column names

I am selecting row by row as follows:
for i in range(num_rows):
row = df.iloc[i]
as a result I am getting a Series object where row.index.values contains names of df columns.
But I wanted instead dataframe with only one row having dataframe columns in place.
When I do row.to_frame() instead of 1x85 dataframe (1 row, 85 cols) I get 85x1 dataframe where index contains names of columns and row.columns
outputs
Int64Index([0], dtype='int64').
But all I want is just original data-frame columns with only one row. How do I do it?
Or how do I convert row.index values to row.column values and change 85x1 dimension to 1x85
You just need to adding T
row.to_frame().T
Also change your for loop with adding []
for i in range(num_rows):
row = df.iloc[[i]]

Replace a subset of pandas data frame with another data frame

I have a data frame(DF1) with 100 columns.( one of the column is ID)
I have one more data frame(DF2) with 30 columns.( one column is ID)
I have to update the first 30 columns of the data frame(DF1) with the values in second data frame (DF2) keeping the rest of the values in the remaining columns of first data frame (DF1) intact.
update the first 30 column value in DF1 out of the 100 columns when the ID in second data frame (DF2) is present in first data frame (DF1).
I tested this on Python 3.7 but I see no reason for it not to work on 2.7:
joined = df1.reset_index() \
[['index', 'ID']] \
.merge(df2, on='ID')
df1.loc[joined['index'], df1.columns[:30]] = joined.drop(columns=['index', 'ID'])
This assumes that df2 doesn't have a column called index or the merge will fail saying duplicate key with suffix.
Here a slow-motion of its inner workings:
df1.reset_index() returns a dataframe same as df1 but with an additional column: index
[['index', 'ID']] extracts a dataframe containing just these 2 columns from the dataframe in #1
.merge(...) merges with df2 , matching on ID . The result (joined) is a dataframe with 32 columns: index, ID and the original 30 columns of df2.
df1.loc[<row_indexes>, <column_names>] = <another_dataframe> mean you want to replace at those particular cells with data from another_dataframe. Since joined has 32 columns, we need to drop the extra 2 (index and ID)

Pandas dataframe getting average of two categories and placing in to existing column

I have a dateframe df which is indexed by date (many same dates). I also have a column named name which has company names for each date, rating (A to Z) and category (health, utilities) etc and finally a column called price.
Price consists of many blank values with some populated values I want to fill the blanks with the average price of the other prices which are in the column price for the companies with the same rating and same category of the one which needs to be filled.
Try this
dt['adjPrice'] = dt.groupby(['rating', 'category']).price.apply(lambda s: s.fillna(s.mean()))