DataFrame Groupby apply on second dataframe? - pandas

I have 2 dataframes df1, df2. Both have id as a column. I want to compute a new column, weighted_average, in df1 that is a function of the values in df2 with the same id.
First, I think I should do df1.groupby("id"). Is it possible to use GroupBy.apply(...) and have it use values from df2? In the examples I've seen, it usually just operates on df1 values.

If they have same id positions and length, you can do some like:
df2["new column name"] = df1["column name"].apply(...)

Related

pandas - lookup a value in another DF without merging data from the other dataframe

I have 2 DataFrames. DF1 and DF2.
(Please note that DF2 has more entries than DF1)
What I want to do is add the nationality column to DF1 (The script should look up the name in DF1 and find the corresponding nationality in DF2).
I am currently using the below code
final_DF =df1.merge(df2[['PPS','Nationality']], on=['PPS'], how ='left')
Although the nationality column is being added to DF1 the code duplicating entries and also adding additional data from DF2 that I do not want.
Is there a method to get the nationality from DF2 while only keeping the DF1 data?
Thanks
DF1
DF2
OUPUT
2 points, you need to do.
If there is any duplicated in the DF2
You can define 'how' in the merge statement. so it will look like
final_DF = DF1.merge(DF2, on=['Name'], how = 'left')
since you want to keep only to DF1 rows, 'left' should be the ideal option for you.
For more info refer this

How to change values of a column in a Pyspark dataframe from a map of two columns of the same df

So I have a dataframe say df, with multiple columns. I now create a dataframe from df, say map, passing only columns A and B and keeping only the unique rows. Now I want to modify df in such a way such that, if for a row in df, I find df['B'] in the map, then df['A'] should be key value from the map otherwise df['A' remains the same.
Any useful suggestions would be appreciated.

PySpark Aggregation on Comma Seperated Column

I have a huge DataFrame with two of many columns: "NAME", "VALUE". One of the row value for "NAME" column is "X,Y,V,A".
I want to transpose my DataFrame so the "NAME" values are columns and the average of the "VALUE" are the row values.
I used the pivot function:
df1 = df.groupby('DEVICE', 'DATE').pivot('NAME').avg('VALUE')
All NAME values except for "X,Y,V,A" work well with the above. I am not sure how to separate the 4 values of "X,Y,V,A" and aggregate on individual value.
IIUC, you need to split and explode the string first:
from pyspark.sql.functions import split, explode
df = df.withColumn("NAME", explode(split("NAME", ",")))
Now you can group and pivot:
df1 = df.groupby('DEVICE', 'DATE').pivot('NAME').avg('VALUE')

concat series onto dataframe with column name

I want to add a Series (s) to a Pandas DataFrame (df) as a new column. The series has more values than there are rows in the dataframe, so I am using the concat method along axis 1.
df = pd.concat((df, s), axis=1)
This works, but the new column of the dataframe representing the series is given an arbitrary numerical column name, and I would like this column to have a specific name instead.
Is there a way to add a series to a dataframe, when the series is longer than the rows of the dataframe, and with a specified column name in the resulting dataframe?
You can try Series.rename:
df = pd.concat((df, s.rename('col')), axis=1)
One option is simply to specify the name when creating the series:
example_scores = pd.Series([1,2,3,4], index=['t1', 't2', 't3', 't4'], name='example_scores')
Using the name attribute when creating the series is all I needed.
Try:
df = pd.concat((df, s.rename('CoolColumnName')), axis=1)

How to access (multi)index of a Data Frame?

I have a data frame and use some of its columns to group by:
grouped = df.groupby(['col1', 'col2'])
Now I use mean function to get a new data frame object from the above created groupby object:
df_new = grouped.mean()
Now I have two data frames (df and df2) and I would like to merge them using col1 and col2. The problem that I have now is that df2 does no have these columns. After groupby operation col1 and col2 are "shifted" to index. So, to resolve this problem, I try to create these columns:
df2['col1'] = df2['index'][0]
df2['col2'] = df2['index'][1]
But it does not work because 'index' is not recognized as a column of the data frame.
As an alternative Andy Hayden's method, you could use as_index=False to preserve the columns as columns rather than indices:
df2 = df.groupby(['col1', 'col2'], as_index=False).mean()
You can use left_index (or right_index) arguments of merge:
left_index : boolean, default False
Use the index from the left DataFrame as the join key(s).
If it is a MultiIndex, the number of keys in the other DataFrame (either the index
or a number of columns) must match the number of levels
and use right_on to determine which columns it should merge the index with.
So it'll be something like:
pd.merge(df, df_new, left_on=['col1', 'col2'], right_index=True)