From Pandas groupBy to PySpark groupBy - pandas

Consider an Spark DataFrame, wherein we have few columns. The goal is to perform groupBy operation on it without converting it to Pandas DataFrame. An equivalent Pandas groupBy code looks something like this:
def compute_metrics(x):
return pd.Series({
'a': x['a'].values[0],
'new_b': np.sum(x['b']),
'c': np.mean(x['c']),
'cnt': len(x)
})
data.groupby([
'col_1',
'col_2'
]).apply(compute_metrics).reset_index()
And I'm intending to write this in PySpark. So far I have come up with something like this in PySpark:
gdf = df.groupBy([
'col_1',
'col_2'
]).agg({
'c': 'avg',
'b': 'sum'
}).withColumnRenamed('sum(b)', 'new_b')
However, I am not sure how to go about 'a': x['a'].values[0] and 'cnt': len(x). I thought about using collect_list from from pyspark.sql import functions but that slaps my face with Column object is not Callable. Any idea how to accomplish the aforementioned conversion? Thanks!
[UPDATE] Would it make sense to perform count operation on any column in order to get cnt? Say I do this:
gdf = df.groupBy([
'col_1',
'col_2'
]).agg({
'c': 'avg',
'b': 'sum',
'some_column': 'count'
}).withColumnRenamed('sum(b)', 'new_b')
.withColumnRenamed('count(some_column)', 'cnt')

I have this toy solution using PySpark function sum, avg, count and first. note that I use Spark 2.1 in this solution. Hope this help a bit!
from pyspark.sql.functions import sum, avg, count, first
# create toy example dataframe with column 'A', 'B' and 'C'
ls = [['a', 'b',3], ['a', 'b', 4], ['a', 'c', 3], ['b', 'b', 5]]
df = spark.createDataFrame(ls, schema=['A', 'B', 'C'])
# group by column 'A' and 'B' then performing some function here
group_df = df.groupby(['A', 'B'])
df_grouped = group_df.agg(sum("C").alias("sumC"),
avg("C").alias("avgC"),
count("C").alias("countC"),
first("C").alias("firstC"))
df_grouped.show() # print out the spark dataframe

Related

How to calculate mean in a dataframe for a single year [duplicate]

Im trying to use groupby function, is there a way to use a specific element rather than the column name. Example of code.
df.groupby(['Month', 'Place'])['Number'].sum()
This is what I want to do.
df.groupby(['April', Place'])['Number'].sum()
you need filter your DataFrame before:
df.loc[df['Month'].eq('April')].groupby('Place')['Number'].sum()
#df[df['Month'].eq('April')].groupby('Place')['Number'].sum()
Yes, you can pass as many columns to groupby
df = pd.DataFrame([['a', 'x', 1], ['a', 'x', 2], ['b', 'y',3], ['b', 'z',4]])
df.columns = ['c1', 'c2', 'c3']
df.groupby(['c1', 'c2'])['c3'].mean()
resuls in
c1 c2
a x 1.5
b y 3.0
z 4.0

How to concatenate values from multiple rows using Pandas?

In the screenshot, 'Ctrl' column contains a key value. I have two duplicate rows for OTC-07 which I need to consolidate. I would like to concat the rest of column values for OTC-07. i.e, OTC-07 should have Type A,B and Assertion a,b,c,d after consolidation.. Can anyone help me on this? :o
First define a dataframe of given structure:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'Ctrl': ['OTC-05', 'OTC-06', 'OTC-07', 'OTC-07', 'OTC-08'],
'Type': ['A', 'A', 'A', 'B', np.NaN],
'Assertion': ['a,b,c', 'c,b', 'a,c', 'b,c,d', 'a,b,c']
})
df
Output:
Then replace NaN values with empty strings:
df = df.replace(np.NaN, '', regex=True)
Then group by column 'Ctrl' and aggregate columns 'Type' and 'Assertion'. Please not that assertion aggregation is a bit tricky as you need not a simple concatenation, but sorted list of unique letters:
df.groupby(['Ctrl']).agg({
'Type': ','.join,
'Assertion': lambda x: ','.join(list(sorted(set(','.join(x).split(',')))))
})
Output:

Panda DataFrame Combine Unique Values in Two Column for OrdinalEncoder Fit

I have Titanic dataset and columns in a dataframe I would like to use are 'Embarked' and 'Sex'.
df['Embarked'] and df['Sex'] have Unique value: Embarked['C','Q','S'] and Sex['male','female']
What I would like to do is create a list like below:
[['S','female'],['S','male'],['C','female'],['c','male'],['Q','female'],['Q','male']]
I need unique value combination in a list format so that I can pass to OrdinalEncoder to fit.
Scikit Learn OrdinalEncoder example:
from sklearn.preprocessing import OrdinalEncoder
enc = OrdinalEncoder()
X = [['Male', 1], ['Female', 3], ['Female', 2]]
enc.fit(X)
enc.categories_
enc.transform([['Female', 3], ['Male', 1],['Female',2],['Male',3]])
encoder transform only takes list
If what you'd like is to find the product from the unique values of two columns in a dataframe then turn them into a list, then this will do that!
import pandas as pd
from itertools import product
data = pd.DataFrame([['Q', 'male'], ['Q', 'male'], ['S', 'female'],
['S', 'female'], ['S', 'male'], ['C', 'female'],
['C', 'female'], ['C', 'male'], ['C', 'male']],
columns=['Embarked', 'Sex'])
print([list(x) for x in product(data['Embarked'].unique(), data['Sex'].unique())])
itertools.product gives you the Cartesian Product of a sequence of iterables. Our iterables here are lists created by calling Series.unique() on each of the DataFrame's columns to get their unique values.
Finally, the list comprehension turns itertools.product's typical return of a list of tuples into a list of lists.
a way of doing it is:
list_1 = ['C','Q','S']
list_2 = ['male','female']
X = [[x, y] for x in list_1 for y in list_2]

Apply Series function to the whole dataframe

Well, I knew that function on each" cell" can applies to the whole dataframe using applymap()
However, is there any way to apply Series function,eg: str.upper() to the whole dataframe
Yes, it could be applied directly to the applymap method of the dataframe.
Demo:
df = pd.DataFrame([['a', 'b'], ['c', 'd'], ['e', 'f']])
df
Various possiblities:
1) applymap dataframe:
df.applymap(str.upper)
2) stack + unstack combo:
df.stack().str.upper().unstack()
3) apply series:
df.apply(lambda x: x.str.upper())
All produce:

Pandas - understanding output of pivot table

Here is my example:
import pandas as pd
df = pd.DataFrame({
'Student': ['A', 'B', 'B'],
'Assessor': ['C', 'D', 'D'],
'Score': [72, 19, 92]})
df = df.pivot_table(
index='Student',
columns='Assessor',
values='Score',
aggfunc=lambda x: x)
print(df)
The output looks like:
Assessor C D
Student
A 72 NaN
B NaN [1, 2]
I am not sure why I get '[1,2]' as output. I would expect something like:
Assessor C D
Student
A 72 NaN
B NaN 19
B NaN 92
Here is related question:
if I replace my dataframe with
df = pd.DataFrame({
'Student': ['A', 'B', 'B'],
'Assessor': ['C', 'D', 'D'],
'Score': ['foo', 'bar', 'foo']})
The output of the same pivot is going to be
Process finished with exit code 255
Any thoughts.
pivot_table finds the unique values of the index/columns and aggregates if there are multiple rows in the original DataFrame in a particular cell.
Indexes/columns are generally meant to be unique, so if you want to get the data in that form, you have do something a little ugly like this, although you probably don't want to.
In [21]: pivoted = pd.DataFrame(columns=df['Assessor'], index=df['Student'])
In [22]: for (assessor, score, student) in df.itertuples(index=False):
...: pivoted.loc[student, assessor] = score
For your second question, the reason that groupby generally fails if that there are no numeric columns to aggregate, although it seems to be a bug that it completely crashes like that. I added a note to the issue here.