how to custom sort in spark SQL? - apache-spark-sql

for a column like
col
a
b
c
how do I get the output
col
c
a
b
is there an equivalent of order by field() in sqark sql?

Related

Extracting the group that has a specific groupby aggregate value in Pandas

I'm doing a pandas groupby for a specific DataFrame which looks like this
Group
Value
A
1
A
2
B
2
B
3
And when I apply df.groupby('Group')['Value'].mean() I get
Group
A 1.5
B 2.5
Name: Value, dtype: float64
My end result is trying to find the group that has the max groupby aggregation (ie: Group B) as my result
I understand that groups.keys() is an option to list the keys but would there be a way to repeatedly get the groupname for a specific aggregation. ?
Thanks !
By default, groupby sets your grouping column as index of the aggregation. Use idxmax:
df.groupby('Group')['Value'].mean().idxmax()
Output: 'B'

Copy the value to rows for distinct values - Spark dataframe

I have a spark dataframe. I want to copy the value of rows based on the specific column
ColumnA columnB columnC
a Null Null
a 1 1
b 2 2
c Null Null
c 3 3
So in this dataframe ColumnC is a copy of ColumnB but , I want a single value for each value in ColumnA. If a value of Column A has null and a proper value, I want proper value in all places else Null.
Required dataframe:
ColumnA columnB columnC
a Null 1
a 1 1
b 2 2
c Null 3
c 3 3
I tried doing partitionby(ColumnA) but seems not working.
Can I get a scala code for this
If I understand correctly, you are looking to choose a B value for every A value.
This is a groupBy clause.
In order to correctly choose the value you need to find the proper aggregation function. In this case, the first function can be used:
import org.apache.spark.sql.{functions => F}
df
.groupBy($"columnA")
.agg(F.first($"columnB", true) as "columnC")
This will give you a single value from B for every value A
columnA columnC
a 1
b 2
c 3
You can then proceed to join it with the original DF by A.

How I can select a column where in another column I need a specific things

I have a pyspark data frame. How I can select a column where in another column I need a specific things. suppose I have n columns. for 2 columns I have
A. B.
a b
a c
d f
I want all column B. where column A is a. so
A. B.
a b
a c
It's just a simple filter:
df2 = df.filter("A = 'a'")
which comes in many flavours, such as
df2 = df.filter(df.A == 'a')
df2 = df.filter(df['A'] == 'a')
or
import pyspark.sql.functions as F
df2 = df.filter(F.col('A') == 'a')

Pandas finding average in a comma separated column

I want to take average based on one column which is comma separated and take mean on other column.
My file looks like this:
ColumnA ColumnB
A, B, C 2.9
A, C 9.087
D 6.78
B, D, C 5.49
My output should look like this:
A 7.4435
B 5.645
C 5.83
D 6.135
My code is this:
df = pd.DataFrame(data.ColumnA.str.split(',', expand=True).stack(), columns= ['ColumnA'])
df = df.reset_index(drop = True)
df_avg = pd.DataFrame(df.groupby(by = ['ColumnA'])['ColumnB'].mean())
df_avg = df_avg.reset_index()
It has to be around the same lines but can't figure it out.
In your solution is created index by column ColumnB for avoid lost column values after stack and Series.reset_index, last is added as_index=False for column after aggregation:
df = (df.set_index('ColumnB')['ColumnA']
.str.split(',', expand=True)
.stack()
.reset_index(name='ColumnA')
.groupby('ColumnA', as_index=False)['ColumnB']
.mean())
print (df)
ColumnA ColumnB
0 A 5.993500
1 B 4.195000
2 C 5.825667
3 D 6.135000
Or alternative solution with DataFrame.explode:
df = (df.assign(ColumnA = df['ColumnA'].str.split(','))
.explode('ColumnA')
.groupby('ColumnA', as_index=False)['ColumnB']
.mean())
print (df)
ColumnA ColumnB
0 A 5.993500
1 B 4.195000
2 C 5.825667
3 D 6.135000

pandas isin based on a single row

Now I have:
ss dd list
A B [B,E,F]
C E [C,H,E]
A C [A,D,E]
I want to rule out rows that both ss and dd are in list. So we rule out row 2. Function isin() checks if ss and dd are in all rows of list each time, which is not giving me the result.
Please do not use loop cause my dataset is too large.
Output should be:
ss dd list
A B [B,E,F]
A C [A,D,E]
First we flatten your list column to a dataframe and using isin(here index is do matter , that is why I using original dataframe index to create the cdf)
cdf=pd.DataFrame(df['list'].tolist(),index=df.index)
mask=(cdf.isin(df.ss).any(1))&(cdf.isin(df.dd).any(1))
df[~mask]
Out[589]:
ss dd list
0 A B [B, E, F]
2 A C [A, D, E]