convert pyspark groupedData object to spark Dataframe - apache-spark-sql

I have to do a 2 levels grouping on a pyspark dataframe.
My tentative:
grouped_df=df.groupby(["A","B","C"])
grouped_df.groupby(["C"]).count()
But I get the following error:
'GroupedData' object has no attribute 'groupby'
I guess I should first convert the grouped object into a pySpark DF. But I cannot do that.
Any suggestion?

I had the same issue. The way I got around it was by first doing a "count()" after the first groupby, because that returns a Spark DataFrame, rather than the GroupedData object. Then you can do another groupby on that returned DataFrame.
So try:
grouped_df=df.groupby(["A","B","C"]).count()
grouped_df.groupby(["C"]).count()

The function DataFrame.groupBy(cols) returns a GroupedData object. In order to convert a GroupedData object back to a DataFrame, you will need to use one of the GroupedData functions such as mean(cols) avg(cols) count(). An example using your example is:
df = sqlContext.createDataFrame([['a', 'b', 'c'], ['a', 'b', 'c'], ['a', 'b', 'c']], schema=['A', 'B', 'C'])
df.show()
+---+---+---+
| A| B| C|
+---+---+---+
| a| b| c|
| a| b| c|
| a| b| c|
+---+---+---+
gdf = df.groupBy('C').count()
gdf.show()
+---+-----+
| C|count|
+---+-----+
| c| 3|
+---+-----+

https://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.GroupedData
pyspark.sql.GroupedData Aggregation methods, returned by
DataFrame.groupBy().
A set of methods for aggregations on a DataFrame, created by
DataFrame.groupBy().
You may use an aggregation function as agg, avg, count, max, mean, min, pivot, sum, collect_list, collect_set, count, first, grouping, etc.
Attention to first: this function is an action, it can aaa to you script be slower if you misuse this.
If you have a numeric column you can use aggragation function such as min, max, mean, etc but if you have a string column you may want to use:
df.groupBy("ID").pivot("VAR").agg(concat_ws('', collect_list(col("VAL"))))
or
df.groupBy("ID").pivot("VAR").agg(collect_list(collect_list("VAL")[0]))
or
df.groupBy("ID").pivot("VAR").agg(first("VAL"))

Related

How to transform a column with an array of json to separate columns based on the json keys in Spark?

I have a column like this:
column
[{"key":1,"value":"aaaaa"},{"key":2,"value":"bbbbb"},{"key":3,"value":"ccccc"}]
[{"key":1,"value":"abcde"},{"key":2,"value":"bcdef"}]
[{"key":1,"value":"edcba"},{"key":3,"value":"zxcvb"},{"key":4,"value":"qwert"}]
I want to separate such column base on the keys, with each key having their column.
I've tried something like this but it didn't work:
test_schema = ArrayType(StructType([StructField("key", IntegerType()), StructField("value", StringType())]))
teste = (hits_raw
.withColumn("keys", get_json_object("hitsCustomDimensions", "$[*].key"))
.withColumn("teste", explode(from_json("hitsCustomDimensions", test_schema)))
).display()
The output I want is something like this:
column_1
column_2
column_3
column_4
aaaaa
bbbbb
ccccc
null
abcde
bcdef
null
null
edcba
null
zxcvb
qwert
Parse into array of structs then pivot the key column. You'll need some ID column to group by for the pivot, here I used monotonically_increasing_id function to add an id before inlining the array of structs.
from pyspark.sql import functions as F
df = spark.createDataFrame([
('[{"key":1,"value":"aaaaa"},{"key":2,"value":"bbbbb"},{"key":3,"value":"ccccc"}]',),
('[{"key":1,"value":"abcde"},{"key":2,"value":"bcdef"}]',),
('[{"key":1,"value":"edcba"},{"key":3,"value":"zxcvb"},{"key":4,"value":"qwert"}]',)
], ["column"])
test = (df.withColumn("column", F.from_json("column", test_schema))
.withColumn("id", F.monotonically_increasing_id())
.selectExpr("id", "inline(column)")
.groupBy("id").pivot("key").agg(F.first("value"))
.drop("id")
)
test.show()
#+-----+-----+-----+-----+
#| 1| 2| 3| 4|
#+-----+-----+-----+-----+
#|aaaaa|bbbbb|ccccc| null|
#|abcde|bcdef| null| null|
#|edcba| null|zxcvb|qwert|
#+-----+-----+-----+-----+
You can then rename to columns to add prefix column_* if you want:
test = test.select(*[F.col(c).alias(f"column_{c}") for c in test.columns])

Spark : need confirmation on approach in capturing first and last date : on dataset

I have a data frame :
A, B, C, D, 201701, 2020001
A, B, C, D, 201801, 2020002
A, B, C, D, 201901, 2020003
expected output :
col_A, col_B, col_C ,col_D, min_week ,max_week, min_month, max_month
A, B, C, D, 201701, 201901, 2020001, 2020003
What I tried in pyspark-
from pyspark.sql import Window
import pyspark.sql.functions as psf
w1 = Window.partitionBy('A','B', 'C', 'D')\
.orderBy('WEEK','MONTH')
df_new = df_source\
.withColumn("min_week", psf.first("WEEK").over(w1))\
.withColumn("max_week", psf.last("WEEK").over(w1))\
.withColumn("min_month", psf.first("MONTH").over(w1))\
.withColumn("max_month", psf.last("MONTH").over(w1))
What i also tried -
sql_1 = """
select A, B , C, D, first(WEEK) as min_week,
last(WEEK) as max_week , first(MONTH) as min_month,
last(MONTH) as max_month from df_source
group by A, B , C, D
order by A, B , C, D
"""
df_new = spark.sql(sql_1)
Using the first and second approach i got non consistent results.
Will the below approach work to fix the issue encountered above -
sql_1 = """
select A, B , C, D, min(WEEK) as min_week,
max(WEEK) as max_week , min(MONTH) as min_month,
max(MONTH) as max_month from df_source
group by A, B , C, D
order by A, B , C, D
"""
df_new = spark.sql(sql_1)
Which approach works perfect in pyspark everytime?
is there any alternate way
or, is third option the best way to handle this requirement.
Any pointers will be helpful.
The third approach you propose will work every time. You could also write it like this:
df
.groupBy('A', 'B', 'C', 'D')
.agg(F.min('WEEK').alias('min_week'), F.max('WEEK').alias('max_week'),
F.min('MONTH').alias('min_month'), F.max('MONTH').alias('max_month'))
.show()
which yields:
+---+---+---+---+--------+--------+---------+---------+
| A| B| C| D|min_week|max_week|min_month|max_month|
+---+---+---+---+--------+--------+---------+---------+
| A| B| C| D| 201701| 201901| 2020001| 2020003|
+---+---+---+---+--------+--------+---------+---------+
It is interesting to understand why the first two approaches produce unpredictable results while the third always works.
The second approach is unpredictable because spark is a parallel computation engine. When it aggregates a value, it starts by aggregating the value in all the partitions and then the results will be aggregated two by two. Yet the order of these aggregations is not deterministic. It depends among other things on the order of completion of the tasks which can change at every attempt, in particular if there is a lot of data.
The first approach is not exactly what you want to do. Window functions will not aggregate the dataframe into one single row. They will compute the aggregation and add it to every row. You are also making several mistakes. If you order the dataframe, by default spark considers windows ranging from the start of the window to the current row. Therefore the maximum will be the current row for the week. In fact, to compute the in and the max, you do not need to order the dataframe. You can just do it like this:
w = Window.partitionBy('A','B', 'C', 'D')
df.select('A', 'B', 'C', 'D',
F.min('WEEK').over(w).alias('min_week'),
F.max('WEEK').over(w).alias('max_week'),
F.min('MONTH').over(w).alias('min_month'),
F.max('MONTH').over(w).alias('max_month')
).show()
which yields the correct result but that was not what you were expecting. But at least, you see the difference between window aggregations and regular aggregations.
+---+---+---+---+--------+--------+---------+---------+
| A| B| C| D|min_week|max_week|min_month|max_month|
+---+---+---+---+--------+--------+---------+---------+
| A| B| C| D| 201701| 201901| 2020001| 2020003|
| A| B| C| D| 201701| 201901| 2020001| 2020003|
| A| B| C| D| 201701| 201901| 2020001| 2020003|
+---+---+---+---+--------+--------+---------+---------+

How to transform two arrays of each column into a pair for a Spark DataFrame?

I have a DataFrame which has two columns of array values like below
var ds = Seq((Array("a","b"),Array("1","2")),(Array("p","q"),Array("3","4")))
var df = ds.toDF("col1", "col2")
+------+------+
| col1| col2|
+------+------+
|[a, b]|[1, 2]|
|[p, q]|[3, 4]|
+------+------+
I want to transform this into an array of pairs like below
+------+------+---------------+
| col1| col2| col3|
+------+------+---------------+
|[a, b]|[1, 2]|[[a, 1],[b, 2]]|
|[p, q]|[3, 4]|[[p, 3],[q, 4]]|
+------+------+---------------+
I guess I can use struct and then some udf. But I wanted to know if there is any built-in higher order method to do this efficiently.
From Spark-2.4 use arrays_zip function.
Example:
df.show()
#+------+------+
#| col1| col2|
#+------+------+
#|[a, b]|[1, 2]|
#|[p, q]|[3, 4]|
#+------+------+
from pyspark.sql.functions import *
df.withColumn("col3",arrays_zip(col("col1"),col("col2"))).show()
#+------+------+----------------+
#| col1| col2| col3|
#+------+------+----------------+
#|[a, b]|[1, 2]|[[a, 1], [b, 2]]|
#|[p, q]|[3, 4]|[[p, 3], [q, 4]]|
#+------+------+----------------+
For Spark-2.3 or below, I found the iterator zip method really handy for this use case (which I was unaware of while posting the question). I can define a small UDF
val zip = udf((xs: Seq[String], ys: Seq[String]) => xs.zip(ys))
and use as
var out = df.withColumn("col3", zip(df("col1"), df("col2")))
This gives me desired result.

Statistics of Columns computed parallely

Best way to get the max value in a Spark dataframe column
This post shows how to run an aggregation (distinct, min, max) on a table something like:
for colName in df.columns:
dt = cd[[colName]].distinct().count()
mx = cd.agg({colName: "max"}).collect()[0][0]
mn = cd.agg({colName: "min"}).collect()[0][0]
print(colName, dt, mx, mn)
This can be easily done by compute statistics. The stats from Hive and spark are different:
Hive gives - distinct, max, min, nulls, length, version
Spark Gives - count, mean, stddev, min, max
Looks like there are quite a few statistics that are calculated. How get all of them for all columns using one command?
However, I have 1000s of columns and doing this serially is very slow. Suppose I want to compute some other function say Standard Deviation on each of the columns - how can that be done parallely?
You can use pyspark.sql.DataFrame.describe() to get aggregate statistics like count, mean, min, max, and standard deviation for all columns where such statistics are applicable. (If you don't pass in any arguments, stats for all columns are returned by default)
df = spark.createDataFrame(
[(1, "a"),(2, "b"), (3, "a"), (4, None), (None, "c")],["id", "name"]
)
df.describe().show()
#+-------+------------------+----+
#|summary| id|name|
#+-------+------------------+----+
#| count| 4| 4|
#| mean| 2.5|null|
#| stddev|1.2909944487358056|null|
#| min| 1| a|
#| max| 4| c|
#+-------+------------------+----+
As you can see, these statistics ignore any null values.
If you're using spark version 2.3, there is also pyspark.sql.DataFrame.summary() which supports the following aggregates:
count - mean - stddev - min - max - arbitrary approximate percentiles specified as a percentage (eg, 75%)
df.summary("count", "min", "max").show()
#+-------+------------------+----+
#|summary| id|name|
#+-------+------------------+----+
#| count| 4| 4|
#| min| 1| a|
#| max| 4| c|
#+-------+------------------+----+
If you wanted some other aggregate statistic for all columns, you could also use a list comprehension with pyspark.sql.DataFrame.agg(). For example, if you wanted to replicate what you say Hive gives (distinct, max, min and nulls - I'm not sure what length and version mean):
import pyspark.sql.functions as f
from itertools import chain
agg_distinct = [f.countDistinct(c).alias("distinct_"+c) for c in df.columns]
agg_max = [f.max(c).alias("max_"+c) for c in df.columns]
agg_min = [f.min(c).alias("min_"+c) for c in df.columns]
agg_nulls = [f.count(f.when(f.isnull(c), c)).alias("nulls_"+c) for c in df.columns]
df.agg(
*(chain.from_iterable([agg_distinct, agg_max, agg_min, agg_nulls]))
).show()
#+-----------+-------------+------+--------+------+--------+--------+----------+
#|distinct_id|distinct_name|max_id|max_name|min_id|min_name|nulls_id|nulls_name|
#+-----------+-------------+------+--------+------+--------+--------+----------+
#| 4| 3| 4| c| 1| a| 1| 1|
#+-----------+-------------+------+--------+------+--------+--------+----------+
Though this method will return one row, rather than one row per statistic as describe() and summary() do.
You can put as many expressions into an agg as you want, when you collect they all get computed at once. The result is a single row with all the values. Here's an example:
from pyspark.sql.functions import min, max, countDistinct
r = df.agg(
min(df.col1).alias("minCol1"),
max(df.col1).alias("maxCol1"),
(max(df.col1) - min(df.col1)).alias("diffMinMax"),
countDistinct(df.col2).alias("distinctItemsInCol2"))
r.printSchema()
# root
# |-- minCol1: long (nullable = true)
# |-- maxCol1: long (nullable = true)
# |-- diffMinMax: long (nullable = true)
# |-- distinctItemsInCol2: long (nullable = false)
row = r.collect()[0]
print(row.distinctItemsInCol2, row.diffMinMax)
# (10, 9)
You can also use the dictionary syntax here, but it's harder to manage for more complex things.

Creating multiple columns for a grouped pyspark dataframe

I'm trying to add several new columns to my dataframe (preferably in a for loop), with each new column being the count of certain instances of col B, after grouping by column A.
What doesn't work:
import functions as f
#the first one will be fine
df_grouped=df.select('A','B').filter(df.B=='a').groupBy('A').count()
df_grouped.show()
+---+-----+
| A |count|
+---+-----+
|859| 4|
|947| 2|
|282| 6|
|699| 24|
|153| 12|
# create the second column:
df_g2=df.select('A','B').filter(df.B=='b').groupBy('A').count()
df_g2.show()
+---+-----+
| A |count|
+---+-----+
|174| 18|
|153| 20|
|630| 6|
|147| 16|
#I get an error on adding the new column:
df_grouped=df_grouped.withColumn('2nd_count',f.col(df_g2.select('count')))
The error:
AttributeError: 'DataFrame' object has no attribute '_get_object_id'
I also tried it without using f.col, and with just df_g2.count, but I get an error saying "col should be column".
Something that DOES work:
df_g1=df.select('A','B').filter(df.B=='a').groupBy('A').count()
df_g2=df.select('A','B').filter(df.B=='b').groupBy('A').count()
df_grouped=df_g1.join(df_g2,['A'])
However, I'm going to add up to around 1000 new columns, and having that so many joins seems costly. I wonder if doing joins is inevitable, given that every time I group by col A, its order changes in the grouped object (e.g. compare order of column A in df_grouped with its order in df_g2 in above), or there is a better way to do this.
What you probably need is groupby and pivot.
Try this:
df.groupby('A').pivot('B').agg(F.count('B')).show()