Pyspark Multiple JOINS Column <> Row values: Reducing Actions - apache-spark-sql

I have a master table 'Table 1' with 3 columns(Shown below). Tables 2.1, 3.1 & 4.1 are for 3 unique dates present in Table 1 and need to be populated in column 'Points 1'. Similarly, Tables 2.2, 3.2 & 4.2 are for same 3 unique dates present in Table 1 and need to be populated in column 'Points 2'.
Current Approach:
df1 = spark.table("Table1")
df2_1 = spark.table("table2.1")
df2_1 = withColumn("Date", lit(3312019))
df3 = df1.join(df2_1, df1.ID==df2.1==ID & df1.Date==df2_1.Date, 'left')
df4 = df3.withColumn('Points', when(df3.Category==A, col('A'))
.when(df3.Category==B, col('B'))
.when(df3.Category==C, col('C'))
.when(df3.Category==D, col('D'))
.otherwise(lit(None)))
Current Approach makes my code lengthy if implemented for all 6 tables, any suggestions to shorten it and reduce multiple actions?

I don't know if this is much shorter or "cleaner" than your version, but since you asked for help on this, I will post this as an answer. Please note that my answer is in regular spark (scala) - not pyspark, but it shouldn't be too difficult to port it to pyspark, if you find the answer useful :)
So here goes:
First a little helper function
def columns2rows(row: Row) = {
val id = row.getInt(0)
val date = row.getInt(1)
val cols = Seq("A", "B", "C", "D")
cols.indices.map(index => (id, cols(index), date, if (row.isNullAt(index+2)) 0 else row.getInt(index+2)))
}
Then union together the tables needed to populate "Points1"
val df1 = table21.withColumn("Date", lit(3312019))
.unionByName(table31.withColumn("Date", lit(12312019)))
.unionByName(table41.withColumn("Date", lit(5302020)))
.select($"ID", $"Date", $"A", $"B", $"C", $"D")
.flatMap(row => columns2rows(row))
.toDF("ID", "Category", "Date", "Points1")
Then union together the tables needed to populate "Points2"
val df2 = table22.withColumn("Date", lit(3312019))
.unionByName(table32.withColumn("Date", lit(12312019)))
.unionByName(table42.withColumn("Date", lit(5302020)))
.select($"ID", $"Date", $"A", $"B", $"C", $"D")
.flatMap(row => columns2rows(row))
.toDF("ID", "Category", "Date", "Points2")
Join them together and finally with the original table:
val joiningTable = df1.join(df2, Seq("ID", "Category", "Date"))
val res = table1.join(joiningTable, Seq("ID", "Category", "Date"))
...and voila - printing the final result:
res.show()
+---+--------+--------+-------+-------+
| ID|Category| Date|Points1|Points2|
+---+--------+--------+-------+-------+
|123| A| 3312019| 40| 20|
|123| B| 5302020| 10| 90|
|123| D| 5302020| 0| 80|
|123| A|12312019| 20| 10|
|123| B|12312019| 0| 10|
|123| B| 3312019| 60| 60|
+---+--------+--------+-------+-------+

Related

How to transform a column with an array of json to separate columns based on the json keys in Spark?

I have a column like this:
column
[{"key":1,"value":"aaaaa"},{"key":2,"value":"bbbbb"},{"key":3,"value":"ccccc"}]
[{"key":1,"value":"abcde"},{"key":2,"value":"bcdef"}]
[{"key":1,"value":"edcba"},{"key":3,"value":"zxcvb"},{"key":4,"value":"qwert"}]
I want to separate such column base on the keys, with each key having their column.
I've tried something like this but it didn't work:
test_schema = ArrayType(StructType([StructField("key", IntegerType()), StructField("value", StringType())]))
teste = (hits_raw
.withColumn("keys", get_json_object("hitsCustomDimensions", "$[*].key"))
.withColumn("teste", explode(from_json("hitsCustomDimensions", test_schema)))
).display()
The output I want is something like this:
column_1
column_2
column_3
column_4
aaaaa
bbbbb
ccccc
null
abcde
bcdef
null
null
edcba
null
zxcvb
qwert
Parse into array of structs then pivot the key column. You'll need some ID column to group by for the pivot, here I used monotonically_increasing_id function to add an id before inlining the array of structs.
from pyspark.sql import functions as F
df = spark.createDataFrame([
('[{"key":1,"value":"aaaaa"},{"key":2,"value":"bbbbb"},{"key":3,"value":"ccccc"}]',),
('[{"key":1,"value":"abcde"},{"key":2,"value":"bcdef"}]',),
('[{"key":1,"value":"edcba"},{"key":3,"value":"zxcvb"},{"key":4,"value":"qwert"}]',)
], ["column"])
test = (df.withColumn("column", F.from_json("column", test_schema))
.withColumn("id", F.monotonically_increasing_id())
.selectExpr("id", "inline(column)")
.groupBy("id").pivot("key").agg(F.first("value"))
.drop("id")
)
test.show()
#+-----+-----+-----+-----+
#| 1| 2| 3| 4|
#+-----+-----+-----+-----+
#|aaaaa|bbbbb|ccccc| null|
#|abcde|bcdef| null| null|
#|edcba| null|zxcvb|qwert|
#+-----+-----+-----+-----+
You can then rename to columns to add prefix column_* if you want:
test = test.select(*[F.col(c).alias(f"column_{c}") for c in test.columns])

Merging two dataframes using Pyspark

I have 2 DF to merge:
DF1 --> contains Stocks
Plant Art_nr Tot
A X 5
B Y 4
DF2 --Z contains open delivery
Plant Art_nr Tot
A X 1
C Z 3
I would like to obtain a DF3 where for each combination of Plant and Art_nr:
- if there is a match between DF1.Plant&Art_nr and DF2.Plant&Art_nr I get the difference between DF1 and DF2
- if there is no match between DF1.Plant&Art_nr and DF2.Plant&Art_nr I keep the original values from DF1 and DF2
DF3 -->
Plant Art_nr Total
A X 4
B Y 4
C Z 3
I created a "Concat" field in DF1 and DF2 to concatenate Plant and Art_nr and I tried with a full join + when + otherwise but I can't find the correct syntax
DF1.join(DF2, ["Concat"],"full").withColumn("Total",when(DF1.Concat.isin(DF2.Concat)), DF1.Tot - DF2.Tot).otherwise(when(not(DF1.Concat.isin(DF2.Concat)), DF1.Tot)).show()
Any suggestions about alternative functions I could use, or how to correctly use those?
You have to join both dataframes and then perform case (If-Else) expression or coalesce function.
This could be done in multiple ways, here are few examples.
Option1: Use coalesce function as alternative of CASE-WHEN-NULL
from pyspark.sql.functions import coalesce, lit,abs
cond = [df1.Plant == df2.Plant, df1.Art_nr == df2.Art_nr]
df1.join(df2,cond,'full') \
.select(coalesce(df1.Plant,df2.Plant).alias('Plant')
,coalesce(df1.Art_nr,df2.Art_nr).alias('Art_nr')
,abs(coalesce(df1.Tot,lit(0)) - coalesce(df2.Tot,lit(0))).alias('Tot')
).show()
Option2: Use case expression within selectExpr()
cond = [df1.Plant == df2.Plant, df1.Art_nr == df2.Art_nr]
df1.alias('a').join(df2.alias('b'),cond,'full') \
.selectExpr("CASE WHEN a.Plant IS NULL THEN b.Plant ELSE a.Plant END AS Plant",
"CASE WHEN a.Art_nr IS NULL THEN b.Art_nr ELSE a.Art_nr END AS Art_nr",
"abs(coalesce(a.Tot,0) - coalesce(b.Tot,0)) AS Tot") \
.show()
#+-----+------+---+
#|Plant|Art_nr|Tot|
#+-----+------+---+
#| A| X| 4|
#| B| Y| 4|
#| C| Z| 3|
#+-----+------+---+
Option3: Use when().otherwise()
from pyspark.sql.functions import when,coalesce, lit,abs
cond = [df1.Plant == df2.Plant, df1.Art_nr == df2.Art_nr]
df1.join(df2,cond,'full') \
.select(when(df1.Plant.isNull(),df2.Plant).otherwise(df1.Plant).alias('Plant')
,when(df1.Art_nr.isNull(),df2.Art_nr).otherwise(df1.Art_nr).alias('Art_nr')
,abs(coalesce(df1.Tot,lit(0)) - coalesce(df2.Tot,lit(0))).alias('Tot')
).show()
Output:
#+-----+------+---+
#|Plant|Art_nr|Tot|
#+-----+------+---+
#| A| X| 4|
#| B| Y| 4|
#| C| Z| 3|
#+-----+------+---+
Use Udf, seems verbose but gives more clarity
from pyspark.sql.types import IntegerType
from pyspark.sql.functions import udf, array
def score(arr):
if arr[0] is None:
return int(arr[1])
elif arr[1] is None:
return int(arr[0])
return (int(arr[0])-int(arr[1]))
udf_final = udf(lambda arr: score(arr), IntegerType())
DF1.join(DF2, cond, "full").withColumn("final_score",udf_final(array("Tot","Total")))
I would probably do a union with a groupBy and some reformatting to avoid using UDFs and without large blocks of code.
from pyspark.sql.functions import *
DF3 = DF1.union(DF2.withColumn("Tot", col("Tot") * (-1)))
DF3 = DF3.groupBy("Plant", "Art_nr").agg(sum("Tot").alias("Tot"))
DF3 = DF3.withColumn("Tot", abs(col("Tot")))
I'm not 100% sure if there are no side effects I wasn't considering and if it fits your needs.

Statistics of Columns computed parallely

Best way to get the max value in a Spark dataframe column
This post shows how to run an aggregation (distinct, min, max) on a table something like:
for colName in df.columns:
dt = cd[[colName]].distinct().count()
mx = cd.agg({colName: "max"}).collect()[0][0]
mn = cd.agg({colName: "min"}).collect()[0][0]
print(colName, dt, mx, mn)
This can be easily done by compute statistics. The stats from Hive and spark are different:
Hive gives - distinct, max, min, nulls, length, version
Spark Gives - count, mean, stddev, min, max
Looks like there are quite a few statistics that are calculated. How get all of them for all columns using one command?
However, I have 1000s of columns and doing this serially is very slow. Suppose I want to compute some other function say Standard Deviation on each of the columns - how can that be done parallely?
You can use pyspark.sql.DataFrame.describe() to get aggregate statistics like count, mean, min, max, and standard deviation for all columns where such statistics are applicable. (If you don't pass in any arguments, stats for all columns are returned by default)
df = spark.createDataFrame(
[(1, "a"),(2, "b"), (3, "a"), (4, None), (None, "c")],["id", "name"]
)
df.describe().show()
#+-------+------------------+----+
#|summary| id|name|
#+-------+------------------+----+
#| count| 4| 4|
#| mean| 2.5|null|
#| stddev|1.2909944487358056|null|
#| min| 1| a|
#| max| 4| c|
#+-------+------------------+----+
As you can see, these statistics ignore any null values.
If you're using spark version 2.3, there is also pyspark.sql.DataFrame.summary() which supports the following aggregates:
count - mean - stddev - min - max - arbitrary approximate percentiles specified as a percentage (eg, 75%)
df.summary("count", "min", "max").show()
#+-------+------------------+----+
#|summary| id|name|
#+-------+------------------+----+
#| count| 4| 4|
#| min| 1| a|
#| max| 4| c|
#+-------+------------------+----+
If you wanted some other aggregate statistic for all columns, you could also use a list comprehension with pyspark.sql.DataFrame.agg(). For example, if you wanted to replicate what you say Hive gives (distinct, max, min and nulls - I'm not sure what length and version mean):
import pyspark.sql.functions as f
from itertools import chain
agg_distinct = [f.countDistinct(c).alias("distinct_"+c) for c in df.columns]
agg_max = [f.max(c).alias("max_"+c) for c in df.columns]
agg_min = [f.min(c).alias("min_"+c) for c in df.columns]
agg_nulls = [f.count(f.when(f.isnull(c), c)).alias("nulls_"+c) for c in df.columns]
df.agg(
*(chain.from_iterable([agg_distinct, agg_max, agg_min, agg_nulls]))
).show()
#+-----------+-------------+------+--------+------+--------+--------+----------+
#|distinct_id|distinct_name|max_id|max_name|min_id|min_name|nulls_id|nulls_name|
#+-----------+-------------+------+--------+------+--------+--------+----------+
#| 4| 3| 4| c| 1| a| 1| 1|
#+-----------+-------------+------+--------+------+--------+--------+----------+
Though this method will return one row, rather than one row per statistic as describe() and summary() do.
You can put as many expressions into an agg as you want, when you collect they all get computed at once. The result is a single row with all the values. Here's an example:
from pyspark.sql.functions import min, max, countDistinct
r = df.agg(
min(df.col1).alias("minCol1"),
max(df.col1).alias("maxCol1"),
(max(df.col1) - min(df.col1)).alias("diffMinMax"),
countDistinct(df.col2).alias("distinctItemsInCol2"))
r.printSchema()
# root
# |-- minCol1: long (nullable = true)
# |-- maxCol1: long (nullable = true)
# |-- diffMinMax: long (nullable = true)
# |-- distinctItemsInCol2: long (nullable = false)
row = r.collect()[0]
print(row.distinctItemsInCol2, row.diffMinMax)
# (10, 9)
You can also use the dictionary syntax here, but it's harder to manage for more complex things.

Creating multiple columns for a grouped pyspark dataframe

I'm trying to add several new columns to my dataframe (preferably in a for loop), with each new column being the count of certain instances of col B, after grouping by column A.
What doesn't work:
import functions as f
#the first one will be fine
df_grouped=df.select('A','B').filter(df.B=='a').groupBy('A').count()
df_grouped.show()
+---+-----+
| A |count|
+---+-----+
|859| 4|
|947| 2|
|282| 6|
|699| 24|
|153| 12|
# create the second column:
df_g2=df.select('A','B').filter(df.B=='b').groupBy('A').count()
df_g2.show()
+---+-----+
| A |count|
+---+-----+
|174| 18|
|153| 20|
|630| 6|
|147| 16|
#I get an error on adding the new column:
df_grouped=df_grouped.withColumn('2nd_count',f.col(df_g2.select('count')))
The error:
AttributeError: 'DataFrame' object has no attribute '_get_object_id'
I also tried it without using f.col, and with just df_g2.count, but I get an error saying "col should be column".
Something that DOES work:
df_g1=df.select('A','B').filter(df.B=='a').groupBy('A').count()
df_g2=df.select('A','B').filter(df.B=='b').groupBy('A').count()
df_grouped=df_g1.join(df_g2,['A'])
However, I'm going to add up to around 1000 new columns, and having that so many joins seems costly. I wonder if doing joins is inevitable, given that every time I group by col A, its order changes in the grouped object (e.g. compare order of column A in df_grouped with its order in df_g2 in above), or there is a better way to do this.
What you probably need is groupby and pivot.
Try this:
df.groupby('A').pivot('B').agg(F.count('B')).show()

convert pyspark groupedData object to spark Dataframe

I have to do a 2 levels grouping on a pyspark dataframe.
My tentative:
grouped_df=df.groupby(["A","B","C"])
grouped_df.groupby(["C"]).count()
But I get the following error:
'GroupedData' object has no attribute 'groupby'
I guess I should first convert the grouped object into a pySpark DF. But I cannot do that.
Any suggestion?
I had the same issue. The way I got around it was by first doing a "count()" after the first groupby, because that returns a Spark DataFrame, rather than the GroupedData object. Then you can do another groupby on that returned DataFrame.
So try:
grouped_df=df.groupby(["A","B","C"]).count()
grouped_df.groupby(["C"]).count()
The function DataFrame.groupBy(cols) returns a GroupedData object. In order to convert a GroupedData object back to a DataFrame, you will need to use one of the GroupedData functions such as mean(cols) avg(cols) count(). An example using your example is:
df = sqlContext.createDataFrame([['a', 'b', 'c'], ['a', 'b', 'c'], ['a', 'b', 'c']], schema=['A', 'B', 'C'])
df.show()
+---+---+---+
| A| B| C|
+---+---+---+
| a| b| c|
| a| b| c|
| a| b| c|
+---+---+---+
gdf = df.groupBy('C').count()
gdf.show()
+---+-----+
| C|count|
+---+-----+
| c| 3|
+---+-----+
https://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.GroupedData
pyspark.sql.GroupedData Aggregation methods, returned by
DataFrame.groupBy().
A set of methods for aggregations on a DataFrame, created by
DataFrame.groupBy().
You may use an aggregation function as agg, avg, count, max, mean, min, pivot, sum, collect_list, collect_set, count, first, grouping, etc.
Attention to first: this function is an action, it can aaa to you script be slower if you misuse this.
If you have a numeric column you can use aggragation function such as min, max, mean, etc but if you have a string column you may want to use:
df.groupBy("ID").pivot("VAR").agg(concat_ws('', collect_list(col("VAL"))))
or
df.groupBy("ID").pivot("VAR").agg(collect_list(collect_list("VAL")[0]))
or
df.groupBy("ID").pivot("VAR").agg(first("VAL"))