Merging two dataframes using Pyspark - dataframe

I have 2 DF to merge:
DF1 --> contains Stocks
Plant Art_nr Tot
A X 5
B Y 4
DF2 --Z contains open delivery
Plant Art_nr Tot
A X 1
C Z 3
I would like to obtain a DF3 where for each combination of Plant and Art_nr:
- if there is a match between DF1.Plant&Art_nr and DF2.Plant&Art_nr I get the difference between DF1 and DF2
- if there is no match between DF1.Plant&Art_nr and DF2.Plant&Art_nr I keep the original values from DF1 and DF2
DF3 -->
Plant Art_nr Total
A X 4
B Y 4
C Z 3
I created a "Concat" field in DF1 and DF2 to concatenate Plant and Art_nr and I tried with a full join + when + otherwise but I can't find the correct syntax
DF1.join(DF2, ["Concat"],"full").withColumn("Total",when(DF1.Concat.isin(DF2.Concat)), DF1.Tot - DF2.Tot).otherwise(when(not(DF1.Concat.isin(DF2.Concat)), DF1.Tot)).show()
Any suggestions about alternative functions I could use, or how to correctly use those?

You have to join both dataframes and then perform case (If-Else) expression or coalesce function.
This could be done in multiple ways, here are few examples.
Option1: Use coalesce function as alternative of CASE-WHEN-NULL
from pyspark.sql.functions import coalesce, lit,abs
cond = [df1.Plant == df2.Plant, df1.Art_nr == df2.Art_nr]
df1.join(df2,cond,'full') \
.select(coalesce(df1.Plant,df2.Plant).alias('Plant')
,coalesce(df1.Art_nr,df2.Art_nr).alias('Art_nr')
,abs(coalesce(df1.Tot,lit(0)) - coalesce(df2.Tot,lit(0))).alias('Tot')
).show()
Option2: Use case expression within selectExpr()
cond = [df1.Plant == df2.Plant, df1.Art_nr == df2.Art_nr]
df1.alias('a').join(df2.alias('b'),cond,'full') \
.selectExpr("CASE WHEN a.Plant IS NULL THEN b.Plant ELSE a.Plant END AS Plant",
"CASE WHEN a.Art_nr IS NULL THEN b.Art_nr ELSE a.Art_nr END AS Art_nr",
"abs(coalesce(a.Tot,0) - coalesce(b.Tot,0)) AS Tot") \
.show()
#+-----+------+---+
#|Plant|Art_nr|Tot|
#+-----+------+---+
#| A| X| 4|
#| B| Y| 4|
#| C| Z| 3|
#+-----+------+---+
Option3: Use when().otherwise()
from pyspark.sql.functions import when,coalesce, lit,abs
cond = [df1.Plant == df2.Plant, df1.Art_nr == df2.Art_nr]
df1.join(df2,cond,'full') \
.select(when(df1.Plant.isNull(),df2.Plant).otherwise(df1.Plant).alias('Plant')
,when(df1.Art_nr.isNull(),df2.Art_nr).otherwise(df1.Art_nr).alias('Art_nr')
,abs(coalesce(df1.Tot,lit(0)) - coalesce(df2.Tot,lit(0))).alias('Tot')
).show()
Output:
#+-----+------+---+
#|Plant|Art_nr|Tot|
#+-----+------+---+
#| A| X| 4|
#| B| Y| 4|
#| C| Z| 3|
#+-----+------+---+

Use Udf, seems verbose but gives more clarity
from pyspark.sql.types import IntegerType
from pyspark.sql.functions import udf, array
def score(arr):
if arr[0] is None:
return int(arr[1])
elif arr[1] is None:
return int(arr[0])
return (int(arr[0])-int(arr[1]))
udf_final = udf(lambda arr: score(arr), IntegerType())
DF1.join(DF2, cond, "full").withColumn("final_score",udf_final(array("Tot","Total")))

I would probably do a union with a groupBy and some reformatting to avoid using UDFs and without large blocks of code.
from pyspark.sql.functions import *
DF3 = DF1.union(DF2.withColumn("Tot", col("Tot") * (-1)))
DF3 = DF3.groupBy("Plant", "Art_nr").agg(sum("Tot").alias("Tot"))
DF3 = DF3.withColumn("Tot", abs(col("Tot")))
I'm not 100% sure if there are no side effects I wasn't considering and if it fits your needs.

Related

optimize pyspark code to find a keyword and its count in a dataframe

We have a lot of files in our s3 bucket. The current pyspark code I have reads each file, takes one column from that file and looks for the keyword and returns a dataframe with count of keyword in the column and the file.
Here is the code in pyspark. (we are using databricks to write code if that helps)
import s3fs
fs = s3fs.S3FileSystem()
from pyspark.sql.functions import lower, col
keywords = ['%keyword1%','%keyword2%']
prefix = ''
deployment_id = ''
pull_id = ''
paths = fs.ls(prefix+'/'+deployment_id+'/'+pull_id)
result = []
errors = []
try:
for path in paths:
df = spark.read.parquet('s3://'+path)
print(path)
for keyword in keywords:
for col in df.columns:
filtered_df = df.filter(lower(df[col]).like(keyword))
filtered_count = filtered_df.count()
if filtered_count > 0 :
#print(col +' has '+ str(filtered_count) +' appearences')
result.append({'keyword': keyword, 'column': col, 'count': filtered_count,'table':path.split('/')[-1]})
except Exception as e:
errors.append({'error_msg':e})
try:
errors = spark.createDataFrame(errors)
except Exception as e:
print('no errors')
try:
result = spark.createDataFrame(result)
result.display()
except Exception as e:
print('problem with results. May be no results')
I am new to pyspark,databricks and spark. Code here works very slow. I know that cause we have a local code in python that is faster than this one. we wanted to use pyspark, databricks cause we thought it would be faster and on local code we need to put aws access keys every day and some times if the file is huge it gives a memory error.
NOTE - The above code reads data faster but the search functionality seems to be slower when compared to local python code
here is the python code in our local system
def search_df(self,keyword,df,regex=False):
start=time.time()
if regex:
mask = df.applymap(lambda x: re.search(keyword,x) is not None if isinstance(x,str) else False).to_numpy()
else:
mask = df.applymap(lambda x: keyword.lower() in x.lower() if isinstance(x,str) else False).to_numpy()
I was hoping if I could have any code changes to the pyspark so its faster.
Thanks.
tried changing
.like(keyword) to .contains(keyword) to see if thats faster. but doesnt seem to work
Check out the below code. Have defined a function that uses List Comprehensions to search each column in the df for a keyword. Next calling that function for each keyword. There will be a new df returned for each keyword, which then need to be unioned using reduce function.
import pyspark.sql.functions as F
from pyspark.sql import DataFrame
from functools import reduce
sampleData = [["Hello s1","Y","Hi s1"],["What is your name?","What is s1","what is s2?"] ]
df = spark.createDataFrame(sampleData,["col1","col2","col3"])
df.show()
# Sample input dataframe
+------------------+----------+-----------+
| col1| col2| col3|
+------------------+----------+-----------+
| Hello s1| Y| Hi s1|
|What is your name?|What is s1|what is s2?|
+------------------+----------+-----------+
keywords=["s1","s2"]
def calc(k) -> DataFrame:
return df.select([F.count(F.when(F.col(c).rlike(k),c)).alias(c) for c in df.columns] ).withColumn("keyword",F.lit(k))
lst=[calc(k) for k in keywords]
fDf=reduce(DataFrame.unionByName, [y for y in lst])
stExpr="stack(3,'col1',col1,'col2',col2,'col3',col3) as (ColName,Count)"
fDf.select("keyword",F.expr(stExpr)).show()
# Output
+-------+-------+-----+
|keyword|ColName|Count|
+-------+-------+-----+
| s1| col1| 1|
| s1| col2| 1|
| s1| col3| 1|
| s2| col1| 0|
| s2| col2| 0|
| s2| col3| 1|
+-------+-------+-----+
You can add a where clause at the end to filter rows greater than 0 ==>
where("Count >0")

Pyspark Multiple JOINS Column <> Row values: Reducing Actions

I have a master table 'Table 1' with 3 columns(Shown below). Tables 2.1, 3.1 & 4.1 are for 3 unique dates present in Table 1 and need to be populated in column 'Points 1'. Similarly, Tables 2.2, 3.2 & 4.2 are for same 3 unique dates present in Table 1 and need to be populated in column 'Points 2'.
Current Approach:
df1 = spark.table("Table1")
df2_1 = spark.table("table2.1")
df2_1 = withColumn("Date", lit(3312019))
df3 = df1.join(df2_1, df1.ID==df2.1==ID & df1.Date==df2_1.Date, 'left')
df4 = df3.withColumn('Points', when(df3.Category==A, col('A'))
.when(df3.Category==B, col('B'))
.when(df3.Category==C, col('C'))
.when(df3.Category==D, col('D'))
.otherwise(lit(None)))
Current Approach makes my code lengthy if implemented for all 6 tables, any suggestions to shorten it and reduce multiple actions?
I don't know if this is much shorter or "cleaner" than your version, but since you asked for help on this, I will post this as an answer. Please note that my answer is in regular spark (scala) - not pyspark, but it shouldn't be too difficult to port it to pyspark, if you find the answer useful :)
So here goes:
First a little helper function
def columns2rows(row: Row) = {
val id = row.getInt(0)
val date = row.getInt(1)
val cols = Seq("A", "B", "C", "D")
cols.indices.map(index => (id, cols(index), date, if (row.isNullAt(index+2)) 0 else row.getInt(index+2)))
}
Then union together the tables needed to populate "Points1"
val df1 = table21.withColumn("Date", lit(3312019))
.unionByName(table31.withColumn("Date", lit(12312019)))
.unionByName(table41.withColumn("Date", lit(5302020)))
.select($"ID", $"Date", $"A", $"B", $"C", $"D")
.flatMap(row => columns2rows(row))
.toDF("ID", "Category", "Date", "Points1")
Then union together the tables needed to populate "Points2"
val df2 = table22.withColumn("Date", lit(3312019))
.unionByName(table32.withColumn("Date", lit(12312019)))
.unionByName(table42.withColumn("Date", lit(5302020)))
.select($"ID", $"Date", $"A", $"B", $"C", $"D")
.flatMap(row => columns2rows(row))
.toDF("ID", "Category", "Date", "Points2")
Join them together and finally with the original table:
val joiningTable = df1.join(df2, Seq("ID", "Category", "Date"))
val res = table1.join(joiningTable, Seq("ID", "Category", "Date"))
...and voila - printing the final result:
res.show()
+---+--------+--------+-------+-------+
| ID|Category| Date|Points1|Points2|
+---+--------+--------+-------+-------+
|123| A| 3312019| 40| 20|
|123| B| 5302020| 10| 90|
|123| D| 5302020| 0| 80|
|123| A|12312019| 20| 10|
|123| B|12312019| 0| 10|
|123| B| 3312019| 60| 60|
+---+--------+--------+-------+-------+

Add single quotes to the dataFrame column values

DataFrame is holding a column QUALIFY with values like below.
QUALIFY
=================
ColA|ColB|ColC
ColA
ColZ|ColP
The values in this column are split by "|". I want values in this column to be like 'ColA','ColB','ColC' ...
With the below code I am able to replace | with ,',. How can I add a single quote at the start and end of value?
newDf = df_qualify.withColumn('QUALIFY2', regexp_replace('QUALIFY', "\\|", "\\','"))
Your solution is almost there - you just need to add a single quote to the start and end. You can achieve this using pyspark.sql.functions.concat:
from pyspark.sql.functions import col, concat, lit, regexp_replace
df.withColumn(
"QUALIFY2",
concat(lit("'"), regexp_replace(col('QUALIFY'), r"\|", r"','"), lit("'"))
).show()
#+--------------+--------------------+
#| QUALIFY| QUALIFY2|
#+--------------+--------------------+
#|ColA|ColB|ColC|'ColA','ColB','ColC'|
#| ColA| 'ColA'|
#| ColZ|ColP| 'ColZ','ColP'|
#+--------------+--------------------+
Alternatively, you can avoid regular expressions and achieve the same using split and concat_ws:
from pyspark.sql.functions import split, concat_ws
df.withColumn(
"QUALIFY2",
concat(lit("'"), concat_ws("','", split("QUALIFY", "\|")), lit("'"))
).show()
#+--------------+--------------------+
#| QUALIFY| QUALIFY2|
#+--------------+--------------------+
#|ColA|ColB|ColC|'ColA','ColB','ColC'|
#| ColA| 'ColA'|
#| ColZ|ColP| 'ColZ','ColP'|
#+--------------+--------------------+
Split the column on | and then join the resulting array back to a string :
import pyspark.sql.functions as F
import pyspark.sql.types as T
def str_list(x):
return str(x).replace("[", "").replace("]", "")
str_udf = F.udf(str_list, T.StringType())
df = df.withColumn("arr_split", F.split(F.col("QUALIFY"), "\|")) # escape character
df = df.withColumn("QUALIFY2", str_udf(F.col("arr_split")))
My sample output frame:
df.drop("arr_split").show() # Please ignore a and b columns
+---+---+--------------+--------------------+
| a| b| abc| QUALIFY2|
+---+---+--------------+--------------------+
| 1| 1|col1|col2|col3|'col1', 'col2', '...|
| 2| 2|col1|col2|col3|'col1', 'col2', '...|
| 3| 3|col1|col2|col3|'col1', 'col2', '...|
| 4| 4|col1|col2|col3|'col1', 'col2', '...|
| 5| 5|col1|col2|col3|'col1', 'col2', '...|
+---+---+--------------+--------------------+
Below code worked for me, added the square brackets back to make it an array
import pyspark.sql.functions as F
import pyspark.sql.types as T
def str_list(x):
return str(x).replace("[", "").replace("]", "")
str_udf = F.udf(str_list, T.StringType())
df = df.withColumn(column_name,str_udf(F.col(column_name)))
df = df.withColumn(column_name, F.expr("concat('[', " + column_name +", ']')"))

Statistics of Columns computed parallely

Best way to get the max value in a Spark dataframe column
This post shows how to run an aggregation (distinct, min, max) on a table something like:
for colName in df.columns:
dt = cd[[colName]].distinct().count()
mx = cd.agg({colName: "max"}).collect()[0][0]
mn = cd.agg({colName: "min"}).collect()[0][0]
print(colName, dt, mx, mn)
This can be easily done by compute statistics. The stats from Hive and spark are different:
Hive gives - distinct, max, min, nulls, length, version
Spark Gives - count, mean, stddev, min, max
Looks like there are quite a few statistics that are calculated. How get all of them for all columns using one command?
However, I have 1000s of columns and doing this serially is very slow. Suppose I want to compute some other function say Standard Deviation on each of the columns - how can that be done parallely?
You can use pyspark.sql.DataFrame.describe() to get aggregate statistics like count, mean, min, max, and standard deviation for all columns where such statistics are applicable. (If you don't pass in any arguments, stats for all columns are returned by default)
df = spark.createDataFrame(
[(1, "a"),(2, "b"), (3, "a"), (4, None), (None, "c")],["id", "name"]
)
df.describe().show()
#+-------+------------------+----+
#|summary| id|name|
#+-------+------------------+----+
#| count| 4| 4|
#| mean| 2.5|null|
#| stddev|1.2909944487358056|null|
#| min| 1| a|
#| max| 4| c|
#+-------+------------------+----+
As you can see, these statistics ignore any null values.
If you're using spark version 2.3, there is also pyspark.sql.DataFrame.summary() which supports the following aggregates:
count - mean - stddev - min - max - arbitrary approximate percentiles specified as a percentage (eg, 75%)
df.summary("count", "min", "max").show()
#+-------+------------------+----+
#|summary| id|name|
#+-------+------------------+----+
#| count| 4| 4|
#| min| 1| a|
#| max| 4| c|
#+-------+------------------+----+
If you wanted some other aggregate statistic for all columns, you could also use a list comprehension with pyspark.sql.DataFrame.agg(). For example, if you wanted to replicate what you say Hive gives (distinct, max, min and nulls - I'm not sure what length and version mean):
import pyspark.sql.functions as f
from itertools import chain
agg_distinct = [f.countDistinct(c).alias("distinct_"+c) for c in df.columns]
agg_max = [f.max(c).alias("max_"+c) for c in df.columns]
agg_min = [f.min(c).alias("min_"+c) for c in df.columns]
agg_nulls = [f.count(f.when(f.isnull(c), c)).alias("nulls_"+c) for c in df.columns]
df.agg(
*(chain.from_iterable([agg_distinct, agg_max, agg_min, agg_nulls]))
).show()
#+-----------+-------------+------+--------+------+--------+--------+----------+
#|distinct_id|distinct_name|max_id|max_name|min_id|min_name|nulls_id|nulls_name|
#+-----------+-------------+------+--------+------+--------+--------+----------+
#| 4| 3| 4| c| 1| a| 1| 1|
#+-----------+-------------+------+--------+------+--------+--------+----------+
Though this method will return one row, rather than one row per statistic as describe() and summary() do.
You can put as many expressions into an agg as you want, when you collect they all get computed at once. The result is a single row with all the values. Here's an example:
from pyspark.sql.functions import min, max, countDistinct
r = df.agg(
min(df.col1).alias("minCol1"),
max(df.col1).alias("maxCol1"),
(max(df.col1) - min(df.col1)).alias("diffMinMax"),
countDistinct(df.col2).alias("distinctItemsInCol2"))
r.printSchema()
# root
# |-- minCol1: long (nullable = true)
# |-- maxCol1: long (nullable = true)
# |-- diffMinMax: long (nullable = true)
# |-- distinctItemsInCol2: long (nullable = false)
row = r.collect()[0]
print(row.distinctItemsInCol2, row.diffMinMax)
# (10, 9)
You can also use the dictionary syntax here, but it's harder to manage for more complex things.

convert pyspark groupedData object to spark Dataframe

I have to do a 2 levels grouping on a pyspark dataframe.
My tentative:
grouped_df=df.groupby(["A","B","C"])
grouped_df.groupby(["C"]).count()
But I get the following error:
'GroupedData' object has no attribute 'groupby'
I guess I should first convert the grouped object into a pySpark DF. But I cannot do that.
Any suggestion?
I had the same issue. The way I got around it was by first doing a "count()" after the first groupby, because that returns a Spark DataFrame, rather than the GroupedData object. Then you can do another groupby on that returned DataFrame.
So try:
grouped_df=df.groupby(["A","B","C"]).count()
grouped_df.groupby(["C"]).count()
The function DataFrame.groupBy(cols) returns a GroupedData object. In order to convert a GroupedData object back to a DataFrame, you will need to use one of the GroupedData functions such as mean(cols) avg(cols) count(). An example using your example is:
df = sqlContext.createDataFrame([['a', 'b', 'c'], ['a', 'b', 'c'], ['a', 'b', 'c']], schema=['A', 'B', 'C'])
df.show()
+---+---+---+
| A| B| C|
+---+---+---+
| a| b| c|
| a| b| c|
| a| b| c|
+---+---+---+
gdf = df.groupBy('C').count()
gdf.show()
+---+-----+
| C|count|
+---+-----+
| c| 3|
+---+-----+
https://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.GroupedData
pyspark.sql.GroupedData Aggregation methods, returned by
DataFrame.groupBy().
A set of methods for aggregations on a DataFrame, created by
DataFrame.groupBy().
You may use an aggregation function as agg, avg, count, max, mean, min, pivot, sum, collect_list, collect_set, count, first, grouping, etc.
Attention to first: this function is an action, it can aaa to you script be slower if you misuse this.
If you have a numeric column you can use aggragation function such as min, max, mean, etc but if you have a string column you may want to use:
df.groupBy("ID").pivot("VAR").agg(concat_ws('', collect_list(col("VAL"))))
or
df.groupBy("ID").pivot("VAR").agg(collect_list(collect_list("VAL")[0]))
or
df.groupBy("ID").pivot("VAR").agg(first("VAL"))