PySpark - change dataframe column value based on its existence in other dataframe - dataframe

I am relatively new in using Spark and didn't find out solution for this. I found several similar questions related to this but didn't find the way how to bundle this in my use-case.
I have two dataframes, first is based on CSV which looks like this (displayed as table):
id
license_no
2005
1011
2006
1022
2007
3911
Second dataframe is based on CSV which looks like this:
license_no
active
1011
y
1022
y
3911
n
I need to check does license_no value exists in second dataframe, and if it exists and active=y, then I need to add some prefix (99 in the beginning) to its id and license_no in first dataframe - for example, license_id exists in second dataframe and it is active, so its ID in first dataframe should be changed to 992005 and license_no do 991011. If doesn't exits 88 should be added.
Dataframe should look like this after transformations:
id
license_no
992005
991011
992006
991022
882007
883911
I am not sure what is best solution for this, can I directly do this transformation in one spark command?

#Join
s = df.join(df1,how='left',on='license_no')
#Contitionally concat prefix using list squares
s.select(*[when(col('active')=='y',concat(lit('99'),str(x))).otherwise(concat(lit('88'),str(x))).alias(x) for x in s.columns if x!='active']).show()
+----------+------+
|license_no| id|
+----------+------+
| 991011|992005|
| 991022|992006|
| 883911|882007|
+----------+------+

Related

PySpark Grouping and Aggregating based on A Different Column?

I'm working on a problem where I have a dataset in the following format (replaced real data for example purposes):
session
activity
timestamp
1
enter_store
2022-03-01 23:25:11
1
pay_at_cashier
2022-03-01 23:31:10
1
exit_store
2022-03-01 23:55:01
2
enter_store
2022-03-02 07:15:00
2
pay_at_cashier
2022-03-02 07:24:00
2
exit_store
2022-03-02 07:35:55
3
enter_store
2022-03-05 11:07:01
3
exit_store
2022-03-05 11:22:51
I would like to be able to compute counting statistics for these events based on the pattern observed within each session. For example, based on the table above, the count of each pattern observed would be as follows:
{
'enter_store -> pay_at_cashier -> exit_store': 2,
'enter_store -> exit_store': 1
}
I'm trying to do this in PySpark, but I'm having some trouble figuring out the most efficient way to do this kind of pattern matching where some steps are missing. The real problem involves a much larger dataset of ~15M+ events like this.
I've tried logic in the form of filtering the entire DF for unique sessions where 'enter_store' is observed, and then filtering that DF for unique sessions where 'pay_at_cashier' is observed. That works fine, the only issue is I'm having trouble thinking of ways where I can count the sessions like 3 where there is only a starting step and final step, but no middle step.
Obviously one way to do this brute-force would be to iterate over each session and assign it a pattern and increment a counter, but I'm looking for more efficient and scalable ways to do this.
Would appreciate any suggestions or insights.
For Spark 2.4+, you could do
df = (df
.withColumn("flow", F.expr("sort_array(collect_list(struct(timestamp, activity)) over (partition by session))"))
.withColumn("flow", F.expr("concat_ws(' -> ', transform(flow, v -> v.activity))"))
.groupBy("flow").agg(F.countDistinct("session").alias("total_session"))
)
df.show(truncate=False)
# +-------------------------------------------+-------------+
# |flow |total_session|
# +-------------------------------------------+-------------+
# |enter_store -> pay_at_cashier -> exit_store|2 |
# |enter_store -> exit_store |1 |
# +-------------------------------------------+-------------+
The first block was collecting list of timestamp and its activity for each session in an ordered array (be sure timestamp is timestamp format) based on its timestamp value. After that, use only the activity values from the array using transform function (and combine them to create a string using concat_ws if needed) and group them by the activity order to get the distinct sessions.

Spark Dataframe : Group by custom Range

I have a Spark Dataframe which I aggregated based on a column called "Rank", with "Sum" beging the Sum of all values with that Rank.
df.groupBy("Rank").agg(sum("Col1")).orderBy("Rank").show()
Rank
Sum(Col1)
1
1523
2
785
3
232
4
69
5
126
...
....
430
7
Instead of having the Sum for every single value of "Rank", I would like to group my data into rank "buckets", to get a more compact output. For example :
Rank Range
Sum(Col1)
1
1523
2-5
1212
5-10
...
...
...
100+
....
Instead of having 4 different rows for Rank 2,3,4,5 - I would like to have one row "2-5" showing the sum for all these ranks.
What would be the best way of doing that ? I am quite new to Spark Dataframes and am thankful for any help and especially examples on how to achieve that
Thank you !
Few options:
Histogram - build a histogram. See the following post:
Making histogram with Spark DataFrame column
Add another column for the bucket values (See Apache spark dealing with case statements):
df.select(when(people("Rank Range") === "1", "1")
.when(..., "2-5")
.otherwise("100"))
Now you can run your group by query on the new Rank Range column.

groupby 2 columns and count into separate columns based on one columns cases

I'm trying to group by 2 columns of which the first value has 5 different values and the second 2.
My data looks like this:
and using
df_counted = df_analysis
.groupby(['TYPE', 'RESULT'])
.size()
.sort_values(ascending=False)
.reset_index(name='COUNT')
I was able to transform it into the cases I want:
However I don't want a column for result, just for counts.
It's suppoed to be like
COUNT_TRUE COUNT_FALSE
FORWARD 21 182
BACKWARD 34 170
RIGHT 24 298
LEFT 20 242
NEUTRAL 16 82
The best I could do there was this. How do I get there?
Pandas has a feature of making a pivot table with dataframe. Your task can also be done by making pivot table.
df_counted.pivot_table(index="TYPE", columns="RESULT", values="COUNT")
Result:
Solved it and went a kind of full SQL there. It's not elegant, but it works:
df_counted is the last df from the question with the NaN values.
# drop duplicates for the first counts
df_pos = df_counted.drop_duplicates(subset=['TYPE'], keep='first').drop(columns=['COUNT_POS'])
# drop duplicates for the first counts
df_neg = df_counted.drop_duplicates(subset=['TYPE'], keep='last').drop(columns=['COUNT_NEG'])
# join on TYPE
df = df_pos.set_index('TYPE').join(df_neg.set_index('TYPE'))
If someone has a more elegant way of doing this, I'd be super interested to see it.

Call function in pyspark with values from dataframe as strings

I have to call a function func_test(spark,a,b) which accepts two string values and create a df out of it. spark is a SparkSession variable
These two string values are two columns of another dataframe and would be different for different rows of that dataframe.
I am unable to achieve this.
Things tried so far:
1.
ctry_df = func_test(spark, df.select("CTRY").first()["CTRY"],df.select("CITY").first()["CITY"])
Gives CTRY and CITY of only the first record of the df.
2.
ctry_df = func_test(spark, df['CTRY'],df['CITY'])
Gives Column<b'CTRY'> and Column<b'CITY'> as values.
Example:
df is:
+----------+----------+-----------+
| CTRY | CITY | XYZ |
+----------+----------+-----------+
| US | LA | HELLO|
| UK | LN | WORLD|
| SN | SN | SPARK|
+----------+----------+-----------+
So, I want first call to fetch func_test(spark,US,LA); second call to go func_test(spark,UK,LN); third call to be func_test(spark,SN,SN) and so on.
Pyspark - 3.7
Spark - 2.2
Edit 1:
Issue in detail:
func_test(spark,string1,string2) is a function which accepts two string values. Inside this function is a set of various dataframe operations done. For example:- First spark sql in the func_test is a normal select and these two variables string1 and string2 are used in the where clause. The result of this spark sql which generates a df is a temp table of next spark sql and so on. Finally, it creates a df which this function func_test(spark,string1,string2) returns.
Now, In the main class, I have to call this func_test and the two parameters string1 and string2 will be fetched from records of dataframe. So that, first func_test call generates query as select * from dummy where CTRY='US' and CITY='LA'. And the subsequent operations happen which results in df. Second call to func_test becomes select * from dummy where CTRY='UK' and CITY='LN'. Third call becomes select * from dummy where CTRY='SN' and CITY='SN' and so on.
instead of first() use collect() and iterate through the loop
collect_vals = df.select('CTRY','CITY').distinct().collect()
for row_col in collect_vals:
func_test(spark, row_col['CTRY'],row_col['CITY'])
hope this helps !!

iteration in spark sql dataframe , getting 1st row value in first iteration and second row value in next iteration and so on

Below is the query that will give the data and distance where distance is <=10km
var s=spark.sql("select date,distance from table_new where distance <=10km")
s.show()
this will give the output like
12/05/2018 | 5
13/05/2018 | 8
14/05/2018 | 18
15/05/2018 | 15
16/05/2018 | 23
---------- | --
i want to use first row of the dataframe s , store the date value in a variable v , in first iteration.
In next iteration it should pick the second row , and corresponding data value to be replaced the old variable b .
like wise so on .
I think you should look at Spark "Window Functions". You may find here what you need.
The "bad" way to do this would be to collect the dataframe using df.collect() which would return a list of Rows which you can manually iterate over each using a loop.This is bad cause it brings all the data in your driver.
The better way would be to use foreach() :
df.foreach(lambda x: <<your code here>>)
foreach() takes a lambda function as argument which iterates over each row of the dataframe without bringing all the data in the driver.But you cant use a simple local variable v inside a lambda fuction when there is overwriting involved.you can use spark accumulators for such a case.
eg: if i want to sum all the values in 2nd column
counter = sc.longAccumulator("counter")
df.foreach(lambda row: counter.add(row.get(1)))