Function to filter values in PySpark - dataframe

I'm trying to run a for loop in PySpark that needs a to filter a variable for an algorithm.
Here's an example of my dataframe df_prods:
+----------+--------------------+--------------------+
|ID | NAME | TYPE |
+----------+--------------------+--------------------+
| 7983 |SNEAKERS 01 | Sneakers|
| 7034 |SHIRT 13 | Shirt|
| 3360 |SHORTS 15 | Short|
I want to iterate over a list of ID's, get the match from the algorithm and then filter the product's type.
I created a function that gets the type:
def get_type(ID_PROD):
return [row[0] for row in df_prods.filter(df_prods.ID == ID_PROD).select("TYPE").collect()]
And wanted it to return:
print(get_type(7983))
Sneakers
But I find two issues:
1- it takes a long time to do that (longer than I got doing a similar thing on Python)
2- It returns an string array type: ['Sneakers'] and when I try to filter the products, this happens:
type = get_type(7983)
df_prods.filter(df_prods.type == type)
java.lang.RuntimeException: Unsupported literal type class java.util.ArrayList [Sneakers]
Does anyone know a better way to approach this on PySpark?
Thank you very much in advance. I'm having a very hard time learning PySpark.

A little adjustment on your function. This returns the actual string of the target column from the first record found after filtering.
from pyspark.sql.functions import col
def get_type(ID_PROD):
return df.filter(col("ID") == ID_PROD).select("TYPE").collect()[0]["TYPE"]
type = get_type(7983)
df_prods.filter(col("TYPE") == type) # works
I find using col("colname") to be much more readable.
About the performance issue you've mentioned, I really cannot say without more details (e.g. inspecting the data and the rest of your application). Try this syntax and tell me if the performance improves.

Related

PySpark Grouping and Aggregating based on A Different Column?

I'm working on a problem where I have a dataset in the following format (replaced real data for example purposes):
session
activity
timestamp
1
enter_store
2022-03-01 23:25:11
1
pay_at_cashier
2022-03-01 23:31:10
1
exit_store
2022-03-01 23:55:01
2
enter_store
2022-03-02 07:15:00
2
pay_at_cashier
2022-03-02 07:24:00
2
exit_store
2022-03-02 07:35:55
3
enter_store
2022-03-05 11:07:01
3
exit_store
2022-03-05 11:22:51
I would like to be able to compute counting statistics for these events based on the pattern observed within each session. For example, based on the table above, the count of each pattern observed would be as follows:
{
'enter_store -> pay_at_cashier -> exit_store': 2,
'enter_store -> exit_store': 1
}
I'm trying to do this in PySpark, but I'm having some trouble figuring out the most efficient way to do this kind of pattern matching where some steps are missing. The real problem involves a much larger dataset of ~15M+ events like this.
I've tried logic in the form of filtering the entire DF for unique sessions where 'enter_store' is observed, and then filtering that DF for unique sessions where 'pay_at_cashier' is observed. That works fine, the only issue is I'm having trouble thinking of ways where I can count the sessions like 3 where there is only a starting step and final step, but no middle step.
Obviously one way to do this brute-force would be to iterate over each session and assign it a pattern and increment a counter, but I'm looking for more efficient and scalable ways to do this.
Would appreciate any suggestions or insights.
For Spark 2.4+, you could do
df = (df
.withColumn("flow", F.expr("sort_array(collect_list(struct(timestamp, activity)) over (partition by session))"))
.withColumn("flow", F.expr("concat_ws(' -> ', transform(flow, v -> v.activity))"))
.groupBy("flow").agg(F.countDistinct("session").alias("total_session"))
)
df.show(truncate=False)
# +-------------------------------------------+-------------+
# |flow |total_session|
# +-------------------------------------------+-------------+
# |enter_store -> pay_at_cashier -> exit_store|2 |
# |enter_store -> exit_store |1 |
# +-------------------------------------------+-------------+
The first block was collecting list of timestamp and its activity for each session in an ordered array (be sure timestamp is timestamp format) based on its timestamp value. After that, use only the activity values from the array using transform function (and combine them to create a string using concat_ws if needed) and group them by the activity order to get the distinct sessions.

Karate - How to construct two tables, using lines from each to validate against the other [duplicate]

I want to use single row under examples in cucumber like below:
Examples:
| data1 | data2|paymentOp|
| MySql | uk1 |??????????|
Where paymentOp is a number which I am getting from java method which has List as an argument. The method returns each of the numbers which I want to pass it under paymentOp.
There is an absolute way to iterate it by copy the row and paste it again in the table but I don't want that because the method has a dynamic result which may return 2 or 5 set of numbers.
Is it possible to achieve it using Karate?
How to proceed further. Any lead here would be much appreciated!
You can combine Examples: with dynamic behavior. Please read this example (especially the second one): https://github.com/intuit/karate/blob/master/karate-demo/src/test/java/demo/outline/examples.feature
Since you have difficulties reading the docs and examples (:P) here is a simple example. Take some time to understand it carefully.
Background:
* def data = { one: 1, two: 2, three: 3 }
Scenario Outline:
* match data.<key> == <value>
Examples:
| key | value |
| one | 1 |
| two | 2 |
| three | 3 |

Call function in pyspark with values from dataframe as strings

I have to call a function func_test(spark,a,b) which accepts two string values and create a df out of it. spark is a SparkSession variable
These two string values are two columns of another dataframe and would be different for different rows of that dataframe.
I am unable to achieve this.
Things tried so far:
1.
ctry_df = func_test(spark, df.select("CTRY").first()["CTRY"],df.select("CITY").first()["CITY"])
Gives CTRY and CITY of only the first record of the df.
2.
ctry_df = func_test(spark, df['CTRY'],df['CITY'])
Gives Column<b'CTRY'> and Column<b'CITY'> as values.
Example:
df is:
+----------+----------+-----------+
| CTRY | CITY | XYZ |
+----------+----------+-----------+
| US | LA | HELLO|
| UK | LN | WORLD|
| SN | SN | SPARK|
+----------+----------+-----------+
So, I want first call to fetch func_test(spark,US,LA); second call to go func_test(spark,UK,LN); third call to be func_test(spark,SN,SN) and so on.
Pyspark - 3.7
Spark - 2.2
Edit 1:
Issue in detail:
func_test(spark,string1,string2) is a function which accepts two string values. Inside this function is a set of various dataframe operations done. For example:- First spark sql in the func_test is a normal select and these two variables string1 and string2 are used in the where clause. The result of this spark sql which generates a df is a temp table of next spark sql and so on. Finally, it creates a df which this function func_test(spark,string1,string2) returns.
Now, In the main class, I have to call this func_test and the two parameters string1 and string2 will be fetched from records of dataframe. So that, first func_test call generates query as select * from dummy where CTRY='US' and CITY='LA'. And the subsequent operations happen which results in df. Second call to func_test becomes select * from dummy where CTRY='UK' and CITY='LN'. Third call becomes select * from dummy where CTRY='SN' and CITY='SN' and so on.
instead of first() use collect() and iterate through the loop
collect_vals = df.select('CTRY','CITY').distinct().collect()
for row_col in collect_vals:
func_test(spark, row_col['CTRY'],row_col['CITY'])
hope this helps !!

iteration in spark sql dataframe , getting 1st row value in first iteration and second row value in next iteration and so on

Below is the query that will give the data and distance where distance is <=10km
var s=spark.sql("select date,distance from table_new where distance <=10km")
s.show()
this will give the output like
12/05/2018 | 5
13/05/2018 | 8
14/05/2018 | 18
15/05/2018 | 15
16/05/2018 | 23
---------- | --
i want to use first row of the dataframe s , store the date value in a variable v , in first iteration.
In next iteration it should pick the second row , and corresponding data value to be replaced the old variable b .
like wise so on .
I think you should look at Spark "Window Functions". You may find here what you need.
The "bad" way to do this would be to collect the dataframe using df.collect() which would return a list of Rows which you can manually iterate over each using a loop.This is bad cause it brings all the data in your driver.
The better way would be to use foreach() :
df.foreach(lambda x: <<your code here>>)
foreach() takes a lambda function as argument which iterates over each row of the dataframe without bringing all the data in the driver.But you cant use a simple local variable v inside a lambda fuction when there is overwriting involved.you can use spark accumulators for such a case.
eg: if i want to sum all the values in 2nd column
counter = sc.longAccumulator("counter")
df.foreach(lambda row: counter.add(row.get(1)))

Hive UDF 'org.apache.hadoop.hive.contrib.udf.UDFRowSequence' generating same value for first two records

I am trying to generate auto increment values using the Hive UDF UDFRowSequence, but its generating same id for first two records.
+-------+----------+---+-------------------+
|rank_id| state| id| datetime|
+-------+----------+---+-------------------+
| 1|New Jersey| 10|2018-03-27 10:00:00|
| 1| Tamil| 25|2018-03-27 11:05:00|
| 2| TamilNa| 25|2018-03-27 11:15:00|
| 3| TamilNadu| 25|2018-03-27 11:25:00|
| 4| Gujarat| 30|2018-03-27 11:00:00|
+-------+----------+---+-------------------+
Here is the code that I am using for auto-increment .
package org.apache.hadoop.hive.contrib.udf;
import org.apache.hadoop.hive.ql.exec.Description;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.hive.ql.udf.UDFType;
import org.apache.hadoop.io.LongWritable;
/**
* UDFRowSequence.
*/
#Description(name = "row_sequence",
value = "_FUNC_() - Returns a generated row sequence number starting from 1")
#UDFType(deterministic = false, stateful = true)
public class UDFRowSequence extends UDF
{
private LongWritable result = new LongWritable();
public UDFRowSequence() {
result.set(0);
}
public LongWritable evaluate() {
result.set(result.get() + 1);
return result;
}
}
Can anyone please tell me what wrong am I doing that is generating the same id for first two records.
Apparently, you are doing nothing wrong.
But, it seems no such solution exist.
The reason you are getting repeated numbers is mostly because your evaluation happens in 2 mappers (if you are using spark engine, then 2 executors). And, at each executor the UDF will start the sequence from 1.
So, same value for first 2 records is just by accident. Results may vary depending on how many mappers are used to run the query.
You can achieve what you want by restricting the number of executors to 1. From a spark perspective, I think you can use a repartition(1) operation.
Also have a look # this thread that has some useful points.