Call function in pyspark with values from dataframe as strings - apache-spark-sql

I have to call a function func_test(spark,a,b) which accepts two string values and create a df out of it. spark is a SparkSession variable
These two string values are two columns of another dataframe and would be different for different rows of that dataframe.
I am unable to achieve this.
Things tried so far:
1.
ctry_df = func_test(spark, df.select("CTRY").first()["CTRY"],df.select("CITY").first()["CITY"])
Gives CTRY and CITY of only the first record of the df.
2.
ctry_df = func_test(spark, df['CTRY'],df['CITY'])
Gives Column<b'CTRY'> and Column<b'CITY'> as values.
Example:
df is:
+----------+----------+-----------+
| CTRY | CITY | XYZ |
+----------+----------+-----------+
| US | LA | HELLO|
| UK | LN | WORLD|
| SN | SN | SPARK|
+----------+----------+-----------+
So, I want first call to fetch func_test(spark,US,LA); second call to go func_test(spark,UK,LN); third call to be func_test(spark,SN,SN) and so on.
Pyspark - 3.7
Spark - 2.2
Edit 1:
Issue in detail:
func_test(spark,string1,string2) is a function which accepts two string values. Inside this function is a set of various dataframe operations done. For example:- First spark sql in the func_test is a normal select and these two variables string1 and string2 are used in the where clause. The result of this spark sql which generates a df is a temp table of next spark sql and so on. Finally, it creates a df which this function func_test(spark,string1,string2) returns.
Now, In the main class, I have to call this func_test and the two parameters string1 and string2 will be fetched from records of dataframe. So that, first func_test call generates query as select * from dummy where CTRY='US' and CITY='LA'. And the subsequent operations happen which results in df. Second call to func_test becomes select * from dummy where CTRY='UK' and CITY='LN'. Third call becomes select * from dummy where CTRY='SN' and CITY='SN' and so on.

instead of first() use collect() and iterate through the loop
collect_vals = df.select('CTRY','CITY').distinct().collect()
for row_col in collect_vals:
func_test(spark, row_col['CTRY'],row_col['CITY'])
hope this helps !!

Related

PySpark - change dataframe column value based on its existence in other dataframe

I am relatively new in using Spark and didn't find out solution for this. I found several similar questions related to this but didn't find the way how to bundle this in my use-case.
I have two dataframes, first is based on CSV which looks like this (displayed as table):
id
license_no
2005
1011
2006
1022
2007
3911
Second dataframe is based on CSV which looks like this:
license_no
active
1011
y
1022
y
3911
n
I need to check does license_no value exists in second dataframe, and if it exists and active=y, then I need to add some prefix (99 in the beginning) to its id and license_no in first dataframe - for example, license_id exists in second dataframe and it is active, so its ID in first dataframe should be changed to 992005 and license_no do 991011. If doesn't exits 88 should be added.
Dataframe should look like this after transformations:
id
license_no
992005
991011
992006
991022
882007
883911
I am not sure what is best solution for this, can I directly do this transformation in one spark command?
#Join
s = df.join(df1,how='left',on='license_no')
#Contitionally concat prefix using list squares
s.select(*[when(col('active')=='y',concat(lit('99'),str(x))).otherwise(concat(lit('88'),str(x))).alias(x) for x in s.columns if x!='active']).show()
+----------+------+
|license_no| id|
+----------+------+
| 991011|992005|
| 991022|992006|
| 883911|882007|
+----------+------+

how to reduce rows to 1 row by concatenate in Azure Log Analytics

string row1
string row2
Is it possible to reduce rows to 1 row?
Rows should be joined with a comma.
As a result I expect
string row1, string row2
One of the workaround could able to solve the above issue,
To concatenate we can use this for e.g | extend New_Column = strcat(tagname,",", tagvalue) with comma between two string.
For example we have tested in our environment with tag name and tag value
resourcecontainers
| where type =~ 'microsoft.resources/subscriptions'
| extend tagname = tostring(bag_keys(tags)[0])
| extend tagvalue = tostring(tags[tagname])
| extend New_Column = strcat(tagname,",", tagvalue) // for concate two rows into one with comma between two string
Here is the sample output for reference:
For more information please refer this SO THREAD
UPDATE: To cancat the rows we tried with the example of code as stated in the given SO THREAD which is suggested by #Yoni L.
| summarize result = strcat_array(make_list(word), ",")
Sample output for reference:
thanks for the tips. in your links I found:
| summarize result = strcat_array(make_list(name_s), ",")

Function to filter values in PySpark

I'm trying to run a for loop in PySpark that needs a to filter a variable for an algorithm.
Here's an example of my dataframe df_prods:
+----------+--------------------+--------------------+
|ID | NAME | TYPE |
+----------+--------------------+--------------------+
| 7983 |SNEAKERS 01 | Sneakers|
| 7034 |SHIRT 13 | Shirt|
| 3360 |SHORTS 15 | Short|
I want to iterate over a list of ID's, get the match from the algorithm and then filter the product's type.
I created a function that gets the type:
def get_type(ID_PROD):
return [row[0] for row in df_prods.filter(df_prods.ID == ID_PROD).select("TYPE").collect()]
And wanted it to return:
print(get_type(7983))
Sneakers
But I find two issues:
1- it takes a long time to do that (longer than I got doing a similar thing on Python)
2- It returns an string array type: ['Sneakers'] and when I try to filter the products, this happens:
type = get_type(7983)
df_prods.filter(df_prods.type == type)
java.lang.RuntimeException: Unsupported literal type class java.util.ArrayList [Sneakers]
Does anyone know a better way to approach this on PySpark?
Thank you very much in advance. I'm having a very hard time learning PySpark.
A little adjustment on your function. This returns the actual string of the target column from the first record found after filtering.
from pyspark.sql.functions import col
def get_type(ID_PROD):
return df.filter(col("ID") == ID_PROD).select("TYPE").collect()[0]["TYPE"]
type = get_type(7983)
df_prods.filter(col("TYPE") == type) # works
I find using col("colname") to be much more readable.
About the performance issue you've mentioned, I really cannot say without more details (e.g. inspecting the data and the rest of your application). Try this syntax and tell me if the performance improves.

iteration in spark sql dataframe , getting 1st row value in first iteration and second row value in next iteration and so on

Below is the query that will give the data and distance where distance is <=10km
var s=spark.sql("select date,distance from table_new where distance <=10km")
s.show()
this will give the output like
12/05/2018 | 5
13/05/2018 | 8
14/05/2018 | 18
15/05/2018 | 15
16/05/2018 | 23
---------- | --
i want to use first row of the dataframe s , store the date value in a variable v , in first iteration.
In next iteration it should pick the second row , and corresponding data value to be replaced the old variable b .
like wise so on .
I think you should look at Spark "Window Functions". You may find here what you need.
The "bad" way to do this would be to collect the dataframe using df.collect() which would return a list of Rows which you can manually iterate over each using a loop.This is bad cause it brings all the data in your driver.
The better way would be to use foreach() :
df.foreach(lambda x: <<your code here>>)
foreach() takes a lambda function as argument which iterates over each row of the dataframe without bringing all the data in the driver.But you cant use a simple local variable v inside a lambda fuction when there is overwriting involved.you can use spark accumulators for such a case.
eg: if i want to sum all the values in 2nd column
counter = sc.longAccumulator("counter")
df.foreach(lambda row: counter.add(row.get(1)))

How to apply a custom filtering function on a Spark DataFrame

I have a DataFrame of the form:
A_DF = |id_A: Int|concatCSV: String|
and another one:
B_DF = |id_B: Int|triplet: List[String]|
Examples of concatCSV could look like:
"StringD, StringB, StringF, StringE, StringZ"
"StringA, StringB, StringX, StringY, StringZ"
...
while a triplet is something like:
("StringA", "StringF", "StringZ")
("StringB", "StringU", "StringR")
...
I want to produce the cartesian set of A_DF and B_DF, e.g.;
| id_A: Int | concatCSV: String | id_B: Int | triplet: List[String] |
| 14 | "StringD, StringB, StringF, StringE, StringZ" | 21 | ("StringA", "StringF", "StringZ")|
| 14 | "StringD, StringB, StringF, StringE, StringZ" | 45 | ("StringB", "StringU", "StringR")|
| 18 | "StringA, StringB, StringX, StringY, StringG" | 21 | ("StringA", "StringF", "StringZ")|
| 18 | "StringA, StringB, StringX, StringY, StringG" | 45 | ("StringB", "StringU", "StringR")|
| ... | | | |
Then keep just the records that have at least two substrings (e.g StringA, StringB) from A_DF("concatCSV") that appear in B_DF("triplet"), i.e. use filter to exclude those that don't satisfy this condition.
First question is: can I do this without converting the DFs into RDDs?
Second question is: can I ideally do the whole thing in the join step--as a where condition?
I have tried experimenting with something like:
val cartesianRDD = A_DF
.join(B_DF,"right")
.where($"triplet".exists($"concatCSV".contains(_)))
but where cannot be resolved. I tried it with filter instead of where but still no luck. Also, for some strange reason, type annotation for cartesianRDD is SchemaRDD and not DataFrame. How did I end up with that? Finally, what I am trying above (the short code I wrote) is incomplete as it would keep records with just one substring from concatCSV found in triplet.
So, third question is: Should I just change to RDDs and solve it with a custom filtering function?
Finally, last question: Can I use a custom filtering function with DataFrames?
Thanks for the help.
The function CROSS JOIN is implemented in Hive, so you could first do the cross-join using Hive SQL:
A_DF.registerTempTable("a")
B_DF.registerTempTable("b")
// sqlContext should be really a HiveContext
val result = sqlContext.sql("SELECT * FROM a CROSS JOIN b")
Then you can filter down to your expected output using two udf's. One that converts your string to an array of words, and a second one that gives us the length of the intersection of the resulting array column and the existing column "triplet":
import scala.collection.mutable.WrappedArray
import org.apache.spark.sql.functions.col
val splitArr = udf { (s: String) => s.split(",").map(_.trim) }
val commonLen = udf { (a: WrappedArray[String],
b: WrappedArray[String]) => a.intersect(b).length }
val temp = (result.withColumn("concatArr",
splitArr(col("concatCSV"))).select(col("*"),
commonLen(col("triplet"), col("concatArr")).alias("comm"))
.filter(col("comm") >= 2)
.drop("comm")
.drop("concatArr"))
temp.show
+----+--------------------+----+--------------------+
|id_A| concatCSV|id_B| triplet|
+----+--------------------+----+--------------------+
| 14|StringD, StringB,...| 21|[StringA, StringF...|
| 18|StringA, StringB,...| 21|[StringA, StringF...|
+----+--------------------+----+--------------------+