How to apply a custom filtering function on a Spark DataFrame - sql

I have a DataFrame of the form:
A_DF = |id_A: Int|concatCSV: String|
and another one:
B_DF = |id_B: Int|triplet: List[String]|
Examples of concatCSV could look like:
"StringD, StringB, StringF, StringE, StringZ"
"StringA, StringB, StringX, StringY, StringZ"
...
while a triplet is something like:
("StringA", "StringF", "StringZ")
("StringB", "StringU", "StringR")
...
I want to produce the cartesian set of A_DF and B_DF, e.g.;
| id_A: Int | concatCSV: String | id_B: Int | triplet: List[String] |
| 14 | "StringD, StringB, StringF, StringE, StringZ" | 21 | ("StringA", "StringF", "StringZ")|
| 14 | "StringD, StringB, StringF, StringE, StringZ" | 45 | ("StringB", "StringU", "StringR")|
| 18 | "StringA, StringB, StringX, StringY, StringG" | 21 | ("StringA", "StringF", "StringZ")|
| 18 | "StringA, StringB, StringX, StringY, StringG" | 45 | ("StringB", "StringU", "StringR")|
| ... | | | |
Then keep just the records that have at least two substrings (e.g StringA, StringB) from A_DF("concatCSV") that appear in B_DF("triplet"), i.e. use filter to exclude those that don't satisfy this condition.
First question is: can I do this without converting the DFs into RDDs?
Second question is: can I ideally do the whole thing in the join step--as a where condition?
I have tried experimenting with something like:
val cartesianRDD = A_DF
.join(B_DF,"right")
.where($"triplet".exists($"concatCSV".contains(_)))
but where cannot be resolved. I tried it with filter instead of where but still no luck. Also, for some strange reason, type annotation for cartesianRDD is SchemaRDD and not DataFrame. How did I end up with that? Finally, what I am trying above (the short code I wrote) is incomplete as it would keep records with just one substring from concatCSV found in triplet.
So, third question is: Should I just change to RDDs and solve it with a custom filtering function?
Finally, last question: Can I use a custom filtering function with DataFrames?
Thanks for the help.

The function CROSS JOIN is implemented in Hive, so you could first do the cross-join using Hive SQL:
A_DF.registerTempTable("a")
B_DF.registerTempTable("b")
// sqlContext should be really a HiveContext
val result = sqlContext.sql("SELECT * FROM a CROSS JOIN b")
Then you can filter down to your expected output using two udf's. One that converts your string to an array of words, and a second one that gives us the length of the intersection of the resulting array column and the existing column "triplet":
import scala.collection.mutable.WrappedArray
import org.apache.spark.sql.functions.col
val splitArr = udf { (s: String) => s.split(",").map(_.trim) }
val commonLen = udf { (a: WrappedArray[String],
b: WrappedArray[String]) => a.intersect(b).length }
val temp = (result.withColumn("concatArr",
splitArr(col("concatCSV"))).select(col("*"),
commonLen(col("triplet"), col("concatArr")).alias("comm"))
.filter(col("comm") >= 2)
.drop("comm")
.drop("concatArr"))
temp.show
+----+--------------------+----+--------------------+
|id_A| concatCSV|id_B| triplet|
+----+--------------------+----+--------------------+
| 14|StringD, StringB,...| 21|[StringA, StringF...|
| 18|StringA, StringB,...| 21|[StringA, StringF...|
+----+--------------------+----+--------------------+

Related

fbprophet at scale with applyInPandas resulting unexpected count values [PySpark]

I am using applyInPandas to implement a forecast function over a sampled data using groupBy on ID. The end goal is to calculate MAPE for each ID.
def forecast_balance(history_pd: pd.DataFrame) -> pd.DataFrame:
anonym_cis = history_pd.at[0,'ID']
# instantiate the model, configure the parameters
model = Prophet(
interval_width=0.95,
growth='linear',
daily_seasonality=True,
weekly_seasonality=True,
yearly_seasonality=False,
seasonality_mode='multiplicative'
)
# fit the model
model.fit(history_pd)
# configure predictions
future_pd = model.make_future_dataframe(
periods=30,
freq='d',
include_history=True
)
# make predictions
results_pd = model.predict(future_pd)
results_pd.loc[:, 'ID'] = anonym_cis
# . . .
# return predictions
return results_pd[['ds', 'ID', 'yhat', 'yhat_upper', 'yhat_lower']]
results = (
fr_sample
.groupBy('ID')
.applyInPandas(forecast_balance, schema=result_schema)
)
I am getting am expected predictive results. However, when I count the number of rows for each ID in input data and the output data, it doesn't match. I would like to know from where/how these extra 30 (292-262) rows are getting created in the process for each ID.
+----------+-----+
| ID|count|
+----------+-----+
| 482726| 262|
| 482769| 262|
| 483946| 262|
| 484124| 262|
| 484364| 262|
| 485103| 262|
+----------+-----+
+----------+-----+
| ID|count|
+----------+-----+
| 482726| 292|
| 482769| 292|
| 483946| 292|
| 484124| 292|
| 484364| 292|
| 485103| 292|
+----------+-----+
Note:
This is how I am calculating MAPE as of now which is not for each ID but a over all data, hence resulting a single value (e.g. 1.4382).
def gr_mape_val(pd_sample_df, result_df):
result_df = result_df.toPandas()
actuals_pd = pd_sample_df[pd_sample_df['ds'] < date(2022, 3, 19) ]['y']
predicted_pd = result_df[ result_df['ds'] < pd.to_datetime('2022-03-19') ]['yhat']
mape = mean_absolute_percentage_error(actuals_pd, predicted_pd)
return mape
To use it in groupBy format for each ID, I need to have both the above mentioned count values matched but I am not able to figure out, how?
I just found what was going on there:
Basically with make_future_dataframe, I am creating 30 extra datapoints which was changing the total count of predicted_pd.
This can be simply solved by using df.na.drop()
pd_sample_df.join(result_df, on=['ID', 'ds'], how='outer').na.drop()

Spark- scan data frame base on value

I'm trying to find a column (I do know the name of the column) base on a value. For example in this dataframe below, I'd like to know which row that has a column contains yellow for Category = A . The thing is I don't know the column name (colour) in advance so I couldn't do select * where Category = 'A' and colour = 'yellow' How can I scan the columns and achieve this? Many thanks for your help.
+--------+-----------+-------------+
|Category|colour |. name. |
+--------+-----------+-------------+
|A. | blue.| Elmo|
|A | yellow | Alex|
|B | desc | Erin|
+--------+-----------+-------------+
You can loop that check through the list of column names. You also can wrap this loop in a function for the readable purpose. Please note that this check per column would happen in sequence.
from pyspark.sql import functions as F
cols = df.columns
for c in cols:
cnt = df.where((F.col('Category') == 'A') & (F.col(c) == 'yellow')).count()
if cnt > 0:
print(c)

Karate - How to construct two tables, using lines from each to validate against the other [duplicate]

I want to use single row under examples in cucumber like below:
Examples:
| data1 | data2|paymentOp|
| MySql | uk1 |??????????|
Where paymentOp is a number which I am getting from java method which has List as an argument. The method returns each of the numbers which I want to pass it under paymentOp.
There is an absolute way to iterate it by copy the row and paste it again in the table but I don't want that because the method has a dynamic result which may return 2 or 5 set of numbers.
Is it possible to achieve it using Karate?
How to proceed further. Any lead here would be much appreciated!
You can combine Examples: with dynamic behavior. Please read this example (especially the second one): https://github.com/intuit/karate/blob/master/karate-demo/src/test/java/demo/outline/examples.feature
Since you have difficulties reading the docs and examples (:P) here is a simple example. Take some time to understand it carefully.
Background:
* def data = { one: 1, two: 2, three: 3 }
Scenario Outline:
* match data.<key> == <value>
Examples:
| key | value |
| one | 1 |
| two | 2 |
| three | 3 |

Cypher: How to create a recursive cost query with alternates?

I have the following structure:
(:pattern)-[:contains]->(:pattern)
...basically a hierarchy of patterns that use other patterns as content. These constitute trees.
Certain patterns are generated by certain generators:
(:generator)-[:canProduce]->(:pattern)
The canProduce relationship has a cost value associated with it as a property. Multiple generators can create the same pattern.
I would like to figure out, with a query, what patterns I need to generate to produce a particular output - and which generators to choose to have the lowest cost. I started like this:
MATCH (p:pattern {name: 'preciousPattern'})-[:contains *]->(ps:pattern) RETURN ps
so far so good. The results don't contain the starting pattern, so I made this:
MATCH (p:pattern {name: 'preciousPattern'})-[:contains *]->(ps:pattern)
WITH p+collect(ps) as list
UNWIND list as patterns
RETURN patterns
That does not feel elegant, but it also does not provide the hierarchy
I can of course do a path query (MATCH path = MATCH...) but the results don't seem very useful.
Also, now I need to connect the cost from the generator relationship.
I tried this:
MATCH (p:pattern {name: 'awesome'})-[:contains *]->(ps:pattern)
WITH p+collect(ps) as list
UNWIND list as rec
CALL {
WITH rec
MATCH (rec)-[r:canGenerate]-(g:generator)
return r.GenCost as GenCost, g.name AS GenName
}
return rec.name, GenCost , GenName
The problem I have now is that if any of the patterns that are part of another pattern can be generated by multiple generators, I just get double entries in the list, but what I want is separate lists for each alternative possibility, so that I can generate the cost.
This is my pattern tree:
Awesome
input1
input2
input 3
Input 3 can be generated by 2 different generators. I now get:
Awesome | 2 | MainGen
input1 | 3 | TestGen1
input2 | 2.5 | TestGen2
input3 | 1.25 | TestGen3
input4 | 1.4 | TestGen4
What I want is this: Two lists (or n, in the general case, where I might have n possible paths), one
Awesome | 2 | MainGen
input1 | 3 | TestGen1
input2 | 2.5 | TestGen2
input3 | 1.25 | TestGen3
and one:
Awesome | 2 | MainGen
input1 | 3 | TestGen1
input2 | 2.5 | TestGen2
input4 | 1.4 | TestGen4
each set representing one alternative set, so that I can calculate the costs and compare.
I have no idea how to do something like that. Any suggestions?

Call function in pyspark with values from dataframe as strings

I have to call a function func_test(spark,a,b) which accepts two string values and create a df out of it. spark is a SparkSession variable
These two string values are two columns of another dataframe and would be different for different rows of that dataframe.
I am unable to achieve this.
Things tried so far:
1.
ctry_df = func_test(spark, df.select("CTRY").first()["CTRY"],df.select("CITY").first()["CITY"])
Gives CTRY and CITY of only the first record of the df.
2.
ctry_df = func_test(spark, df['CTRY'],df['CITY'])
Gives Column<b'CTRY'> and Column<b'CITY'> as values.
Example:
df is:
+----------+----------+-----------+
| CTRY | CITY | XYZ |
+----------+----------+-----------+
| US | LA | HELLO|
| UK | LN | WORLD|
| SN | SN | SPARK|
+----------+----------+-----------+
So, I want first call to fetch func_test(spark,US,LA); second call to go func_test(spark,UK,LN); third call to be func_test(spark,SN,SN) and so on.
Pyspark - 3.7
Spark - 2.2
Edit 1:
Issue in detail:
func_test(spark,string1,string2) is a function which accepts two string values. Inside this function is a set of various dataframe operations done. For example:- First spark sql in the func_test is a normal select and these two variables string1 and string2 are used in the where clause. The result of this spark sql which generates a df is a temp table of next spark sql and so on. Finally, it creates a df which this function func_test(spark,string1,string2) returns.
Now, In the main class, I have to call this func_test and the two parameters string1 and string2 will be fetched from records of dataframe. So that, first func_test call generates query as select * from dummy where CTRY='US' and CITY='LA'. And the subsequent operations happen which results in df. Second call to func_test becomes select * from dummy where CTRY='UK' and CITY='LN'. Third call becomes select * from dummy where CTRY='SN' and CITY='SN' and so on.
instead of first() use collect() and iterate through the loop
collect_vals = df.select('CTRY','CITY').distinct().collect()
for row_col in collect_vals:
func_test(spark, row_col['CTRY'],row_col['CITY'])
hope this helps !!