add column in spark dataframe referring another dataframe using udf - dataframe

I've a dataframe "Forecast" with columns - Store, Item, FC_startdate, FC_enddate, FC_qty
Another dataframe "Actual" with columns - Store, Item, Saledate, Sales_qty.
I want to create a UDF with parameters passed - p_store, p_item, p_startdate, p_enddate and get the sum of Sales_qty in between these dates and add this as a new column (Act_qty) to "Forecast" dataframe.
but spark is not allowing to pass a dataframe in UDF along with fields of Forecast.
Instead of using merge - What can be the solution?

After defining and registering your udf, you can use the udf function in your transformation code like any other function of the spark-sql library.
Similar to the spark-sql library functions you can only pass columns of your dataframe and return the processed value. Dataframes cannot be passed to udf's.
So in your case you can transform your current dataframe into another dataframe by using the udf as a function and then proceed ahead.
https://docs.databricks.com/spark/latest/spark-sql/udf-scala.html

A golden rule is, that anything that can be done without UDFs, should be done without UDFs, they should be applied more-so when you require a very specific transformation on a singular row, rather than for the big aggregation type operation you decribe.
In this case it seems like you could just use SparkSQL: Select rows of Actual, where the Saledate is between the dates you would like (Spark understands dates natively, refer to the documentation), sum SalesQty per Store or Item, or both (I am not sure what you intend to do), rename the sum column and join this new dataframe into the Forecast using Store or Item or both again.
If you, however, insist on using UDFs you will have to pass columns, rather than dataframes as arguments but I can't think of a straightforward way of how to achieve what you describe using UDFs while not sacrificing a lot of performance.

Related

Qlik sense: How to aggregate strings into single row in script

I am trying to aggregate strings that belong to the same product code in one row. Which Qlik sense aggregation function should I use?
image
I am able to aggregate integers in such example, but failed for string aggregation.
Have you tried maxstring() - this is a string aggregation function.
As x3ja mentioned, you can use an aggregation function in charts that will work for strings, including:
MaxString()
Only()
Concat()
These can result in the type of thing you're looking for:
It's worth noting, though, that this sort of problem is almost always an issue with the underlying data model. Depending on what your source data looks like, you should consider investigating your use of Join and/or Concatenate. You can see more info on how to use those functions on this Qlik Help page.
Here's a very basic example of using a Join to properly combine the data in a way that results in all data showing up a single record without needing any aggregations in the table chart:

pandas - running into problems setting multiple columns using results from pd.apply()

I have a function that returns tuples. When I apply this to my pandas dataframe using pd.apply() function, the results look this way.
The Date here is an index and I am not interested in it.
I want to create two new columns in a dataframe and set their values to the values you see in these tuples.
How do I do this?
I tried the following:
This errors out citing mismatch between expected and available values. It is seeing these tuples as a single entity, so those two columns I specified on the left hand side are a problem. Its expecting only one.
And what I need is to break it down into two parts that can be used to set two different columns.
Whats the correct way to achieve this?
Make your function return a pd.Series, this will be expanded into a frame.
orders.apply(lambda x: pd.Series(myFunc(x)), axis=1)
use zip
orders['a'], orders['b'] = zip(*df['your_column'])

Spark SQL UDF to run same logic on different tables

I want to run the same SQL logic twice, on two different underlying tables. Is there a way to do this in spark that doesn't involve writing the exact same logic twice with just the table name being different?
You can use spark.sql(s"query logic from ${tablename}").
Other way is using unbound columns via col("column_name") instead of referencing them via dataframe reference. And then wrapping this in a function:
def processDf(df: DataFrame): DataFrame = {
df.withColumn("some_col", col("input_col") + lit(5))
// this just an illustration via dummy code
}
Now you can pass any data frame to this function that has input_col in its schema and that is numeric and this would work irrespective of data frame reference. In case of incompatible schemas and advanced use cases I would advise looking into Transformers from spark ml.
It is a common pattern in spark ml for transform method that takes Dataset[_] and outputs DataFrame. In case of incompatible schema you can pass these as parameters.

List of aggregation functions in Spark SQL

I'm looking for a list of pre-defined aggregation functions in Spark SQL. I have in mind something analogous to Presto Aggregate Functions.
I Ctrl+F'd around a little in the SQL API docs to no avail... it's also hard to tell at a glance which functions are for aggregation vs. not. For example, if I didn't know avg is an aggregation function I'd be hard pressed to tell it is one (in a way that's actually scalable to the full set of functions):
avg - avg(expr) - Returns the mean calculated from values of a group.
If such a list doesn't exist, can someone at least confirm to me that there's no pre-defined function like any/bool_or or all/bool_and to determine if any or all of a boolean column in a group are true (or false)?
For now, my workaround is
select grp_col, count(if(bool_col, true, NULL)) > 0 any_agg
Just take a look at Spark Docs on Aggregate functions section
The list of functions is here under Relational Grouped Dataset - specifically the API's that return DataFrame (not RelationalGroupedDataSet):
https://spark.apache.org/docs/latest/api/scala/index.html?org/apache/spark/sql/RelationalGroupedDataset.html#org.apache.spark.sql.RelationalGroupedDataset

openrefine, cluster and edit two datasets

i have two datasets. Column A has ids from dataset one, column B, has the data i need to cluster and edit, using the various available algorithms. Dataset 2, has again in the first column, the ids, and in the next column, the data. I need to reconcile, data only from dataset one, against data from the second dataset. What i have done so far, is use one dataset, merge the two, but then openrefine, gives me mixed results, ie messy data that exist only in dataset two, which is not what i want, in the current phase.
I have also investigated Reconcile-csv, but without success, in achieving desired result. Any ideas?
An alternative approach to using the reconciliation approach described by Ettore is to use algorithms similar to the 'key collision' clustering algorithms to create shared keys between the two data sets and then use this to do lookups between the data sets using the 'cross' function.
As an example for Column B in each data set you could 'Add column based on this column' using the GREL:
value.fingerprint()
This creates the same key as is used by the "Fingerprint" clustering method. Lets call the new column 'Column C'
You can then look up between the two projects using the following GREL in Dataset 2:
cells["Column C"].cross("Dataset 1","Column C")
If the values in both Dataset 1 and Dataset 2 would have clustered based on the fingerprint cluster then the lookup between the projects will work
You can also use the phonetic keying algorithms to create match keys in Column C if that works better. What you can't do using this method (as far as I know) is the equivalent of the Nearest Neighbour matching - you'd have to have a reconciliation service with fuzzy matching of some kind, or merge the two data sets, to achieve this.
Owen
Reconcile-CSV is a very good tool, but not very user friendly. You can use as an alternative the free Excel plugin Fuzzy Lookup Add-In for Excel. It's very easy to use, as evidenced by this screencast. One constraint: the two tables to be reconciled must be in Excel table format (select and CTRL + L).
And here is the same procedure with reconcile-csv (the GREL formula used is cell.recon.best.name and comes from here)