I want to run the same SQL logic twice, on two different underlying tables. Is there a way to do this in spark that doesn't involve writing the exact same logic twice with just the table name being different?
You can use spark.sql(s"query logic from ${tablename}").
Other way is using unbound columns via col("column_name") instead of referencing them via dataframe reference. And then wrapping this in a function:
def processDf(df: DataFrame): DataFrame = {
df.withColumn("some_col", col("input_col") + lit(5))
// this just an illustration via dummy code
}
Now you can pass any data frame to this function that has input_col in its schema and that is numeric and this would work irrespective of data frame reference. In case of incompatible schemas and advanced use cases I would advise looking into Transformers from spark ml.
It is a common pattern in spark ml for transform method that takes Dataset[_] and outputs DataFrame. In case of incompatible schema you can pass these as parameters.
Related
So I am uploading this table to BigQuery with
job_config = bigquery.LoadJobConfig()
job_config.source_format = bigquery.SourceFormat.PARQUET
job_config.write_disposition = "WRITE_TRUNCATE"
pq_opt = ParquetOptions()
pq_opt.enable_list_inference = True
job_config.parquet_options = pq_opt
job = self.client.load_table_from_file(source_file, table_ref, job_config=job_config)
where say,I have a pa.schema with entries of the type:
("image_id", pa.list_(pa.string())),
And as suggested in this question I use enable_list_inference. Before using enable_list_inference the schema in BQ looks like:
Schema without list inference
And after I use it I get:
Schema with list inference
So what's the reason I lose the list part but not the item part? I am passing a normal data frame with lists in some rows. How can I ditch the .item part and just have a column of REPEATED entries?
This became too long for a comment but there are two potential things going on here (not sure if fixing either would help but it is worth trying).
Pyarrow by default assumes columns are nullable. You can try making the top level non-nullable (and/or the inner string element non-nullable), if they in fact cannot have nulls, this should change schema inference. BQ needs to maintain intermediate records (similar to parquet) to handle repeated elements that might also be null.
The second thing to try is to set use_compliant_nested_types=True when writing the pyarrow table. this will change the inner element from "item" to "element" which is the correct name according to the parquet specification. This might also affect things. (BigQuery should support either so I think the first option is the more likely to work).
I've a dataframe "Forecast" with columns - Store, Item, FC_startdate, FC_enddate, FC_qty
Another dataframe "Actual" with columns - Store, Item, Saledate, Sales_qty.
I want to create a UDF with parameters passed - p_store, p_item, p_startdate, p_enddate and get the sum of Sales_qty in between these dates and add this as a new column (Act_qty) to "Forecast" dataframe.
but spark is not allowing to pass a dataframe in UDF along with fields of Forecast.
Instead of using merge - What can be the solution?
After defining and registering your udf, you can use the udf function in your transformation code like any other function of the spark-sql library.
Similar to the spark-sql library functions you can only pass columns of your dataframe and return the processed value. Dataframes cannot be passed to udf's.
So in your case you can transform your current dataframe into another dataframe by using the udf as a function and then proceed ahead.
https://docs.databricks.com/spark/latest/spark-sql/udf-scala.html
A golden rule is, that anything that can be done without UDFs, should be done without UDFs, they should be applied more-so when you require a very specific transformation on a singular row, rather than for the big aggregation type operation you decribe.
In this case it seems like you could just use SparkSQL: Select rows of Actual, where the Saledate is between the dates you would like (Spark understands dates natively, refer to the documentation), sum SalesQty per Store or Item, or both (I am not sure what you intend to do), rename the sum column and join this new dataframe into the Forecast using Store or Item or both again.
If you, however, insist on using UDFs you will have to pass columns, rather than dataframes as arguments but I can't think of a straightforward way of how to achieve what you describe using UDFs while not sacrificing a lot of performance.
I am trying to transpose/unpivot a dataframe in SparkR. I don't find any direct method available in SparkR package to accomplish unpivoting a dataframe. Neither I am able to use R package on a SparkR dataframe even after using includePackage method. It would be helpful if someone could let me know if there are direct ways to unpivot using SparkR or other alternatives such as Hive.
Neither I am able to use R package on a SparkR dataframe
Native R commands don't run on Spark DataFrames. Only Spark commands run on Spark DataFrames. If you want to run an R command on a Spark DataFrame you can collect() to convert it to an R data.frame, but you lose the benefits of distributed processing.
The Spark DataFrame is a similar construct to a table in a relational database. By working with Spark commands on a Spark DataFrame you will retain the benefits of distributed processing across the cluster.
It's difficult to answer such a general question - normally on this forum people expect specific examples with data and code. In general, if I wanted to un-pivot a relational table then the most basic way would be to create a set of queries, each query containing the row key plus one column, filtered for non-null in the column. I would then union together the multiple results into a new DataFrame.
If your preference is for R language syntax, that union can be done using the unionAll(x,y) command in SparkR, which will be processed across the cluster (unlike an R command on an R data.frame).
How can i use the new UDF functionality to create "Dynamic SQL statement"?
Is there a way to use UDF in order to construct SQL statement based on template and input variables, and later run this query?
The documentation https://cloud.google.com/bigquery/user-defined-functions?hl=en says:
A UDF is similar to the "Map" function in a MapReduce: it takes a
single row as input and produces zero or more rows as output. The
output can potentially have a different schema than the input.
So your UDF receives just a single row.
Therefore - no, UDF is not for the purpose you described in your question.
You might take a look at views - maybe that will suit you better:
https://cloud.google.com/bigquery/querying-data#views
I am using Spark with Scala and trying to get data from a database using JdbcRDD.
val rdd = new JdbcRDD(sparkContext,
driverFactory,
testQuery,
rangeMinValue.get,
rangeMaxValue.get,
partitionCount,
rowMapper)
.persist(StorageLevel.MEMORY_AND_DISK)
Within the query there are no ? values to set (since the query is quite long I am not putting it here.) So I get an error saying that,
java.sql.SQLException: Parameter index out of range (1 > number of parameters, which is 0).
I have no idea what the problem is. Can someone suggest any kind of solution ?
Got the same problem.
Used this:
SELECT * FROM tbl WHERE ... AND ? = ?
And then call it with lowerbound 1, higher bound 1 and partition 1.
Will always run only one partition.
Your problem is Spark expected that your query String has a couple of ? parameters.
From Spark user list:
In order for Spark to split the JDBC query in parallel, it expects an
upper and lower bound for your input data, as well as a number of
partitions so that it can split the query across multiple tasks.
For example, depending on your data distribution, you could set an
upper and lower bound on your timestamp range, and spark should be
able to create new sub-queries to split up the data.
Another option is to load up the whole table using the HadoopInputFormat
class of your database as a NewHadoopRDD.