JDBC RDD Query Statement without '?'

JDBC RDD Query Statement without '?' - sql

I am using Spark with Scala and trying to get data from a database using JdbcRDD.
val rdd = new JdbcRDD(sparkContext,
driverFactory,
testQuery,
rangeMinValue.get,
rangeMaxValue.get,
partitionCount,
rowMapper)
.persist(StorageLevel.MEMORY_AND_DISK)
Within the query there are no ? values to set (since the query is quite long I am not putting it here.) So I get an error saying that,
java.sql.SQLException: Parameter index out of range (1 > number of parameters, which is 0).
I have no idea what the problem is. Can someone suggest any kind of solution ?

Got the same problem.
Used this:
SELECT * FROM tbl WHERE ... AND ? = ?
And then call it with lowerbound 1, higher bound 1 and partition 1.
Will always run only one partition.

Your problem is Spark expected that your query String has a couple of ? parameters.
From Spark user list:
In order for Spark to split the JDBC query in parallel, it expects an
upper and lower bound for your input data, as well as a number of
partitions so that it can split the query across multiple tasks.
For example, depending on your data distribution, you could set an
upper and lower bound on your timestamp range, and spark should be
able to create new sub-queries to split up the data.
Another option is to load up the whole table using the HadoopInputFormat
class of your database as a NewHadoopRDD.

Related

Airflow - Bigquery operator not working as intended

I'm new with Airflow and I'm currently stuck on an issue with the Bigquery operator.
I'm trying to execute a simple query on a table from a given dataset and copy the result on a new table in the same dataset. I'm using the bigquery operator to do so, since according to the doc the 'destination_dataset_table' parameter is supposed to do exactly what I'm looking for (source:https://airflow.apache.org/docs/stable/_api/airflow/contrib/operators/bigquery_operator/index.html#airflow.contrib.operators.bigquery_operator.BigQueryOperator).
But instead of copying the data, all I get is a new empty table with the schema of the one I'm querying from.
Here's my code
default_args = {
'owner':'me',
'depends_on_past':False,
'start_date':datetime(2019,1,1),
'end_date':datetime(2019,1,3),
'retries':10,
'retry_delay':timedelta(minutes=1),
}
dag = DAG(
dag_id='my_dag',
default_args=default_args,
schedule_interval=timedelta(days=1)
)
copyData = BigQueryOperator(
task_id='copyData',
dag=dag,
sql=
"SELECT some_columns,x,y,z FROM dataset_d.table_t WHERE some_columns=some_value",
destination_dataset_table='dataset_d.table_u',
bigquery_conn_id='something',
)
I don't get any warnings or errors, the code is running and the tasks are marked as success. It does create the table I wanted, with the columns I specified, but totally empty.
Any idea what I'm doing wrong?
EDIT: I tried the same code on a much smaller table (from 10Gb to a few Kb), performing a query with a much smaller result (from 500Mb to a few Kb), and it did work this time. Do you think the size of the table/the query result matters? Is it limited? Or does performing a too large query cause a lag of some sort?
EDIT2: After a few more tests I can confirm that this issue is not related to the size of the query or the table. It seems to have something to do with the Date format. In my code the WHERE condition is actually checking if a date_column = 'YYYY-MM-DD'. When I replace this condition with an int or string comparison it works perfectly. Do you guys know if Bigquery uses a particular date format or requires a particular syntax?
EDIT3: Finally getting somewhere: When I cast my date_column as a date (CAST(date_column AS DATE)) to force its type to DATE, I get an error that says that my field is actually an int-32 (Argument type mismatch). But I'm SURE that this field is a date, so that implies that either Bigquery stores it as an int while displaying it as a date, or that the Bigquery operator does some kind of hidden type conversion while loading tables. Any idea on how to fix this?

I had a similar issue when transferring data from other data sources than big-query.
I suggest casting the date_column as follows: to_char(date_column, 'YYYY-MM-DD') as date
In general, I have seen that big-query auto detection schema is often problematic. The safest way is to always specify schema before executing its corresponding query, or use operators that support schema definition.

How to parameterise variable length of Strings?

I am writing a query where 'batch_name' is the parameter, some times I get only one batch name and sometime I get 2 or more batch names. How can I handle this in Oracle BI Publisher query,
Here is my query,
Select * from pay_batch_headers pbh Where UPPER(pbh.batch_name) = UPPER(:p_batch_name)
Now this query will handle for only one batch name, I want it to handle multiple batch names.
something like Where UPPER(pbh.batch_name) IN ('Batch1','Batch2','Batch3')
But problem to use IN clause is I cant predict number of batches I have to query. Can any one help me in this please.

You have two choices. One is to munge the variables together into a string and use some method, such as regexp_like():
where regexp_like(upper(pbh.batch_name), ??)
The parameter string should look like: '^abc|def|ghi|jkl$'. You can make it as long as you like.
Another method is to use execute immediate. Dump the values into a SQL query as a string, using IN. The advantage of this method is that it can more easily use indexes

Creating a UDF in BigQuery

I would like to create a UDF named maxDate in BigQuery that does the following:
maxDate('table_name') returns the result from running the query below:
select max(table_id) from fact.___TABLES____ where table_id < 'table_name';
I'm quite new to JS and not too sure how to start. This looks like a simple thing to write. Could anyone point me in the right way? I've read the documentation, and unsure of how to write this.

Scalar UDF are not existent yet in BigQuery
See more about BigQuery User-Defined Functions to understand what are they today.
To simplify - think of today's UDF as virtual table that you can query and this table in turn powered by real table where each row is processed row-by-row and javascript code is applied for each row and generates (instead of this input row) zero, one or many (depends of inplemented in js logic) rows)

Dynamic Pivot Query without storing query as String

I am fully familiar with the following method in the link for performing a dynamic pivot query. Is there an alternative method to perform a dynamic pivot without storing the Query as a String and inserting a column string inside it?
http://www.simple-talk.com/community/blogs/andras/archive/2007/09/14/37265.aspx

Short answer: no.
Long answer:
Well, that's still no. But I will try to explain why. As of today, when you run the query, the DB engine demands to be aware of the result set structure (number of columns, column names, data types, etc) that the query will return. Therefore, you have to define the structure of the result set when you ask data from DB. Think about it: have you ever ran a query where you would not know the result set structure beforehand?
That also applies even when you do select *, which is just a sugar syntax. At the end, the returning structure is "all columns in such table(s)".
By assembling a string, you dynamically generate the structure that you desire, before asking for the result set. That's why it works.
Finally, you should be aware that assembling the string dynamically can theoretically and potentially (although not probable) get you a result set with infinite columns. Of course, that's not possible and it will fail, but I'm sure you understood the implications.
Update
I found this, which reinforces the reasons why it does not work.
Here:
SSIS relies on knowing the metadata of the dataflow in advance and a
dynamic pivot (which is what you are after) is not compatible with
that.
I'll keep looking and adding here.

Incorrect Results when calling a Python UDF in Redshift multiple times within a single column inside a select statement

I am encountering an issue in Redshift where calling a UDF more than once per column inside a select statement is returning the same result as the first call to that UDF.
Bit of Background
I have a very simple Python UDF that calculates an md5 hash. The reason for this function is to be able to handle UTF-16/UTF-8 conversion before doing the hash so it is consistent with SQL server. Now the syntax or logic inside the function does not seem to be the issue as we have tried creating even simpler functions that produce the same behavior.
The Problem
My function is named MD5_UTF16 and is called by doing MD5_UTF16(yourvalue), and returns a hash string / hexdigest of the value you pass into the argument.
In my query I need to be able to do this (postgresql syntax):
SELECT MD5_UTF16(column1) || MD5_UTF16(column2)|| MD5_UTF16(column3) AS concatenatedhash
FROM MyTable
i.e. I need to calculate each hash and concatenate them in a single column. If I calculated each of those hashes separately in their own columns, the function generates the correct hashes for those columns. However, in my example above I have called each function and concatenated the results with the results of the other calls. In this scenario what is happening is all the calls to the functions are returning the hash for the first call i.e. MD5_UTF16(column1).
To clarify a bit further using example hash values. Let's pretend these are the hashes for each of the columns above:
Column 1: 275AB169CBEE4550F752C634B9335AE0
Column 2: B2214041A94F50B027FE1DEEC4C8474C
Column 3: 91050DAEFFEE20CDA2FC9914B6E4EBE9
My expected result for the concatenatedhash column would be a simple concatenation of the strings above (275AB169CBEE4550F752C634B9335AE0B2214041A94F50B027FE1DEEC4C8474C91050DAEFFEE20CDA2FC9914B6E4EBE9)
Instead, what I am getting is a concatenation of column 1's hash 3 times:
(275AB169CBEE4550F752C634B9335AE0275AB169CBEE4550F752C634B9335AE0275AB169CBEE4550F752C634B9335AE0)
In my SELECT statement if I had called the function on column 2 (instead of column 1) first, then it would be the hash for column 2 that is repeated.
Has anyone encountered this before?
NOTE: You can only replicate this behavior if you are selecting data out of a table. So doing a:
SELECT MD5_UTF16('hard-coded value 1') || MD5_UTF16('hard-coded value 2')
with no table source will not replicate this behavior.
Work-arounds I am aware of
I do know of a possible workaround but I still would have expected my method above to work, so this question is not about applying the following workaround, but more understanding why the above method is not working.
- Workaround: Calculate each hash in a separate column first then concatenate them. This will have potential performance implications on our end among other things.
EDIT 1
Have found that the issue I've described only happens when there is a join in my query.. even if none of the column data from the joined table are being used in my UDF calls i.e.
SELECT ...concatenated hashes..
FROM table1
JOIN table2 ...
Removing the join seems to cause the hashes to be calculated correctly. Will attempt a workaround using this new knowledge. Not sure if it has anything to do with the execution plan running the UDF's differently when a join is involved - even though none of the column data from the joined table is being used for the UDF calls.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

JDBC RDD Query Statement without '?' - sql

Got the same problem. Used this: SELECT * FROM tbl WHERE ... AND ? = ? And then call it with lowerbound 1, higher bound 1 and partition 1. Will always run only one partition.

Related

Airflow - Bigquery operator not working as intended

How to parameterise variable length of Strings?

Creating a UDF in BigQuery

Dynamic Pivot Query without storing query as String

Incorrect Results when calling a Python UDF in Redshift multiple times within a single column inside a select statement

Categories

Resources