Equivalent of split_function() in HIVE

Equivalent of split_function() in HIVE - hive

As per my project requirement I have come up with one complex logic which has 'a small section' as following:
'regexp_extract(split_part(vw_cart.order_detail,':',2), '[0-9]+', 0)'
It works like a charm in IMPALA but fails in HIVE.
I am working on to find something similar to 'SPLIT_PART' for HIVE execution of my Code.
Any guidance will be helpful.

In Hive split() function returns array and array elements are numbered started from 0.
In Impala split_part(vw_cart.order_detail,':',2) - returns second element from delimited string, numbered from 1
So, in Hive it will be:
regexp_extract(split(vw_cart.order_detail,':')[1], '[0-9]+', 0)

Related

Presto SQL - Combining Columns

I am using AWS Athena to run a query against my data set to combine values from different columns found in various data sets (ex. There is a parquet file for each of client 1-4). However, the output is simply empty for "all_clients_total_clicks". The strange thing is that similar code on another table is working - just not for the one I'm currently working on.
Can someone please help me confirm whether my syntax is acceptable? Or point me in the right direction/documentation for review? SQL Below:
SELECT "columnA",
sum("columnX") AS "TotalImpressions",
cast(sum("client1_column_total_clicks") AS double)
+ cast(sum("client2_column_total_clicks") AS double)
+ cast(sum("client3_column_total_clicks") AS double)
+ cast(sum("client4_column_total_clickss") AS double) AS "all_clients_total_clicks"
FROM "db_name"."db_table"
Group by "columnA"

The issue stems from trying to add null values. Using Try + Coalesce resolved it for me.
Presto DB Documentation for Conditionals

Creating a UDF in BigQuery

I would like to create a UDF named maxDate in BigQuery that does the following:
maxDate('table_name') returns the result from running the query below:
select max(table_id) from fact.___TABLES____ where table_id < 'table_name';
I'm quite new to JS and not too sure how to start. This looks like a simple thing to write. Could anyone point me in the right way? I've read the documentation, and unsure of how to write this.

Scalar UDF are not existent yet in BigQuery
See more about BigQuery User-Defined Functions to understand what are they today.
To simplify - think of today's UDF as virtual table that you can query and this table in turn powered by real table where each row is processed row-by-row and javascript code is applied for each row and generates (instead of this input row) zero, one or many (depends of inplemented in js logic) rows)

How to define functions for any column (scalar UDF) on Google BigQuery

Let's say I need to define a function with a behavior like UPPER(string), we can call it FIRSTCHAR(string) that gets the first character of a string.
So I would like to make SQL like:
SELECT FIRSTCHAR(middle_name) AS middle_name_first_char,
FIRSTCHAR(last_name) AS last_name_first_char FROM clients
Reading BigQuery UDF documentation is not clear how to make such functions that works over string, across any table or column. It looks like to define a function with bigquery.defineFunction() it needs an Input column names argument.

Per what I know, scalar type UDF are not available yet in BigQuery. Current UDF are only table wise. So you supply table to UDF and UDF is processing it row-by-row outputting 0, 1 or many rows (depends on your implemented function) for each input row.
I remember one of Google Team member mentioned - they work on making scalar UDF available at some point
I assume your simplified example in question is just example to demonstrate point of your question, so I am not providing actual solution for this example (which is super simple use of string function(s))
2016-08-11 UPDATE
Scalar UDF are supported now for BigQuery Standard SQL
See examples below
JS UDF
CREATE TEMPORARY FUNCTION FIRSTCHAR(word STRING)
RETURNS STRING
LANGUAGE js
AS "return word.substring(0, 1);";
SELECT
FIRSTCHAR(middle_name) AS middle_name_first_char,
FIRSTCHAR(last_name) AS last_name_first_char
FROM clients
SQL UDF
CREATE TEMPORARY FUNCTION FIRSTCHAR(word STRING)
RETURNS STRING
AS (SUBSTR(word, 0, 1));
SELECT
FIRSTCHAR(middle_name) AS middle_name_first_char,
FIRSTCHAR(last_name) AS last_name_first_char
FROM clients

TABLE_QUERY fails to handle dataset names starting with numbers [duplicate]

I'm using bigquery with a dataset called '87891428' containing daily tables. I try to query a dates range thanks to the function TABLE_DATE_RANGE:
SELECT avg(foo)
FROM (
TABLE_DATE_RANGE(87891428.a_abc_,
TIMESTAMP('2014-09-30'),
TIMESTAMP('2014-10-19'))
)
But this leads to a very explicit error message:
Error: Encountered "" at line 3, column 21. Was expecting one of:
I've the feeling that TABLE_DATE_RANGE doesn"t like to have a dataset starting with a number cause when I copy few tables into a new dataset called 'test' the query run properly. Does anyone has already encountered this issue and if so what is the best workaround (as far as I know you can't rename a dataset) ?

The fix for this is to use brackets around the dataset name and table prefix:
SELECT avg(foo)
FROM (
TABLE_DATE_RANGE([87891428.a_abc_],
TIMESTAMP('2014-09-30'),
TIMESTAMP('2014-10-19'))
)

JDBC RDD Query Statement without '?'

I am using Spark with Scala and trying to get data from a database using JdbcRDD.
val rdd = new JdbcRDD(sparkContext,
driverFactory,
testQuery,
rangeMinValue.get,
rangeMaxValue.get,
partitionCount,
rowMapper)
.persist(StorageLevel.MEMORY_AND_DISK)
Within the query there are no ? values to set (since the query is quite long I am not putting it here.) So I get an error saying that,
java.sql.SQLException: Parameter index out of range (1 > number of parameters, which is 0).
I have no idea what the problem is. Can someone suggest any kind of solution ?

Got the same problem.
Used this:
SELECT * FROM tbl WHERE ... AND ? = ?
And then call it with lowerbound 1, higher bound 1 and partition 1.
Will always run only one partition.

Your problem is Spark expected that your query String has a couple of ? parameters.
From Spark user list:
In order for Spark to split the JDBC query in parallel, it expects an
upper and lower bound for your input data, as well as a number of
partitions so that it can split the query across multiple tasks.
For example, depending on your data distribution, you could set an
upper and lower bound on your timestamp range, and spark should be
able to create new sub-queries to split up the data.
Another option is to load up the whole table using the HadoopInputFormat
class of your database as a NewHadoopRDD.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Equivalent of split_function() in HIVE - hive

In Hive split() function returns array and array elements are numbered started from 0. In Impala split_part(vw_cart.order_detail,':',2) - returns second element from delimited string, numbered from 1 So, in Hive it will be: regexp_extract(split(vw_cart.order_detail,':')[1], '[0-9]+', 0)

Related

Presto SQL - Combining Columns

Creating a UDF in BigQuery

How to define functions for any column (scalar UDF) on Google BigQuery

TABLE_QUERY fails to handle dataset names starting with numbers [duplicate]

JDBC RDD Query Statement without '?'

Categories

Resources