Discrepancy in String matching between Teradata and HIVE

Discrepancy in String matching between Teradata and HIVE - sql

I am getting into Hive and learning hive. I have customer table in teradata , used sqoop to extract complete table in hive which worked fine.
See below customer table both in Teradata and HIVE.
In Teradata :
select TOP 4 id,name,'"'||status||'"' from customer;
3172460 Customer#003172460 "BUILDING "
3017726 Customer#003017726 "BUILDING "
2817987 Customer#002817987 "COMPLETE "
2817984 Customer#002817984 "BUILDING "
In HIVE :
select id,name,CONCAT ('"' , status , '"') from customer LIMIT 4;
3172460 Customer#003172460 "BUILDING "
3017726 Customer#003017726 "BUILDING "
2817987 Customer#002817987 "COMPLETE "
2817984 Customer#002817984 "BUILDING "
When I tried to fetch records from table customer with column matching which is of String type. I am getting different result for same query in different environment.
See below query results..
In Teradata :
select TOP 2 id,name,'"'||status||'"' from customer WHERE status = 'BUILDING';
3172460 Customer#003172460 "BUILDING "
3017726 Customer#003017726 "BUILDING "
In HIVE :
select id,name,CONCAT ('"' , status , '"') from customer WHERE status = 'BUILDING' LIMIT 2;
**<<No Result>>**
It seems that teradata is doing trimming short of thing before actually comparing stating values. But Hive is matching strings as it is.
Not sure, It is expected behaviour or bug or can be raised as enhancement.
I see below possible solution:
* Convert into like operator expression with wildcard charater before and after
Looking forward for your response on this. How can it be handled/achieved in hive.

You could use rtrim function, i.e:
select id,name,CONCAT ('"' , status , '"') from customer WHERE rtrim(status) = 'BUILDING' LIMIT 2;
But question here arise what standard in string comparision Hive uses? According to ANSI/ISO SQL-92 'BUILDING' == 'BUILDING ', Here is a link for an article about it.

Answering to my own question ...got it from hive-user mailing list
Hive is only PRO SQL compliance,
In hive the string comparisons work just like they would work in java
so in hive
"BUILDING" = "BUILDING"
"BUILDING " != "BUILDING" (extra space added)

Related

Equivalent function in HANA DB for json_object - case of Subqueries

Following on Raghav question : Equivalent function in HANA DB for json_object and great answers from vivekanandasr and astentx, my subsequent question is :
How to use "for JSON" with SAP Hana 2.0 when considering a header table and a line table ?
Problem : When a nested query or a scalar subquery is used with "for JSON" in SAP HANA, the query engine interprets the result of the subquery as text.
for example, with a table F for invoices header and a table F1 for invoices lines :
SELECT F."DocNum" , F."DocDate",
(SELECT "ItemCode", "Quantity"
FROM INV1 F1 WHERE F."DocEntry" = F1."DocEntry" FOR JSON )
AS "DocumentLines"
FROM OINV F
WHERE F."DocEntry" = 17216
FOR JSON ( 'arraywrap'='no');
is supposed to return something like :
{
"DocDate":"2022-04-30"
,"DocNum":117121
,"DocumentLines":
[{"ItemCode":"ITEM1", "Quantity":10}
,{"ItemCode":"ITEM2","Quantity":20}]
}
SAP Hana returns :
{
"DocDate":"2022-04-30"
,"DocNum":117121
,"DocumentLines":"[{\"ItemCode\":\"ITEM1\",\"Quantity\":10}, {\"ItemCode\":\"ITEM2\",\"Quantity\":20}]"
}
2 consequences :
the " sign is escaped with \"
the array is considered as text and the brackets are surrounded with "
... and the resulting Json is not correct ...
The equivalent on MSSQL seems to work with "for json auto" but i don't know how to make it works with SAP Hana ?

How to use DB2 version 10 SQL TO generate a query into JSON output?

I'm using DB2 - 10
Im trying to generate a QUERY with SQL to pull certain data values from tables,
I'm then concatenating within the SQL CODE with the JSON code prior and post the DATA after to finalize the output in JSON code.
SCRIPT/QUERY:
SELECT
'","truck": {"number": "' || LS.LS_POWER_UNIT ||
'","type": "TR"*"vinNumber": "' || P.VIN ||
'","licensePlates": [{"number": "' || P.LIC_1 ||
'","stateProvince": "' || P.LIC_1_PRST
TRUCK
RESULT:
","truck": {"number": "1234","type": "TR","vinNumber":
"123456VINNUMBER",""licensePlates": [{"number":
"ON1234","stateProvince": "ON"}]
Please note that's just a sample from my entire code, there are probably missing syntax but it's complete in the rest of my code.
However upon researching i've found out other DB2 versions have a JSON_Object mine does not however, was wondering if someone was fluent in DB2 10 to help me utilize some sort of JSON_OBJECT similar to the following example found in other DB2 VERSIONS.
select json_object ('id' value id,
'name' value last_name,
'office' value office_number)
from empdata;
RESULT:
{"id":901,"name":"Doe","office":"E-334"}
{"id":902,"name":"Pan","office":"E-216"}
{"id":903,"name":"Jones","office":"E-739"}
{"id":904,"name":"Smith","office":null}

JSON_OBJECT only exists on Db2 for i. By "DB2 10" I assume you are using DB2 10 for z/OS, rather than say DB2 10.1 or 10.5 for LUW.
So for DB2 of z/OS maybe start here https://www.ibm.com/developerworks/data/library/techarticle/dm-1403xmljson/index.html

Google BigQuery: TABLE_QUERY AND TABLE_DATE_RANGE

I have Big Query tables like below, and like to issue a query to the tables marked <=.
prefix_AAAAAAA_20170320
prefix_AAAAAAA_20170321
prefix_AAAAAAA_20170322 <=
prefix_AAAAAAA_20170323 <=
prefix_AAAAAAA_20170324 <=
prefix_AAAAAAA_20170325
prefix_BBBBBBB_20170320
prefix_BBBBBBB_20170321
prefix_BBBBBBB_20170322 <=
prefix_BBBBBBB_20170323 <=
prefix_BBBBBBB_20170324 <=
prefix_BBBBBBB_20170325
prefix_CCCCCCC_20170320
prefix_CCCCCCC_20170321
prefix_CCCCCCC_20170322
prefix_CCCCCCC_20170323
prefix_CCCCCCC_20170324
prefix_CCCCCCC_20170325
I made a query as this
SELECT * FROM
(TABLE_QUERY(mydataset,
'table_id CONTAINS "prefix" AND
(table_id CONTAINS "AAAAAA" OR table_id CONTAINS "BBBBBB")' )
AND
TABLE_DATE_RANGE(mydataset.prefix, TIMESTAMP('2017-03-22'), TIMESTAMP('2017-03-24')))
I got this error.
Error: Encountered " "AND" "AND "" at line 5, column 4. Was expecting: ")" ...
Does anybody has ideas?

You cannot mix TABLE_QUERY and TABLE_DATE_RANGE for exactly same FROM!
Try something like below
#legacySQL
SELECT *
FROM (TABLE_QUERY(mydataset, 'REGEXP_MATCH(table_id, "prefix_[AB]{7}_2017032[234]")'))
Consider Migrating to BigQuery Standard SQL
In this case you can Query Multiple Tables Using a Wildcard Table
See How to Migrate from TABLE_QUERY() to _TABLE_SUFFIX
I think, in this case your query can look like
#standardSQL
SELECT *
FROM `mydataset.prefix_*`
WHERE REGEXP_CONTAINS(_TABLE_SUFFIX, '[AB]{7}_2017032[234]')
I can not migrate to Standard SQL because ...
If I would like to search for example between 2017-03-29 and 2017-04-02, do you have any smart SQL
Try below version
#legacySQL
SELECT *
FROM (TABLE_QUERY(mydataset,
'REGEXP_MATCH(table_id, r"prefix_[AB]{7}_(\d){8}") AND
RIGHT(table_id, 8) BETWEEN "20170329" AND "20170402"'))
Of course yo can adjust above to use whatever exactly logic yo need to apply!

Query Rewrite Template in Oracle SQL

I have a Oracle SQL Query as below :
select id from docs where CONTAINS (text,
'<query>
<textquery lang="ENGLISH" grammar="CONTEXT"> Informix
<progression>
<seq><rewrite>transform((TOKENS, "{", "}", "AND"))</rewrite></seq>
</progression>
</textquery>
</query>')>0;
The Query Works as expected. But I want to search for word Inform / Infor / Info. So I altered the query to below :
select id from docs where CONTAINS (text,
'<query>
<textquery lang="ENGLISH" grammar="CONTEXT"> Informix
<progression>
<seq><rewrite>transform((TOKENS, "?{", "}", "AND"))</rewrite></seq>
</progression>
</textquery>
</query>')>0;
By adding extra "?" in transform function. But this looks for informix / informi / inform / infor / info / inf / in. I want to restrict the search to a specific characters 4. Say till info. How can the same be achieved?
Thanks.

To find all documents that contain at least one occurrence of any of the terms between informix and info use the OR operator
and list all you allowerd terms in the template
<query>
<textquery lang="ENGLISH" grammar="CONTEXT"> informix informi inform infor info
<progression>
<seq><rewrite>transform((TOKENS, "{", "}", "OR"))</rewrite></seq>
</progression>
</textquery>
</query>
But the usage of template is not realy meaninfull here.
The same result you get with a direct query
select score(1), id from docs
where contains(text,'informix OR informi OR inform OR infor OR info',1) > 0
order by 1 desc;
The advantage of this case is that you can controll the score by prefering the documents with longer string with higher weights
select score(1), id from docs
where contains(text,'informix*5 OR informi*4 OR inform*3 OR infor*2 OR info',1) > 0
order by 1 desc;
Btw the ? (fuzzy) operator is used IMO to find misspelled words, not the exact prefixes of a term.
UPDATE
The concatenation of the prefixes you may assembly in PL/SQL or if necessary in SQL such as follows:
with txt as (
select 'informix' text from dual),
txt2 as (
select
substr(text,1,length(text) -rownum+1) text
from txt connect by level <= length(text) -3
)
select
LISTAGG( text, ', ') WITHIN GROUP (ORDER BY text desc)
from txt2
.
informix, informi, inform, infor, info

Google BigQuery wouldn't run when I include 'WHERE' & 'GROUP BY' clause to my query

Each time I run a (Select, From, Limit) query when not aggregating data on Google BigQuery, a table is displayed for my query (it works) but each time I add any other 'Clause' such as 'WHERE' & 'GROUP BY' - an error always gets displayed.
For example:
SELECT
cigarette_use,
AVG(weight_pounds) baby_weight,
AVG(mother_age) mother_age,
STDDEV( weight_pounds) baby_weight_stdev,
FROM
[publicdata:samples.natality]
LIMIT
1000
WHERE
year=2003
AND state='OH'
GROUP BY
cigarette_use;
For the query above, this error got displayed -
"Error: Encountered " "WHERE" "WHERE "" at line 10, column 1. Was expecting: <EOF>
Job ID: decent-courage-101120:job_Ts2AJAeI8SijokiKCnV5joh5VQg"
And when I removed the 'WHERE' clause from the query i.e.
WHERE
year=2003
AND state='OH'
This error got displayed -
"Error: Encountered " "GROUP" "GROUP "" at line 10, column 1. Was expecting: <EOF>
Job ID: decent-courage-101120:job_Hq_Ux9x-pBGwcwaG7wJ8KlthUys"
Can someone please tell me what I'm doing wrong and what I can do to run simple queries like above on Google BigQuery without encountering errors?
Thank You.

You need to use LIMIT at the very end of your query:
SELECT cigarette_use,
AVG(weight_pounds) baby_weight,
AVG(mother_age) mother_age,
STDDEV(weight_pounds) baby_weight_stdev,
FROM [publicdata:samples.natality]
WHERE YEAR=2003
AND STATE='OH'
GROUP BY cigarette_use
LIMIT 1000;

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Discrepancy in String matching between Teradata and HIVE - sql

You could use rtrim function, i.e: select id,name,CONCAT ('"' , status , '"') from customer WHERE rtrim(status) = 'BUILDING' LIMIT 2; But question here arise what standard in string comparision Hive uses? According to ANSI/ISO SQL-92 'BUILDING' == 'BUILDING ', Here is a link for an article about it.

Answering to my own question ...got it from hive-user mailing list Hive is only PRO SQL compliance, In hive the string comparisons work just like they would work in java so in hive "BUILDING" = "BUILDING" "BUILDING " != "BUILDING" (extra space added)

Related

Equivalent function in HANA DB for json_object - case of Subqueries

How to use DB2 version 10 SQL TO generate a query into JSON output?

Google BigQuery: TABLE_QUERY AND TABLE_DATE_RANGE

Query Rewrite Template in Oracle SQL

Google BigQuery wouldn't run when I include 'WHERE' & 'GROUP BY' clause to my query

Categories

Resources