Do a startsWith in Hive HQL (with substring?) - hive

I'm trying to do a kind of Spark "StartsWith" in Hive. I have been reading to do this and what i found is to do it with a substring
I have a string, if this string starts with UTC8 I have to add this prefix UTC8-Min8 otherwise if it starts with PMM1 I would have to add the prefix NTC2-Min8 to an existing column.

i think if you remove double quote it should work. You have few syntax error though. COuld you pls try below code?
SELECT
id, sum, address,
CASE
WHEN substring(trim(prd_ex),1,4) = 'UTC8' THEN CONCAT('UTC8-Min8',column_exe)
WHEN substring(trim(prd_ex),1,4) = 'PMM1' THEN CONCAT('NTC2-Min8',column_exe)
END
AS col_type
from Table1;

Related

Invalid digits on Redshift

I'm trying to load some data from stage to relational environment and something is happening I can't figure out.
I'm trying to run the following query:
SELECT
CAST(SPLIT_PART(some_field,'_',2) AS BIGINT) cmt_par
FROM
public.some_table;
The some_field is a column that has data with two numbers joined by an underscore like this:
some_field -> 38972691802309_48937927428392
And I'm trying to get the second part.
That said, here is the error I'm getting:
[Amazon](500310) Invalid operation: Invalid digit, Value '1', Pos 0,
Type: Long
Details:
-----------------------------------------------
error: Invalid digit, Value '1', Pos 0, Type: Long
code: 1207
context:
query: 1097254
location: :0
process: query0_99 [pid=0]
-----------------------------------------------;
Execution time: 2.61s
Statement 1 of 1 finished
1 statement failed.
It's literally saying some numbers are not valid digits. I've already tried to get the exactly data which is throwing the error and it appears to be a normal field like I was expecting. It happens even if I throw out NULL fields.
I thought it would be an encoding error, but I've not found any references to solve that.
Anyone has any idea?
Thanks everybody.
I just ran into this problem and did some digging. Seems like the error Value '1' is the misleading part, and the problem is actually that these fields are just not valid as numeric.
In my case they were empty strings. I found the solution to my problem in this blogpost, which is essentially to find any fields that aren't numeric, and fill them with null before casting.
select cast(colname as integer) from
(select
case when colname ~ '^[0-9]+$' then colname
else null
end as colname
from tablename);
Bottom line: this Redshift error is completely confusing and really needs to be fixed.
When you are using a Glue job to upsert data from any data source to Redshift:
Glue will rearrange the data then copy which can cause this issue. This happened to me even after using apply-mapping.
In my case, the datatype was not an issue at all. In the source they were typecast to exactly match the fields in Redshift.
Glue was rearranging the columns by the alphabetical order of column names then copying the data into Redshift table (which will
obviously throw an error because my first column is an ID Key, not
like the other string column).
To fix the issue, I used a SQL query within Glue to run a select command with the correct order of the columns in the table..
It's weird why Glue did that even after using apply-mapping, but the work-around I used helped.
For example: source table has fields ID|EMAIL|NAME with values 1|abcd#gmail.com|abcd and target table has fields ID|EMAIL|NAME But when Glue is upserting the data, it is rearranging the data by their column names before writing. Glue is trying to write abcd#gmail.com|1|abcd in ID|EMAIL|NAME. This is throwing an error because ID is expecting a int value, EMAIL is expecting a string. I did a SQL query transform using the query "SELECT ID, EMAIL, NAME FROM data" to rearrange the columns before writing the data.
Hmmm. I would start by investigating the problem. Are there any non-digit characters?
SELECT some_field
FROM public.some_table
WHERE SPLIT_PART(some_field, '_', 2) ~ '[^0-9]';
Is the value too long for a bigint?
SELECT some_field
FROM public.some_table
WHERE LEN(SPLIT_PART(some_field, '_', 2)) > 27
If you need more than 27 digits of precision, consider a decimal rather than bigint.
If you get error message like “Invalid digit, Value ‘O’, Pos 0, Type: Integer” try executing your copy command by eliminating the header row. Use IGNOREHEADER parameter in your copy command to ignore the first line of the data file.
So the COPY command will look like below:
COPY orders FROM 's3://sourcedatainorig/order.txt' credentials 'aws_access_key_id=<your access key id>;aws_secret_access_key=<your secret key>' delimiter '\t' IGNOREHEADER 1;
For my Redshift SQL, I had to wrap my columns with Cast(col As Datatype) to make this error go away.
For example, setting my columns datatype to Char with a specific length worked:
Cast(COLUMN1 As Char(xx)) = Cast(COLUMN2 As Char(xxx))

Why does my update query to replace string not work?

I have an Access table where I have transaction IDs in the below format:
Transaction_ID
39296165-1
39296165-2
39296165-3
39284029-1
39284029-2
I am trying to write a query which finds the dash and removes the -1,-2,-3 etc., so I can then de-duplicate based on the string before the dash.
I've written the below:
UPDATE mytable
SET Transaction_ID=Left(Transaction_ID,InStr(1,Transaction_ID,"-")-1)*
Which works fine, however, when it comes across a Transaction_ID which doesn't have a dash in the string, it gives me a type conversion and replaces the string with a blank value.
Any advice on error-trapping this?
Add a WHERE clause to only update if InStr does not return -1:
WHERE InStr(1,Transaction_ID,"-") > 0
This would also work and would be more efficient.
WHERE Transaction_ID LIKE "*-*"

sqlite text data type comparison not working

I have a table in SQLite database with data type text, but when I do comparison its not working if I do it this way :
select * from scanned_dbs where db = 'cdd_db';
But if i change the query to:
select * from scanned_dbs where db like 'cdd_db';
It works. But as valex pointed out it will also match cddAdb, cddBdb, ...and so on, so this is not the right way.
One more method I found which is working is this:
select * from scanned_dbs where cast(db as varchar) = 'cdd_db';
So can any one tell me why this is working and not the first one which is direct comparison...
Because an underscore ("_") in the LIKE pattern matches any single character in the string.
So when you use db = 'cdd_db' the only db value that matches is exact 'cdd_db'. But when you use the LIKE operator db like 'cdd_db' then "_" symbol is a pattern so db values that match: cddAdb,cddBdb,cddcdb,cddddb,cdd1db, ....

BigQuery Wildcard using TABLE_DATE_RANGE()

Great news about the new table wildcard functions this morning! Is there a way to use TABLE_DATE_RANGE() on tables that include date but no prefix?
I have a dataset that contains tables named YYYYMMDD (no prefix). Normally I would query like so:
SELECT foo
FROM [mydata.20140319],[mydata.20140320],[mydata.20140321]
LIMIT 100
I tried the following but I'm getting an error:
SELECT foo
FROM
(TABLE_DATE_RANGE(mydata.,
TIMESTAMP('2014-03-19'),
TIMESTAMP('2015-03-21')))
LIMIT 100
as well as:
SELECT foo
FROM
(TABLE_DATE_RANGE(mydata,
TIMESTAMP('2014-03-19'),
TIMESTAMP('2015-03-21')))
LIMIT 100
The underlying bug here has been fixed as of 2015-05-14. You should be able to use TABLE_DATE_RANGE with a purely numeric table name. You'll need to end the dataset in a '.' and enclose the name in brackets, so that the parser doesn't complain. This should work:
SELECT foo
FROM
(TABLE_DATE_RANGE([mydata.],
TIMESTAMP('2014-03-19'),
TIMESTAMP('2015-03-21')))
LIMIT 100
Note: The underlying bug has been fixed, please see my other answer.
Original response left for posterity (since the workaround should still work, in case you need it for some reason)
Great question. That should work, but it doesn't currently. I've filed an internal bug. In the meantime, a workaround is to use the TABLE_QUERY function, as in:
SELECT foo
FROM (
TABLE_QUERY(mydata,
"TIMESTAMP(table_id) BETWEEN "
+ "TIMESTAMP('2014-03-19') "
+ "AND TIMESTAMP('2015-03-21')"))
Note that with standard SQL support in BigQuery, you can use _TABLE_SUFFIX, instead of TABLE_QUERY. For example:
SELECT foo
FROM `mydata_*`
WHERE _TABLE_SUFFIX BETWEEN '20140319' AND '20150321'
Also check this question for more about BigQuery standard SQL.

how to use substr in SQL Server?

I have the following extract of a code used in SAS and wanted to write it in SQL Server to extract data.
substr(zipname,1,4) in("2000","9000","3000","1000");run;
How do I write this in SQL Server ?
I tried and got this error:
An expression of non-boolean type specified in a context where a
condition is expected
In sql server, there's no substr function (it's substring)
by the way, you need a complete query...
select blabla
from blibli
where substring(zipname, 1, 4) in ('2000', '9000', 3000', '1000')
assuming zipname is a varchar or something like that...
You need a table that you are getting the records from, and zipname would be a column in the table. The statement would be something like this:
select * from tablename where substring(zipname,1,4) in ('2000','9000','3000','1000')
Since you want the first x characters, you can also use the left() function.
where left(zipname, 4) in (values go here)
Bear in mind that your values have to be single quoted. Your question has double quotes.