Why concatenatation in hive returns error? - hive

I use the concat function in hive (Amazon Elastic Map Reduce) for specifiying the path to S3 bucket (dval is date value, which will be changed automatically):
add jar s3://mySerdeBucket/hive-json-serde.jar;
set dval='03';
set pthstring=concat('s3://mybucket/',${hiveconf:dval},'/');
CREATE EXTERNAL TABLE table1 (uuid string, tm string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.JsonSerde'
WITH SERDEPROPERTIES ('rename_columns'='device_uuid>uuid,at>tmstmp')
LOCATION ${hiveconf:pthstring};
The server returns the following error:
FAILED: ParseException line 4:9 mismatched input 'concat' expecting StringLiteral near 'LOCATION' in table location specification
As far as i understand, hive reads the pthstring variable as a string with the function in it, not the result of concat function. How can I fix it?

Note that concat() is a builti-in hive string function that can be applied in a query on column fields, not with hive variables.
use like this:
hive (default)> set dval=03;
hive (default)> set pthstring=s3://mybucket/${hiveconf:dval}/;
hive (default)> set dval; ---------> gives : dval=03
hive (default)> set pthstring; ----> gives : pthstring=s3://mybucket/03/

Related

Import Logs from S3 to Athena using Regex

I have exported CloudWatch logs to S3 and now want to import those logs to Athena. The format of the logs is as follow (pasted only one log for reference):
2021-07-30T14:30:22.937Z RequestId INFO {"_logLevel":"debug","msg":"Start: Calling All the Data Associates Function","timestamp":1627655422937,"EventSubCategory":"AppSyncService","API":"AppSyncService","function":"XXXXXXXXXXXXXXXXX","Correlation_Id":"XXXXXXXXXXXXXXXXX"}
I am using a regular expression to import the log and using the following query to create the table.
CREATE EXTERNAL TABLE IF NOT EXISTS test1 (
`time` string
`requestid` string
`loglevel` string
`message` string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = '1',
'input.regex' = '^(.*?)\t(.*?)\t(.*?)\t([\s\S]*?)\n'
)
LOCATION 's3://logs/test/'
TBLPROPERTIES ('has_encrypted_data'='false');
Regular Expression:
^(.*?)\t(.*?)\t(.*?)\t([\s\S]*?)\n
There are four columns in the table and the regular expression is also creating four groups and working as per my expectation. However, we still get empty table as result.
Can anyone please help to resolve this issue?
I think your problem is that you need to double-escape things in the regex, and you also should not match on a newline at the end, but $. Try this pattern:
'input.regex' = '^(.*?)\\t(.*?)\\t(.*?)\\t([\\s\\S]*?)$'
You can see an example in the official docs.
Also, the pattern [\s\S] could be replaced by . (\S means everything not matched by \s, so together they match anything).
An alternative to the regex serde is Grok, which is less error prone to write. Using the Grok serde I think this table would work for you:
CREATE EXTERNAL TABLE IF NOT EXISTS test1 (
`time` string
`requestid` string
`loglevel` string
`message` string
)
ROW FORMAT SERDE 'com.amazonaws.glue.serde.GrokSerDe'
WITH SERDEPROPERTIES (
input.format' = '%{TIMESTAMP_ISO8601:time} %{NOTSPACE:requestid} %{NOTSPACE:loglevel} %{NOTSPACE:message}'
)
LOCATION 's3://logs/test/'
Grok patterns are much easier to read. Check out the documentation and the built-in patterns for more info.

Redshift duplicate month specification in date/time format

I'm trying to copy a csv file from S3 into Reshift and I'm hitting this error:
org.postgresql.util.PSQLException: ERROR: duplicate month specification in date/time format↵redshiftETL sql="COPY tablename FROM 's3://bucket/keyname.csv' IAM_ROLE 'arn:aws:iam::ACCOUNTID:role/redshift-role' REGION 'us-east-1' CSV TIMEFORMAT AS 'YYYY-MM-DDThh:mm:ss.sTZD';"
I guess it's thinking mm is a duplicate specification of month! Why is that?
I've end up using the auto detection of time, now the copy statement looks like:
COPY tablename FROM 's3://bucket/keyname.csv'
IAM_ROLE 'arn:aws:iam::ACCOUNTID:role/redshift-role'
REGION 'us-east-1' CSV
TIMEFORMAT AS 'auto';"

How to store the output of a query in a variable in HIVE

I want to store current_day - 1 in a variable in Hive. I know there are already previous threads on this topic but the solutions provided there first recommends defining the variable outside hive in a shell environment and then using that variable inside Hive.
Storing result of query in hive variable
I first got the current_Date - 1 using
select date_sub(FROM_UNIXTIME(UNIX_TIMESTAMP(),'yyyy-MM-dd'),1);
Then i tried two approaches:
1. set date1 = ( select date_sub(FROM_UNIXTIME(UNIX_TIMESTAMP(),'yyyy-MM-dd'),1);
and
2. set hivevar:date1 = ( select date_sub(FROM_UNIXTIME(UNIX_TIMESTAMP(),'yyyy-MM-dd'),1);
Both the approaches are throwing an error:
"ParseException line 1:82 cannot recognize input near 'select' 'date_sub' '(' in expression specification"
When I printed (1) in place of yesterday's date the select query is saved in the variable. The (2) approach throws "{hivevar:dt_chk} is undefined
".
I am new to Hive, would appreciate any help. Thanks.
Hive doesn't support a straightforward way to store query result to variables.You have to use the shell option along with hiveconf.
date1 = $(hive -e "set hive.cli.print.header=false; select date_sub(from_unixtime(unix_timestamp(),'yyyy-MM-dd'),1);")
hive -hiveconf "date1"="$date1" -f hive_script.hql
Then in your script you can reference the newly created varaible date1
select '${hiveconf:date1}'
After lots of research, this is probably the best way to achieve setting a variable as an output of an SQL:
INSERT OVERWRITE LOCAL DIRECTORY '<home path>/config/date1'
select CONCAT('set hivevar:date1=',date_sub(FROM_UNIXTIME(UNIX_TIMESTAMP(),'yyyy-MM-dd'),1)) from <some table> limit 1;
source <home path>/config/date1/000000_0;
You will then be able to use ${date1} in your subsequent SQLs.
Here we had to use <some table> limit 1 as hive got a bug in insert overwrite if we don't specify a table name.

How do I add columns to a table created using serde when creating a Hive table?

table desc info
hive> desc log23;
OK
col_name data_type comment
17/05/25 10:49:12 INFO mapred.FileInputFormat: Total input files to process : 1
host string from deserializer
remote_host string from deserializer
remote_logname string from deserializer
remote_user string from deserializer
request_time string from deserializer
request_method string from deserializer
request_url string from deserializer
first_line string from deserializer
http_status string from deserializer
bytes string from deserializer
referer string from deserializer
agent string from deserializer
Time taken: 0.049 seconds, Fetched: 12 row(s)
apache log format serialize
serializationLib:org.apache.hadoop.hive.contrib.serde2.RegexSerDe, parameters:{output.format.string=%1$s %2$s %3$s %4$s %5$s %6$s %7$s %8$s %9$s %10$s %11$s %12$s, serialization.format=1, input.regex=([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) (-|\[[^\]]*\]) "(.[A-Z]*) (.*) (.*)" (-|[0-9]*) (-|[0-9]*) "(.*)" "(.*)"})
Add a column using Alter query
hive> alter table log23 add columns (code string);
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask.
Error: type expected at the position 0 of derived from deserializer:derived from deserializer:derived from deserializer:derived from deserializer:derived from deserializer:derived from deserializer:derived from deserializer:derived from deserializer:derived from deserializer:derived from deserializer:<derived from deserializer:derived from deserializer:string but>'<' is found.`
I get an error like above failed. How do I add a column...?
Unfortunately you cannot add columns if you've used serde. It is a known issue:
https://issues.apache.org/jira/browse/HIVE-17713
ADD COLUMNS lets you add new columns to the end of the existing columns but before the partition columns. This is supported for Avro backed tables as well, for Hive 0.14 and later.
REPLACE COLUMNS removes all existing columns and adds the new set of columns. This can be done only for tables with a native SerDe (DynamicSerDe, MetadataTypedColumnsetSerDe, LazySimpleSerDe and ColumnarSerDe). Refer to Hive SerDe for more information. REPLACE COLUMNS can also be used to drop columns. For example, "ALTER TABLE test_change REPLACE COLUMNS (a int, b int);" will remove column 'c' from test_change's schema.
i tried the same but i am able to create a table and added the columns at the end:
create table log23 (host String, remote_host String);
alter table log23 add columns(code String);
which is working with textfile format. please let me know if you are using different file format so that i try to replicate the use.

Pig - reading Hive table stored as Avro

I have created a hive table stored with Avro file format. I am trying to load same hive table using below Pig commands
pig -useHCatalog;
hive_avro = LOAD 'hive_avro_table' using org.apache.hive.hcatalog.pig.HCatLoader();
I am getting " failed to read from hive_avro_table " error when I tried to display "hive_avro" using DUMP command.
Please help me how to resolve this issue. Thanks in advance
create table hivecomplex
(name string,
phones array<INT>,
deductions map<string,float>,
address struct<street:string,zip:INT>
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
COLLECTION ITEMS TERMINATED BY '$'
MAP KEYS TERMINATED BY '#'
STORED AS AVRO
;
hive> select * from hivecomplex;
OK
John [650,999,9999] {"pf":500.0} {"street":"pleasantville","zip":88888}
Time taken: 0.078 seconds, Fetched: 1 row(s)
Now for the pig
pig -useHCatalog;
a = LOAD 'hivecomplex' USING org.apache.hive.hcatalog.pig.HCatLoader();
dump a;
ne.util.MapRedUtil - Total input paths to process : 1
(John,{(650),(999),(9999)},[pf#500.0],(pleasantville,88888))