How do I add columns to a table created using serde when creating a Hive table? - sql

table desc info
hive> desc log23;
OK
col_name data_type comment
17/05/25 10:49:12 INFO mapred.FileInputFormat: Total input files to process : 1
host string from deserializer
remote_host string from deserializer
remote_logname string from deserializer
remote_user string from deserializer
request_time string from deserializer
request_method string from deserializer
request_url string from deserializer
first_line string from deserializer
http_status string from deserializer
bytes string from deserializer
referer string from deserializer
agent string from deserializer
Time taken: 0.049 seconds, Fetched: 12 row(s)
apache log format serialize
serializationLib:org.apache.hadoop.hive.contrib.serde2.RegexSerDe, parameters:{output.format.string=%1$s %2$s %3$s %4$s %5$s %6$s %7$s %8$s %9$s %10$s %11$s %12$s, serialization.format=1, input.regex=([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) (-|\[[^\]]*\]) "(.[A-Z]*) (.*) (.*)" (-|[0-9]*) (-|[0-9]*) "(.*)" "(.*)"})
Add a column using Alter query
hive> alter table log23 add columns (code string);
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask.
Error: type expected at the position 0 of derived from deserializer:derived from deserializer:derived from deserializer:derived from deserializer:derived from deserializer:derived from deserializer:derived from deserializer:derived from deserializer:derived from deserializer:derived from deserializer:<derived from deserializer:derived from deserializer:string but>'<' is found.`
I get an error like above failed. How do I add a column...?

Unfortunately you cannot add columns if you've used serde. It is a known issue:
https://issues.apache.org/jira/browse/HIVE-17713

ADD COLUMNS lets you add new columns to the end of the existing columns but before the partition columns. This is supported for Avro backed tables as well, for Hive 0.14 and later.
REPLACE COLUMNS removes all existing columns and adds the new set of columns. This can be done only for tables with a native SerDe (DynamicSerDe, MetadataTypedColumnsetSerDe, LazySimpleSerDe and ColumnarSerDe). Refer to Hive SerDe for more information. REPLACE COLUMNS can also be used to drop columns. For example, "ALTER TABLE test_change REPLACE COLUMNS (a int, b int);" will remove column 'c' from test_change's schema.

i tried the same but i am able to create a table and added the columns at the end:
create table log23 (host String, remote_host String);
alter table log23 add columns(code String);
which is working with textfile format. please let me know if you are using different file format so that i try to replicate the use.

Related

Import Logs from S3 to Athena using Regex

I have exported CloudWatch logs to S3 and now want to import those logs to Athena. The format of the logs is as follow (pasted only one log for reference):
2021-07-30T14:30:22.937Z RequestId INFO {"_logLevel":"debug","msg":"Start: Calling All the Data Associates Function","timestamp":1627655422937,"EventSubCategory":"AppSyncService","API":"AppSyncService","function":"XXXXXXXXXXXXXXXXX","Correlation_Id":"XXXXXXXXXXXXXXXXX"}
I am using a regular expression to import the log and using the following query to create the table.
CREATE EXTERNAL TABLE IF NOT EXISTS test1 (
`time` string
`requestid` string
`loglevel` string
`message` string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = '1',
'input.regex' = '^(.*?)\t(.*?)\t(.*?)\t([\s\S]*?)\n'
)
LOCATION 's3://logs/test/'
TBLPROPERTIES ('has_encrypted_data'='false');
Regular Expression:
^(.*?)\t(.*?)\t(.*?)\t([\s\S]*?)\n
There are four columns in the table and the regular expression is also creating four groups and working as per my expectation. However, we still get empty table as result.
Can anyone please help to resolve this issue?
I think your problem is that you need to double-escape things in the regex, and you also should not match on a newline at the end, but $. Try this pattern:
'input.regex' = '^(.*?)\\t(.*?)\\t(.*?)\\t([\\s\\S]*?)$'
You can see an example in the official docs.
Also, the pattern [\s\S] could be replaced by . (\S means everything not matched by \s, so together they match anything).
An alternative to the regex serde is Grok, which is less error prone to write. Using the Grok serde I think this table would work for you:
CREATE EXTERNAL TABLE IF NOT EXISTS test1 (
`time` string
`requestid` string
`loglevel` string
`message` string
)
ROW FORMAT SERDE 'com.amazonaws.glue.serde.GrokSerDe'
WITH SERDEPROPERTIES (
input.format' = '%{TIMESTAMP_ISO8601:time} %{NOTSPACE:requestid} %{NOTSPACE:loglevel} %{NOTSPACE:message}'
)
LOCATION 's3://logs/test/'
Grok patterns are much easier to read. Check out the documentation and the built-in patterns for more info.

Write a csv to a partitioned Hive table using Spark org.apache.spark.SparkException: Requested partitioning does not match the table

I have an existing Hive table:
CREATE TABLE form_submit (form_id String,
submitter_name String)
PARTITIONED BY
submission_date String)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n'
STORED AS ORC;
I have a csv of raw data, which I read using
val session = SparkSession.builder()
.enableHiveSupport()
.config("spark.hadoop.hive.exec.dynamic.partition", "true")
.config("spark.hadoop.hive.exec.dynamic.partition.mode", "nonstrict")
.getOrCreate()
val dataframe = session
.read
.option("header", "true")
.csv(hdfsPath)
I then perform some manipulations on this data, using a series of withColumn and drop statements, to make sure that the format matches the table format.
I then try to write it like so:
formattedDataframe.write
.mode(SaveMode.Append)
.format("hive")
.partitionBy("submission_date")
.saveAsTable(tableName)
I'm not using insertInto, because the columns in the dataframe end up in a bad order, and I wouldn't want to rely on column order anyway.
And run it as a Spark job. I get an exception:
Exception in thread "main" org.apache.spark.SparkException: Requested partitioning does not match the form_submit table:
Requested partitions:
Table partitions: "submission_date"
What am I doing wrong? Didn't I choose the partitioning by calling partitionedBy?

Hive simple Regular expression

I am trying to check if all data within in a column is having a valid date.
create table dates (tm string, dt string) row format delimited fields terminated by '\t'
date.txt(sample data)
20181205 15
20171023 23
20170516 16
load data local inpath 'dates.txt' overwrite into table dates;
create temporary macro isitDate(s string)
case when regexp_extract(s,'((0[1-9]|[12][0-9]|3[01])',0) = ''
then false
else true
end;
select * from dates where isitDate(dt);
But select statement is giving below error-
Failed with exception
java.io.IOException:org.apache.hadoop.hive.ql.metadata.HiveException:
Unable to execute method public java.lang.String
org.apache.hadoop.hive.ql.udf.UDFRegExpExtract.evaluate(java.lang.String,java.lang.String,java.lang.Integer)
on object org.apache.hadoop.hive.ql.udf.UDFRegExpExtract#66b45e1e of
class org.apache.hadoop.hive.ql.udf.UDFRegExpExtract with arguments
{15:java.lang.String, ((0[1-9]|[12][0-9]|3[01]):java.lang.String,
0:java.lang.Integer} of size 3
Is there something wrong with my regular expression.
made a stupid mistake, there is one extra opening bracket in macro

apache hive loads null values instead of intergers

I am new to apache hive and was running queries on sample data which is saved in a csv file as below:
0195153448;"Classical Mythology";"Mark P. O. Morford";"2002";"Oxford University Press";"//images.amazon.com/images/P/0195153448.01.THUMBZZZ.jpg";"http://images.amazon.com/images/P/0195153448.01.MZZZZZZZ.jpg";"images.amazon.com/images/P/0195153448.01.LZZZZZZZ.jpg"
and the table which i created is of form
hive> describe book;
OK
isbn bigint
title string
author string
year string
publ string
img1 string
img2 string
img3 string
Time taken: 0.085 seconds, Fetched: 8 row(s)
and the script which I used to create the table is:
create table book(isbn int,title string,author string, year string,publ string,img1 string,img2 string,img3 string) row format delimited fields terminated by '\;' lines terminated by '\n' location 'path';
When I try to retrieve the data from the table by using the following query:
select *from book limit 1;
I get the following result:
NULL "Classical Mythology" "Mark P. O. Morford" "2002" "Oxford University Press" "http://images.amazon.com/images/P/0195153448.01.THUMBZZZ.jpg" "images.amazon.com/images/P/0195153448.01.MZZZZZZZ.jpg" "images.amazon.com/images/P/0195153448.01.LZZZZZZZ.jpg"
Even though I specify the first column type as int or bigint the data into the table is getting loaded as NULL.
I tried searching on the internet and could figure out that I have to specify the row delimiter. I used that too but no change in the data from the table.
Is there anything that I am making a mistake... Please help.

Why concatenatation in hive returns error?

I use the concat function in hive (Amazon Elastic Map Reduce) for specifiying the path to S3 bucket (dval is date value, which will be changed automatically):
add jar s3://mySerdeBucket/hive-json-serde.jar;
set dval='03';
set pthstring=concat('s3://mybucket/',${hiveconf:dval},'/');
CREATE EXTERNAL TABLE table1 (uuid string, tm string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.JsonSerde'
WITH SERDEPROPERTIES ('rename_columns'='device_uuid>uuid,at>tmstmp')
LOCATION ${hiveconf:pthstring};
The server returns the following error:
FAILED: ParseException line 4:9 mismatched input 'concat' expecting StringLiteral near 'LOCATION' in table location specification
As far as i understand, hive reads the pthstring variable as a string with the function in it, not the result of concat function. How can I fix it?
Note that concat() is a builti-in hive string function that can be applied in a query on column fields, not with hive variables.
use like this:
hive (default)> set dval=03;
hive (default)> set pthstring=s3://mybucket/${hiveconf:dval}/;
hive (default)> set dval; ---------> gives : dval=03
hive (default)> set pthstring; ----> gives : pthstring=s3://mybucket/03/