How to specify SERDEPROPERTIES and TBLPROPERTIES when creating Hive table via prestosql - hive

I'm trying to follow the examples of Hive connector to create hive table. I can write HQL to create a table via beeline. But wonder how to make it via prestosql.
Given table
CREATE TABLE hive.web.request_logs (
request_time varchar,
url varchar,
ip varchar,
user_agent varchar,
dt varchar
)
WITH (
format = 'CSV',
partitioned_by = ARRAY['dt'],
external_location = 's3://my-bucket/data/logs/'
)
How to specify SERDEPROPERTIES like separatorChar and quoteChar?
How to specify TBLPROPERTIES like skip.header.line.count?

In Presto you do this like this:
CREATE TABLE table_name( ... columns ... )
WITH (format='CSV', csv_separator='|', skip_header_line_count=1);
You can list all supported table properties in Presto with
SELECT * FROM system.metadata.table_properties;

Related

How can I create a TIMESTAMP column in HIVE with a string based timestamp?

I am attempting to create a table in HIVE so that it can be queried via Trino .. but getting an error. My guess is I need to transform or somehow modify the string or do something with the formatting? do I do that at the CREATE TABLE step? no idea
use hive.MYSCHEMA;
USE
trino:MYSCHEMA> CREATE TABLE IF NOT EXISTS hive.MYSCHEMA.MYTABLE (
-> column_1 VARCHAR,
-> column_2 VARCHAR,
-> column_3 VARCHAR,
-> column_4 BIGINT,
-> column_5 VARCHAR,
-> column_6 VARCHAR,
-> query_start_time TIMESTAMP)
-> WITH (
-> external_location = 's3a://MYS3BUCKET/dir1/dir2/',
-> format = 'PARQUET');
CREATE TABLE
trino:MYSCHEMA> SELECT * FROM MYTABLE;
Query 20220926_131538_00008_dbc39, FAILED, 1 node
Splits: 1 total, 0 done (0.00%)
1.72 [0 rows, 0B] [0 rows/s, 0B/s]
Query 20220926_131538_00008_dbc39 failed: Failed to read Parquet file: s3a://MYS3BUCKET/dir1/dir2/20220918_194105-135895.snappy.parquet
the full stacktrace is as follows
io.trino.spi.TrinoException: Failed to read Parquet file: s3a://MYS3BUCKET/dir1/dir2/20220918_194105-135895.snappy.parquet
at io.trino.plugin.hive.parquet.ParquetPageSource.handleException(ParquetPageSource.java:169)
at io.trino.plugin.hive.parquet.ParquetPageSourceFactory.lambda$createPageSource$6(ParquetPageSourceFactory.java:271)
at io.trino.parquet.reader.ParquetBlockFactory$ParquetBlockLoader.load(ParquetBlockFactory.java:75)
at io.trino.spi.block.LazyBlock$LazyData.load(LazyBlock.java:406)
at io.trino.spi.block.LazyBlock$LazyData.getFullyLoadedBlock(LazyBlock.java:385)
at io.trino.spi.block.LazyBlock.getLoadedBlock(LazyBlock.java:292)
at io.trino.spi.Page.getLoadedPage(Page.java:229)
at io.trino.operator.TableScanOperator.getOutput(TableScanOperator.java:314)
at io.trino.operator.Driver.processInternal(Driver.java:411)
at io.trino.operator.Driver.lambda$process$10(Driver.java:314)
at io.trino.operator.Driver.tryWithLock(Driver.java:706)
at io.trino.operator.Driver.process(Driver.java:306)
at io.trino.operator.Driver.processForDuration(Driver.java:277)
at io.trino.execution.SqlTaskExecution$DriverSplitRunner.processFor(SqlTaskExecution.java:736)
at io.trino.execution.executor.PrioritizedSplitRunner.process(PrioritizedSplitRunner.java:164)
at io.trino.execution.executor.TaskExecutor$TaskRunner.run(TaskExecutor.java:515)
at io.trino.$gen.Trino_397____20220926_094436_2.run(Unknown Source)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
at java.base/java.lang.Thread.run(Thread.java:833)
Caused by: java.lang.UnsupportedOperationException: io.trino.spi.type.ShortTimestampType
at io.trino.spi.type.AbstractType.writeSlice(AbstractType.java:115)
at io.trino.parquet.reader.BinaryColumnReader.readValue(BinaryColumnReader.java:54)
at io.trino.parquet.reader.PrimitiveColumnReader.lambda$readValues$2(PrimitiveColumnReader.java:248)
at io.trino.parquet.reader.PrimitiveColumnReader.processValues(PrimitiveColumnReader.java:304)
at io.trino.parquet.reader.PrimitiveColumnReader.readValues(PrimitiveColumnReader.java:246)
at io.trino.parquet.reader.PrimitiveColumnReader.readPrimitive(PrimitiveColumnReader.java:235)
at io.trino.parquet.reader.ParquetReader.readPrimitive(ParquetReader.java:441)
at io.trino.parquet.reader.ParquetReader.readColumnChunk(ParquetReader.java:540)
at io.trino.parquet.reader.ParquetReader.readBlock(ParquetReader.java:523)
at io.trino.parquet.reader.ParquetReader.lambda$nextPage$3(ParquetReader.java:272)
at io.trino.parquet.reader.ParquetBlockFactory$ParquetBlockLoader.load(ParquetBlockFactory.java:72)
... 17 more
We can achieve the desired results by splitting the task into 2 steps. Hive does not have a feature to transform a string to the timestamp in DDL.
So first we create 2 tables.
Fist we create the original table with the data
CREATE TABLE IF NOT EXISTS
hive.MYSCHEMA.MYTABLE (
column_1 VARCHAR,
column_2 VARCHAR,
column_3 VARCHAR,
column_4 BIGINT,
column_5 VARCHAR,
column_6 VARCHAR,
query_start_time VARCHAR)
WITH (
external_location = 's3a://MYS3BUCKET/dir1/dir2/',
format = 'PARQUET');
Next the new table with correct timestamp data type
CREATE TABLE IF NOT EXISTS
hive.MYSCHEMA.NEWTABLE (
column_1 VARCHAR,
column_2 VARCHAR,
column_3 VARCHAR,
column_4 BIGINT,
column_5 VARCHAR,
column_6 VARCHAR,
query_start_time TIMESTAMP)
WITH (
external_location = 's3a://MYS3BUCKET/newlocation/',
format = 'PARQUET');
Now we move data from MYTABLE to NEWTABLE with conversion using
INSERT OVERWRITE TABLE NEWTABLE Select column_1, column_2, column_3, ...., column_6,
unix_timestamp(query_start_time, 'yyyy-MM-ddTHH:mm:ss.SSSSSSZ') as query_start_time from MYTABLE;
You will have to test for the correct format in the unix_timestamp function by reading here
This will first convert the string column to timestamp and then store it in the new table. This means that all the old data will be read and stored in the new location.
You can think of it as an ETL job in Hive.
Additional Information to Why this conversion needs ETL although we have Schema ON Read
Schema ON Read is powerful for Big Data. It allows you to change the data type of a column stored in data while reading.
For example, you have the ID column as INT in your file but you can read it as STRING/VARCHAR if you define the column type as a string in your DDL.
Similarly reading a TIMESTAMP data as DATETIME. This is useful for schema evolution or reading from multiple sources with different datatypes.
Now why we couldn't use this power in the above scenario?
This will be the case for every scenario where you want to process the column. e.g. split one string column into two columns. The reason why we have to perform ETL, in this case, is because in parquet/avro timestamp datatype is not a primitive type. It is of type long int and with the additional property of logical_type as datetime/timestamp.
You can read here-parquet and here-avro about logical types for further clarification.
Hive will take below format natively. So, if you can remove T and Z I think you should be good to go.
Please give bellow CT sql a try. This may not be a parquet table but it should work if your timestamp is in correct string format.
CREATE TABLE mytable (
id int,
ts timestamp)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = ',',
'field.delim' = ',',
"timestamp.formats"="yyyy-MM-dd HH:mm:ss.SSSSSS"
)
LOCATION 's3://user/'

impala CREATE EXTERNAL TABLE and remove double quotes

i got data on CSV for example :
"Female","44","0","0","Yes","Govt_job","Urban","103.59","32.7","formerly smoked"
i put it as hdfs with hdfs dfs put
and now i want to create external table from it on impala (not in hive)
there is an option without the double quotes ?
this is what i run by impala-shell:
CREATE EXTERNAL TABLE IF NOT EXISTS test_test.test1_ext
( `gender` STRING,`age` STRING,`hypertension` STRING,`heart_disease` STRING,`ever_married` STRING,`work_type` STRING,`Residence_type` STRING,`avg_glucose_level` STRING,`bmi` STRING,`smoking_status` STRING )
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION "/user/test/tmp/test1"
Update 28.11
i managed to do it by create the external and then create a VIEW as select with case when concat() each col.
Impala uses the Hive metastore so anything created in Hive is available from Impala after issuing an INVALIDATE METADATA dbname.tablename. HOWEVER, to remove the quotes you need to use the Hive Serde library 'org.apache.hadoop.hive.serde2.OpenCSVSerde' and this is not accessible from Impala. My suggestion would be to do the following:
Create the external table in Hive
CREATE EXTERNAL TABLE IF NOT EXISTS test_test.test1_ext
( gender STRING, age STRING, hypertension STRING, heart_disease STRING, ever_married STRING, work_type STRING, Residence_type STRING, avg_glucose_level STRING, bmi STRING, smoking_status STRING )
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES
(
"separatorChar" = ",",
"quoteChar" = """
)
STORED AS TEXTFILE
LOCATION "/user/test/tmp/test1"
Create a managed table in Hive using CTAS
CREATE TABLE mytable AS SELECT * FROM test_test.test1_ext;
Make it available in Impala
INVALIDATE METADATA db.mytable;

How to output a table as a parquet file in spark-sql, not spark-shell?

It is easy to read a table from CSV file using spark-sql:
CREATE TABLE MyTable (
X STRING,
Y STRING
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
"separatorChar" = "\,",
"quoteChar" = "\""
)
STORED AS TEXTFILE;
LOAD DATA LOCAL INPATH 'input.csv' INTO TABLE MyTable;
But how can I output this result as Parquet file?
PS: I know how to do that in spark-shell, but it is not what I'm looking for.
You have to create one table with the schema of your results in hive stored as parquet. After getting the results you can export them into the parquet file format table like this.
set hive.insert.into.external.tables = true
create external table mytable_parq ( use your source table DDL) stored as parquet location '/hadoop/mytable';
insert into mytable_parq select * from mytable ;
or
insert overwrite directory '/hadoop/mytable' STORED AS PARQUET select * from MyTable ;

How do you add Data to an Existing Hive Metastore?

I have multiple subdirectories in S3 that contain .orc files. I'm trying to create a hive metastore so I can query the data with Presto / Hive, etc. The data is poorlly structured (no consistent delimiter, ugly characters, etc). Here's a scrubbed sample:
1488736466 199.199.199.199 0_b.www.sphericalcow.com.f9b1.qk-g6m6z24tdr.v4.url.name.com TXT IN: NXDOMAIN/0/143
1488736466 6.6.5.4 0.3399.186472.4306.6668.638.cb5a.names-things.update.url.name.com TXT IN: NOERROR/3/306 0\009253\009http://az.blargi.ng/%D3%AB%EF%BF%BD%EF%BF%BD/\009 0\009253\009http://casinoroyal.online/\009 0\009253\009http://d2njbfxlilvpsq.cloudfront.net/b_zq_ym_bangvideo/bangvideo0826.apk\009
I was able to create a table pointing to one of the subdirectories using a serde regex and the fields are parsing properly, but as far as I can tell I can only load one subfolder at a time.
How does one add more data to an existing hive metastore?
Here's an example of my hive metastore create statement with the regex serde bit:
DROP TABLE IF EXISTS test;
CREATE EXTERNAL TABLE test (field1 string, field2 string, field3 string, field4 string)
COMMENT 'fill all the tables with the datas.'
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "([0-9]{10}) ([0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}) (\\S*) (.*)",
"output.format.string" = "%1$s %2$s %3$s %4$s"
)
STORED AS ORC
LOCATION 's3://path/to/one/of/10/folders/'
tblproperties ("orc.compress" = "SNAPPY", "skip.header.line.count"="2");
select * from test limit 10;
I realize there is probably a very simple solution, but I tried INSERT INTO in place of CREATE EXTERNAL TABLE, but it understandably complains about the input, and I looked in both the hive and serde documentation for help but was unable to find a reference to adding to an existing store.
Possible solution using partitions.
CREATE EXTERNAL TABLE test (field1 string, field2 string, field3 string, field4 string)
partitioned by (mypartcol string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "([0-9]{10}) ([0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}) (\\S*) (.*)"
)
LOCATION 's3://whatever/as/long/as/it/is/empty'
tblproperties ("skip.header.line.count"="2");
alter table test add partition (mypartcol='folder 1') location 's3://path/to/1st/of/10/folders/';
alter table test add partition (mypartcol='folder 2') location 's3://path/to/2nd/of/10/folders/';
.
.
.
alter table test add partition (mypartcol='folder 10') location 's3://path/to/10th/of/10/folders/';
For #TheProletariat (the OP)
It seems there is no need for RegexSerDe since the columns are delimited by space (' ').
Note the use of tblproperties ("serialization.last.column.takes.rest"="true")
create external table test
(
field1 bigint
,field2 string
,field3 string
,field4 string
)
row format delimited
fields terminated by ' '
tblproperties ("serialization.last.column.takes.rest"="true")
;

query hbase like normal sql

I know Hbase is not like normal SQL.
But is it possible to query Hbase something like this?
select row-key from Table
where cf:first="ram" and cf:middle="leela" and cf:last="ban";
// ram(first name) leela(middle name) ban(last name)
There are two ways of doing it:
Use Apache Phoenix (Recommended). It's a powerful SQL wrapper for HBase.
Use Apache Hive. Hive can create an 'external table' using HiveQL:
CREATE EXTERNAL TABLE employees (
empid int,
ename String
)
ROW FORMAT DELIMITED
COLLECTION ITEMS TERMINATED BY '#'
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,cf:ename")
TBLPROPERTIES ("hbase.table.name" = "employees");