How can I create a TIMESTAMP column in HIVE with a string based timestamp? - hive

I am attempting to create a table in HIVE so that it can be queried via Trino .. but getting an error. My guess is I need to transform or somehow modify the string or do something with the formatting? do I do that at the CREATE TABLE step? no idea
use hive.MYSCHEMA;
USE
trino:MYSCHEMA> CREATE TABLE IF NOT EXISTS hive.MYSCHEMA.MYTABLE (
-> column_1 VARCHAR,
-> column_2 VARCHAR,
-> column_3 VARCHAR,
-> column_4 BIGINT,
-> column_5 VARCHAR,
-> column_6 VARCHAR,
-> query_start_time TIMESTAMP)
-> WITH (
-> external_location = 's3a://MYS3BUCKET/dir1/dir2/',
-> format = 'PARQUET');
CREATE TABLE
trino:MYSCHEMA> SELECT * FROM MYTABLE;
Query 20220926_131538_00008_dbc39, FAILED, 1 node
Splits: 1 total, 0 done (0.00%)
1.72 [0 rows, 0B] [0 rows/s, 0B/s]
Query 20220926_131538_00008_dbc39 failed: Failed to read Parquet file: s3a://MYS3BUCKET/dir1/dir2/20220918_194105-135895.snappy.parquet
the full stacktrace is as follows
io.trino.spi.TrinoException: Failed to read Parquet file: s3a://MYS3BUCKET/dir1/dir2/20220918_194105-135895.snappy.parquet
at io.trino.plugin.hive.parquet.ParquetPageSource.handleException(ParquetPageSource.java:169)
at io.trino.plugin.hive.parquet.ParquetPageSourceFactory.lambda$createPageSource$6(ParquetPageSourceFactory.java:271)
at io.trino.parquet.reader.ParquetBlockFactory$ParquetBlockLoader.load(ParquetBlockFactory.java:75)
at io.trino.spi.block.LazyBlock$LazyData.load(LazyBlock.java:406)
at io.trino.spi.block.LazyBlock$LazyData.getFullyLoadedBlock(LazyBlock.java:385)
at io.trino.spi.block.LazyBlock.getLoadedBlock(LazyBlock.java:292)
at io.trino.spi.Page.getLoadedPage(Page.java:229)
at io.trino.operator.TableScanOperator.getOutput(TableScanOperator.java:314)
at io.trino.operator.Driver.processInternal(Driver.java:411)
at io.trino.operator.Driver.lambda$process$10(Driver.java:314)
at io.trino.operator.Driver.tryWithLock(Driver.java:706)
at io.trino.operator.Driver.process(Driver.java:306)
at io.trino.operator.Driver.processForDuration(Driver.java:277)
at io.trino.execution.SqlTaskExecution$DriverSplitRunner.processFor(SqlTaskExecution.java:736)
at io.trino.execution.executor.PrioritizedSplitRunner.process(PrioritizedSplitRunner.java:164)
at io.trino.execution.executor.TaskExecutor$TaskRunner.run(TaskExecutor.java:515)
at io.trino.$gen.Trino_397____20220926_094436_2.run(Unknown Source)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
at java.base/java.lang.Thread.run(Thread.java:833)
Caused by: java.lang.UnsupportedOperationException: io.trino.spi.type.ShortTimestampType
at io.trino.spi.type.AbstractType.writeSlice(AbstractType.java:115)
at io.trino.parquet.reader.BinaryColumnReader.readValue(BinaryColumnReader.java:54)
at io.trino.parquet.reader.PrimitiveColumnReader.lambda$readValues$2(PrimitiveColumnReader.java:248)
at io.trino.parquet.reader.PrimitiveColumnReader.processValues(PrimitiveColumnReader.java:304)
at io.trino.parquet.reader.PrimitiveColumnReader.readValues(PrimitiveColumnReader.java:246)
at io.trino.parquet.reader.PrimitiveColumnReader.readPrimitive(PrimitiveColumnReader.java:235)
at io.trino.parquet.reader.ParquetReader.readPrimitive(ParquetReader.java:441)
at io.trino.parquet.reader.ParquetReader.readColumnChunk(ParquetReader.java:540)
at io.trino.parquet.reader.ParquetReader.readBlock(ParquetReader.java:523)
at io.trino.parquet.reader.ParquetReader.lambda$nextPage$3(ParquetReader.java:272)
at io.trino.parquet.reader.ParquetBlockFactory$ParquetBlockLoader.load(ParquetBlockFactory.java:72)
... 17 more

We can achieve the desired results by splitting the task into 2 steps. Hive does not have a feature to transform a string to the timestamp in DDL.
So first we create 2 tables.
Fist we create the original table with the data
CREATE TABLE IF NOT EXISTS
hive.MYSCHEMA.MYTABLE (
column_1 VARCHAR,
column_2 VARCHAR,
column_3 VARCHAR,
column_4 BIGINT,
column_5 VARCHAR,
column_6 VARCHAR,
query_start_time VARCHAR)
WITH (
external_location = 's3a://MYS3BUCKET/dir1/dir2/',
format = 'PARQUET');
Next the new table with correct timestamp data type
CREATE TABLE IF NOT EXISTS
hive.MYSCHEMA.NEWTABLE (
column_1 VARCHAR,
column_2 VARCHAR,
column_3 VARCHAR,
column_4 BIGINT,
column_5 VARCHAR,
column_6 VARCHAR,
query_start_time TIMESTAMP)
WITH (
external_location = 's3a://MYS3BUCKET/newlocation/',
format = 'PARQUET');
Now we move data from MYTABLE to NEWTABLE with conversion using
INSERT OVERWRITE TABLE NEWTABLE Select column_1, column_2, column_3, ...., column_6,
unix_timestamp(query_start_time, 'yyyy-MM-ddTHH:mm:ss.SSSSSSZ') as query_start_time from MYTABLE;
You will have to test for the correct format in the unix_timestamp function by reading here
This will first convert the string column to timestamp and then store it in the new table. This means that all the old data will be read and stored in the new location.
You can think of it as an ETL job in Hive.
Additional Information to Why this conversion needs ETL although we have Schema ON Read
Schema ON Read is powerful for Big Data. It allows you to change the data type of a column stored in data while reading.
For example, you have the ID column as INT in your file but you can read it as STRING/VARCHAR if you define the column type as a string in your DDL.
Similarly reading a TIMESTAMP data as DATETIME. This is useful for schema evolution or reading from multiple sources with different datatypes.
Now why we couldn't use this power in the above scenario?
This will be the case for every scenario where you want to process the column. e.g. split one string column into two columns. The reason why we have to perform ETL, in this case, is because in parquet/avro timestamp datatype is not a primitive type. It is of type long int and with the additional property of logical_type as datetime/timestamp.
You can read here-parquet and here-avro about logical types for further clarification.

Hive will take below format natively. So, if you can remove T and Z I think you should be good to go.
Please give bellow CT sql a try. This may not be a parquet table but it should work if your timestamp is in correct string format.
CREATE TABLE mytable (
id int,
ts timestamp)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = ',',
'field.delim' = ',',
"timestamp.formats"="yyyy-MM-dd HH:mm:ss.SSSSSS"
)
LOCATION 's3://user/'

Related

Change Datatype of Values in a Pandas SQL Insert Statement

Using Pandas read_sql() function to INSERT data into a SQL table called table_1. Pulling data from a primary database and writing to table_1.
# Creates the table
read_sql(
f"""
CREATE TABLE IF NOT EXISTS {table_1} (id varchar, centroid varchar, date int, thresh bigint)
"""
)
Using a loop to process multiple days and calling function_1.
date_format = "%Y%m%d"
dates_to_compute = pd.date_range(start='2022-09-01', end='2022-09-10', freq='D').strftime(date_format)
for date in dates_to_compute:
print(f"Executing date {date}")
query = f"""
INSERT INTO {table_1}
{function_1(id, centroid, date, thresh)}
"""
read_sql(query)
Here is the error statement:
DatabaseError:
Insert query has mismatched column types:
Table: [varchar, varchar, integer, bigint],
Query: [bigint, varchar, double, double, array(varchar(9))]
My question is can I modify the INSERT statement to change the datatypes so that they match those of table_1 that was created earlier.

Spectrum Scan Error while reading from external table (S3 to RS)

I created an external table in Redshift from JSON files which are stored in S3 buckets.
All the columns are defined as varchar (despite the fact that the source data containing numbers and strings but I import everything as varchar to avoid error).
After I created the table and trying to query the table I got this error:
SQL Error [XX000]: ERROR: Spectrum Scan Error
Detail:
-----------------------------------------------
error: Spectrum Scan Error
code: 15001
context: Error while reading Ion/JSON int value: Numeric overflow.
What I'm doing wrong? why do I get 'numeric overflow error' if I defined the column as varchar?
I'm using the following command in order to create the table:
CREATE EXTERNAL TABLE spectrum_schema.example_table(
column_1 varchar,
column_2 varchar,
column_3 varchar,
column_4 varchar
)
ROW FORMAT SERDE
'org.openx.data.jsonserde.JsonSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
LOCATION
's3://************/files/'
;

How to specify SERDEPROPERTIES and TBLPROPERTIES when creating Hive table via prestosql

I'm trying to follow the examples of Hive connector to create hive table. I can write HQL to create a table via beeline. But wonder how to make it via prestosql.
Given table
CREATE TABLE hive.web.request_logs (
request_time varchar,
url varchar,
ip varchar,
user_agent varchar,
dt varchar
)
WITH (
format = 'CSV',
partitioned_by = ARRAY['dt'],
external_location = 's3://my-bucket/data/logs/'
)
How to specify SERDEPROPERTIES like separatorChar and quoteChar?
How to specify TBLPROPERTIES like skip.header.line.count?
In Presto you do this like this:
CREATE TABLE table_name( ... columns ... )
WITH (format='CSV', csv_separator='|', skip_header_line_count=1);
You can list all supported table properties in Presto with
SELECT * FROM system.metadata.table_properties;

How to load data in partitioned table automatically

I created an external but partitioned table as below
CREATE EXTERNAL TABLE IF NOT EXISTS dividends ( ymd STRING, dividend
FLOAT ) PARTITIONED BY (exchange STRING, symbol STRING) ROW FORMAT
DELIMITED FIELDS TERMINATED BY ',';
I want to load data in such a way that for each unique partition value, it automatically forms a new partition and data goes in that .Is there any way?
Sample data below
NASDAQ,AMTD,2006-01-25,6.0
NASDAQ,AHGP,2009-11-09,0.44
NASDAQ,AHGP,2009-08-10,0.428
NASDAQ,AHGP,2009-05-11,0.415
NASDAQ,AHGP,2009-02-10,0.403
NASDAQ,AHGP,2008-11-07,0.39
NASDAQ,AHGP,2008-08-08,0.353
NASDAQ,AHGP,2008-05-09,0.288
NASDAQ,AHGP,2008-02-08,0.288
NASDAQ,AHGP,2007-11-07,0.265
NASDAQ,AHGP,2007-08-08,0.265
NASDAQ,AHGP,2007-05-09,0.25
NASDAQ,AHGP,2007-02-07,0.25
NASDAQ,AHGP,2006-11-07,0.215
NASDAQ,AHGP,2006-08-09,0.215
NASDAQ,ALEX,2009-11-03,0.315
NASDAQ,ALEX,2009-08-04,0.315
NASDAQ,ALEX,2009-05-12,0.315
NASDAQ,ALEX,2009-02-11,0.315
NASDAQ,ALEX,2008-11-04,0.315
NASDAQ,AFCE,2005-06-06,12.0
NASDAQ,ASRVP,2009-12-28,0.528
NASDAQ,ASRVP,2009-09-25,0.528
NASDAQ,ASRVP,2009-06-25,0.528
NASDAQ,ASRVP,2009-03-26,0.528
NASDAQ,ASRVP,2008-12-26,0.528
NASDAQ,ASRVP,2008-09-25,0.528
NASDAQ,ASRVP,2008-06-25,0.528
I was searching for this. These were my steps, created a Staging table and loaded the csv file and then created and loaded the table using dynamic partition.
CREATE EXTERNAL TABLE stocks ( exchange STRING,
symbol STRING,
ymd STRING,
price_open FLOAT,
price_high FLOAT,
price_low FLOAT,
price_close FLOAT,
volume INT,
price_adj_close FLOAT)
LOCATION '/user/hduser/stocks';
CREATE EXTERNAL TABLE IF NOT EXISTS dividends_stage (
exchange STRING,
symbol STRING,
ymd STRING,
dividend FLOAT )
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '/user/hduser/div_stage';
hadoop fs -mv /user/hduser/dividends.csv /user/hduser/div_stage
CREATE EXTERNAL TABLE IF NOT EXISTS dividends (
ymd STRING,
dividend FLOAT )
PARTITIONED BY (exchange STRING, symbol STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' ;
INSERT OVERWRITE TABLE dividends partition (exchange , symbol)
SELECT ymd,dividend, exchange, symbol from dividends_stage;
SELECT INPUT__FILE__NAME, BLOCK__OFFSET__INSIDE__FILE from dividends ;
Hope this helps and not too late..

Unable to create table in hive

I am creating table in hive like:
CREATE TABLE SEQUENCE_TABLE(
SEQUENCE_NAME VARCHAR2(225) NOT NULL,
NEXT_VAL NUMBER NOT NULL
);
But, in result there is parse exception. Unable to read Varchar2(225) NOT NULL.
Can anyone guide me that how to create table like given above and any other process to provide path for it.
There's no such thing as VARCHAR, field width or NOT NULL clause in hive.
CREATE TABLE SEQUENCE_TABLE( SEQUENCE_TABLE string, NEXT_VAL bigint);
Please read this for CREATE TABLE syntax:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-CreateTable
Anyway Hive is "SQL Like" but it's not "SQL". I wouldn't use it for things such as sequence table as you don't have support for transactions, locking, keys and everything you are familiar with from Oracle (though I think that in new version there is simple support for transactions, updates, deletes, etc.).
I would consider using normal OLTP database for whatever you are trying to achieve
only you have option here like:
CREATE TABLE SEQUENCE_TABLE(SEQUENCE_NAME String,NEXT_VAL bigint) row format delimited fields terminated by ',' stored as textfile;
PS:Again depends the types to data you are going to load in hive
Use following syntax...
CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.] table_name
[(col_name data_type [COMMENT col_comment], ...)]
[COMMENT table_comment]
[ROW FORMAT row_format]
[STORED AS file_format]
And Example of hive create table
CREATE TABLE IF NOT EXISTS employee ( eid int, name String,
salary String, destination String)
COMMENT ‘Employee details’
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘\t’
LINES TERMINATED BY ‘\n’
STORED AS TEXTFILE;