timestamp field in presto parquet table showing bad data - hive

I have an external table created in hive 0.131 using parquet
CREATE external table if not exists tbl (
occurred_at timestamp,
user_phone string,
usergk BIGINT,
enabled boolean,
classgk int,
title string,
eta int,
latitude decimal(9,6),
longitude decimal(9,6),
device_type string,
device_os_version string,
event_at_utc timestamp
)
PARTITIONED BY (country string, occured_date date)
STORED AS parquet
LOCATION 's3://XXX'
when I query in HIVE 0.13 everything looks fine but when I tried to do a simple query on this table in Presto (i.e select * from tbl limit 10 ) I got an error:
Can not read Parquet column: [HiveColumnHandle{clientId=hive,
name=occured_date, hiveType=date, hiveColumnIndex=-1, partitionKey=true}]
java.lang.RuntimeException: java.lang.IllegalArgumentException:
Can not read Parquet column: [HiveColumnHandle{clientId=hive, name=occured_date, hiveType=date, hiveColumnIndex=-1, partitionKey=true}]
when I wrote down specific column (i.e select occurred_at from tbl limit 10) I got odd results like '14173-10-07 02:42:56' while in hive it shows meaningful results
can it be related to the parquet format hive 0.13 is using?

Related

Create partitions using athena alter table statement

This "create table" statement is working correctly.
CREATE EXTERNAL TABLE default.no_details_2018_csv (
`id` string,
`client_id` string,
`client_id2` string,
`id_1` string,
`id_2` string,
`client_id3` string,
`code_1` string,
`code_2` string,
`code_3` string
)
STORED AS PARQUET
LOCATION 's3://some_bucket/athena-parquet/no_details/2018/'
tblproperties ("parquet.compress"="SNAPPY");
The data for the year 2018 available in parquet format can be found in that bucket / folder.
1) How do I add partitions to this table? I need to add the year 2019 data to the same table by referring to the new location of s3://some_bucket/athena-parquet/no_details/2019/ The data for both years is available in parquet (snappy) format.
2) Is it possible to partition by month instead of years? In other words is it OK to have 24 partitions instead of 2? Will the new target table will also have parquet format just like source data? The code_2 column mentioned above looks like this "20181013133839". I need to use first 4 characters for yearly (or 6 for monthly) partitions.
First table needs be created as EXTERNAL TABLE Check this
Sample -
CREATE EXTERNAL TABLE default.no_details_table (
`id` string,
`client_id` string,
`client_id2` string,
`id_1` string,
`id_2` string,
`client_id3` string,
`code_1` string,
`code_2` string,
`code_3` string
)
PARTITIONED BY (year string)
STORED AS PARQUET
LOCATION 's3://some_bucket/athena-parquet/no_details/'
tblproperties ("parquet.compress"="SNAPPY");
You can add a partition as
ALTER TABLE default.no_details_table ADD PARTITION (year='2018') LOCATION 's3://some_bucket/athena-parquet/no_details/2018/';
If you want to have more partitions for each month or day, create table with
PARTITIONED BY (day string)
But you need to put data of a day to path -
s3://some_bucket/athena-parquet/no_details/20181013/

Bigquery invalid timestamp error when appending data from one table to another

I am trying to copy data from one table to another having different partition and clustering fields, but I keep getting invalid timestamp error. Data into my source table was always written using standardSQL and I don't run in issue querying data from the source table. Did anyone else run into similar issue?
This is how my tables looks like:
Project: sample
Dataset: test
Table Name: table_a
event_id integer,
event_name string,
event_category string,
service_name string
service_timestamp timestamp
event_timestamp timestamp
Partitioned by event_timestamp, Clustered By: event_category
Project: sample
Dataset: test
Table Name: table_b
event_id integer,
event_name string,
event_category string,
service_name string
service_timestamp timestamp
event_timestamp timestamp
Partitioned by event_timestamp, Clustered By: service_name
I am trying to copy data from table_a to table_b using following command:
bq query --allow_large_results --append_table --use_legacy_sql=false --destination_table 'sample.test.table_b' "select * from \`sample.test.table_a\` where event_timestamp>='2018-01-01'";
Cannot return an invalid timestamp value of 632691030736614000 microseconds relative to the Unix epoch. The range of valid timestamp values is [0001-01-1 00:00:00, 9999-12-31 23:59:59.999999]; error in writing field service_timestamp```

Read multiple files in Hive table by date range

Let's imagine I store one file per day in a format:
/path/to/files/2016/07/31.csv
/path/to/files/2016/08/01.csv
/path/to/files/2016/08/02.csv
How can I read the files in a single Hive table for a given date range (for example from 2016-06-04 to 2016-08-03)?
Assuming every files follow the same schema, I would then suggest that you store the files with the following naming convention :
/path/to/files/dt=2016-07-31/data.csv
/path/to/files/dt=2016-08-01/data.csv
/path/to/files/dt=2016-08-02/data.csv
You could then create an external table partitioned by dt and pointing to the location /path/to/files/
CREATE EXTERNAL TABLE yourtable(id int, value int)
PARTITIONED BY (dt string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '/path/to/files/'
If you have several partitions and don't want to write alter table yourtable add partition ... queries for each one, you can simply use the repair command that will automatically add partitions.
msck repair table yourtable
You can then simply select data within a date range by specifying the partition range
SELECT * FROM yourtable WHERE dt BETWEEN '2016-06-04' and '2016-08-03'
Without moving your file:
Design your table schema. In hive shell, create the table (partitioned by date)
Loading files into tables
Query with HiveQL ( select * from table where dt between '2016-06-04 ' and '2016-08-03')
Moving your file:
Design your table schema. In hive shell, create the table (partitioned by date)
move /path/to/files/2016/07/31.csv under /dbname.db/tableName/dt=2016-07-31, then you'll have
/dbname.db/tableName/dt=2016-07-31/file1.csv
/dbname.db/tableName/dt=2016-08-01/file1.csv
/dbname.db/tableName/dt=2016-08-02/file1.csv
load partition with
alter table tableName add partition (dt=2016-07-31);
See Add partitions
In Spark-shell, read hive table
/path/to/data/user_info/dt=2016-07-31/0000-0
1.create sql
val sql = "CREATE EXTERNAL TABLE `user_info`( `userid` string, `name` string) PARTITIONED BY ( `dt` string) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' LOCATION 'hdfs://.../data/user_info'"
2. run it
spark.sql(sql)
3.load data
val rlt= spark.sql("alter table user_info add partition (dt=2016-09-21)")
4.now you can select data from table
val df = spark.sql("select * from user_info")

How to not have partition name when creating partitions via Hive

We have a table (table1) which is partitioned on year, month and day.
I created an orc table similar to table1 with similar partitions but of type ORC. I am trying to insert
date into partitions using the following statement but i am getting data dumped in folders with partition names.
How can i make sure where the folders don't have partition names in them?
create external table table1_orc(
col1 string,
col2 string,
col3 int
PARTITIONED BY (
`year` string,
`month` string,
`day` string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION '/base_path_orc/';
set hive.exec.dynamic.partition=true;
insert overwrite table table1_orc partition(year,month,day) select * from table1 where year = '2015' and month = '10' and day = '01';
Path to table1 in hdfs - /base_path/2015/10/01/data.csv
Path to orc table in hdfs (current output) -/base_path_orc/year=2015/month=10/day=01/000000_0
Desired output - /base_path_orc/2015/10/01/000000_0

Create External Hive Table Pointing to HBase Table

I have a table named "HISTORY" in HBase having column family "VDS" and the column names ROWKEY, ID, START_TIME, END_TIME, VALUE. I am using Cloudera Hadoop Distribution. I want to provide SQL interface to HBase table using Impala. In order to do this we have to create respective External Table in Hive? So how to create external hive table pointing to this HBase table?
Run the following code in Hive Query Editor:
CREATE EXTERNAL TABLE IF NOT EXISTS HISTORY
(
ROWKEY STRING,
ID STRING,
START_TIME STRING,
END_TIME STRING,
VALUE DOUBLE
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES
(
"hbase.columns.mapping" = ":key,VDS:ID,VDS:START_TIME,VDS:END_TIME,VDS:VALUE"
)
TBLPROPERTIES("hbase.table.name" = "HISTORY");
Don't forget to Refresh Impala Metadata after External Table Creation with the following bash command:
echo "INVALIDATE METADATA" | impala-shell;