Getting error while retrieving columns on HIVE "TIMESTAMP" column - sql

In Hive, I am trying create table on log file, I have data in the following format.
1000000000012311 1373346000 21.4 XX
1000000020017331 1358488800 16.9 YY
In this second field is Unix timestamp. I am writing following HIVE QUERY:
CREATE EXTERNAL TABLE log(user STRING, tdate TIMESTAMP, spend DOUBLE, state STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ' ' LINES TERMINATED BY '\n' LOCATION '/user/XXX/YYY/ZZZ';
Table is created. but when I am trying to get the data from table Select * form log limit 10';
I am getting following error.
Failed with exception java.io.IOException:org.apache.hadoop.hive.ql.metadata.HiveException: Error evaluating tdate
I have checked the HIVE manual and also google it, but didn't get any solution.

For epoch, you can define as BIGINT and then use the built-in UDF, from_unixtime() to convert to a string representing the date. Some thing like "select from_unixtime(tdate) from log "
A similar post at this link : How to create an external Hive table with column typed Timestamp

Hive supports the datatype timestamp but when used with JDBC cannot accept Timestamp as a datatype. But this was a problem in earlier versions. From Hive version 0.8.0, this problem is fixed. You can checkout this JIRA ticket raised.
https://issues.apache.org/jira/browse/HIVE-2957

Related

Importing CSV file but getting timestamp error

I'm trying to import CSV files into BigQuery and on any of the hourly reports I attempt to upload it gives the code
Error while reading data, error message: Could not parse 4/12/2016 12:00:00 AM as TIMESTAMP for field SleepDay (position 1) starting at location 65 with message Invalid time zone: AM
I get that the format is trying to use AM as a timezone and causing an error but I'm not sure how best to work around it. All of the hourly entries will have AM or PM after the date-time and that will be thousands of entries.
I'm using the autodetect for my schema and I believe that's where the issue is coming up, but I'm not sure what to put in the edit as text schema option to fix it
To successfully parse an imported string to timestamp in Bigquery, the string must be in the ISO 8601 format.
YYYY-MM-DDThh:mm:ss.sss
If your source data is not available in this format, then try the below approach.
Import the CSV into a temporary table by providing explicit schema, where timestamp fields are strings.
2. Select the data from the created temporary table, use the BigQuery PARSE_TIMESTAMP function as specified below and write to the permanent table.
INSERT INTO `example_project.example_dataset.permanent_table`
SELECT
PARSE_TIMESTAMP('%m/%d/%Y %H:%M:%S %p',time_stamp) as time_stamp,
value
FROM `example_project.example_dataset.temporary_table`;

Presto fails to import PARQUET files from S3

I have a presto table that imports PARQUET files based on partitions from s3 as follows:
create table hive.data.datadump
(
tUnixEpoch varchar,
tDateTime varchar,
temperature varchar,
series varchar,
sno varchar,
date date
)
WITH (
format = 'PARQUET',
partitioned_by = ARRAY['series','sno','date'],
external_location = 's3a://dev/files');
The S3 folder structure where the parquet files are stored looks like:
s3a://dev/files/series=S5/sno=242=/date=2020-1-23
and the partition starts from series.
The original code in pyspark that produces the parquet files has all the schema as String type and I am trying to import that as a string but when I run my create script in Presto, it successfully created the table but fails to import the data.
On Running,
select * from hive.data.datadump;
I get the following error:
[Code: 16777224, SQL State: ] Query failed (#20200123_191741_00077_tpmd5): The column tunixepoch is declared as type string, but the Parquet file declares the column as type DOUBLE[Code: 16777224, SQL State: ] Query failed (#20200123_191741_00077_tpmd5): The column tunixepoch is declared as type string, but the Parquet file declares the column as type DOUBLE
Can you guys help to resolve this issue?
Thank You in advance!
I ran into same issues and I found out that this was caused by one of the records in my source doesnt have a matching datatype for the column it was complaining about. I am sure this is just data. You need to trap the exact record which doesnt have the right type.
This might have been solved, just for info, this could be due to column declaration mismatch between hive and parquet file. To use the column names instead of the order, use the property -
hive.parquet.use-column-names=true

Getting null in some columns in hive due to next line "\n" within records

I have a table where I have next line character ("\n") within my records. So when I do select * on table, I get null values in the column which come after the records with "\n" or sometimes I get multiple records for a single record.
I get above problem in terminal,DB Visualizer and Tableau everywhere.
The data is stored correctly but this error is because hive is not able to provide in proper format. So we need to change the query output format of hive. We need to set below property :
set hive.query.result.fileformat=SequenceFile;
Its default value was TextFile which was giving an error.
Default Value:
Hive 0.x, 1.x, and 2.0: TextFile
Hive 2.1 onward: SequenceFile
https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties

How to load data to Hive table and make it also accessible in Impala

I have a table in Hive:
CREATE EXTERNAL TABLE sr2015(
creation_date STRING,
status STRING,
first_3_chars_of_postal_code STRING,
intersection_street_1 STRING,
intersection_street_2 STRING,
ward STRING,
service_request_type STRING,
division STRING,
section STRING )
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' WITH SERDEPROPERTIES (
'colelction.delim'='\u0002',
'field.delim'=',',
'mapkey.delim'='\u0003',
'serialization.format'=',', 'skip.header.line.count'='1',
'quoteChar'= "\"")
The table is loaded data this way:
LOAD DATA INPATH "hdfs:///user/rxie/SR2015.csv" INTO TABLE sr2015;
Why the table is only accessible in Hive? when I attempt to access it in HUE/Impala Editor I got the following error:
AnalysisException: Could not resolve table reference: 'sr2015'
which seems saying there is no such a table, but the table does show up in the left panel.
In Impala-shell, error is different as below:
ERROR: AnalysisException: Failed to load metadata for table: 'sr2015'
CAUSED BY: TableLoadingException: Failed to load metadata for table:
sr2015 CAUSED BY: InvalidStorageDescriptorException: Impala does not
support tables of this type. REASON: SerDe library
'org.apache.hadoop.hive.serde2.OpenCSVSerde' is not supported.
I have always been thinking Hive table and Impala table are essentially the same and difference is Impala is a more efficient query engine.
Can anyone help sort it out? Thank you very much.
Assuming that sr2015 is located in DB called db, in order to make the table visible in Impala, you need to either issue
invalidate metadata db;
or
invalidate metadata db.sr2015;
in Impala shell
However in your case, the reason is probably the version of Impala you're using, since it doesn't support the table format altogether

In PostgreSQL, what's data type you pass to a create table call when dealing with timestamp values?

When creating a table how do you deal with a timestamp in csv file that has the following syntax - MM/DD/YY HH:MI? Here's an example: 1/1/16 19:00
I have tried the following script in PostgreSQL:
create table timetable (
time timestamp
);
copy table from '<path>' delimiter ',' CSV;
But, I receive an error message saying:
ERROR: ERROR: invalid input syntax for type timestamp:
"visit_datetime" Where: COPY air_reserve, line 16, column
visit_datetime: "visit_datetime"
One solution I have considered is first creating the timestamp column in char then run a separate query that converts it to the appropriate timestamp datatype using the function call 'to_char(time, MM/DD/YY HH:MI). But, I'm looking for a solution that would enable to load the data in the correct datatype in a single query.
You may find a datestyle that enables you to load the data you have, but sooner or later someone will deliver to you something that doesn't fit.
The solution you have considered is probably the best.
We use this as a standard pattern for loading data warehouses. We take today's data, load it into a staging table using varchar columns for any data that will not load directly into its target data type. We then run whatever scripts we need to to get the data into a good state, raising warnings for anything that is broken in a way we haven't seen before. Then we add the cleaned version of today's data into the table containing cleaned data for all previous days.
We don't mind if this takes several steps; we put them all in a script and run it as an automated job.
I'm working on documenting the techniques we use. You can see the beginnings of this at http://www.thedatastudio.net.