How i can create a timestamp field in pig from a string that hive accepts as timestamp?
I have formatted the string in pig to match timestamp format in hive, but after loading it is null instead of showing the date.
2014-04-10 09:45:56 this is how the format looks like in pig, and this is matching the format with hive timestamp, but cannot load. (only if i load into string field)
any ideas why?
quick update: no hcatalog is available
problem is some case the timestamp fields contains null values and all the filed become null when using timestamp data type. When putting timestamp to a column where all the row is in the above format it works fine. So the real question is how null values can be handle
I suspect you have written your data to HDFS using PigStorage and you want to load it into a Hive table. The problem is that a missing tuple field will be written by Pig as null which will be treated by Hive 0.11 as null. So far so good.
But then all the subsequent fields will be treated as null, however they can have different values. Hive 0.12 doesn't have this issue.
Depending on the SerDe type, Hive can interpret different strings as null. In case of LazySimpleSerDe it is \N.
You have two option:
set the table's null format property to the empty string which is produced by Pig
or store \N in Pig for null fields
E.g:
Given the following data in Pig 0.11 :
A = load 'data' as (txt:chararray, ts:chararray);
dump A;
(a,2014-04-10 09:45:56)
(b,2014-04-11 10:45:56)
(,)
(e,2014-04-12 11:45:56)
Option 1:
store A into '/user/data';
Hive 0.11 :
CREATE EXTERNAL TABLE test (txt string, tms TimeStamp)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
LOCATION '/user/data';
alter table test SET SERDEPROPERTIES('serialization.null.format' = '');
Option 2:
...
B = foreach A generate txt, (ts is null?'\\N':ts);
store B into '/user/data';
Then create the table in Hive without setting the serde property.
Related
I'm using python pandas to write a DataFrame to parquet in GCS, then using Bigquery Transfer Service to transfer the GCS parquet file to a Bigquery table. Sometimes when the DataFrame is small, an entire column might have NULL values. When this occurs, Bigquery treats that null value column as an INTEGER type instead of what the parquet claims it to be.
When trying to append it to an existing table that expects that column to be NULLABLE STRING, Big Query Transfer Service will fail with INVALID_ARGUMENT: Provided Schema does not match Table project.dataset.dataset_health_reports. Field asin has changed type from STRING to INTEGER; JobID: xxx
When I use BQDTS to write the parquet to a new table, it can create the table, but the null column becomes an Integer type.
Any idea how to make BQDTS respect the original type or to manually specify types?
to remedy this issue you can pre-define the schema for columns which can be ambigous. For example I want the street_address_two column to be string then I can define the schema argument in LoadJobConfig as:
[bigquery.SchemaField("street_address_two", "STRING")].
The code will look like:
job_config = bigquery.LoadJobConfig(
schema=[
bigquery.SchemaField("street_address_two", "STRING")
],
source_format=bigquery.SourceFormat.PARQUET,
)
I have a DataStage parallel job that writes to Hive as the final stage in a long job. I can view the data that is about to be written and there are many NULL strings that I want to see in the Hive table.
However, when I view the table that is created, there are no NULL strings, they all get converted into empty strings '' instead. I can see other datatypes, like DECIMAL(5,0) have NULL values and I can select these, e.g.
SELECT * FROM mytable WHERE decimal_column IS NULL;
The process for writing to Hive is to store the data in a staging table in a delimited text format. This is then pushed through a generic CDC process and results in data being written to a new partition in an ORC format table.
The only option I can see for handling NULL values is "Null Value" in the HDFS File Connector Stage. If I leave this blank then I get empty strings and if I type in 'NULL' then 'NULL' is what I get, i.e. not a NULL, but the string 'NULL'.
I can't change the process as it's in place for literally thousands of jobs already. Is there any way to get my string values to be NULL or am I stuck with empty strings?
According to the IBM documentation, an empty String in double-quotation "" should help.
Null value
Specify the character or string that represents null values in the data. For a source stage, input data that has the value
that you specify is set to null on the output link. For a target
stage, in the output file that is written to the file system, null
values are represented by the value that is specified for this
property. To specify that an empty string represents a null value,
specify "" (two double quotation marks).
Source: https://www.ibm.com/docs/en/iis/11.7?topic=reference-properties-file-connector
I've have a CSV files which contain date and timestamp values in the below formats. Eg:
Col1|col2
01JAN2019|01JAN2019:17:34:41
But when I define Col1 as Date and Col2 as Timestamp in my create statement, the Hive tables simply returns NULL when I query.
CREATE EXTERNAL TABLE IF NOT EXISTS my_schema.my_table
(Col1 date,
Col2 timestamp)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘|’
STORED AS TEXTFILE
LOCATION 'my_path';
Instead, if I define the data types as simply string then it works. But that's not how I want my tables to be.
I want the table to be able to read the incoming data in correct type. How can I achieve this? Is it possible to define the expected data format of the incoming data with the CREATE statement itself?
Can someone please help?
As of Hive 1.2.0 it is possible to provide additional SerDe property "timestamp.formats". See this Jira for more details: HIVE-9298
ALTER TABLE timestamp_formats SET SERDEPROPERTIES ("timestamp.formats"="ddMMMyyyy:HH:mm:ss");
I create and import hive table with sqoop and use pyspark to get data. The table is composed by one string field, one int field and several float field. I can get the whole data by hue hive sql query. But while I program with pyspark sql the non-float field can be displayed and the float fields always show null value.
HUE hive sql results:
zeppelin pyspark output:
The details of hive table:
I finally found the cause. since I import these tables from mysql via sqoop. the original table columns are uppercase and in hive they were converted to all lowercase automatically. it caused all converted fields value can not be retrieved by sparksql. (but HUE hive queries these data normally, It might be a bug of spark.) I have to convert uppercase field names to lower case by specify the option --query in sqoop command. i.e. --query 'select MMM as mmm from table...'
I need to create external table for a hdfs location. The data is having null instead of empty space for few fields. If the field length is less than 4 for such fields, it is throwing error when selecting data. Is there a way to define replacement of all such nulls with empty space while creating table it self.?
I am trying it in greenplum, just tagged hive to see what can be done for such cases in hive.
You could use the serialization property for mapping NULL string to empty string.
CREATE TABLE IF NOT EXISTS abc ( ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE TBLPROPERTIES ("serialization.null.format"="")
In this case when you query it from hive you would get empty value for that field and hdfs would have "\N".
Or
If you want to represented empty string instead of '\N', you can using COALESCE function:
INSERT OVERWRITE tabname SELECT NULL, COALESCE(NULL,"") FROM data_table;
the answer to the problem is using NULL as 'null' statement in create table syntax for greenplum. As i have mentioned, i wanted to get few inputs from people who faced such issues in hive. so i have tagged hive as well. But, greenplum external table syntax supports NULL AS phrase in which we can specify the form of NULL that you want to keep.