I have inconsistent log file which I would like to partition with Hive using dynamic partitioning. File example:
20/06/13 20:21:42.637 FLW CPTView::OnInitialUpdate nRemoveAppShareQSize0=50000\n
20/06/13 20:21:42.638 FLW \n
BandwidthGlobalSettings:Old Bandwidth common defines\n
Sometimes log file contains line which started with some word different from date. Each line delimited with \n.
I am running commands:
CREATE EXTERNAL TABLE IF NOT EXISTS log_messages_temp (date STRING,time STRING,severity STRING,message STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\040' LOCATION '/examples/hive/tmp';
CREATE EXTERNAL TABLE IF NOT EXISTS log_messages_partitioned (time STRING,severity STRING,message STRING) PARTITIONED BY (date STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\040' LOCATION '/examples/hive/partitions';
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
FROM log_messages_temp pvs INSERT OVERWRITE TABLE log_messages_partitioned PARTITION(date) SELECT pvs.time, pvs.severity, pvs.message, pvs.date;
As a result two dynamic partitions were created: date=20/06/13 and date=BandwidthGlobalSettings:Old
I would like to define to Hive to ignore lines started with not date string.
How can I do this? Or maybe exists another solution?
Thanks.
I think you can write a UDF that will use regular expression to take only date format(ex: 20/06/13) and discard all others like "BandwidthGlobalSettings:Old" . You can use this UDF in your last query while inserting into final table.
I hope this explanation helps your requirement.
Related
I have created a table employee_orc which is orc format with snappy compression.
create table employee_orc(emp_id string, name string)
row format delimited fields terminated by '\t' stored as orc tblproperties("orc.compress"="SNAPPY");
I have uploaded data into the table using the insert statement.
employee_orc table has 1000 records.
When I run the below query, it shows all the records
select * from employee_orc;
But when run the below query, it shows zero results even though the records exist.
select * from employee_orc where emp_id = "EMP456";
Why I am unable to retrieve a single record from the employee_orc table?
The record does not exist. You may think they are the same because they look the same, but there is some difference. One possibility are spaces at the beginning or end of the string. For this, you can use like:
where emp_id like '%EMP456%'
This might help you.
On my part, I don't understand why you want to specify a delimiter in ORC. Are you confusing CSV and ORC or external vs managed ?
I advice you to create your table differently
create table employee_orc(emp_id string, name string)
stored as ORC
TBLPROPERTIES (
"orc.compress"="ZLIB");
I've have a CSV files which contain date and timestamp values in the below formats. Eg:
Col1|col2
01JAN2019|01JAN2019:17:34:41
But when I define Col1 as Date and Col2 as Timestamp in my create statement, the Hive tables simply returns NULL when I query.
CREATE EXTERNAL TABLE IF NOT EXISTS my_schema.my_table
(Col1 date,
Col2 timestamp)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘|’
STORED AS TEXTFILE
LOCATION 'my_path';
Instead, if I define the data types as simply string then it works. But that's not how I want my tables to be.
I want the table to be able to read the incoming data in correct type. How can I achieve this? Is it possible to define the expected data format of the incoming data with the CREATE statement itself?
Can someone please help?
As of Hive 1.2.0 it is possible to provide additional SerDe property "timestamp.formats". See this Jira for more details: HIVE-9298
ALTER TABLE timestamp_formats SET SERDEPROPERTIES ("timestamp.formats"="ddMMMyyyy:HH:mm:ss");
I've exported a CSV file from Excel which has date in format ddmmyyyy hmm. I'm using the COPY function to import to a table in PostgreSQL.
Since I only want to retain the date part I tried the date data type:
CREATE TABLE Public."ride_details"(ride_id int,created_at date)
COPY Public."ride_details" FROM '/tmp/ride_details.csv' DELIMITER ',' csv HEADER;
But that resulted in:
ERROR: date/time field value out of range: "26/07/19 5:48"
HINT: Perhaps you need a different "datestyle" setting.
CONTEXT: COPY ride_details, line 2, column created_at: "26/07/19 5:48"
SQL state: 22008
Do I need to specify a different data type or how to make this work?
COPY is rather unforgiving with invalid input. (This way it can be fast an reliable.)
It may be enough to set a matching datestyle setting:
SET datestyle = 'ISO,DMY'; -- DMY being the relevant part
... and retry. (Sets the setting for your session only.) Related:
Importing .csv with timestamp column (dd.mm.yyyy hh.mm.ss) using psql \copy
Info in your question is not entirely clear, you may have to do more:
copy to a temporary "staging" table with a text column, and INSERT to the actual target table from there using to_date() - with a custom pattern specifying your non-standard date format:
CREATE TABLE public.ride_details(ride_id int,created_at date); -- target table
CREATE TABLE pg_temp.step1(ride_id int, created_at text); -- temporary staging table
COPY TO pg_temp.step1 ...;
INSERT INTO public.ride_details(ride_id, created_at)
SELECT ride_id, to_date(created_at, 'DD/MM/YY') -- or whatever
FROM pg_temp.step1;
to_date() ignores dangling characters after the given pattern, so we do not have to deal with your odd hmm specification (hh?).
I went with the YY format displayed in the error msg, not the yyyy you claim at the top. Either way, the input must be in consistent format, or you need to do more, yet ...
All in a single DB session, since that's the scope of temp tables. The temp table is not persisted and dies automatically at the end of the session. I use it for performance reasons.
Else you need a plain table as stepping stone, which is persisted across sessions and can be deleted after having served its purpose.
Related:
How to ignore errors with psql \copy meta-command
How to update selected rows with values from a CSV file in Postgres?
I'm using Athena to query data from multiple files partitioned on S3. I create a
CREATE EXTERNAL TABLE IF NOT EXISTS testing_table (
EventTime string,
IpAddress string,
Publisher string,
Segmentname string,
PlayDuration double,
cost double ) PARTITIONED BY (
year string,
month string,
day string ) ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' ESCAPED BY '\\' LINES TERMINATED BY '\n' LOCATION 's3://campaigns/testing/';
In my location, the may have multiple files with different filename, such as: "campaign_au_click.csv", "campaign_au_impression.csv". These files may have different structure.
Is the any way that my above table only getting data from click files.
Thanks
Your best bet is to partition them into different folders. Athena, like Hive, works on the folder level - any and all files in a folder will be taken in as the same schema.
The very first option should be to have those files in different folders. But considering that we have the situation right now and we want to query table for specific files. There is a work around.
You create your table with root folder only. But while querying you can have a WHERE clause on filename. The column name for filename is accessed by "$path" (including quotes).
For example, you query can be
SELECT .....
From .....
WHERE
.....
AND
"$path" like "%_click.csv"
Note : The where clause provided is just an example. You can explore regexp_like instead of like.
I need to create external table for a hdfs location. The data is having null instead of empty space for few fields. If the field length is less than 4 for such fields, it is throwing error when selecting data. Is there a way to define replacement of all such nulls with empty space while creating table it self.?
I am trying it in greenplum, just tagged hive to see what can be done for such cases in hive.
You could use the serialization property for mapping NULL string to empty string.
CREATE TABLE IF NOT EXISTS abc ( ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE TBLPROPERTIES ("serialization.null.format"="")
In this case when you query it from hive you would get empty value for that field and hdfs would have "\N".
Or
If you want to represented empty string instead of '\N', you can using COALESCE function:
INSERT OVERWRITE tabname SELECT NULL, COALESCE(NULL,"") FROM data_table;
the answer to the problem is using NULL as 'null' statement in create table syntax for greenplum. As i have mentioned, i wanted to get few inputs from people who faced such issues in hive. so i have tagged hive as well. But, greenplum external table syntax supports NULL AS phrase in which we can specify the form of NULL that you want to keep.