Insert overwrite doesn't delete all the old data files - hive

We are trying to insert overwrite a hive table. Most of the times it's overwriting as expected, i.e deleting any old files and replace new files. We are seeing some inconsistencies with this behavior, once in a while all the old files are not getting deleted, but new files are getting created. This is causing data inconsistency.
I am not able to reproduce this behavior. Just wanted to know if any one has faced similar issue or have any pointer of what might be happening.
We are using hive version 2.1.1.
Below is the orc table structure and the insert overwrite command. Fileid is the unique column in the table. This table size is around 500GB.
Hive table structure:
CREATE EXTERNAL TABLE `tier0.file`(
`filegroup` struct<collection:struct<name:string,code:string,royaltystate:string,enterprisecollectionid:bigint,isactive:boolean,active:boolean,filefamily:string,contentfamily:string,cfwcollectionname:string,droplocation:string,applyembeddestinationsite:boolean,associatedsource:string,excluderestriction:boolean,ownershiptype:string,collectionid:bigint,notes:string,bundlerestrictions:array<struct<bundleid:bigint,bundletype:string>>,pricecodes:array<struct<collectioncode:string,pricecode:string,iptccategory:string>>>,istockcollection:string,events:array<string>,paidassignmentids:array<string>,sisterfiles:array<string>,clonedfiles:array<string>,vcd:array<string>,source:struct<parentsource:string,parentsourceid:bigint,childsource:string,childsourceid:bigint>>,
`filemanagement` struct<filemanagement:string,destinationsites:array<string>,readyforsale:boolean,readyforpublish:boolean,reviewstatus:string,excludedestinationsites:array<string>,displaystatus:string,inactivedate:string,pulledreason:string,pulledreasonaudit:string,approvaldate:string,futurepulledreason:string,futureinactivedate:string,futureactivedate:string>,
`primarylanguage` string,
`audithistory` struct<note:string,notecategory:string>,
`contents` array<struct<deliverylocation:string,contenttype:string,submission:array<struct<data:struct<mimetype:string,fileinfo:struct<filelocation:string,filesize:bigint,filename:string,checksum:string,checksumtype:string>,submitdate:string,createdate:string,mediaformat:string,offlinehd:boolean,postertime:double,shoottype:string,stripaudio:boolean,timein:string,timeout:string,videoencoding:struct<compression:string,bitdepth:string,bitrate:double,definition:string,framerate:string,framesize:string,scantype:string,wrapper:string,height:int,width:int,interlaced:boolean>,rotation:string,anamorphic:boolean,pixelwidth:int,pixelheight:int,colorprofile:string,samplesperpixel:string,resolution:string,resolutionunit:string,colormode:string,animated:boolean,imageorientation:string,filmformat:string,duration:string,artistname:string,directlicense:boolean,lyrichook:string,albumtitle:string,parenttrackid:string,key:string,timesignature:string,publicdomain:string,lyrics:string,tracktitle:string,tracktype:string,speed:string,genre:string,mood:string,lyricpov:string,instrument:string,vocal:string,transformedmetadata:map<string,string>,iptc:map<string,string>,exif:map<string,string>,xmp:map<string,string>,xmpraw:map<string,string>>,sizeid:int,sizename:string,keyname:string,schemauri:string,extension:string,fileindex:int,suffix:string,readonly:boolean,ismaster:boolean>>,filepack:array<struct<data:struct<mimetype:string,fileinfo:struct<filelocation:string,filesize:bigint,filename:string,checksum:string,checksumtype:string>,submitdate:string,createdate:string,mediaformat:string,offlinehd:boolean,postertime:double,shoottype:string,stripaudio:boolean,timein:string,timeout:string,videoencoding:struct<compression:string,bitdepth:string,bitrate:double,definition:string,framerate:string,framesize:string,scantype:string,wrapper:string,height:int,width:int,interlaced:boolean>,rotation:string,anamorphic:boolean,pixelwidth:int,pixelheight:int,colorprofile:string,samplesperpixel:string,resolution:string,resolutionunit:string,colormode:string,animated:boolean,imageorientation:string,filmformat:string,duration:string,artistname:string,directlicense:boolean,lyrichook:string,albumtitle:string,parenttrackid:string,key:string,timesignature:string,publicdomain:string,lyrics:string,tracktitle:string,tracktype:string,speed:string,genre:string,mood:string,lyricpov:string,instrument:string,vocal:string,transformedmetadata:map<string,string>,iptc:map<string,string>,exif:map<string,string>,xmp:map<string,string>,xmpraw:map<string,string>>,sizeid:int,sizename:string,keyname:string,schemauri:string,extension:string,fileindex:int,suffix:string,readonly:boolean,ismaster:boolean>>,createdate:string,camerashotdate:string,updatedate:string,audithistory:array<struct<note:string,notecategory:string>>,contract:struct<parentsource:string,contractid:bigint,contentprovidername:string,contentprovidertitle:string,vendornumber:bigint,childsource:string,parentsourceid:bigint,childsourceid:bigint,istockusername:string,istockuserid:bigint,iptccredit:string,signatorycontentprovidername:string,signatoryguid:string,startdate:string,enddate:string>,release:struct<releaseid:string,releaseinformation:string,releasemetadata:array<struct<releasemetadataid:string,aliasid:string,releasetype:string,filelocation:string,name:string,agerange:string,age:string,birthdate:string,gender:string,ethnicity:string,ethnicities:array<string>,talentid:array<string>,usage:array<string>,teamsreleaseid:string>>>,contentmanagement:struct<state:string,notes:string,messages:array<string>>,contentsource:struct<clientsystemid:string,submittedby:string,ingestionproviderid:int,submissionnotes:string,clientlastmodifieddate:string>,alternateids:array<struct<alternateid:string,alternateidtype:string>>,homeproperty:string,mediatype:int,colorpalettes:struct<rgbmodel:array<struct<red:int,green:int,blue:int,presence:string,x:string,y:string,density:string>>>,transcript:string,hasaudio:boolean,visualcolor:string,era:string,cliptype:string,productiontitle:string,footagespeed:string>>,
`submitdate` string,
`licensecharacteristics` struct<filefamily:string,restrictioninstructions:string,riskcategory:string,advancedroyaltybearing:boolean,pricingcode:string,callforimage:boolean,exclusivecontent:boolean,subscriptioneligible:boolean,publicistapprovalrequired:boolean,whollyowned:boolean,royaltybearing:string,bundletags:array<string>,paidassignment:boolean,preferredlicensemodel:string,exclusivity:string,parentbundlecollection:string,restrictions:array<struct<id:string,beginningdate:string,enddate:string,controlledrestrictions:array<string>>>>,
`fileid` string,
`updatedate` string,
`version` int,
`exclusionrouting` array<string>,
`inclusionrouting` array<string>,
`errors` map<string,array<struct<errorcode:string,message:string>>>,
`dp_schema` string,
`dp_source` string,
`dp_source_type` string,
`dp_proc_time` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION
's3a://bucket/tier0/file/'
Insert overwrite Command:-
insert overwrite table stg.tier0_file
SELECT
filegroup,
filemanagement,
primarylanguage,
audithistory,
contents,
submitdate,
licensecharacteristics,
fileid,
updatedate,
version,
errors,
dp_schema ,
dp_source ,
dp_source_type ,
dp_proc_time
FROM (
SELECT
filegroup,
filemanagement,
primarylanguage,
audithistory,
contents,
submitdate,
licensecharacteristics,
fileid,
updatedate,
version,
errors,
dp_schema ,
dp_source ,
dp_source_type ,
dp_proc_time,
ROW_NUMBER() OVER(PARTITION BY fileid ORDER BY version DESC, dp_proc_time DESC) AS rownum
FROM
( SELECT
filegroup,filemanagement,primarylanguage,audithistory,contents,submitdate,licensecharacteristics,fileid,updatedate,version,errors,dp_schema,dp_source,dp_source_type,dp_proc_time
FROM tier0.file
UNION ALL
SELECT
filegroup,filemanagement,primarylanguage,audithistory,contents,submitdate,licensecharacteristics,fileid,updatedate,version,errors,dp_schema,dp_source,dp_source_type,dp_proc_time
FROM stg.file
) base ) rnk
where rnk.rownum = 1;

Related

How to partition by a transformed column in Hive?

I want to make use of year-month as partition in my table, but there is no such column in the table.
Is it possible to be partitioned by the custom field? For example I tried as below:
INSERT OVERWRITE table zhihu_answer partition (ym)
SELECT
answer_id,
answer_updated,
author_headline,
author_id,
author_name,
question_created,
question_id,
question_title,
question_type,
voteup_count,
date_format(insert_time,'yyyyMM') as ym
FROM zhihu_answer;
But it failed with:
Error while compiling statement: FAILED: ValidationFailureSemanticException table is not partitioned but partition spec exists: {ym=null}
DDL:
CREATE TABLE `zhihu_answer`(
`answer_id` string,
`answer_updated` string,
`author_headline` string,
`author_id` string,
`author_name` string,
`insert_time` string,
`question_created` string,
`question_id` string,
`question_title` string,
`question_type` string,
`voteup_count` int)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
'hdfs://device1:8020/user/hive/warehouse/zhihu.db/zhihu_answer'
TBLPROPERTIES (
'transient_lastDdlTime'='1569629962')
Thanks for your help.

HIVE:insert overwrite Parquet table error

I just do a simple query like this ,but somme Exception appear.
insert overwrite table stage_dfqp.user_currency partition (dt='2018-05-16')
select fuid,
fbpid,
fgamefsk
from stage_dfqp.pb_gamecoins
enter image description here
but when I change query like this(just add limit XXX) Exception disappear
insert overwrite table stage_dfqp.user_currencypartition (dt='2018-05-16')
select fuid,
fbpid,
fgamefsk
from stage_dfqp.pb_gamecoins limit 100
Hive table info:
CREATE TABLE `stage_dfqp.user_currency`(
`fuid` bigint ,
`coin_type` string ,
`coin_num` bigint
)
PARTITIONED BY (
`dt` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'

Impala - Handle special characters on partition column

I am currently working on a job which copies data from a staging table to the final table. The column in the staging table which is used for partition on the final table has multiple records with single quotes (e.g. supplies'A, demand'A etc). Due to this the impala INSERT OVERWRITE statement is failing with the following message:
Query: insert OVERWRITE rec_details (
rec_id, rec_name, rec_value ) PARTITION (rec_part) SELECT
rec_id, rec_name, rec_value, rec_name FROM staging_rec_details Query submitted at: 2017-06-12 03:23:22 (Coordinator:
http://hostname:port) Query progress can be monitored at:
http://hostname:port/query_plan?query_id=ea4e14229d1c0119:a839f51500000000
WARNINGS: TableLoadingException: Failed to load metadata for table:
rec_details CAUSED BY: IllegalStateException: Invalid partition name:
rec_part=-supplies'A
DDL Statements are as follows:
--DDL 1 - Staging Table
CREATE EXTERNAL TABLE staging_rec_details(
rec_id STRING,
rec_name STRING,
rec_value STRING
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\007'
LINES TERMINATED BY '\001'
--WITH SERDEPROPERTIES ('serialization.format'='\t', 'field.delim'='\t')
STORED AS TEXTFILE
LOCATION '/staging/staging_rec_details'
--DDL 2 - Final Table
CREATE EXTERNAL TABLE rec_details(
rec_id STRING,
rec_name STRING,
rec_value STRING
)
PARTITIONED BY (rec_part STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\007'
LINES TERMINATED BY '\001'
--WITH SERDEPROPERTIES ('serialization.format'='\t', 'field.delim'='\t')
STORED AS PARQUET
LOCATION '/data/rec_details'
Following is the Impala statement used for insering records:
--Impala SQL
INSERT OVERWRITE rec_details
(
rec_id, rec_name, rec_value
)
PARTITION (rec_part)
SELECT
rec_id, rec_name, rec_value, rec_name
FROM staging_rec_details
How can I insert data into the final table when the partition column has a special character like single quote ?
The issue was resolved by replacing the special character :
-- Modified Impala SQL
INSERT OVERWRITE rec_details
(
rec_id, rec_name, rec_value
) PARTITION (rec_part)
SELECT
rec_id, rec_name, rec_value,
regexp_replace(rec_name,'\'','')
FROM staging_rec_details

How do you add Data to an Existing Hive Metastore?

I have multiple subdirectories in S3 that contain .orc files. I'm trying to create a hive metastore so I can query the data with Presto / Hive, etc. The data is poorlly structured (no consistent delimiter, ugly characters, etc). Here's a scrubbed sample:
1488736466 199.199.199.199 0_b.www.sphericalcow.com.f9b1.qk-g6m6z24tdr.v4.url.name.com TXT IN: NXDOMAIN/0/143
1488736466 6.6.5.4 0.3399.186472.4306.6668.638.cb5a.names-things.update.url.name.com TXT IN: NOERROR/3/306 0\009253\009http://az.blargi.ng/%D3%AB%EF%BF%BD%EF%BF%BD/\009 0\009253\009http://casinoroyal.online/\009 0\009253\009http://d2njbfxlilvpsq.cloudfront.net/b_zq_ym_bangvideo/bangvideo0826.apk\009
I was able to create a table pointing to one of the subdirectories using a serde regex and the fields are parsing properly, but as far as I can tell I can only load one subfolder at a time.
How does one add more data to an existing hive metastore?
Here's an example of my hive metastore create statement with the regex serde bit:
DROP TABLE IF EXISTS test;
CREATE EXTERNAL TABLE test (field1 string, field2 string, field3 string, field4 string)
COMMENT 'fill all the tables with the datas.'
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "([0-9]{10}) ([0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}) (\\S*) (.*)",
"output.format.string" = "%1$s %2$s %3$s %4$s"
)
STORED AS ORC
LOCATION 's3://path/to/one/of/10/folders/'
tblproperties ("orc.compress" = "SNAPPY", "skip.header.line.count"="2");
select * from test limit 10;
I realize there is probably a very simple solution, but I tried INSERT INTO in place of CREATE EXTERNAL TABLE, but it understandably complains about the input, and I looked in both the hive and serde documentation for help but was unable to find a reference to adding to an existing store.
Possible solution using partitions.
CREATE EXTERNAL TABLE test (field1 string, field2 string, field3 string, field4 string)
partitioned by (mypartcol string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "([0-9]{10}) ([0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}) (\\S*) (.*)"
)
LOCATION 's3://whatever/as/long/as/it/is/empty'
tblproperties ("skip.header.line.count"="2");
alter table test add partition (mypartcol='folder 1') location 's3://path/to/1st/of/10/folders/';
alter table test add partition (mypartcol='folder 2') location 's3://path/to/2nd/of/10/folders/';
.
.
.
alter table test add partition (mypartcol='folder 10') location 's3://path/to/10th/of/10/folders/';
For #TheProletariat (the OP)
It seems there is no need for RegexSerDe since the columns are delimited by space (' ').
Note the use of tblproperties ("serialization.last.column.takes.rest"="true")
create external table test
(
field1 bigint
,field2 string
,field3 string
,field4 string
)
row format delimited
fields terminated by ' '
tblproperties ("serialization.last.column.takes.rest"="true")
;

Writing columns having NULL as some string using OpenCSVSerde - HIVE

I'm using 'org.apache.hadoop.hive.serde2.OpenCSVSerde' to write hive table data.
CREATE TABLE testtable ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
"separatorChar" = ","
"quoteChar" = "'"
)
STORED AS TEXTFILE LOCATION '<location>' AS
select * from foo;
So, if 'foo' table has empty strings in it, for eg: '1','2','' . The empty strings are written as is to the textfile. The data in textfile reads '1','2',''
But if 'foo' contains null values, for eg: '1','2',null. The null value is not written in the text file.
The data in the textfile reads '1','2',
How do I make sure that the nulls are properly written to the textfile using csv serde. Either written as empty strings or any other string say "nullstring"?
I also tried this:
CREATE TABLE testtable ROW FORMAT SERDE
....
....
STORED AS TEXTFILE LOCATION '<location>'
TBLPROPERTIES ('serialization.null.format'='')
AS select * foo;
Though this should probably replace the empty strings with null. But this doesn't even do that.
Please guide me on how to write nulls to csv files.
Will I have to check for the null values for columns in the select query itself like (NVL or something) and replace it with something?
Open CSV Serde ignores 'serialization.null.format' property , you can handle null values using below steps
1. CREATE TABLE testtable
(
name string,
title string,
birth_year string
)ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
"separatorChar" = ","
,"quoteChar" = "'"
)
STORED AS TEXTFILE;
2. load data into testtable
3. CREATE TABLE testtable1
(
name string,
title string,
birth_year string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
TBLPROPERTIES('serialization.null.format'='');
4. INSERT OVERWRITE TABLE testtable1 SELECT * FROM testtable