Hive Insert overwrite table partition is successful, but table stays empty

Hive Insert overwrite table partition is successful, but table stays empty - hive

I'm trying to insert overwrite an external hive table into a partitioned internal table, using the code below. The code runs 'successfully', but then when I run 'select * from videotracking_playevent limit 10', it never returns any results.
The external table was generated from a directory of recursive folders containing Parquet files, and can be queried. I've tested the regular expressions in this sample, which work fine as well. The Hive log doesn't show any errors.
I have a feeling having partitions is what's messing it up somehow. I don't see why though, any ideas?
set hive.mapred.supports.subdirectories=true;
set hive.input.dir.recursive=true;
set hive.supports.subdirectories=true;
set mapred.input.dir.recursive=true;
set hive.execution.engine=spark;
set hive.exec.dynamic.partition.mode=nonstrict;
INSERT overwrite TABLE videotracking_playevent PARTITION (source, createyear, createmonth, createday)
SELECT
id_gigya,
created,
uid,
category,
action,
video_id,
program,
device,
url,
video_cms,
duration,
position,
version,
slot_type,
slot_position,
ad_position,
ad_duration,
player_type,
is_embed,
ad_max_ads,
ad_max_duration,
brand,
casting,
ip,
platform,
subprofile_id,
channel,
episode_id,
regexp_replace(regexp_extract(INPUT__FILE__NAME, 'source=[a-z]*', 0),'source=','') AS source,
regexp_replace(regexp_extract(INPUT__FILE__NAME, 'createyear=[0-9]*', 0),'createyear=','') AS createyear,
regexp_replace(regexp_extract(INPUT__FILE__NAME, 'createmonth=[0-9]*', 0),'createmonth=','') AS createmonth,
regexp_replace(regexp_extract(INPUT__FILE__NAME, 'createday=[0-9]*', 0),'createday=','') AS createday
FROM
videotracking_playevent_ext;

Related

Hive Table name starts with underscore select statement issue

In the process of executing my hql script, i have to store data into a temporary table before inserting to the main table.
In that scenario, I have tried to create a temporary table with an underscore at the starting.
Note: with quotes the table name with underscore is not working.
Working Create Statement:
create table
dbo.`_temp_table` (
emp_id int,
emp_name string)
stored as ORC
tblproperties ('ORC.compress' = 'ZLIB')';
Working Insert Statement:
insert into table dbo.`_temp_table` values (123, 'ABC');
But, the select statement on the temp table is not working and it is showing null records even though we have inserted the record as per insert statement.
select * from dbo.`_temp_table`;
Everything is working fine, but select statement to view the rows is not working.
I still not sure, that we can create a temp table in the above way???

Hadoop uses such filenames started with underscore for hidden files and ignores them when reading. For example "_$folder$" file which is created when you execute mkdir to create empty folder in S3 bucket.
See HIVE-6431 - Hive table name start with underscore
By default, FileInputFormat(which is the super class of various
formats) in hadoop ignores file name starts with "_" or ".", and hard
to walk around this in hive codebase.
You can try to create external table and specify table location without underscore and still having underscore in table name. Also consider using TEMPORARY tables.

SQL importing Extended Events file using sys.fn_xe_file_target_read_file how to only get values since last import

I am using SQL Server 2012
I have a long running extended event (runs for days to capture events) that saves to a .xel file.
I have a job that runs periodically to import the data into a staging table.
I am only importing the XML event_data column from the file so I can parse out the XML fields I need and save to a table for reporting.
I know when the last time I ran the import was so I want to see if I can only select records from the file that were added since the import process last ran.
I have it working now but it imports ALL the records from the files into staging tables, parses out the fields I need (including timestamp), then only imports the records that have a timestamp since the job last ran.
My process only inserts new ones since the last time the job ran so this all works fine but it does a lot of work importing and parsing out the XML for ALL records in the file, including the ones I already imported the last times the job ran.
So I want to find a way to not import from the file at all if it was already imported, or at least not have to parse the XML for the records that were already imported (though I have to parse it now to get the timestamp to exclude the ones already processed).
Below is what I have, and as I said, it works, but is doing a lot of extra work if I can find a way to skip the ones I already imported.
I only included the steps for my process that I need the help on:
-- pull data from file path and insert into staging table
INSERT INTO #CaptureObjectUsageFileData (event_data)
SELECT cast(event_data as XML) as event_data
FROM sys.fn_xe_file_target_read_file(#FilePathNameToImport, null, null, null)
-- parse out the data needed (only columns using) and insert into temp table for parsed data
INSERT INTO #CaptureObjectUsageEventData (EventTime, EventObjectType, EventObjectName)
SELECT n.value('(#timestamp)[1]', 'datetime') AS [utc_timestamp],
n.value('(data[#name="object_type"]/text)[1]', 'varchar(500)') AS ObjectType,
n.value('(data[#name="object_name"]/value)[1]', 'varchar(500)') as ObjectName
from (
SELECT event_data
FROM #CaptureObjectUsageFileData (NOLOCK)
) ed
CROSS apply ed.event_data.nodes('event') as q(n)
-- select from temp table as another step for speed/conversion
-- converting the timestamp to smalldatetime so it doesnt get miliseconds so when we select distinct it wont have lots of dupes
INSERT INTO DBALocal.dbo.DBObjectUsageTracking(DatabaseID, ObjectType, ObjectName, ObjectUsageDateTime)
SELECT DISTINCT #DBID, EventObjectType, EventObjectName, CAST(EventTime AS SMALLDATETIME)
FROM #CaptureObjectUsageEventData
WHERE EventTime > #LastRunDateTime

Okay, I've place a comment already, but - after thinking a bit deeper and looking into your code - this might be rather simple:
You can store the time of your last import and use a predicate in .nodes() (like you do this in .value() to get the correct <data>-element).
Try something like this:
DECLARE #LastImport DATETIME=GETDATE(); --put the last import's time here
and then
CROSS apply ed.event_data.nodes('event[#timestamp cast as xs:dateTime? > sql:variable("#LastImport")]') as q(n)
Doing so, .nodes() should return only <event>-elements, where the condition is fullfilled. If this does not help, please show some reduced example of the XML and what you want to get.

Accepted answer above but posting the code for the section I had questions on in full with updates from comments/fixes I made (again not entire code) but important parts. Using #Shnugo help I was able to completely remove a temp table from my process that I needed for doing the date filtering on before inserting into my permanent table, with his answer I can just insert directly into the permanent table. In my testing small data sets the update and the removal of the extra code reduced the running time by 1/3. With the more data I get the bigger impact this improvement will give.
This is designed to run an Extended Event session over a long period of time.
It will tell me what Objects are being used (to later query up against the system tables) to tell me what ones are NOT being used.
See Extended Event generation code below:
I am grabbing info on: sp_statement_starting and only grabbing SP and function events and only saving the object name, type, and timestamp
I am NOT saving the SQL Text because it is not needed for my purpose.
The sp_statement_starting pulls every statement inside a Stored Procedure so when an SP runs it could have 1-100 statements starting events,
and insert that many records into the file (which is way more data than needed for my purposes).
In my code after I import the file into the staging table I am shortning the timestamp to shortdatetime and selecting distinct values from all the records in the file
I am doing this because it inserts a record for every statement inside an SP, shortining the data to shortdatetime and selecting distinct greatly reduces the humber of recrods inserted.
I know I could just keep the object name and only insert unique values and ignore the time completely, but I want to see approximatly how often they are called.
CREATE EVENT SESSION [CaptureObjectUsage_SubmissionEngine] ON SERVER
ADD EVENT sqlserver.sp_statement_starting(
-- collect object name but NOT statement, thats not needed
SET collect_object_name=(1),
collect_statement=(0)
WHERE (
-- this is for functions or SP's
(
-- functions
[object_type]=(8272)
-- SProcs
OR [object_type]=(20038)
)
AND [sqlserver].[database_name]=N'DBNAMEHERE'
AND [sqlserver].[is_system]=(0))
)
ADD TARGET package0.event_file(
SET filename=N'c:\Path\CaptureObjectUsage.xel' -- mine that was default UI gave me
)
WITH (MAX_MEMORY=4096 KB,EVENT_RETENTION_MODE=ALLOW_SINGLE_EVENT_LOSS,MAX_DISPATCH_LATENCY=30 SECONDS,MAX_EVENT_SIZE=0 KB,MEMORY_PARTITION_MODE=NONE,TRACK_CAUSALITY=OFF,STARTUP_STATE=OFF)
GO
-- ***************************************************************************
-- code for importing
-- ***************************************************************************
-- pull data from file path and insert into staging table
INSERT INTO #CaptureObjectUsageFileData (event_data)
SELECT cast(event_data as XML) as event_data
FROM sys.fn_xe_file_target_read_file(#FilePathNameToImport, null, null, null)
-- with the XML.nodes parsing I can insert directly into my final table because it does the logic here
INSERT INTO DBALocal.dbo.DBObjectUsageTracking(DatabaseID, ObjectType, ObjectName, ObjectUsageDateTime)
SELECT DISTINCT #DBID, -- #DBID is variable I set above so I dont need to use DBNAME and take up a ton more space
n.value('(data[#name="object_type"]/text)[1]', 'varchar(500)') AS ObjectType,
n.value('(data[#name="object_name"]/value)[1]', 'varchar(500)') as ObjectName,
CAST(n.value('(#timestamp)[1]', 'datetime') AS SMALLDATETIME) AS [utc_timestamp]
from (
SELECT event_data
FROM #CaptureObjectUsageFileData (NOLOCK)
) ed
-- original before adding the .node logic
--CROSS apply ed.event_data.nodes('event') as q(n)
-- updated to reduce amount of data to import
CROSS apply ed.event_data.nodes('event[#timestamp cast as xs:dateTime? > sql:variable("#LastRunDateTime")]') as q(n)

old question, but since no one offered a solution using the initial_offset parameter for sys.fn_xe_file_target_read_file, I'll drop some code about how I used it a few years ago. It's not a working solution I think, because I cut and pasted it from a larger code base, but it shows everything that is needed to get it working.
-- table to hold the config, i.e. the last file read and the offset.
IF OBJECT_ID('session_data_reader_config', 'U') IS NULL
CREATE TABLE session_data_reader_config
(
lock bit PRIMARY KEY
DEFAULT 1
CHECK(lock=1) -- to allow only one record in the table
, file_target_path nvarchar(260)
, last_file_read nvarchar(260)
, last_file_read_offset bigint
, file_exists AS dbo.fn_file_exists(last_file_read)
)
-- Insert the default value to start reading the log files, if no values are already present.
IF NOT EXISTS(SELECT 1 FROM session_data_reader_config )
INSERT INTO session_data_reader_config (file_target_path,last_file_read,last_file_read_offset)
VALUES ('PathToYourFiles*.xel',NULL,NULL)
-- import the EE data into the staging table
IF EXISTS(SELECT 1 FROM [session_data_reader_config] WHERE file_exists = 1 )
BEGIN
INSERT INTO [staging_table] ([file_name], [file_offset], [data])
SELECT t2.file_name, t2.file_offset, t2.event_data --, CAST(t2.event_data as XML)
FROM [session_data_reader_config]
CROSS APPLY sys.fn_xe_file_target_read_file(file_target_path,NULL, last_file_read, last_file_read_offset) t2
END
ELSE
BEGIN
INSERT INTO [staging_table] ([file_name], [file_offset], [data])
SELECT t2.file_name, t2.file_offset, t2.event_data
FROM [session_data_reader_config]
CROSS APPLY sys.fn_xe_file_target_read_file(file_target_path,NULL, NULL, NULL) t2
END
-- update the config table with the last file and offset
UPDATE [session_data_reader_config]
SET [last_file_read] = T.[file_name]
, [last_file_read_offset] = T.[file_offset]
FROM (
SELECT TOP (1)
[file_name]
, [file_offset]
FROM [staging_table]
ORDER BY [id] DESC
) AS T ([file_name], [file_offset])

How do I partition a table by all values?

I have an external table, now I want to add partitions to it. I have 224 unique city id's and I want to just write alter table my_table add partition (cityid) location /path; but hive complains, saying that I don't provide anything for the city id value, it should be e.g. alter table my_table add partition (cityid=VALUE) location /path;, but I don't want to run alter table commands for every value of city id, how can I do it for all id's in one go?
This is what hive command line looks like:
hive> alter table pavel.browserdata add partition (cityid) location '/user/maria_dev/data/cityidPartition';
FAILED: ValidationFailureSemanticException table is not partitioned but partition spec exists: {cityid=null}

Partition on physical level is a location (separate location for each value, usually looks like key=value) with data files. If you already have partitions directory structure with files, all you need is to create partitions in Hive metastore, then you can point your table to the root directory using ALTER TABLE SET LOCATION, then use MSCK REPAIR TABLE command. The equivalent command on Amazon Elastic MapReduce (EMR)'s version of Hive is: ALTER TABLE table_name RECOVER PARTITIONS. This will add Hive partitions metadata. See manual here: RECOVER PARTITIONS
If you have only not-partitioned table with data in it's location, then adding partitions will not work because the data needs to be reloaded, you need to:
Create another partitioned table and use insert overwrite to load partition data using dynamic partition load:
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
insert overwrite table2 partition(cityid)
select col1, ... colN,
cityid
from table1; --partitions columns should be last in the select
This is quite efficient way to reorganize your data.
After this you can delete source table and rename your target table.

Does u-sql script executes in sequence?

I am supposed to do incremental load and using below structure.
Do the statements execute in sequence i.e. TRUNCATE is never executed before first two statements which are getting data:
#newData = Extract ... (FROM FILE STREAM)
#existingData = SELECT * FROM dbo.TableA //this is ADLA table
#allData = SELECT * FROM #newData UNION ALL SELECT * FROM #existingData
TRUNCATE TABLE dbo.TableA;
INSERT INTO dbo.TableA SELECT * FROM #allData

To be very clear: U-SQL scripts are not executed statement by statement. Instead it groups the DDL/DML/OUTPUT statements in order and the query expressions are just subtrees to the inserts and outputs. But first it binds the data during compilation to their names, so your SELECT from TableA will be bound to the data (kind of like a light-weight snapshot), so even if the truncate is executed before the select, you should still be able to read the data from table A (note that permission changes may impact that).
Also, if your script fails during the execution phase, you should have an atomic execution. That means if your INSERT fails, the TRUNCATE should be undone at the end.
Having said that, why don't you use INSERT incrementally and use ALTER TABLE REBUILD periodically instead of doing the above pattern that reads the full table on every insertion?

Hive creating extra subfolders under partitioned directories on INSERT OVERWRITE

I have a table partitioned on year,month,day and hour. If I use the following INSERT OVERWRITE to a specific partition it places a file under appropriate directory structure. This file contains the string abc:-
INSERT OVERWRITE TABLE testtable PARTITION(year = 2017, month = 7, day=29, hour=18)
SELECT tbl.c1 FROM
(
select 'abc' as c1
) as tbl;
But if I use the following statement, Hive surprisingly creates three new folders under the folder "hour=18".
And there is a file inside each of these three subfolders.
INSERT OVERWRITE TABLE testtable PARTITION(year = 2017, month = 7, day=29, hour=18)
SELECT tbl.c1 FROM
(
select 'abc' as c1
union ALL
select 'xyz' as c1
union ALL
select 'mno' as c1
) as tbl;
When I query the data, it shows the data as expected. But why did it create these 3 new folders? Since the partitioning scheme is only for year,month,day and hour I wouldn't expect Hive to create folders for anything other than these.

Actually it has nothing to do with INSERT OVERWRITE or partitioning.
It's UNION ALL statement that adds additional directories.
Why it bothers you?
You can do some DISTRIBUTE BY shenanigans or set number of reducers to 1 to put this into one file.

Hi guys I had the same issue and thought of sharing.
Union all adds extra subfolder in the table.
The count(*) on the table will give 0 records and the msck repair will error out with the default properties.
After using set hive.msck.path.validator=ignore; MSCK will not error out but will message "Partitions not in metastore"
Only after setting the properties as mentioned above by DogBoneBlues
(SET hive.mapred.supports.subdirectories=TRUE;
SET mapred.input.dir.recursive=TRUE;) The table is returning values.(count(*))

You can use just "union" instead of "union all" if you dont care about duplicates. "union" should not create sub-folders.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Hive Insert overwrite table partition is successful, but table stays empty - hive

Related

Hive Table name starts with underscore select statement issue

SQL importing Extended Events file using sys.fn_xe_file_target_read_file how to only get values since last import

How do I partition a table by all values?

Does u-sql script executes in sequence?

Hive creating extra subfolders under partitioned directories on INSERT OVERWRITE

Categories

Resources