Using Hive to distribute over Reducers? - hive

The most frustrating part about this problem, is that the obvious answer is "fix the source table!" - which unfortunately I cannot do (this is managed and maintained by another team at work who refuses to help).
So I'm looking for a technical solution to doing this without changing the source table.
The situation is this: I have a source table, and I'm trying to write a hive query to create a new table. The query ends up taking many hours to complete, and the reason is that the work gets bottle-necked into a single reducer.
When I follow the source table to its location on hdfs, I notice there are 1009 part files. 1008 of them are 0 bytes, and 1 of them is 400 GB.
This explains why 1 reducer takes so long, because all of the data is contained in a single file.
I have tried to add the following settings in an attempt to split the work over many reducers.
set hive.merge.mapfiles=true;
set hive.merge.mapredfiles=true;
set hive.merge.smallfiles.avgsize=134217728;
set hive.merge.size.per.task=134217728;
set mapred.max.split.size=134217728;
set mapred.min.split.size=134217728;
set hive.exec.reducers.bytes.per.reducer=134217728;
All attempts end with my new table looking exactly like the source table, with tons of 0 byte files, and a single file with all of the data. I'm able to control the reducers, which controls the total number of files... but I cannot control the data to have the result evenly distributed.
Any ideas on how I can "fix" my resulting table to have evenly distributed files? Bonus on if I can fix this during the query process which would even the load on my reducers and make the query faster.
The source table looks like this:
CREATE TABLE `source_tbl`(
`col1` varchar(16)
, `col2` smallint
, `col3` varchar(5),
... many more cols ...
`col20000` int)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION
'hdfs://cluster/user/hive/warehouse/schema.db/source_tbl'
TBLPROPERTIES (
'COLUMN_STATS_ACCURATE'='true',
'numFiles'='1009',
'numRows'='19187489',
'rawDataSize'='2972053294998',
'totalSize'='50796390931',
'transient_lastDdlTime'='1501859524')
My query is this:
create table schema.dest_tbl as select * from schema.source_tbl;

Related

How do I force hive to always create a consistent filename like 000000_0?

I am doing an Insert overwrite operation through a hive external table onto AWS S3. Hive creates a output file 000000_0 onto S3. However at times I am noticing that it creates file with other names like 0000003_0 etc. I always need to overwrite the existing file but with inconsistent file names I am unable to do so. How do I force hive to always create a consistent filename like 000000_0? Below is an example of how my code looks like, where tab_content is a hive external table.
INSERT OVERWRITE TABLE tab_content
PARTITION(datekey)
select * from source
Better do not do this and modify your program to accept any number of files in the directory.
Each reducer (or mapper if it runs on map-only) creates it's own file. These reducers do know nothing about each other, they named during creation. Files are marked as 000001_0,000002_0. But it can be 000001_1 also if attempt number 0 has failed and attempt number 1 has succeeded. Also if table is partitioned and there is no distribute by partition key at the end, each reducer will create it's own file in each partition.
You can force it to work on a single final reducer (it can be done for example if you add order by clause or setting set mapred.reduce.tasks = 1;). But bear in mind that this solution is not scalable, because too many data will cause performance problems on single reducer. Also What will happen if attempt 0 has failed and it was restarted and attempt 1 succeeded? It will create 000001_1 instead of 000001_0.

Writing large scripts in Impala

I have to translate long Teradata scripts (10,000 lines long) into Impala. I never done this before with Impala.
The tools I’ve got to work with are impala shell or hue.
I’ve not seen an example of Impala code that’s more than 50 lines long either in impala shell or hue. Can someone point me to an example of impala script in either impala shell or hue that's at least 500 lines long?
I can handle the syntax change,I don’t need advice on that. I’m looking for gotchas or traps in writing long code into these tools.
You need to create an external table with a source data to your file (as it's shown in Impala tutorial).
-- The EXTERNAL clause means the data is located outside the central location
-- for Impala data files and is preserved when the associated Impala table is dropped.
-- We expect the data to already exist in the directory specified by the LOCATION clause.
CREATE EXTERNAL TABLE tab1
(
id INT,
col_1 BOOLEAN,
col_2 DOUBLE,
col_3 TIMESTAMP
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '/user/cloudera/sample_data/tab1';
Then you can easily move your data whenever you want using INSERT construction.
INSERT INTO table2
SELECT * FROM tab1;

Create Table in Hive with one file

I'm creating a new table in Hive using:
CREATE TABLE new_table AS select * from old_table;
My problem is that after the table is created, It generates multiple files for each partition - while I want only one file for each partition.
How can I define it in the table?
Thank you!
There are many possible solutions:
1) Add distribute by partition key at the end of your query. Maybe there are many partitions per reducer and each reducer creates files for each partition. This may reduce the number of files and memory consumption as well. hive.exec.reducers.bytes.per.reducer setting will define how much data each reducer will process.
2) Simple, quite good if there are not too much data: add order by to force single reducer. Or increase hive.exec.reducers.bytes.per.reducer=500000000; --500M files. This is for single reducer solution is for not too much data, it will run slow if there are a lot of data.
If your task is map-only then better consider options 3-5:
3) If running on mapreduce, switch-on merge:
set hive.merge.mapfiles=true;
set hive.merge.mapredfiles=true;
set hive.merge.size.per.task=500000000; --Size of merged files at the end of the job
set hive.merge.smallfiles.avgsize=500000000; --When the average output file size of a job is less than this number,
--Hive will start an additional map-reduce job to merge the output files into bigger files
4) When running on Tez
set hive.merge.tezfiles=true;
set hive.merge.size.per.task=500000000;
set hive.merge.smallfiles.avgsize=500000000;
5) For ORC files you can merge files efficiently using this command:
ALTER TABLE T [PARTITION partition_spec] CONCATENATE; - for ORC

Hive insert vs Hive Load: What are the trade offs?

I'm learning Hadoop/Big data technologies. I would like to ingest data in bulk into hive. I started working with a simple CSV file and when I tried to use INSERT command to load each record by record, one record insertion itself took around 1 minute. When I put the file into HDFS and then used the LOAD command, it was instantaneous since it just copies the file into hive's warehouse. I just want to know what are the trade offs that one have to face when they opt in towards LOAD instead of INSERT.
Load- Hive does not do any transformation while loading data into tables. Load operations are currently pure copy/move operations that move datafiles into locations corresponding to Hive tables.
Insert-Query Results can be inserted into tables by using the insert clause and which in turn runs the map reduce jobs.So it takes some time to execute.
In case if you want to optimize/tune the insert statements.Below are some techniques:
1.Setting the execution Engine in hive-site.xml to Tez(if its already installed)
set hive.execution.engine=tez;
2.USE ORCFILE
CREATE TABLE A_ORC (
customerID int, name string, age int, address string
) STORED AS ORC tblproperties (“orc.compress" = “SNAPPY”);
INSERT INTO TABLE A_ORC SELECT * FROM A;
3. Concurrent job runs in hive can save the overall job running time .To achieve that hive-default.xml,below config needs to be changed:
set hive.exec.parallel=true;
set hive.exec.parallel.thread.number=<your value>;
For more info,you can visit http://hortonworks.com/blog/5-ways-make-hive-queries-run-faster/
Hope this helps.

hive RegexSerDe null

How should I work with NULL values in RegexSerDe?
I have file with data:
cat MOS/ex1.txt
123,dwdjwhdjwh,456
543,\N,956
I have the table:
CREATE TABLE mos.stations (usaf string, wban STRING, name string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "(.*),(.*),(.*)"
);
I successfully loaded the data from file to table:
LOAD DATA LOCAL INPATH '/home/hduser/MOS/ex1.txt' OVERWRITE INTO TABLE mos.stations;
Simple select works fine:
hive> select * from mos.stations;
123dwdjwhdjwh456
543\N956
And next ends with error:
select * from mos.stations where wban is null;
[Hive Error]: Query returned non-zero code: 9, cause: FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.MapRedTask
What is wrong?
I see a couple of possible issues:
1) It may not having anything to do with null handling at all. The first query doesn't actually spawn an M/R job while the second one does so it might be a simple classpath issue where RegexSerde is not being seen by the M/R tasks because its jar is not in the classpath of the tasktracker. You'll need to find where the hive-contrib jar on your system lives and then make hive aware of it via something like:
add jar /usr/lib/hive/lib/hive-contrib-0.7.1-cdh3u2.jar
Note, your path and jar name may be different. You can run the above through hive right before your query.
2) Another issue might be that the RegexSerde doesn't really deal with "\N" the same way as the default LazySimpleSerde. Judging by the output you are getting in the first query (where it returns a literal "\N") that could be the case. What happens if you query where wban='\\N'? or where wban='\N' (I forget if you need to double escape).
Finally, one word of caution about RegexSerde. While its really handy, its slow as molasses going uphill in January compared to the default serde. If the dataset is large and you plan to run a lot of queries against it, its best to pre-process so that you don't need the RegexSerde. Otherwise, your going to pay a penalty for every query. The same datset above looks like it would be fine with the default serde.