Can you control hdfs file size for a HortonWorks HDP 3.4.1 Managed Table? - hive

Currently testing a cluster and when using the "CREATE TABLE AS" the resulting managed table ends up being one file ~ 1.2 GB while the base file the query is created from has many small files. The SELECT portion runs fast, but then the result is 2 reducers running to create one file which takes 75% of the run time.
Additional testing:
1) If using "CREATE EXTERNAL TABLE AS" is used the query runs very fast and there is no merge files step involved.
2) Also, the merging doesn't appear to occur with version HDP 3.0.1.

You can change set hive.exec.reducers.bytes.per.reducer=<number> to let hive decide number of reducers based on reducer input size (default value is set to 1 GB or 1000000000 bytes ) [ you can refer to links provided by #leftjoin to get more details about this property and fine tuning for your needs ]
Another option you can try is to change following properties
set mapreduce.job.reduces=<number>
set hive.exec.reducers.max=<number>

Related

How to force MR execution when running simple Hive query?

There is Hive 2.1.1 over MR, table test_table stored as sequencefile and the following ad-hoc query:
select t.*
from test_table t
where t.test_column = 100
Although this query can be executed without starting MR (fetch task), sometimes it takes longer to scan HDFS files rather than triggering a single map job.
When I want to enforce MR execution, I make the query more complex: e.g., using distinct. The significant drawbacks of this approach are:
Query results may differ from the original query's
Brings meaningless calculation load on the cluster
Is there a recommended way to force MR execution when using Hive-on-MR?
The hive executor decides either to execute map task or fetch task depending on the following settings (with defaults):
hive.fetch.task.conversion ("more") — the strategy for converting MR tasks into fetch tasks
hive.fetch.task.conversion.threshold (1 GB) — max size of input data that can be fed to a fetch task
hive.fetch.task.aggr (false) — when set to true, queries like select count(*) from src also can be executed in a fetch task
It prompts me the following two options:
set hive.fetch.task.conversion.threshold to a lower value, e.g. 512 Mb
set hive.fetch.task.conversion to "none"
For some reason lowering the threshold did not change anything in my case, so I stood with the second option: seems fine for ad-hoc queries.
More details regarding these settings can be found in Cloudera forum and Hive wiki.
Just add set hive.execution.engine=mr; before your query and it will enforce Hive to use MR.

how to decrease the number of mapper in hive while the file is bigger than block size?

guys
I have a table in hive which have more than 720 partitions,and in each partition there is more than 400 files and the file's average size is 1G.
Now I execute following SQL:
insert overwrite table test_abc select * from DEFAULT.abc A WHERE A.P_HOUR ='2017042400' ;
this partition (P_HOUR ='2017042400' )have 409 files. When I submit this sql ,I got following output
INFO : Number of reduce tasks is set to 0 since there's no reduce operator
INFO : number of splits:409
INFO : Submitting tokens for job: job_1482996444961_9384015
I google many doc to find how to decrease the number of mapper, lots of doc solved this problem when the file is small.
I have tried the following set in beeline, but not work
---------------first time
set mapred.min.split.size =5000000000;
set mapred.max.split.size =10000000000;
set mapred.min.split.size.per.node=5000000000;
set mapred.min.split.size.per.rack=5000000000;
-----------------second time
set mapreduce.input.fileinputformat.split.minsize =5000000000;
set mapreduce.input.fileinputformat.split.maxsize=10000000000;
set mapreduce.input.fileinputformat.split.minsize.per.rack=5000000000;
set mapreduce.input.fileinputformat.split.minsize.per.node=5000000000;
my hadoop version is
Hadoop 2.7.2
Compiled by root on 11 Jul 2016 10:58:45
hive version is
Connected to: Apache Hive (version 1.3.0)
Driver: Hive JDBC (version 1.3.0)
In addition to the setup in your post
set hive.hadoop.supports.splittable.combineinputformat=true;
hive.hadoop.supports.splittable.combineinputformat
- Default Value: false
- Added In: Hive 0.6.0
Whether to combine small input files so that fewer mappers are spawned.
MRv2 uses CombineInputFormat, while Tez uses grouped splits to determine the Mapper. If your execution engine is mr and you would like to reduce Mappers use:
mapreduce.input.fileinputformat.split.maxsize=xxxxx
If maxSplitSize is specified, then blocks on the same node are combined to a single split. Blocks that are left over are then combined with other blocks in the same rack. If maxSplitSize is not specified, then blocks from the same rack are combined in a single split; no attempt is made to create node-local splits. If the maxSplitSize is equal to the block size, then this class is similar to the default splitting behavior in Hadoop
This link can be helpful to control Mapper in Hive if your execution engine is mr
If your execution engine is tez and you would lile to control Mappers then use:
set tez.grouping.max-size = XXXXXX;
Here is a good read reference for the parallelism in Hive for tez execution engine,

Hive query - INSERT OVERWRITE LOCAL DIRECTORY creates multiple files for a single table

I do the following from a hive table myTable.
INSERT OVERWRITE LOCAL DIRECTORY '/myDir/out' SELECT concat_ws('',NAME,PRODUCT,PRC,field1,field2,field3,field4,field5) FROM myTable;
So, this command generates 2 files 000000_0 and 000001_0 inside the folder out/.
But, I need the contents as a single file. What should I do?
There are multiple files in the directory because every reducer is writing one file. If you really need the contents as a single file, run your map reduce job with only 1 reducer which will write to a single file.
However depending on your data size, this might not be a good approach to run a single reducer.
Edit: Instead of forcing hive to run 1 reduce task and output a single reduce file, it would be better to use hadoop fs operations to merge outputs to a single file.
For example
hadoop fs -text /myDir/out/* | hadoop fs -put - /myDir/out.txt
A bit late to the game, but I found that using LIMIT large_number, where large_number is bigger than rows in your query. It forces hive to use at least a reducer. For example:
set mapred.reduce.tasks=1; INSERT OVERWRITE LOCAL DIRECTORY '/myDir/out' SELECT * FROM table_name LIMIT 1000000000
Worked flawlessly.
CLUSTER BY will make the work.

hbase export row size limit

I want to create a backup of my hbase table using hbase export.
The problem is that my rows are very big and I get a java heap space error. Is there any parameter I can give in order to limit the copied size in each step?
I use the following command:
hadoop jar /usr/lib/hbase/hbase-0.90.3-cdh3u1.jar export tableName backupPathOnHdfs numberOfColumnFamiliesVersions
or
hbase org.apache.hadoop.hbase.mapreduce.Export tableName backupPathOnHdfs numberOfColumnFamiliesVersions
You can use the export tool, that is, hadoop jar hbase-*-SNAPSHOT.jar export, with a regexp param (if it starts with ^) or interpreted as a row key prefix as the last argument. Details see in the source as it seems not yet to be documented (should work from 0.91.0 on).

How can I store the result of a SQL query in a CSV file using Squirrel?

Version 3.0.3. It's a fairly large result-set, around 3 million rows.
Martin pretty much has this right.
The TL/DR version is that you need the "SQLScripts" plugin (which is one of the "standard' plugins), and then you can select these menu options: Session > Scripts > Store Result of SQL in File
I'm looking at version 3.4. I don't know when this feature was introduced but you may need to upgrade if you don't have and cannot install the SQLScripts plugin.
Instructions for installing a new plugin can be found at: http://squirrel-sql.sourceforge.net/user-manual/quick_start.html#plugins
But if you're performing a fresh install of Squirrel you can simply select the "SQLScripts" plugin during the installation.
Here's the long version:
Run the query
Connect to the database. Click on the SQL tab. Enter your query. Hit the run button (or Ctrl-Enter).
You should see the first 100 rows or so in the results area in the bottom half of the pane (depending upon how you've configured the Limit Rows option).
Export the full results
Open the Session menu. Select the Scripts item (nearly at the bottom of this long menu). Select Store Result of SQL in File.
This opens a dialog box where you can configure your export. Make sure you check Export the complete result set to get everything.
I haven't tried this with a 3 million row result set, but I have noticed that Squirrel seems to stream the data to disk (rather than reading it all into memory before writing), so I don't see any reason why it wouldn't work with an arbitrarily large file.
Note that you can export directly to a file by using Ctrl-T to invoke the tools popup and selecting sql2file.
I have found a way to do this, there is a nice support for this in Squirrel. Run the SQL select (the 100 row limit will be ignored by the exporter, don't worry). Then, in the main menu, choose Session, Scripts, Store Result of SQL in File. This functionality may not be present by default, it may be present in some standard plugin (but not installed by default). I don't know which plugin though.
I also wanted to export results of SQL query to CSV file using SquirrelSQL. However according to changes file it seems that this functionality is not supported even in SquirrelSql 3.3.0.
So far I was able to export only data shown in 'result table' of SQL query by right click on the table > Export to CSV. The table size is by default 100 rows and so is the CSV export. You may change the table size in Session Properties > SQL > Limit rows. E.g. change the size to 10000 and your export will also contain 10000 rows. The question is, how will SquirrelSql deal with really big result sets (millions of rows)...
Run from your GUI:
COPY (SELECT * FROM some_table) TO '/some/path/some_table.csv' WITH CSV HEADER
Using Squirrel 3.5.0
The "Store SQL result as file" is great if you only have a simple Select query. A more complex one with parameters wont work.
Even trying to export a result of 600,000+ rows to a CSV file can also fail.