Error in creating Druid datasources from Hive - hive

Have followed Druid documentation https://cwiki.apache.org/confluence/display/Hive/Druid+Integration.
The error I am facing is:-
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
java.io.FileNotFoundException: File does not exist:
/usr/lib/hive/lib/hive-druid-handler-2.3.0.jar
at org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1530)
at org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1523)
at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1
The error saying is unable to find "/usr/lib/hive/lib/hive-druid-handler-2.3.0.jar" Although I am using hive hive-2.3.2.
In order to overcome the above problem have downloaded jar and restarted Hadoop. But it is not solved yet.

Looks like you are using Hive1. All the druid integration is done with Hive2.

Related

How can I change physical memory in a mapreduce/hive job?

I'm trying to run a Hive INSERT OVERWRITE query on an EMR cluster with 40 worker nodes and single master node.
However, while running the INSERT OVERWRITE query, as soon as I get to
Stage-1 map = 100%, reduce = 100%, Cumulative CPU 180529.86 sec
this state, I get the following error:
Ended Job = job_1599289114675_0001 with errors
Diagnostic Messages for this Task:
Container [pid=9944,containerID=container_1599289114675_0001_01_041995] is running beyond physical memory limits. Current usage: 1.5 GB of 1.5 GB physical memory used; 3.2 GB of 7.5 GB virtual memory used. Killing container.
Dump of the process-tree for container_1599289114675_0001_01_041995 :
I'm not sure how can I change the 1.5 GB physical memory number. In my configurations, I don't see such a number, and I don't understand how that 1.5 GB number is being calculated.
I even tried changing the "yarn.nodemanager.vmem-pmem-ratio":"5" to 5 as suggested in some forums. But irrespective of this change, I still get the error.
This is how the job starts:
Number of reduce tasks not specified. Estimated from input data size: 942
Hadoop job information for Stage-1: number of mappers: 910; number of reducers: 942
And this is how my configuration file looks like for the cluster. I'm unable to understand what settings do I have to change to not run into this issue. Could it also be due to Tez settings? Although I'm not using it as the engine.
Any suggestions will be greatly appreciated, thanks.
While opening hive console, append the following to the command
--hiveconf mapreduce.map.memory.mb=8192 --hiveconf mapreduce.reduce.memory.mb=8192 --hiveconf mapreduce.map.java.opts=-Xmx7600M
Incase you still get the Java heap error, try increasing to higher values, but make sure that the mapreduce.map.java.opts doesn't exceed mapreduce.map.memory.mb.

Why does Java OutOfMemoryError occurs when selecting less columns in hive query?

I have two hive select statements:
select * from ode limit 5;
This successfully pulls out 5 records from the table 'ode'. All the columns are included in the result. However, This following query caused an error:
select content from ode limit 5;
Where 'content' is one column in the table. The error is:
hive> select content from ode limit 5;
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOfRange(Arrays.java:3664)
at java.lang.String.<init>(String.java:207)
The second query should be a lot cheaper and why does it cause a memory issue? How to fix this?
When you select the whole table, Hive triggers Fetch task instead of MR that involves no parsing (it is like calling hdfs dfs -cat ... | head -5).
As far as I can see in your case, the hive client tries to run map locally.
You can choose one of the two ways:
Force remote execution with hive.fetch.task.conversion
Increase hive client heap size using HADOOP_CLIENT_OPTS env variable.
You can find more details regarding fetch tasks here.

Can you control hdfs file size for a HortonWorks HDP 3.4.1 Managed Table?

Currently testing a cluster and when using the "CREATE TABLE AS" the resulting managed table ends up being one file ~ 1.2 GB while the base file the query is created from has many small files. The SELECT portion runs fast, but then the result is 2 reducers running to create one file which takes 75% of the run time.
Additional testing:
1) If using "CREATE EXTERNAL TABLE AS" is used the query runs very fast and there is no merge files step involved.
2) Also, the merging doesn't appear to occur with version HDP 3.0.1.
You can change set hive.exec.reducers.bytes.per.reducer=<number> to let hive decide number of reducers based on reducer input size (default value is set to 1 GB or 1000000000 bytes ) [ you can refer to links provided by #leftjoin to get more details about this property and fine tuning for your needs ]
Another option you can try is to change following properties
set mapreduce.job.reduces=<number>
set hive.exec.reducers.max=<number>

Size of data block in avro is larger than the maximum allowed value 16777216

I am trying to load Avro data into Bigquery.So i converted ORC data into AVRO by running INSERT OVERWRITE COMMAND in hive.When i try to load data in Bigquery using bq command line tool, I am getting this error:-
"message": "Error while reading data, error message: Avro parsing error in position 397707. Size of data block 17378680 is larger than the maximum allowed value 16777216."
Is there any way i can increase this data block size.I couldn't find anything related to this.
Below is the command that i am trying to use to load data.
bq load --source_format=AVRO dataset.table gs://********/gold/offers/hive/gold_hctc_ofr_txt/ingestion_time=20180305/000000_0
Seems like you are actually getting bumped due to BigQuery’s block size limit as defined in this document. You can check on the Row and cell size limits section where it is mentioned that Avro’s block size is 16MB.

how to decrease the number of mapper in hive while the file is bigger than block size?

guys
I have a table in hive which have more than 720 partitions,and in each partition there is more than 400 files and the file's average size is 1G.
Now I execute following SQL:
insert overwrite table test_abc select * from DEFAULT.abc A WHERE A.P_HOUR ='2017042400' ;
this partition (P_HOUR ='2017042400' )have 409 files. When I submit this sql ,I got following output
INFO : Number of reduce tasks is set to 0 since there's no reduce operator
INFO : number of splits:409
INFO : Submitting tokens for job: job_1482996444961_9384015
I google many doc to find how to decrease the number of mapper, lots of doc solved this problem when the file is small.
I have tried the following set in beeline, but not work
---------------first time
set mapred.min.split.size =5000000000;
set mapred.max.split.size =10000000000;
set mapred.min.split.size.per.node=5000000000;
set mapred.min.split.size.per.rack=5000000000;
-----------------second time
set mapreduce.input.fileinputformat.split.minsize =5000000000;
set mapreduce.input.fileinputformat.split.maxsize=10000000000;
set mapreduce.input.fileinputformat.split.minsize.per.rack=5000000000;
set mapreduce.input.fileinputformat.split.minsize.per.node=5000000000;
my hadoop version is
Hadoop 2.7.2
Compiled by root on 11 Jul 2016 10:58:45
hive version is
Connected to: Apache Hive (version 1.3.0)
Driver: Hive JDBC (version 1.3.0)
In addition to the setup in your post
set hive.hadoop.supports.splittable.combineinputformat=true;
hive.hadoop.supports.splittable.combineinputformat
- Default Value: false
- Added In: Hive 0.6.0
Whether to combine small input files so that fewer mappers are spawned.
MRv2 uses CombineInputFormat, while Tez uses grouped splits to determine the Mapper. If your execution engine is mr and you would like to reduce Mappers use:
mapreduce.input.fileinputformat.split.maxsize=xxxxx
If maxSplitSize is specified, then blocks on the same node are combined to a single split. Blocks that are left over are then combined with other blocks in the same rack. If maxSplitSize is not specified, then blocks from the same rack are combined in a single split; no attempt is made to create node-local splits. If the maxSplitSize is equal to the block size, then this class is similar to the default splitting behavior in Hadoop
This link can be helpful to control Mapper in Hive if your execution engine is mr
If your execution engine is tez and you would lile to control Mappers then use:
set tez.grouping.max-size = XXXXXX;
Here is a good read reference for the parallelism in Hive for tez execution engine,