Why does Java OutOfMemoryError occurs when selecting less columns in hive query? - hive

I have two hive select statements:
select * from ode limit 5;
This successfully pulls out 5 records from the table 'ode'. All the columns are included in the result. However, This following query caused an error:
select content from ode limit 5;
Where 'content' is one column in the table. The error is:
hive> select content from ode limit 5;
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOfRange(Arrays.java:3664)
at java.lang.String.<init>(String.java:207)
The second query should be a lot cheaper and why does it cause a memory issue? How to fix this?

When you select the whole table, Hive triggers Fetch task instead of MR that involves no parsing (it is like calling hdfs dfs -cat ... | head -5).
As far as I can see in your case, the hive client tries to run map locally.
You can choose one of the two ways:
Force remote execution with hive.fetch.task.conversion
Increase hive client heap size using HADOOP_CLIENT_OPTS env variable.
You can find more details regarding fetch tasks here.

Related

How to force MR execution when running simple Hive query?

There is Hive 2.1.1 over MR, table test_table stored as sequencefile and the following ad-hoc query:
select t.*
from test_table t
where t.test_column = 100
Although this query can be executed without starting MR (fetch task), sometimes it takes longer to scan HDFS files rather than triggering a single map job.
When I want to enforce MR execution, I make the query more complex: e.g., using distinct. The significant drawbacks of this approach are:
Query results may differ from the original query's
Brings meaningless calculation load on the cluster
Is there a recommended way to force MR execution when using Hive-on-MR?
The hive executor decides either to execute map task or fetch task depending on the following settings (with defaults):
hive.fetch.task.conversion ("more") — the strategy for converting MR tasks into fetch tasks
hive.fetch.task.conversion.threshold (1 GB) — max size of input data that can be fed to a fetch task
hive.fetch.task.aggr (false) — when set to true, queries like select count(*) from src also can be executed in a fetch task
It prompts me the following two options:
set hive.fetch.task.conversion.threshold to a lower value, e.g. 512 Mb
set hive.fetch.task.conversion to "none"
For some reason lowering the threshold did not change anything in my case, so I stood with the second option: seems fine for ad-hoc queries.
More details regarding these settings can be found in Cloudera forum and Hive wiki.
Just add set hive.execution.engine=mr; before your query and it will enforce Hive to use MR.

Exceeded the memory limit of 20 MB per session for prepared statements. Reduce the number or size of the prepared statements

I am trying to insert record in to Azure sql Dataware House using Oracle ODI, but i am getting error after insertion of some records.
NOTE: I am trying to insert 1000 records, but error is coming after 800.
Error Message: Caused By: java.sql.BatchUpdateException: 112007;Exceeded the memory limit of 20 MB per session for prepared statements. Reduce the number or size of the prepared statements.
I am trying to insert 1000 records,but error is coming after 800.
Error Message: Caused By: java.sql.BatchUpdateException: 112007;Exceeded the memory limit of 20 MB per session for prepared statements. Reduce the number or size of the prepared statements.
While Abhijith's answer is technically correct, I'd like to suggest an alternative that will give you far better performance.
The root of your problem is that you've chosen the worst-possible way to load a large volume of data into Azure SQL Data Warehouse. A long list of INSERT statements is going to perform very badly, no matter how many DWUs you throw at it, because it is always going to be a single-node operation.
My recommendation is to adapt your ODI process in the following way, assuming that your Oracle is on-premise.
Write your extract to a file
Invoke AZCOPY to move the file to Azure blob storage
CREATE EXTERNAL TABLE to map a view over the file in storage
CREATE TABLE AS or INSERT INTO to read from that view into your target table
This will be orders of magnitude faster than your current approach.
20MB is the limit defined and it is hard limit for now. Reducing the batch size will certainly help you work around this limit.
Link to capacity limits.
https://learn.microsoft.com/en-us/azure/sql-data-warehouse/sql-data-warehouse-service-capacity-limits

Can I filter data returned by the BigQuery connector for Spark?

I have adapted the instructions at Use the BigQuery connector with Spark to extract data from a private BigQuery object using PySpark. I am running the code on Dataproc. The object in question is a view that has a cardinality >500million rows. When I issue this statement:
table_data = spark.sparkContext.newAPIHadoopRDD(
'com.google.cloud.hadoop.io.bigquery.JsonTextBigQueryInputFormat',
'org.apache.hadoop.io.LongWritable',
'com.google.gson.JsonObject',
conf=conf)
in the job output I see:
Bigquery connector version 0.10.7-hadoop2
Creating BigQuery from default credential.
Creating BigQuery from given credential.
Using working path: 'gs://dataproc-9e5dc592-1a35-42e6-9dd6-5f9dd9c8df87-europe-west1/hadoop/tmp/bigquery/pyspark_input20181108091647'
Estimated number of shards < desired num maps (0 < 93); clipping to 0.
Computed '2' shards for sharded BigQuery export.
Table 'projectname:datasetname.viewname' to be exported has 0 rows and 0 bytes
Estimated number of shards < desired num maps (0 < 93); clipping to 0.
Computed '2' shards for sharded BigQuery export.
Table 'projectname:datasetname.viewname' to be exported has 0 rows and 0 bytes
(timestamp/message-level/namespace removed for readability)
That was over 2 hours ago and the job is still running, there has been no more output in that time. I have looked in the mentioned gs location and can see that a directory called shard-0 has been located, but it is empty. Essentially there has been no visible activity for the past 2 hours.
I'm worried that the bq connector is trying to extract the entirety of that view. Is there a way that I can issue a query to define what data to extract as opposed to extracting the entire view?
UPDATE
I was intrigued by this message in the output:
Estimated number of shards < desired num maps (0 < 93); clipping to 0
It seems strange to me that estimated number of shards would be 0. I've taken a look at some of the code (ShardedExportToCloudStorage.java) that is getting executed here and the above message is logged from computeNumShards(). Given numShards=0 I'm assuming that numTableBytes=0 which means function call:
tableToExport.getNumBytes();
(ShardedExportToCloudStorage.java#L97)
is returning 0 and I assume that the reason for that is that the object I am accessing is a view, not a table. Am I onto something here or am I on a wild goose chase?
UPDATE 2...
To test out my theory (above) that the source object being a view is causing a problem I have done the following:
Created a table in the same project as my dataproc cluster
create table jt_test.jttable1 (col1 string)
Inserted data into it
insert into jt_test.jttable1 (col1) values ('foo')
Submitted a dataproc job to read the table and output the number of rows
Here's the code:
conf = {
# Input Parameters.
'mapred.bq.project.id': project
,'mapred.bq.gcs.bucket': bucket
,'mapred.bq.temp.gcs.path': input_directory
,'mapred.bq.input.project.id': 'myproject'
,'mapred.bq.input.dataset.id': 'jt_test'
,'mapred.bq.input.table.id': jttable1
}
table_data = spark.sparkContext.newAPIHadoopRDD(
'com.google.cloud.hadoop.io.bigquery.JsonTextBigQueryInputFormat',
'org.apache.hadoop.io.LongWritable',
'com.google.gson.JsonObject',
conf=conf)
print ('got table_data')
print (table_data.toDF().head(10))
print ('row tally={}'.format(table_data.toDF().count()))
When I run the dataproc pyspark job, here's the output:
8/11/08 14:59:26 INFO <cut> Table 'myproject:jt_test.jttable1' to be exported has 1 rows and5 bytes
got table_data
[Row(_1=0, _2=u'{"col1":"foo"}')]
row tally=1
Create a view over the table
create view jt_test.v_jtview1 as select col1 from `myproject.jt_test.jttable1`
Run the same job but this time consume the view instead of the table
conf = {
# Input Parameters.
'mapred.bq.project.id': project
,'mapred.bq.gcs.bucket': bucket
,'mapred.bq.temp.gcs.path': input_directory
,'mapred.bq.input.project.id': 'myproject'
,'mapred.bq.input.dataset.id': 'jt_test'
,'mapred.bq.input.table.id': v_jtview1
}
When I run the dataproc pyspark job, here's the output:
Table 'dh-data-dev-53702:jt_test.v_jtview1' to be exported has 0 rows and 0 bytes
and that's it! There's no more output and the job is still running, exactly the same as I explained above. Its effectively hung.
Seems to be a limitation of the BigQuery connector - I can't use it to consume from views.
To close the loop here, jamiet# confirmed in the comment that root cause is that BigQuery does not support export from Views, it supports export only from Tables.

Mapreduce job not launching when running hive query with where clause

I am using apache-hive-1.2.2 on Hadoop 2.6.0. When am running a hive query with where clause it is giving results immediately without launching any MapReduce job. I'm not sure what is happening. Table has over 100k records.
I am quoting this from Hive Documentation
hive.fetch.task.conversion
Some select queries can be converted to a single FETCH task,
minimizing latency. Currently the query should be single sourced not
having any subquery and should not have any aggregations or distincts
(which incur RS – ReduceSinkOperator, requiring a MapReduce task),
lateral views and joins.
Any type of the sort of aggregation like max or min or count is going to require a MapReduce job. So it depends on your data-set you have.
select * from tablename;
It just reads raw data from files in HDFS, so it is much faster without MapReduce and it doesn't need MR.
This is due to the the property "hive.fetch.task.conversion". The default value is set to "more" (Hive 2.1.0) and results in Hive trying to go straight at the data by launching a single Fetch task instead of a Map Reduce job wherever possible.
This behaviour however might not be desirable in case you have a huge table (say 500 GB+) as it would cause a single thread to be launched instead of multiple threads as happens in the case of a Map Reduce job.
You can set this property to "minimal" or "none" in hive-site.xml to bypass the behaviour.

weird issue with Hive 0.12 in BigInsights 3.0

I have this simple query which is fine in hive 0.8 in IBM BigInsights2.0:
SELECT * FROM patient WHERE hr > 50 LIMIT 5
However when I run this query using hive 0.12 in BigInsights3.0 it runs forever and returns no results.
Actually the scenario is the same for following query and many others:
INSERT OVERWRITE DIRECTORY '/Hospitals/dir' SELECT p.patient_id FROM
patient1 p WHERE p.readingdate='2014-07-17'
If I exclude the WHERE part then it would be all fine in both versions.
Any idea what might be wrong with hive 0.12 or BigInsights3.0 when including WHERE clause in the query?
When you use a WHERE clause in the Hive query, Hive will run a map-reduce job to return the results. That's why it usually takes longer to run the query because without the WHERE clause, Hive can simply return the content of the file that represents the table in HDFS.
You should check the status of the map-reduce job that is triggered by your query to find out if an error happened. You can do that by going to the Application Status tab in the BigInsights web console and clicking on Jobs, or by going to the job tracker web interface. If you see any failed tasks for that job, check the logs of the particular task to find out what error occurred. After fixing the problem, run the query again.