Error in Apache Drill when doing JOIN between two large tables

Error in Apache Drill when doing JOIN between two large tables - sql

I'm trying to make a JOIN between two tables, one having 1,250,910,444 records and the other 385,377,113 records using the Apache Drill.
However, after 2 minutes of execution, it gives the following error:
org.apache.drill.common.exceptions.UserRemoteException: RESOURCE ERROR: One or more nodes ran out of memory while executing the query. Failure allocating buffer.
Fragment 1:2 [Error Id: 51b70ce1-29d5-459b-b974-8682cec41961 on sbsb35.ipea.gov.br:31010]
(org.apache.drill.exec.exception.OutOfMemoryException) Failure allocating buffer. io.netty.buffer.PooledByteBufAllocatorL.allocate():64 org.apache.drill.exec.memory.AllocationManager.():80
org.apache.drill.exec.memory.BaseAllocator.bufferWithoutReservation():243
org.apache.drill.exec.memory.BaseAllocator.buffer():225 org.apache.drill.exec.memory.BaseAllocator.buffer():195
org.apache.drill.exec.vector.VarCharVector.allocateNew():394 org.apache.drill.exec.vector.NullableVarCharVector.allocateNew():239
org.apache.drill.exec.test.generated.HashTableGen1800$BatchHolder.():137 org.apache.drill.exec.test.generated.HashTableGen1800.newBatchHolder():697
org.apache.drill.exec.test.generated.HashTableGen1800.addBatchHolder():690 org.apache.drill.exec.test.generated.HashTableGen1800.addBatchIfNeeded():679
org.apache.drill.exec.test.generated.HashTableGen1800.put():610 org.apache.drill.exec.test.generated.HashTableGen1800.put():549
org.apache.drill.exec.physical.impl.join.HashJoinBatch.executeBuildPhase():366
org.apache.drill.exec.physical.impl.join.HashJoinBatch.innerNext():222
org.apache.drill.exec.record.AbstractRecordBatch.next():162
org.apache.drill.exec.record.AbstractRecordBatch.next():119
org.apache.drill.exec.record.AbstractRecordBatch.next():109
...
java.lang.Thread.run():748
Drill configuration information I'm using: planner.memory_limit = 268435456
The server I'm using has 512GB of memory.
Could someone suggest me how to solve this problem? Creating index for each table could be a solution? If so, how do I do this on Drill.

Currently Apache Drill does not support indexing.
Your query fails during execution stage so planner.memory_limit won't take any effect.
Currently all you can do is allocate more memory:
make sure you have enough direct memory allocated in drill-env.sh;
use planner.memory.max_query_memory_per_node option.
There is ongoing work in the community to allow spill to disk for the Hash Join
but it's still in progress (https://issues.apache.org/jira/browse/DRILL-6027).

Use setting the planner.memory.max_query_memory_per_node to the max possible.
ALTER SESSION SET 'planner.memory.max_query_memory_per_node' = (some Value)
Make sure the parameter is set for this session.
check the DRILL_HEAP and DRILL_DIRECT_MAX MEMORY too.

Related

DataStage 11.5 CFF stage throwing exception : APT_BabAlloc : Heap allocation failed

Can you please help me to solve this --> I m using Datastage 11.5 and in cff stage of one of my job i m getting allocation failed error due to which my job is getting aborted when ever a large size cff file comes.
my job simplly converts cff file into text file.
Errors in job log show:
Message: main_program: Current heap size: 2,072,104,336 bytes in 4,525,666 blocks
Message: main_program: Fatal Error: Throwing exception: APT_BadAlloc: Heap allocation failed. [error_handling/exception.C:132]

From https://www.ibm.com/support/pages/datastage-cff-stage-job-fails-message-aptbadalloc-heap-allocation-failed the Complex Flat File (CFF) stage is a composite operator and will be inserting a Promote Sub-Record operator for every subrecord. Too many of them can exhaust the available heap. To further diagnose the problem, SET the Environment Variable APT_DUMP_SCORE=True and Verify the score dump in the log to see if the job is creating too many Promote Sub-Record operators. This could be exhausting the available heap. To improve performance and reduce memory usage the table definition should be optimized further.
Resolving the problem
Here is what you can do to reduce the number of Promote Sub-Record operators:
Save the table definition from the CFF stage in the job.
Clear all the columns in the CFF stage.
Reload the table definition from the saved table definition in step 1) by checking the check box 'Remove group columns". This step will remove the additional group columns.
Check the layout, it should have the same record length with the original job. After reloading the table the table structure will be flat (no more hierarchy).
After the above steps the OSH script generated from the Complex Flat File stage will no longer contain Promote Sub-Record operator and the performance will be improved and memory usage will be reduced to minimum.

ERROR : FAILED: Error in acquiring locks: Error communicating with the metastore org.apache.hadoop.hive.ql.lockmgr.LockException

Getting the Error in acquiring locks, when trying to run count(*) on partitioned tables.
The table has 365 partitions when filtered on <= 350 partitions, the queries are working fine.
when tried to include more partitions for the query, it's failing with the error.
working on Hive-managed ACID tables, with the following default values
hive.support.concurrency=true //cannot make it as false, it's throwing <table> is missing from the ValidWriteIdList config: null, should be true for ACID read and write.
hive.lock.manager=org.apache.hadoop.hive.ql.lockmgr.zookeeper.ZooKeeperHiveLockManager
hive.txn.manager=org.apache.hadoop.hive.ql.lockmgr.DbTxnManager
hive.txn.strict.locking.mode=false
hive.exec.dynamic.partition.mode=nonstrict
Tried increasing/decreasing values for these following with a beeline session.
hive.lock.numretries
hive.unlock.numretries
hive.lock.sleep.between.retries
hive.metastore.batch.retrieve.max={default 300} //changed to 10000
hive.metastore.server.max.message.size={default 104857600} // changed to 10485760000
hive.metastore.limit.partition.request={default -1} //did not change as -1 is unlimited
hive.metastore.batch.retrieve.max={default 300} //changed to 10000.
hive.lock.query.string.max.length={default 10000} //changed to higher value
Using the HDI-4.0 interactive-query-llap cluster, the meta-store is backed by default sql-server provided along.

The problem is NOT due to service tier of the hive metastore database.
It is most probably due to too many partitions in one query based on the symptom.
I meet the same issue several times.
In the hivemetastore.log, you shall able to see such error:
metastore.RetryingHMSHandler: MetaException(message:Unable to update transaction database com.microsoft.sqlserver.jdbc.SQLServerException: The incoming request has too many parameters. The server supports a maximum of 2100 parameters. Reduce the number of parameters and resend the request.
at com.microsoft.sqlserver.jdbc.SQLServerException.makeFromDatabaseError(SQLServerException.java:254)
at com.microsoft.sqlserver.jdbc.SQLServerStatement.getNextResult(SQLServerStatement.java:1608)
at com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement.doExecutePreparedStatement(SQLServerPreparedStatement.java:578)
This is due to in Hive metastore, each partition involved in the hive query requires at most 8 parameters to acquire a lock.
Some possible workarounds:
Decompose the the query into multiple sub-queries to read from fewer
partitions.
Reduce the number of partitions by setting different partition keys.
Remove partitioning if partition keys don't have any filters.
Following are the parameters which manage the batch size for INSERT query generated by the direct SQL. Their default value is 1000. Set both of them to 100 (as a good starting point) in the Custom hive-site section of Hive configs via. Ambari and restart ALL Hive related components (including Hive metastore).
hive.direct.sql.max.elements.values.clause=100
hive.direct.sql.max.elements.in.clause=100

We also faced the same error in HDInsight and after doing many configuration changes similar to what you have done, the only thing that worked is scaling our Hive Metastore SQL DB server.
We had to scale it all the way to a P2 tier with 250 DTUs for our workloads to work without these Lock Exceptions. As you may know, with the tier and DTU count, the SQL server's IOPS and response time improves thus we suspected that the Metastore performance was the root cause for these Lock Exceptions with the increase in workloads.
Following link provides information about the DTU based performance variation in SQL servers in Azure.
https://learn.microsoft.com/en-us/azure/sql-database/sql-database-service-tiers-dtu
Additionally as I know, the default Hive metastore that gets provisioned when you opt to not provide an external DB in cluster creation is just an S1 tier DB. This would not be suitable for any high capacity workloads. At the same time, as a best practice always provision your metastores external to the cluster and attach at cluster provisioning time, as this gives you the flexibility to connect the same Metastore to multiple clusters (so that your Hive layer schema can be shared across multiple clusters, e.g. Hadoop for ETLs and Spark for Processing / Machine Learning), and you have the full control to scale up or down your metastore as per your need anytime.
The only way to scale the default metastore is by engaging the Microsoft support.

We faced the same issue in HDINSIGHT. We solved it by upgrading the metastore.
The Default metastore had only 5 DTU which is not recommended for production environments. So we migrated to custom Metastore and spin the Azure SQL SERVER (P2 above 250 DTUs) and the setting the below properties:
hive.direct.sql.max.elements.values.clause=200
hive.direct.sql.max.elements.in.clause=200
Above values are set because SQL SERVER cannot process more than 2100 parameter. When you have partitions more than 348, you faced this issue as 1 partition creates 8 parameters for metastore 8 x 348

Exceeded the memory limit of 20 MB per session for prepared statements. Reduce the number or size of the prepared statements

I am trying to insert record in to Azure sql Dataware House using Oracle ODI, but i am getting error after insertion of some records.
NOTE: I am trying to insert 1000 records, but error is coming after 800.
Error Message: Caused By: java.sql.BatchUpdateException: 112007;Exceeded the memory limit of 20 MB per session for prepared statements. Reduce the number or size of the prepared statements.
I am trying to insert 1000 records,but error is coming after 800.
Error Message: Caused By: java.sql.BatchUpdateException: 112007;Exceeded the memory limit of 20 MB per session for prepared statements. Reduce the number or size of the prepared statements.

While Abhijith's answer is technically correct, I'd like to suggest an alternative that will give you far better performance.
The root of your problem is that you've chosen the worst-possible way to load a large volume of data into Azure SQL Data Warehouse. A long list of INSERT statements is going to perform very badly, no matter how many DWUs you throw at it, because it is always going to be a single-node operation.
My recommendation is to adapt your ODI process in the following way, assuming that your Oracle is on-premise.
Write your extract to a file
Invoke AZCOPY to move the file to Azure blob storage
CREATE EXTERNAL TABLE to map a view over the file in storage
CREATE TABLE AS or INSERT INTO to read from that view into your target table
This will be orders of magnitude faster than your current approach.

20MB is the limit defined and it is hard limit for now. Reducing the batch size will certainly help you work around this limit.
Link to capacity limits.
https://learn.microsoft.com/en-us/azure/sql-data-warehouse/sql-data-warehouse-service-capacity-limits

Error message: "PostgreSQL said: could not write block 119518 of temporary file: No space left on device" PostgreSQL

I have a query that, intuitively, should work just fine. But, almost immediately after executing, I am served with this error message:
ERROR: could not write block 119518 of temporary file: No space left on device
Query failed
PostgreSQL said: could not write block 119518 of temporary file: No space left on device
I have approximately 4.5GB in free storage space.
I've tried whittling down the memory usage by replacing each of the CTEs with materialized views in the hope that would reduce the need for processing power.
Additionally. I've taken these steps --
--I boosted our AWS instance to the memory-optimized db.r4.16xlarge.
--I ran analyze verbose and vacuum full analyze
--I've stopped all other processes
The query does some small processing in two CTEs and then joins a table (roughly 20M rows) with a smaller lookup table (roughly 500K rows).

Just in case someone else runs into this, I found out the answer. And, it was really simple: just go into the RDS admin panel and increase allocated storage.

Why will my SQL Transaction log file not auto-grow?

The Issue
I've been running a particularly large query, generating millions of records to be inserted into a table. Each time I run the query I get an error reporting that the transaction log file is full.
I've managed to get a test query to run with a reduced set of results and by using SELECT INTO instead of INSERT into as pre built table. This reduced set of results generated a 20 gb table, 838,978,560 rows.
When trying to INSERT into the pre built table I've also tried using it with and without a Cluster index. Both failed.
Server Settings
The server is running SQL Server 2005 (Full not Express).
The dbase being used is set to SIMPLE for recovery and there is space available (around 100 gb) on the drive that the file is sitting on.
The transaction log file setting is for File Growth of 250 mb and to a maximum of 2,097,152 mb.
The log file appears to grow as expected till it gets to 4729 mb.
When the issue first appeared the file grow to a lower value however i've reduced the size of other log files on the same server and this appears to allow this transaction log file grow further by the same amount as the reduction on the other files.
I've now run out of ideas of how to solve this. If anyone has any suggestion or insight into what to do it would be much appreciated.

First, you want to avoid auto-growth whenever possible; auto-growth events are HUGE performance killers. If you have 100GB available why not change the log file size to something like 20GB (just temporarily while you troubleshoot this). My policy has always been to use 90%+ of the disk space allocated for a specific MDF/NDF/LDF file. There's no reason not to.
If you are using SIMPLE recovery SQL Server is supposed manage the task of returning unused space but sometimes SQL Server does not do a great job. Before running your query check the available free log space. You can do this by:
right-click the DB > go to Tasks > Shrink > Files.
change the type to "Log"
This will help you understand how much unused space you have. You can set "Reorganize pages before releasing unused space > Shrink File" to 0. Moving forward you can also release unused space using CHECKPOINT; this may be something to include as a first step before your query runs.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas