dbt tests store_failures behaviour when limit is specified - why is limit ignored when inserting data into the failure table? - dbt

I am asking this question in the context of BigQuery as the target database.
The dbt documentation for the limit config says ...
Limit the number of failures that will be returned by a test query. We recommend using this config when working with large datasets and storing failures in the database.
I am not sure I understand the recommendation. When I test limit config along with store-failures, all failed rows are inserted into the failure table. The limit is applied when selecting from the failure table. What is the benefit of limit with large datasets and storing failures in a BigQuery table?

Related

Exceeded the memory limit of 20 MB per session for prepared statements. Reduce the number or size of the prepared statements

I am trying to insert record in to Azure sql Dataware House using Oracle ODI, but i am getting error after insertion of some records.
NOTE: I am trying to insert 1000 records, but error is coming after 800.
Error Message: Caused By: java.sql.BatchUpdateException: 112007;Exceeded the memory limit of 20 MB per session for prepared statements. Reduce the number or size of the prepared statements.
I am trying to insert 1000 records,but error is coming after 800.
Error Message: Caused By: java.sql.BatchUpdateException: 112007;Exceeded the memory limit of 20 MB per session for prepared statements. Reduce the number or size of the prepared statements.
While Abhijith's answer is technically correct, I'd like to suggest an alternative that will give you far better performance.
The root of your problem is that you've chosen the worst-possible way to load a large volume of data into Azure SQL Data Warehouse. A long list of INSERT statements is going to perform very badly, no matter how many DWUs you throw at it, because it is always going to be a single-node operation.
My recommendation is to adapt your ODI process in the following way, assuming that your Oracle is on-premise.
Write your extract to a file
Invoke AZCOPY to move the file to Azure blob storage
CREATE EXTERNAL TABLE to map a view over the file in storage
CREATE TABLE AS or INSERT INTO to read from that view into your target table
This will be orders of magnitude faster than your current approach.
20MB is the limit defined and it is hard limit for now. Reducing the batch size will certainly help you work around this limit.
Link to capacity limits.
https://learn.microsoft.com/en-us/azure/sql-data-warehouse/sql-data-warehouse-service-capacity-limits

SparkSQL JDBC writer fails with "Cannot acquire locks error"

I'm trying to insert 50 million rows from hive table into a SQLServer table using SparkSQL JDBC Writer.Below is the line of code that I'm using to insert the data
mdf1.coalesce(4).write.mode(SaveMode.Append).jdbc(connectionString, "dbo.TEST_TABLE", connectionProperties)
The spark job is failing after processing 10 million rows with the below error
java.sql.BatchUpdateException: The instance of the SQL Server Database Engine cannot obtain a LOCK resource at this time. Rerun your statement when there are fewer active users. Ask the database administrator to check the lock and memory configuration for this instance, or to check for long-running transactions.
But the same job succeeds if I use the below line of code.
mdf1.coalesce(1).write.mode(SaveMode.Append).jdbc(connectionString, "dbo.TEST_TABLE", connectionProperties)
I'm trying to open 4 parallel connections to the SQLServer to optimize the performance. But the job keeps failing with "cannot aquire locks error" after processing 10 million rows. Also, If I limit the dataframe to just few million rows(less than 10 million), the job succeeds even with four parallel connections
Can anybody suggest me if SparkSQL can be used to export huge volumes of data into RDBMS and if I need to make any configuration changes on SQL server table.
Thanks in Advance.

Google BigQuery large table (105M records) with 'Order Each by' clause produce "Resources Exceeds Query Execution" error

I am running into Serious issue "Resources Exceeds Query Execution" when Google Big Query large table (105M records) with 'Order Each by' clause.
Here is the sample query (which using public data set: Wikipedia):
SELECT Id,Title,Count(*) FROM [publicdata:samples.wikipedia] Group EACH by Id, title Order by Id, Title Desc
How to solve this without adding Limit keyword.
Using order by on big data databases is not an ordinary operation and at some point it exceeds the attributes of big data resources. You should consider sharding your query or run the order by in your exported data.
As I explained to you today in your other question, adding allowLargeResults will allow you to return large response, but you can't specify a top-level ORDER BY, TOP or LIMIT clause. Doing so negates the benefit of using allowLargeResults, because the query output can no longer be computed in parallel.
One option here that you may try is sharding your query.
where ABS(HASH(Id) % 4) = 0
You can play with the above parameters a lot to achieve smaller resultsets and then combining.
Also read Chapter 9 - Understanding Query Execution it explaines how internally sharding works.
You should also read Launch Checklist for BigQuery
I've gone through the same problem and fixed it following the next steps
Run the query without ORDER BY and save in a dataset table.
Export the content from that table to a bucket in GCS using wildcard (BUCKETNAME/FILENAME*.csv)
Download the files to a folder in your machine.
Install XAMPP (if you get a UAC warning) and change some settings after.
Start Apache and MySQL in your XAMPP control panel.
Install HeidiSQL and stablish the connection with your MySQL server (installed with XAMPP)
Create a database and a table with its fields.
Go to Tools > Import CSV file, configure accordingly and import.
Once all data is imported, do the ORDER BY and export the table.

Hive Update and Delete

I am using Hive 1.0.0 Version and Hadoop 2.6.0 and Cloudera ODBC Driver. I am trying to Update and Delete the data in the hive database from Cloudera HiveOdbc Driver it throws an error. Here is my error.
What i have done ?
CREATE:
create database geometry;
create table odbctest (EmployeeID Int,FirstName String,Designation String, Salary Int,Department String)
clustered by (department)
into 3 buckets
stored as orcfile
TBLPROPERTIES ('transactional'='true');
Table created.
INSERT:
insert into table geometry.odbctest values(10,'Hive','Hive',0,'B');
By passing the above query the data is inserting into database.
UPDATE:
When i am trying to Update the following error is getting
update geometry.odbctest set salary = 50000 where employeeid = 10;
SQL> update geometry.odbctest set salary = 50000 where employeeid = 10;
[S1000][Cloudera][HiveODBC] (55) Insert operation is not support for
table: HIVE.geometry.odbctest
[ISQL]ERROR: Could not SQLPrepare
DELETE:
When i am trying to Delete the following error is getting
delete from geometry.odbctest where employeeid=10;
SQL> delete from geometry.odbctest where employeeid=10;
[S1000][Cloudera][HiveODBC] (55) Insert operation is not support for table: HIVE.geometry.odbctest
[ISQL]ERROR: Could not SQLPrepare
Can anyone help me out,
You have done a couple of required steps properly:
ORC format
Bucketed table
A likely cause would be: one or more of the following hive settings were not included:
These configuration parameters must be set appropriately to turn on
transaction support in Hive:
hive.support.concurrency – true
hive.enforce.bucketing – true
hive.exec.dynamic.partition.mode – nonstrict
hive.txn.manager – org.apache.hadoop.hive.ql.lockmgr.DbTxnManager
hive.compactor.initiator.on – true (for exactly one instance of the Thrift metastore service)
hive.compactor.worker.threads – a positive number on at least one instance of the Thrift metastore service
The full requirements for transaction support are here: https://cwiki.apache.org/confluence/display/Hive/Hive+Transactions
If you have verified the above settings are in place then do a
describe extended odbctest;
To evaluate its transaction related characteristics.
I stumbled across the same issue when connecting to Hive 1.2 using the Simba ODBC driver distributed by Cloudera (v 2.5.12.1005 64-bit). After verifying everything in javadba's post, I did some additional digging and found the problem to be a bug in the ODBC driver.
I was able to resolve the issue by using the Progress DataDirect driver, and it looks like the version of the driver distributed by hortonworks works as well (links to both solutions below).
https://www.progress.com/data-sources/apache-hive-drivers-for-odbc-and-jdbc
http://hortonworks.com/hdp/addons/
Hope that helps anyone who may still be struggling!
You should not think about Hive as a regular RDBMS, Hive is better suited for batch processing over very large sets of immutable data.
Here is what you can findenter link description here
Hadoop is a batch processing system and Hadoop jobs tend to have high
latency and incur substantial overheads in job submission and
scheduling. As a result - latency for Hive queries is generally very
high (minutes) even when data sets involved are very small (say a few
hundred megabytes). As a result it cannot be compared with systems
such as Oracle where analyses are conducted on a significantly smaller
amount of data but the analyses proceed much more iteratively with the
response times between iterations being less than a few minutes. Hive
aims to provide acceptable (but not optimal) latency for interactive
data browsing, queries over small data sets or test queries.
Hive is not designed for online transaction processing and does not
offer real-time queries and row level updates. It is best used for
batch jobs over large sets of immutable data (like web logs).
As of now Hive does not support Update and Delete Operations on the data in HDFS.
https://cwiki.apache.org/confluence/display/Hive/Hive+Transactions

Copy failed records to dynamo db

I am copying 50 million records to amazon dynamodb using hive script. The script failed after running for 2 days with an item size exceeded exception.
Now if I restart the script again, it will start the insertions again from first record. Is there a way where I can say like "Insert only those records which are not in dynamo db" ?
You can use conditional writes to only write the item if it the specified attributes are not equal to the values you provide. This is done by using the ConditionExpression for a PutItem request. However, it still uses write capacity even if a write fails (emphasis mine) so this may not even be the best option for you:
If a ConditionExpression fails during a conditional write, DynamoDB
will still consume one write capacity unit from the table. A failed
conditional write will return a ConditionalCheckFailedException
instead of the expected response from the write operation. For this
reason, you will not receive any information about the write capacity
unit that was consumed. However, you can view the
ConsumedWriteCapacityUnits metric for the table in Amazon CloudWatch
to determine the provisioned write capacity that was consumed from the
table.