ADFv2 Lookup Activity timeout after 2 hours even after increasing query timeout value - azure-sql-database

I have a lookup activity that timeouts after the 2 hours (120 mins) which is the default (could be just a coincidence) even after increasing the query timeout to 720 mins.
The lookup activity executes a Proc based on an expression.
This is the error
Failure happened on 'Source' side.
ErrorCode=UserErrorSourceQueryTimeout,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=Query
source database timeout after '7200'
seconds.,Source=Microsoft.DataTransfer.DataContracts,''Type=System.TimeoutException,Message=,Source=Microsoft.DataTransfer.DataContracts,'
Is there a step that I've missed out somewhere?

There are actually two timeouts in Lookup activity, one is Lookup activity timeout and other is queryTimeout. Please make sure that to set the queryTimeout value lower than the lookup activity timeout value. queryTimeout won't be effective if it is greater than lookup activity timeout (24hrs).
Note: When you use query or stored procedure to lookup data, make sure to return one and exact one result set. Otherwise, Lookup activity fails.
Refer: Lookup activity in Azure Data Factory and Azure Synapse Analytics

Related

Azure Data Factory - Rerun Failed Pipeline Against Azure SQL Table With Differential Date Filter

I am using ADF to keep an Azure SQL DB in sync with an on-prem DB. The on-prem DB is read only and the direction is one-way, from the Azure SQL DB to the on-prem DB.
My source table in the Azure SQL Cloud DB is quite large (10's of millions of rows) so I have the pipeline set to use an UPSERT (merge, trying to create a differential merge). I am using a filter on the Source table and the and the Filter Query has a WHERE condition that looks like this:
[HistoryDate] >= '#{formatDateTime(pipeline().parameters.windowStart, 'yyyy-MM-dd HH:mm' )}'
AND [HistoryDate] < '#{formatDateTime(pipeline().parameters.windowEnd, 'yyyy-MM-dd HH:mm' )}'
The HistoryDate column is auto-maintained in the source table with a getUTCDate() type approach. New records will always get a higher value and be included in the WHERE condition.
This works well, but here is my question: I am testing on my local machine before deploying to the client. When I am not working, my laptop hibernates and the pipeline rightfully fails because my local SQL Instance is "offline" during that run. When I move this to production this should not be an issue (computer hibernating), but what happens if the clients connection is temporarily lost (i.e, the client loses internet for a time)? Because my pipeline has a WHERE condition on the source to reduce the table size upsert to a practical number, any failure would result in a loss of any data created during that 5 minute window.
A failed pipeline can be rerun, but the run time would be different at that moment in time and I would effectively miss the block of records that would have been picked up if the pipeline had been run on time. pipeline().parameters.windowStart and pipeline().parameters.windowEnd will now be different.
As an FYI, I have this running every 5 minutes to keep the local copy in sync as close to real-time as possible.
Am I approaching this correctly? I'm sure others have this scenario and it's likely I am missing something obvious. :-)
Thanks...
Sorry to answer my own question, but to potentially help others in the future, it seems there was a better way to deal with this.
ADF offers a "Metadata-driven Copy Task" utility/wizard on the home screen that creates a pipeline. When I used it, it offers a "Delta Load" option for tables which takes a "Watermark". The watermark is a column for an incrementing IDENTITY column, increasing date or timestamp, etc. At the end of the wizard, it allows you to download a script that builds a table and corresponding stored procedure that maintains the values of each parameters after each run. For example, if I wanted my delta load to be based on an IDENTITY column, it stores the value of the max value of a particular pipeline run. The next time a run happens (trigger), it uses this as the MIN value (minus 1) and the current MAX value of the IDENTITY column to get the added records since the last run.
I was going to approach things this way, but it seems like ADF already does this heavy lifting for us. :-)

SQL timeout when inserting data into temp table in custom application

I have a console .net application that reads a one-line single value
data from a file.
The application is having SQL timeout issues for few days last month for which I am working to find root cause.
The logic in app uses single value to pull data from base tables based on column values higher than single data from file.
The data pulled from base tables joins are dumped into two temporary tables present in script attached.
The two temp tables is joined with base tables and data from joins is dumped into one final temp table(AccMatters) from where we update base / permanent tables after checking certain business logic for charge code validation(time charged by employee / users working on certain matters for company carry charge code to be used for charging time).
Attached SQL code that gave timeout issue. The temporary table AccMatters is having issue during insertion. Comments are available in SQL code for giving information on code.
The script contain code till dumping to last temp table as timeout issue occurred at that point when checking logs of .net console application which has the SQL statements embedded in it.
The issue occurred for three days last month and volume of records inserted into last temporary tables was 800+ rows on those days when timeout issue occurred.
When executing in production environment, the script takes few minutes that is very much less than timeout of 20 minutes set in the application.
The custom app at last updates file containing single data with new data from base table that is greater than that value and file data is again used in next run of custom application.
Any help on possible SQL server code inconsistencies that can be identified in attached script will be helpful in identifying root cause of issue for days when issue was reported by customer.
If it is the case, you need to run few diagnostic scripts to find out whats happening in server.
1) reader/writer conflict
DBCC OpenTran (dbname)
2) Check for the tempdb latency and the log file growth of Tempdb
3) Any blocked sessions/processess
SELECT * FROM dbo.sysprocesses WHERE blocked <> 0;
SELECT * FROM dbo.sysprocesses WHERE spid IN (SELECT blocked FROM
dbo.sysprocesses where blocked <> 0)
4) Check if that proc falls under high impact query on disk/latency
SELECT TOP 10 t.TEXT AS 'SQL Text'
,st.execution_count
,ISNULL(st.total_elapsed_time / st.execution_count, 0) AS 'AVG Excecution Time'
,st.total_worker_time / st.execution_count AS 'AVG Worker Time'
,st.total_worker_time
,st.max_logical_reads
,st.max_logical_writes
,st.creation_time
,ISNULL(st.execution_count / DATEDIFF(second, st.creation_time,
getdate()), 0) AS 'Calls Per Second'
FROM sys.dm_exec_query_stats st
CROSS APPLY sys.dm_exec_sql_text(st.sql_handle) t
ORDER BY creation_time desc
5) Use activity Monitor to check if response time of TempDB is higher
I would really like to look at perfmon counters to start with, check for the abnormal growth of temp db log file. I would say create another similar proc and name it differently with global temp tables. Debugging would give you enough idea of whats happening in the server.

Azure DataFactory Pipeline Timeout

Currently we have a table with more than 200k records so when we move the data from source azure sql database to another sql database it takes a lot of time with more than 3 hours resulting in timeout error, initially we set timeout as 1 hour however because of timeout error we have to increase the timeout interval to 3 hours but still its not working.
This is how we have defined the process.
Two datasets -> input and output
One pipeline
Inside the pipeline we have a query like select * from table;
and we have stored procedure and its script is like
Delete from table all records.
Insert statement to insert all records.
This is time consuming so we have decided to do update and insert whatever data is modified or inserted based on date column in last 24 hours.
So is there any functionality in azure pipeline which checks the records which are inserted or updated in source azure sql db in last 24 hours or do we need to do in destination sql stored procedure.
In Azure Data Factory, we have an option sth like writeBatchsize. We can set this value to flush the data in intervals instead of flushing for each record.

Copy failed records to dynamo db

I am copying 50 million records to amazon dynamodb using hive script. The script failed after running for 2 days with an item size exceeded exception.
Now if I restart the script again, it will start the insertions again from first record. Is there a way where I can say like "Insert only those records which are not in dynamo db" ?
You can use conditional writes to only write the item if it the specified attributes are not equal to the values you provide. This is done by using the ConditionExpression for a PutItem request. However, it still uses write capacity even if a write fails (emphasis mine) so this may not even be the best option for you:
If a ConditionExpression fails during a conditional write, DynamoDB
will still consume one write capacity unit from the table. A failed
conditional write will return a ConditionalCheckFailedException
instead of the expected response from the write operation. For this
reason, you will not receive any information about the write capacity
unit that was consumed. However, you can view the
ConsumedWriteCapacityUnits metric for the table in Amazon CloudWatch
to determine the provisioned write capacity that was consumed from the
table.

Bigquery load job said successful but data did not get loaded into table

I submitted a Bigquery load job, it ran and returned with the status successful. But the data didn't make into the destintation table.
Here was the command that was run:
/usr/local/bin/bq load --nosynchronous_mode --project_id=ardent-course-601 --job_id=logsToBq_load_impressions_20140816_1674a956_6c39_4859_bc45_eb09db7ef99a --source_format=NEWLINE_DELIMITED_JSON dw_logs_impressions.impressions_20140816 gs://sm-uk-hadoop/queries/logsToBq_transformLogs/impressions/20140816/9307f6e3-0b3a-44ca-8571-7107c399998c/part* /opt/sm-analytics/projects/logsTobqMR/jsonschema/impressionsSchema.txt
I checked the job status of the job logsToBq_load_impressions_20140816_1674a956_6c39_4859_bc45_eb09db7ef99a. The input file count and size showed the correct number of input files and total size.
Does anyone know why the data didn't make into the table but yet the job is reported as successful?
Just in case this is not a mistake on our side, I ran the load job again but to a different destination table and this time the data made into the destination table fine.
Thank you.
I experienced this recently with BigQuery in sandbox mode without a billing account.
In this mode the partition expiration is automatically set to 60 days. If you load data into the table where the partitioned column(e.g. date) is older than 60 days it won't show up in the table. The load job still succeeds with the correct number of output rows.
This is very surprising, but I've confirmed via the logs that this is indeed the case.
Unfortunately, the detailed logs for this job, which ran on August 16, are no longer available. We're investigating whether this may have affected other jobs more recently. Please ping this thread if you see this issue again.
we had this issue in our system and the reason was like table was set with partition expiry for 30 days and table was partitioned on timestamp column.. Hence when someone was ingesting data which is older than partition expiry date bigquery load jobs were successfully completed in Spark but we see no data in ingestion tables.. since it was getting deleted moment after it was ingested.. due to partition expiry set on.
Please check your bigquery table partition expiry parameters and see the partition column value of incoming data. If it value will be lower than partition expiry.. you wont see data in bigquery tables.. it will get deleted just after the ingestion.