How do I find out what is inserting data in my Azure Data Warehouse

How do I find out what is inserting data in my Azure Data Warehouse - azure-log-analytics

I am using an Azure 'Synapse SQL Pool' (aka Data Warehouse) containing a table named 'DimClient'. I see in my database that new records are being added every day at a specific time. I've reviewed all the ADF pipelines and triggers but none of them are set to run at that time. I don't see any stored procedures that insert or update records in this table either. I can only conclude there is another process running that is adding those records.
I turned on 'Send to Log Analytics' to forward to a workspace and included the SqlRequests and ExecRequests categories. I waited a day and reviewed the logs using the following query:
AzureDiagnostics
| where Category == "SqlRequests" or Category == "ExecRequests"
| where Command_s contains "DimClient" ;
I get 'No Results Found' but when I query the table in SSMS, it contains new records that were added within the last 24 hours. How do I determine what is inserting these records?

you should get result. it takes time some to sync data in log analytics. also check diagnostic settings on Synpase pool

Related

Azure Synapse pipeline: How to move incremental updates from SQL Server into synapse for crunching numbers

We are working building a new data pipeline for our project and we have to move incremental updates that happen throughout the day on our SQL servers into Azure synapse for some number crunching.
We have to get updates which occur across 60+ tables ( 1-2 million updates a day ) into synapse to crunch some aggregates and statistics as they happen throughout the day.
One of the requirements is being near real time and doing a bulk import into synapse is not ideal because it takes more than 10 mins to do full compute on all data.
I have been reading about CDC feed into synapse https://learn.microsoft.com/en-us/azure/data-factory/tutorial-incremental-copy-change-data-capture-feature-portal and it is one possible solution.
Wondering if there are other alternatives to this or suggestions for achieving the end goal of data crunching near real time for DB updates.

Change Data Capture (CDC) is the suited way to capture the changes and add to the destination location (storage/database).
Apart from that, you can also use watermark column to capture the changes in multiple tables in SQL Server.
Select one column for each table in the source data store, which you
can identify the new or updated records for every run. Normally, the
data in this selected column (for example, last_modify_time or ID)
keeps increasing when rows are created or updated. The maximum value
in this column is used as a watermark.
Here is the high-level solution diagram for this approach:
Step-by-Step approach is given in this official document Incrementally load data from multiple tables in SQL Server to Azure SQL Database using PowerShell.

SQL timeout when inserting data into temp table in custom application

I have a console .net application that reads a one-line single value
data from a file.
The application is having SQL timeout issues for few days last month for which I am working to find root cause.
The logic in app uses single value to pull data from base tables based on column values higher than single data from file.
The data pulled from base tables joins are dumped into two temporary tables present in script attached.
The two temp tables is joined with base tables and data from joins is dumped into one final temp table(AccMatters) from where we update base / permanent tables after checking certain business logic for charge code validation(time charged by employee / users working on certain matters for company carry charge code to be used for charging time).
Attached SQL code that gave timeout issue. The temporary table AccMatters is having issue during insertion. Comments are available in SQL code for giving information on code.
The script contain code till dumping to last temp table as timeout issue occurred at that point when checking logs of .net console application which has the SQL statements embedded in it.
The issue occurred for three days last month and volume of records inserted into last temporary tables was 800+ rows on those days when timeout issue occurred.
When executing in production environment, the script takes few minutes that is very much less than timeout of 20 minutes set in the application.
The custom app at last updates file containing single data with new data from base table that is greater than that value and file data is again used in next run of custom application.
Any help on possible SQL server code inconsistencies that can be identified in attached script will be helpful in identifying root cause of issue for days when issue was reported by customer.

If it is the case, you need to run few diagnostic scripts to find out whats happening in server.
1) reader/writer conflict
DBCC OpenTran (dbname)
2) Check for the tempdb latency and the log file growth of Tempdb
3) Any blocked sessions/processess
SELECT * FROM dbo.sysprocesses WHERE blocked <> 0;
SELECT * FROM dbo.sysprocesses WHERE spid IN (SELECT blocked FROM
dbo.sysprocesses where blocked <> 0)
4) Check if that proc falls under high impact query on disk/latency
SELECT TOP 10 t.TEXT AS 'SQL Text'
,st.execution_count
,ISNULL(st.total_elapsed_time / st.execution_count, 0) AS 'AVG Excecution Time'
,st.total_worker_time / st.execution_count AS 'AVG Worker Time'
,st.total_worker_time
,st.max_logical_reads
,st.max_logical_writes
,st.creation_time
,ISNULL(st.execution_count / DATEDIFF(second, st.creation_time,
getdate()), 0) AS 'Calls Per Second'
FROM sys.dm_exec_query_stats st
CROSS APPLY sys.dm_exec_sql_text(st.sql_handle) t
ORDER BY creation_time desc
5) Use activity Monitor to check if response time of TempDB is higher
I would really like to look at perfmon counters to start with, check for the abnormal growth of temp db log file. I would say create another similar proc and name it differently with global temp tables. Debugging would give you enough idea of whats happening in the server.

Azure DataFactory Pipeline Timeout

Currently we have a table with more than 200k records so when we move the data from source azure sql database to another sql database it takes a lot of time with more than 3 hours resulting in timeout error, initially we set timeout as 1 hour however because of timeout error we have to increase the timeout interval to 3 hours but still its not working.
This is how we have defined the process.
Two datasets -> input and output
One pipeline
Inside the pipeline we have a query like select * from table;
and we have stored procedure and its script is like
Delete from table all records.
Insert statement to insert all records.
This is time consuming so we have decided to do update and insert whatever data is modified or inserted based on date column in last 24 hours.
So is there any functionality in azure pipeline which checks the records which are inserted or updated in source azure sql db in last 24 hours or do we need to do in destination sql stored procedure.

In Azure Data Factory, we have an option sth like writeBatchsize. We can set this value to flush the data in intervals instead of flushing for each record.

Bigquery load job said successful but data did not get loaded into table

I submitted a Bigquery load job, it ran and returned with the status successful. But the data didn't make into the destintation table.
Here was the command that was run:
/usr/local/bin/bq load --nosynchronous_mode --project_id=ardent-course-601 --job_id=logsToBq_load_impressions_20140816_1674a956_6c39_4859_bc45_eb09db7ef99a --source_format=NEWLINE_DELIMITED_JSON dw_logs_impressions.impressions_20140816 gs://sm-uk-hadoop/queries/logsToBq_transformLogs/impressions/20140816/9307f6e3-0b3a-44ca-8571-7107c399998c/part* /opt/sm-analytics/projects/logsTobqMR/jsonschema/impressionsSchema.txt
I checked the job status of the job logsToBq_load_impressions_20140816_1674a956_6c39_4859_bc45_eb09db7ef99a. The input file count and size showed the correct number of input files and total size.
Does anyone know why the data didn't make into the table but yet the job is reported as successful?
Just in case this is not a mistake on our side, I ran the load job again but to a different destination table and this time the data made into the destination table fine.
Thank you.

I experienced this recently with BigQuery in sandbox mode without a billing account.
In this mode the partition expiration is automatically set to 60 days. If you load data into the table where the partitioned column(e.g. date) is older than 60 days it won't show up in the table. The load job still succeeds with the correct number of output rows.

This is very surprising, but I've confirmed via the logs that this is indeed the case.
Unfortunately, the detailed logs for this job, which ran on August 16, are no longer available. We're investigating whether this may have affected other jobs more recently. Please ping this thread if you see this issue again.

we had this issue in our system and the reason was like table was set with partition expiry for 30 days and table was partitioned on timestamp column.. Hence when someone was ingesting data which is older than partition expiry date bigquery load jobs were successfully completed in Spark but we see no data in ingestion tables.. since it was getting deleted moment after it was ingested.. due to partition expiry set on.
Please check your bigquery table partition expiry parameters and see the partition column value of incoming data. If it value will be lower than partition expiry.. you wont see data in bigquery tables.. it will get deleted just after the ingestion.

SQL: Tracking changes to the table that gets truncated everyday (and repulled form different srvr)

I have a table that is a replicate of a table from a different server.
Unfortunately I don't have access to the transaction information, and all I have is the table that shows "as is" information & I have a SSIS to replicate the table on my server every day (the table gets truncated, and new information is pulled every night).
Everything has been fine and good, but I want to start tracking what has changed. i.e. I want to know if a new row has been inserted or a value of a column has changed.
Is this something that could be done easily?
I would appreciate any help..
The SQL version is SQL Server 2012 SP1 | Enterprise

If you want to do this for a perticular table then you can go for a scd(slowly changing dimension) transform in SSIS control flow which will keep the hystory records in different table
or
you can create CDC(changing data capture) method on that table.CDC will help you on monitering of every DML operation in that table.It will inserted in the modified row in the system table.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas