How to increase performance of Azure Data Factory Pipeline with Integration Runtime

How to increase performance of Azure Data Factory Pipeline with Integration Runtime - azure-data-factory-2

I would like to increated the performance of our pipelines.
The pipelines currently run from an integration runtime.
I am running a single copy activity on tables held on our Source which is a SQL Database. Tables contain just under a million rows, with about 15 columns.
Currently the time it takes to copy a table from Source to Sink(ADLS) is approximately 20mins.
Is there a way to increase the DIU to increase performance?
My current copy settings are as follows:
I'm thinking that if I made some changes to Settings, see below, I would improve performance, but I have never played around to settings before, any suggestions most welcomed.
The activity details for a pipeline run is as follows:
My link service is an Azure Synapse Link service, see below:

From the output window, we can see that almost all the wait time was "Time to first byte", which means your SQL server is slow to reply. It takes ~22 minutes for less than 90K rows. So changes on the ADF side will not help.
If your query is a simple "select * from table", then maybe your SQL server is low on resources. You can check that in your database portal in Azure. Try to add more resources and see if copy times improve.
If this is a query from a view or other complicated query, maybe it needs some improvement (indexes, improve code). You can test that by writing the query result to a table in your SQL database, use that table as the data factory source, and see if this improves copy time.

Quick check , is the Azure SQL and storage account in the same region ? Also I see that your copy activity is set as parraleism as 1 , you can play with number and see if that helps .
How to setyp parallelism please read here : https://learn.microsoft.com/en-us/azure/data-factory/copy-activity-performance-features#parallel-copy
Please see the snaphot below

Related

Singlestore (MemSQL)

I have a Singlestore (previously MemSQL) cloud database set up.
My software is running in the background, constantly writing to a table.
When I try to query this table, it takes 10+ seconds. When the software is shut off, the query takes milliseconds.
What would be the reason for this? And is there anything that can be done to mitigate against this?

From a high level, cluster resources are much more utilized while the background software constantly writes to the table. The same resources that handle the constant writes are concurrently trying to serve the query, so it makes sense its faster when there is no writing.
A 'knob to turn' WRT database ingest performance is partition count - you can try creating a test DB w/ more partitions that the current DB (say 2x more). Then try querying from the test DB, both while the background software is running and while it is not - compare this to the DB w/ fewer partitions.
For general guidance on troubleshooting query performance, see this section of the docs: https://docs.singlestore.com/managed-service/en/query-data/query-procedures/troubleshooting-poorly-performing-queries.html
If you're an active customer, you can file a support ticket for the issue for some additional analysis of the backend workings

How to troubleshoot suspended queries in Azure Synapse?

Currently, I encounter an issue of suspended queries in Azure Synapse when executing from ADF (Store procedures call).
Also, I followed the suggestion in the link below for troubleshooting the issue:
Delete due to sensitive informations
The troubleshoot queries returned as below:
I checked if the transaction lock is the issue as I killed a few suspending or running queries which they ran for more than 15 hours. I also checked for the rest of the queries running but there is nothing would cause the transaction lock. I tried to run the store procedure manually from Azure Data Studio which is blocked as mentioned above and It took 40 seconds to complete.
While the suspending query from ADF, it took nearly an hour to finish.
Any suggestion to troubleshoot this issue is much appreciated.
Thanks

There a number of factors you must always consider when tuning queries in Azure Synapse Analytics, dedicated SQL pools:
DWU - what DWU is your pool at? Lower DWUs mean lower concurrent users and lower performance and should not be used for any kind of performance tuning. Crank it up temporarily to rule this out as a problem, bearing in mind changing this disconnects any active queries. Also bear in mind, not all queries respond to higher DWU.
Resource class - what resource class is associated with the user executing these queries? Remember the default is smallrc, and the admin user always has smallrc. Understand static and dynamic resource classes. DMV sys.dm_pdw_exec_requests will give you useful information on this. Trial with your workload to find the sweetspot between performance and concurrency v resource class. Encourage your dev team to use labels in their queries: OPTION ( LABEL = 'some informative label' )
Table geometry - this is the distribution (ROUND_ROBIN|HASH|REPLICATE) of your table and the indexing choice (CLUSTERED COLUMNSTORE|CLUSTERED INDEX|HEAP). Clustered columnstore and round robin are the defaults but they are not always appropriate. Consider what is appropriate for your tables.
If you work through those and still have an issue you can start to look at statistics and workload classification for starters, but gather information on the points above should give you a good idea.
If you are just doing single value INSERTs, then don't. Dedicated SQL pools are terrible with these. Convert these to load from a file in a single INSERT / COPY INTO.

About the way of saving BigQuery data capacity.(BigQuery/Data Portal/Data Studio/Google)

I want to know about the way of saving BigQuery data capacity with changing setting of Data Portal(Google BI tool/old name:Data Studio).
The reason is I can't execute SQL or defray the much cost , if I don't save my BigQuery data capacity .
I want to know the way is not used Changing BigQuery Setting(contain of change SQL code) , but Data Protal setting.
Because , the dashboard in data portal continue to use BigQuery data capacity , I can't solve my problem ,even if I change the SQL code.
My situations is below:
My situations:
1.I made a "view" in my BigQuery Enviroment.
I tried to make the query not to use a lot of BigQuery data capacity.
For example , I didn't use "SELECT * FROM ...".
I set the view to "data sorce" in the data portal.
And I made the dashboard using the "data sorce".
If someone open the dashboard , the view I made is executed.
And , BigQuery data capacity is used every time that someone open the dashboard.

If I'm understanding correctly, you're wanting to reduce the amount of data processed in BigQuery from your Data Studio (or in Japan, Data Portal) reports.
There are a few ways to do this:
Make sure that the "Enable Cache" option is checked in the report settings.
Avoid using BigQuery views as a query source, as these aren't cached at the BigQuery level (the view query is run every time, and likely many times per report for various charts). Instead, use a Custom Query connection or pull the table data directly to allow caching. Another option (which we use heavily) is to run a scheduled query that saves the output of a view as a table and replaces it regularly (or is triggered when the underlying data is refreshed). This way your queries can be cached, but the business logic can still exist within the view.
Create a BI Engine reservation in BigQuery. This adds another level of caching to Data Studio reports, and may give you better results for things that can't be query-cached or cached in Data Studio. (While there will be a cost to the service in the future based on the size of instance you reserve, it's free during their beta period.)
Don't base your queries on tables with a streaming buffer attached (even if it hasn't received rows recently), uses wildcard tables in the query, or is based on an external dataset (e.g. file in Cloud Storage or BigTable). See Caching Exceptions for details.
Pull as little data as possible by using the new Data Source Parameters. This means you can pass the values of your date range or other filters directly to BigQuery and filter the data before it reaches your report. This is especially helpful if you have a date-partitioned table, as you can only scan the needed partitions (which greatly reduces processing and the amount of data returned)
Also, sometimes it seems like you're moving a lot of data but that doesn't always relate to a high cost. Check your cost breakdowns or look at the logging filtered to the user your data source authenticates as, then see how much cost that's incurred. Certain operations fall under a free tier, and others don't result in cost for non-egress use cases like Data Studio. All that to say that you may want to make sure there's a cost problem at the BigQuery level in the first place before killing yourself trying to optimize the usage.

Synchronize SQL Server databases

I have a new idea and question about that I would like to ask you.
We have a CRM application on-premise / in house. We use that application kind of 24X7. We also do billing and payroll on the same CRM database which is OLTP and also same thing with SSRS reports.
It looks like whenever we do operation in front end which does inserts and updates to couple of entities at the same time, our application gets frozen until that process finishes. e.g. extracting payroll for 500 employees for their activities during last 2 weeks. Basically it summarize total working hours pulls that numbers from database and writes/updates that record where it says extract has been accomplished. so for 500 employees we are looking at around 40K-50K rows for Insert/Select/Update statements together.
Nobody can do anything while this process runs! We are considering the following options to take care of this issue.
Running this process in off-hours
OR make a copy of DB of Dyna. CRM and do this operations(extracting thousands of records and running multiple reports) on copy.
My questions are:
how to create first of all copy and where to create it (best practices)?
How to make it synchronize in real-time.
if we do select statement operation in copy DB than it's OK, but if we do any insert/update on copy how to reflect that on actual live db? , in short how to make sure both original and copy DB are synchronize to each other in real time.
I know I asked too many questions, but being SQL person, stepping into CRM team and providing suggestion, you know what I am trying to say.
Thanks folks for your any suggestion in advance.

Well to answer your question in regards to the live "copy" of a database a good solution is an alwayson availability group.
https://blogs.technet.microsoft.com/canitpro/2013/08/19/step-by-step-creating-a-sql-server-2012-alwayson-availability-group/
Though I dont think that is what you are going to want in this situation. Alwayson availability groups are typically for database instances that require very low failure time frames. For example: If the primary DB server goes down in the cluster it fails over to a secondary in a second or two at the most and the end users only notice a slight hiccup for a second.
What I think you would find better is to look at those insert statements that are hitting your database server and seeing why they are preventing you from pulling data. If they are truly locking the table maybe changing a large amount of your reads to "nolock" reads might help remedy your situation.
It would also be helpful to know what kind of resources you have allocated and also if you have proper indexing on the core tables for your DB. If you dont have proper indexing then a lot of the queries can take longer then normal causing the locking your seeing.
Finally I would recommend table partitioning if the tables you are pulling against are to large. This can help with a lot of disk speed issues potentially and also help optimize your querys if you partition by time segment (i.e. make a new partition every X months so when a query pulls from one time segment they only pull from that one data file).
https://msdn.microsoft.com/en-us/library/ms190787.aspx
I would say you need to focus on efficiency more then a "copy database" as your volumes arent very high to be needing anything like that from the sounds of it. I currently have a sql server transaction database running with 10 million+ inserts on it a day and I still have live reports hit against it. You just need the resources and proper indexing to accommodate.

Tableau 8.1 taking long time to display report

I have a stored procedure as a source connection in Tableau 8.1. It takes a long time to fetch and display ( about 1 min) 40000 records (there is no bar chart, pie charts etc).
What the stored proc does is it selects 40000 records with some 6-7 table joins.
However the same stored procedures executes and displays the records in sql server management studio within 3 seconds.
After using SQL Server Profiler, it shows that some 45000 inserts into a tableau temp table occurs which takes a long time. Also, it shows in the log file that it takes a high percentage of time for the inserts while the execution of stored proc itself takes about 4-5 seconds only.Is this the problem ?Any suggestion how to over come this issue?
Regards
Gautam S

A few of places to start:
First check out the Tableau log file in your Tableau repository directory after trying to access your data. There will be a lot of information in there, but you should be able to see the actual SQL that Tableau sends to your database -- and that may give you some clues about what it is doing that is taking so long. If nothing else, you can cut and paste the SQL into your database tools and try to isolate the problem SQL without Tableau in the mix
If the log file doesn't give you an idea about how to restructure your system to avoid the long query, then send it along with info about your schema to Tableau support. They may be able to help.
Simplify whatever you can to reduce the problem to its core, get rid of everything in your visualization but a total, and then slowly build it back up to see what causes the behavior. For example, make a test version and remove one table at a time from your query to see what causes the problem.
Avoid using quick filters if you see performance problems (or minimize them) Nice feature, but comes with a performance cost
Try the Tableau performance monitoring (record and analysis) features
Work with a smaller data set during testing so you can more quickly experiment with different approaches
Try replacing your stored procedure with a view. That's usually better if at all possible.
Add indices to speed the joins
If there is no way around the long operation and if updates are infrequent, make a Tableau extract so that you only pay that cost periodically
If none of these things help, cut the problem down to its simplest version and post a schema and the problem SQL Otherwise, people can only give you generic advice

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas