ADF Integration Runtime - Limit Concurrent Jobs - azure-data-factory-2

We developed a pipeline with get meta data and for each activity. Inside for each, there are few activities like lookup, stored procedure, and delete.
Source is a file share. We have tried it with 5000 files, each file size is 3kb, using self hosted IR and concurrent jobs as 12. The time it took is 3hr 30min.
Will there be any improvement if we can increase concurrent jobs or it will just limit the concurrent jobs. Also please let me know what is the maximum limit of concurrent jobs.

The default value of the concurrent jobs limit is set based on the machine size. The factors used to calculate this value depend on the amount of RAM and the number of CPU cores of the machine. So the more cores and the more memory, the higher the default limit of concurrent jobs.
You scale out by increasing the number of nodes. When you increase the number of nodes, the concurrent jobs limit is the sum of the concurrent job limit values of all the available nodes. For example, if one node lets you run a maximum of twelve concurrent jobs, then adding three more similar nodes lets you run a maximum of 48 concurrent jobs (that is, 4 x 12).

Related

Understanding how many parallel workers remain from a worker pool across sessions for PostgreSQL parallel queries?

Let's say I have a pool of 4 parallel workers in my PostgreSQL database configuration. I also have 2 sessions.
In session#1, the SQL is currently executing, with my planner randomly choosing to launch 2 workers for this query.
So, in session#2, how can I know that my pool of workers has decreased by 2?
You can count the parallel worker backends:
SELECT current_setting('max_parallel_workers')::integer AS max_workers,
count(*) AS active_workers
FROM pg_stat_activity
WHERE backend_type = 'parallel worker';
Internally, Postgres keeps track of how many parallel workers are active with two variables named parallel_register_count and parallel_terminate_count. The difference between the two is the number of active parallel workers. See comment in in the source code.
Before registering a new parallel worker, this number is checked against the max_parallel_workers setting in the source code here.
Unfortunately, I don't know of any direct way to expose this information to the user.
You'll see the effects of an exhausted limit in query plans. You might try EXPLAIN ANALYZE with a SELECT query on a big table that's normally parallelized. You would see fewer workers used than workers planned. The manual:
The total number of background workers that can exist at any one time
is limited by both max_worker_processes and max_parallel_workers.
Therefore, it is possible for a parallel query to run with fewer
workers than planned, or even with no workers at all.

Impala concurrent query delay

My cluster configuration is as follows:
3 Node cluster
128GB RAM per cluster node.
Processor: 16 core HyperThreaded per cluster node.
All 3 nodes have Kudu master and T-Server and Impala server, one of the node has Impala catalogue and Impala StateStore.
My issues are as follows:
1) I've a hard time figuring out Dynamic resource pooling in Impala while running concurrent queries. I've tried giving mem_limit still no luck. I've also tried static service pool but with that also I couldn't achieve required concurrency. Even with admission control, the required concurrency was not achieved.
I) The time taken for 1 query: 500-800ms.
II) But if 10 concurrent queries are given the time taken grows to 3-6s per query.
III) But if more than 20 concurrent queries are given the time taken is exceeding 10s per query.
2) One of my cluster nodes is not taking the load after submitting the query, I checked this by the summary of the query. I've tried giving the NUM_NODES as 0 and 1 on the node which is not taking the load, still, the summary shows that the node is not taking the load.
What is the table size ? How many rows are there in the tables ? Are the tables partitioned ? It will be nice if you can compare your configurations with the Impala Benchmarks
As mentioned above Impala is designed to run on a Massive Parallel Processing Infrastructure. If when we had a setup of 10 nodes with 80 cores and 160 virtual cores with 12 TB SAN storage, we could get a computation time of 60 seconds with 5 concurrent users.

Is there a limit to queries using Bigquery's library and api?

I want to know if there is any limit when making queries to my data already loaded in bigquery?
For example, if I want to extract bigquery information from a web application or from a "web service", what is my limit of selects, updates and deletes?
The documentation tells me this:
Concurrent rate limit for interactive queries under on-demand pricing: 50 concurrent queries. Queries that return cached results, or queries configured using the dryRun property, do not count against this limit.
Daily query size limit: unlimited by default, but you may specify limits using custom quotas.
But I can not understand if I have a limit on the number of consultations per day, and if so, what is my limit?
There is a limit to the number of slots you can allocate for queries at a particular time.
Some nuggets:
Slot: represents one unit of computational capacity.
Query: Uses as many slots as required so the query runs optimally (Currently: MAX 50 slots for On Demand Price) [A]
Project: The slots used per project is based on the number of queries that run at the same time (Currently: MAX 2000 slots for On Demand Price)
[A] This is all under the hood without user intervention. BigQuery makes an assessment of the query to calculate the number of slots required.
So if you do the math, worst case, if all your queries use 50 slots, you will not find any side effect until you have more than 40 queries running concurrently. Even in those situations, the queries will just be in the queue for a while and will start running after some queries are done executing.
Slots become more worrisome when you are time sensitive to getting your data on time and they are run in an environment where:
A lot of queries are running at the same time.
Most of those queries that are running at the same time usually take a long time to execute on an empty load.
The best way to understand whether these limits will impact you or not is by monitoring the current activity within your project. Bigquery advises you to monitor your slots with Stackdriver.
Update: Bigquery addresses the problem of query prioritization in one of their blog posts - Truth 5: BigQuery supports query prioritization

Azure SQL DW Concurrent queries by DWU and Resource class

In "Concurrency slot consumption" section of https://learn.microsoft.com/en-us/azure/sql-data-WArehouse/sql-data-warehouse-develop-concurrency , It describes how the number of concurrent queries & concurrency slots varies by DWU and Resource Class.
For "smallrc" resource class user, the number of concurrent queries and concurrency slots are 4 for 100 DWU and this number linearly scales upto 24 for 600 DWU (for 600 DWU, smallrc resource class user can run 24 concurrent queries on 24 concurrency slots).
My question is
1) How many concurrent queries can be run with all users running "smallrc" resource class on 1000 DWU ? As 1000 DWU provides 40 concurrency slots and each "smallrc" user takes 1 slot for running query, does it mean 40 concurrent queries can be run on 1000 DWU?? As per the documentation, it looks like maximum concurrent queries are 32. could someone please provide some details on this?
2) Also As per documentation, It looks like the max number of concurrent queries that can be run on SQL DW are 32 irrespective if i use 1000 DWU or 6000 DWU. could someone please provide details on why is this limitation ? if i use "smallrc" resource class user to submit queries on DW2000, are the concurrent queries still limited to 32 ?
1) From the documentation: "SQL Data Warehouse supports up to 32 concurrent queries on the larger DWU sizes." Having more concurrency slots only means that you can have more mediumrc and largerc queries running concurrently. For example on a DW1000 where the maximum slots are 40, smallrc takes 1 slot and mediumrc takes 8 slots, so you can have 4 mediumrc queries and 8 smallrc queries concurrently, or 31 smallrc and 1 mediumrc, but the max for total queries is still 32.
You can still run some types of queries without using slots: https://learn.microsoft.com/en-us/azure/sql-data-WArehouse/sql-data-warehouse-develop-concurrency#query-exceptions-to-concurrency-limits
2) "The environment is intended to host DW workloads, which by their nature tend to be focused on running fewer, often more expensive, queries to return results that typically aggregate values over large datasets, vs an OLTP environment which is focused on running many concurrent, ideally very cheap, queries. By setting the limit to 32, the environment is designed to ensure that there are sufficient resources, especially memory, available for the expensive DW queries." - SimonFacer

max memory per query

How can I configure the maximum memory that a query (select query) can use in sql server 2008?
I know there is a way to set the minimum value but how about the max value? I would like to use this because I have many processes in parallel. I know about the MAXDOP option but this is for processors.
Update:
What I am actually trying to do is run some data load continuously. This data load is in the ETL form (extract transform and load). While the data is loaded I want to run some queries ( select ). All of them are expensive queries ( containing group by ). The most important process for me is the data load. I obtained an average speed of 10000 rows/sec and when I run the queries in parallel it drops to 4000 rows/sec and even lower. I know that a little more details should be provided but this is a more complex product that I work at and I cannot detail it more. Another thing that I can guarantee is that my load speed does not drop due to lock problems because I monitored and removed them.
There isn't any way of setting a maximum memory at a per query level that I can think of.
If you are on Enterprise Edition you can use resource governor to set a maximum amount of memory that a particular workload group can consume which might help.
In SQL 2008 you can use resource governor to achieve this. There you can set the request_max_memory_grant_percent to set the memory (this is the percent relative to the pool size specified by the pool's max_memory_percent value). This setting in not query specific, it is session specific.
In addition to Martin's answer
If your queries are all the same or similar, working on the same data, then they will be sharing memory anyway.
Example:
A busy web site with 100 concurrent connections running 6 different parametrised queries between them on broadly the same range of data.
6 execution plans
100 user contexts
one buffer pool with assorted flags and counters to show usage of each data page
If you have 100 different queries or they are not parametrised then fix the code.
Memory per query is something I've never thought or cared about since last millenium