In "Concurrency slot consumption" section of https://learn.microsoft.com/en-us/azure/sql-data-WArehouse/sql-data-warehouse-develop-concurrency , It describes how the number of concurrent queries & concurrency slots varies by DWU and Resource Class.
For "smallrc" resource class user, the number of concurrent queries and concurrency slots are 4 for 100 DWU and this number linearly scales upto 24 for 600 DWU (for 600 DWU, smallrc resource class user can run 24 concurrent queries on 24 concurrency slots).
My question is
1) How many concurrent queries can be run with all users running "smallrc" resource class on 1000 DWU ? As 1000 DWU provides 40 concurrency slots and each "smallrc" user takes 1 slot for running query, does it mean 40 concurrent queries can be run on 1000 DWU?? As per the documentation, it looks like maximum concurrent queries are 32. could someone please provide some details on this?
2) Also As per documentation, It looks like the max number of concurrent queries that can be run on SQL DW are 32 irrespective if i use 1000 DWU or 6000 DWU. could someone please provide details on why is this limitation ? if i use "smallrc" resource class user to submit queries on DW2000, are the concurrent queries still limited to 32 ?
1) From the documentation: "SQL Data Warehouse supports up to 32 concurrent queries on the larger DWU sizes." Having more concurrency slots only means that you can have more mediumrc and largerc queries running concurrently. For example on a DW1000 where the maximum slots are 40, smallrc takes 1 slot and mediumrc takes 8 slots, so you can have 4 mediumrc queries and 8 smallrc queries concurrently, or 31 smallrc and 1 mediumrc, but the max for total queries is still 32.
You can still run some types of queries without using slots: https://learn.microsoft.com/en-us/azure/sql-data-WArehouse/sql-data-warehouse-develop-concurrency#query-exceptions-to-concurrency-limits
2) "The environment is intended to host DW workloads, which by their nature tend to be focused on running fewer, often more expensive, queries to return results that typically aggregate values over large datasets, vs an OLTP environment which is focused on running many concurrent, ideally very cheap, queries. By setting the limit to 32, the environment is designed to ensure that there are sufficient resources, especially memory, available for the expensive DW queries." - SimonFacer
Related
We developed a pipeline with get meta data and for each activity. Inside for each, there are few activities like lookup, stored procedure, and delete.
Source is a file share. We have tried it with 5000 files, each file size is 3kb, using self hosted IR and concurrent jobs as 12. The time it took is 3hr 30min.
Will there be any improvement if we can increase concurrent jobs or it will just limit the concurrent jobs. Also please let me know what is the maximum limit of concurrent jobs.
The default value of the concurrent jobs limit is set based on the machine size. The factors used to calculate this value depend on the amount of RAM and the number of CPU cores of the machine. So the more cores and the more memory, the higher the default limit of concurrent jobs.
You scale out by increasing the number of nodes. When you increase the number of nodes, the concurrent jobs limit is the sum of the concurrent job limit values of all the available nodes. For example, if one node lets you run a maximum of twelve concurrent jobs, then adding three more similar nodes lets you run a maximum of 48 concurrent jobs (that is, 4 x 12).
My cluster configuration is as follows:
3 Node cluster
128GB RAM per cluster node.
Processor: 16 core HyperThreaded per cluster node.
All 3 nodes have Kudu master and T-Server and Impala server, one of the node has Impala catalogue and Impala StateStore.
My issues are as follows:
1) I've a hard time figuring out Dynamic resource pooling in Impala while running concurrent queries. I've tried giving mem_limit still no luck. I've also tried static service pool but with that also I couldn't achieve required concurrency. Even with admission control, the required concurrency was not achieved.
I) The time taken for 1 query: 500-800ms.
II) But if 10 concurrent queries are given the time taken grows to 3-6s per query.
III) But if more than 20 concurrent queries are given the time taken is exceeding 10s per query.
2) One of my cluster nodes is not taking the load after submitting the query, I checked this by the summary of the query. I've tried giving the NUM_NODES as 0 and 1 on the node which is not taking the load, still, the summary shows that the node is not taking the load.
What is the table size ? How many rows are there in the tables ? Are the tables partitioned ? It will be nice if you can compare your configurations with the Impala Benchmarks
As mentioned above Impala is designed to run on a Massive Parallel Processing Infrastructure. If when we had a setup of 10 nodes with 80 cores and 160 virtual cores with 12 TB SAN storage, we could get a computation time of 60 seconds with 5 concurrent users.
I have seen the warnings of not using Google Big Table for small data sets.
Does this mean that a workload of 100 QPS could run slower (total time; not per query) than a workload of 8000 QPS?
I understand that 100 QPS is going to be incredibly inefficient on BigTable; but could it be as drastic as 100 inserts takes 15 seconds to complete; where-as a 8000 inserts could run in 1 second?
Just looking for a "in theory; from time to time; yes" vs "probably relatively unlikely" type answer to be a rough guide for how I structure my performance test cycles.
Thanks
There's a flat start up cost to running any Cloud Bigtable operations. That start up cost generally is generally less than 1 second. I would expect 100 operations should take less than 8000 operations. When I see extreme slowness, I usually suspect network latency or some other unique condition.
We're having issues with running small workloads on our Developer Big Table instance (2.5 TB) One instance instead of 3.
We have a key set up on user id and around 100 rows on the key user id. Total records in the database are a few million. We querying big table and seeing 1.4 seconds of latency from fetching the rows associated with a single key of user id. Total number of records returned is less than 100 and we're seeing way over a second of latency. It seems to me that giant workloads are the only way to use this data store. We're looking at other NoSQL alternatives like Redis.
I want to know if there is any limit when making queries to my data already loaded in bigquery?
For example, if I want to extract bigquery information from a web application or from a "web service", what is my limit of selects, updates and deletes?
The documentation tells me this:
Concurrent rate limit for interactive queries under on-demand pricing: 50 concurrent queries. Queries that return cached results, or queries configured using the dryRun property, do not count against this limit.
Daily query size limit: unlimited by default, but you may specify limits using custom quotas.
But I can not understand if I have a limit on the number of consultations per day, and if so, what is my limit?
There is a limit to the number of slots you can allocate for queries at a particular time.
Some nuggets:
Slot: represents one unit of computational capacity.
Query: Uses as many slots as required so the query runs optimally (Currently: MAX 50 slots for On Demand Price) [A]
Project: The slots used per project is based on the number of queries that run at the same time (Currently: MAX 2000 slots for On Demand Price)
[A] This is all under the hood without user intervention. BigQuery makes an assessment of the query to calculate the number of slots required.
So if you do the math, worst case, if all your queries use 50 slots, you will not find any side effect until you have more than 40 queries running concurrently. Even in those situations, the queries will just be in the queue for a while and will start running after some queries are done executing.
Slots become more worrisome when you are time sensitive to getting your data on time and they are run in an environment where:
A lot of queries are running at the same time.
Most of those queries that are running at the same time usually take a long time to execute on an empty load.
The best way to understand whether these limits will impact you or not is by monitoring the current activity within your project. Bigquery advises you to monitor your slots with Stackdriver.
Update: Bigquery addresses the problem of query prioritization in one of their blog posts - Truth 5: BigQuery supports query prioritization
I would like to use HSQLDB +Hibernate in a server with 5 to 30 clients that will fairly intensively write to the DB.
Each client will persist a dozen thousands lines in a single table every 30 seconds (24/7, that's roughly 1 billion rows/day), and the clients will also query the database for a few thousands lines more or less at random times at an average frequency of a couple of requests every 5 to 10 seconds.
Can HSQLDB handle such a use case or should I switch to MySQL/PostgreSQL ?
You are looking at a total of 2000 - 12000 writes and 5000 - 30000 reads per second.
With fast hardware, HSQLDB can probably handle this with persistent memory tables. With CACHED tables, it may be able to handle the lower range with solid state disks (disk seek time is the main parameter).
See this test. You can run it with MySQL and PostgresSQL for comparison.
http://hsqldb.org/web/hsqlPerformanceTests.html
You should switch. HSQLDB is not for critical apps. Be prepared for data corruption and decreasing startup performance over time.
The main negative hype comes from JBoss: https://community.jboss.org/wiki/HypersonicProduction
See also http://www.coderanch.com/t/89950/JBoss/HSQLDB-production
Also see similar question: Is is safe to use HSQLDB for production? (JBoss AS5.1)