Impala concurrent query delay - impala

My cluster configuration is as follows:
3 Node cluster
128GB RAM per cluster node.
Processor: 16 core HyperThreaded per cluster node.
All 3 nodes have Kudu master and T-Server and Impala server, one of the node has Impala catalogue and Impala StateStore.
My issues are as follows:
1) I've a hard time figuring out Dynamic resource pooling in Impala while running concurrent queries. I've tried giving mem_limit still no luck. I've also tried static service pool but with that also I couldn't achieve required concurrency. Even with admission control, the required concurrency was not achieved.
I) The time taken for 1 query: 500-800ms.
II) But if 10 concurrent queries are given the time taken grows to 3-6s per query.
III) But if more than 20 concurrent queries are given the time taken is exceeding 10s per query.
2) One of my cluster nodes is not taking the load after submitting the query, I checked this by the summary of the query. I've tried giving the NUM_NODES as 0 and 1 on the node which is not taking the load, still, the summary shows that the node is not taking the load.

What is the table size ? How many rows are there in the tables ? Are the tables partitioned ? It will be nice if you can compare your configurations with the Impala Benchmarks
As mentioned above Impala is designed to run on a Massive Parallel Processing Infrastructure. If when we had a setup of 10 nodes with 80 cores and 160 virtual cores with 12 TB SAN storage, we could get a computation time of 60 seconds with 5 concurrent users.

Related

ADF Integration Runtime - Limit Concurrent Jobs

We developed a pipeline with get meta data and for each activity. Inside for each, there are few activities like lookup, stored procedure, and delete.
Source is a file share. We have tried it with 5000 files, each file size is 3kb, using self hosted IR and concurrent jobs as 12. The time it took is 3hr 30min.
Will there be any improvement if we can increase concurrent jobs or it will just limit the concurrent jobs. Also please let me know what is the maximum limit of concurrent jobs.
The default value of the concurrent jobs limit is set based on the machine size. The factors used to calculate this value depend on the amount of RAM and the number of CPU cores of the machine. So the more cores and the more memory, the higher the default limit of concurrent jobs.
You scale out by increasing the number of nodes. When you increase the number of nodes, the concurrent jobs limit is the sum of the concurrent job limit values of all the available nodes. For example, if one node lets you run a maximum of twelve concurrent jobs, then adding three more similar nodes lets you run a maximum of 48 concurrent jobs (that is, 4 x 12).

Is there a limit to queries using Bigquery's library and api?

I want to know if there is any limit when making queries to my data already loaded in bigquery?
For example, if I want to extract bigquery information from a web application or from a "web service", what is my limit of selects, updates and deletes?
The documentation tells me this:
Concurrent rate limit for interactive queries under on-demand pricing: 50 concurrent queries. Queries that return cached results, or queries configured using the dryRun property, do not count against this limit.
Daily query size limit: unlimited by default, but you may specify limits using custom quotas.
But I can not understand if I have a limit on the number of consultations per day, and if so, what is my limit?
There is a limit to the number of slots you can allocate for queries at a particular time.
Some nuggets:
Slot: represents one unit of computational capacity.
Query: Uses as many slots as required so the query runs optimally (Currently: MAX 50 slots for On Demand Price) [A]
Project: The slots used per project is based on the number of queries that run at the same time (Currently: MAX 2000 slots for On Demand Price)
[A] This is all under the hood without user intervention. BigQuery makes an assessment of the query to calculate the number of slots required.
So if you do the math, worst case, if all your queries use 50 slots, you will not find any side effect until you have more than 40 queries running concurrently. Even in those situations, the queries will just be in the queue for a while and will start running after some queries are done executing.
Slots become more worrisome when you are time sensitive to getting your data on time and they are run in an environment where:
A lot of queries are running at the same time.
Most of those queries that are running at the same time usually take a long time to execute on an empty load.
The best way to understand whether these limits will impact you or not is by monitoring the current activity within your project. Bigquery advises you to monitor your slots with Stackdriver.
Update: Bigquery addresses the problem of query prioritization in one of their blog posts - Truth 5: BigQuery supports query prioritization

Azure SQL DW Concurrent queries by DWU and Resource class

In "Concurrency slot consumption" section of https://learn.microsoft.com/en-us/azure/sql-data-WArehouse/sql-data-warehouse-develop-concurrency , It describes how the number of concurrent queries & concurrency slots varies by DWU and Resource Class.
For "smallrc" resource class user, the number of concurrent queries and concurrency slots are 4 for 100 DWU and this number linearly scales upto 24 for 600 DWU (for 600 DWU, smallrc resource class user can run 24 concurrent queries on 24 concurrency slots).
My question is
1) How many concurrent queries can be run with all users running "smallrc" resource class on 1000 DWU ? As 1000 DWU provides 40 concurrency slots and each "smallrc" user takes 1 slot for running query, does it mean 40 concurrent queries can be run on 1000 DWU?? As per the documentation, it looks like maximum concurrent queries are 32. could someone please provide some details on this?
2) Also As per documentation, It looks like the max number of concurrent queries that can be run on SQL DW are 32 irrespective if i use 1000 DWU or 6000 DWU. could someone please provide details on why is this limitation ? if i use "smallrc" resource class user to submit queries on DW2000, are the concurrent queries still limited to 32 ?
1) From the documentation: "SQL Data Warehouse supports up to 32 concurrent queries on the larger DWU sizes." Having more concurrency slots only means that you can have more mediumrc and largerc queries running concurrently. For example on a DW1000 where the maximum slots are 40, smallrc takes 1 slot and mediumrc takes 8 slots, so you can have 4 mediumrc queries and 8 smallrc queries concurrently, or 31 smallrc and 1 mediumrc, but the max for total queries is still 32.
You can still run some types of queries without using slots: https://learn.microsoft.com/en-us/azure/sql-data-WArehouse/sql-data-warehouse-develop-concurrency#query-exceptions-to-concurrency-limits
2) "The environment is intended to host DW workloads, which by their nature tend to be focused on running fewer, often more expensive, queries to return results that typically aggregate values over large datasets, vs an OLTP environment which is focused on running many concurrent, ideally very cheap, queries. By setting the limit to 32, the environment is designed to ensure that there are sufficient resources, especially memory, available for the expensive DW queries." - SimonFacer

What are the best practices for hadoop benchmarking?

I am using TestDFSIO to benchmark hadoop I/O performance.
The test rig I am using is a small virtual cluster of 3 data nodes and one name node.
Each vm would have 6-8 GB RAM and 100-250 GB HDD.
I want to know about two things:
What should be the values for number of files(nrFIles) and file size for each file (fileSize) parameters with respect to my set up such that we can relate the results of my small cluster to clusters of standard sizes like having 8-12 x 2-TB hard disks and 64 GBs of RAM and higher processing speeds. Is it even correct to do so.
In general what are the best practices for benchmarking hadoop? Like:
what is the recommended cluster specification(specs of datanodes, namenodes), recommended test data size, what configurations/specs should the test bed have in order to have results which will conform to real life hadoop applications
Simply said I want to know about the correct hadoop test rig setup and correct test methods so that my results are relatable to production clusters.
It will be helpful to have references to proven work.
Another question is
suppose i have -nrFiles 15 -fileSize 1GB
I found that number of map tasks will be equal to the number mentioned for nrFiles
But how are they distributed among the 3 data nodes? 15 number of map tasks is not clear to me. Is it like for 15 files each file will have one mapper working on it?
I have not found any document or description of how exactly testDFSIO works.
You cannot compare results for two clusters. the results may vary on number of mappers on a node, replication factor, network etc.
Cluster specification would depend on what are you trying to use it for.
If you provide -nrFiles 15 -fileSize 1000 there would be 15 files created of each 1GB. each mapper would work on a single file, so there would be 15 map tasks. for your 3 node cluster assuming you just have 1 mapper on a node there would be 5 waves to write the full data.
refer the below link for testDFSIO and other benchmarking tools: http://www.michael-noll.com/blog/2011/04/09/benchmarking-and-stress-testing-an-hadoop-cluster-with-terasort-testdfsio-nnbench-mrbench/

What is the performance of HSQLDB with several clients

I would like to use HSQLDB +Hibernate in a server with 5 to 30 clients that will fairly intensively write to the DB.
Each client will persist a dozen thousands lines in a single table every 30 seconds (24/7, that's roughly 1 billion rows/day), and the clients will also query the database for a few thousands lines more or less at random times at an average frequency of a couple of requests every 5 to 10 seconds.
Can HSQLDB handle such a use case or should I switch to MySQL/PostgreSQL ?
You are looking at a total of 2000 - 12000 writes and 5000 - 30000 reads per second.
With fast hardware, HSQLDB can probably handle this with persistent memory tables. With CACHED tables, it may be able to handle the lower range with solid state disks (disk seek time is the main parameter).
See this test. You can run it with MySQL and PostgresSQL for comparison.
http://hsqldb.org/web/hsqlPerformanceTests.html
You should switch. HSQLDB is not for critical apps. Be prepared for data corruption and decreasing startup performance over time.
The main negative hype comes from JBoss: https://community.jboss.org/wiki/HypersonicProduction
See also http://www.coderanch.com/t/89950/JBoss/HSQLDB-production
Also see similar question: Is is safe to use HSQLDB for production? (JBoss AS5.1)