Understanding how many parallel workers remain from a worker pool across sessions for PostgreSQL parallel queries? - sql

Let's say I have a pool of 4 parallel workers in my PostgreSQL database configuration. I also have 2 sessions.
In session#1, the SQL is currently executing, with my planner randomly choosing to launch 2 workers for this query.
So, in session#2, how can I know that my pool of workers has decreased by 2?

You can count the parallel worker backends:
SELECT current_setting('max_parallel_workers')::integer AS max_workers,
count(*) AS active_workers
FROM pg_stat_activity
WHERE backend_type = 'parallel worker';

Internally, Postgres keeps track of how many parallel workers are active with two variables named parallel_register_count and parallel_terminate_count. The difference between the two is the number of active parallel workers. See comment in in the source code.
Before registering a new parallel worker, this number is checked against the max_parallel_workers setting in the source code here.
Unfortunately, I don't know of any direct way to expose this information to the user.
You'll see the effects of an exhausted limit in query plans. You might try EXPLAIN ANALYZE with a SELECT query on a big table that's normally parallelized. You would see fewer workers used than workers planned. The manual:
The total number of background workers that can exist at any one time
is limited by both max_worker_processes and max_parallel_workers.
Therefore, it is possible for a parallel query to run with fewer
workers than planned, or even with no workers at all.

Related

Can this SQL operation be done without doing it row by row (RBAR)?

I have a set of tasks, with some tasks being more important than others
Each task does some work on one or more databases.
These tasks are assigned to workers that will perform the task (threads in an application poll a table).
When the worker has done the task, it sets the value back to null to signal it can accept work again.
When assigning the tasks to the workers, I would like to impose an upper limit on the number of database connections that can be used at any one time - so a task that uses a database that is currently at it's limit will not be assigned to a worker.
I can get the number of database connections available by subtracting the databases of tasks that are currently assigned to workers from the database limits.
My problem is this, how do I select tasks that can run, in order of importance, based on the number of database connections available, without doing it row by row?
I'm hoping the example below illustrates my problem:
On the right is available database connections, decreasing as we go down the list of tasks in order of importance.
If I'm selecting them in order of the importance of a task, then the connections available to the next task depend on whether the previous one was selected, which depends on if there was space for all it's database connections.
In the case above, task 7 can run only because task 6 couldn't
Also task 8 can't run because task 5 took the last connection to database C as it's a more important task.
Question:
Is there a way to work this out without using while loops and doing it row by row?

Is there a limit to queries using Bigquery's library and api?

I want to know if there is any limit when making queries to my data already loaded in bigquery?
For example, if I want to extract bigquery information from a web application or from a "web service", what is my limit of selects, updates and deletes?
The documentation tells me this:
Concurrent rate limit for interactive queries under on-demand pricing: 50 concurrent queries. Queries that return cached results, or queries configured using the dryRun property, do not count against this limit.
Daily query size limit: unlimited by default, but you may specify limits using custom quotas.
But I can not understand if I have a limit on the number of consultations per day, and if so, what is my limit?
There is a limit to the number of slots you can allocate for queries at a particular time.
Some nuggets:
Slot: represents one unit of computational capacity.
Query: Uses as many slots as required so the query runs optimally (Currently: MAX 50 slots for On Demand Price) [A]
Project: The slots used per project is based on the number of queries that run at the same time (Currently: MAX 2000 slots for On Demand Price)
[A] This is all under the hood without user intervention. BigQuery makes an assessment of the query to calculate the number of slots required.
So if you do the math, worst case, if all your queries use 50 slots, you will not find any side effect until you have more than 40 queries running concurrently. Even in those situations, the queries will just be in the queue for a while and will start running after some queries are done executing.
Slots become more worrisome when you are time sensitive to getting your data on time and they are run in an environment where:
A lot of queries are running at the same time.
Most of those queries that are running at the same time usually take a long time to execute on an empty load.
The best way to understand whether these limits will impact you or not is by monitoring the current activity within your project. Bigquery advises you to monitor your slots with Stackdriver.
Update: Bigquery addresses the problem of query prioritization in one of their blog posts - Truth 5: BigQuery supports query prioritization

Rails/SQL How do I get different workers to work on different records?

I have (for argument sake) 1000 records and 10 Heroku workers running.
I want to have each worker work on a different set of records..
What I've got right now is quite good, but not quite complete.
sql = 'update products set status = 2 where id in
(select id from products where status = 1 limit (100) ) return *'
records = connection.execute(sql)
This works rather well.. I get get 100 records and at the same time, I make sure my other workers don't get the same 100..
If I throw it in a while loop then even if I have 20000 records and 2 workers, eventually they will all get processed.
My issue is if there's a crash or exception then the 100 records look like their being processed by another worker but they aren't.
I can't use transaction, because the other selects will pick up the same records.
My question
What strategies do others use to have many workers working on the same dataset, but different records.
I know this is a conversational question... I'd put it as community wiki, but I don't see that ability any more.
Building a task queue in a RDBMS is annoyingly hard. I recommend using a queueing system that's designed for the job instead.
Check out PGQ, Celery, etc.
I have used queue_classic by Heroku to schedule jobs stored in a Postgres database.
If I were to do this it would be something other than a db-side queue. It sounds like standard client processing but you really want is parallel processing of the result set.
The simplest solution might be to do what you are doing but lock them on the client side, and divide them between workers there (spinlocks etc). You can then commit the transaction and re-run after these have finished processing.
The difficulty is that if you have records you are processing for things that are supposed to happen outside the server, and there is a crash, you never really know what records were processed. It is safer to rollback probably, but just keep that in mind.

max memory per query

How can I configure the maximum memory that a query (select query) can use in sql server 2008?
I know there is a way to set the minimum value but how about the max value? I would like to use this because I have many processes in parallel. I know about the MAXDOP option but this is for processors.
Update:
What I am actually trying to do is run some data load continuously. This data load is in the ETL form (extract transform and load). While the data is loaded I want to run some queries ( select ). All of them are expensive queries ( containing group by ). The most important process for me is the data load. I obtained an average speed of 10000 rows/sec and when I run the queries in parallel it drops to 4000 rows/sec and even lower. I know that a little more details should be provided but this is a more complex product that I work at and I cannot detail it more. Another thing that I can guarantee is that my load speed does not drop due to lock problems because I monitored and removed them.
There isn't any way of setting a maximum memory at a per query level that I can think of.
If you are on Enterprise Edition you can use resource governor to set a maximum amount of memory that a particular workload group can consume which might help.
In SQL 2008 you can use resource governor to achieve this. There you can set the request_max_memory_grant_percent to set the memory (this is the percent relative to the pool size specified by the pool's max_memory_percent value). This setting in not query specific, it is session specific.
In addition to Martin's answer
If your queries are all the same or similar, working on the same data, then they will be sharing memory anyway.
Example:
A busy web site with 100 concurrent connections running 6 different parametrised queries between them on broadly the same range of data.
6 execution plans
100 user contexts
one buffer pool with assorted flags and counters to show usage of each data page
If you have 100 different queries or they are not parametrised then fix the code.
Memory per query is something I've never thought or cared about since last millenium

How to split up a massive data query into multiple queries

I have to select all rows from a table with millions of rows (to preload a Coherence datagrid.) How do I split up this query into multiple queries that can be concurrently executed by multiple threads?
I first thought of getting a count of all records and doing:
SELECT ...
WHERE ROWNUM BETWEEN (packetNo * packetSize) AND ((packetNo + 1) * packetSize)
but that didn't work. Now I'm stuck.
Any help will be very appreciated.
If you have the Enterprise Edition license, the easiest way of achieving this objective is parallel query.
For one-off or ad hoc queries use the PARALLEL hint:
select /*+ parallel(your_table, 4) */ *
from your_table
/
The number in the hint is the number of slave queries you want to execute; in this case the database will run four threads.
If you want every query issued on the table to be parallelizable then permanently alter the table definition:
alter table your_table parallel (degree 4)
/
Note that the database won't always use parallel query; the optimizer will decide whether it's appropriate. Parallel query only works with full table scans or index range scans which cross multiple partitions.
There are a number of caveats. Parallel query requires us to have sufficient cores to satisfy the proposed number of threads; if we only have a single dual-core CPU setting a parallel degree of 16 isn't going to magically speed up the query. Also, we need spare CPU capacity; if the server is already CPU bound then parallel execution is only going to make things worse. Finally, the I/O and storage subsystems need to be capable of satisfying the concurrent demand; SANs can be remarkably unhelpful here.
As always in matters of performance, it is crucial to undertake some benchmarking against realistic volumes of data in a representative environment before going into production.
What if you don't have Enterprise Edition? Well, it is possible to mimic parallel execution by hand. Tom Kyte calls it "Do-It-Yourself Parallelism". I have used this technique myself, and it works well.
The key thing is to work out the total range ROWIDs which apply to the table, and split them across multiple jobs. Unlike some of the other solutions proposed in this thread, each job only selects the rows it needs. Mr Kyte summarized the technique in an old AskTom thread, including the vital split script: find it here.
Splitting the table and starting off threads is a manual task: fine as a one-off but rather tiresome to undertake frequently. So if you are running 11g release 2 you ought to know that there is a new PL/SQL package DBMS_PARALLEL_EXECUTE which automates this for us.
Are you sure a parallel execution of the query will be faster? This will only be the case if the huge table is stored on a disk array with many disks or if it is partitioned over several disk. In all other cases, a sequential access of the table will be many times faster.
If you really have to split the query, you have to split it in a way so that a sequential access for each part is still possible. Please post the DLL of the table so we can give a specific answer.
If the processing of the data or the loading into the data grid is the bottleneck, then you are better of reading the data with a single process and the splitting the data before futher processing it.
Assuming that reading is fast and further data processing is the bottleneck, you could for exmaple read the data and write it into very simple text files (such a fixed length or CSV). After every 10,000 rows you start a new file and spawn a thread or process to process the just finished file.
try with something like this:
select * from
( select a.*, ROWNUM rnum from
( <your_query_goes_here, with order by> ) a
where ROWNUM <= :MAX_ROW_TO_FETCH )
where rnum >= :MIN_ROW_TO_FETCH;
Have you considered using MOD 10 on ROWNUM to pull the data one tenth at a time?
SELECT A.*
FROM Table A
WHERE MOD(ROWNUM,10) = 0;