Why cannot a user run two or more select queries concurrently on same table? - sql

While practicing DBMS and SQL using Oracle Database, when I tried to fire 2 select queries on a table the database always wait for the first query to finish executing and keeps the other one in pipeline apparently.
Consider a table MY_TABLE having 1 million records with a column 'id' that holds the serial number of records.
Now my queries are:-
Query #1 - select * from MY_TABLE where id<500001; --I am fetching first 500,000 records here
Query #2 - select * from MY_TABLE where id>500000; --I am fetching next 500,000 records here
Since these are select queries, these must be acquiring a read lock on the table which is a shared lock. Then why this phenomenon happens? Please note the sample space or domain for both queries are mutually exclusive to the best of my knowledge here because of the filters that I applied via where clause and this further aggravates my confusion.
Also, I am visualizing this in form of that, there must be some process which is evaluating my query and then doing a handshake with the memory(i.e. resource) for fetching the result. So, any resource in shared lock mode should be accessible to all process which hold that lock.
Secondly, is there any way to override this behavior or execute multiple select queries concurrently.
Note:- I want to chunk down a particular task(i.e. data of a table) and enhance the speed of my script.

The database doesn't keep queries in a pipeline, it's simply the fact that your client is only sending one query at a time. The database will quite happily run multiple queries against the same data at the same time, e.g. from separate sessions.

Related

How to establish read-only-once implement within SAP HANA?

Context: I am a long-time MSSQL developer... What I would like to know is how to implement a read-only-once select from SAP HANA.
High-level pseudo-code:
Collect request via db proc (query)
Call API with request
Store results of the request (response)
I have a table (A) that is the source of inputs to a process. Once a process has completed it will write results to another table (B).
Perhaps this is all solved if I just add a column to table A to avoid concurrent processors from selecting the same records from A?
I am wondering how to do this without adding the column to source table A.
What I have tried is a left outer join between tables A and B to get rows from A that have no corresponding rows (yet) in B. This doesn't work, or I haven't implemented such that rows are processed only 1 time by any of the processors.
I have a stored proc to handle batch selection:
/*
* getBatch.sql
*
* SYNOPSIS: Retrieve the next set of criteria to be used in a search
* request. Use left outer join between input source table
* and results table to determine the next set of inputs, and
* provide support so that concurrent processes may call this
* proc and get their inputs exclusively.
*/
alter procedure "ACOX"."getBatch" (
in in_limit int
,in in_run_group_id varchar(36)
,out ot_result table (
id bigint
,runGroupId varchar(36)
,sourceTableRefId integer
,name nvarchar(22)
,location nvarchar(13)
,regionCode nvarchar(3)
,countryCode nvarchar(3)
)
) language sqlscript sql security definer as
begin
-- insert new records:
insert into "ACOX"."search_result_v4" (
"RUN_GROUP_ID"
,"BEGIN_DATE_TS"
,"SOURCE_TABLE"
,"SOURCE_TABLE_REFID"
)
select
in_run_group_id as "RUN_GROUP_ID"
,CURRENT_TIMESTAMP as "BEGIN_DATE_TS"
,'acox.searchCriteria' as "SOURCE_TABLE"
,fp.descriptor_id as "SOURCE_TABLE_REFID"
from
acox.searchCriteria fp
left join "ACOX"."us_state_codes" st
on trim(fp.region) = trim(st.usps)
left outer join "ACOX"."search_result_v4" r
on fp.descriptor_id = r.source_table_refid
where
st.usps is not null
and r.BEGIN_DATE_TS is null
limit :in_limit;
-- select records inserted for return:
ot_result =
select
r.ID id
,r.RUN_GROUP_ID runGroupId
,fp.descriptor_id sourceTableRefId
,fp.merch_name name
,fp.Location location
,st.usps regionCode
,'USA' countryCode
from
acox.searchCriteria fp
left join "ACOX"."us_state_codes" st
on trim(fp.region) = trim(st.usps)
inner join "ACOX"."search_result_v4" r
on fp.descriptor_id = r.source_table_refid
and r.COMPLETE_DATE_TS is null
and r.RUN_GROUP_ID = in_run_group_id
where
st.usps is not null
limit :in_limit;
end;
When running 7 concurrent processors, I get a 35% overlap. That is to say that out of 5,000 input rows, the resulting row count is 6,755. Running time is about 7 mins.
Currently my solution includes adding a column to the source table. I wanted to avoid that but it seems to make a simpler implement. I will update the code shortly, but it includes an update statement prior to the insert.
Useful references:
SAP HANA Concurrency Control
Exactly-Once Semantics Are Possible: Here’s How Kafka Does It
First off: there is no "read-only-once" in any RDBMS, including MS SQL.
Literally, this would mean that a given record can only be read once and would then "disappear" for all subsequent reads. (that's effectively what a queue does, or the well-known special-case of a queue: the pipe)
I assume that that is not what you are looking for.
Instead, I believe you want to implement a processing-semantic analogous to "once-and-only-once" aka "exactly-once" message delivery. While this is impossible to achieve in potentially partitioned networks it is possible within the transaction context of databases.
This is a common requirement, e.g. with batch data loading jobs that should only load data that has not been loaded so far (i.e. the new data that was created after the last batch load job began).
Sorry for the long pre-text, but any solution for this will depend on being clear on what we want to actually achieve. I will get to an approach for that now.
The major RDBMS have long figured out that blocking readers is generally a terrible idea if the goal is to enable high transaction throughput. Consequently, HANA does not block readers - ever (ok, not ever-ever, but in the normal operation setup).
The main issue with the "exactly-once" processing requirement really is not the reading of the records, but the possibility of processing more than once or not at all.
Both of these potential issues can be addressed with the following approach:
SELECT ... FOR UPDATE ... the records that should be processed (based on e.g. unprocessed records, up to N records, even-odd-IDs, zip-code, ...). With this, the current session has an UPDATE TRANSACTION context and exclusive locks on the selected records. Other transactions can still read those records, but no other transaction can lock those records - neither for UPDATE, DELETE, nor for SELECT ... FOR UPDATE ... .
Now you do your processing - whatever this involves: merging, inserting, updating other tables, writing log-entries...
As the final step of the processing, you want to "mark" the records as processed. How exactly this is implemented, does not really matter.
One could create a processed-column in the table and set it to TRUE when records have been processed. Or one could have a separate table that contains the primary keys of the processed records (and maybe a load-job-id to keep track of multiple load jobs).
In whatever way this is implemented, this is the point in time, where this processed status needs to be captured.
COMMIT or ROLLBACK (in case something went wrong). This will COMMIT the records written to the target table, the processed-status information, and it will release the exclusive locks from the source table.
As you see, Step 1 takes care of the issue that records may be missed by selecting all wanted records that can be processed (i.e. they are not exclusively locked by any other process).
Step 3 takes care of the issue of records potentially be processed more than once by keeping track of the processed records. Obviously, this tracking has to be checked in Step 1 - both steps are interconnected, which is why I point them out explicitly. Finally, all the processing occurs within the same DB-transaction context, allowing for guaranteed COMMIT or ROLLBACK across the whole transaction. That means, that no "record marker" will ever be lost when the processing of the records was committed.
Now, why is this approach preferable to making records "un-readable"?
Because of the other processes in the system.
Maybe the source records are still read by the transaction system but never updated. This transaction system should not have to wait for the data load to finish.
Or maybe, somebody wants to do some analytics on the source data and also needs to read those records.
Or maybe you want to parallelise the data loading: it's easily possible to skip locked records and only work on the ones that are "available for update" right now. See e.g. Load balancing SQL reads while batch-processing? for that.
Ok, I guess you were hoping for something easier to consume; alas, that's my approach to this sort of requirement as I understood it.

locking logic for frequently queried view on frequently updated tables - please advise?

We face the following situation (Teradata):
Business layer frequently executes long-running queries on Table X_Past UNION ALL Table X_Today.
Table X_Today gets updated frequently, say once every 10 minutes. X_Past only once after midnight (per full-load).
Writing process should not block reading process.
Writing should happen as soon as new data is available.
Proposed approach:
2 "Today" and a "past" table, plus a UNION ALL view that selects from one of them based on the value in a load status table.
X_Today_1
X_Today_0
X_Past
loading process with load in X_Today_1 and set the active_table value in the load status table to "X_Today_1"
next time it will load X_Today_0 and set the active_table value to "X_Today_0"
etc.
The view that is used to select on the table will be built as follows:
select *
from X_PAST
UNION ALL
select td1.*
from X_Today_1 td1
, ( select active_table from LOAD_STATUS ) active_tab1
where active_tab1.te_active_table = 'X_Today_1'
UNION ALL
select td0.*
from X_Today_0 td0
, ( select active_table from STATUS_LOG ) active_tab0
where active_tab1.te_active_table = 'X_Today_0'
my main questions:
when executing the select, will there be a lock on ALL tables, or only on those that are actually accessed for data? Since because of the where clause, data from one of the Today_1/0 tables will always be ignored and this table should be availablew for loading;
do we need any form of locking or is the default locking mechanism that what we want (which I suspect it is)?
will this work, or am I overlooking something?
It is important that the loading process will wait in case the reading process takes longer than 20 minutes and the loader is about to refresh the second table again. The reading process should never really be blocked, except maybe by itself.
Any input is much appreciated...
thank you for your help.
A few comments to your questions:
Depending on the query structure, the Optimizer will try to get the default locks (in this case a READ lock) at different levels -- most likely table or row-hash locks. For example, if you do a SELECT * FROM my_table WHERE PI_column = 'value', you should get a row-hash lock and not a table lock.
Try running an EXPLAIN on your SELECT and see if it gives you any locking info. The Optimizer might be smart enough to determine there are 0 rows in one of the joined tables and reduce the lock requests. If it still locks both tables, see the end of this post for an alternative approach.
Your query written as-is will result in READ locks, which would block any WRITE requests on the tables. If you are worried about locking issues / concurrency, have you thought about using an explicit ACCESS lock? This would allow your SELECT to run without ever having to wait for your write queries to complete. This is called a "dirty read", since there could be other requests still modifying the tables while they are being read, so it may or may not be appropriate depending on your requirements.
Your approach seems feasible. You could also do something similar, but instead of having two UNIONs, have a single "X_Today" view that points to the "active" table. After your load process completes, you could re-point the view to the appropriate table as needed via a MACRO call:
-- macros (switch between active / loading)
REPLACE MACRO switch_to_today_table_0 AS
REPLACE VIEW X_Today AS SELECT * FROM X_Today_0;
REPLACE MACRO switch_to_today_table_1 AS
REPLACE VIEW X_Today AS SELECT * FROM X_Today_1;
-- SELECT query
SELECT * FROM X_PAST UNION ALL SELECT * FROM X_Today;
-- Write request
MERGE INTO x_today_0...;
-- Switch active "today" table to must recently loaded one
EXEC switch_to_today_table_0;
You'd have to manage which table to write to (or possible do that using a view too) and which "switch" macro to call within your application.
One thing to think about is that having two physical tables that logically represent the same table (i.e. should have the same data) may potentially allow for situations where one table is missing data and needs to be manually synced.
Also, if you haven't looked at them already, a few ideas to optimize your SELECT queries to run faster: row partitioning, indexes, compression, statistics, primary index selection.

Simple select query running forever

A simple SQL Select is query running forever for a particular ID in SQL Server 2012.
This query is running forever; it should return 10000 rows:
select *
from employees
where company_id = 34
If I change the query to
select *
from employees
where company_id = 12
it returns 7000 rows very quickly.
Employees is a view created by joining different tables.
Could there be a problem in the view?
One possibility is that you have a very large table. Such a query is probably scanning the entire tables and returning rows that match as they are encountered.
My guess is that rows for company 12 are encountered before rows for company 34.
If this is the case, then an index on (company_id) should help.
There may be other causes as well. Here are two other possibilities:
Contention for rows with company_id 34 that are causing delays on reading the data (this would depend on isolation level that you are using and the nature of concurrent updates).
An unlimited size column which is populated with very big values for company_id 34 and empty or very small for 12.
There may be other possibilities as well.
One of the things you can do to speed up the process is to index the column on company_id as a b-tree index would speed up the search.
Without looking at the structure of the table and execution plan, here are a few things that can be suggested apart from what Gordon has already covered:
Could you create indexes on the underlying tables which can cover this query? That would include index on the 'searched' and 'sorted' columns (joins, where clause, order by, group by, distinct) and include the SELECTED columns in the INCLUDE part of the indexes (in case of a nonclustered rowstore index)? Aim is to see 'index seek' in the Execution Plan.
Update statistics on the underlying tables. (And as a side note, would suggest to keep 'AUTO CREATE' and 'AUTO UPDATE' statistics ON unless you have a reason not to do that automatically in your application)
Would also like to know when was the last time defragmentation was performed on the server. Long due defragmentation could be a very good reason for why you might see this kind of issues on certain values, specially on a table on which lot of write operations happen.
Execute the query again. Even if you do not have proper information about #3 above, you can try to execute the query skipping step#3.
While running the query check for waits stats in in the server by
querying at dmvs: sys.dm_os_wait_stats and sys.dm_tran_locks. Please
check whether the wait is due to CXPACKET (waits due to other parallel
processes) or PAGEIOLATCH (Reading from Disk than RAM) or locks. It
is the starting point of the investigation which will give you the
root cause and you can then take appropriate measure accordingly.
Additional quick check can be: checking 'Available RAM' in the
server task manager. Please make sure that your SQL Server RAM is not used up by
some other unnecessary applications/sessions.

Breaking down a large number of rows into smaller queries? Parallelism

I want to create a external application which will query one table from a large Oracle database.
The query will run daily and I am expecting to handle 30,000+ rows.
To break down the size of these rows, I would like to create a new thread/ process for each 10,000 rows that exist. So going by the above figure it would be 3 threads to process all those rows.
I don't want each thread to overlap each others row set so I know I will need to add a column within the table to act as a range marker, a row_position
Logic
Get row_count of data set in query parameters
Get first_row_pos
While (row_count > 10,000)
{
Create thread with 10,000 rows starting from first_row_pos
row_count == row_count - 10,000
first_row_pos = first_row_pos + 10,000
}
create thread for remaining rows
all threads run their queries concurrently.
This is basic logic at the moment, however I do not know how feasible this is.
Is this a good way or is there a better way?
Can this be done through one database connection with each thread sharing or is it better to have a seperate db connection for each thread?
Any other advice welcome?
Note: I just realised a do while loop would be better if there is less than 10,000 rows in this case.
Thanks
Oralce provide a parallel hint for sutuations such as this where you have a full table scan or similar problem and want to make use of multiple cores to divide the workload. Further details here.
The syntax is very simple, you specify the table (or alias) and the number of cores (I usually leave as default) e.g.:
select /*+ parallel(a, default) */ *
from table_a a
You can also use this with multiple tables e.g.
select /*+ parallel(a, default) parallel(b,default) */ *
from table_a a, table_b b
where a.some_id = b.some_id
A database connection is not thread-safe, so if you are going to query the database from several threads, you would have to have a separate connection for each of them. You can either create a connection or get them from a pool.
Before you implement your approach, take some time to analyze where is the time spent. Oracle overall is pretty good with utilizing multiple cores. And the database interaction is usually is the most time-consuming part. By splitting the query in three you might actually slow things down.
If indeed your application is spending most of the time performing calculations on that data, your best approach might be loading all data in a single thread and then splitting processing into multiple threads.

Postgres: How to fire multiple queries in same time?

I have one procedure which updates record values, and i want to fire it up against all records in table (over 30k records), procedure execution time is from 2 up to 10 seconds, because it depends on network load.
Now i'm doing UPDATE table SET field = procedure_name(paramns); but with that amount of records it takes up to 40 min to process all table.
Now im using 4 different connections witch fork to background and fires query with WHERE clause set to iterate over modulo of row id's to speed this up, ( WHERE id_field % 4 = ) and this works well and cuts down table populate to ~10 mins.
But i want to avoid using cron, shell jobs and multiple connections for this, i know that it can be done with libpq, but is there a way to fire up a query (4 different non-blocking queries) and do not wait till it ends execution, within single connection?
Or if anyone can point me out to some clues on how to write that function, using postgres internals, or simply in C and bound it as a stored procedure?
Cheers Darius
I've got a sure answer for this question - IF you will share with us what your ab workout is!!! I'm getting fat by the minute and I need answers myself...
OK I'll answer anyway.
If you are updating one table, on one database server, in 40 minutes 'single threaded' and in 10 minutes with 4 threads, the bottleneck is not the database server; otherwise, it would get bogged down in I/O. If you are executing a bunch of UPDATES, one call per record, the network round-trip time is killing you.
I'm pretty sure this is the case and not that it's either an I/O bottleneck on the DB or the possibility that procedure_name(paramns); is taking a long time. (If that were the procedure taking 2-10 seconds it would take like 2500 min to do 30K records). The reason I am sure is that starting 4 concurrent processed cuts the time in 1/4. So especially it is not an i/o issue on the DB server.
This might be the one excuse for putting business logic in an SP on the server. Optimization unfortunately means breaking the rules. The consequence is difficult maintenance. but, duh!!
However, the best solution would be to get this set up to use 'bulk update' queries. That might mean you have to take several strange and unintuitive steps such as this:
This will require a lot of modfication if multiple users can run it concurrently.
refactor the system so procedure_name(paramns) can get all the data it needs to process all records via a select statement. May need to use creative joins. If it's an SP of course now you are moving the logic to the client.
Use that have the program create an XML or other importable flat file format with the PK of the record to update, and the new field value or values. Write all the updates to this file instead of executing them on the DB.
have a temp table on the database that matches the layout of this flat file
run an import on the database - clear the temp table and import the file
do an update of a join of the temp table and the table to be updated, e.g., UPDATE mytbl, mytemp WHERE myPK=mytempPK SET myval=mytempnewval (use the right join syntax of course).
You can try some of these things 'by hand' first before you bother coding, to see if it's worth the speed increase.
If possible, you can still put this all in an SP!
I'm not making any guarantees, especially as I look down at my ever-fattening belly, but, this has the potential to melt your update job down to under a minute.
It is possible to update multiple rows at once. Below an example in postgres:
UPDATE
table_name
SET
column_name = temp.column_name
FROM
(VALUES
(<id1>, <value1>),
(<id2>, <value2>),
(<id3>, <value3>)
) AS temp("id", "column_name")
WHERE
table_name.id = temp.id
PHP has some functions for asynchrone queries:
pg_ send_ execute()
pg_ send_ prepare()
pg_send_query()
pg_ send_ query_ params()
No idea about other programming languages, you have to dig into the manuals.
I think you can't. Single connection can handle single query at once. It's described in libpq documentation chapter "Asynchronous Command Processing":
"After successfully calling PQsendQuery, call PQgetResult one or more times to obtain the results. PQsendQuery cannot be called again (on the same connection) until PQgetResult has returned a null pointer, indicating that the command is done."