I want to understand how transactions work in SQL, specifically in PostgreSQL
Imagine I have a very large table (first_table) and the query below lasts 2 seconds and I execute the query below via psql.
sudo -u postgres psql -f database/query.sql
This is the query:
TRUNCATE TABLE second_table;
INSERT INTO second_table (
foo1
,foo2
)
SELECT foo1
, foo2
FROM first_table;
What can happen if I execute another query selecting from second_table at the same time the previous query is executing. Notice the truncate table at the start of the previous query.
example:
SELECT * FROM second_table;
EDIT: I mean I would get zero or non-zero records in the second query?
I mean I would get zero or non-zero records in the second query?
Under reasonable transaction isolation levels, the database does not allow dirty reads, meaning no transaction can see changes from other transactions that have not yet been committed. (In Postgresql, it is not even an option to turn that off, a very sensible choice in my book).
That means that the second query will either see the contents of the table before the TRUNCATE, or it will see the new records added after the TRUNCATE. But it will not see something in between, i.e. it will not get an empty table (assuming there have been records in the table before the TRUNCATE) and it will not see an incomplete half of the new records (or even a weird mix).
If you say that the second query returns before the first query has committed, then it will have seen the state of the table before any changes from the first query have been applied.
Related
Let's say we have one table bookings which contains billion records. We wrote a simple SELECT query to select some records from this table with some WHERE clauses (doesn't matter what was in WHERE clause. This query will take several seconds). After executing this SELECT query (and before it finishes), we then inserted a record in our bookings table (this record satisfies WHERE clauses of first SELECT query).
The question: "Will this new record be selected when first SELECT query finishes its work?"
Preferably I want answer about PostgreSQL case, but would be glad to hear about how MySQL, SQL Server and others would behave in such a situation.
Thanks.
Will this new record be selected when first SELECT query finishes its work
No.
Every statement is an atomic operation, and sees a consistent state (=snapshot) of the database as it was at the moment when the statement started.
The above applies to Postgres and Oracle (maybe to other DBMS as well, but I can't say for sure. Some support dirty reads, where this wouldn't be guaranteed)
I have a select query that selects data on streaming. Suppose I ran the query and data count is 100; while retrieving the data, a few more rows are inserted, for example 10 more. Now my question is: will the select return 100 or 110 rows?
This gets into Isolation in RDBMS environments. For example, in SQL Server, if I run a query that selects all COMMITTED data from a table, and at the time it has 100 rows, I will return 100 rows. If this table is currently being inserted into and those new rows are not yet committed, I will still return 100 rows (assuming the table is not locked). I have to rerun the query each time. The result set will not just magically get bigger. You have to issue a select each time you want to return data.
Now, if I am selecting UNCOMMITTED data and using something like NOLOCK, each time I run my select, I will return records that have not been committed yet. This means that each time I run my select, while the table is receiving new records, I will see those new records each time my data set is returned. This is helpful to see the newest records as they are coming in, but this can lead to dirty reads if for any reason that transaction fails or gets rolled back.
I have a Day-Partitioned Table on BigQuery. When I try to delete some rows from the table using a query like:
DELETE FROM `MY_DATASET.partitioned_table` WHERE id = 2374180
I get the following error:
Error: DML statements are not yet supported over partitioned tables.
A quick Google search leads me to: https://cloud.google.com/bigquery/docs/loading-data-sql-dml where it also says: "DML statements that modify partitioned tables are not yet supported."
So for now, is there a workaround that we can use in deleting rows from a partitioned table?
DML has some known issues/limitation in this phase.
Such as:
DML statements cannot be used to modify tables with REQUIRED fields in their schema.
Each DML statement initiates an implicit transaction, which means that changes made by the statement are automatically committed at the end of each successful DML statement. There is no support for multi-statement transactions.
The following combinations of DML statements are allowed to run concurrently on a table:
UPDATE and INSERT
DELETE and INSERT
INSERT and INSERT
Otherwise one of the DML statements will be aborted. For example, if two UPDATE statements execute simultaneously against the table then only one of them will succeed.
Tables that have been written to recently via BigQuery Streaming (tabledata.insertall) cannot be modified using UPDATE or DELETE statements. To check if the table has a streaming buffer, check the tables.get response for a section named streamingBuffer. If it is absent, the table can be modified using UPDATE or DELETE statements.
DML statements that modify partitioned tables are not yet supported.
Also be aware of the quota limits
Maximum UPDATE/DELETE statements per day per table: 48
Maximum UPDATE/DELETE statements per day per project: 500
Maximum INSERT statements per day per table: 1,000
Maximum INSERT statements per day per project: 10,000
What you can do is copy the entire partition to a non-partitioned table and execute the DML statement there. Than write back the temp table to the partition. Also if you ran into DML update limit statements per day per table, you need to create a copy of the table and run the DML on the new table to avoid the limit.
You could delete partitions in partitioned tables using the command-line bq rm, like this:
bq rm 'mydataset.mytable$20160301'
I've already done it without temporary table, steps:
1) prepare query which selects all the rows from particular partition which should be kept:
SELECT * FROM `your_data_set.tablename` WHERE
_PARTITIONTIME = timestamp('2017-12-07')
AND condition_to_keep_rows_which_shouldn't_be_deleted = 'condition'
if necessary run this for other partitions
2) choose Destination table for result of your query where you point TO THE PARTICULAR PARTITION, you need to provide table name like this:
tablename$20171207
3) Check option "Overwrite table" -> it will overwrite only particular partition
4) Run Query, as a result from pointed partition redundant rows will be deleted!
//remember that you could need run this for other partitions, where you rows to deleted are spread across more than one partition
Looks like as of my writing, this is no longer a BigQuery limitation!
In standard SQL, a statement like the above, over a partitioned table, will succeed, assuming rows being deleted weren't recently (within last 30 minutes) inserted via a streaming insert.
Current docs on DML: https://cloud.google.com/bigquery/docs/reference/standard-sql/data-manipulation-language
Example Query that worked for me in the BQ UI:
DELETE
FROM dataset_name.partitioned_table_on_timestamp_column
WHERE
timestamp >= '2020-02-01' AND timestamp < '2020-06-01'
After the hamsters are done spinning, we get the BQ response:
This statement removed 101 rows from partitioned_table_on_timestamp_column
I have a temp table created in DB2 using java and then some 2000 rows inserted the same.
Now I am using this table in a select query with few joins. The query involves other 3 tables say A, B and C. Somehow this select query returns result very slow it almost takes a 20 seconds to provide results.
Below are the details of this query,
table A has 200000 records. B and C has only 100-200 records.
All these 3 tables have enough indexes defined on columns involved in join and the where clause. The Explain plan tool etc. did not show any new indexes needed.
When I run the query removing session table and its use in where clause, the query returns results in milliseconds. And as I mentioned this session table has only around 2000 records.
I have also declared indexes on each column of this session table.
I am not really sure about terminology here but when I say session table it is a temporary table created using db connection and the table gets dropped when the DB connection is closed. Also when the program runs with 15 threads, no thread is capable of looking at table created by other thread.
Where could the issue be? Please let me know some suggestions here.
Some suggestions (I assume LUW since you don't mention platform)
a) You say that you have indexes on each column of your session table, I assume this means a set of 1 column indexes. This is in most cases not optimal, you can probably replace those with a composite index. Check what advis suggests by creating a real table like:
create table temp.t ( ... )
insert into temp.t (...) values (...)
runstats on temp.t with distribution
then run: db2advis -d -m I -s "your query but with temp.t instead of session table"
b) After data is loaded and - eventually - new indexes are created, do a runstats on the session table
I'm currently using Oracle 11g and let's say I have a table with the following columns (more or less)
Table1
ID varchar(64)
Status int(1)
Transaction_date date
tons of other columns
And this table has about 1 Billion rows. I would want to update the status column with a specific where clause, let's say
where transaction_date = somedatehere
What other alternatives can I use rather than just the normal UPDATE statement?
Currently what I'm trying to do is using CTAS or Insert into select to get the rows that I want to update and put on another table while using AS COLUMN_NAME so the values are already updated on the new/temporary table, which looks something like this:
INSERT INTO TABLE1_TEMPORARY (
ID,
STATUS,
TRANSACTION_DATE,
TONS_OF_OTHER_COLUMNS)
SELECT
ID
3 AS STATUS,
TRANSACTION_DATE,
TONS_OF_OTHER_COLUMNS
FROM TABLE1
WHERE
TRANSACTION_DATE = SOMEDATE
So far everything seems to work faster than the normal update statement. The problem now is I would want to get the remaining data from the original table which I do not need to update but I do need to be included on my updated table/list.
What I tried to do at first was use DELETE on the same original table using the same where clause so that in theory, everything that should be left on that table should be all the data that i do not need to update, leaving me now with the two tables:
TABLE1 --which now contains the rows that i did not need to update
TABLE1_TEMPORARY --which contains the data I updated
But the delete statement in itself is also too slow or as slow as the orginal UPDATE statement so without the delete statement brings me to this point.
TABLE1 --which contains BOTH the data that I want to update and do not want to update
TABLE1_TEMPORARY --which contains the data I updated
What other alternatives can I use in order to get the data that's the opposite of my WHERE clause (take note that the where clause in this example has been simplified so I'm not looking for an answer of NOT EXISTS/NOT IN/NOT EQUALS plus those clauses are slower too compared to positive clauses)
I have ruled out deletion by partition since the data I need to update and not update can exist in different partitions, as well as TRUNCATE since I'm not updating all of the data, just part of it.
Is there some kind of JOIN statement I use with my TABLE1 and TABLE1_TEMPORARY in order to filter out the data that does not need to be updated?
I would also like to achieve this using as less REDO/UNDO/LOGGING as possible.
Thanks in advance.
I'm assuming this is not a one-time operation, but you are trying to design for a repeatable procedure.
Partition/subpartition the table in a way so the rows touched are not totally spread over all partitions but confined to a few partitions.
Ensure your transactions wouldn't use these partitions for now.
Per each partition/subpartition you would normally UPDATE, perform CTAS of all the rows (I mean even the rows which stay the same go to TABLE1_TEMPORARY). Then EXCHANGE PARTITION and rebuild index partitions.
At the end rebuild global indexes.
If you don't have Oracle Enterprise Edition, you would need to either CTAS entire billion of rows (followed by ALTER TABLE RENAME instead of ALTER TABLE EXCHANGE PARTITION) or to prepare some kind of "poor man's partitioning" using a view (SELECT UNION ALL SELECT UNION ALL SELECT etc) and a bunch of tables.
There is some chance that this mess would actually be faster than UPDATE.
I'm not saying that this is elegant or optimal, I'm saying that this is the canonical way of speeding up large UPDATE operations in Oracle.
How about keeping in the UPDATE in the same table, but breaking it into multiple small chunks?
UPDATE .. WHERE transaction_date = somedatehere AND id BETWEEN 0000000 and 0999999
COMMIT
UPDATE .. WHERE transaction_date = somedatehere AND id BETWEEN 1000000 and 1999999
COMMIT
UPDATE .. WHERE transaction_date = somedatehere AND id BETWEEN 2000000 and 2999999
COMMIT
This could help if the total workload is potentially manageable, but doing it all in one chunk is the problem. This approach breaks it into modest-sized pieces.
Doing it this way could, for example, enable other apps to keep running & give other workloads a look in; and would avoid needing a single humungous transaction in the logfile.