Batch processing versus Single row transactions for atomicity - sql

I have two tables; one to hold records of reports generated, and the other to update a flag that the reports have been generated. This script will be scheduled, and the SQLs have been implemented. However, there are two implementations of the script:
Case 1:
- Insert all the records, then
- Update all the flags,
- Commit if all is well
Case 2:
While (there are records)
- Insert a record,
- Update the flag
- Commit if all is well
Which should be preferred and why?
A transaction for Case 1 is for all inserts, then all update. It's all or nothing. I'm to believe this is faster, or not if the connection to the remote database keeps getting interrupted. It requires very little client side processing. But if the inserts fail midway, we'll have to rerun from the top.
A transaction for Case 2 is one insert, update. This requires to keep track of the inserted records, and updating the specific records. I'll have to use placeholders, and while granted the database may cache the SQL, and use the query execution plan repeatedly, I suspect this would be slower than Case 1 because of the additional client side processing. However on an unreliable connection, which we can assume, this looks the better choice.
EDIT 5/11/2015 11:31AM
CASE 1 snippet:
my $sql = "INSERT INTO eval_rep_track_dup\#prod \
select ert.* \
from eval_rep_track ert \
inner join \
(
select erd.evaluation_fk, erd.report_type, LTRIM(erd.assign_group_id, '/site/') course_name \
from eval_report_dup\#prod erd \
inner join eval_report er \
on er.id = erd.id \
where erd.status='queue' \
and er.status='done' \
) cat \
on ert.eval_id = cat.evaluation_fk \
and ert.report_type = cat.report_type \
and ert.course_name = cat.course_name";
my $sth = $dbh->prepare($sql) or die "Error with sql statement : $DBI::errstr\n";
my $noterror = $sth->execute() or die "Error in sql statement : " . $sth->errstr . "\n";
...
# update the status from queue to done
$sql = "UPDATE eval_report_dup\#prod \
SET status='done' \
WHERE id IN \
( \
select erd.id \
from eval_report_dup\#prod erd \
inner join eval_report er \
on er.id = erd.id \
where erd.status='queue' \
and er.status='done' \
)";
$sth = $dbh->prepare($sql);
$sth->execute();
eval_rep_track_dup has 3 number, 8 varchar2 and a timestamp columns
eval_report_dup has 10 number, 8 varchar2 and 3 timestamp columns

Hi
Well if it was up to me I would do the latter method. The principle reason would be if the server/program went down in the middle of processing; you could easily restart the job.
Good luck
pj

Related

BigQuery MERGE unexpected row duplication

I am using standard SQL MERGE to update at regular target table based on an source external table that is a set of CVS files in a bucket. Here is a simplified input file:
$ gsutil cat gs://dolphin-dev-raw/demo/input/demo_20191125_20200505050505.tsv
"id" "PortfolioCode" "ValuationDate" "load_checksum"
"1" "CIMDI000TT" "2020-03-28" "checksum1"
The MERGE statement is:
MERGE xx_producer_conformed.demo T
USING xx_producer_raw.demo_raw S
ON
S.id = T.id
WHEN NOT MATCHED THEN
INSERT (id, PortfolioCode, ValuationDate, load_checksum, insert_time, file_name, extract_timestamp, wf_id)
VALUES (id, PortfolioCode, ValuationDate, load_checksum, CURRENT_TIMESTAMP(), _FILE_NAME, REGEXP_EXTRACT(_FILE_NAME, '.*_[0-9]{8}_([0-9]{14}).tsv'),'scheduled__2020-08-19T16:24:00+00:00')
WHEN MATCHED AND S.load_checksum != T.load_checksum THEN UPDATE SET
T.id = S.id, T.PortfolioCode = S.PortfolioCode, T.ValuationDate = S.ValuationDate, T.load_checksum = S.load_checksum, T.file_name = S._FILE_NAME, T.extract_timestamp = REGEXP_EXTRACT(_FILE_NAME, '.*_[0-9]{8}_([0-9]{14}).tsv'), T.wf_id = 'scheduled__2020-08-19T16:24:00+00:00'
If I wipe the target table and rerun the MERGE I get an row modified count of 1:
bq query --use_legacy_sql=false --location=asia-east2 "$(cat merge.sql | awk 'ORS=" "')"
Waiting on bqjob_r288f8d33_000001740b413532_1 ... (0s) Current status: DONE
Number of affected rows: 1
This successfully results in the target table updating:
$ bq query --format=csv --max_rows=10 --use_legacy_sql=false "select * from ta_producer_conformed.demo"
Waiting on bqjob_r7f6b6a46_000001740b5057a3_1 ... (0s) Current status: DONE
id,PortfolioCode,ValuationDate,load_checksum,insert_time,file_name,extract_timestamp,wf_id
1,CIMDI000TT,2020-03-28,checksum1,2020-08-20 09:44:20,gs://dolphin-dev-raw/demo/input/demo_20191125_20200505050505.tsv,20200505050505,scheduled__2020-08-19T16:24:00+00:00
If I return the MERGE again I get row modified count of 0:
$ bq query --use_legacy_sql=false --location=asia-east2 "$(cat merge.sql | awk 'ORS=" "')"
Waiting on bqjob_r3de2f833_000001740b4161b3_1 ... (0s) Current status: DONE
Number of affected rows: 0
That results no changes to the target table. So everything is working expected.
The problem is that when I run the code on a more complex example with many input files to insert into an empty target table I end up with rows that have the same id where count(id) is not equal to count(distinct id):
$ bq query --use_legacy_sql=false --max_rows=999999 --location=asia-east2 "select count(id) as total_records from xx_producer_conformed.xxx; select count(distinct id) as unique_records from xx_producer_conformed.xxx; "
Waiting on bqjob_r5df5bec8_000001740b7dfa50_1 ... (1s) Current status: DONE
select count(id) as total_records from xx_producer_conformed.xxx; -- at [1:1]
+---------------+
| total_records |
+---------------+
| 11582 |
+---------------+
select count(distinct id) as unique_records from xx_producer_conformed.xxx; -- at [1:78]
+----------------+
| unique_records |
+----------------+
| 5722 |
+----------------+
This surprises me as my expectation is that the underlying logic would step through each line in each underlying file and only insert on the first id then on any subsequent id it would update. So my expectation is that you cannot have more rows than unique ids in the input bucket.
If I then try to run the MERGE again it fails telling me that there is more than one row in the target table with the same id:
$ bq query --use_legacy_sql=false --location=asia-east2 "$(cat merge.sql | awk 'ORS=" "')"
Waiting on bqjob_r2fe783fc_000001740b8271aa_1 ... (0s) Current status: DONE
Error in query string: Error processing job 'xxxx-10843454-datamesh-
dev:bqjob_r2fe783fc_000001740b8271aa_1': UPDATE/MERGE must match at most one
source row for each target row
I was expecting that the there would be no two rows with the same "id" when the MERGE statement does it's inserts.
All the tables and queries used are generated from from a file that lists the "business columns". So the simple demo example above is identical to the full scale queries in terms of the login and joins in the MERGE statement.
Why would the MERGE query above result in rows with duplicated "id" and how do I fix this?
The problem is very easily repeatable by wiping the target table and duplicating a relatively large input as the input:
AAAA_20200805_20200814200000.tsv
AAAA_clone_20200805_20200814200000.tsv
I believe that what is at the heart of this is parallelism. A single large MERGE of many files can spawn many worker threads in parallel. It would be very slow for any two worker threads running in parallel loading different files to immediately "see" each others inserts. Rather I expect that they would run independently and not "see" each others writes into separate buffers. When the buffers are finally combined it leads to multiple inserts with the same id.
To fix this I am using some CTEs to pick the latest record for any id based on extract_timestamp by using using ROW_NUMBER() OVER (PARTITION BY id ORDER BY extract_timestamp DESC). We can then filter by the lowest value to pick the latest version of the record. The full query is:
MERGE xx_producer_conformed.demo T
USING (
WITH cteExtractTimestamp AS (
SELECT
id, PortfolioCode, ValuationDate, load_checksum
, _FILE_NAME
, REGEXP_EXTRACT(_FILE_NAME, '.*_[0-9]{8}_([0-9]{14}).tsv') AS extract_timestamp
FROM
xx_producer_raw.demo_raw
),
cteRanked AS (
SELECT
id, PortfolioCode, ValuationDate, load_checksum
, _FILE_NAME
, extract_timestamp
, ROW_NUMBER() OVER (PARTITION BY id ORDER BY extract_timestamp DESC) AS row_num
FROM
cteExtractTimestamp
)
SELECT
id, PortfolioCode, ValuationDate, load_checksum
, _FILE_NAME
, extract_timestamp
, row_num
, "{{ task_instance.xcom_pull(task_ids='get_run_id') }}" AS wf_id
FROM cteRanked
WHERE row_num = 1
) S
ON
S.id = T.id
WHEN NOT MATCHED THEN
INSERT (id, PortfolioCode, ValuationDate, load_checksum, insert_time, file_name, extract_timestamp, wf_id)
VALUES (id, PortfolioCode, ValuationDate, load_checksum, CURRENT_TIMESTAMP(), _FILE_NAME, extract_timestamp, wf_id)
WHEN MATCHED AND S.load_checksum != T.load_checksum THEN UPDATE SET
T.id = S.id, T.PortfolioCode = S.PortfolioCode, T.ValuationDate = S.ValuationDate, T.load_checksum = S.load_checksum, T.file_name = S._FILE_NAME, T.extract_timestamp = S.extract_timestamp, T.wf_id = S.wf_id
This means that cloning a file and not changing the extract_timestamp in the filename will pick one of the two rows at random. In normal running we would expect subsequent extracts that have updated data to be a source file with a new extract_timetamp. The above query will then pick the newest record to merge into the target table.

Node reuse, instead of creating new ones?

I'm trying to create (action)->(state) pair in such a way so that if :
the action exists, use it, instead of creating new one
the state exists, use it, instead of creating new one
and do it in single query.
The one I have creates new action node if the state is different, from previous calls. So I end up with multiple action nodes which are the same.
query = "merge (:state {id:%s})-[:q {q:%s}]->(:action {id:%s})" % (state, 0, action)
I use radis-graph.
The only way is to use 3 queries instead of 1 to achieve this :
graph.query mem 'merge (:state {id:9})'
graph.query mem 'merge (:action {id:9})'
graph.query mem 'match (s:state), (a:action) where s.id = 9 and a.id = 9 create (s)-[:q {q:0.3}]->(a)'
At the moment RedisGraph doesn't supports mixing the MATCH and MERGE clauses, and so you don't have much options besides splitting the query as you did,
one suggestion would be to wrap those three queries within a MULTI EXEC:
MULTI
graph.query mem 'merge (:state {id:9})'
graph.query mem 'merge (:action {id:9})'
graph.query mem 'match (s:state), (a:action) where s.id = 9 and a.id = 9 create (s)-[:q {q:0.3}]->(a)'
EXEC
This should speed things up,
we'll update here once MATCH and MERGE can be mixed.

Oracle Query takes ages to execute

I have this below Oracle query. It takes ages to execute.
Select Distinct Z.WH_Source,
substr(Z.L_Y_Month,0,4) || '-' || substr(Z.L_Y_Month,5) Ld_Yr_Mth,
m.model_Name, p.SR, p.PLATE_NO, pp.value, z.CNT_number, z.platform_SR_number,
z.account_name, z.owner_name, z.operator_name, z.jetcare_expiry_date, z.wave,
z.address, z.country, substr(z.CNT_status, 10) ctstatus,
ALLOEM.GET_CNT_TYRE_SR#TNS_GG(z.CNT_number, Z.WH_Source, Z.L_Y_Month,
z.platform_SR_number, '¿')
product_SR_number
From MST.ROLE p
inner join MST.model m on m.model_id = p.model_id
left join MST.ROLEproperty pp on pp.ROLE_id = p.ROLE_id
and pp.property_lookup = 'SSG-WH-ENROLL'
left join alloem.Z_SSG_HM_LOG#TNS_GG z on z.camp_ac_ROLE_id = p.ROLE_id
Where
1 = 1 or z.L_Y_Month = 1
Order By 1, 2 desc, 3,4
If i remove this line,
ALLOEM.GET_CNT_TYRE_SR#TNS_GG(z.CNT_number, Z.WH_Source, Z.L_Y_Month,
z.platform_SR_number, '¿')
it executes very fast. But, I can't remove the line. Is there any way to make this query to execute fast.?
If i remove this line,
ALLOEM.GET_CNT_TYRE_SR#TNS_GG(z.CNT_number, Z.WH_Source,
Z.L_Y_Month, z.platform_SR_number, '¿')
it executes very fast. But, I can't remove the line. Is there any way to make this query to execute fast.?
Query tuning is a complex thing. Without table structures, indexes, execution plan or statistics it is very hard to provide one universal answer.
Anyway I would try scalar subquery caching(if applicable):
ALLOEM.GET_CNT_TYRE_SR#TNS_GG(z.CNT_number, Z.WH_Source, Z.L_Y_Month,
z.platform_SR_number, '¿')
=>
(SELECT ALLOEM.GET_CNT_TYRE_SR#TNS_GG(z.CNT_number, Z.WH_Source,Z.L_Y_Month,
z.platform_SR_number, '¿') FROM dual)
Also usage of DISTINCT may indicate some problems with normalization. If possible please fix underlying problem and remove it.
Finally you should avoid using positional ORDER BY (it is commom anti-pattern).
This:
alloem.Z_SSG_HM_LOG#TNS_GG
suggests that you fetch data over a database link. It is usually slower than fetching data locally. So, if you can afford it & if your query manipulates "static" data (i.e. nothing changes in Z_SSG_HM_LOG table frequently) and - even if it does - the amount of data isn't very high, consider creating a materialized view (MV) in schema you're connected to while running that query. You can even create index(es) on a MV so ... hopefully, everything will run faster without too much effort.

sqoop import using free form query

sqoop import \
--connect jdbc:mysql://localhost/loudacre \
--username training \
--password training \
--target-dir /axl172930/loudacre/pset1 \
--split-by acct_num \
--query 'SELECT first_name,last_name,acct_num,city,state from accounts a
JOIN (SELECT account_id, count(device_id) as num_of_devices
FROM accountdevice group by account_id
HAVING count(device_id) = 1)d ON a.acct_num = d.account_id
WHERE $CONDITIONS'
The question is as follows: Import the first name, last name, account number, city and state of the accounts having exactly 1 device.
accounts and accountdevice are tables. When I used the distinct keyword in the count function I was getting different number of records. Which approach is correct for the above question? Please suggest if you can get the answer without using a subquery.
I think below query should satisfy your requirement:
SELECT a.first_name,a.last_name,a.acct_num,a.city,a.state,count(d.device_id)
FROM accounts a JOIN num_of_devices d on a.acct_num = d.account_id
GROUP BY a.acct_num HAVING count(d.device_id) = 1);

Optimize this query getting exceed recourse limit

SELECT DISTINCT
A.IDPRE
,A.IDARTB
,A.TIREGDAT
,B.IDDATE
,B.IDINFO
,C.TIINTRO
FROM
GLHAZQ A
,PRTINFO B
,PRTCON C
WHERE
B.IDARTB = A.IDARTB
AND B.IDPRE = A.IDPRE
AND C.IDPRE = A.IDPRE
AND C.IDARTB = A.IDARTB
AND C.TIINTRO = (
SELECT MIN(TIINTRO)
FROM
PRTCON D
WHERE D.IDPRE = A.IDPRE
AND D.IDARTB = A.IDARTB)
ORDER BY C.TIINTRO
I get below error when I run this query(DB2)
SQL0495N Estimated processor cost of "000000012093" processor seconds
("000575872000" service units) in cost category "A" exceeds a resource limit error
threshold of "000007000005" service units. SQLSTATE=57051
Please help me to fix this problem
Apparently, the workload manager is doing its job in preventing you from using too many resources. You'll need to tune your query so that its estimated cost is lower than the threshold set by your DBA. You would start by examining the query explain plan as produced by db2exfmt. If you want help, publish the plan here, along with the table and index definitions.
To produce the explain plan, perform the following 3 steps:
Create explain tables by executing db2 -tf $INSTANCE_HOME/sqllib/misc/EXPLAIN.DDL
Generate the plan by executing the explain statement: db2 explain plan for select ...<the rest of your query>
Format the plan: db2exfmt -d <your db name> -1 (note the second parameter is the digit "1", not the letter "l").
To generate the table DDL statements use the db2look utility:
db2look -d <your db name> -o tables.sql -e -t GLHAZQ PRTINFO PRTCON
Although not a db2 person, but I would suspect query syntax is the same. In your query, you are doing a sub-select based on the C.TIINTRO which can kill performance. You are also querying for all records.
I would start the query by pre-querying the MIN() value and since you are not even using any other value field from the "C" alias, leave it out.
SELECT DISTINCT
A.IDPRE,
A.IDARTB,
A.TIREGDAT,
B.IDDATE,
B.IDINFO,
PreQuery.TIINTRO
FROM
( SELECT D.IDPRE,
D.IDARTB,
MIN(D.TIINTRO) TIINTRO
from
PRTCON D
group by
D.IDPRE,
D.IDARTB ) PreQuery
JOIN GLHAZQ A
ON PreQuery.IDPre = A.IDPRE
AND PreQuery.IDArtB = A.IDArtB
JOIN PRTINFO B
ON PreQuery.IDPre = B.IDPRE
AND PreQuery.IDArtB = B.IDArtB
ORDER BY
PreQuery.TIINTRO
I would ensure you have indexes on
table Index keys
PRTCON (IDPRE, IDARTB, TIINTRO)
GLHAZQ (IDPRE, IDARTB)
PRTINFO (IDPRE, IDARTB)
If you really DO need your "C" table, you could just add as another JOIN such as
JOIN PRTCON C
ON PreQuery.IDArtB = C.IDArtB
AND PreQuery.TIIntro = C.TIIntro
With such time, you might be better having a "covering index" with
GLHAZQ table key ( IDPRE, IDARTB, TIREGDAT )
PRTINFO (IDPRE, IDARTB, IDDATE, IDINFO)
this way, the index has all the elements you are returning in the query vs having to go back to all the actual pages of data. It can get the values from the index directly