Oracle sapling of data in parallel - sql

I have X million records in a table TABLE_A and want to process these records one by one.
How can I divide the population equally among 10 instances of same PL/SQL scripts to process in parallel?
See below query
SELECT CBR.CUSTOMER_ID, CBR.NAME, CBR.DEPT_NAME
FROM
(
SELECT CUSTOMER_ID, NAME, HOME_TELNO, DEPT_NAME, ROWNUM AS RNUM
FROM TABLE_A ORDER BY CUSTOMER_ID ASC
) CBR
WHERE CBR.RNUM < :sqli_end_rownum AND CBR.RNUM >= :sqli_start_rownum ;
Values will be incremented in each iteration of loop. In next iteration sqli_start_rownum will become sqli_end_rownum.
This query is taking much time. Does someone has better way to do it

You could look into DBMS_PARALLEL_EXECUTE:
http://docs.oracle.com/cd/E11882_01/appdev.112/e40758/d_parallel_ex.htm#ARPLS67331
For example:
https://oracle-base.com/articles/11g/dbms_parallel_execute_11gR2
The poor man's version of this is basically to run a query to generate ranges of rowids. You can then access the rows in the table within a given range.
Step1: create the number of "buckets" you want to divide the table into and get a range of rowids for each bucket. Here's an 8-bucket example:
select bucket_num, min(rid) as start_rowid, max(rid) as end_rowid, count(*)
from (select rowid rid
, ntile(8) over (order by rowid) as bucket_num
from table_a
)
group by bucket_num
order by bucket_num;
You'd get an output that looks like this (I'm using 12c - rowids may look different in 11g):
BUCKET_NUM START_ROWID END_ROWID COUNT(*)
1 AABetTAAIAAB8GCAAA AABetTAAIAAB8u5AAl 82792
2 AABetTAAIAAB8u5AAm AABetTAAIAAB9RrABi 82792
3 AABetTAAIAAB9RrABj AABetTAAIAAB96vAAU 82792
4 AABetTAAIAAB96vAAV AABetTAAIAAB+gKAAs 82792
5 AABetTAAIAAB+gKAAt AABetTAAIAAB+/vABv 82792
6 AABetTAAIAAB+/vABw AABetTAAIAAB/hbAB1 82791
7 AABetTAAIAAB/hbAB2 AABetTAAIAACARDABf 82791
8 AABetTAAIAACARDABg AABetTAAIAACBGnABq 82791
(The sum of the counts will be the total number of rows in the table at the time of the query.)
Step2: can grab a set of rows from the table for a given range:
SELECT <whatever you need>
FROM <table>
WHERE rowid BETWEEN 'AABetTAAIAAB8GCAAA' and 'AABetTAAIAAB8u5AAl'
...
Step3: repeat step2 for the given ranges.
so instead of this:
SELECT CBR.CUSTOMER_ID, CBR.NAME, CBR.DEPT_NAME
FROM
(
SELECT CUSTOMER_ID, NAME, HOME_TELNO, DEPT_NAME, ROWNUM AS RNUM
FROM TABLE_A ORDER BY CUSTOMER_ID ASC
) CBR
WHERE CBR.RNUM < :sqli_end_rownum AND CBR.RNUM >= :sqli_start_rownum ;
you'll just have this:
SELECT CBR.CUSTOMER_ID, CBR.NAME, CBR.DEPT_NAME
FROM table_a
WHERE rowid BETWEEN :start_rowid and :end_rowid
You can use this to run the same job in parallel but you'll need a separate session for each run (e.g. multiple SQL Plus sessions. You can also use something like DBMS_JOBS/DBMS_SCHEDULER to launch background jobs.
(Note: always be aware if your table is being updated between the time the buckets are calculated and the time you access the tables as you can miss rows.)

Related

Athena pagination and performance issue

I have huge data set in S3 and using AWS Athena I am trying to query it, below 3 parameters are input for my query.
marketplaceId
startIndex
endIndex
but it's took 16 seconds to query just 50 records. ( I am using python to query data from Athena --> S3)
What I am doing wrong here? and the way I implemented pagination is right or not?
SQL Query which I am executing.
SELECT
dataset_date,
marketplace_id,
gl_product_group,
gl_product_group_desc,
browse_node_id,
browse_node_name,
root_browse_node_id,
browse_root_name,
wt_xref_id,
gt_xref_id,
node_path,
total_count_of_asins,
buyable_asin_count,
glance_view_count_t12m,
ord_cnt,
price_p50,
price_p90,
price_p100,
row_num
FROM
(
SELECT
dataset_date,
marketplace_id,
gl_product_group,
gl_product_group_desc,
browse_node_id,
browse_node_name,
root_browse_node_id,
browse_root_name,
wt_xref_id,
gt_xref_id,
node_path,
total_count_of_asins,
buyable_asin_count,
glance_view_count_t12m,
ord_cnt,
price_p50,
price_p90,
price_p100,
row_number() over (
order by
browse_node_id,
gl_product_group,
glance_view_count_t12m desc
) as row_num
from
(
select
*
from
category_info
WHERE
marketplace_id = '<marketplaceId>'
)
)
WHERE
row_num between '<startIndex>'
and '<endIndex>';
Update
After debugging my issue with timestamp I found It's taking 6 second for 1 query. and I am running two query.
1st - to get data , query which I mentioned above.
2nd - to get count of total number or rows in my table.
so that's why it's taking 12-16 sec.
So is there any way to get total number of rows without second query (select count(*) from category_info).

How to find neighboring records in the SQL table in terms of month and year?

Please help me to optimize my SQL query.
I have a table with the fields: date, commodity_id, exp_month_id, exp_year, price, where the first 4 fields are the primary key. The months are designated with the alphabet-ordered letters: e.g. F (for Jan), G (for Feb.), H (for March), etc. Thus the letter of more distant from Jan. month will be larger than the letter of the less distant month (F < G < H < ...). Some commodity_ids have all 12 months in the table, some only 5 or 3, which are constant for all years.
I need to calculate the difference between prices (gradient) of the neighboring records in terms of exp_month_id, exp_year. As the first step, I want to define for every couple (exp_month_id, exp_year) the valid couple (next_month_id, next_year). The main problem here, that if the current exp_month_id is the last in the year, then next_year = exp_year + 1 and next_month_id should be the first one in the year.
I have written the following query to do the job:
WITH trading_months AS (
SELECT DISTINCT commodity_id,
exp_month_id
FROM futures
ORDER BY exp_month_id
)
SELECT DISTINCT f.commodity_id,
f.exp_month_id,
f.exp_year,
(
WITH [temp] AS (
SELECT exp_month_id
FROM trading_months
WHERE commodity_id = f.commodity_id
)
SELECT exp_month_id
FROM [temp]
WHERE exp_month_id > f.exp_month_id
UNION ALL
SELECT exp_month_id
FROM [temp]
LIMIT 1
)
AS next_month_id,
(
SELECT CASE WHEN EXISTS (
SELECT commodity_id,
exp_month_id
FROM trading_months
WHERE commodity_id = f.commodity_id AND
exp_month_id > f.exp_month_id
LIMIT 1
)
THEN f.exp_year ELSE f.exp_year + 1 END
)
AS next_year
FROM futures AS f
This query serves as a base for a dynamic table (view) which is subsequently used for calculating the gradient. However, the execution of this query takes more than one second and thus the whole process takes minutes. I wonder if you could help me optimizing the query.
Note: The following requires Sqlite 3.25 or newer for window function support:
Lack of sample data (Preferably as a CREATE TABLE and INSERT statements for easy importing) and expected results makes it hard to test, but if your end goal is computing the difference in prices between expiration dates (Making your question a bit of an XY problem, maybe something like:
SELECT date, commodity_id, price, exp_year, exp_month_id
, price - lag(price, 1) OVER (PARTITION BY commodity_id ORDER BY exp_year, exp_month_id) AS "change from last price"
FROM futures;
Thanks to the hint of #Shawn to use window functions I could rewrite the query in much shorter form:
CREATE VIEW "futures_nextmonths_win" AS
WITH trading_months AS (
SELECT DISTINCT commodity_id,
exp_month_id,
exp_year
FROM futures)
SELECT commodity_id,
exp_month_id,
exp_year,
lead(exp_month_id) OVER w AS next_month_id,
lead(exp_year) OVER w AS next_year
FROM trading_months
WINDOW w AS (PARTITION BY commodity_id ORDER BY exp_year, exp_month_id);
which is also slightly faster then the original one.

Splitting table PK values into roughly same-size ranges

I have a table in Postgres with about half a million rows and an integer primary key.
I'd like to split its entire PK space into N ranges of approximately same size for independent processing. How do I best do it?
I apparently can do it by fetching all PK values to a client and remember every N-th value. This does a full scan and a fetch of all the values, while I only want no more than N+1 of them.
I can select min and max values and cut the range, but if the PKs are not distributed quite evenly, it may give me some ranges of seriously different sizes.
I want ranges for index-based access later on, so any modulo-based tricks do mot apply.
Is there any nice SQL-based solution that does not involve fetching all the keys to a client? Writing an N-specific query, e.g. with N clauses, if fine.
An example:
IDs in a range, say, from 1234 to 567890, N = 4.
I'd like to get 4 numbers, say 127123, 254789, 379860, so than there are approximately 125k records in each of the ranges of IDs [1234, 127123], [127123, 254789], [254789, 379860], [379860, 567890].
Update:
I've come up with a solution like this:
select
percentile_disc(0.25) within group (order by c.id) over() as pct_25
,percentile_disc(0.50) within group (order by c.id) over() as pct_50
,percentile_disc(0.75) within group (order by c.id) over() as pct_75
from customer c
limit 1
;
It does a decent job of giving me the exact range boundaries, and runs only a few seconds, which is fine for my purposes.
What bothers me is that I have to add the limit 1 clause to get just one row. Without it, I receive identical rows, one per record in the table. Is there a better way to get just a one row of the percentiles?
I think you can use row_number() for this purpose. Something like this:
select t.*,
floor((seqnum * N) / cnt) as range
from (select t.*,
row_number() over (order by pk) - 1 as seqnum,
count(*) over () as cnt
from t
) t;
This assumes by range that you mean ranges on pk values. You can also move the range expression to a where clause to just select one particular range.

Sql -after group by I need to take rows with newest date

I need to write a query in sql and I can't do it correctly. I have a table with 7 columns 1st_num, 2nd_num, 3rd_num, opening_Date, Amount, code, cancel_Flag.
For every 1st_num, 2nd_num, 3rd_num I want to take only the record with the min (cancel_flag), and if there's more then 1 row so take the the newest opening Date.
But when I do group by and choose min and max for the relevant fields, I get a mix of the rows, for example:
1. 12,130,45678,2015-01-01,2005,333,0
2. 12,130,45678,2015-01-09,105,313,0
The result will be
:12,130,45678,2015-01-09,2005,333,0
and that mixes the rows into one
Microsoft sql server 2008 . using ssis by visual studio 2008
my code is :
SELECT
1st_num,
2nd_num,
3rd_num,
MAX(opening_date),
MAX (Amount),
code,
MIN(cancel_flag)
FROM do. tablename
GROUP BY
1st_num,
2nd_num,
3rd_num,
code
HAVING COUNT(*) > 1
How do I take the row with the max date or.min cancel flag as it is without mixing values?
I can't really post my code because of security reasons but I'm sure you can help.
thank you,
Oren
It is very difficult like this to answer, because every DBMS has different syntax.
Anyways, for most dbms this should work. Using row_number() function to rank the rows, and take only the first one by our definition (all your conditions):
SELECT * FROM (
SELECT t.*,
ROW_NUMBER() OVER ( PARTITION BY t.1st_num,t.2nd_num,t.3rd_num order by t.cancel_flag asc,t.opening_date desc) as row_num
FROM YourTable t ) as tableTempName
WHERE row_num = 1
Use NOT EXISTS to return a row as long as no other row with same 1st_num, 2nd_num, 3rd_num has a lower cancel_flag value, or same cancel_flag but a higher opening_Date.
select *
from tablename t1
where not exists (select 1 from tablename t2
where t2.1st_num = t1.1st_num
and t2.2nd_num = t1.2nd_num
and t2.3rd_num = t1.3rd_num
and (t2.cancel_flag < t1.cancel_flag
or (t2.cancel_flag = t1.cancel_flag and
t2.opening_Date > t1.opening_Date)))
Core ANSI SQL-99, expected to work with (almost) any dbms.

Access 2013 - Query not returning correct Number of Results

I am trying to get the query below to return the TWO lowest PlayedTo results for each PlayerID.
select
x1.PlayerID, x1.RoundID, x1.PlayedTo
from P_7to8Calcs as x1
where
(
select count(*)
from P_7to8Calcs as x2
where x2.PlayerID = x1.PlayerID
and x2.PlayedTo <= x1.PlayedTo
) <3
order by PlayerID, PlayedTo, RoundID;
Unfortunately at the moment it doesn't return a result when there is a tie for one of the lowest scores. A copy of the dataset and code is here http://sqlfiddle.com/#!3/4a9fc/13.
PlayerID 47 has only one result returned as there are two different RoundID's that are tied for the second lowest PlayedTo. For what I am trying to calculate it doesn't matter which of these two it returns as I just need to know what the number is but for reporting I ideally need to know the one with the newest date.
One other slight problem with the query is the time it takes to run. It takes about 2 minutes in Access to run through the 83 records but it will need to run on about 1000 records when the database is fully up and running.
Any help will be much appreciated.
Resolve the tie by adding DatePlayed to your internal sorting (you wanted the one with the newest date anyway):
select
x1.PlayerID, x1.RoundID
, x1.PlayedTo
from P_7to8Calcs as x1
where
(
select count(*)
from P_7to8Calcs as x2
where x2.PlayerID = x1.PlayerID
and (x2.PlayedTo < x1.PlayedTo
or x2.PlayedTo = x1.PlayedTo
and x2.DatePlayed >= x1.DatePlayed
)
) <3
order by PlayerID, PlayedTo, RoundID;
For performance create an index supporting the join condition. Something like:
create index P_7to8Calcs__PlayerID_RoundID on P_7to8Calcs(PlayerId, PlayedTo);
Note: I used your SQLFiddle as I do not have Acess available here.
Edit: In case the index does not improve performance enough, you might want to try the following query using window functions (which avoids nested sub-query). It works in your SQLFiddle but I am not sure if this is supported by Access.
select x1.PlayerID, x1.RoundID, x1.PlayedTo
from (
select PlayerID, RoundID, PlayedTo
, RANK() OVER (PARTITION BY PlayerId ORDER BY PlayedTo, DatePlayed DESC) AS Rank
from P_7to8Calcs
) as x1
where x1.RANK < 3
order by PlayerID, PlayedTo, RoundID;
See OVER clause and Ranking Functions for documentation.