I have a daily scheduler to run the job on Bigquery, however, it crashed due to running out of memory usage. The job consists of the most updated information from each of the 5 tables, which means I used over( ... order by) five times to query the updated record from each table and it consumed a lot of memory usage. Is there any efficient way to fix the error by refactoring the query?
Here's the brief code structure:
CREATE TEMP TABLE main_info AS
WITH orders_1 AS(
select
* except(rnk)
from(
select
*,
ROW_NUMBER() OVER(PARTITION BY order_id ORDER BY update_time DESC) AS rnk
from order_1
)
where rnk = 1
),
orders_2 AS(
select
* except(rnk)
from(
select
*,
ROW_NUMBER() OVER(PARTITION BY order_id ORDER BY update_time DESC) AS rnk
from order_2
)
where rnk = 1
),
orders_3 AS(
select
* except(rnk)
from(
select
*,
ROW_NUMBER() OVER(PARTITION BY order_id ORDER BY update_time DESC) AS rnk
from order_3
)
where rnk = 1
)
SELECT
*
FROM orders_1 o1
LEFT JOIN orders_2 o2
ON o1.order_id = o2.order_id
LEFT JOIN orders_3 o3
ON o1.order_id = o3.order_id
I was expecting to reduce memory usage under the limit. I did some research and found out to replace row_number() over( ... order by) with array_agg() to optimize the performance or to create the temp table for each table and combine it all? is there any better advice?
I'm not sure whether this will solve your problem, but we could definitely use QUALIFY to simplify your CTEs. For example:
SELECT *
FROM order_1
QUALIFY ROW_NUMBER() OVER(order_window) = 1
WINDOW order_window AS (
PARTITION BY order_id
ORDER BY update_time DESC
)
(also uses WINDOW for readability)
It's possible that this will help by eliminating subqueries, but that depends on whether it's already optimised to the same thing behind the scenes.
Other ideas:
do the left joins get you very different results to inner joins? If so, you could pre-empt this by prefiltering your second and third CTEs to not include order IDs that are just going to be dropped.
does it have to be a temporary table you create? Or could you create full tables for each of the CTEs instead and build this in stages?
create separate temp tables and combine those. That will reduce memory utilization as compared to initial query. To immediate releasing memory you can drop such temp tables at the appropriate steps. Refer below splits:
CREATE TEMP TABLE orders_1 AS(
select
* except(rnk)
from(
select
*,
ROW_NUMBER() OVER(PARTITION BY order_id ORDER BY update_time DESC) AS rnk
from order_1
)
where rnk = 1 );
CREATE TEMP TABLE orders_2 AS(
select
* except(rnk)
from(
select
*,
ROW_NUMBER() OVER(PARTITION BY order_id ORDER BY update_time DESC) AS rnk
from order_2
)
where rnk = 1 );
CREATE TEMP TABLE orders_3 AS(
select
* except(rnk)
from(
select
*,
ROW_NUMBER() OVER(PARTITION BY order_id ORDER BY update_time DESC) AS rnk
from order_3
)
where rnk = 1);
CREATE TEMP TABLE main_info AS
SELECT *
FROM orders_1 o1
LEFT JOIN orders_2 o2
ON o1.order_id = o2.order_id
LEFT JOIN orders_3 o3
ON o1.order_id = o3.order_id;
DROP TABLE orders_1;
DROP TABLE orders_2;
DROP TABLE orders_3;
Related
I have a table called 'sales' in postgres which has a column called 'region'. I am trying to find out a way to delete 90% of records from each 'region' of the same table.
I am using the below query. But the same is not working in postgres and also the table does not have a primary/unique key column
delete from table
( select row_number() over (partition by region) as PAR
from sales
)b
where PAR >=
( select S*0.1 as ninety
from
( select region, count(*) as S
from sales
group by region
)a
and b.region = a.region
can anyone provide any better solution to this.
If you have an unique id in the table, you can do:
delete
from t
using (select t.*,
row_number() over (partition by region order by region) as seqnum, -- I always include order by
count(*) over (partition by region) as cnt
from t
) tt
where t.id = tt.id and
tt.seqnum < 0.9 * cnt;
I have the following table:
I want to get the most recent status for each dept_code that a CL_ID has. So the desired output would be this:
I have tried the following but this give me just the most recent status for each client and not each of their dept_codes.
SELECT *
FROM [CIMSHR6_MERGED].[dbo].[C3CLSTAT] C
INNER JOIN
(SELECT CLIENT_NUMBER, MAX(STATUS_DATE) AS SDATE
FROM [CIMSHR6_MERGED].[dbo].[C3CLSTAT]
GROUP BY CLIENT_NUMBER) X
ON X.CLIENT_NUMBER = C.CLIENT_NUMBER
AND X.SDATE = C.STATUS_DATE
ORDER BY C.CLIENT_NUMBER
Any help would be much appreciated. Thanks.
A convenient method that works in SQL Server is:
select top (1) cl.*
from [CIMSHR6_MERGED].[dbo].[C3CLSTAT] cl
order by row_number() over (partition by cl_id, dept_code order by status_date desc);
A method that is efficient with the right indexes in almost any database is:
select cl.*
from [CIMSHR6_MERGED].[dbo].[C3CLSTAT] cl
where cl.status_date = (select max(cl2.status_date)
from [CIMSHR6_MERGED].[dbo].[C3CLSTAT] cl2
where cl2.cl_id = cl.cl_id and cl2.dept_code = cl.dept_code
);
The right index is on (cl_id, dept_code, status_date).
I would also use ROW_NUMBER, but with a subquery:
SELECT CL_ID, Status_date, Status, Dept_code
FROM
(
SELECT *,
ROW_NUMBER() OVER (PARTITION BY CL_ID, Dept_code ORDER BY Status_date DESC) rn
FROM CIMSHR6_MERGED].[dbo].[C3CLSTAT]
) t
WHERE rn = 1;
1) Firstly group everything on Dept_Code,CL_ID and assign rank for each row with in the group in descending order.
2) Select all the rows with rnk=1 which would display your desired result.
SELECT Z.CL_ID,
Z.Status_Date,
Z.Status,
Z.Dept_Code
FROM
(
SELECT *,
RANK() OVER( PARTITION BY Dept_Code,CL_ID, ORDER BY Status_Date DESC ) AS rnk
FROM [CIMSHR6_MERGED].[dbo].[C3CLSTAT]
) Z
WHERE Z.rnk = 1;
This would work for almost all databases
select * from c3clstat c
where exists
(select 1 from c3clstat c1
where c1.cl_id=c.cl_id
and c1.dept_code=c.dept_code
group by cl_id,dept_code
having c.status_date=max(c1.status_date)
)
I want to create a temporary table which is inturn derived from a query to be used in another sub-query so as to simplify the rownum() and partition by condition. The query I have entered is as below but it returns an error t.trlr_num invalid identifier.
with t as
(select distinct
ym.trlr_num,
ym.arrdte,
ri.invnum,
ri.supnum
from rcvinv ri, yms_ymr ym
where ym.trlr_cod='RCV'
and ri.trknum = ym.trlr_num
and ym.wh_id <=50
and ym.trlr_stat in ('C','CI','R','OR')
and ym.arrdte is not null
order by ym.arrdte desc
)
select trlr_number, invnum, supnum
from
(
select
t.trlr_num, t.invnum, t.supnum,
row_number() over (partition by t.trlr_number,t.invnum order by t.arrdte) as rn
from t
)
where rn = 1;
From above, I put a condition to create a table t as a temporary table to be used in the below select statement. But is seems to error out with invalid identifier.
seems typo, replace trlr_number with trlr_num and it work
with t as
(select distinct
ym.trlr_num,
ym.arrdte,
ri.invnum,
ri.supnum
from rcvinv ri, yms_ymr ym
where ym.trlr_cod='RCV'
and ri.trknum = ym.trlr_num
and ym.wh_id <=50
and ym.trlr_stat in ('C','CI','R','OR')
and ym.arrdte is not null
order by ym.arrdte desc
)
select trlr_num, invnum, supnum
from
(
select
t.trlr_num, t.invnum, t.supnum,
row_number() over (partition by t.trlr_num,t.invnum order by t.arrdte) as rn
from t
)
where rn = 1;
You could use multiple subqueries in the WITH clause as separate temporary tables. It would be nice and easy to understand:
WITH t AS
(SELECT DISTINCT ym.trlr_num trlr_num,
ym.arrdte arrdte,
ri.invnum invnum,
ri.supnum supnum
FROM rcvinv ri,
yms_ymr ym
WHERE ym.trlr_cod ='RCV'
AND ri.trknum = ym.trlr_num
AND ym.wh_id <=50
AND ym.trlr_stat IN ('C','CI','R','OR')
AND ym.arrdte IS NOT NULL
),
t1 AS (
SELECT t.trlr_num,
t.arrdte,
t.invnum,
t.supnum,
row_number() OVER (PARTITION BY t.trlr_num, t.invnum ORDER BY t.trlr_num, t.invnum DESC) rn
FROM t
)
SELECT trlr_num, arrdte, invnum, supnum
FROM t1
WHERE rn = 1;
I think it's easier to show you an image:
So, for each fld_call_id, go to the next value, if it's identical. When we get to the last value, I need the value in column fld_menu_id.
Or, to put it in another way, eliminate fld_call_id duplicates and save only the last one.
You can use ROW_NUMBER:
WITH CTE AS(
SELECT RN = ROW_NUMBER() OVER (PARTITION BY fld_call_id ORDER BY fld_id DESC),
fld_menu_id
FROM dbo.TableName
)
SELECT fld_menu_id FROM CTE WHERE RN = 1
You can create a Rank column and only select that row, something along the lines of the following:
;WITH cte AS
(
SELECT
*
,RANK() OVER (PARTITION BY fld_call_id ORDER BY fld_id DESC) Rnk
FROM YourTable
)
SELECT
*
FROM cte
WHERE Rnk=1
So you GROUP BY fld_call_id and ORDER BY fld_id in descending order so that the last value comes first. These are the rows where Rnk=1.
Edit after comments of OP.
SELECT Table.*
FROM Table
INNER JOIN
(
SELECT MAX(fldMenuID) AS fldMenuID,
fldCallID
FROM Table
GROUP BY fldCallID
) maxValues
ON (maxValues.fldMenuID = Table.fldMenuID
AND maxValues.fldCallID= Table.fldCallID)
Hope This works
SELECT A.*
FROM table A
JOIN (SELECT fld_id,
ROW_NUMBER() OVER (PARTITION BY Fld_call_id ORDER BY fld_id DESC) [Row]
FROM table) LU ON A.fld_id = LU.fld_id
WHERE LU.[Row] = 1
I want to optimize this
WITH a as
(SELECT *
,ROW_NUMBER() OVER (PARTITION BY applicationid ORDER BY AgreementStartDate desc) rn
,(select count(*) from RM_TbPackages where d.ApplicationID=ApplicationID) as PackageCount
FROM CM_VwSupplierApplications d)
select * from a
where rn=1
order by a.ApplicationID
As per the comment, there is nothing wrong with the partition. One possible inefficiency is the subquery (select count(*) from RM_TbPackages where d.ApplicationID=ApplicationID) - a set based approach to this by computing all counts per Application and then joining to the count should improve performance:
WITH a as
(
SELECT * ,
ROW_NUMBER() OVER (PARTITION BY applicationid ORDER BY AgreementStartDate desc) rn,
x.PackageCount
FROM CM_VwSupplierApplications d
INNER JOIN
(select ApplicationID, count(*) as PackageCount
from RM_TbPackages
group by ApplicationID )x
on x.ApplicationID = d.ApplicationID
)
select * from a
where rn=1
order by a.ApplicationID;
This query will run faster since it is not making a subselect for ever row in CM_VwSupplierApplications:
;WITH a AS
(
SELECT * ,ROW_NUMBER() OVER (PARTITION BY applicationid ORDER BY AgreementStartDate desc) rn
FROM CM_VwSupplierApplications d
)
SELECT a.*, b.PackageCount
FROM a
OUTER APPLY
( SELECT count(*) PackageCount
FROM RM_TbPackages
WHERE d.ApplicationID=ApplicationID) b
WHERE a.rn=1
ORDER BY a.ApplicationID
To improve it even more, you could consider index on table CM_VwSupplierApplications on the columns applicationid and AgreementStartDate