Join and aggregate two huge tables efficiently

Join and aggregate two huge tables efficiently - sql

I have a huge table with over 1 million transaction records and I need to join this table to itself and pull all similar transactions within 52 weeks prior for each transaction and aggregate them for later use in an ML model.
select distinct a.transref,
a.transdate, a.transamount,
a.transtype,
avg (b.transamount)
over (partition by a.transref,a.transdate, a.transamount,a.transtype) as avg_trans_amount
from trans_table a
inner join trans_table b
on a.transtype = b.transtype
and b.transdate <= dateadd(week, -52, a.transdate)
and b.transdate <= a.transdate
and a.transdate between '2022-11-16' and '2021-11-16'
the transaction table looks like this:
+--------+----------+-----------+---------+
|transref|trasndate |transamount|transtype|
+--------+----------+-----------+---------+
|xh123rdk|2022-11-16|112.48 |food & Re|
|g8jegf90|2022-11-04|23.79 |Misc |
|ulpef32p|2022-10-23|83.15 |gasoline |
+--------+----------+-----------+---------+
and the expected output should look like this:
+--------+----------+-----------+---------+----------------+
|transref|trasndate |transamount|transtype|avg_trans_amount|
+--------+----------+-----------+---------+----------------+
|xh123rdk|2022-11-16|112.48 |food & Re|180.11 |
|g8jegf90|2022-11-04|23.79 |Misc |43.03 |
|ulpef32p|2022-10-23|83.15 |gasoline |112.62 |
+--------+----------+-----------+---------+----------------+
Since each transaction may pull over 10,000 similar type records the query is very slow and expensive to run, therefore SQL Server failed to create the output table.
How can I optimize this query to run efficiently within a reasonable time?
Note: After failing to run the query, I ended up creating a stored procedure to split the original table a into smaller chunks, join it to the big table, aggregate the results and append the results to an output table and repeat this until the entire table a was covered. This way I could manage to do the job, however, it was still slow. I expect there are better ways to do it in SQL without all this manual work.

ok, I think I figured out what's causing the query to run tooslow. the trick is to avoid repetitive and unnecessary calculations by doing some group by first before doing the join.
with merch as (
select transtype,
dateadd(week, -52, transdate) as startdate,
transdate as enddate),
from trans_table
group by transtype, transdate),
summary as (
select distinct transtype,
stratdate, enddate,
avg(t.transamt) over (partition by
m.transtype, m.startdate, m.enddate) as avg_amt,
percentile_cont(0.5) within group (order by t.transamt) over (partition by
m.transtype, m.startdate, m.enddate) as median_amt
from merch as m
inner join trans_table as t
on m.transtype = t.transdate and
t.transdate between m.starttype and
m.enddate)
select t.*, s.avg_amt s.median_amt
from trans_table t
inner join summary s
on t.transtype = s.transtype
and t.transdate = s.enddate

Related

Two Table Join Into One Result

I have two tables where I am attempting to join the results into one. I am trying to get the INV_QPC which is the case pack size shown in the results (SEIITN and SKU) are the same product numbers.
The code below gives two results, but the goal is to get the bottom result into the main output, where I was hoping the join would be the lookup to show the case pack size in relation to SKU.
INV_QPC = case pack size
SKU = SKU/Product Number
SEIITN = SKU/Product Number
Thanks for looking.
SELECT
ORDER_QTY, SKU, INVOICE_NUMBER, CUSTOMER_NUMBER, ROUTE,
ALLOCATED_QTY, SHORTED_QTY, PRODUCTION_DATE,
DATEPART(wk, PRODUCTION_DATE) AS FISCAL_WEEK,
YEAR(PRODUCTION_DATE) AS FISCAL_YEAR,
CONCAT(SKU, CUSTOMER_NUMBER) AS SKU_STORE_WEEK
FROM
[database].[dbo].[ORDERS]
WHERE
[PRODUCTION_DATE] >= DATEADD(day, -3, GETDATE())
AND [PRODUCTION_DATE] <= GETDATE()
SELECT INV_QPC
FROM [database].[dbo].[PRODUCT_MASTER]
JOIN [database].[dbo].[ORDERS] ON ORDERS.SKU = PRODUCT_MASTER.SEIITN;

It looks like you are on the right track, but your second SQL statement is only returning the INV_QPC column, so it is not being joined to the first query. Here is an updated SQL statement that should give you the result you are looking for:
SELECT
ORD.ORDER_QTY, ORD.SKU, ORD.INVOICE_NUMBER, ORD.CUSTOMER_NUMBER, ORD.ROUTE,
ORD.ALLOCATED_QTY, ORD.SHORTED_QTY, ORD.PRODUCTION_DATE,
DATEPART(wk, ORD.PRODUCTION_DATE) AS FISCAL_WEEK,
YEAR(ORD.PRODUCTION_DATE) AS FISCAL_YEAR,
CONCAT(ORD.SKU, ORD.CUSTOMER_NUMBER) AS SKU_STORE_WEEK,
PROD.INV_QPC
FROM
[database].[dbo].[ORDERS] ORD
JOIN [database].[dbo].[PRODUCT_MASTER] PROD ON ORD.SKU = PROD.SEIITN
WHERE
ORD.PRODUCTION_DATE >= DATEADD(day, -3, GETDATE())
AND ORD.PRODUCTION_DATE <= GETDATE()
In this query, I have added the INV_QPC column to the SELECT statement, and also included the join condition in the JOIN clause. Additionally, I have given aliases to the tables in the FROM and JOIN clauses to make the query easier to read. Finally, I have updated the WHERE clause to reference the ORD alias instead of the table name directly.

Slow Aggregates using as-of date

I have a query that's intended as the base dataset for an AR Aging report in a BI tool. The report has to be able to show AR as of a given date across a several-month range. I have the logic working, but I'm seeing pretty slow performance. Code below:
WITH
DAT AS (
SELECT
MY_DATE AS_OF_DATE
FROM
NS_REPORTS."PUBLIC".NETSUITE_DATE_TABLE
WHERE
CAST(CAST(MY_DATE AS TIMESTAMP) AS DATE) BETWEEN '2020-01-01' AND CAST(CAST(CURRENT_DATE() AS TIMESTAMP) AS DATE)
), INV AS
(
WITH BASE AS
(
SELECT
BAS1.TRANSACTION_ID
, DAT.AS_OF_DATE
, SUM(BAS1.AMOUNT) ORIG_AMOUNT_BASE
FROM
"PUBLIC".BILL_TRANS_LINES_BASE BAS1
CROSS JOIN DAT
WHERE
BAS1.TRANSACTION_TYPE = 'Invoice'
AND BAS1.TRANSACTION_DATE <= DAT.AS_OF_DATE
--AND BAS1.TRANSACTION_ID = 6114380
GROUP BY
BAS1.TRANSACTION_ID
, DAT.AS_OF_DATE
)
, TAX AS
(
SELECT
TRL1.TRANSACTION_ID
, SUM(TRL1.AMOUNT_TAXED * - 1) ORIG_AMOUNT_TAX
FROM
CONNECTORS.NETSUITE.TRANSACTION_LINES TRL1
WHERE
TRL1.AMOUNT_TAXED IS NOT NULL
AND TRL1.TRANSACTION_ID IN (SELECT TRANSACTION_ID FROM BASE)
GROUP BY
TRL1.TRANSACTION_ID
)
SELECT
BASE.TRANSACTION_ID
, BASE.AS_OF_DATE
, BASE.ORIG_AMOUNT_BASE
, COALESCE(TAX.ORIG_AMOUNT_TAX, 0) ORIG_AMOUNT_TAX
FROM
BASE
LEFT JOIN TAX ON TAX.TRANSACTION_ID = BASE.TRANSACTION_ID
)
SELECT
AR.*
, CASE
WHEN AR.DAYS_OUTSTANDING < 0
THEN 'Current'
WHEN AR.DAYS_OUTSTANDING BETWEEN 0 AND 30
THEN '0 - 30'
WHEN AR.DAYS_OUTSTANDING BETWEEN 31 AND 60
THEN '31 - 60'
WHEN AR.DAYS_OUTSTANDING BETWEEN 61 AND 90
THEN '61 - 90'
WHEN AR.DAYS_OUTSTANDING > 90
THEN '91+'
ELSE NULL
END DO_BUCKET
FROM
(
SELECT
AR1.*
, TRA1.TRANSACTION_TYPE
, DATEDIFF('day', AR1.AS_OF_DATE, CAST(CAST(TRA1.DUE_DATE AS TIMESTAMP) AS DATE)) DAYS_OUTSTANDING
, AR1.ORIG_AMOUNT_BASE + AR1.ORIG_AMOUNT_TAX + AR1.PMT_AMOUNT AMOUNT_OUTSTANDING
FROM
(
SELECT
INV.TRANSACTION_ID
, INV.AS_OF_DATE
, INV.ORIG_AMOUNT_BASE
, INV.ORIG_AMOUNT_TAX
, COALESCE(PMT.PMT_AMOUNT, 0) PMT_AMOUNT
FROM
INV
LEFT JOIN (
SELECT
TLK.ORIGINAL_TRANSACTION_ID
, DAT.AS_OF_DATE
, SUM(TLK.AMOUNT_LINKED * - 1) PMT_AMOUNT
FROM
CONNECTORS.NETSUITE."TRANSACTION_LINKS" AS TLK
CROSS JOIN DAT
WHERE
TLK.LINK_TYPE = 'Payment'
AND CAST(CAST(TLK.ORIGINAL_DATE_POSTED AS TIMESTAMP) AS DATE) <= DAT.AS_OF_DATE
GROUP BY
TLK.ORIGINAL_TRANSACTION_ID
, DAT.AS_OF_DATE
) PMT ON PMT.ORIGINAL_TRANSACTION_ID = INV.TRANSACTION_ID
AND PMT.AS_OF_DATE = INV.AS_OF_DATE
) AR1
JOIN CONNECTORS.NETSUITE."TRANSACTIONS" TRA1 ON TRA1.TRANSACTION_ID = AR1.TRANSACTION_ID
)
AR
WHERE
1 = 1
--AND CAST(AMOUNT_OUTSTANDING AS NUMERIC(15, 2)) > 0
AND AS_OF_DATE >= '2020-04-22'
As you can see, I'm using a date table for the as-of date logic. I think this is the best way to do it, but I welcome any suggestions for better practice.
If I run the query with a single as-of date, it takes 1 min 6 sec and the two main aggregates, on TRANSACTION_LINKS and BILL_TRANS_LINES_BASE, each take about 25% of processing time. I'm not sure why. If I run with the filter shown, >= '2020-04-22', it takes 3 min 33 sec and the aggregates each take about 10% of processing time; they're lower because the ResultWorker takes 63% of processing time to write the results because it's so many rows.
I'm new to Snowflake but not to SQL. My understanding is that Snowflake does not allow manual creation of indexes, but again, I'm happy to be wrong. Please let me know if you have any ideas for improving the performance of this query.
Thanks in advance.
EDIT 1:
Screenshot of most expensive node in query profile

Without seeing the full explain plan and having some sample data to play with it is difficult to give any definitive answers, but here a few thoughts, for what they are worth...
The first are more about readability and may not help performance much:
Don't embed CTEs within each other, just define them in the order that they are needed. There is no need to define BASE and TAX within INV
Use CTEs as much as possible. Your main SELECT statement has 2 other SELECT statements embedded within it. It would be much more readable if these were defined using CTEs
Specific performance issues:
Keep data volumes as low as possible for as long as possible. Your CROSS JOINs obviously create cartesian products that massively increases the volume of data - therefore implement this as late in your SQL as possible rather than right at the start as you have done
While it may make your SQL less readable, use as few SQL statements as possible. For example, you should be able to create your INV CTE with a single SELECT statement rather than the 3 statements/CTEs that you are using

Other alternatives to achieve LIMIT in SQL

I have created an SQL query to get certain data with LIMIT so I can use it in datatable. It has 76288 rows.
SELECT TransDate, AgentName, OfficeCode, year, ControlNumber,
ContainerNumber, BookingNumber, SealNumber, VesselName, ShippingLine, ShippingDate
FROM (
SELECT a.TransDate, a.AgentName, a.OfficeCode, DATEPART(YEAR, a.TransDate) AS year,
a.ControlNumber, b.ContainerNumber, b.BookingNumber,
b.SealNumber, b.VesselName, b.ShippingLine, b.ShippingDate,
ROW_NUMBER() OVER (ORDER BY a.TransDate) R
FROM Cargo_Transactions a
JOIN Cargo_Vessels b ON a.ControlNumber = b.ControlNumber
LEFT OUTER JOIN [Routes] c ON a.RouteID = c.RouteID
WHERE
a.TransDate IS NOT NULL
AND a.TransDate <= GETDATE()
AND DATEPART(YEAR, a.TransDate) = '2018'
) as f WHERE R BETWEEN 0 and 100
ORDER BY TransDate ASC;
0 and 100 is inside a variable that changes when the pagination is clicked.
If it's for the first hundred pages, it loads okay. But when I click the last page, it breaks saying timeout exceeded. Also, when I use the search function of the datatable, it's not working the way it should.
Example: I searched for dino in the datatable, it will say it has 95 records but will only show 1 record since the query is only between 0 and 10.
SELECT TransDate, AgentName, OfficeCode, year, ControlNumber, ContainerNumber,
BookingNumber, SealNumber, VesselName, ShippingLine, ShippingDate
FROM (
SELECT a.TransDate, a.AgentName, a.OfficeCode, DATEPART(YEAR, a.TransDate) AS year,
a.ControlNumber, b.ContainerNumber, b.BookingNumber, b.SealNumber,
b.VesselName, b.ShippingLine, b.ShippingDate,
ROW_NUMBER() OVER (ORDER BY a.TransDate) R
FROM Cargo_Transactions a
JOIN Cargo_Vessels b ON a.ControlNumber = b.ControlNumber
LEFT OUTER JOIN [Routes] c ON a.RouteID = c.RouteID
WHERE
a.TransDate IS NOT NULL
AND a.TransDate <= GETDATE()
AND DATEPART(YEAR, a.TransDate) = '2018'
) as f WHERE R BETWEEN 0 and 10 AND AgentName LIKE '%dino%'
ORDER BY TransDate ASC;
I also tried TOP and EXCEPT but when I search for SELECT TOP 0... EXCEPT SELECT TOP 100... but it's only showing 9 rows.
UPDATE:
I was able to make it work by including the WHERE clause in the subquery. My only problem now is the ORDER BY. It only works in the current page shown which is data 1 - 10 but not for all the data.
Any alternatives? Your help is highly appreciated. Thanks!

Teradata spool space issue on running a sub query with Count

I am using below query to calculate business days between two dates for all the order numbers. Business days are already available in the teradata table Common_WorkingCalendar. But, i'm also facing spool space issue while i execute the query. I have ample space available in my data lab. Need to optimize the query. Appreciate any inputs.
SELECT
tx."OrderNumber",
(SELECT COUNT(1) FROM Common_WorkingCalendar
WHERE CalDate between Cast(tx."TimeStamp" as date) and Cast(mf.ShipDate as date)) as BusDays
from StoreFulfillment ff
inner join StoreTransmission tx
on tx.OrderNumber = ff.OrderNumber
inner join StoreMerchandiseFulfillment mf
on mf.OrderNumber = ff.OrderNumber

This is a very inefficient way to get this count which results in a product join.
The recommended approach is adding a sequential number to your calendar which increases only on business days (calculated using SUM(CASE WHEN businessDay THEN 1 ELSE 0 END) OVER (ORDER BY CalDate ROWS UNBOUNDED PRECEDING)), then it's two joins, for the start date and the end date.
If this calculation is needed a lot you better add a new column, otherwise you can do it on the fly:
WITH cte AS
(
SELECT CalDate,
-- as this table only contains business days you can use this instead
row_number(*) Over (ORDER BY CalDate) AS DayNo
FROM Common_WorkingCalendar
)
SELECT
tx."OrderNumber",
to_dt.DayNo - from_dt.DayNo AS BusDays
FROM StoreFulfillment ff
INNER JOIN StoreTransmission tx
ON tx.OrderNumber = ff.OrderNumber
INNER JOIN StoreMerchandiseFulfillment mf
ON mf.OrderNumber = ff.OrderNumber
JOIN cte AS from_dt
ON from_dt.CalDate = Cast(tx."TimeStamp" AS DATE)
JOIN cte AS to_dt
ON to_dt.CalDate = Cast(mf.ShipDate AS DATE)

Using a date field for matching SQL Query

I'm having a bit of an issue wrapping my head around the logic of this changing dimension. I would like to associate these two tables below. I need to match the Cost - Period fact table to the cost dimension based on the Id and the effective date.
As you can see - if the month and year field is greater than the effective date of its associated Cost dimension, it should adopt that value. Once a new Effective Date is entered into the dimension, it should use that value for any period greater than said date going forward.
EDIT: I apologize for the lack of detail but the Cost Dimension will actually have a unique Index value and the changing fields to reference for the matching would be Resource, Project, Cost. I tried to match the query you provided with my fields, but I'm getting the incorrect output.
FYI: Naming convention change: EngagementId is Id, Resource is ConsultantId, and Project is ProjectId
I've changed the images below and here is my query
,_cte(HoursWorked, HoursBilled, Month, Year, EngagementId, ConsultantId, ConsultantName, ProjectId, ProjectName, ProjectRetainer, RoleId, Role, Rate, ConsultantRetainer, Salary, amount, EffectiveDate)
as
(
select sum(t.Duration), 0, Month(t.StartDate), Year(t.StartDate), t.EngagementId, c.ConsultantId, c.ConsultantName, c.ProjectId, c.ProjectName, c.ProjectRetainer, c.RoleId, c.Role, c.Rate, c.ConsultantRetainer,
c.Salary, 0, c.EffectiveDate
from timesheet t
left join Engagement c on t.EngagementId = c.EngagementId and Month(c.EffectiveDate) = Month(t.EndDate) and Year(c.EffectiveDate) = Year(t.EndDate)
group by Month(t.StartDate), Year(t.StartDate), t.EngagementId, c.ConsultantName, c.ConsultantId, c.ProjectId, c.ProjectName, c.ProjectRetainer, c.RoleId, c.Role, c.Rate, c.ConsultantRetainer,
c.Salary, c.EffectiveDate
)
select * from _cte where EffectiveDate is not null
union
select _cte.HoursWorked, _cte.HoursBilled, _cte.Month, _cte.Year, _cte.EngagementId, _cte.ConsultantId, _cte.ConsultantName, _cte.ProjectId, _Cte.ProjectName, _cte.ProjectRetainer, _cte.RoleId, _cte.Role, sub.Rate, _cte.ConsultantRetainer,_cte.Salary, _cte.amount, sub.EffectiveDate
from _cte
outer apply (
select top 1 EffectiveDate, Rate
from Engagement e
where e.ConsultantId = _cte.ConsultantId and e.ProjectId = _cte.ProjectId and e.RoleId = _cte.RoleId
and Month(e.EffectiveDate) < _cte.Month and Year(e.EffectiveDate) < _cte.Year
order by EffectiveDate desc
) sub
where _cte.EffectiveDate is null
Example:
I'm struggling with writing the query that goes along with this. At first I attempted to partition by greatest date. However, when I executed the join I got the highest effective date for every single period (even those prior to the effective date).
Is this something that can be accomplished in a query or should I be focusing on incremental updates of the destination table so that any effective date / time period in the past is left alone?
Any tips would be great!
Thanks,
Channing

Try this one:
; with _CTE as(
select p.* , c.EffectiveDate, c.Cost
from period p
left join CostDimension c on p.id = c.id and p.Month = DATEPART(month, c.EffectiveDate) and p.year = DATEPART (year, EffectiveDate)
)
select * from _CTE Where EffectiveDate is not null
Union
select _CTE.id, _CTE.Month, _CTE.Year, sub.EffectiveDate, sub.Cost
from _CTE
outer apply (select top 1 EffectiveDate, Cost
from CostDimension as cd
where cd.Id = _CTE.id and cd.EffectiveDate < DATETIMEFROMPARTS(_CTE.Year, _CTE.Month, 1, 0, 0, 0, 0)
order by EffectiveDate desc
) sub
where _Cte.EffectiveDate is null

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Join and aggregate two huge tables efficiently - sql

Related

Two Table Join Into One Result

Slow Aggregates using as-of date

Other alternatives to achieve LIMIT in SQL

Teradata spool space issue on running a sub query with Count

Using a date field for matching SQL Query

Categories

Resources