Slow Aggregates using as-of date - sql

I have a query that's intended as the base dataset for an AR Aging report in a BI tool. The report has to be able to show AR as of a given date across a several-month range. I have the logic working, but I'm seeing pretty slow performance. Code below:
WITH
DAT AS (
SELECT
MY_DATE AS_OF_DATE
FROM
NS_REPORTS."PUBLIC".NETSUITE_DATE_TABLE
WHERE
CAST(CAST(MY_DATE AS TIMESTAMP) AS DATE) BETWEEN '2020-01-01' AND CAST(CAST(CURRENT_DATE() AS TIMESTAMP) AS DATE)
), INV AS
(
WITH BASE AS
(
SELECT
BAS1.TRANSACTION_ID
, DAT.AS_OF_DATE
, SUM(BAS1.AMOUNT) ORIG_AMOUNT_BASE
FROM
"PUBLIC".BILL_TRANS_LINES_BASE BAS1
CROSS JOIN DAT
WHERE
BAS1.TRANSACTION_TYPE = 'Invoice'
AND BAS1.TRANSACTION_DATE <= DAT.AS_OF_DATE
--AND BAS1.TRANSACTION_ID = 6114380
GROUP BY
BAS1.TRANSACTION_ID
, DAT.AS_OF_DATE
)
, TAX AS
(
SELECT
TRL1.TRANSACTION_ID
, SUM(TRL1.AMOUNT_TAXED * - 1) ORIG_AMOUNT_TAX
FROM
CONNECTORS.NETSUITE.TRANSACTION_LINES TRL1
WHERE
TRL1.AMOUNT_TAXED IS NOT NULL
AND TRL1.TRANSACTION_ID IN (SELECT TRANSACTION_ID FROM BASE)
GROUP BY
TRL1.TRANSACTION_ID
)
SELECT
BASE.TRANSACTION_ID
, BASE.AS_OF_DATE
, BASE.ORIG_AMOUNT_BASE
, COALESCE(TAX.ORIG_AMOUNT_TAX, 0) ORIG_AMOUNT_TAX
FROM
BASE
LEFT JOIN TAX ON TAX.TRANSACTION_ID = BASE.TRANSACTION_ID
)
SELECT
AR.*
, CASE
WHEN AR.DAYS_OUTSTANDING < 0
THEN 'Current'
WHEN AR.DAYS_OUTSTANDING BETWEEN 0 AND 30
THEN '0 - 30'
WHEN AR.DAYS_OUTSTANDING BETWEEN 31 AND 60
THEN '31 - 60'
WHEN AR.DAYS_OUTSTANDING BETWEEN 61 AND 90
THEN '61 - 90'
WHEN AR.DAYS_OUTSTANDING > 90
THEN '91+'
ELSE NULL
END DO_BUCKET
FROM
(
SELECT
AR1.*
, TRA1.TRANSACTION_TYPE
, DATEDIFF('day', AR1.AS_OF_DATE, CAST(CAST(TRA1.DUE_DATE AS TIMESTAMP) AS DATE)) DAYS_OUTSTANDING
, AR1.ORIG_AMOUNT_BASE + AR1.ORIG_AMOUNT_TAX + AR1.PMT_AMOUNT AMOUNT_OUTSTANDING
FROM
(
SELECT
INV.TRANSACTION_ID
, INV.AS_OF_DATE
, INV.ORIG_AMOUNT_BASE
, INV.ORIG_AMOUNT_TAX
, COALESCE(PMT.PMT_AMOUNT, 0) PMT_AMOUNT
FROM
INV
LEFT JOIN (
SELECT
TLK.ORIGINAL_TRANSACTION_ID
, DAT.AS_OF_DATE
, SUM(TLK.AMOUNT_LINKED * - 1) PMT_AMOUNT
FROM
CONNECTORS.NETSUITE."TRANSACTION_LINKS" AS TLK
CROSS JOIN DAT
WHERE
TLK.LINK_TYPE = 'Payment'
AND CAST(CAST(TLK.ORIGINAL_DATE_POSTED AS TIMESTAMP) AS DATE) <= DAT.AS_OF_DATE
GROUP BY
TLK.ORIGINAL_TRANSACTION_ID
, DAT.AS_OF_DATE
) PMT ON PMT.ORIGINAL_TRANSACTION_ID = INV.TRANSACTION_ID
AND PMT.AS_OF_DATE = INV.AS_OF_DATE
) AR1
JOIN CONNECTORS.NETSUITE."TRANSACTIONS" TRA1 ON TRA1.TRANSACTION_ID = AR1.TRANSACTION_ID
)
AR
WHERE
1 = 1
--AND CAST(AMOUNT_OUTSTANDING AS NUMERIC(15, 2)) > 0
AND AS_OF_DATE >= '2020-04-22'
As you can see, I'm using a date table for the as-of date logic. I think this is the best way to do it, but I welcome any suggestions for better practice.
If I run the query with a single as-of date, it takes 1 min 6 sec and the two main aggregates, on TRANSACTION_LINKS and BILL_TRANS_LINES_BASE, each take about 25% of processing time. I'm not sure why. If I run with the filter shown, >= '2020-04-22', it takes 3 min 33 sec and the aggregates each take about 10% of processing time; they're lower because the ResultWorker takes 63% of processing time to write the results because it's so many rows.
I'm new to Snowflake but not to SQL. My understanding is that Snowflake does not allow manual creation of indexes, but again, I'm happy to be wrong. Please let me know if you have any ideas for improving the performance of this query.
Thanks in advance.
EDIT 1:
Screenshot of most expensive node in query profile

Without seeing the full explain plan and having some sample data to play with it is difficult to give any definitive answers, but here a few thoughts, for what they are worth...
The first are more about readability and may not help performance much:
Don't embed CTEs within each other, just define them in the order that they are needed. There is no need to define BASE and TAX within INV
Use CTEs as much as possible. Your main SELECT statement has 2 other SELECT statements embedded within it. It would be much more readable if these were defined using CTEs
Specific performance issues:
Keep data volumes as low as possible for as long as possible. Your CROSS JOINs obviously create cartesian products that massively increases the volume of data - therefore implement this as late in your SQL as possible rather than right at the start as you have done
While it may make your SQL less readable, use as few SQL statements as possible. For example, you should be able to create your INV CTE with a single SELECT statement rather than the 3 statements/CTEs that you are using

Related

SQL for looping the date and summarize data

I need to summarize by customer having specific product from 2018-1-1 until today if cdate is within activationdt and next_activationdt. When next_activationdt is empty that means it is still in use and not cancelled or changed.
For example 2018-1-1, Mary and Alan are using product A.
I could not figure out a good way of running all the dates together.
I need to run the code on SQL workbench pulling data from AWS Redshift database.
You need a way to generate the dates. Perhaps you have a calendar table, or you can use something like this:
with dates as (
select dateadd(day, row_number() over (), '2018-01-01'::date) as dte
from t
limit 365 -- assumes table has at least 365 rows
)
select d.dte,
sum( (product = 'A')::int ) as num_a,
sum( (product = 'B')::int ) as num_b,
sum( (product = 'C')::int ) as num_c,
sum( (product = 'D')::int ) as num_d
from dates d left join
t
on d.dte >= t.activationdate and
(d.dte < t.nextactivationdate or t.nextactivationdate is null)
group by d.dte
order by d.dte;

Joining Tables on Time, IF NULL edit time by 1 minute

I have two tables.
Table 1 = My Trades
Table 2 = Market Trades
I want query the market trade 1 minute prior to my trade. If there is no market trade in Table 2 that is 1 minute apart from mine then I want to look back 2 minutes and so on till I have a match.
Right now my query gets me 1 minute apart but I cant figure out how to get 2 minutes apart if NULL or 3 minutes apart if NULL (up to 30 minutes). I think it would best using a variable but im not sure the best way to approach this.
Select
A.Ticker
,a.date_time
,CONVERT(CHAR(16),a.date_time - '00:01',120) AS '1MINCHANGE'
,A.Price
,B.Date_time
,B.Price
FROM
Trade..MyTrade as A
LEFT JOIN Trade..Market as B
on (a.ticker = b.ticker)
and (CONVERT(CHAR(16),a.date_time - '00:01',120) = b.Date_time)
There is no great way to do this in MySQL. But, because your code looks like SQL Server, I'll show that solution here, using APPLY:
select t.Ticker ,
convert(CHAR(16), t.date_time - '00:01', 120) AS '1MINCHANGE',
t.Price,
m.Date_time,
m.Price
from Trade..MyTrade as t outer apply
(select top 1 m.*
from Trade..Market m
where a.ticker = b.ticker and
convert(CHAR(16), t.date_time - '00:01', 120) >= b.Date_time)
order by m.DateTime desc
) m;

Teradata spool space issue on running a sub query with Count

I am using below query to calculate business days between two dates for all the order numbers. Business days are already available in the teradata table Common_WorkingCalendar. But, i'm also facing spool space issue while i execute the query. I have ample space available in my data lab. Need to optimize the query. Appreciate any inputs.
SELECT
tx."OrderNumber",
(SELECT COUNT(1) FROM Common_WorkingCalendar
WHERE CalDate between Cast(tx."TimeStamp" as date) and Cast(mf.ShipDate as date)) as BusDays
from StoreFulfillment ff
inner join StoreTransmission tx
on tx.OrderNumber = ff.OrderNumber
inner join StoreMerchandiseFulfillment mf
on mf.OrderNumber = ff.OrderNumber
This is a very inefficient way to get this count which results in a product join.
The recommended approach is adding a sequential number to your calendar which increases only on business days (calculated using SUM(CASE WHEN businessDay THEN 1 ELSE 0 END) OVER (ORDER BY CalDate ROWS UNBOUNDED PRECEDING)), then it's two joins, for the start date and the end date.
If this calculation is needed a lot you better add a new column, otherwise you can do it on the fly:
WITH cte AS
(
SELECT CalDate,
-- as this table only contains business days you can use this instead
row_number(*) Over (ORDER BY CalDate) AS DayNo
FROM Common_WorkingCalendar
)
SELECT
tx."OrderNumber",
to_dt.DayNo - from_dt.DayNo AS BusDays
FROM StoreFulfillment ff
INNER JOIN StoreTransmission tx
ON tx.OrderNumber = ff.OrderNumber
INNER JOIN StoreMerchandiseFulfillment mf
ON mf.OrderNumber = ff.OrderNumber
JOIN cte AS from_dt
ON from_dt.CalDate = Cast(tx."TimeStamp" AS DATE)
JOIN cte AS to_dt
ON to_dt.CalDate = Cast(mf.ShipDate AS DATE)

very slow oracle select statement

i have a select statement that contains hundred thousands if data, however the execution time is very slow which take longer than 15 minutes. Is the any way that i can improve the execution time for this select statement.
select a.levelP,
a.code,
a.descP,
(select nvl(SUM(amount),0) from ca_glopen where code = a.code and acc_mth = '2016' ) ocf,
(select nvl(SUM(amount),0) from ca_glmaintrx where code = a.code and to_char(doc_date,'yyyy') = '2016' and to_char(doc_date,'yyyymm') < '201601') bcf,
(select nvl(SUM(amount),0) from ca_glmaintrx where jum_amaun > 0 and code = a.code and to_char(doc_date,'yyyymm') = '201601' ) debit,
(select nvl(SUM(amount),0) from ca_glmaintrx where jum_amaun < 0 and code = a.code and to_char(doc_date,'yyyymm') = '201601' ) credit
from ca_chartAcc a
where a.code is not null
order by to_number(a.code), to_number(levelP)
please help me for the way to up speed my query and result.TQ
Your primary problem is that most of your subqueries use functions on your search criteria, including some awkward ones on your dates. It's much better to flip that around and explicitly qualify the expected range, by supplying actual dates (a one month range is usually a small percentage of total rows, so this is very likely to hit an index).
SELECT Chart.levelP, Chart.code, Chart.descP,
COALESCE(GL_SUM.ocf, 0),
COALESCE(Transactions.bcf, 0),
COALESCE(Transactions.debit, 0),
COALESCE(Transactions.credit, 0),
FROM ca_ChartAcc Chart
LEFT JOIN (SELECT code, SUM(amount) AS ocf
FROM ca_GLOpen
WHERE acc_mth = '2016') GL_Sum
ON GL_Sum.code = Chart.code
LEFT JOIN (SELECT code,
SUM(amount) AS bcf,
SUM(CASE WHEN amount > 0 THEN amount) AS debit,
SUM(CASE WHEN amount < 0 THEN amount) AS credit,
FROM ca_GLMainTrx
WHERE doc_date >= TO_DATE('2016-01-01')
AND doc_date < TO_DATE('2016-02-01')) Transactions
ON Transactions.code = Chart.code
WHERE Chart.code IS NOT NULL
ORDER BY TO_NUMBER(Chart.code), TO_NUMBER(Chart.levelP)
If you only need a few codes, it may yield better results to push those values into the subqueries as well (although note that the optimizer is likely to perform this for you).
It may be possible to remove the calls to TO_NUMBER(...) from the ORDER BY clause; however, this depends on the format of the values, since how they were encoded may change the ordering of results.

Optimizing SQL Server Query (CTE)

This query takes forever to run. Does anybody have any good tips on how I can optimize it?
WITH CTE (Lockindate, Before5, After5) AS (SELECT nl.Lockindate,
(CASE WHEN CAST(RIGHT(FirstLockActivity,8) AS time(1)) <= '17:00' THEN
'Before 5 PM' END) AS before5,
(CASE WHEN CAST(RIGHT(FirstLockActivity,8) AS time(1)) >= '17:00' THEN
'After 5 PM' END) AS after5
FROM netlock nl WITH(NOLOCK)
JOIN rate rs WITH(NOLOCK)
ON nl.id=rs.id
WHERE nl.lockindate BETWEEN '2016-08-01' AND '2016-08-31')
SELECT lockindate, COUNT(After5), COUNT(Before5)
FROM CTE
GROUP BY lockindate
Although it won't speed up things all that much IN THIS CASE (*), the conversions from one datatype to another and then another again are 'not optimal'.
(CASE WHEN CAST(RIGHT(FirstLockActivity,8) AS time(1)) <= '17:00' THEN 'Before 5 PM' END) AS before5,
First of all you do an implicit conversion from a datetime [FirstLockActivity] to a string. The reason for this being that the Right() function expects a string, and hence the conversion.
Doing implicit conversions can be dangerous. Depending on the configuration of your server (and even of your connection which by itself might be influenced by the regional settings of your operating system) this can result in surprising results, not all of which will have the expected last 8 characters!
FYI: have a look here: https://msdn.microsoft.com/en-us/library/ms187928.aspx?f=255&MSPPError=-2147217396 . As you can't explicitly pass the 'style' with CAST I've always suggested to people to rather use Convert(), ESPECIALLY when converting from datetimes to string, and vice versa.
After that you take the right-most 8 characters and then convert these into a time(1). I'm not sure why you want to use a time(1) over a time(0) since you're only interested in the hour part, but in the end it won't make much difference I guess.
Anyway, doing all these conversions takes CPU and thus time.
Assuming this thing isn't too dumbed down from what you really want to do, the target of the query is to return an indication on how many entries there were before and after 5 PM for each lockindate. As such, all you need to do is calculate the hour of each FirstLockActivy and decide from there.
=> Hour(xx) and DatePart(hour, xx) will both return the required information for a fraction of the CPU-cost of those conversions.
Also, you can easily get the before/after in one CASE construction.
WITH CTE (LockinDate, Before5PM)
AS (SELECT Lockindate,
(CASE WHEN Hour(FirstLockActivity) < 17 THEN 1 ELSE 0 END) AS Before5PM
FROM netlock nl WITH (NOLOCK)
JOIN rate rs WITH (NOLOCK)
ON nl.id=rs.id
WHERE nl.lockindate BETWEEN Convert(datetime, '2016-08-01', 105) AND Convert(datetime, '2016-08-31', 105))
SELECT LockinDate,
After5 = SUM(1 - Before5PM),
Before5 = SUM(Before5PM)
FROM CTE
GROUP BY LockinDate
Assuming the amount of (relevant) records in the tables is huge, then this will have some effect on the duration, but given the speed of modern processors it probably won't be shocking. Off course, when you're on a busy server and there isn't all that much (free) CPU going around, then the effect will be much more noticable.
That said, as for performance I'd suggest to check the indexes on the netlock and rate table. Ideally netlock has a clustered index (or PK) on the id field and a non-clustered index on lockindate. Additionally, the rate table has a clustered index on the id field with some additional other fields I'm not aware of. If the latter isn't the case, having a non-clustered index on the id field with the FirstLockActivity field in the included column would be great too.
If you want to have this query without the CTE, you can simply copy paste the CTE into a subquery/derived table like this:
SELECT LockinDate,
After5 = SUM(1 - Before5PM),
Before5 = SUM(Before5PM)
FROM (SELECT Lockindate,
(CASE WHEN Hour(FirstLockActivity) < 17 THEN 1 ELSE 0 END) AS Before5PM
FROM netlock nl WITH (NOLOCK)
JOIN rate rs WITH (NOLOCK)
ON nl.id=rs.id
WHERE nl.lockindate BETWEEN Convert(datetime, '2016-08-01', 105) AND Convert(datetime, '2016-08-31', 105)) A
GROUP BY LockinDate
Or, worked out a little more you'd get
SELECT LockinDate,
After5 = SUM(1 - (CASE WHEN Hour(FirstLockActivity) < 17 THEN 1 ELSE 0 END)),
Before5 = SUM( (CASE WHEN Hour(FirstLockActivity) < 17 THEN 1 ELSE 0 END))
FROM netlock nl WITH (NOLOCK)
JOIN rate rs WITH (NOLOCK)
ON nl.id=rs.id
WHERE nl.lockindate BETWEEN Convert(datetime, '2016-08-01', 105) AND Convert(datetime, '2016-08-31', 105)) A
GROUP BY LockinDate
PS: wrote all this in the browser, there might be some typo's and none the code is tested, prepare to have to fiddle around a bit to get it working =)
PS: if you can't get the query plan, you might still be able to use SET STATISTICS TIME to more easily compare one version of the query to another.
(*: In case you do this kind of conversions inside the WHERE clause or a JOIN clause, it will confuse the optimizer and the results could be devastating for the performance of your query)
You don't need a CTE:
SELECT nl.Lockindate,
SUM(CASE WHEN CAST(RIGHT(FirstLockActivity,8) AS time(1)) <= '17:00' THEN 1 ELSE 0 END) AS before5,
SUM(CASE WHEN CAST(RIGHT(FirstLockActivity,8) AS time(1)) > '17:00' THEN 1 ELSE 0 END) AS after5
FROM netlock nl WITH(NOLOCK)
JOIN rate rs WITH(NOLOCK)
ON nl.id=rs.id
WHERE nl.lockindate BETWEEN '2016-08-01' AND '2016-08-31'
GROUP BY nk.lockindate
But this might result in exactly the same plan. Then you need to check why it's slow (missing indexes on the join/group by columns?)