Not able to run a simple beam pipeline - sql

I have a simple sql query with some aggregations, there is no problem with the query itself, I am looking into its execution plan and don't know where are those aggregations in the plan come from the query itself:
Table:
Query (this query contains string operation, group by, order by and join, purpose: to get the reporting period that total amount increased certain target over the years):
WITH cte
AS (SELECT Year(orderdate) AS yr,
Month(orderdate) AS mon,
Ltrim(Rtrim(Str(Year(orderdate)))) + '-'
+ Ltrim(Rtrim(Str(Month(orderdate)))) AS theMonth,
Sum(totalamount) AS theAmount
FROM [order]
GROUP BY Year(orderdate),
Month(orderdate),
Ltrim(Rtrim(Str(Year(orderdate)))) + '-'
+ Ltrim(Rtrim(Str(Month(orderdate)))))
SELECT TOP 3 cte.themonth,
cte_prev.themonth AS thePrevMonth,
cte.theamount,
cte_prev.theamount AS thePrevAmount,
( cte.theamount - cte_prev.theamount ) AS diff
FROM cte
JOIN cte cte_prev
ON cte.yr = cte_prev.yr + 1
AND cte.mon = cte_prev.mon
WHERE ( cte.theamount - cte_prev.theamount ) / cte_prev.theamount > 0.8
ORDER BY ( cte.theamount - cte_prev.theamount ) / cte_prev.theamount DESC
Execution plan:
I wonder how can I create a better/simpler query to calculate the difference between two reporting period? and the string trimming is really annoying here: why there is no simple and single trim but have to ltrim and rtrim?

Related

Why does the order by in Big Query not working?

I am trying to use order by in big query to sort my query. What I want to do is, to order the results based on the week number of the year but it doesn't seem to work. Nor does it show any kind of syntax issue.
SELECT * FROM (SELECT concat(cast(EXTRACT(week FROM elt.event_datetime) as string),', ', extract(year from elt.event_datetime)) WEEK, elt.msg_source SOURCE, (elt.source_timedelta_s_ + elt.pipeline_timedelta_s_) Latency FROM <table> elt join ,<table1> ai ON elt.msg_id = ai.msg_id WHERE ai.report_type <> 'PFR' and EXTRACT(date FROM elt.event_datetime) > extract(date from (date_sub(current_timestamp(), INTERVAL 30 day)))
order by WEEK desc)PIVOT ( AVG(Latency) FOR SOURCE IN ('FLYHT', 'SMTP')) t
Basically, I want my results as they are numbered in green in the image below.
Can someone check what is the issue?
SELECT * FROM (SELECT concat(cast(EXTRACT(week FROM elt.event_datetime) as string),', ', extract(year from elt.event_datetime)) WEEK, elt.msg_source SOURCE, (elt.source_timedelta_s_ + elt.pipeline_timedelta_s_) Latency FROM <table> elt join ,<table1> ai ON elt.msg_id = ai.msg_id WHERE ai.report_type <> 'PFR' and EXTRACT(date FROM elt.event_datetime) > extract(date from (date_sub(current_timestamp(), INTERVAL 30 day))))
PIVOT ( AVG(Latency) FOR SOURCE IN ('FLYHT', 'SMTP')) t order by (select RIGHT(t.WEEK,4)) desc ,(select regexp_substr(t.WEEK,'[^,]+')) desc
as suggested by #Shipra Sarkar in the comments.

SQL aggregated subquery - Athena

Using AWS Athena I want to get total recovered per day by getting total recovered amount / total advances
here is code:
SELECT a.advance_date
,sum(a.advance_amount) as "advance_amount"
,sum(a.advance_fee) as "advance_fee"
,(SELECT
sum(credit_recovered+fee_recovered) / (a.advance_amount+a.advance_fee)
FROM ncmxmy.ageing_recovery_raw_parquet
WHERE advance_date = a.advance_date
AND date(recovery_date) <= DATE_ADD('day', 0, a.advance_date)
) as "day_0"
FROM ageing_summary_advance_parquet a
GROUP BY a.advance_date
ORDER BY a.advance_date
I am getting an error
"("sum"((credit_recovered + fee_recovered)) / (a.advance_amount + a.advance_fee))' must be an aggregate expression or appear in GROUP BY clause"
Your division gives the error because the denominator tries to use individual columns from the ageing_summary_advance_parquet table. In my perception of the query, you need to divide by the grouped sum of advance_amount and advance_fee columns. In that case, we can merge two grouped sets of data by advance_date into the division. Please let me know if this query helps:
WITH cte1 (sum_adv_date, advance_date) as
(SELECT
sum(credit_recovered+fee_recovered) as sum_adv_date, advance_date
FROM ncmxmy.ageing_recovery_raw_parquet
WHERE date(recovery_date) <= DATE_ADD('day', 0, advance_date)
GROUP BY advance_date
),
cte2 (advance_date, advance_amount, advance_fee) as
(SELECT
a.advance_date
,sum(a.advance_amount) as "advance_amount"
,sum(a.advance_fee) as "advance_fee"
FROM ageing_summary_advance_parquet a
GROUP BY a.advance_date
)
SELECT cte2.advance_amount, cte2.advance_fee,
(cte1.sum_adv_date/(cte2.advance_amount+cte2.advance_fee)) as "day_0"
FROM cte1 inner join cte2 on cte1.advance_date = cte2.advance_date
ORDER BY cte1.advance_date

Attempting to calculate absolute change and % change in 1 query

I'm having trouble with the SELECT portion of this query. I can calculate the absolute change just fine, but when I want to also find out the percent change I get lost in all the subqueries. Using BigQuery. Thank you!
SELECT
station_name,
ridership_2013,
ridership_2014,
absolute_change_2014 / ridership_2013 * 100 AS percent_change,
(ridership_2014 - ridership_2013) AS absolute_change_2014,
It will probably be beneficial to organize your query with CTEs and descriptive aliases to make things a bit easier. For example...
with
data as (select * from project.dataset.table),
ridership_by_year as (
select
extract(year from ride_date) as yr,
count(*) as rides
from data
group by 1
),
ridership_by_year_and_station as (
select
extract(year from ride_date) as yr,
station_name,
count(*) as rides
from data
group by 1,2
),
yearly_changes as (
select
this_year.yr,
this_year.rides,
prev_year.rides as prev_year_rides,
this_year.rides - coalesce(prev_year.rides,0) as absolute_change_in_rides,
safe_divide( this_year.rides - coalesce(prev_year.rides), prev_year.rides) as relative_change_in_rides
from ridership_by_year this_year
left join ridership_by_year prev_year on this_year.yr = prev_year.yr + 1
),
yearly_station_changes as (
select
this_year.yr,
this_year.station_name,
this_year.rides,
prev_year.rides as prev_year_rides,
this_year.rides - coalesce(prev_year.rides,0) as absolute_change_in_rides,
safe_divide( this_year.rides - coalesce(prev_year.rides), prev_year.rides) as relative_change_in_rides
from ridership_by_year this_year
left join ridership_by_year prev_year on this_year.yr = prev_year.yr + 1
)
select * from yearly_changes
--select * from yearly_station_changes
Yes this is a bit longer, but IMO it is much easier to understand.

Slow Aggregates using as-of date

I have a query that's intended as the base dataset for an AR Aging report in a BI tool. The report has to be able to show AR as of a given date across a several-month range. I have the logic working, but I'm seeing pretty slow performance. Code below:
WITH
DAT AS (
SELECT
MY_DATE AS_OF_DATE
FROM
NS_REPORTS."PUBLIC".NETSUITE_DATE_TABLE
WHERE
CAST(CAST(MY_DATE AS TIMESTAMP) AS DATE) BETWEEN '2020-01-01' AND CAST(CAST(CURRENT_DATE() AS TIMESTAMP) AS DATE)
), INV AS
(
WITH BASE AS
(
SELECT
BAS1.TRANSACTION_ID
, DAT.AS_OF_DATE
, SUM(BAS1.AMOUNT) ORIG_AMOUNT_BASE
FROM
"PUBLIC".BILL_TRANS_LINES_BASE BAS1
CROSS JOIN DAT
WHERE
BAS1.TRANSACTION_TYPE = 'Invoice'
AND BAS1.TRANSACTION_DATE <= DAT.AS_OF_DATE
--AND BAS1.TRANSACTION_ID = 6114380
GROUP BY
BAS1.TRANSACTION_ID
, DAT.AS_OF_DATE
)
, TAX AS
(
SELECT
TRL1.TRANSACTION_ID
, SUM(TRL1.AMOUNT_TAXED * - 1) ORIG_AMOUNT_TAX
FROM
CONNECTORS.NETSUITE.TRANSACTION_LINES TRL1
WHERE
TRL1.AMOUNT_TAXED IS NOT NULL
AND TRL1.TRANSACTION_ID IN (SELECT TRANSACTION_ID FROM BASE)
GROUP BY
TRL1.TRANSACTION_ID
)
SELECT
BASE.TRANSACTION_ID
, BASE.AS_OF_DATE
, BASE.ORIG_AMOUNT_BASE
, COALESCE(TAX.ORIG_AMOUNT_TAX, 0) ORIG_AMOUNT_TAX
FROM
BASE
LEFT JOIN TAX ON TAX.TRANSACTION_ID = BASE.TRANSACTION_ID
)
SELECT
AR.*
, CASE
WHEN AR.DAYS_OUTSTANDING < 0
THEN 'Current'
WHEN AR.DAYS_OUTSTANDING BETWEEN 0 AND 30
THEN '0 - 30'
WHEN AR.DAYS_OUTSTANDING BETWEEN 31 AND 60
THEN '31 - 60'
WHEN AR.DAYS_OUTSTANDING BETWEEN 61 AND 90
THEN '61 - 90'
WHEN AR.DAYS_OUTSTANDING > 90
THEN '91+'
ELSE NULL
END DO_BUCKET
FROM
(
SELECT
AR1.*
, TRA1.TRANSACTION_TYPE
, DATEDIFF('day', AR1.AS_OF_DATE, CAST(CAST(TRA1.DUE_DATE AS TIMESTAMP) AS DATE)) DAYS_OUTSTANDING
, AR1.ORIG_AMOUNT_BASE + AR1.ORIG_AMOUNT_TAX + AR1.PMT_AMOUNT AMOUNT_OUTSTANDING
FROM
(
SELECT
INV.TRANSACTION_ID
, INV.AS_OF_DATE
, INV.ORIG_AMOUNT_BASE
, INV.ORIG_AMOUNT_TAX
, COALESCE(PMT.PMT_AMOUNT, 0) PMT_AMOUNT
FROM
INV
LEFT JOIN (
SELECT
TLK.ORIGINAL_TRANSACTION_ID
, DAT.AS_OF_DATE
, SUM(TLK.AMOUNT_LINKED * - 1) PMT_AMOUNT
FROM
CONNECTORS.NETSUITE."TRANSACTION_LINKS" AS TLK
CROSS JOIN DAT
WHERE
TLK.LINK_TYPE = 'Payment'
AND CAST(CAST(TLK.ORIGINAL_DATE_POSTED AS TIMESTAMP) AS DATE) <= DAT.AS_OF_DATE
GROUP BY
TLK.ORIGINAL_TRANSACTION_ID
, DAT.AS_OF_DATE
) PMT ON PMT.ORIGINAL_TRANSACTION_ID = INV.TRANSACTION_ID
AND PMT.AS_OF_DATE = INV.AS_OF_DATE
) AR1
JOIN CONNECTORS.NETSUITE."TRANSACTIONS" TRA1 ON TRA1.TRANSACTION_ID = AR1.TRANSACTION_ID
)
AR
WHERE
1 = 1
--AND CAST(AMOUNT_OUTSTANDING AS NUMERIC(15, 2)) > 0
AND AS_OF_DATE >= '2020-04-22'
As you can see, I'm using a date table for the as-of date logic. I think this is the best way to do it, but I welcome any suggestions for better practice.
If I run the query with a single as-of date, it takes 1 min 6 sec and the two main aggregates, on TRANSACTION_LINKS and BILL_TRANS_LINES_BASE, each take about 25% of processing time. I'm not sure why. If I run with the filter shown, >= '2020-04-22', it takes 3 min 33 sec and the aggregates each take about 10% of processing time; they're lower because the ResultWorker takes 63% of processing time to write the results because it's so many rows.
I'm new to Snowflake but not to SQL. My understanding is that Snowflake does not allow manual creation of indexes, but again, I'm happy to be wrong. Please let me know if you have any ideas for improving the performance of this query.
Thanks in advance.
EDIT 1:
Screenshot of most expensive node in query profile
Without seeing the full explain plan and having some sample data to play with it is difficult to give any definitive answers, but here a few thoughts, for what they are worth...
The first are more about readability and may not help performance much:
Don't embed CTEs within each other, just define them in the order that they are needed. There is no need to define BASE and TAX within INV
Use CTEs as much as possible. Your main SELECT statement has 2 other SELECT statements embedded within it. It would be much more readable if these were defined using CTEs
Specific performance issues:
Keep data volumes as low as possible for as long as possible. Your CROSS JOINs obviously create cartesian products that massively increases the volume of data - therefore implement this as late in your SQL as possible rather than right at the start as you have done
While it may make your SQL less readable, use as few SQL statements as possible. For example, you should be able to create your INV CTE with a single SELECT statement rather than the 3 statements/CTEs that you are using

Filling in missing dates DB2 SQL

My initial query looks like this:
select process_date, count(*) batchCount
from T1.log_comments
order by process_date asc;
I need to be able to do some quick analysis for weekends that are missing, but wanted to know if there was a quick way to fill in the missing dates not present in process_date.
I've seen the solution here but am curious if there's any magic hidden in db2 that could do this with only a minor modification to my original query.
Note: Not tested, framed it based on my exposure to SQL Server/Oracle. I guess this gives you the idea though:
*now amended and tested on DB2*
WITH MaxDateQry(MaxDate) AS
(
SELECT MAX(process_date) FROM T1.log_comments
),
MinDateQry(MinDate) AS
(
SELECT MIN(process_date) FROM T1.log_comments
),
DatesData(ProcessDate) AS
(
SELECT MinDate from MinDateQry
UNION ALL
SELECT (ProcessDate + 1 DAY) FROM DatesData WHERE ProcessDate < (SELECT MaxDate FROM MaxDateQry)
)
SELECT a.ProcessDate, b.batchCount
FROM DatesData a LEFT JOIN
(
SELECT process_date, COUNT(*) batchCount
FROM T1.log_comments
) b
ON a.ProcessDate = b.process_date
ORDER BY a.ProcessDate ASC;