Synapse serverless TPC-H Query15 wrong syntax - azure-synapse

I am trying the TPC-H Queries, they all worked fine except Number 15, basically supplier_no was not recognized, do you know how to rewrite it, the only change I made for all Queries is to replace limit by top
SELECT
--Query15
s_suppkey,
s_name,
s_address,
s_phone,
total_revenue
FROM
supplier,
(
SELECT
l_suppkey AS supplier_no,
SUM(l_extendedprice * (1 - l_discount)) AS total_revenue
FROM
lineitem
WHERE
l_shipdate >= CAST('1996-01-01' AS date)
AND l_shipdate < CAST('1996-04-01' AS date)
GROUP BY
supplier_no
) revenue0
WHERE
s_suppkey = supplier_no
AND total_revenue = (
SELECT
MAX(total_revenue)
FROM
(
SELECT
l_suppkey AS supplier_no,
SUM(l_extendedprice * (1 - l_discount)) AS total_revenue
FROM
lineitem
WHERE
l_shipdate >= CAST('1996-01-01' AS date)
AND l_shipdate < CAST('1996-04-01' AS date)
GROUP BY
supplier_no
) revenue1
)
ORDER BY
s_suppkey;

If you are getting the following errors you just need to make sure that you are referring to the source column name (l_suppkey) in this case, not the alias (supplier_no) in this case:
Msg 207, Level 16, State 1, Line 1 Invalid column name 'supplier_no'.
Msg 164, Level 15, State 1, Line 1 Each GROUP BY expression must
contain at least one column that is not an outer reference.
A full statement which has been tested against a dedicated SQL pool in Azure Synapse Analytics:
SELECT
--Query15
s_suppkey,
s_name,
s_address,
s_phone,
total_revenue
FROM
supplier,
(
SELECT
l_suppkey AS supplier_no,
SUM(l_extendedprice * (1 - l_discount)) AS total_revenue
FROM
lineitem
WHERE
l_shipdate >= CAST('1996-01-01' AS date)
AND l_shipdate < CAST('1996-04-01' AS date)
GROUP BY
l_suppkey
) revenue0
WHERE
s_suppkey = supplier_no
AND total_revenue = (
SELECT
MAX(total_revenue)
FROM
(
SELECT
l_suppkey AS supplier_no,
SUM(l_extendedprice * (1 - l_discount)) AS total_revenue
FROM
lineitem
WHERE
l_shipdate >= CAST('1996-01-01' AS date)
AND l_shipdate < CAST('1996-04-01' AS date)
GROUP BY
l_suppkey
) revenue1
)
ORDER BY
s_suppkey;
NB SQL Server has the ability to refer to the alias in the ORDER BY statement but not the GROUP BY.
Re related discussion on performance on Azure Synapse Serverless SQL Pools:
Just for fun, I repartitioned my TPC-H SF10 dbo.lineitem table by l_shipdate, added the filepath() metadata function to filter on and got the warm query down to 1 sec, 7 seconds on first run. So some caching did seem to be in play.
I realise you have not had to do these very query-specific optimisations for the other platforms but I wanted to see if it was possible to improve the performance.
I suppose Q14 is to test specific transformation rules in the respective db engines:
The query:
;WITH cte AS
(
SELECT
l_suppkey,
SUM(l_extendedprice * (1 - l_discount)) AS total_revenue
FROM OPENROWSET(
BULK 'enriched/tpch/tpch10/lineitem_partitioned/*/*.parquet',
DATA_SOURCE = 'MyDataSource',
FORMAT = 'PARQUET'
) x
WHERE x.filepath(1) = 1996
AND l_shipdate Between CAST('1996-01-01' AS DATE) And CAST('1996-04-01' AS DATE)
GROUP BY l_suppkey
)
SELECT
s.s_suppkey,
s.s_name,
s.s_address,
s.s_phone,
c.total_revenue
FROM ext.supplier s
INNER JOIN cte c ON s.s_suppkey = c.l_suppkey
WHERE total_revenue = ( SELECT MAX(total_revenue) FROM cte );

Related

How to get Postgres to return 0 for empty rows

I have a query which get data summarised between two dates like so:
SELECT date(created_at),
COUNT(COALESCE(id, 0)) AS total_orders,
SUM(COALESCE(total_price, 0)) AS total_price,
SUM(COALESCE(taxes, 0)) AS taxes,
SUM(COALESCE(shipping, 0)) AS shipping,
AVG(COALESCE(total_price, 0)) AS average_order_value,
SUM(COALESCE(total_discount, 0)) AS total_discount,
SUM(total_price - COALESCE(taxes, 0) - COALESCE(shipping, 0) - COALESCE(total_discount, 0)) as net_sales
FROM orders
WHERE shop_id = 43
AND orders.active = true
AND orders.created_at >= '2022-07-20'
AND orders.created_at <= '2022-07-26'
GROUP BY date (created_at)
order by created_at::date desc
However for dates that do not have any orders, the query returns nothing and I'd like to return 0.
I have tried with COALESCE but that doesn't seem to do the trick?
Any suggestions?
This should be substantially faster - and correct:
SELECT *
, total_price - taxes - shipping - total_discount AS net_sales -- ⑤
FROM (
SELECT created_at
, COALESCE(total_orders , 0) AS total_orders
, COALESCE(total_price , 0) AS total_price
, COALESCE(taxes , 0) AS taxes
, COALESCE(shipping , 0) AS shipping
, COALESCE(average_order_value , 0) AS average_order_value
, COALESCE(total_discount , 0) AS total_discount
FROM generate_series(timestamp '2022-07-20' -- ①
, timestamp '2022-07-26'
, interval '1 day') AS g(created_at)
LEFT JOIN ( -- ③
SELECT created_at::date
, count(*) AS total_orders -- ⑥
, sum(total_price) AS total_price
, sum(taxes) AS taxes
, sum(shipping) AS shipping
, avg(total_price) AS average_order_value
, sum(total_discount) AS total_discount
FROM orders
WHERE shop_id = 43
AND active -- simpler
AND created_at >= '2022-07-20'
AND created_at < '2022-07-27' -- ② !
GROUP BY 1
) o USING (created_at) -- ④
) sub
ORDER BY created_at DESC;
db<>fiddle here
I copied, simplified, and extended Xu's fiddle for comparison.
① Why this particular form for generate_series()? See:
Generating time series between two dates in PostgreSQL
② Assuming created_at is data type timestamp your original formulation is most probably incorrect. created_at <= '2022-07-26' would include the first instant of '2022-07-26' and exclude the rest. To include all of '2022-07-26', use created_at < '2022-07-27'. See:
How do I write a function in plpgsql that compares a date with a timestamp without time zone?
③ The LEFT JOIN is the core feature of this answer. Generate all days with generate_series(), independently aggregate days from table orders, then LEFT JOIN to retain one row per day like you requested.
④ I made the column name match created_at, so we can conveniently shorten the join syntax with the USING clause.
⑤ Compute net_sales in an outer SELECT after replacing NULL values, so we need COALESCE() only once.
⑥ count(*) is equivalent to COUNT(COALESCE(id, 0)) in any case, but cheaper. See:
Optimizing GROUP BY + COUNT DISTINCT on unnested jsonb column
PostgreSQL: running count of rows for a query 'by minute'
Please refer to the below script.
SELECT *
FROM
(SELECT date(created_at) AS created_at,
COUNT(id) AS total_orders,
SUM(total_price) AS total_price,
SUM(taxes) AS taxes,
SUM(shipping) AS shipping,
AVG(total_price) AS average_order_value,
SUM(total_discount) AS total_discount,
SUM(total_price - taxes - shipping - total_discount) AS net_sales
FROM orders
WHERE shop_id = 43
AND orders.active = true
AND orders.created_at >= '2022-07-20'
AND orders.created_at <= '2022-07-26'
GROUP BY date (created_at)
UNION
SELECT dates AS created_at,
0 AS total_orders,
0 AS total_price,
0 AS taxes,
0 AS shipping,
0 AS average_order_value,
0 AS total_discount,
0 AS net_sales
FROM generate_series('2022-07-20', '2022-07-26', interval '1 day') AS dates
WHERE dates NOT IN
(SELECT created_at
FROM orders
WHERE shop_id = 43
AND orders.active = true
AND orders.created_at >= '2022-07-20'
AND orders.created_at <= '2022-07-26' ) ) a
ORDER BY created_at::date desc;
There is one sample for your reference.
Sample
I got your duplicate test cases at my side. The root cause is created_at field (datattype:timestamp), hence there are duplicate lines.
Below script is correct for your request.
SELECT *
FROM
(SELECT date(created_at) AS created_at,
COUNT(id) AS total_orders,
SUM(total_price) AS total_price,
SUM(taxes) AS taxes,
SUM(shipping) AS shipping,
AVG(total_price) AS average_order_value,
SUM(total_discount) AS total_discount,
SUM(total_price - taxes - shipping - total_discount) AS net_sales
FROM orders
WHERE shop_id = 43
AND orders.active = true
AND orders.created_at >= '2022-07-20'
AND orders.created_at <= '2022-07-26'
GROUP BY date (created_at)
UNION
SELECT dates AS created_at,
0 AS total_orders,
0 AS total_price,
0 AS taxes,
0 AS shipping,
0 AS average_order_value,
0 AS total_discount,
0 AS net_sales
FROM generate_series('2022-07-20', '2022-07-26', interval '1 day') AS dates
WHERE dates NOT IN
(SELECT date (created_at)
FROM orders
WHERE shop_id = 43
AND orders.active = true
AND orders.created_at >= '2022-07-20'
AND orders.created_at <= '2022-07-26' ) ) a
ORDER BY created_at::date desc;
Here is a sample that's same with your side. Link
You can use WITH RECURSIVE to build a table of dates and then select dates that are not in your table
WITH RECURSIVE t(d) AS (
(SELECT '2015-01-01'::date)
UNION ALL
(SELECT d + 1 FROM t WHERE d + 1 <= '2015-01-10')
) SELECT d FROM t WHERE d NOT IN (SELECT d_date FROM tbl);
[look on this post : ][1]
[1]: https://stackoverflow.com/questions/28583379/find-missing-dates-postgresql#:~:text=You%20can%20use%20WITH%20RECURSIVE,SELECT%20d_date%20FROM%20tbl)%3B

Add running or cumulative total

I have below query which gives me expected results:
SELECT
total_orders,
quantity,
available_store_credits
FROM
(
SELECT
COUNT(orders.id) as total_orders,
date_trunc('year', confirmed_at) as year,
date_trunc('month', confirmed_at) as month,
SUM( quantity ) as quantity,
FROM
orders
INNER JOIN (
SELECT
orders.id,
sum(quantity) as quantity
FROM
orders
INNER JOIN line_items ON line_items.order_id = orders.id
WHERE
orders.deleted_at IS NULL
AND orders.status IN (
'paid', 'packed', 'in_transit', 'delivered'
)
GROUP BY
orders.id
) as order_quantity
ON order_quantity.id = orders.id
GROUP BY month, year) as orders_transactions
FULL OUTER JOIN
(
SELECT
date_trunc('year', created_at) as year,
date_trunc('month', created_at) as month,
SUM( ROUND( ( CASE WHEN amount_in_cents > 0 THEN amount_in_cents end) / 100, 2 )) AS store_credit_given,
SUM( ROUND( amount_in_cents / 100, 2 )) AS available_store_credits
FROM
store_credit_transactions
GROUP BY month, year
) as store_credit_results
ON orders_transactions.month = store_credit_results.month
I want to add one more column beside available_store_credits which will calculate running total of available_store_credits.
These are my trials, but none are working:
Attempt #1
SELECT
total_orders,
quantity,
available_store_credits,
cum_amt
FROM
(
SELECT
COUNT(orders.id) as total_orders,
date_trunc('year', confirmed_at) as year,
date_trunc('month', confirmed_at) as month,
SUM( quantity ) as quantity,
FROM
orders
INNER JOIN (
SELECT
orders.id,
sum(quantity) as quantity
FROM
orders
INNER JOIN line_items ON line_items.order_id = orders.id
WHERE
orders.deleted_at IS NULL
AND orders.status IN (
'paid', 'packed', 'in_transit', 'delivered'
)
GROUP BY
orders.id
) as order_quantity
ON order_quantity.id = orders.id
GROUP BY month, year) as orders_transactions
FULL OUTER JOIN
(
SELECT
date_trunc('year', created_at) as year,
date_trunc('month', created_at) as month,
SUM( ROUND( ( CASE WHEN amount_in_cents > 0 THEN amount_in_cents end) / 100, 2 )) AS store_credit_given,
SUM( ROUND( amount_in_cents / 100, 2 )) AS available_store_credits
SUM( amount_in_cents ) OVER (ORDER BY date_trunc('month', created_at), date_trunc('year', created_at)) AS cum_amt
FROM
store_credit_transactions
GROUP BY month, year
) as store_credit_results
ON orders_transactions.month = store_credit_results.month
Attempt #2
SELECT
total_orders,
quantity,
available_store_credits,
running_tot
FROM
(
SELECT
COUNT(orders.id) as total_orders,
date_trunc('year', confirmed_at) as year,
date_trunc('month', confirmed_at) as month,
FROM
orders
INNER JOIN (
SELECT
orders.id,
sum(quantity) as quantity
FROM
orders
INNER JOIN line_items ON line_items.order_id = orders.id
WHERE
orders.deleted_at IS NULL
AND orders.status IN (
'paid', 'packed', 'in_transit', 'delivered'
)
GROUP BY
orders.id
) as order_quantity
ON order_quantity.id = orders.id
GROUP BY month, year) as orders_transactions
FULL OUTER JOIN
(
SELECT
date_trunc('year', created_at) as year,
date_trunc('month', created_at) as month,
SUM( ROUND( amount_in_cents / 100, 2 )) AS available_store_credits,
SUM (available_store_creds) as running_tot
FROM
store_credit_transactions
INNER JOIN (
SELECT t0.id,
(
SELECT SUM( ROUND( amount_in_cents / 100, 2 )) as running_total
FROM store_credit_transactions as t1
WHERE date_trunc('month', t1.created_at) <= date_trunc('month', t0.created_at)
) AS available_store_creds
FROM store_credit_transactions AS t0
) as results
ON results.id = store_credit_transactions.id
GROUP BY month, year
) as store_credit_results
ON orders_transactions.month = store_credit_results.month
Making some assumptions about the undisclosed table definition and Postgres version (assuming current Postgres 14), this should do it:
SELECT total_orders, quantity, available_store_credits
, sum(available_store_credits) OVER (ORDER BY month) AS cum_amt -- HERE!!
FROM (
SELECT date_trunc('month', confirmed_at) AS month
, count(*) AS total_orders
, sum(quantity) AS quantity
FROM (
SELECT o.id, o.confirmed_at, sum(quantity) AS quantity
FROM orders o
JOIN line_items l ON l.order_id = o.id
WHERE o.deleted_at IS NULL
AND o.status IN ('paid', 'packed', 'in_transit', 'delivered')
GROUP BY 1
) o
GROUP BY 1
) orders_transactions
FULL JOIN (
SELECT date_trunc('month', created_at) AS month
, round(sum(amount_in_cents) FILTER (WHERE amount_in_cents > 0) / 100, 2) AS store_credit_given
, round(sum(amount_in_cents) / 100, 2) AS available_store_credits
FROM store_credit_transactions
GROUP BY 1
) store_credit_results USING (month)
Assuming you want the running sum to show up in every row and order of the date.
First, I simplified and removed some cruft:
date_trunc('year', confirmed_at) as year, was 100 % redundant noise in your query. I removed it.
As was another join to orders. Removed that, too. Assuming orders.id is defined as PK, we can further simplify. See:
PostgreSQL - GROUP BY clause
Use the superior aggregate FILTER. See:
Aggregate columns with additional (distinct) filters
Simplified a couple of other minor bits.

Attempting to calculate absolute change and % change in 1 query

I'm having trouble with the SELECT portion of this query. I can calculate the absolute change just fine, but when I want to also find out the percent change I get lost in all the subqueries. Using BigQuery. Thank you!
SELECT
station_name,
ridership_2013,
ridership_2014,
absolute_change_2014 / ridership_2013 * 100 AS percent_change,
(ridership_2014 - ridership_2013) AS absolute_change_2014,
It will probably be beneficial to organize your query with CTEs and descriptive aliases to make things a bit easier. For example...
with
data as (select * from project.dataset.table),
ridership_by_year as (
select
extract(year from ride_date) as yr,
count(*) as rides
from data
group by 1
),
ridership_by_year_and_station as (
select
extract(year from ride_date) as yr,
station_name,
count(*) as rides
from data
group by 1,2
),
yearly_changes as (
select
this_year.yr,
this_year.rides,
prev_year.rides as prev_year_rides,
this_year.rides - coalesce(prev_year.rides,0) as absolute_change_in_rides,
safe_divide( this_year.rides - coalesce(prev_year.rides), prev_year.rides) as relative_change_in_rides
from ridership_by_year this_year
left join ridership_by_year prev_year on this_year.yr = prev_year.yr + 1
),
yearly_station_changes as (
select
this_year.yr,
this_year.station_name,
this_year.rides,
prev_year.rides as prev_year_rides,
this_year.rides - coalesce(prev_year.rides,0) as absolute_change_in_rides,
safe_divide( this_year.rides - coalesce(prev_year.rides), prev_year.rides) as relative_change_in_rides
from ridership_by_year this_year
left join ridership_by_year prev_year on this_year.yr = prev_year.yr + 1
)
select * from yearly_changes
--select * from yearly_station_changes
Yes this is a bit longer, but IMO it is much easier to understand.

CREATE VIEW must be the only statement in the batch MS SQL Server

Microsoft SQL Server Management Studio 18 shows an error:
CREATE VIEW must be the only statement in the batch
After executing the request, the following error appears:
Incorrect syntax around the "select" keyword
create view revenue0 (supplier_no, total_revenue) as
select
l_suppkey,
sum(l_extendedprice * (1 - l_discount))
from
lineitem
where
l_shipdate >= '1996-05-01'
and l_shipdate < dateadd(mm,3,cast('1996-05-01' as datetime))
group by
l_suppkey;
select
s_suppkey,
s_name,
s_address,
s_phone,
total_revenue
from
supplier,
revenue0
where
s_suppkey = supplier_no
and total_revenue = (
select
max(total_revenue)
from
revenue0
)
order by
s_suppkey
option (maxdop 2)
drop view revenue0
UPD. I tried to run with this method:
create view revenue0 (supplier_no, total_revenue) as
select
l_suppkey,
sum(l_extendedprice * (1 - l_discount))
from
lineitem
where
l_shipdate >= cast('1996-05-01' as datetime)
and l_shipdate < dateadd(mm, 3, cast('1996-05-01' as datetime))
group by
l_suppkey;
go
select
s_suppkey,
s_name,
s_address,
s_phone,
total_revenue
from
supplier,
revenue0
where
s_suppkey = supplier_no
and total_revenue = (
select
max(total_revenue)
from
revenue0
)
order by
s_suppkey;
drop view revenue0;
But as a result of executing the request, an error is displayed:
Invalid object name "revenue0".
As soon as I didn't change my name. SQL all the same swears at this name.
UPD2. The question was solved independently. The topic is closed! Thank you all for your efforts!
The error tells you the CREATE VIEW must be the only statement in a batch. A batch is ended in SQL Server with the "GO" keyword, as Gordon stated, so your code should look like this:
create view revenue0 (supplier_no, total_revenue) as
select
l_suppkey,
sum(l_extendedprice * (1 - l_discount))
from
lineitem
where
l_shipdate >= '1996-05-01'
and l_shipdate < dateadd(mm,3,cast('1996-05-01' as datetime))
group by
l_suppkey;
GO -- right here. This ends a batch. Must be on a new line, with no semi-color, or SQL gets pissy.
select
s_suppkey,
s_name,
s_address,
s_phone,
total_revenue
from
supplier,
revenue0
where
s_suppkey = supplier_no
and total_revenue = (
select
max(total_revenue)
from
revenue0
)
order by
s_suppkey
option (maxdop 2);
drop view revenue0;

how to prevent duplicate columns in inner join with multiple select clauses

select * from
(select date, gen_city_id, min(temp) as min_temp, max(temp) as max_temp from current_weather group by date, gen_city_id order by date) cw
inner join
(select gen_city_id, forecast_date, array_agg(temp) from forecast where forecast_date < current_date group by gen_city_id, forecast_date) f
on cw.gen_city_id = f.gen_city_id and cw.date = f.forecast_date;
The above query works, however the gen_city_id and date/forecast_date columns are selected from both the tables. In my result set how do I prevent duplicate columns from both the tables ?
If I try removing the columns from the select cause of any one of the tables, then the query errors out.
Change the query in this way. You can specify which fields you want to obtain in the resultset:
select cw.*,f.temp from
(select date, gen_city_id, min(temp) as min_temp, max(temp) as max_temp from current_weather group by date, gen_city_id order by date) cw
inner join
(select gen_city_id, forecast_date, array_agg(temp) temp from forecast where forecast_date < current_date group by gen_city_id, forecast_date) f
on cw.gen_city_id = f.gen_city_id and cw.date = f.forecast_date;
You can use the using clause:
select *
from (select date, gen_city_id, min(temp) as min_temp, max(temp) as max_temp
from current_weather
group by date, gen_city_id order by date
) cw join
(select gen_city_id, forecast_date as date, array_agg(temp)
from forecast
where forecast_date < current_date
group by gen_city_id, forecast_date
) f
using (gen_city_id, date) ;
This removes the duplicate columns included in the using clause.
In general, though, I recommend listing out all the columns separately.