Presto Weighted Moving Average Syntax Error - sql

I'm trying to run the weighted moving average Silota query with similar data in a Presto database but am encountering an error. The same query in the Redshift database has no issues, however in Presto I receive a syntax error:
Query failed (#20220505_230258_04927_5xpwi):
line 14:14: Column 't2.row_number' cannot be resolved io.prestosql.spi.PrestoException:
line 14:14: Column 't2.row_number' cannot be resolved.
The data is the same in both databases, why does the query run in Redshift while Presto throws the error?
WITH t AS
(select date_trunc('month',mql_date) date, avg(mqls) mqls, row_number() over ()
from marketing.campaign
WHERE date_trunc('month',mql_date) > date('2021-12-31')
GROUP BY 1)
select t.date, avg(t.mqls),
sum(case
when t.row_number - t2.row_number = 0 then 0.4 * t2.mqls
when t.row_number - t2.row_number = 1 then 0.3 * t2.mqls
when t.row_number - t2.row_number = 2 then 0.2 * t2.mqls
when t.row_number - t2.row_number = 3 then 0.1 * t2.mqls
end) weighted_avg
from t
join t t2 on t2.row_number between t.row_number - 3 and t.row_number
group by 1
order by 1

I suspect it is because your SQL assumes that the result of the row_number() window function will be called "row_number". This is true in Redshift but other databases may infer a different name onto it. You should alias this to some defined name such as "rn".
Also you have no "order by" clause in your row_number() function which will make the row numbers unpredictable and possibly varying between invocations.

Related

Postgres "Cannot take logarithm of zero" Error

I use
ln(session_length) - avg(ln(session_length)) OVER (PARTITION BY device_platform) / nullif(stddev(ln(session_length)) OVER (PARTITION BY device_platform), 0) AS ln_std
for removing outliers with SQL. I have used the function with Redshift before and I did not get any error but when I use this with Postgres I get
[2201E] ERROR: cannot take logarithm of zero
The error comes when I added where clause with ln_std <= 1.67 otherwise there is no error.
Can someone point me if I miss something.
My code is:
SELECT
user_id
, event_date
, device_platform
, marketing_user
, session_length
FROM
(
SELECT
user_id
, date(event_time) AS event_date
, device_platform
, marketing_user AS marketing_user
, session_length
--! Normalisation: Using a logarithmic scale (ln())
--! Create the Z score for removing the outliers
, ln(session_length) - avg(ln(session_length)) OVER (PARTITION BY device_platform) /
nullif(stddev(ln(session_length)) OVER (PARTITION BY device_platform),
0) AS ln_std
FROM
session_start
WHERE
date(install_time) >= '2020-01-01'
) filter
WHERE
ln_std <= 1.67
There is a value less than or equal to zero in your session_length column, the error is describing it pretty well. Do some analysis on why this is happening and threat them accordingly.

TPC-DS Query 6: Why do we need 'where j.i_category = i.i_category' condition?

I'm going through TPC-DS for Amazon Athena.
It was fine until query 5.
I got some problem on query 6. (which is below)
select a.ca_state state, count(*) cnt
from customer_address a
,customer c
,store_sales s
,date_dim d
,item i
where a.ca_address_sk = c.c_current_addr_sk
and c.c_customer_sk = s.ss_customer_sk
and s.ss_sold_date_sk = d.d_date_sk
and s.ss_item_sk = i.i_item_sk
and d.d_month_seq =
(select distinct (d_month_seq)
from date_dim
where d_year = 2002
and d_moy = 3 )
and i.i_current_price > 1.2 *
(select avg(j.i_current_price)
from item j
where j.i_category = i.i_category)
group by a.ca_state
having count(*) >= 10
order by cnt, a.ca_state
limit 100;
It took more than 30 minutes so it failed with timeout.
I tried to find which part cause problem, so I checked the where conditions and I found where j.i_category = i.i_category for the last part of where condition.
I don't know why this condition is needed so I deleted this part and the query ran Ok.
can you guys tell me why this part is needed?
The j.i_category = i.i_category is subquery correlation condition.
If you remove it from the subquery
select avg(j.i_current_price)
from item j
where j.i_category = i.i_category)
the subquery becomes uncorrelated, and becomes a global aggregation on the item table, which is easy to calculate and the query engine needs to do it once.
If you want a fast, performant query engine on AWS, i can recommend Starburst Presto (disclaimer: i am from Starburst). See https://www.concurrencylabs.com/blog/starburst-presto-vs-aws-redshift/ for a related comparison (note: this is not a comparison with Athena).
If it doesn't have to be that fast, you can use PrestoSQL on EMR (note that "PrestoSQL" and "Presto" components on EMR are not the same thing).

Hive summary function inside case statement

I am trying to write a simple Hive query:
select sum(case when pot_sls_q > 2* avg(pit_sls_q) then 1 else 0)/count(*) from prd_inv_fnd.item_pot_sls where dept_i=43 and class_i=3 where p_wk_end_d = 2014-06-28;
Here pit_sls_q and pot_sls_q both are columns in the Hive table and I want proportion of records which have pot_sls_q more than 2 times average of pit_sls_q. However I get error:
FAILED: SemanticException [Error 10128]: Line 1:95 Not yet supported place for UDAF 'avg'
To fool around I even tried using some window function:
select sum(case when pot_sls_q > 2* avg(pit_sls_q) over (partition by dept_i,class_i) then 1 else 0 end)/count(*) from prd_inv_fnd.item_pot_sls where dept_i=43 and class_i=3 and p_wk_end_d = '2014-06-28';
which is fine considering the fact filtering or partitioning the data on same condition is "same" data essentially but even with this I get error:
FAILED: SemanticException [Error 10002]: Line 1:36 Invalid column reference 'avg': (possible column names are: p_wk_end_d, dept_i, class_i, item_i, pit_sls_q, pot_sls_q)
please suggest right way of doing this.
You are using AVG inside SUM which won't work (along with other syntax errors).
Try analytic AVG OVER () this:
select sum(case when pot_sls_q > 2 * avg_pit_sls_q then 1 else 0 end) / count(*)
from (
select t.*,
avg(pit_sls_q) over () avg_pit_sls_q
from prd_inv_fnd.item_pot_sls t
where dept_i = 43
and class_i = 3
and p_wk_end_d = '2014-06-28'
) t;

SQL - How to eliminate duplicates from the below query in POSTGRES

I have been working on the below query. Basically there are two tables. Realtime_Input and Realtime_Output. When I join the two tables and take the necessary columns, I made this a view and when i query against the view I get duplicates.
What am I doing wrong? When I tested using distinct keyword, I get 60 unique rows but intermittently i get duplicates. My db is on cloud foundry cloud (postgres). Is is because of that? Please help !
select i2.key_ts_long,
case
when i2.revenue_activepower = 'NA'
then (-1 * CAST(io.min5_forecast as real))
else (CAST(i2.revenue_activepower AS real) - CAST(io.min5_forecast as real))
end as diff
from realtime_analytic_input i2,
(select i.farm_id,
i.key_ts_long,
o.min5_forecast,
o.min5_timestamp_seconds
from realtime_analytic_input i,
realtime_analytic_output o
where i.farm_id = o.farm_id
and i.key_ts_long = o.key_ts_long
and o.farm_id = 'MW1'
) io
where i2.key_ts_long = CAST(io.min5_timestamp_seconds AS bigint)
and i2.farm_id = io.farm_id
and i2.farm_id = 'MW1'
and io.key_ts_long between 1464738953169 and 1466457841
order by io.key_ts_long desc

SQL - CountIf on a column

Trying to do some calculations via SQL on my iSeries and have the following conundrum: I need to count the number of times a certain value appears in a column. My select statement is as follows:
Select
MOTRAN.ORDNO, MOTRAN.OPSEQ, MOROUT.WKCTR, MOTRAN.TDATE,
MOTRAN.LBTIM, MOROUT.SRLHU, MOROUT.RLHTD, MOROUT.ACODT,
MOROUT.SCODT, MOROUT.ASTDT, MOMAST.SSTDT, MOMAST.FITWH,
MOMAST.FITEM,
CONCAT(MOTRAN.ORDNO, MOTRAN.OPSEQ) As CON,
count (Concat(MOTRAN.ORDNO, MOTRAN.OPSEQ) )As CountIF,
MOROUT.SRLHU / (count (Concat(MOTRAN.ORDNO, MOTRAN.OPSEQ))) as calc
*(snip)*
With this information, I'm trying to count the number of times a value in CON appears. I will need this to do some math with so it's kinda important. My count statement doesn't work properly as it reports a certain value as occurring once when I see it appears 8 times.
Try putting a CASE statement inside a SUM().
SUM(CASE WHEN value = 'something' THEN 1 ELSE 0 END)
This will count the number of rows where value = 'something'.
Similary...
SUM(CASE WHEN t1.val = CONCAT(t2.val, t3.val) THEN 1 ELSE 0 END)
If you're on a supported version of the OS, ie 6.1 or higher...
You might be able to make use of "grouping set" functionality. Particularly the ROLLUP clause.
I can't say for sure without more understanding of your data.
Otherwise, you're going to need to so something like
wth Cnt as (select ORDNO, OPSEQ, count(*) as NbrOccur
from MOTRAN
group by ORDNO, OPSEQ
)
Select
MOTRAN.ORDNO, MOTRAN.OPSEQ, MOROUT.WKCTR, MOTRAN.TDATE,
MOTRAN.LBTIM, MOROUT.SRLHU, MOROUT.RLHTD, MOROUT.ACODT,
MOROUT.SCODT, MOROUT.ASTDT, MOMAST.SSTDT, MOMAST.FITWH,
MOMAST.FITEM,
CONCAT(MOTRAN.ORDNO, MOTRAN.OPSEQ) As CON,
Cnt.NbrOccur,
MOROUT.SRLHU / Cnt.NbrOccur as calc
from
motran join Cnt on mortran.ordno = cnt.ordno and mortran.opseq = cnt.opseq
*(snip)*