I'm trying to run the weighted moving average Silota query with similar data in a Presto database but am encountering an error. The same query in the Redshift database has no issues, however in Presto I receive a syntax error:
Query failed (#20220505_230258_04927_5xpwi):
line 14:14: Column 't2.row_number' cannot be resolved io.prestosql.spi.PrestoException:
line 14:14: Column 't2.row_number' cannot be resolved.
The data is the same in both databases, why does the query run in Redshift while Presto throws the error?
WITH t AS
(select date_trunc('month',mql_date) date, avg(mqls) mqls, row_number() over ()
from marketing.campaign
WHERE date_trunc('month',mql_date) > date('2021-12-31')
GROUP BY 1)
select t.date, avg(t.mqls),
sum(case
when t.row_number - t2.row_number = 0 then 0.4 * t2.mqls
when t.row_number - t2.row_number = 1 then 0.3 * t2.mqls
when t.row_number - t2.row_number = 2 then 0.2 * t2.mqls
when t.row_number - t2.row_number = 3 then 0.1 * t2.mqls
end) weighted_avg
from t
join t t2 on t2.row_number between t.row_number - 3 and t.row_number
group by 1
order by 1
I suspect it is because your SQL assumes that the result of the row_number() window function will be called "row_number". This is true in Redshift but other databases may infer a different name onto it. You should alias this to some defined name such as "rn".
Also you have no "order by" clause in your row_number() function which will make the row numbers unpredictable and possibly varying between invocations.
Related
I use
ln(session_length) - avg(ln(session_length)) OVER (PARTITION BY device_platform) / nullif(stddev(ln(session_length)) OVER (PARTITION BY device_platform), 0) AS ln_std
for removing outliers with SQL. I have used the function with Redshift before and I did not get any error but when I use this with Postgres I get
[2201E] ERROR: cannot take logarithm of zero
The error comes when I added where clause with ln_std <= 1.67 otherwise there is no error.
Can someone point me if I miss something.
My code is:
SELECT
user_id
, event_date
, device_platform
, marketing_user
, session_length
FROM
(
SELECT
user_id
, date(event_time) AS event_date
, device_platform
, marketing_user AS marketing_user
, session_length
--! Normalisation: Using a logarithmic scale (ln())
--! Create the Z score for removing the outliers
, ln(session_length) - avg(ln(session_length)) OVER (PARTITION BY device_platform) /
nullif(stddev(ln(session_length)) OVER (PARTITION BY device_platform),
0) AS ln_std
FROM
session_start
WHERE
date(install_time) >= '2020-01-01'
) filter
WHERE
ln_std <= 1.67
There is a value less than or equal to zero in your session_length column, the error is describing it pretty well. Do some analysis on why this is happening and threat them accordingly.
I'm going through TPC-DS for Amazon Athena.
It was fine until query 5.
I got some problem on query 6. (which is below)
select a.ca_state state, count(*) cnt
from customer_address a
,customer c
,store_sales s
,date_dim d
,item i
where a.ca_address_sk = c.c_current_addr_sk
and c.c_customer_sk = s.ss_customer_sk
and s.ss_sold_date_sk = d.d_date_sk
and s.ss_item_sk = i.i_item_sk
and d.d_month_seq =
(select distinct (d_month_seq)
from date_dim
where d_year = 2002
and d_moy = 3 )
and i.i_current_price > 1.2 *
(select avg(j.i_current_price)
from item j
where j.i_category = i.i_category)
group by a.ca_state
having count(*) >= 10
order by cnt, a.ca_state
limit 100;
It took more than 30 minutes so it failed with timeout.
I tried to find which part cause problem, so I checked the where conditions and I found where j.i_category = i.i_category for the last part of where condition.
I don't know why this condition is needed so I deleted this part and the query ran Ok.
can you guys tell me why this part is needed?
The j.i_category = i.i_category is subquery correlation condition.
If you remove it from the subquery
select avg(j.i_current_price)
from item j
where j.i_category = i.i_category)
the subquery becomes uncorrelated, and becomes a global aggregation on the item table, which is easy to calculate and the query engine needs to do it once.
If you want a fast, performant query engine on AWS, i can recommend Starburst Presto (disclaimer: i am from Starburst). See https://www.concurrencylabs.com/blog/starburst-presto-vs-aws-redshift/ for a related comparison (note: this is not a comparison with Athena).
If it doesn't have to be that fast, you can use PrestoSQL on EMR (note that "PrestoSQL" and "Presto" components on EMR are not the same thing).
I am trying to write a simple Hive query:
select sum(case when pot_sls_q > 2* avg(pit_sls_q) then 1 else 0)/count(*) from prd_inv_fnd.item_pot_sls where dept_i=43 and class_i=3 where p_wk_end_d = 2014-06-28;
Here pit_sls_q and pot_sls_q both are columns in the Hive table and I want proportion of records which have pot_sls_q more than 2 times average of pit_sls_q. However I get error:
FAILED: SemanticException [Error 10128]: Line 1:95 Not yet supported place for UDAF 'avg'
To fool around I even tried using some window function:
select sum(case when pot_sls_q > 2* avg(pit_sls_q) over (partition by dept_i,class_i) then 1 else 0 end)/count(*) from prd_inv_fnd.item_pot_sls where dept_i=43 and class_i=3 and p_wk_end_d = '2014-06-28';
which is fine considering the fact filtering or partitioning the data on same condition is "same" data essentially but even with this I get error:
FAILED: SemanticException [Error 10002]: Line 1:36 Invalid column reference 'avg': (possible column names are: p_wk_end_d, dept_i, class_i, item_i, pit_sls_q, pot_sls_q)
please suggest right way of doing this.
You are using AVG inside SUM which won't work (along with other syntax errors).
Try analytic AVG OVER () this:
select sum(case when pot_sls_q > 2 * avg_pit_sls_q then 1 else 0 end) / count(*)
from (
select t.*,
avg(pit_sls_q) over () avg_pit_sls_q
from prd_inv_fnd.item_pot_sls t
where dept_i = 43
and class_i = 3
and p_wk_end_d = '2014-06-28'
) t;
I have been working on the below query. Basically there are two tables. Realtime_Input and Realtime_Output. When I join the two tables and take the necessary columns, I made this a view and when i query against the view I get duplicates.
What am I doing wrong? When I tested using distinct keyword, I get 60 unique rows but intermittently i get duplicates. My db is on cloud foundry cloud (postgres). Is is because of that? Please help !
select i2.key_ts_long,
case
when i2.revenue_activepower = 'NA'
then (-1 * CAST(io.min5_forecast as real))
else (CAST(i2.revenue_activepower AS real) - CAST(io.min5_forecast as real))
end as diff
from realtime_analytic_input i2,
(select i.farm_id,
i.key_ts_long,
o.min5_forecast,
o.min5_timestamp_seconds
from realtime_analytic_input i,
realtime_analytic_output o
where i.farm_id = o.farm_id
and i.key_ts_long = o.key_ts_long
and o.farm_id = 'MW1'
) io
where i2.key_ts_long = CAST(io.min5_timestamp_seconds AS bigint)
and i2.farm_id = io.farm_id
and i2.farm_id = 'MW1'
and io.key_ts_long between 1464738953169 and 1466457841
order by io.key_ts_long desc
Trying to do some calculations via SQL on my iSeries and have the following conundrum: I need to count the number of times a certain value appears in a column. My select statement is as follows:
Select
MOTRAN.ORDNO, MOTRAN.OPSEQ, MOROUT.WKCTR, MOTRAN.TDATE,
MOTRAN.LBTIM, MOROUT.SRLHU, MOROUT.RLHTD, MOROUT.ACODT,
MOROUT.SCODT, MOROUT.ASTDT, MOMAST.SSTDT, MOMAST.FITWH,
MOMAST.FITEM,
CONCAT(MOTRAN.ORDNO, MOTRAN.OPSEQ) As CON,
count (Concat(MOTRAN.ORDNO, MOTRAN.OPSEQ) )As CountIF,
MOROUT.SRLHU / (count (Concat(MOTRAN.ORDNO, MOTRAN.OPSEQ))) as calc
*(snip)*
With this information, I'm trying to count the number of times a value in CON appears. I will need this to do some math with so it's kinda important. My count statement doesn't work properly as it reports a certain value as occurring once when I see it appears 8 times.
Try putting a CASE statement inside a SUM().
SUM(CASE WHEN value = 'something' THEN 1 ELSE 0 END)
This will count the number of rows where value = 'something'.
Similary...
SUM(CASE WHEN t1.val = CONCAT(t2.val, t3.val) THEN 1 ELSE 0 END)
If you're on a supported version of the OS, ie 6.1 or higher...
You might be able to make use of "grouping set" functionality. Particularly the ROLLUP clause.
I can't say for sure without more understanding of your data.
Otherwise, you're going to need to so something like
wth Cnt as (select ORDNO, OPSEQ, count(*) as NbrOccur
from MOTRAN
group by ORDNO, OPSEQ
)
Select
MOTRAN.ORDNO, MOTRAN.OPSEQ, MOROUT.WKCTR, MOTRAN.TDATE,
MOTRAN.LBTIM, MOROUT.SRLHU, MOROUT.RLHTD, MOROUT.ACODT,
MOROUT.SCODT, MOROUT.ASTDT, MOMAST.SSTDT, MOMAST.FITWH,
MOMAST.FITEM,
CONCAT(MOTRAN.ORDNO, MOTRAN.OPSEQ) As CON,
Cnt.NbrOccur,
MOROUT.SRLHU / Cnt.NbrOccur as calc
from
motran join Cnt on mortran.ordno = cnt.ordno and mortran.opseq = cnt.opseq
*(snip)*