Using Subquery in Sequence function PrestoSQL - sql

Use case -
I am trying to find weekly frequency of a customer from a dataset. Now, not all customers have "events" happening in all of the weeks, and I would need to fill them in with zero values for the "count" column.
I was trying to do this using the sequence function of PrestoSQL. However, this would need me to get the value of max week from the customer's orders itself ( I don't want to hardcode this since the result would be going into a BI tool and I dont want to update this manually every week )
with all_orders_2020 as (select customer, cast(date_parse(orderdate, '%Y-%m-%d') as date) as order_date
from orders
where orderdate > '2020-01-01' and customer in (select customer from some_customers)),
orders_with_week_number as (select *, week(order_date) as week_number from all_orders_2020),
weekly_count as (select customer, week_number, count(*) as ride_count from orders_with_week_number
where customer = {{some_customer}} group by customer, week_number)
SELECT
week_number
FROM
(VALUES
(SEQUENCE(1,(select max(week_number) from weekly_count)))
) AS t1(week_array)
CROSS JOIN
UNNEST(week_array) AS t2(week_number)
Presto complaints about this saying -
Unexpected subquery expression in logical plan: (SELECT "max"(week_number)
FROM
weekly_count
)
Any clues how this can be done ?

Had a similar use case and followed the example from here: https://docs.aws.amazon.com/athena/latest/ug/flattening-arrays.html
Bring the SEQUENCE out and define the subquery using a WITH clause:
WITH dataset AS (
SELECT SEQUENCE(1, (SELECT MAX(week_number) FROM weekly_count)) AS week_array
)
SELECT week_number FROM dataset
CROSS JOIN UNNEST(week_array) as t(week_number)

Related

How to reference fields from table created in sub-query's of large JOIN

I am writing a large query with many JOINs (shortened it in example here) and I am trying to reference values form other sub-queries but can't figure out how.
This is my example query:
DROP TABLE IF EXISTS breakdown;
CREATE TEMP TABLE breakdown AS
SELECT * FROM
(
SELECT COUNT(DISTINCT s_id) AS before, date_trunc('day', time) AS day FROM table_a
WHERE date_trunc('sec',earliest) < date_trunc('sec',time) GROUP BY day
)
JOIN
(
SELECT ROUND(before * 100.0 / total, 1) AS Percent_1, day
FROM breakdown
GROUP BY day
) USING (day)
JOIN
(
SELECT COUNT(DISTINCT s_id) AS equal, date_trunc('day', time) AS day FROM table_a
WHERE date_trunc('sec',earliest) = date_trunc('sec',time) GROUP BY day
) USING (day)
JOIN
(
SELECT COUNT(DISTINCT s_id) AS after, date_trunc('day', time) AS day FROM table_a
WHERE date_trunc('sec',earliest) > date_trunc('sec',time) GROUP BY day
) USING (day)
JOIN
(
SELECT COUNT(DISTINCT s_id) AS total, date_trunc('day', earliest) AS day
FROM first
GROUP BY 2
) USING (day)
ORDER BY day;
SELECT * FROM breakdown ORDER BY day;
The last query gives me the total and for each of the previous subqueries I want to get the percentages as well.
I found the code for getting the percentage (second JOIN) but I don't know how to reference the values from the other tables.
E.g. for getting the percentage from the first query I want to use the COUNT of the first query which I renamed before and then divide that by the COUNT of the last query which I renamed total (If there is an easier solution to do this i.e. get the percentage for each of the sub-queries please let me know), But I cant seem to find how to reference them. I tried adding AS x to the end of each subquery and calling by that (x.total) as well as trying to reference via the parent table (breakdown.total) but neither worked.
How can I do this without changing my table too much as it is a long table with a lot of sub-queries.
This is what my table looks like I would like to add percentage for each column
Using redshift BTW.
Thanks
I'm a little confused by all that is going on as you drop table breakdown and then in the second subquery of the create table you reference breakdown. I suspect that there are some issues in the provided sample of SQL. Please update if there are issues.
For a number of these subqueries it looks like you are using a subquery where a case statement will do. In Redshift you don't want to scan the same table over and over if you can prevent it. For example if we look at the the 3rd and 4th subqueries you can replace these with one query. Also in these cases I like to use the DECODE() statement rather than CASE since it is more readable in these simple cases.
(
SELECT COUNT(DISTINCT s_id) AS equal, date_trunc('day', time) AS day
FROM table_a
WHERE date_trunc('sec',earliest) = date_trunc('sec',time)
GROUP BY day
) USING (day)
JOIN
(
SELECT COUNT(DISTINCT s_id) AS after, date_trunc('day', time) AS day
FROM table_a
WHERE date_trunc('sec',earliest) > date_trunc('sec',time)
GROUP BY day
)
Becomes:
(
SELECT COUNT(DISTINCT DECODE(date_trunc('sec',earliest) = date_trunc('sec',time), true, s_id, NULL)) AS equal,
COUNT(DISTINCT DECODE(date_trunc('sec',earliest) > date_trunc('sec',time), true, s_id, NULL)) AS after,
date_trunc('day', time) AS day
FROM table_a
GROUP BY day
)
Read each table once (if at all possible) and calculate the desired results. then you will have all your values in one layer of query and can reference these new values. This will be faster (especially on Redshift).
=============================
Expanding based on comment made by poster.
It appears that using DECODE() and referencing derived columns in a single query can produce what you want. I don't have your data so I cannot test this but here is what I'd want to move to:
SELECT
COUNT(DISTINCT DECODE(date_trunc('sec',earliest) < date_trunc('sec',time), true, s_id)) AS before,
ROUND(before * 100.0 / total, 1) AS Percent_1,
COUNT(DISTINCT DECODE(date_trunc('sec',earliest) = date_trunc('sec',time), true, s_id)) AS equal,
COUNT(DISTINCT DECODE(date_trunc('sec',earliest) > date_trunc('sec',time), true, s_id)) AS after,
COUNT(DISTINCT s_id) AS total
FROM table_a
GROUP BY date_trunc('day', time);
This should be a complete replacement for the SELECT currently inside your CREATE TEMP TABLE. However, I don't have sample data so this is untested.

i am trying to use the avg() function in a subquery after using a count in the inner query but i cannot seem to get it work in SQL

my table name is CustomerDetails and it has the following columns:
customer_id, login_id, session_id, login_date
i am trying to write a query that calculates the average number of customers login in per day.
i tried this:
select avg(session_id)
from CustomerDetails
where exists (select count(session_id) from CustomerDetails as 'no_of_entries')
.
but then i realized it was going straight to the column and just calculating the average of that column but that's not what i want to do. can someone help me?
thanks
The first thing you need to do is get logins per day:
SELECT login_date, COUNT(*) AS loginsPerDay
FROM CustomerDetails
GROUP BY login_date
Then you can use that to get average logins per day:
SELECT AVG(loginsPerDay)
FROM (
SELECT login_date, COUNT(*) AS loginsPerDay
FROM CustomerDetails
GROUP BY login_date
)
If your login_date is a DATE type you're all set. If it has a time component then you'll need to truncate it to date only:
SELECT AVG(loginsPerDay)
FROM (
SELECT CAST(login_date AS DATE), COUNT(*)
FROM CustomerDetails
GROUP BY CAST(login_date AS DATE)
)
i am trying to write a query that calculates the average number of customers login in per day.
Count the number of customers. Divide by the number of days. I think that is:
select count(*) * 1.0 / count(distinct cast(login_date as date))
from customerdetails;
I understand that you want do count the number of visitors per day, not the number of visits. So if a customer logged twice on the same day, you want to count him only once.
If so, you can use distinct and two levels of aggregation, like so:
select avg(cnt_visitors) avg_cnt_vistors_per_day
from (
select count(distinct customer_id) cnt_visitors
from customer_details
group by cast(login_date as date)
) t
The inner query computes the count of distinct customers for each day, he outer query gives you the overall average.

SQL count new values only with partition by - running count with no duplicates

Based on table below in Presto I need a column for all new 'rid'. What I managed to do is the same what I can achieve with partition by but it's not exactly what I'm looking for (db<>fiddle demo).
Goal is to have many groupings counts but I think this should describe problem sufficiently.
I need data truncated by days and column for new users every day as shown at example below. In simple words - if value repeats don't count it. I've tried to find correlation between this and relational division problem but I just stuck.
You could use row_number() to rank the records of each rid by time; then you can aggregate and count in only the top record per group.
select
date_trunc(day, t.time) dy,
count(*) rid_count,
sum(case when t.rn = 1 then 1 else 0 end) new_rid_count
from (
select
t.*
row_number() over(partition by t.rid order by t.time) rn
from mytable t
) t
group by date_trunc(day, t.time)
I think of this as two levels of aggregation. The inner one to get the earliest date. The outer to aggregate:
select first_day, count(*)
from (select rid, date_trunc('day', min(time))::date as first_day
from orders o
group by rid
) r
group by 1

BigQuery Cross Join Failing

I'm trying to pull user activity by date. I am trying to built a table of every day since a user account was created, using cross join and a where clause. In my case, cross join cannot be avoided. The calendar table is just a list of all dates for last 365 days (365 rows). The user table has ~1b rows.
Here is the query that fails with insufficient resources:
SELECT
u.user_id as user_id,
date(u.created) as signup_date,
cal.date as date,
from (select date(dt) as date from [dw.calendar] where date(dt) <
CURRENT_DATE() ) cal
cross join each dw.user u
where
date(u.created) <= cal.date
Based on https://cloud.google.com/bigquery/query-reference, cross joins do not even support the "each" clause. How do I perform the above operation to successfully create a table?
You do not need to fill "empty" days to just calculate daily count and perform window function to get the aggregated sum, so you don't even need calendar table for this. To make this happen you need to use RANGE vs. ROWS in your window. See example below (for BigQuery Standard SQL)
#standardSQL
SELECT
user_id, created, daily_count,
SUM(daily_count) OVER(
PARTITION BY user_id ORDER BY created_unix_date DESC
RANGE BETWEEN CURRENT ROW AND 6 FOLLOWING
) weekly_avg
FROM `dw.user`, UNNEST([UNIX_DATE(created)]) AS created_unix_date
ORDER BY user_id, created DESC
i am not sure about exact schema /types of your table so might need to adjust above respectively, but meantime you can test/play with below dummy data
#standardSQL
WITH `dw.user` AS (
SELECT
day AS created,
CAST(1 + 10 * RAND() AS INT64) AS user_id,
CAST(100 * RAND() AS INT64) AS daily_count
FROM UNNEST(GENERATE_DATE_ARRAY('2017-01-01', '2017-04-26')) AS day
)
SELECT
user_id, created, daily_count,
SUM(daily_count) OVER(
PARTITION BY user_id ORDER BY created_unix_date DESC
RANGE BETWEEN CURRENT ROW AND 6 FOLLOWING
) weekly_avg
FROM `dw.user`, UNNEST([UNIX_DATE(created)]) AS created_unix_date
ORDER BY user_id, created DESC

Combine two queries with monthly average

I need to put together the results of these two queries into a single return with the following structure:
"date", avg(selic."Taxa"), avg(titulos."puVenda")
Partial structure of tables:
selic
"dtFechamento" date,
"pTaxa" real
titulos
"dtTitulo" date,
"puVenda" real,
"nomeTitulo" character(30)
Query table selic:
select to_char("dtFechamento", 'YYYY-MM') as data, avg("pTaxa")
from "selic"
group by data
order by data
Query table titulos:
select to_char("dtTitulo", 'YYYY-MM') as data, avg("puVenda")
from "titulos"
where "nomeTitulo" = 'LFT010321'
group by data
order by data
I tried a subquery, but it returned the fields next to each other and can not muster.
select *
from (select to_char("dtFechamento", 'YYYY-MM') as data, avg("pTaxa")
from "selic"
group by data
order by data) as selic,
(select to_char("dtTitulo", 'YYYY-MM') as data, avg("puVenda")
from "titulos"
where "nomeTitulo" = 'LFT010321'
group by data
order by data) as LFT010321;
Assuming you want to return one row per month where either of your two queries returns a row. And pad missing values from the other query with NULL.
Use a FULL [OUTER] JOIN:
SELECT to_char(mon, 'YYYY-MM') AS data, s.avg_taxa, t.avg_venda
FROM (
SELECT date_trunc('month', "dtFechamento") AS mon, avg("pTaxa") AS avg_taxa
FROM selic
GROUP BY 1
) s
FULL JOIN (
SELECT date_trunc('month', "dtTitulo") AS mon, avg("puVenda") AS avg_venda
FROM titulos
WHERE "nomeTitulo" = 'LFT010321'
GROUP BY 1
) t USING (mon)
ORDER BY mon;
It is substantially faster to join after aggregating than before (fewer join operations).
It is also faster to GROUP BY, JOIN and ORDER on timestamp values than on a text rendition. Typically also cleaner and less error prone (although text is unambiguous in this particular case). That's why I use date_trunc() instead of to_char() on lower levels.
If the format for the month is not important, you can just return the timestamp value. Else you can format any way you like after you are done processing.
Similar case with more explanation:
PostgreSQL merge two queries with COUNT and GROUP BY in each
This should get what you need. The inner "PQ" (PreQuery) does a union all between each possible date, but also adds a flag column to identify which average it was associated with. Each part is grouped by date. So now, the outer query will AT MOST have 2 records for a given date... one for tax, the other be Venda. So now you dont need any full outer join, nor need to build some dynamic calendar data basis to get the details for all possible dates.
So, it is possible for only a Tax average OR a Venda average OR BOTH.
SELECT
PQ.Data,
SUM( CASE when PQ.SumType = 'T' then PQ.TypeAvg else 0 end ) as AvgTax,
SUM( CASE when PQ.SumType = 'V' then PQ.TypeAvg else 0 end ) as AvgVenda
from
( select
to_char( dtFechamento, 'YYYY-MM') as data,
'T' as sumtype,
avg( pTaxa ) as TypeAvg
from
selic
group by
to_char( dtFechamento, 'YYYY-MM') as data
UNION ALL
select
to_char( dtTitulo, 'YYYY-MM') as data,
'V' as sumType,
avg( puVenda ) as TypeAvg
from
titulos
where
nomeTitulo = 'LFT010321'
group by
to_char( dtTitulo, 'YYYY-MM') ) PQ
group by
PQ.Data
order by
PQ.Data