How to join 2 subqueries - sql

I'm trying to join 2 subqueries within the same table with this query:
SELECT COUNT(phone) as users, DATE_TRUNC('month', somedate) as date_month from
(SELECT phone, MIN (created_at) as somedate
FROM analytics.orders
where status = 'done'
GROUP BY phone) as s1
GROUP BY date_month
INNER JOIN
(SELECT value, cohort FROM
(SELECT SUM (amount) as value, DATE_TRUNC('month', created_at) as cohort
FROM analytics.orders
where status = 'done'
GROUP BY cohort, (SELECT SUM (amount) from analytics.orders )
ORDER BY cohort) as s2) as s3
ON s1.date_month=s3.cohort
But I am getting this error:
syntax error at or near "INNER" LINE 7: INNER JOIN ^
I guess that something is wrong with inner naming but I can't understand what is exactly wrong.

Don't think you actually need to join or union them.
It's all using the same table.
Rather combine the queries.
SELECT
DATE_TRUNC('month', created_at) as date_month,
COUNT(DISTINCT phone) as unique_phones,
SUM(amount) as total_amount
FROM analytics.orders
WHERE status = 'done'
GROUP BY DATE_TRUNC('month', created_at)
ORDER BY 1;

Related

Month over Month percent change in user registrations

I am trying to write a query to find month over month percent change in user registration. \
Users table has the logs for user registrations
user_id - pk, integer
created_at - account created date, varchar
activated_at - account activated date, varchar
state - active or pending, varchar
I found the number of users for each year and month. How do I find month over month percent change in user registration? I think I need a window function?
SELECT
EXTRACT(month from created_at::timestamp) as created_month
,EXTRACT(year from created_at::timestamp) as created_year
,count(distinct user_id) as number_of_registration
FROM users
GROUP BY 1,2
ORDER BY 1,2
This is the output of above query:
Then I wrote this to find the difference in user registration in the previous year.
SELECT
*
,number_of_registration - lag(number_of_registration) over (partition by created_month) as difference_in_previous_year
FROM (
SELECT
EXTRACT(month from created_at::timestamp) as created_month
,EXTRACT(year from created_at::timestamp) as created_year
,count( user_id) as number_of_registration
FROM users as u
GROUP BY 1,2
ORDER BY 1,2) as temp
The output is this:
You want an order by clause that contains created_year.
number_of_registration
- lag(number_of_registration) over (partition by created_month order by created_year) as difference_in_previous_year
Note that you don't actually need a subquery for this. You can do:
select
extract(year from created_at) as created_year,
extract(month from created_at) as created_year
count(*) as number_of_registration,
count(*) - lag(count(*)) over(partition by extract(month from created_at) order by extract(year from created_at))
from users as u
group by created_year, created_month
order by created_year, created_month
I used count(*) instead of count(user_id), because I assume that user_id is not nullable (in which case count(*) is equivalent, and more efficient). Casting to a timestamp is also probably superfluous.
These queries work as long as you have data for every month. If you have gaps, then the problem should be addressed differently - but this is not the question you asked here.
I can get the registrations from each year as two tables and join them. But it is not that effective
SELECT
t1.created_year as year_2013
,t2.created_year as year_2014
,t1.created_month as month_of_year
,t1.number_of_registration_2013
,t2.number_of_registration_2014
,(t2.number_of_registration_2014 - t1.number_of_registration_2013) / t1.number_of_registration_2013 * 100 as percent_change_in_previous_year_month
FROM
(select
extract(year from created_at) as created_year
,extract(month from created_at) as created_month
,count(*) as number_of_registration_2013
from users
where extract(year from created_at) = '2013'
group by 1,2) t1
inner join
(select
extract(year from created_at) as created_year
,extract(month from created_at) as created_month
,count(*) as number_of_registration_2014
from users
where extract(year from created_at) = '2014'
group by 1,2) t2
on t1.created_month = t2.created_month
First off, Why are you using strings to hold date/time values? Your 1st step should to define created_at, activated_at as a proper timestamps. In the resulting query I assume this correction. If this is faulty (you do not correct it) then cast the string to timestamp in the CTE generating the date range. But keep in mind that if you leave it as text you will at some point get a conversion exception.
To calculate month-over-month use the formula "100*(Nt - Nl)/Nl" where Nt is the number of users this month and Nl is the number of users last month. There are 2 potential issues:
There are gaps in the data.
Nl is 0 (would incur divide by 0 exception)
The following handles this by first generating the months between the earliest date to the latest date then outer joining monthly counts to the generated dates. When Nl = 0 the query returns NULL indication the percent change could not be calculated.
with full_range(the_month) as
(select generate_series(low_month, high_month, interval '1 month')
from (select min(date_trunc('month',created_at)) low_month
, max(date_trunc('month',created_at)) high_month
from users
) m
)
select to_char(the_month,'yyyy-mm')
, users_this_month
, case when users_last_month = 0
then null::float
else round((100.00*(users_this_month-users_last_month)/users_last_month),2)
end percent_change
from (
select the_month, users_this_month , lag(users_this_month) over(order by the_month) users_last_month
from ( select f.the_month, count(u.created_at) users_this_month
from full_range f
left join users u on date_trunc('month',u.created_at) = f.the_month
group by f.the_month
) mc
) pc
order by the_month;
NOTE: There are several places there the above can be shortened. But the longer form is intentional to show how the final vales are derived.

How to pull a list of all visitor_ids that generated more than $500 combined in their first two sessions in the month of January 2020?

Tables:
Sessions
session_ts
visitor_id
vertical
session_id
Transactions
session_ts
session_id
rev_bucket
revenue
Currently have the following query (using SQLite):
SELECT
s.visitor_id,
sub.session_id,
month,
year,
total_rev,
CASE
WHEN (row_num IN (1,2) >= total_rev >= 500) THEN 'Yes'
ELSE 'No' END AS High_Value_Transactions,
sub.row_num
FROM
sessions s
JOIN
(
SELECT
s.visitor_id,
t.session_id,
strftime('%m',t.session_ts) as month,
strftime('%Y',t.session_ts) as year,
SUM(t.revenue) as total_rev,
row_number() OVER(PARTITION BY s.visitor_id ORDER BY s.session_ts) as row_num
FROM
Transactions t
JOIN
sessions s
ON
s.session_id = t.session_id
WHERE strftime('%m',t.session_ts) = '01'
AND strftime('%Y',t.session_ts) = '2020'
GROUP BY 1,2
) sub
ON
s.session_id = sub.session_id
WHERE sub.row_num IN (1,2)
ORDER BY 1
I'm having trouble identifying the first two sessions that combine for $500.
Open to any feedback and simplifying of query. Thanks!
You can use window functions and aggregation:
select visitor_id, sum(t.revenue) total_revenue
from (
select
s.visitor_id,
t.revenue,
row_number() over(partition by s.visitor_id order by t.session_ts) rn
from transactions t
inner join sessions s on s.session_id = t.session_id
where t.session_ts >= '2020-01-01' and t.session_ts < '2020-02-01'
) t
where rn <= 2
group by visitor_id
having sum(t.revenue) >= 500
The subquery joins the two tables, filters on the target month (note that using half-open interval predicates is more efficient than applying date functions on the date column), and ranks each row within groups of visits of the same customer.
Then, the outer query filters on the first two visits per visitor, aggregates by visitor, computes the corresponding revenue, and filters it with a having clause.

CASE AND WHEN SQL

I have transactional data of customers' purchase. I tried to select customer_id from the last 1 month and calculate recency as the average day customers come to purchase (AVG(gap))
SELECT
customer_id,
(
CASE WHEN day::DATE<= '2015-05-01'::DATE AND day::DATE > '2015-05-01'::DATE - INTERVAL '1 month'
THEN
(
SELECT
AVG(gap)
FROM
(
SELECT
customer_id,
( day- LAG(day) OVER ( PARTITION BY customer_id ORDER BY day ) ) AS gap
FROM
baskets
JOIN
basket_lines
USING
( basket_id )
GROUP BY 1
) a
) b
ELSE 0
) AS A
FROM
baskets
JOIN
basket_lines
USING
(basket_id)
GROUP BY
1;
However, I have an error like `
ERROR: syntax error at or near "b"
LINE 45: GROUP BY 1)a)b ELSE 0) AS A
^
Does it mean I can not use subquery after THEN statement?
A subquery in the THEN clause does not take an alias. Also, you must end your CASE expression with END:
SELECT
customer_id,
(CASE WHEN day::DATE<= '2015-05-01'::DATE AND
day::DATE > '2015-05-01'::DATE - INTERVAL '1 month'
THEN
(SELECT AVG(gap) FROM (
SELECT customer_id,
(day- LAG(day) OVER (PARTITION BY customer_id ORDER BY day)) as gap
FROM baskets
JOIN basket_lines
USING (basket_id)
GROUP BY 1) a) ELSE 0 END) AS A
FROM baskets
JOIN basket_lines
USING (basket_id)
GROUP BY 1;
But you have a correlated subquery in your select statement. This is probably not optimal, and we can likely rewrite your query using a join.
I propose the following refactor:
WITH cte AS (
SELECT
customer_id,
(day- LAG(day) OVER (PARTITION BY customer_id ORDER BY day)) as gap
FROM baskets
INNER JOIN basket_lines
USING (basket_id)
WHERE day::DATE<= '2015-05-01'::DATE AND
day::DATE > '2015-05-01'::DATE - INTERVAL '1 month'
)
SELECT
customer_id,
AVG(gap) AS cust_avg
FROM cte
GROUP BY
customer_id;

Postgresql - Window Function Aggregate

I'm attempting to find the number of new users per month by product type. However, I continue to receive an error requesting cnt to be used in an aggregate function.
SELECT EXTRACT(MONTH FROM date) AS month
FROM (SELECT users.date,
COUNT(*) OVER(PARTITION BY product_type) AS cnt FROM users) AS u
GROUP BY month
ORDER BY cnt DESC;
That seems like a very strange construct. Here is a method that doesn't use window functions:
select date_trunc('month', date) as yyyymm, product_id, count(*)
from (select distinct on (u.userid) u.*
from users u
order by u.userid, u.date
) u
group by date_trunc('month', date), product_id
order by yyyymm, product_id;

Netezza not supporting sub query and similar... any workaround?

I'm sure this will be a very simple question for most of you, but it is driving me crazy...
I have a table like this (simplifying):
| customer_id | date | purchase amount |
I need to extract, for each day, the number of customers that made a purchase that day, and the number of customers that made at least a purchase in the 30 days previous to the current one.
I tried using a subquery like this:
select purch_date as date, count (distinct customer_id) as DAU,
count(distinct (select customer_id from table where purch_date<= date and purch_date>date-30)) as MAU
from table
group by purch_date
Netezza returns an error saying that subqueries are not supported, and that I should think to rewrite the query. But how?!?!?
I tried using case when statement, but did not work. In fact, the following:
select purch_date as date, count (distinct customer_id) as DAU,
count(distinct case when (purch_date<= date and purch_date>date-30) then player_id else null end) as MAU
from table
group by purch_date
returned no errors, but the MAU and DAU columns are the same (which is wrong).
Can anybody help me, please? thanks a lot
I don't beleive netezza supports subqueries in the select line...move to the from statement
select pur_date as date, count(distinct customer_id) as DAU
from table
group by purch_date
select pur_date as date, count (distinct customer_ID) as MAU
from table
where purch_date<= date and purch_date>date-30
group by purch_date
I hope thats right for MAU and DAU. join them to get the results combined:
select a.date, a.dau, b.mau
from
(select pur_date as date, count(distinct customer_id) as DAU
from table
group by purch_date) a
left join
(select pur_date as date, count (distinct customer_ID) as MAU
from table
where purch_date<= date and purch_date>date-30
group by purch_date) b
on b.date = a.date
I got it finally :) For all interested, here is the way I solved it:
select a.date_dt, max(a.dau), count(distinct b.player_id)
from (select dt.cal_day_dt as date_dt,
count(distinct s.player_id) as dau
FROM IA_PLAYER_SALES_HOURLY s
join IA_DATES dt on dt.date_key = s.date_key
group by dt.cal_day_dt
order by dt.cal_day_dt
) a
join (
select dt.cal_day_dt as date_dt,
s.player_id as player_id
FROM IA_PLAYER_SALES_HOURLY s
join IA_DATES dt on dt.date_key = s.date_key
order by dt.cal_day_dt
) b on b.date_dt <= a.date_dt and b.date_dt > a.date_dt - 30
group by a.date_dt
order by a.date_dt;
Hope this is helpful.