How to join two tables based on a calculated field? - sql

I have two SQL queries that output the same kind of output and have the same grouping and order :
select date_trunc('month', inserted_at)::date as date, count(id) from payment_logs where payment_logs.event_name = 'subscription_created' group by date order by date desc;
select date_trunc('month', inserted_at)::date as date, count(id) from users group by date order by date desc;
I would like to join those two results based on the calculated date field (which is the month), and have a result with 3 columns : date, count_users and count_payment_logs.
How can I achieve that? Thanks.

Something like this
select plog.date as odata, usr.cntusr, plog.cntlog
from (
select date_trunc('month', inserted_at)::date as date, count(id) cntlog
from payment_logs
where payment_logs.event_name = 'subscription_created'
group by date order by date desc
) plog
join (
select date_trunc('month', inserted_at)::date as date, count(id) cntusr
from users
group by date
) usr on plog.data = usr.data
order by odata desc

Nothing wrong with the accepted answer, but I wanted to show an alternative and add some color. Instead of subqueries, you can also use common table expressions (CTEs) which improve readability but also have some other features as well. Here is an example using CTEs:
with payments as (
select
date_trunc('month', inserted_at)::date as date,
count(id) as payment_count
from payment_logs
where
event_name = 'subscription_created'
group by date
),
users as (
select
date_trunc('month', inserted_at)::date as date,
count(id) as user_count
from users
group by date
)
select
p.date, p.payment_count, u.user_count
from
payments p
join users u on
p.date = u.date
order by
p.date desc
In my opinion the abstraction is neater and makes the code much easier to follow (and thus maintain).
Other notes:
The order by is expensive, and you can avoid it within each of the subqueries/CTEs since it's being done at the end anyway. The ones in the subqueries will be clobbered by whatever you do in the main query anyway, so just omit them completely. Your results will not differ, and your query will be more efficient.
In this example, you probably don't have any missing months, but it's possible... especially if you expand this concept to future queries. In such a case, you may want to consider a full outer join instead of an inner join (you have months that appear in the users that may not be in the payments or vice versa):
select
coalesce (p.date, u.date) as date,
p.payment_count, u.user_count
from
payments p
full outer join users u on
p.date = u.date
order by
1 desc
Another benefit of CTEs vs subqueries is that you can reuse them. In this example, I want to mimic the full outer join concept but with one additional twist -- I have data from another table by month that I want in the query. The CTE lets me use the CTE for "payments" and "users" as many times as I want. Here I use them in the all_dates CTE and again in the main query. By creating "all_dates" I can now use left joins and avoid weird coalescing in joins (not wrong, just ugly).
with payments as (
-- same as above
),
users as (
-- same as above
),
all_dates as (
select date from payments -- referred to payments here
union
select date from users
)
select
a.date, ac.days_in_month, p.payment_count, u.user_count
from
all_dates a
join accounting_calendar ac on
a.date = ac.accounting_month
left join payments p on -- referred to it here again, same CTE
a.date = p.date
left join users u on
a.date = u.date
order by
p.date desc
The point is you can reuse the CTEs.
A final advantage is that you can declare the CTE materialized or non-materialized (default). The materialized CTE will essentially pre-process and store the results, which in certain cases may have better performance. A non-materialized on, on the other hand, will mimic a standard subquery which is nice because you can pass where clause conditions from outside the query to inside of it.

Related

Users that played in X different dates - SQL Standard + BigQuery

I have the following schema of a data model (I only have the schema, not the tables) on BigQuery with SQL Standard.
I have created this query to select the Top 10 users that generated more revenue in the last three months on the Love game:
SELECT
users.user_id,
SUM(pay.amount) AS total_rev
FROM
`my-database.User` AS users
INNER JOIN
`my-database.IAP_events` AS pay
ON
users.User_id = pay.User_id
INNER JOIN
`my-database.Games` AS games
ON
users.Game_id = games.Game_id
WHERE
games.game_name = "Love"
GROUP BY
users.user_id
ORDER BY
total_rev ASC
LIMIT
10
But then, the exercise says to only consider users that played during 10 different days in the last 3 months. I understand I would use a subquery with a count in the dates but I am a little lost on how to do it...
Thanks a lot!
EDIT: You need to count distinct dates, not transactions, so in the qualify clause you'll need to state COUNT(DISTINCT date_) OVER ... instead of COUNT(transaction_id) OVER .... Fixed the code already.
As far as I understood, you need to count the distinct transaction_id inside IAP_Events on a 3 previous months window, check that the count is greater than 10, and then sum the amounts of all the users included in that constraint.
To do so, you can use BigQuery's analytic functions, aka window functions:
with window_counting as (
select
user_id,
amount
from
iap_events
where
date_ >= date_sub(current_date(), interval 3 month)
qualify
count(distinct date_) over (partition by user_id) > 10
),
final as (
select
user_id,
sum(amount)
from
window_counting
group by
1
order by
2 desc
limit 10
)
select * from final
You will just need to add the needed joins inside the first CTE in order to filter by game_name :)

What is the most efficient way to find the first and last entry of an entity in SQL?

I was asked this question in an interview. A table, trips, contains the following columns( customer_id, start_from, end_at, start_at_time, end_at_time), with data structured so that each trip is stored as a separate row and a part of the table looks like this: How would you find the list of all the customers who started yesterday from point A and ended yesterday at point P?
I provided solution using windowing functions that identified the list of all customers that started their day at A and then did an inner join of a list of these customers with the customers who ended their day at P( using the same windowing functions).
The solution I gave was this:
SELECT a.customer_id
FROM
(SELECT a.customer_id
FROM
(SELECT customer_id,
start_from,
row_number() OVER (PARTITION BY customer_id
ORDER BY start_at_time ASC) AS rnk
FROM trips
WHERE to_date(start_at_time)= date_sub(CURRENT_DATE, 1) ) AS a
WHERE a.rnk=1
AND a.start_from='A' ) AS a
INNER JOIN
(SELECT a.customer_id
FROM
(SELECT customer_id,
end_at,
row_number() OVER (PARTITION BY customer_id
ORDER BY end_at_time DESC) AS rnk
FROM trips
WHERE to_date(end_at_time)= date_sub(CURRENT_DATE, 1) ) AS a
WHERE a.rnk=1
AND a.end_at='P' ) AS b ON a.customer_id=b.customer_id
My interviewer said my solution was correct but there is a more efficient way to solve this problem. I've searching and trying to find a more efficient way but I could not find one so far. Can you suggest a more efficient way to solve this problem?
I might use first_value() for this:
select t.customer_id
from (select t.*,
first_value(start_from) over (partition by customer_id order by start_at_time) as first_start,
first_value(end_at) over (partition by customer_id order by start_at_time desc) as last_end
from t
where start_at_time >= date_sub(CURRENT_DATE, 1) and
start_at_time < CURRENT_DATE
) t
where first_start = start_from and -- just some filtering so select distinct is not needed
first_start = 'A' and
last_end = 'P';
I should add that many databases support an equivalent function for aggregation, and I would use that instead.
This assumes that starts are not repeated. To be safe, you can add select distinct, but there is a performance hit for that.
A generalized version of what I would probably have done:
SELECT fandl.a
FROM (
SELECT a, MIN(start) AS t0, MAX(start) AS tN
FROM someTable
WHERE start >= DATE_SUB(CURRENT_DATE, 1) AND start < CURRENT_DATE
GROUP BY a
) AS fandl
INNER JOIN someTable AS st0 ON fandl.a = st0.a AND fandl.t0 = st0.start
INNER JOIN someTable AS stN ON fandl.a = stN.a AND fandl.tN = stN.start
WHERE st0.b1 = 'A' AND stN.b2 = 'P'
;
Using the date function you did, since you did not specify sql dialect.
Note that, in many RDBMS, if there is an (a, start) index, the subquery and joins can be done with the index alone; actual table access would only be required for the final WHERE evaluation.

Get records with the newest date in Oracle

I need to find the emails of the last person that performed an action over a post. The database structure is a little bit complicated because of several reasons not important for the case.
SELECT u.address
FROM text t
JOIN post p ON (p.pid=t.pid)
JOIN node n ON (n.nid=p.nid)
JOIN user u ON (t.login=u.login)
WHERE n.nid='123456'
AND p.created IN (
SELECT max(p.created)
FROM text t
JOIN post p ON (p.pid=t.pid)
JOIN node n ON (n.nid=p.nid)
WHERE n.nid='123456');
I would like to know if there is a way to do use the max function or any other way to get the latest date without having to make a subquery (that is almost the same as the main query).
Thank you very much
You can use a window function (aka "analytical" function) to calculate the max date.
Then you can select all rows where the created date equals the max. date.
select address
from (
SELECT u.address,
p.created,
max(p.created) over () as max_date
FROM text t
JOIN post p ON (p.pid=t.pid)
JOIN node n ON (n.nid=p.nid)
JOIN user u ON (t.login=u.login)
WHERE n.nid='123456'
) t
where created = max_date;
The over() clause is empty as you didn't use a GROUP BY in your question. But if you need e.g. the max date per address then you could use
max(p.created) over (partition by t.adress) as max_date
The partition by works like a group by
You can also extend that query to work for more than one n.id. In that you you have to include it in the partition:
max(p.created) over (partition by n.id, ....) as max_date
Btw: if n.id is a numeric column you should not compare it to a string literal. '123456' is a string, 123456 is a number
SELECT address
FROM (
SELECT u.address,
row_number() OVER (PARTITION BY n.nid ORDER BY p.created DESC) AS rn
FROM text t JOIN post p ON (p.pid=t.pid)
JOIN node n ON (n.nid=p.nid)
JOIN user u ON (t.login=u.login)
WHERE n.nid='123456'
)
WHERE rn = 1;
The ROW_NUMBER function numbers the rows in descending order of p.created with PARTITION BY n.nid making separate partitions for row numbers of separate n.nids.

max records with dense rank

Is there a better alternative to using max to get the max records.
I have been playing with dense rank and partition over with the below query
but I am getting undesired results and poor performance.
select Tdate = (Select max(Date)
from Industries
where Industries.id = i.id
and Industries.Date <= '22 June 2011')
from #ii_t i
Many Thanks.
The supplied query doesn't use the DENSE_RANK windowing function. Not being familiar with your data structure, I believe your query is attempting to find the largest value of Date for each Industry id, yes? Rewriting the above query to use a ranking function, I would write it as a common table expression.
;
WITH RANKED AS
(
SELECT
II.*
-- RANK would serve just as well in this scenario
, DENSE_RANK() OVER (PARTITION BY II.id ORDER BY II.Date desc) AS most_recent
FROM Industries II
WHERE
II.Date <= '22 June 2011'
)
, MOST_RECENT AS
(
-- This query restricts it to the most recent row by id
SELECT
R.*
FROM
RANKED R
WHERE
R.most_recent = 1
)
SELECT
*
FROM
MOST_RECENT MR
INNER JOIN
#ii_t i
ON i.id = MR.id
Also, to address the question of performance, you might need to look at how Industries is structured. There may not be an index on that table and if there is, it might not cover the Date (descending) and id field. To improve the efficiency of the above query, don't pull back everything in the RANKED section. I did that as I was not sure what fields you would need but obviously the less you have to pull back, the more efficient the engine can be in retrieving data.
Try this (untested) code and see if it does what you want. By the looks of it, it should return the same things and hopefully a bit faster.
select Tdate = max(Industries.Date)
from #ii_t i
left outer join Industries
on Industries.id = i.id and
Industries.Date <= '22 June 2011'
group by i.id

How to make a SQL query for last transaction of every account?

Say I have a table "transactions" that has columns "acct_id" "trans_date" and "trans_type" and I want to filter this table so that I have just the last transaction for each account. Clearly I could do something like
SELECT acct_id, max(trans_date) as trans_date
FROM transactions GROUP BY acct_id;
but then I lose my trans_type. I could then do a second SQL call with my list of dates and account id's and get my trans_type back but that feels very cludgy since it means either sending data back and forth to the sql server or it means creating a temporary table.
Is there a way to do this with a single query, hopefully a generic method that would work with mysql, postgres, sql-server, and oracle.
This is an example of a greatest-n-per-group query. This question comes up several times per week on StackOverflow. In addition to the subquery solutions given by other folks, here's my preferred solution, which uses no subquery, GROUP BY, or CTE:
SELECT t1.*
FROM transactions t1
LEFT OUTER JOIN transactions t2
ON (t1.acct_id = t2.acct_id AND t1.trans_date < t2.trans_date)
WHERE t2.acct_id IS NULL;
In other words, return a row such that no other row exists with the same acct_id and a greater trans_date.
This solution assumes that trans_date is unique for a given account, otherwise ties may occur and the query will return all tied rows. But this is true for all the solutions given by other folks too.
I prefer this solution because I most often work on MySQL, which doesn't optimize GROUP BY very well. So this outer join solution usually proves to be better for performance.
This works on SQL Server...
SELECT acct_id, trans_date, trans_type
FROM transactions a
WHERE trans_date = (
SELECT MAX( trans_date )
FROM transactions b
WHERE a.acct_id = b.acct_id
)
Try this
WITH
LastTransaction AS
(
SELECT acct_id, max(trans_date) as trans_date
FROM transactions
GROUP BY acct_id
),
AllTransactions AS
(
SELECT acct_id, trans_date, trans_type
FROM transactions
)
SELECT *
FROM AllTransactions
INNER JOIN LastTransaction
ON AllTransactions.acct_id = LastTransaction.acct_id
AND AllTransactions.trans_date = LastTransaction.trans_date
select t.acct_id, t.trans_type, tm.trans_date
from transactions t
inner join (
SELECT acct_id, max(trans_date) as trans_date
FROM transactions
GROUP BY acct_id;
) tm on t.acct_id = tm.acct_id and t.trans_date = tm.trans_date