SQL query to find most recent date using two tables - sql

I have two tables in a database. One called Users and one called Days.
The Users table has a column called ID which contains some numbers and an ActiveUser column which contains either a 0(not active) or 1(active). The Days table has a column called UserID which contains the same numbers the ID column from the Users table has. It also has a column called Day which contains info like "2015-01-01 00:00:00:000".
Users Table:
ID | ActiveUser
-------------------
10 | 0
11 | 1
12 | 1
13 | 0
Days Table:
User_ID | Day
------------------
10 | 2010-06-24 00:00:00.000
11 | 2011-07-05 00:00:00.000
12 | 2008-06-19 00:00:00.000
13 | 2010-06-20 00:00:00.000
10 | 2009-09-02 00:00:00.000
12 | 2010-08-15 00:00:00.000
11 | 2011-05-06 00:00:00.000
13 | 2012-04-25 00:00:00.000
I'm trying to create a query that finds the most recent Day listed for each Active user. So using the tables above, the query I'm trying to make should give me the following:
Day
------
2011-07-05 00:00:00.000
2010-08-15 00:00:00.000
Which corresponds to the two Active users with user ID's 11 and 12, both of which have two entries for Day, but the query picks the most recent date.
I'm new to SQL, and the closest I've got is below, which doesn't take into account the Users table (also, used https://stackoverflow.com/a/2411703/2480598 as a template):
select Days.Day
from Days
inner join (select User_ID, max(day) as MaxDay
from Days
group by User_ID
) tm on Days.User_ID = tm.User_ID and Days.Day = tm.MaxDay
This gives me the list of dates for users active and non-active.

Select a.User_ID, max(a.day) as LatestDay from Days a
Inner Join Users b
on a.User_ID = b.ID and b.ActiveUser = 1
group by a.User_ID
The inner join limits it to active users only. You can remove a.User_ID from the select if you only want a list of days.

This is classic example where APPLY can be applied:
select * from users u
outer apply(select top 1 day from days where userid = u.id order by day desc)a
where u.ActiveUser = 1
If you want only the day column(I think it makes no sense):
select max(d) as day
from users u
join days d on d.userid = u.id
where u.ActiveUser = 1
group by u.id

You need to join to the users table to get the active users. Here is a simple way:
select u.id, max(d.day) as MaxDay
from Days d join
Users u
on d.user_id = u.id
where u.ActiveUser = 1
group by u.User_ID;

Related

PostgreSQL - How to get month/year even if there are no records within that date?

What I'm trying to do in this case is to get the ''most future'' record of a Bills table and get all the record prior 13 months from that last record, so what I've tried is something like this
SELECT
users.name,
EXTRACT(month from priority_date) as month,
EXTRACT(year from priority_date) as year,
SUM("money_balance") as "money_balance"
FROM bills
JOIN users on users.id = bills.user_id
WHERE priority_date >= ( SELECT
DATE_TRUNC('month', MAX(debts.priority_date))
FROM bills
INNER JOIN users ON bills.property_id = users.id
WHERE users.company_id = 15
AND users.active = true
AND bills.paid = false ) - interval '13 month'
AND priority_date <= ( SELECT
MAX(bills.priority_date)
FROM bills
INNER JOIN users ON bills.property_id = users.id
WHERE users.community_id = 15
AND users.active = true
AND debts.paid = false )
AND users.company_id = 15
AND bills.paid = false
AND users.active = true
GROUP BY 1,2,3
ORDER BY year, month
So for instance, lets say the most future date for a created bill is December 2022, this query will give me the info from November 2021 to December 2022
The data will give me something like
name
month
year
money_balance
Joshua..
11
2021
300
Joshua..
1
2022
111
Mark..
1
2022
200
...
...
...
...
John
12
2022
399
In the case of Joshua, because he had no bills to pay in December 2021, it doesn't return anything for that month/year.
Is it possible to return the months/year where there are no records for that month, for each user?
Something like
name
month
year
money_balance
Joshua..
11
2021
300
Joshua..
12
2021
0
Joshua..
1
2022
111
other users
....
...
...
Thank you so much!
We can use a CTE to create the list of months, using the maximum and minimum dates from bill, and then cross join it onto users to get a line for all users for all months. We then left join onto bills to populate the last column.
The problem with this approach is that we can end up with a lot of rows with no value.
create table bills(user_id int,priority_date date, money_balance int);
create table users(id int, name varchar(25));
insert into users values(1,'Joshua'),(2,'Mark'),(3,'John');
insert into bills values(1,'2021-11-01',300),(1,'2022-01-01',111),(2,'2022-01-01',200),(3,'2021-12-01',399);
;with months as
(SELECT to_char(generate_series(min(priority_date), max(priority_date), '1 month'), 'Mon-YY') AS "Mon-YY"
from bills)
SELECT
u.name,
"Mon-YY",
--EXTRACT(month from "Mon-YY") as month,
--EXTRACT(year from "Mon-YY") as year,
SUM("money_balance") as "money_balance"
FROM months m
CROSS JOIN users u
LEFT JOIN bills b
ON u.id = b.user_id
AND to_char(priority_date,'Mon-YY') = m."Mon-YY"
GROUP BY
u.name,
"Mon-YY"
ORDER BY "Mon-YY", u.name
name | Mon-YY | money_balance
:----- | :----- | ------------:
John | Dec-21 | 399
Joshua | Dec-21 | null
Mark | Dec-21 | null
John | Jan-22 | null
Joshua | Jan-22 | 111
Mark | Jan-22 | 200
John | Nov-21 | null
Joshua | Nov-21 | 300
Mark | Nov-21 | null
db<>fiddle here

How to obtain information from 10 dates without using 10+ left joins

I have some information as shown in the simplified table below.
login_date | userid
-------------------------
2020-12-01 | 123
2020-12-01 | 456
2020-12-02 | 123
2020-12-02 | 456
2020-12-02 | 789
2020-12-03 | 123
2020-12-03 | 789
The range of dates found in login_date span from 2020-12-01 to 2020-12-12 and the userid for each day is unique.
What I wish to obtain comes in 2 folds:
The number of users who first logged in on a certain date. excluding users who logged in on preceding day(s).
For users who first logged in on a certain date (e.g. 2020-12-01), how many of them logged in on subsequent days as well? (i.e. of the batch who first logged in on 2020-12-01, how many were found to log in on 2020-12-02, 2020-12-03.. and so on)
For the above table, an example of the desired result may be as follows:
| 2020-12-01 | 2020-12-02 | 2020-12-03 | ... (users' first login date)
----------------------------------------------------------------------------------------
| 2020-12-01 | 2 x x
users who continued | 2020-12-02 | 2 1 x
to log in on these | 2020-12-03 | 1 1 0
dates | ... |
Reasoning:
On the first day, two new users logged in, 123 and 456.
On the second day, the same old users, 123 and 456, logged in as well. In addition, a new user (logging in for the first time), 789, was added.
On the third day, only one of the original old users, 123 logged in. (count of 1). The new user (from the second day), 789, logged in as well. (count of 1)
My attempt
I actually managed to obtain a (rough) solution in two parts. For the first day, 2012-12-01, I simply filtered users who logged in on the first day and performed left joins for all the remaining dates:
select count(d1.userid) as d1_users, count(d2.userid) as d2_users, ... (repeated for all joined tables)
from table1 d1
left join (
select userid
from table1
where login_date = date('2020-12-02')
) d2
on d1.userid = d2.userid
... -- (10 more left joins, with each filtering by an incremented date value)
where d1.login_date = date('2020-12-01')
For dates following the second day onwards, I did a bit of preprocessing to exclude users who had logged in on preceding day(s):
with d2_users as (
select userid
from table1 a
left join (
select userid
from table1
where login_date = date('2020-12-01')
) b
on a.userid = b.userid
where b.userid is null -- filtering out users who logged in on preceding day(s)
and a.login_date = date('2020-12-02')
)
select count(d2.userid) as d2_users, ... -- (repeated for all joined tables)
from d2_users d2
left join (
select userid
from table1
where login_date = date('2020-12-03')
) d3
on d2.userid = d3.userid
... -- (similar to the query for the 2020-12-01)
In the process of writing and executing this query it took a lot of manual editing (deleting of unnecessary left joins for later dates and count), and ultimately the entire query for just two days takes up 300+ lines of SQL code. I am not sure whether there is a more efficient process for this.
Any advice would be greatly appreciated! I would be happy to provide further clarification if needed as well since the optimization of the solution to this problem has been bugging me for some time.
I apologize for the poor formatting of the desired result, as I currently only have a representation of it in a spreadsheet and not an idea of how it may look like as a SQL output.
Edit:
I realized I may not have communicated the ideal outcomes properly. For each min_login_date identified, what I wish to obtain is the number of users who continue to log in from a preceding date. An example would be:
10 users log in on 2020-12-01. Hence, the count for 2020-12-01 = 10.
Of the 10 previous users, 8 users log in on 2020-12-02. Hence the count for 2020-12-02 = 8.
Of the 8 users (from the previous day), 6 users log in on 2020-12-03. Hence the count for 2020-12-03 = 6.
As such for each min_login_date, the user count for subsequent dates should be <= that of the user count for previous dates. Hope this helps! I apologize for any miscommunication.
You can use window functions to get the earliest date. And then aggregate:
select min_login_date, count(*) as num_on_day,
sum(case when login_date = '2020-12-01' then 1 else 0 end) as login_20201201,
sum(case when login_date = '2020-12-02' then 1 else 0 end) as login_20201203,
. . .
from (select t.*,
min(login_date) over (partition by user_id) as min_login_date
from t
) t
group by min_login_date
I think you need some tweak using analytical function and aggregate function as follows:
select login_date,
Count(case when min_login_date = '2020-12-01' then 1 end) as login_20201201,
Count(case when min_login_date = '2020-12-02' then 1 end) as login_20201202,
......
from (select t.*,
min(login_date) over (partition by user_id) as min_login_date,
Lag(login_date) over (partition by user_id) as lag_login_date,
from your_taeble t
Where t.login_date between '2020-12-01' and '2020-12-12'
) t
where (lag_login_date = login_date - interval '1 day' or lag_login_date is null)
group by login_date

SQL Joining One Table to a Selection of Rows from Second Table that Contains a Max Value per Group

I have a table of Cases with info like the following -
ID
CaseName
Date
Occupation
11
John
2020-01-01
Joiner
12
Mark
2019-10-10
Mechanic
And a table of Financial information like the following -
ID
CaseID
Date
Value
1
11
2020-01-01
1,000
2
11
2020-02-03
2,000
3
12
2019-10-10
3,000
4
12
2019-12-25
4,000
What I need to produce is a list of Cases including details of the most recent Financial value, for example -
ID
CaseName
Occupation
Lastest Value
11
John
Joiner
2,000
12
Mark
Mechanic
4,000
Now I can join my tables easy enough with -
SELECT *
FROM Cases AS c
LEFT JOIN Financial AS f ON f.CaseID = c.ID
And I can find the most recent date per case from the financial table with -
SELECT CaseID, MAX(Date) AS LastDate
FROM Financial
GROUP BY CaseID
But I am struggling to find a way to bring these two together to produce the required results as per the table set out above.
A simple method is window functions:
SELECT *
FROM Cases c LEFT JOIN
(SELECT f.*, MAX(date) OVER (PARTITION BY CaseId) as max_date
FROM Financial f
) f
ON f.CaseID = c.ID AND f.max_date = f.date;

How to get the count of distinct values until a time period Impala/SQL?

I have a raw table recording customer ids coming to a store over a particular time period. Using Impala, I would like to calculate the number of distinct customer IDs coming to the store until each day. (e.g., on day 3, 5 distinct customers visited so far)
Here is a simple example of the raw table I have:
Day ID
1 1234
1 5631
1 1234
2 1234
2 4456
2 5631
3 3482
3 3452
3 1234
3 5631
3 1234
Here is what I would like to get:
Day Count(distinct ID) until that day
1 2
2 3
3 5
Is there way to easily do this in a single query?
Not 100% sure if will work on impala
But if you have a table days. Or if you have a way of create a derivated table on the fly on impala.
CREATE TABLE days ("DayC" int);
INSERT INTO days
("DayC")
VALUES (1), (2), (3);
OR
CREATE TABLE days AS
SELECT DISTINCT "Day"
FROM sales
You can use this query
SqlFiddleDemo in Postgresql
SELECT "DayC", COUNT(DISTINCT "ID")
FROM sales
cross JOIN days
WHERE "Day" <= "DayC"
GROUP BY "DayC"
OUTPUT
| DayC | count |
|------|-------|
| 1 | 2 |
| 2 | 3 |
| 3 | 5 |
UPDATE VERSION
SELECT T."DayC", COUNT(DISTINCT "ID")
FROM sales
cross JOIN (SELECT DISTINCT "Day" as "DayC" FROM sales) T
WHERE "Day" <= T."DayC"
GROUP BY T."DayC"
try this one:
select day, count(distinct(id)) from yourtable group by day

How to create a pivot table by product by month in SQL

I have 3 tables:
users (id, account_balance)
grocery (user_id, date, amount_paid)
fishmarket (user_id, date, amount_paid)
Both fishmarket and grocery tables may have multiple occurrences for the same user_id with different dates and amounts paid or have nothing at all for any given user. I am trying to develop a pivot table of the following structure:
id | grocery_amount_paid_January | fishmarket_amount_paid_January
1 10 NULL
2 40 71
The only idea I can come with is to create multiple left joins, but this should be wrong since there will be 24 joins (per each month) for each product. Is there a better way?
I have provided a lot of answers on crosstab queries in PostgreSQL lately. Sometimes a "plain" query like the following does the job:
WITH x AS (SELECT '2012-01-01'::date AS _from
,'2012-12-01'::date As _to) -- provide date range once in CTE
SELECT u.id
,to_char(m.mon, 'MM.YYYY') AS month_year
,g.amount_paid AS grocery_amount_paid
,f.amount_paid AS fishmarket_amount_paid
FROM users u
CROSS JOIN (SELECT generate_series(_from, _to, '1 month') AS mon FROM x) m
LEFT JOIN (
SELECT user_id
,date_trunc('month', date) AS mon
,sum(amount_paid) AS amount_paid
FROM x, grocery -- CROSS JOIN with a single row
WHERE date >= _from
AND date < (_to + interval '1 month')
GROUP BY 1,2
) g ON g.user_id = u.id AND m.mon = g.mon
LEFT JOIN (
SELECT user_id
,date_trunc('month', date) AS mon
,sum(amount_paid) AS amount_paid
FROM x, fishmarket
WHERE date >= _from
AND date < (_to + interval '1 month')
GROUP BY 1,2
) f ON f.user_id = u.id AND m.mon = g.mon
ORDER BY u.id, m.mon;
produces this output:
id | month_year | grocery_amount_paid | fishmarket_amount_paid
---+------------+---------------------+------------------------
1 | 01.2012 | 10 | NULL
1 | 02.2012 | NULL | 65
1 | 03.2012 | 98 | 13
...
2 | 02.2012 | 40 | 71
2 | 02.2012 | NULL | NULL
Major points
The first CTE is for convenience only. So you have to type your date range once only. You can use any date range - as long as it's dates with the first of the month (rest of the month will be included!). You could add date_trunc() to it, but I guess you can keep the urge to use invalid dates in check.
First CROSS JOIN users to the result of generate_series() (m) which provides one row per month in your date range. You have learned in your last question how that results in multiple rows per user.
The two subqueries are identical twins. Use WHERE clauses that operate on the base column, so it can utilize an index - which you should have if your table runs over many years (no use for only one or two years, a sequential scan will be faster):
CREATE INDEX grocery_date ON grocery (date);
Then reduce all dates to the first of the month with date_trunc() and sum amount_paid per user_id and the resulting mon.
LEFT JOIN the result to the base table, again by user_id and the resulting mon. This way, rows are neither multiplied nor dropped. You get one row per user_id and month. Voilá.
BTW, I'd never use a column name id. Call it user_id in the table users as well.