JOIN by closer value to key - sql

With the following sample data:
WITH values AS (
SELECT
1 AS shard,
2008 AS year,
1 AS value
UNION ALL
SELECT
1 AS shard,
20012 AS year,
2 AS value
UNION ALL
SELECT
2 AS shard,
2011 AS year,
3 AS value
UNION ALL
SELECT
2 AS shard,
1998 AS year,
4 AS value
UNION ALL
SELECT
2 AS shard,
2001 AS year,
5 AS value
UNION ALL
SELECT
4 AS shard,
1990 AS year,
6 AS value
ORDER BY year
),
data AS (
SELECT
1 AS id,
1 AS shard,
2010 AS year
UNION ALL
SELECT
1 AS id,
2 AS shard,
2000 AS year
UNION ALL
SELECT
1 AS id,
3 AS shard,
1990 AS year
UNION ALL
SELECT
2 AS id,
1 AS shard,
2010 AS year
UNION ALL
SELECT
2 AS id,
2 AS shard,
2000 AS year
UNION ALL
SELECT
2 AS id,
3 AS shard,
1990 AS year
)
I want to join my data collection with the values stored in values collection. Data has an id which differentiates each process, so I want to perform the JOIN for each id. Also, the JOIN has a double mapping key, which are the shard and year fields. I want to retreive, for each entry on my data, the value of the CLOSER year in my values collection which matches its shard attribute.
I have come up with the piece of code, but it is not working as expected as it doesn't consider the values.shard field, and it matches every year no matter the shard they are on.
SELECT *
FROM (
SELECT
data.id,
data.year,
values.year AS closer_year,
ABS(data.year - values.year) AS diff,
values.value,
ROW_NUMBER() OVER (PARTITION BY data.id, data.shard ORDER BY ABS(data.year - values.year)) AS rn
FROM data, values
)
WHERE rn = 1
For the sample data, the expected output should be:
id year closer_year diff value rn
1 2010 2008 2 1 1
1 2000 2001 1 5 1
1 1990 null null null 1
2 2010 2008 2 1 1
2 2000 2001 1 5 1
2 1990 null null null 1
What am I missing?

I found what I was missing just after posting the question. I will answer it in case anyone has a similar use case.
When rereading the text, I noticed that the "match the shard" property I was missing was indeed a left join, so rewriting the query like this solved the problem:
SELECT *
FROM (
SELECT
data.id,
data.year,
values.year AS closer_year,
ABS(data.year - values.year) AS diff,
values.value,
ROW_NUMBER() OVER (PARTITION BY data.id, data.shard ORDER BY ABS(data.year - values.year)) AS rn
FROM data
LEFT JOIN values
ON data.shard = values.shard
)
WHERE rn = 1

Related

How do i select all columns, plus the result of the sum

I have this select:
"Select * from table" that return:
Id
Value
1
1
1
1
2
10
2
10
My goal is create a sum from each Value group by id like this:
Id
Value
Sum
1
1
2
1
1
2
2
10
20
2
10
20
I Have tried ways like:
SELECT Id,Value, (SELECT SUM(Value) FROM Table V2 WHERE V2.Id= V.Id GROUP BY IDRNC ) FROM Table v;
But the is not grouping by id.
Id
Value
Sum
1
1
1
1
1
1
2
10
10
2
10
10
Aggregation aggregates rows, reducing the number of records in the output. In this case you want to apply the result of a computation to each of your records, task carried out by the corresponding window function.
SELECT table.*, SUM(Value) OVER(PARTITION BY Id) AS sum_
FROM table
Check the demo here.
Your attempt looks correct.
Can you try the below query :
It works for me :
SELECT Id, Value,
(SELECT SUM(Value) FROM Table V2 WHERE V2.Id= V.Id GROUP BY ID) as sum
FROM Table v;
You can do it using inner join to join with selection grouped by id :
select t.*, sum
from _table t
inner join (
select id, sum(Value) as sum
from _table
group by id
) as s on s.id = t.id
You can check it here
Your select is ok if you adjust it just a little:
SELECT Id,Value, (SELECT SUM(Value) FROM Table V2 WHERE V2.Id= V.Id GROUP BY IDRNC ) FROM Table v;
GROUP BY IDRNC is a mistake and should be GROUP BY ID
you should give an alias to a sum column ...
subquery selecting the sum does not have to have self table alias to be compared with outer query that has one (this is not a mistake - works either way)
Test:
WITH
a_table (ID, VALUE) AS
(
Select 1, 1 From Dual Union All
Select 1, 1 From Dual Union All
Select 2, 10 From Dual Union All
Select 2, 10 From Dual
)
SELECT ID, VALUE, (SELECT SUM(VALUE) FROM a_table WHERE ID = v.ID GROUP BY ID) "ID_SUM" FROM a_table v;
ID VALUE ID_SUM
---------- ---------- ----------
1 1 2
1 1 2
2 10 20
2 10 20

multiple top n aggregates query defined as a view (or function)?

I couldn't find a past question exactly like this problem. I have an orders table, containing a customer id, order date, and several numeric columns (how many of a particular item were ordered on that date). Removing some of the numberics, it looks like this:
customer_id date a b c d
0001 07/01/22 0 3 3 5
0001 07/12/22 12 0 50 0
0002 06/30/22 5 65 0 30
0002 07/20/22 1 0 19 2
0003 08/01/22 0 0 99 0
I need to sum each numeric column by customer_id, then return the top n customers for each column. Obviously that means a single customer may appear multiple times, once for each column. Assuming top 2, the desired output would look something like this:
column_ranked customer_id sum rank
'a' 001 12 1
'a' 002 6 2
'b' 002 65 1
'b 001 3 2
'c' 003 99 1
'c' 001 53 2
'd' 002 30 1
'd' 001 5 2
(this assumes no date range filter)
My first thought was a CTE to collapse the table into its per-customer sums, then a union from the CTE, with a limit n clause, once for each summed column. That works if the date range is hard-coded into the CTE .... but I want to define this as a view, so it can be called by users something like this:
SELECT * from top_customers_view WHERE date_range BETWEEN ( date1 and date2 )
How can I pass the date restriction down to the CTE? Or am I taking the wrong approach entirely? If a view isn't possible, can it be done as a function? (without using a costly cursor, that is.)
Since the date ranges clearly produce a massive number of combinations you cannot generate a view with them. You can write a query, however, as shown below:
with
p as (select cast ('2022-01-01' as date) as ds, cast ('2022-12-31' as date) as de),
a as (
select top 10 customer_id, 'a' as col, sum(a) as s
from t cross join p where date between ds and de
group by customer_id order by s desc
),
b as (
select top 10 customer_id, 'b' as col, sum(b) as s
from t cross join p where date between ds and de
group by customer_id order by s desc
),
c as (
select top 10 customer_id, 'c' as col, sum(b) as s
from t cross join p where date between ds and de
group by customer_id order by s desc
),
d as (
select top 10 customer_id, 'd' as col, sum(b) as s
from t cross join p where date between ds and de
group by customer_id order by s desc
)
select * from a
union all select * from b
union all select * from c
union all select * from d
order by customer_id, col, s desc
The date range is in the second line.
See db<>fiddle.
Alternatively, you could create a data warehousing solution, but it would require much more effort to make it work.

How to get min & max date across the database for each subject

I have a psotgresql db which has 4 tables and each table has a date column.
Table 1
person_id meas_date
1 2007/02/11
2 2008/05/13
3 2008/07/29
5 2006/03/21
Table 2
person_id visit_date
1 2003/06/21
2 2005/02/23
3 2006/04/19
5 2004/06/11
Table 3
person_id condition_date
1 2008/06/21
2 2009/02/23
3 2005/04/19
5 2002/06/11
Table 4
person_id d_date
1 2018/06/21
2 2005/02/23
3 2004/04/19
5 2009/06/11
Currently I do something like below to find it from one table but how do I find across all the tables in my db. In this case, it is 4 tables
select
person_id,
min(condition_start_date) as min_date,
max(condition_start_date) as max_data,
from Table_3
group by person_id
But can you please help me find across the all tables for a subject/person_id?
I expect my output to be like below
person_id max_date min_date
1 2018/06/21 2003/06/21
2 2009/02/23 2005/02/23
3 2006/04/19 2004/04/19
5 2009/06/11 2002/06/11
Use union all and aggregation:
select person_id, min(date), max(date)
from ((select person_id, date from table1) union all
(select person_id, date from table2) union all
(select person_id, date from table3) union all
(select person_id, date from table4)
) pd
group by person_id;

Is there a way to find active users in SQL?

I'm trying to find the total count of active users in a database. "Active" users here as defined as those who have registered an event on the selected day or later than the selected day. So if a user registered an event on days 1, 2 and 5, they are counted as "active" throughout days 1, 2, 3, 4 and 5.
My original dataset looks like this (note that this is a sample - the real dataset will run to up to 365 days, and has around 1000 users).
Day ID
0 1
0 2
0 3
0 4
0 5
1 1
1 2
2 1
3 1
4 1
4 2
As you can see, all 5 IDs are active on Day 0, and 2 IDs (1 and 2) are active until Day 4, so I'd like the finished table to look like this:
Day Count
0 5
1 2
2 2
3 2
4 2
I've tried using the following query:
select Day as days, sum(case when Day <= days then 1 else 0 end)
from df
But it gives incorrect output (only counts users who were active on each specific days).
I'm at a loss as to what I could try next. Does anyone have any ideas? Many thanks in advance!
I think I would just use generate_series():
select gs.d, count(*)
from (select id, min(day) as min_day, max(day) as max_day
from t
group by id
) t cross join lateral
generate_series(t.min_day, .max_day, 1) gs(d)
group by gs.d
order by gs.d;
If you want to count everyone as active from day 1 -- but not all have a value on day 1 -- then use 1 instead of min_day.
Here is a db<>fiddle.
A bit verbose, but this should do:
with dt as (
select 0 d, 1 id
union all
select 0 d, 2 id
union all
select 0 d, 3 id
union all
select 0 d, 4 id
union all
select 0 d, 5 id
union all
select 1 d, 1 id
union all
select 1 d, 2 id
union all
select 2 d, 1 id
union all
select 3 d, 1 id
union all
select 4 d, 1 id
union all
select 4 d, 2 id
)
, active_periods as (
select id
, min(d) min_d
, max(d) max_d
from dt
group by id
)
, days as (
select distinct d
from dt
)
select d.d
, count(ap.id)
from days d
join active_periods ap on d.d between ap.min_d and ap.max_d
group by 1
order by 1 asc
You need count by day.
select
id,
count(*)
from df
GROUP BY
id

SQL union same number of columns, same data types, different data

I have two result sets that look approximately like this:
Id Name Count
1 Asd 1
2 Sdf 4
3 Dfg 567
4 Fgh 23
But the Count column data is different for the second one and I would like both to be displayed, about like this:
Id Name Count from set 1 Count from set two
1 Asd 1 15
2 Sdf 4 840
3 Dfg 567 81
4 Fgh 23 9
How can I do this in SQL (with union if possible)?
My current SQL, hope this will better explain what I want to do:
(SELECT Id, Name, COUNT(*) FROM Customers where X)
union
(SELECT Id, Name, COUNT(*) FROM Customers where Y)
select *
from
(
SELECT 'S1' as dataset, Id, Name, COUNT(*) as resultcount FROM Customers where X
union
SELECT 'S2',Id, Name, COUNT(*) FROM Customers where Y
) s
pivot
(sum(resultcount) for dataset in (s1,s2)) p
You can do something like:
;WITH Unioned
AS
(
SELECT 'Set1' FromWhat, Id, Name FROM Table1
UNION ALL
SELECT 'Set2', Id, Name FROM Table2
)
SELECT
Id,
Name,
SUM(CASE FromWhat WHEN 'Set1' THEN 1 ELSE 0 END) 'Count from set 1',
SUM(CASE FromWhat WHEN 'Set2' THEN 1 ELSE 0 END) 'Count from set 2'
FROM Unioned
GROUP BY Id, Name;
SQL Fiddle Demo