SQL - join but like R vector recycling - sql

There's two tables. They're kinda like this.
create temp table l (k int, v int);
create temp table r (k int, v int);
insert into l values
(1, 11),
(2, 21), (2, 22),
(3, 31),
(4, 41), (4, 42), (4, 43), (4, 44),
(5, 51), (5, 52),
(6, 61), (6, 62), (6, 63);
insert into r values
(1, 101),
(2, 201),
(3, 301), (3, 302),
(4, 401), (4, 402), (4, 403),
(5, 501), (5, 502), (5, 503),
(6, 601), (6, 602), (6, 603);
If I do a simple inner join of these tables on the k column, I get the Cartesian product for row groups 4 through 6. Is there any way to get, instead, behavior not entirely unlike vector recycling in R? Concretely, the desired joined table is something like
=> select l.k, l.v as lv, r.v as rv from l, r
-> where l.k = r.k and /* additional condition that does what I want */;
k | lv | rv
---+----+-----
1 | 11 | 101
2 | 21 | 201
2 | 22 | 201
3 | 31 | 301
3 | 31 | 302
4 | 41 | 401
4 | 42 | 402
4 | 43 | 403
4 | 44 | 401
5 | 51 | 501
5 | 52 | 502
5 | 51 | 503
6 | 61 | 601
6 | 62 | 602
6 | 63 | 603
And the desired behavior in English is: For each group of rows defined by l.k = r.k, arbitrarily associate each value from the left side with a single value from the right side. If the sides are not the same size, repeat just enough values from the smaller side to pair each value from the larger side with one value from the smaller. Either side may be the larger one.
(In case it matters: The real join will produce order of ten million row groups, the largest row group has order of 10 values on the larger side, and roughly 80% of all row groups are either 1:N or M:1.)

Here is an approach:
Enumerate the rows for each k in each table.
Count the rows for each k in each table.
When there is a NULL value, fill it in with the filled-in value using "enumeration % count" for the other value.
As SQL, this looks like:
select k,
coalesce(l.v, max(l.v) over (partition by k, r.seqnum % l_cnt.l_cnt)),
coalesce(r.v, max(r.v) over (partition by k, l.seqnum % r_cnt.r_cnt))
from (select l.*,
row_number() over (partition by k order by random()) as seqnum
from l
) l full join
(select r.*,
row_number() over (partition by k order by random()) as seqnum
from r
) r
using (k, seqnum) full join
(select k, count(*) as l_cnt
from l
group by k
) l_cnt
using (k) full join
(select k, count(*) as r_cnt
from r
group by k
) r_cnt
using (k)
order by k;
Here is a db<>fiddle.

Related

SQL: detecting consecutive blocks of sequential rows with same key

My problem boils down to the following. I have a table with some natural sequencing, and in it I have a key value which may repeat over time. I want to find the blocks where the key is the same, then changes, and then comes back to being the same. Example:
A
A
B
B
B
C
C
A
A
C
C
Here I want the result to be
A, 1-2
B, 3-5
C, 6-7
A, 8-9
C, 10-11
so I can't use that key value A, B, C to group by, because the same key can appear multiple times, I just want to squeeze out repetitive occurrences that are uninterrupted.
Needless to say, I want the simplest SQL one can come up with. It would use OLAP window functions.
I am usually pretty good with complicated SQL, but with sequences I am not so good. I will work on this a little bit myself, of course, and annex some ideas below this question in a subsequent edit.
Let's begin by defining the table for our discussion:
CREATE TABLE Seq (
num integer,
key char
);
UPDATE 1: doing some research I find a similar question here: How to find consecutive rows based on the value of a column? but both the question and the answers are wrapped up into a lot of extra stuff and confusing.
UPDATE 2: I already got one answer, thanks. Inspecting it now. Here is my test I am typing into PostgreSQL even as we speak:
CREATE TABLE Seq ( num int, key char );
INSERT INTO Seq VALUES
(1, 'A'), (2, 'A'),
(2, 'B'), (3, 'B'), (5, 'B'),
(6, 'C'), (7, 'C'),
(8, 'A'), (9, 'A'),
(10, 'C'), (11, 'C');
UPDATE 3: First contender of a solution is this
SELECT key, min(num), max(num)
FROM (
SELECT seq.*,
row_number() over (partition by key order by num) as seqnum
FROM Seq
) s
GROUP BY key, (num - seqnum)
ORDER BY min;
yields:
key | min | max
-----+-----+-----
A | 1 | 2
B | 2 | 3
B | 5 | 5
C | 6 | 7
A | 8 | 9
C | 10 | 11
(6 rows)
for some reason B repeats twice, I see why, I made a "mistake" in my test data, skipping sequence num 4 and going straight from 3 to 5.
This mistake is fortunate, because it allows me to point out that while in this example the sequence number is discrete, I am intending the sequence to arise from some continuous domain (e.g., time).
There is another "mistake" I made, in that I have num 2 repeated. Is that allowable? Probably not. So cleaning up the example, removing duplicate but leaving the gap:
DROP TABLE Seq;
CREATE TABLE Seq ( num int, key char );
INSERT INTO Seq VALUES
(1, 'A'), (2, 'A'),
(3, 'B'), (4, 'B'), (6, 'B'),
(7, 'C'), (8, 'C'),
(9, 'A'), (10, 'A'),
(11, 'C'), (12, 'C');
this still leaves us with the duplicate B block:
key | min | max
-----+-----+-----
A | 1 | 2
B | 3 | 4
B | 6 | 6
C | 7 | 8
A | 9 | 10
C | 11 | 12
(6 rows)
Now going with that first intuition by Gordon Linoff and trying to understand it and add to it:
SELECT s.*, num - seqnum AS diff
FROM (
SELECT seq.*,
row_number() over (partition by key order by num) as seqnum
FROM Seq
) s
ORDER BY num;
here is the num - seqnum trick before grouping:
num | key | seqnum | diff
-----+-----+--------+------
1 | A | 1 | 0
2 | A | 2 | 0
3 | B | 1 | 2
4 | B | 2 | 2
6 | B | 3 | 3
7 | C | 1 | 6
8 | C | 2 | 6
9 | A | 3 | 6
10 | A | 4 | 6
11 | C | 3 | 8
12 | C | 4 | 8
(11 rows)
I doubt that this is the answer quite yet.
Because of gaps you can't use num directly as Gordon's solution suggested. Row_number it too.
select key, min(num), max(num)
from (select seq.*,
row_number() over (order by num) as rn,
row_number() over (partition by key order by num) as seqnum
from seq
) s
group by key, (rn - seqnum)
order by min(num);
This answers the original problem.
You can enumerate the rows for each key and subtract that from num. Voila! This is number is constant when the key is constant on adjacent rows:
select key, min(num), max(num)
from (select seq.*,
row_number() over (partition by key order by num) as seqnum
from seq
) s
group by key, (num - seqnum);
Here is a db<>fiddle showing that it works.

Creating column for every group in group by

Suppose I have a table T which has entries as follows:
id | type | value |
-------------------------
1 | A | 7
1 | B | 8
2 | A | 9
2 | B | 10
3 | A | 11
3 | B | 12
1 | C | 13
2 | C | 14
For each type, I want a different column. Since the number of types is exhaustive, I would like all different types to be enumerated and a corresponding column for each. I wanted to make id a primary key for the table.
So, the desired output is something like:
id | A's value | B's value | C's value
------------------------------------------
1 | 7 | 8 | 13
2 | 9 | 10 | 14
3 | 11 | 12 | NULL
Please note that this is a simplified version. The actual table T is derived from a much bigger table using group by. And for each group, I would like a separate column. Is that even possible?
Use conditional aggregation:
select id,
max(case when type = 'A' then value end) as a_value,
max(case when type = 'B' then value end) as b_value,
max(case when type = 'C' then value end) as c_value
from t
group by id;
I'd recommend looking into the PIVOT function:
https://docs.snowflake.com/en/sql-reference/constructs/pivot.html
The main blocker with this function though is the list of values for the pivot_column needs to be
pre-determined. To do this, I normally use the LISTAGG function:
https://docs.snowflake.com/en/sql-reference/functions/listagg.html
I've included a query below to show you how to build that string,
and doing this together in a script like
Python or even a Stored Procedure should be fairly straightforward (build the pivot_column, build the aggregate/pivot command, execute the aggregate/pivot command).
I hope this helps...Rich
CREATE OR REPLACE TABLE monthly_sales(
empid INT,
amount INT,
month TEXT)
AS SELECT * FROM VALUES
(1, 10000, 'JAN'),
(1, 400, 'JAN'),
(2, 4500, 'JAN'),
(2, 35000, 'JAN'),
(1, 5000, 'FEB'),
(1, 3000, 'FEB'),
(2, 200, 'FEB'),
(2, 90500, 'FEB'),
(1, 6000, 'MAR'),
(1, 5000, 'MAR'),
(2, 2500, 'MAR'),
(2, 9500, 'MAR'),
(1, 8000, 'APR'),
(1, 10000, 'APR'),
(2, 800, 'APR'),
(2, 4500, 'APR');
SELECT *
FROM monthly_sales
PIVOT(SUM(amount)
FOR month IN ('JAN', 'FEB', 'MAR', 'APR'))
AS p
ORDER BY empid;
SELECT LISTAGG( DISTINCT ''''||month||'''', ', ' )
FROM monthly_sales;

How can i duplicate records with T-SQL and keep track of the progressive number?

How can I duplicate the records of table1 and store them in table2 along with the progressive number calculated from startnum and endnum?
Thanks
the first row must be duplicated in 4 records i.e num: 80,81,82,83
Startnum | Endnum | Data
---------+-------------+----------
80 | 83 | A
10 | 11 | C
14 | 16 | D
Result:
StartEndNum | Data
------------+-----------
80 | A
81 | A
82 | A
83 | A
10 | C
11 | C
14 | D
15 | D
16 | D
A simple method uses a recursive CTE:
with cte as
select startnum, endnum, data
from t
union all
select startnum + 1, endnum, data
from cte
where startnum < endnum
)
select startnum, data
from cte;
If you have ranges that exceed 100, you need option (maxrecursion 0).
Note: There are other solutions as well, using numbers tables (either built-in or generated). I like this solution as a gentle introduction to recursive CTEs.
Without recursion:
declare #t table(Startnum int, Endnum int, Data varchar(20))
insert into #t values
(80, 83, 'A'),
(10, 11, 'C'),
(14, 16, 'D');
select a.StartEndNum, t.Data
from #t t cross apply (select top (t.Endnum - t.Startnum + 1)
t.Startnum + row_number() over(order by getdate()) - 1 as StartEndNum
from sys.all_columns) a;
You can use any other table with enough rows instead of sys.all_columns

SQL aggregates over 3 tables

Well, this is annoying the hell out of me. Any help would be much appreciated.
I'm trying to get a count of how many project Ids and Steps there are. The relationships are:
Projects (n-1) Pages
Pages (n-1) Status Steps
Sample Project Data
id name
1 est et
2 quia nihil
Sample Pages Data
id project_id workflow_step_id
1 1 1
2 1 1
3 1 2
4 1 1
5 2 3
6 2 3
7 2 4
Sample Steps Data
id name
1 a
2 b
3 c
4 d
Expected Output
project_id name count_steps
1 a 3
1 b 1
2 c 2
2 d 1
Thanks!
An approach to meet the expected result. See it also at SQL Fiddle
CREATE TABLE Pages
("id" int, "project_id" int, "workflow_step_id" int)
;
INSERT INTO Pages
("id", "project_id", "workflow_step_id")
VALUES
(1, 1, 1),
(2, 1, 1),
(3, 1, 2),
(4, 1, 1),
(5, 2, 3),
(6, 2, 3),
(7, 2, 4)
;
CREATE TABLE workflow_steps
("id" int, "name" varchar(1))
;
INSERT INTO workflow_steps
("id", "name")
VALUES
(1, 'a'),
(2, 'b'),
(3, 'c'),
(4, 'd')
;
CREATE TABLE Projects
("id" int, "name" varchar(10))
;
INSERT INTO Projects
("id", "name")
VALUES
(1, 'est et'),
(2, 'quia nihil')
;
Query 1:
select pg.project_id, s.name, pg.workflow_step_id, ws.count_steps
from (
select distinct project_id, workflow_step_id
from pages ) pg
inner join (
select workflow_step_id, count(*) count_steps
from pages
group by workflow_step_id
) ws on pg.workflow_step_id = ws.workflow_step_id
inner join workflow_steps s on pg.workflow_step_id = s.id
order by project_id, name, workflow_step_id
Results:
| project_id | name | workflow_step_id | count_steps |
|------------|------|------------------|-------------|
| 1 | a | 1 | 3 |
| 1 | b | 2 | 1 |
| 2 | c | 3 | 2 |
| 2 | d | 4 | 1 |

Postgres group by

When I run a query, these are the results presented to me:
id account_id score active item_id
5 78 9 true 4
6 78 1 true 4
7 78 9 true 6
8 78 5 true 7
9 78 5 true 8
10 78 5 true 8
I'd like the output to look like this by combining item_id's based on score:
id account_id score active item_id
* 78 10 true 4
7 78 9 true 6
8 78 5 true 7
* 78 10 true 8
My query that returns that info looks like this:
SELECT item.id, item.account_id, itemaudit.score, itemrevision.active, itemaudit.item_id
from item
left join itemrevision on item.id = itemrevision.id
join itemaudit on item.id = itemaudit.id
where itemrevision.active = true
;
The bit I'm missing is when 'item_id' is not distinct, combine/sum the value of 'score'. I'm not sure how to do this step.
The schema looks like this:
CREATE TABLE item
(id integer, account_id integer);
CREATE TABLE itemaudit
(id integer, item_id integer, score integer);
CREATE TABLE itemrevision
(id int, active boolean, item_id int);
INSERT INTO item
(id, account_id)
VALUES
(5, 78),
(6, 78),
(7, 78),
(8, 78),
(9, 78),
(10, 78)
;
INSERT INTO itemaudit
(id, item_id, score)
VALUES
(5, 4, 5),
(6, 4, 1),
(7, 6, 9),
(8, 7, 10),
(9, 8, 1),
(10, 8, 9)
;
INSERT INTO itemrevision
(id, active, item_id)
VALUES
(5, true, 4),
(6, true, 4),
(7, true, 6),
(8, true, 7),
(9, true, 7),
(10, true, 8)
;
If I understand correctly, you just want an aggregation query:
select ia.item_id, sum(ia.score) as score
from item i join -- the `where` clause turns this into an inner join
itemrevision ir
on i.id = ir.id join
itemaudit ia
on i.id = ia.id
where ir.active = true
group by ia.item_id;
Notes:
I changed the left join to an inner join, because the where clause has this effect anyway.
Table aliases make the query easier to write and to read.
In an aggregation query, the other columns are not appropriate.
I think you want something like this..
SELECT
CASE
WHEN array_length(array_agg(id),1) = 1
THEN (array_agg(id))[1]::text
ELSE '*'
END AS id,
account_id,
sum(score) AS score,
item_id
FROM item
GROUP BY account_id, item_id
ORDER BY account_id, item_id;
id | account_id | score | item_id
----+------------+-------+---------
* | 78 | 10 | 4
7 | 78 | 9 | 6
8 | 78 | 5 | 7
* | 78 | 10 | 8
(4 rows)
While this is what you want the simpler versions is more detailed and better.
SELECT
array_agg(id) AS id,
account_id,
sum(score) AS score,
item_id
FROM item
GROUP BY account_id, item_id
ORDER BY account_id, item_id;
id | account_id | score | item_id
--------+------------+-------+---------
{5,6} | 78 | 10 | 4
{7} | 78 | 9 | 6
{8} | 78 | 5 | 7
{9,10} | 78 | 10 | 8
(4 rows)