SQL: Get row number which increases every time a value changes - sql

I have the following table in Vertica:
+----------+----------+----------+
| column_1 | column_2 | column_3 |
+----------+----------+----------+
| a | 1 | 1 |
| a | 2 | 1 |
| a | 3 | 1 |
| b | 1 | 1 |
| b | 2 | 1 |
| b | 3 | 1 |
| c | 1 | 1 |
| c | 2 | 1 |
| c | 3 | 1 |
| c | 1 | 2 |
| c | 2 | 2 |
| c | 3 | 2 |
+----------+----------+----------+
The table is ordered by column_1 and column_3.
I would like to add a row number, which increases every time when column_1 or column_3 change their value. It would look something like this:
+----------+----------+----------+------------+
| column_1 | column_2 | column_3 | row_number |
+----------+----------+----------+------------+
| a | 1 | 1 | 1 |
| a | 2 | 1 | 1 |
| a | 3 | 1 | 1 |
| b | 1 | 1 | 2 |
| b | 2 | 1 | 2 |
| b | 3 | 1 | 2 |
| c | 1 | 1 | 3 |
| c | 2 | 1 | 3 |
| c | 3 | 1 | 3 |
| c | 1 | 2 | 4 |
| c | 2 | 2 | 4 |
| c | 3 | 2 | 4 |
+----------+----------+----------+------------+
I tried using partition over but I can't find the right syntax.

Vertica has the CONDITIONAL_CHANGE_EVENT() analytic functions.
It starts at 0, and increments by 1 every time the expression that makes the first argument undergoes a change.
Like so:
WITH
indata(column_1,column_2,column_3,rn) AS (
SELECT 'a',1,1,1
UNION ALL SELECT 'a',2,1,1
UNION ALL SELECT 'a',3,1,1
UNION ALL SELECT 'b',1,1,2
UNION ALL SELECT 'b',2,1,2
UNION ALL SELECT 'b',3,1,2
UNION ALL SELECT 'c',1,1,3
UNION ALL SELECT 'c',2,1,3
UNION ALL SELECT 'c',3,1,3
UNION ALL SELECT 'c',1,2,4
UNION ALL SELECT 'c',2,2,4
UNION ALL SELECT 'c',3,2,4
)
SELECT
*
, CONDITIONAL_CHANGE_EVENT(
column_1||column_3::VARCHAR
) OVER w + 1 AS rownum
FROM indata
WINDOW w AS (ORDER BY column_1,column_3,column_2)
;
-- out column_1 | column_2 | column_3 | rn | rownum
-- out ----------+----------+----------+----+--------
-- out a | 1 | 1 | 1 | 1
-- out a | 2 | 1 | 1 | 1
-- out a | 3 | 1 | 1 | 1
-- out b | 1 | 1 | 2 | 2
-- out b | 2 | 1 | 2 | 2
-- out b | 3 | 1 | 2 | 2
-- out c | 1 | 1 | 3 | 3
-- out c | 2 | 1 | 3 | 3
-- out c | 3 | 1 | 3 | 3
-- out c | 1 | 2 | 4 | 4
-- out c | 2 | 2 | 4 | 4
-- out c | 3 | 2 | 4 | 4

In the absence of an ORDER BY, SQL data sets are unordered. To establish the order in your example therefore, I've assumed the dataset can be sorted with ORDER BY column_1, column_3, column_2
If that assumption doesn't work, you MUST add additional columns that the data can be deterministically sorted by.
That gives the following query...
SELECT
yourTable.*,
DENSE_RANK() OVER (ORDER BY column_1, column_3) AS row_number
FROM
yourTable
ORDER BY
column_1, column_3, column_2

This would also work and doesn't require table sorting
Find distinct value from column_1 and column_3 and give new index for them
Merge the previous with origin table on column_1 and column_3
select t1.*, t2.row_number
from
your_table t1
join
(select column_1, column_2, row_number() over (partition by temp) as row_number from (select distinct column_1, column_2, 1 as temp from your_table) foo) t2
on
t1.column_1=t2.column_1 and t1.column_2=t2.column_2;

Related

SQL Select random rows partitioned by a column

I have a dataset looks like this
| Country | id |
-------------------
| a | 5 |
| a | 1 |
| a | 2 |
| b | 1 |
| b | 5 |
| b | 4 |
| b | 7 |
| c | 5 |
| c | 1 |
| c | 2 |
and i need a query which returns 2 random values from where country in ('a', 'c'):
| Country | id |
------------------
| a | 2 | -- Two random rows from Country = 'a'
| a | 1 |
| c | 1 |
| c | 5 | --Two random rows from Country = 'c'
This should work:
select Country, id from
(select Country,
id,
row_number() over(partition by Country order by rand()) as rn
from table_name
) t
where Country in ('a', 'c') and rn <= 2
Replace rand() with random() if you're using Postgres or newid() in SQL Server.

Count rows in table that are the same in a sequence

I have a table that looks like this
+----+------------+------+
| ID | Session_ID | Type |
+----+------------+------+
| 1 | 1 | 2 |
| 2 | 1 | 4 |
| 3 | 1 | 2 |
| 4 | 2 | 2 |
| 5 | 2 | 2 |
| 6 | 3 | 2 |
| 7 | 3 | 1 |
+----+------------+------+
And I would like to count all occurences of a type that are in a sequence.
Output look some how like this:
+------------+------+-----+
| Session_ID | Type | cnt |
+------------+------+-----+
| 1 | 2 | 1 |
| 1 | 4 | 1 |
| 1 | 2 | 1 |
| 2 | 2 | 2 |
| 3 | 2 | 1 |
| 3 | 1 | 1 |
+------------+------+-----+
A simple group by like
SELECT session_id, type, COUNT(type)
FROM table
GROUP BY session_id, type
doesn't work, since I need to group only rows that are "touching".
Is this possible with a merge sql-select or will I need some sort of coding. Stored Procedure or Application side coding?
UPDATE Sequence:
If the following row has the same type, it should be counted (ordered by ID).
to determine the sequence the ID is the key with the session_ID, since I just want to group rows with the same session_ID.
So if there are 3 rows is in one session
row with the ID 1 has type 1,
and the second row has type 1
and row 3 has type 2
Input:
+----+------------+------+
| ID | Session_ID | Type |
+----+------------+------+
| 1 | 1 | 1 |
| 2 | 1 | 1 |
| 3 | 1 | 2 |
+----+------------+------+
The squence is Row 1 to Row 2. This three row should output
Output:
+------------+------+-------+
| Session_ID | Type | count |
+------------+------+-------+
| 1 | 1 | 2 |
| 3 | 2 | 1 |
+------------+------+-------+
You can use a difference of id and row_number() to identify the gaps and then perform your count
;with cte as
(
Select *, id - row_number() over (partition by session_id,type order by id) as grp
from table
)
select session_id,type,count(*) as cnt
from cte
group by session_id,type,grp
order by max(id)

Semi-transposing a table in Oracle

I am having trouble semi-transposing the table below based on the 'LENGTH' column. I am using an Oracle database, sample data:
+-----------+-----------+--------+------+
| PERSON_ID | PERIOD_ID | LENGTH | FLAG |
+-----------+-----------+--------+------+
| 1 | 1 | 4 | 1 |
| 1 | 2 | 3 | 0 |
| 2 | 1 | 4 | 1 |
+-----------+-----------+--------+------+
I would like to lengthen this table based on the LENGTH row; basically duplicating the row for each value in the LENGTH column.
See the desired output table below:
+-----------+-----------+--------+------+
| PERSON_ID | PERIOD_ID | NUMBER | FLAG |
+-----------+-----------+--------+------+
| 1 | 1 | 1 | 1 |
| 1 | 1 | 2 | 1 |
| 1 | 1 | 3 | 1 |
| 1 | 1 | 4 | 1 |
| 1 | 2 | 1 | 0 |
| 1 | 2 | 2 | 0 |
| 1 | 2 | 3 | 0 |
| 2 | 1 | 1 | 1 |
| 2 | 1 | 2 | 1 |
| 2 | 1 | 3 | 1 |
| 2 | 1 | 4 | 1 |
+-----------+-----------+--------+------+
I typically work in Posgres so Oracle is new to me.
I've found some solutions using the connect by statement but they seem overly complicated, particularly when compared to the simple generate_series() command from Posgres.
A recursive CTE subtracting 1 from length until 1 is reached should work. (In Postgres too, BTW, should you need something working cross platform.)
WITH cte (person_id,
period_id,
number_,
flag)
AS
(
SELECT person_id,
period_id,
length number_,
flag
FROM elbat
UNION ALL
SELECT person_id,
period_id,
number_ - 1 number_,
flag
FROM cte
WHERE number_ > 1
)
SELECT *
FROM cte
ORDER BY person_id,
period_id,
number_;
db<>fiddle

Hive - over (partition by ...) with a column not in group by

Is it possible to do something like:
select
avg(count(distinct user_id))
over (partition by some_date) as average_users_per_day
from user_activity
group by user_type
(notably, the partition by column, some_date, is not in the group by columns)
The idea I'm going for is something like: the average users per day by user type.
I know how to do it using subqueries (see below), but I'd like to know if there is a nice way using only over (partition by ...) and group by.
Notes:
From reading this answer, my understanding (correct me if I'm wrong) is that the following query:
select
avg(count(distinct a)) over (partition by b)
from foo
group by b
can be expanded equivalently to:
select
avg(count_distinct_a)
from (
select
b,
count(distinct a) as count_distinct_a
from foo
group by b
)
group by b
And from that, I can tweak it a bit to achieve what I want:
select
avg(count_distinct_user_id) as average_users_per_day
from (
select
user_type,
count(distinct user_id) as count_distinct_user_id
from user_activity
group by user_type, some_date
)
group by user_type
(notably, the inner group by user_type, some_date differs from the outer group by user_type)
I'd like to be able to tell the partition by-group by interaction to use a "sub-group-by" for the windowing part. Please let me know if my understanding of partition by/group by is completely off.
EDIT: Some sample data and desired output.
Source table:
+---------+-----------+-----------+
| user_id | user_type | some_date |
+---------+-----------+-----------+
| 1 | a | 1 |
| 1 | a | 2 |
| 2 | a | 1 |
| 3 | a | 2 |
| 3 | a | 2 |
| 4 | b | 2 |
| 5 | b | 1 |
| 5 | b | 3 |
| 5 | b | 3 |
| 6 | c | 1 |
| 7 | c | 1 |
| 8 | c | 4 |
| 9 | c | 2 |
| 9 | c | 3 |
| 9 | c | 4 |
+---------+-----------+-----------+
Sample intermediate table (for reasoning with):
+-----------+-----------+---------------------+
| user_type | some_date | distinct_user_count |
+-----------+-----------+---------------------+
| a | 1 | 2 |
| a | 2 | 2 |
| b | 1 | 1 |
| b | 2 | 1 |
| b | 3 | 1 |
| c | 1 | 2 |
| c | 2 | 1 |
| c | 3 | 1 |
| c | 4 | 2 |
+-----------+-----------+---------------------+
SQL is: select user_type, some_date, count(distinct user_id) from user_activity group by user_type, some_date.
Desired result:
+-----------+---------------------+
| user_type | average_daily_users |
+-----------+---------------------+
| a | 2 |
| b | 1 |
| c | 1.5 |
+-----------+---------------------+

SQL::Self join a table to satisfy a particular condition?

I have the following table:
mysql> SELECT * FROM temp;
+----+------+
| id | a |
+----+------+
| 1 | 1 |
| 2 | 2 |
| 3 | 3 |
| 4 | 4 |
+----+------+
I am trying to get the following output:
+----+------+------+
| id | a | a |
+----+------+------+
| 1 | 1 | 2 |
| 2 | 2 | 3 |
| 3 | 3 | 4 |
+----+------+------+
but I am having a small problem. I wrote the following query:
mysql> SELECT A.id, A.a, B.a FROM temp A, temp B WHERE B.a>A.a;
but my output is the following:
+----+------+------+
| id | a | a |
+----+------+------+
| 1 | 1 | 2 |
| 1 | 1 | 3 |
| 2 | 2 | 3 |
| 1 | 1 | 4 |
| 2 | 2 | 4 |
| 3 | 3 | 4 |
+----+------+------+
Can someone tell me how to convert this into the desired output? I am trying to get a form where only the consecutive values are produced. I mean, if 2 is greater than 1 and 3 is greater than 2, I do not want 3 is greater than 1.
Option 1: "Triangular Join" - Quadratic Complexity
SELECT A.id, A.a, MIN(B.a) AS a
FROM temp A
JOIN temp B ON B.a>A.a
GROUP BY A.id, A.a;`
Option 2: "Pseudo Row_Number()" - Linear Complexity
select a_numbered.id, a_numbered.a, b_numbered.a
from
(
select id,
a,
#rownum := #rownum + 1 as rn
from temp
join (select #rownum := 0) r
order by id
) a_numbered join (
select id,
a,
#rownum2 := #rownum2 + 1 as rn
from temp
join (select #rownum2 := 0) r
order by id
) b_numbered
on b_numbered.rn = a_numbered.rn+1