How to increment the counting for each non-consecutive value? - sql

Below is a simple representation of my table:
ID | GA
----------
1 | 1.5
2 | 1.5
3 | 1.2
4 | 1.5
5 | 1.3
I would like to count the number of occurrence of the GA column's values BUT the count should not increment when the value is the same as the next row.
What I would like to expect is like this:
ID | GA | COUNT
-------------------
1 | 1.5 | 1
2 | 1.5 | 1
3 | 1.2 | 1
4 | 1.5 | 2
5 | 1.3 | 1
Notice that GA = 1.5 count is 2. This is because there is a row between ID 2 & 4 that breaks the succession of 1.5.
NOTE: The ordering by ID also matters.
Here's what I've done so far:
SELECT ID,GA,COUNT (*) OVER (
PARTITION BY GA
ORDER BY ID
) COUNT
FROM (
SELECT 1 AS ID,'1.5' AS GA
FROM DUAL
UNION
SELECT 2,'1.5' FROM DUAL
UNION
SELECT 3,'1.2' FROM DUAL
UNION
SELECT 4,'1.5' FROM DUAL
UNION
SELECT 5,'1.3' FROM DUAL
) FOO
ORDER BY ID;
But the result is far from expectation:
ID | GA | COUNT
-------------------
1 | 1.5 | 1
2 | 1.5 | 2
3 | 1.2 | 1
4 | 1.5 | 3
5 | 1.3 | 1
Notice that even if they are consecutive values, the count is still incrementing.

It seems, that you are asking for a kind of a running total, not just a global count.
Assuming, that the input data is in a table named input_data, this should do the trick:
WITH
with_previous AS (
SELECT id, ga, LAG(ga) OVER (ORDER BY id) AS previous_ga
FROM input_data
),
just_new AS (
SELECT id,
ga,
CASE
WHEN previous_ga IS NULL
OR previous_ga <> ga
THEN ga
END AS new_ga
FROM with_previous
)
SELECT id,
ga,
COUNT(new_ga) OVER (PARTITION BY ga ORDER BY id) AS ga_count
FROM just_new
ORDER BY 1
See sqlfiddle: http://sqlfiddle.com/#!4/187e13/1
Result:
ID | GA | GA_COUNT
----+-----+----------
1 | 1.5 | 1
2 | 1.5 | 1
3 | 1.2 | 1
4 | 1.5 | 2
5 | 1.3 | 1
6 | 1.5 | 3
7 | 1.5 | 3
8 | 1.3 | 2
I took sample data from #D-Shih's sqlfiddle

As I understand the problem, this is a variation of a gaps-and-islands problem. You want to enumerate the groups for each ga value independently.
If this interpretation is correct, then I would go for dense_rank() and the difference of row numbers:
select t.*, dense_rank() over (partition by ga order by seqnum_1 - seqnum_2)
from (select t.*,
row_number() over (order by id) as seqnum_1,
row_number() over (partition by ga order by id) as seqnum_2
from t
) t
order by id;
Here is a rextester.

Use a subquery with LAG and SUM anlytic functions:
SELECT id, ga,
sum( cnt ) over (partition by ga order by id) as cnt
FROM (
select t.*,
case lag(ga) over (order by id)
when ga then 0 else 1
end cnt
from Tab t
)
order by id
| ID | GA | CNT |
|----|-----|-----|
| 1 | 1.5 | 1 |
| 2 | 1.5 | 1 |
| 3 | 1.2 | 1 |
| 4 | 1.5 | 2 |
| 5 | 1.3 | 1 |
Demo: http://sqlfiddle.com/#!4/5ddd1/5

Related

Get some values from the table by selecting

I have a table:
| id | Number |Address
| -----| ------------|-----------
| 1 | 0 | NULL
| 1 | 1 | NULL
| 1 | 2 | 50
| 1 | 3 | NULL
| 2 | 0 | 10
| 3 | 1 | 30
| 3 | 2 | 20
| 3 | 3 | 20
| 4 | 0 | 75
| 4 | 1 | 22
| 4 | 2 | 30
| 5 | 0 | NULL
I need to get: the NUMBER of the last ADDRESS change for each ID.
I wrote this select:
select dh.id, dh.number from table dh where dh =
(select max(min(t.history)) from table t where t.id = dh.id group by t.address)
But this select not correctly handling the case when the address first changed, and then changed to the previous value. For example id=1: group by return:
| Number |
| -------- |
| NULL |
| 50 |
I have been thinking about this select for several days, and I will be happy to receive any help.
You can do this using row_number() -- twice:
select t.id, min(number)
from (select t.*,
row_number() over (partition by id order by number desc) as seqnum1,
row_number() over (partition by id, address order by number desc) as seqnum2
from t
) t
where seqnum1 = seqnum2
group by id;
What this does is enumerate the rows by number in descending order:
Once per id.
Once per id and address.
These values are the same only when the value is 1, which is the most recent address in the data. Then aggregation pulls back the earliest row in this group.
I answered my question myself, if anyone needs it, my solution:
select * from table dh1 where dh1.number = (
select max(x.number)
from (
select
dh2.id, dh2.number, dh2.address, lag(dh2.address) over(order by dh2.number asc) as prev
from table dh2 where dh1.id=dh2.id
) x
where NVL(x.address, 0) <> NVL(x.prev, 0)
);

How to fill irregularly missing values with linear interepolation in BigQuery?

I have data which has missing values irregulaly, and I'd like to convert it with a certain interval with liner interpolation using BigQuery Standard SQL.
Specifically, I have data like this:
# data is missing irregulary
+------+-------+
| time | value |
+------+-------+
| 1 | 3.0 |
| 5 | 5.0 |
| 7 | 1.0 |
| 9 | 8.0 |
| 10 | 4.0 |
+------+-------+
and I'd like to convert this table as follows:
# interpolated with interval of 1
+------+--------------------+
| time | value_interpolated |
+------+--------------------+
| 1 | 3.0 |
| 2 | 3.5 |
| 3 | 4.0 |
| 4 | 4.5 |
| 5 | 5.0 |
| 6 | 3.0 |
| 7 | 1.0 |
| 8 | 4.5 |
| 9 | 8.0 |
| 10 | 4.0 |
+------+--------------------+
Any smart soluton for this?
Supplement: this question is similar to this question in stackoverflow but different in that the data is missing irregulaly.
Thank you.
Below is for BigQuery Standard SQL
#standardSQL
select time,
ifnull(value, start_value + (end_value - start_value) / (end_tick - start_tick) * (time - start_tick)) as value_interpolated
from (
select time, value,
first_value(tick ignore nulls) over win1 as start_tick,
first_value(value ignore nulls) over win1 as start_value,
first_value(tick ignore nulls) over win2 as end_tick,
first_value(value ignore nulls) over win2 as end_value,
from (
select time, t.time as tick, value
from (
select generate_array(min(time), max(time)) times
from `project.dataset.table`
), unnest(times) time
left join `project.dataset.table` t
using(time)
)
window win1 as (order by time desc rows between current row and unbounded following),
win2 as (order by time rows between current row and unbounded following)
)
if to apply to sample data from your question - output is
Here is an example of how to solve this in Postgresql.
https://dbfiddle.uk/?rdbms=postgres_9.5&fiddle=c560dd9a8db095920d0a15834b6768f1
with data
as (select time
,lead(time) over(order by time) as next_time
,value
,lead(value) over(order by time) as next_value
,(lead(value) over(order by time)- value) as val_diff
,(lead(time) over(order by time)- time) as time_diff
from t
)
select *
,generate_series- time as grp
,case when generate_series- time = 0 then
value
else value + (val_diff*1.0/time_diff)*(generate_series-time)*1.0
end as val_grp
from data
cross join generate_series(time, coalesce(next_time-1,time))
+------+-----------------+-----+-------------------------+
| time | generate_series | grp | val_grp |
+------+-----------------+-----+-------------------------+
| 1 | 1 | 0 | 3.0 |
| 1 | 2 | 1 | 3.500000000000000000000 |
| 1 | 3 | 2 | 4.000000000000000000000 |
| 1 | 4 | 3 | 4.500000000000000000000 |
| 5 | 5 | 0 | 5.0 |
| 5 | 6 | 1 | 3.00000000000000000 |
| 7 | 7 | 0 | 1.0 |
| 7 | 8 | 1 | 4.50000000000000000 |
| 9 | 9 | 0 | 8.0 |
| 10 | 10 | 0 | 4.0 |
+------+-----------------+-----+-------------------------+
I believe the syntax would be different in BigQuery using UNNEST and GENERATE_ARRAY as follows. You could give it a try.
with data
as (select time
,lead(time) over(order by time) as next_time
,value
,lead(value) over(order by time) as next_value
,(lead(value) over(order by time)- value) as val_diff
,(lead(time) over(order by time)- time) as time_diff
from t
)
select *
,generate_series- time as grp
,case when generate_series- time = 0 then
value
else value + (val_diff*1.0/time_diff)*(generate_series-time)*1.0
end as val_grp
from data
cross join UNNEST(GENERATE_ARRAY(time, coalesce(next_time-1,time))) as generate_series
In BigQuery you can generate the extra rows for each row using generate_array(). Then you can use lead() to get information from the next row and some arithmetic for interpolation:
with t as (
select 1 as time, 3.0 as value union all
select 5 , 5.0 union all
select 7 , 1.0 union all
select 9 , 8.0 union all
select 10 , 4.0
),
tt as (
select t.*,
lead(time) over (order by time) as next_time,
lead(value) over (order by time) as next_value
from t
)
select coalesce(n, tt.time) as time,
(case when n = tt.time or n is null then value
else tt.value + (tt.next_value - tt.value) * (n - tt.time) / (tt.next_time - tt.time)
end) as value
from tt left join
unnest(generate_array(tt.time, tt.next_time - 1, 1)) n
on true
order by 1;
Note: You have a column called time that contains an integer. If this is really a date/time data type of some type, I would suggest that you ask a new question with more appropriate sample data and desired results -- if you don't see how to adapt this answer.

Get N results for each group without using join

can I solve this without using join? there are so many data in this table, I want to do it more efficiently.
one of my idea is get ID list by using group_concat subquery, but it doesn't work well with IN clause.
SELECT * FROM table WHERE id IN (group_concat subquery)
May I get your advice?
data
ID SERVER_ID ...
--------------------
1 1 ...
2 1
3 1
4 2
5 2
6 2
7 3
8 3
9 3
10 3
...
expected result with limit 2 per each group:
ID SERVER_ID ...
--------------------
1 1 ...
2 1
4 2
5 2
7 3
8 3
You can try the following using row_number, this solution will work for postgreSQL, MySQL 8.0, Oracle and SQL Server.
select
id,
server_id
from
(
select
id,
server_id,
row_number() over (partition by server_id order by id) as rnk
from yourTable
) val
where rnk <= 2
Here is the demo.
| id | server_id |
| --- | --------- |
| 1 | 1 |
| 2 | 1 |
| 4 | 2 |
| 5 | 2 |
| 7 | 3 |
| 8 | 3 |

SUM(), GROUP BY, and WHERE

I have a table like this:
Title | Version | Condition | Count
-----------------------------------------------
Title1 | 1.0 | 1 | 10
Title1 | 1.1 | 2 | 5
Title1 | 1.1 | 2 | 10
Title1 | 1.1 | 1 | 10
Title2 | 1.0 | 2 | 10
Title2 | 1.5 | 1 | 5
Title2 | 1.5 | 2 | 5
Title3 | 1.5 | 2 | 10
Title3 | 1.5 | 1 | 10
And I would like to sum the value of "Count" for each line that has the MAX() "Version", and "Condition" = 2. I'd like this to be the resulting data set:
Title | Version | Condition | Count
-----------------------------------------------
Title1 | 1.1 | 2 | 15
Title2 | 1.5 | 2 | 5
Title3 | 1.5 | 2 | 10
I am able to get the list of "Title" And MAX("Version") with "Condition" = 2 with:
SELECT DISTINCT Title, MAX(Version) AS MaxVer FROM TABLE
WHERE Condition = 2
GROUP BY Title
Bit I'm not sure how to add all the "Count"s.
Try this:
SELECT t1.Title, t1.Version, t1.Condition, SUM([Count]) AS Count
FROM mytable AS t1
JOIN (
SELECT Title, MAX(Version) AS max_version
FROM mytable
WHERE Condition = 2
GROUP BY Title
) AS t2 ON t1.Title = t2.Title AND t1.Version = t2.max_version
WHERE t1.Condition = 2
GROUP BY t1.Title, t1.Version, t1.Condition
Demo here
This needs only a single access to the table:
SELECT Title, Version, Condition, Count
FROM
( -- calculate all SUMs
SELECT Title, Version, Condition, SUM(Count) AS Count,
ROW_NUMBER() OVER (PARTITION BY Title ORDER BY Version DESC) AS rn
FROM TABLE
WHERE Condition = 2
GROUP BY Title, Version, Condition
) AS dt
-- and return only the row with the highest Version
WHERE rn = 1

Partitions with number of rows oracle

I have a view in a Oracle DB, it looks as follows:
id | type | numrows
----|--------|----------
1 | S | 2
2 | L | 3
3 | S | 2
4 | S | 2
5 | L | 3
6 | S | 2
7 | L | 3
8 | S | 2
9 | L | 3
10 | L | 3
The idea is: if TYPE is 'S' then return 2 rows (randomly), and if TYPE is 'L' then return 3 rows (randomly).
Example:
id | type | numrows
----|--------|----------
1 | S | 2
3 | S | 2
2 | L | 3
5 | L | 3
7 | L | 3
you should tell oracle how to get 3 rows or 2 rows. An ideea is to fabricate a row:
select id, type, numrows
from
(select
id,
type,
numrows,
row_number() over (partition by type order by type) rnk --fabricated
from table)
where
(type = 'S' and rnk <= 2 )
or
(type = 'L' and rnk <= 3 );
You can order by anything you want in that analytic function. For example, you can order by dbms_random.random() for random choices.
If your column numrows is correct and that's the number of rows you want to get then the where clause is simpler:
select id, type, numrows
from
(select
id,
type,
numrows,
row_number() over (partition by type order by dbms_random.random()) rnk --fabricated
from table)
where
rnk <= numrows;