Count how many times a value appears continuously in Hive/SQL

Count how many times a value appears continuously in Hive/SQL - sql

I've got 3 columns in my table. And I'd like to count, for each userid, ordered by time, how many times the value equals B continuously. Something like the longest sublist with the same value. For example, data below
time userid value
2016-01-01 1 A
2016-01-02 1 B
2016-01-03 1 B
2016-01-04 2 C
2016-01-05 2 B
2016-01-06 2 B
2016-01-07 2 B
2016-01-08 2 C
2016-01-09 2 B
would return
userid times
1 2
2 3
Is this even possible without user defined function in Hive? I've digged a bit into LAG or LEAD, but couldn't find a way. :(

select value
,userid
,max (times) as times
from (select value
,userid
,count (*) as times
from (select value
,userid
,row_number () over
(
partition by userid
order by time
) as rn
,row_number () over
(
partition by userid,value
order by time
) as rn_val
from t
-- where value = 'B'
) t
group by value
,userid
,rn - rn_val
) t
group by value
,userid
order by value
,userid
;

Related

Getting count of last records of 2 columns SQL

I was looking for a solution for the below mentioned scenario.
So my table structure is like this ; Table name : energy_readings
equipment_id
meter_id
readings
reading_date
1
1
100
01/01/2022
1
1
200
02/01/2022
1
1
null
03/01/2022
1
2
100
01/01/2022
1
2
null
04/01/2022
2
1
null
04/01/2022
2
1
399
05/01/2022
2
2
null
02/01/2022
So from this , I want to get the number of nulls for the last record of same equipment_id and meter_id. (Should only consider the nulls of the last record of same equipment_id and meter_id)
EX : Here , the last reading for equipment 1 and meter 1 is a null , therefore it should be considered for the count. Also the last reading(Latest Date) for equipment 1 and meter 2 is a null , should be considered for count. But even though equipment 2 and meter 1 has a null , it is not the last record (Latest Date) , therefore should not be considered for the count.
Thus , this should be the result ;
equipment_id
Count
1
2
2
1
Hope I was clear with the question.
Thank you!

You can use CTE like below. CTE LatestRecord will get latest record for equipment_id & meter_id. Later you can join it with your current table and use WHERE to filter out record with null values only.
;WITH LatestRecord AS (
SELECT equipment_id, meter_id, MAX(reading_date) AS reading_date
FROM energy_readings
GROUP BY equipment_id, meter_id
)
SELECT er.meter_id, COUNT(1) AS [Count]
FROM energy_readings er
JOIN LatestRecord lr
ON lr.equipment_id = er.equipment_id
AND lr.meter_id = er.meter_id
AND lr.reading_date = er.reading_date
WHERE er.readings IS NULL
GROUP BY er.meter_id

with records as(
select equ_id,meter_id,reading_date,readings,
RANK() OVER(PARTITION BY meter_id,equ_id
order by reading_date) Count
from equipment order by equ_id
)
select equ_id,count(counter)
from
(
select equ_id,meter_id,reading_date,readings,MAX(Count) as counter
from records
group by meter_id,equ_id
order by equ_id
) where readings IS NULL group by equ_id
Explanation:-
records will order data by reading_date and will give counting as 1,2,3..
select max of count from records
select count of counter where reading is null
Partition by will give counting as shown in image
Result

Select the first row in the last group of consecutive rows

How would I select the row that is the first occurrence in the last 'grouping' of consecutive rows, where a grouping is defined by the consecutive appearance of a particular column value (in the example below state).
For example, given the following table:
id
datetime
state
value_needed
1
2021-04-01 09:42:41.319000
incomplete
A
2
2021-04-04 09:42:41.319000
done
B
3
2021-04-05 09:42:41.319000
incomplete
C
4
2021-04-05 10:42:41.319000
incomplete
C
5
2021-04-07 09:42:41.319000
done
D
6
2021-04-012 09:42:41.319000
done
E
I would want the row with id=5 as it it is the first occurrence of state=done in the last (i.e. most recent) grouping of state=done.

Assuming all columns NOT NULL.
SELECT *
FROM tbl t1
WHERE NOT EXISTS (
SELECT FROM tbl t2
WHERE t2.state <> t1.state
AND t2.datetime > t1.datetime
)
ORDER BY datetime
LIMIT 1;
db<>fiddle here
NOT EXISTS is only true for the last group of peers. (There is no later row with a different state.)
ORDER BY datetime and take the first. Voilá.

Here's a window function solution that accesses your table only once (which may or may not perform better for large data sets):
SELECT *
FROM (
SELECT *,
LEAD (state) OVER (ORDER BY datetime DESC)
IS DISTINCT FROM state AS first_in_group
FROM tbl
) t
WHERE first_in_group
ORDER BY datetime DESC
LIMIT 1
A dbfiddle based on Erwin Brandstetter's. To illustrate, here's the value of first_in_group for each row:
id datetime state value_needed first_in_group
---------------------------------------------------------------------
6 2021-04-12 09:42:41.319 done E f
5 2021-04-07 09:42:41.319 done D t
4 2021-04-05 10:42:41.319 incomplete C f
3 2021-04-05 09:42:41.319 incomplete C t
2 2021-04-04 09:42:41.319 done B t
1 2021-04-01 09:42:41.319 incomplete A t

How to "filter" records in Hive table?

Imagine table with id, status and modified_date. One id can have more than one record in table. I need to get out only that row for each id that has current status together with the modified_date when this status has changed from older one to current.
id status modified_date,
--------------------------------------------
1 T 1-Jan,
1 T 2-Jan,
1 F 3-Jan,
1 F 4-Jan,
1 T 5-Jan,
1 T 6-Jan,
2 F 18-Feb,
2 F 20-Feb,
2 T 21-Feb,
3 F 1-Mar,
3 F 1-Mar,
3 F 2-Mar,
With everything I already did I can not capture the second change for person 1 from F to T on 5-Jan.
So I expect results :
id status modified_date,
--------------------------------------------
1 T 5-Jan,
2 T 21-Feb,
3 F 1-Mar,

Using lag() analytic function you can address previous row to calculate status_changed flag. Then use row_number to mark last status changed rows with 1 and filter them. See comments in the code:
with your_data as (--replace with your table
select stack(12,
1,'T','1-Jan',
1,'T','2-Jan',
1,'F','3-Jan',
1,'F','4-Jan',
1,'T','5-Jan',
1,'T','6-Jan',
2,'F','18-Feb',
2,'F','20-Feb',
2,'T','21-Feb',
3,'F','1-Mar',
3,'F','1-Mar',
3,'F','2-Mar') as (id,status,modified_date)
)
select id,status,modified_date
from
(
select id,status,modified_date,status_changed_flag,
row_number() over(partition by id, status_changed_flag order by modified_date desc) rn
from
(
select t.*,
--lag(status) over(partition by id order by modified_date) prev_status,
NVL((lag(status) over(partition by id order by modified_date)!=status), true) status_changed_flag
from your_data t
)s
)s where status_changed_flag and rn=1
order by id --remove ordering if not necessary
;
Result:
OK
id status modified_date
1 T 5-Jan
2 T 21-Feb
3 F 1-Mar
Time taken: 178.643 seconds, Fetched: 3 row(s)

Resetting row number according to column value T-SQL

I have got the following data with a column indicating the first record within what we'll call an episode, though there is no episode ID. The ID column indicates and individual person.
ID StartDate EndDate First_Record
1 2013-11-30 2013-12-08 0
1 2013-12-08 2013-12-14 NULL
1 2013-12-14 2013-12-16 NULL
1 2013-12-16 2013-12-24 NULL
2 2001-02-02 2001-02-02 0
2 2001-02-03 2001-02-05 NULL
2 2010-03-11 2010-03-15 0
2 2010-03-15 2010-03-23 NULL
2 2010-03-24 2010-03-26 NULL
And I am trying to get a column indicating row number (starting with 0) grouped by ID ordered by start date, but the row number needs to reset when the First_Record column is not null, basically. Hence the desired output column Depth.
ID StartDate EndDate First_Record Depth
1 2013-11-30 2013-12-08 0 0
1 2013-12-08 2013-12-14 NULL 1
1 2013-12-14 2013-12-16 NULL 2
1 2013-12-16 2013-12-24 NULL 3
2 2001-02-02 2001-02-02 0 0
2 2001-02-03 2001-02-05 NULL 1
2 2010-03-11 2010-03-15 0 0
2 2010-03-15 2010-03-23 NULL 1
2 2010-03-24 2010-03-26 NULL 2
I can't seem to think of any solutions although I found a similar thread, but I'm needing help to translate it into what I'm trying to do. It has to use the First_Record column, as it has been set from specific conditions. Any help appreciated

If you can have only one episode per person (as in your sample data) you can just use row_number():
select t.*, row_number() over (partition by id order by startDate) - 1 as depth
from t;
Otherwise, you can calculate the episode grouping using a cumulative sum and then use that:
select t.*,
row_number() over (partition by id, grp order by startDate) - 1 as depth
from (select t.*,
count(first_record) over (partition by id order by startdate) as grp
from t
) t;

Now the depth will start from 0.
SELECT t.*
,convert(INT, (
row_number() OVER (
PARTITION BY id ORDER BY startDate
)
)) - 1 AS Depth
FROM t;

Get time stamp of change in column value

I have a table that tracks a certain status using a bit column.I want to get the first timestamp of the status change. I have got the desired output using temp table but is there a better way to do this?
I get the max time stamp for status 1, then I get the min timestamp for status 0 and if the min timestamp for status 0 is greater than max timestamp for status 1 then I include it in the result set.
Sample data
123 0 2016-12-21 20:04:56.217
123 0 2016-12-21 19:00:28.980
123 0 2016-12-21 17:00:10.207 <-- Get this record because this is the latest status change from 1 to 0
123 1 2016-12-20 16:15:58.787
123 1 2016-12-20 16:11:36.523
123 1 2016-12-20 14:20:02.467
123 1 2016-12-20 13:57:57.623
123 0 2016-12-20 13:55:31.421 <-- This should not be included in the result even though it is a status change but since it is not the latest
123 1 2016-12-20 13:54:57.307
123 0 2016-12-19 12:23:46.103
123 0 2016-12-18 11:47:21.267
SQL
CREATE TABLE #temp_status_changed
(
id VARCHAR(22) NOT NULL,
enabled BIT NOT NULL,
dt_create DATETIME NOT null
)
INSERT INTO #temp_status_changed
SELECT id,enabled,MAX(dt_create) FROM mytable WHERE enabled=1
GROUP BY id,enabled
SELECT a.id,a.enabled,MIN(a.dt_create) FROM mytable a
JOIN #temp_status_changed b ON a.id=b.id
WHERE a.enabled=0
GROUP BY a.id,a.enabled
HAVING MIN(a.dt_create) > (SELECT dt_create FROM #temp_status_changed WHERE id=a.id)
DROP TABLE #temp_status_changed

There are several ways to achieve that.
For example, using LAG() function you can always get the previous value and compare it:
SELECT * FROM
(
SELECT *, LAG(Enabled) OVER (PARTITION BY id ORDER BY dt_create) PrevEnabled
FROM YourTable
) x
WHERE Enabled = 0 AND PrevEnabled = 1

Another approach without window functions would be:
SELECT
sc.id,
sc.enabled,
dt_create = MIN(sc.dt_create)
FROM
YourTable AS sc
JOIN (
SELECT
id,
max_dt_create = MAX(dt_create)
FROM
YourTable
WHERE
enabled = 1
GROUP BY
id
) as MaxStatusChanges
ON sc.id = MaxStatusChanges.id AND
sc.dt_create > MaxStatusChanges.max_dt_create
GROUP BY
sc.id,
sc.enabled
The query returns no rows for an id if there's no rows with status 1 for that id, as well as if the most recent status for the id is 1. An unclustered index on enabled column with included id and dt_create columns could improve query performance.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Count how many times a value appears continuously in Hive/SQL - sql

Related

Getting count of last records of 2 columns SQL

Select the first row in the last group of consecutive rows

How to "filter" records in Hive table?

Resetting row number according to column value T-SQL

Get time stamp of change in column value

Categories

Resources