How to "filter" records in Hive table?

How to "filter" records in Hive table? - hive

Imagine table with id, status and modified_date. One id can have more than one record in table. I need to get out only that row for each id that has current status together with the modified_date when this status has changed from older one to current.
id status modified_date,
--------------------------------------------
1 T 1-Jan,
1 T 2-Jan,
1 F 3-Jan,
1 F 4-Jan,
1 T 5-Jan,
1 T 6-Jan,
2 F 18-Feb,
2 F 20-Feb,
2 T 21-Feb,
3 F 1-Mar,
3 F 1-Mar,
3 F 2-Mar,
With everything I already did I can not capture the second change for person 1 from F to T on 5-Jan.
So I expect results :
id status modified_date,
--------------------------------------------
1 T 5-Jan,
2 T 21-Feb,
3 F 1-Mar,

Using lag() analytic function you can address previous row to calculate status_changed flag. Then use row_number to mark last status changed rows with 1 and filter them. See comments in the code:
with your_data as (--replace with your table
select stack(12,
1,'T','1-Jan',
1,'T','2-Jan',
1,'F','3-Jan',
1,'F','4-Jan',
1,'T','5-Jan',
1,'T','6-Jan',
2,'F','18-Feb',
2,'F','20-Feb',
2,'T','21-Feb',
3,'F','1-Mar',
3,'F','1-Mar',
3,'F','2-Mar') as (id,status,modified_date)
)
select id,status,modified_date
from
(
select id,status,modified_date,status_changed_flag,
row_number() over(partition by id, status_changed_flag order by modified_date desc) rn
from
(
select t.*,
--lag(status) over(partition by id order by modified_date) prev_status,
NVL((lag(status) over(partition by id order by modified_date)!=status), true) status_changed_flag
from your_data t
)s
)s where status_changed_flag and rn=1
order by id --remove ordering if not necessary
;
Result:
OK
id status modified_date
1 T 5-Jan
2 T 21-Feb
3 F 1-Mar
Time taken: 178.643 seconds, Fetched: 3 row(s)

Related

Getting count of last records of 2 columns SQL

I was looking for a solution for the below mentioned scenario.
So my table structure is like this ; Table name : energy_readings
equipment_id
meter_id
readings
reading_date
1
1
100
01/01/2022
1
1
200
02/01/2022
1
1
null
03/01/2022
1
2
100
01/01/2022
1
2
null
04/01/2022
2
1
null
04/01/2022
2
1
399
05/01/2022
2
2
null
02/01/2022
So from this , I want to get the number of nulls for the last record of same equipment_id and meter_id. (Should only consider the nulls of the last record of same equipment_id and meter_id)
EX : Here , the last reading for equipment 1 and meter 1 is a null , therefore it should be considered for the count. Also the last reading(Latest Date) for equipment 1 and meter 2 is a null , should be considered for count. But even though equipment 2 and meter 1 has a null , it is not the last record (Latest Date) , therefore should not be considered for the count.
Thus , this should be the result ;
equipment_id
Count
1
2
2
1
Hope I was clear with the question.
Thank you!

You can use CTE like below. CTE LatestRecord will get latest record for equipment_id & meter_id. Later you can join it with your current table and use WHERE to filter out record with null values only.
;WITH LatestRecord AS (
SELECT equipment_id, meter_id, MAX(reading_date) AS reading_date
FROM energy_readings
GROUP BY equipment_id, meter_id
)
SELECT er.meter_id, COUNT(1) AS [Count]
FROM energy_readings er
JOIN LatestRecord lr
ON lr.equipment_id = er.equipment_id
AND lr.meter_id = er.meter_id
AND lr.reading_date = er.reading_date
WHERE er.readings IS NULL
GROUP BY er.meter_id

with records as(
select equ_id,meter_id,reading_date,readings,
RANK() OVER(PARTITION BY meter_id,equ_id
order by reading_date) Count
from equipment order by equ_id
)
select equ_id,count(counter)
from
(
select equ_id,meter_id,reading_date,readings,MAX(Count) as counter
from records
group by meter_id,equ_id
order by equ_id
) where readings IS NULL group by equ_id
Explanation:-
records will order data by reading_date and will give counting as 1,2,3..
select max of count from records
select count of counter where reading is null
Partition by will give counting as shown in image
Result

How to select the last value which is not null?

I have the following table:
id a b
1 1 kate
1 4 null
1 3 paul
1 3 paul
1 2 lola
2 1 kim
2 9 null
2 2 null
In result it should be this:
1 3 paul
2 1 kim
I want to get the last a where b is not null. Something like:
select b
from (select,b
row_num() over (partition by id order by a desc) as num) as f
where num = 1
But this way I get a null value, because to the last a = 4 corresponds to b IS NULL. Maybe there is a way to rewrite ffill method from pandas?

Assuming:
a is defined NOT NULL.
You want the row with the greatest a where b IS NOT NULL - per id.
SELECT DISTINCT ON (id) *
FROM tbl
WHERE b IS NOT NULL
ORDER BY id, a DESC;
db<>fiddle here
Detailed explanation:
Select first row in each GROUP BY group?

Try:
select id, a, b
from (select id, a, b,
row_num() over (partition by id order by a desc nulls last) as num
from unnamedTable) t
where num = 1
Or, if that isn't right, try it with nulls first. I can never remember which way it works with desc.

If you aren't guaranteed to have at least one non-null per id then you'll want to move nulls to the bottom of the list rather than filtering those rows out entirely.
select id, a, b
from (
select id, a, b,
row_number() over (
partition by id
order by case when b is not null then 0 else 1 end, a desc
) as num
) as f
where num = 1

You can wrap this around a cte and join it back to the main table if you wish to keep the original columns as is, but looking at your expected output and logic, this should do it. Having said that, row_number() based approach might be a tad faster.
select distinct
id,
max(a) over (partition by id) as a,
first_value(b) over (partition by id order by a desc) as b
from tbl
where b is not null;

Postgresql query to filter latest data based on 2 columns

Table Structure First
users table
id
1
2
3
sites table
id
1
2
site_memberships table
site_id
user_id
created_on
1
1
1
1
1
2
1
1
3
2
1
1
2
1
2
1
2
2
1
2
3
Assuming higher the created_on number, latest the record
Expected Output
site_id
user_id
created_on
1
1
3
2
1
2
1
2
3
Expected output: I need latest record for each user for each site membership.
Tried the following query, but this does not seem to work.
select * from users inner join
(
SELECT ROW_NUMBER () OVER (
PARTITION BY sm.user_id,
sm.created_on
), sm.*
from site_memberships sm
inner join sites s on sm.site_id=s.id
) site_memberships
ON site_memberships.user_id = users.user_id where row_number=1```

I think you have overcomplicated the problem you want to solve.
You seem to want aggregation:
select site_id, user_id, max(created_on)
from site_memberships sm
group by site_id, user_id;
If you had additional columns that you wanted, you could use distinct on instead:
select distinct on (site_id, user_id) sm.*
from site_memberships sm
order by site_id, user_id, created_on desc;

sql - select single ID for each group with the lowest value

Consider the following table:
ID GroupId Rank
1 1 1
2 1 2
3 1 1
4 2 10
5 2 1
6 3 1
7 4 5
I need an sql (for MS-SQL) select query selecting a single Id for each group with the lowest rank. Each group needs to only return a single ID, even if there are two with the same rank (as 1 and 2 do in the above table). I've tried to select the min value, but the requirement that only one be returned, and the value to be returned is the ID column, is throwing me.
Does anyone know how to do this?

Use row_number():
select t.*
from (select t.*,
row_number() over (partition by groupid order by rank) as seqnum
from t
) t
where seqnum = 1;

SQL MAX(column) With Additional Criteria

I have a single table, where I want to return a list of the MAX(id) GROUPed by another identifier. However I have a third column that, when it meets a certain criteria, "trumps" rows that don't meet that criteria.
Probably easier to explain with an example. Sample table has:
UniqueId (int)
GroupId (int)
IsPriority (bit)
Raw data:
UniqueId GroupId IsPriority
-----------------------------------
1 1 F
2 1 F
3 1 F
4 1 F
5 1 F
6 2 T
7 2 T
8 2 F
9 2 F
10 2 F
So, because no row in groupId 1 has IsPriority set, we return the highest UniqueId (5). Since groupId 2 has rows with IsPriority set, we return the highest UniqueId with that value (7).
So output would be:
5
7
I can think of ways to brute force this, but I am looking to see if I can do this in a single query.

SQL Fiddle Demo
WITH T
AS (SELECT *,
ROW_NUMBER() OVER (PARTITION BY GroupId
ORDER BY IsPriority DESC, UniqueId DESC ) AS RN
FROM YourTable)
SELECT UniqueId,
GroupId,
IsPriority
FROM T
WHERE RN = 1

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

How to "filter" records in Hive table? - hive

Related

Getting count of last records of 2 columns SQL

How to select the last value which is not null?

Postgresql query to filter latest data based on 2 columns

sql - select single ID for each group with the lowest value

SQL MAX(column) With Additional Criteria

Categories

Resources