Best way to group records with MAX revision - sql

I have a source table like this:
table_a :
id
revision
status
1
0
APPROVED
1
1
PENDING
I am trying to get distinct records from table_a having the latest revision and show the latest approved revision for each one of them.
result table_b :
id
latest_rev
latest_approved_rev
1
1
0
I have written the following query :
SELECT a.id,
a.revision AS latest_rev,
b.latest_approved_rev
FROM table_a a
LEFT JOIN (SELECT id,
MAX(revision) AS latest_approved_revision
FROM table_a
WHERE status = 'APPROVED'
GROUP BY id) b ON b.id = a.id
WHERE a.revision = (SELECT MAX(revision)
FROM table_a
WHERE id = a_id
My query seems to work fine, but I am wondering if I was missing something and/or if there is another way to make the query better/faster.

Seems you could achieve this with some (conditional) aggregation:
SELECT id,
MAX(revision) AS latest_rev,
MAX(CASE status WHEN 'APPROVED' THEN revision END) AS latest_approved_rev
FROM (VALUES(1,0,'APPROVED'),
(1,1,'PENDING'))V(id,revision,status)
GROUP BY id;

You have a correct query (if I understand your requirement). You are very close to an ideal query. But your outer WHERE clause contains a correlated (dependent) subquery, and those don't always perform well.
You can think of this as the JOIN of two subqueries. The one you have.
SELECT id,
MAX(revision) AS revision
FROM table_a
WHERE status = 'APPROVED'
GROUP BY id
and this one.
SELECT id,
MAX(revision) AS revision
FROM table_a
GROUP BY id
You FULL JOIN them together. Like this.
SELECT max.id,
latest.revision as latest_rev,
approved.revision as approved_rev
FROM (
SELECT id,
MAX(revision) AS revision
FROM table_a
GROUP BY id
) latest
FULL JOIN (
SELECT id,
MAX(revision) AS revision
FROM table_a
GROUP BY id
) approved on latest.id = approved.id
In this case you can actually use LEFT JOIN because you know every id in the approved subquery is also present in the latest subquery.

Related

Select value in a SQL query based on count() being non-zero

I am trying to write a SQL query that would "calculate" status column of the row based on JOIN with statuses table.
My record is basically: id | name | statusId, which is foreign key to statuses table. That table has: id | statusName
I collect count() for each DISTINCT statusId. Now, I need return Id of any status based on the following idea - if count(status0) > 0, I need to return status0, else I need to check status1 then status2 etc.
Could I write a SQL query to return status for each row status with JOIN, WHERE, HAVING etc without if/else logic?
If you need the statuses that are being used, then use exists:
select s.*
from statuses s
where exists (select 1 from records r where r.statusid = s.id);
Actually, if you just want the ids, you can use:
select distinct r.statusid
from records;
Rephrasing the question, do you want to just get the statuses that are present in the bigger table? How about?
SELECT s.*
FROM status s
WHERE s.id IN (SELECT DISTINCT statusid FROM records);
Start with this it should get you on the right track...
select a.*, b.*, c.cnt from table1 a
join table2 b on a.statusid=b.id
join (
select statusname, count(*) cnt
from table1 a join table2 b on a.statusid=b.id
group by statusname)c
on b.statusname=c.statusname

How to compare two tables in Hive based on counts

I have below hive tables
Table_1
ID
1
1
2
Table_2
ID
1
2
2
I am comparing two tables based on count of ID in both tables, I need the output like below
ID
1 - 2records in table 1 and 1 record in Table 2
2 - one record in Table 1 and 2 records in table 2
Table_1 is parent table
i am using below query
select count(*),ID from Table_1 group by ID;
select count(*),ID from Table_2 group by ID;
Just do a full outer join on your queries with the on condition as X.id = Y.id, and then select * from the resultant table checking for nulls on either side.
Select id, concat(cnt1, " entries in table 1, ",cnt2, "entries in table 2") from (select * from (select count(*) as cnt1, id from table1 group by id) X full outer join (select count(*) as cnt2, id from table2 group by id)
on X.id=Y.id
)
Try This. You may use a case statement to check if it should be record / records etc.
SELECT m.id,
CONCAT (COALESCE(a.ct, 0), ' record in table 1, ', COALESCE(b.ct, 0),
' record in table 2')
FROM (SELECT id
FROM table_1
UNION
SELECT id
FROM table_2) m
LEFT JOIN (SELECT Count(*) AS ct,
id
FROM table_1
GROUP BY id) a
ON m.id = a.id
LEFT JOIN (SELECT Count(*) AS ct,
id
FROM table_2
GROUP BY id) b
ON m.id = b.id;
You could use this Python program to do a full comparison of 2 Hive tables:
https://github.com/bolcom/hive_compared_bq
If you want a quick comparison just based on counts, then pass the "--just-count" option (you can also specify the group by column with "--group-by-column").
The script also allows you to visually see all the differences on all rows and all columns if you want a complete validation.

SQL - How to "filter out" people who has more than 1 status

I tried to find this question here but I probably didn't know the exact term to search for.
Here is the problem:
I have this set of customers (see image). I need to filter only those with status "user_paused" or "interval_paused". A same customer_id may have more than 1 status, and sometimes, this status can be "active". If so, this customer should not appear in my final result.
See customer 809 - he shouldn't appear in my final result since he has an "active" status. all the others are fine, because they only have paused statuses.
I still couldn't figure out how to proceed from here.
Thank you so much.
SELECT DISTINCT customer_id FROM TABLE
WHERE status IN ( 'user_paused','interval_paused')
EXCEPT
SELECT DISTINCT customer_id FROM TABLE
WHERE status = 'active'
One method uses group by and having:
select customer_id
from t
group by customer_id
having sum(case when status not in ('user_paused', 'interval_paused') then 1 else 0 end) = 0;
select * from table
where customer_id in
(select customer_id from table
where status in ('interval_paused','user_paused') )
You can find all customers with a status of 'active' quite easily:
SELECT customerid FROM table WHERE status = 'active'
If you want to exclude any customer from your results if they have an active row, you can do this in a subquery:
SELECT * FROM table WHERE /* your other query restrictions */
AND customerID NOT IN
(
SELECT customerid FROM table WHERE status = 'active'
)
This will let you eliminate any row with a customerid that has any 'active' row.
Please note that subqueries are not always the most efficient solution - there could be cases where a subquery would make your query very slow.
To exclude any customer with 'active' in either column use the following:
select * from customers
where paused_statuses != 'active'
and status != 'active';
Not sure if you need distinct or not, but here are 2 approaches. I think both will work in Impala but just in case you have an option. The first uses a "left excluding join" (make the join then exclude the matched rows) which enable us to ignore the active status customers. The second uses an even more traditional "not exists" approach to remove customer_ids that have an active status.
select /* distinct */ t1.customer_id
from table t1
left join table t2 on t1.customer_id = t2.customer_id and t2.status = 'active'
where t2.customer_id IS NULL
and t1.status in ('interval_paused','user_paused')
;
select /* distinct */ t1.customer_id
from table t1
where t1.status in ('interval_paused','user_paused')
and NOT EXISTS (
select null
from table t2
where t1.customer_id = t2.customer_id
and t2.status = 'active'
)
;
if your existing query is complex, then to simplify these additions, use a WITH clause like this:
WITH MyCTE AS (
-- place the whole existing query here
)
select /* distinct */ t1.customer_id
from MyCTE t1
left join MyCTE t2 on t1.customer_id = t2.customer_id and t2.status = 'active'
where t2.customer_id IS NULL
and t1.status in ('interval_paused','user_paused')
;
Notice that that the name you give it ("MyCTE") can be reused in the subsequent query - a very useful feature indeed.
In general the structures created by WITH are called common table expressions (CTE) if you are wondering why I use "MyCTE" as a name.
SELECT customer_id, paused_statuses, status
FROM Customer
WHERE NOT IN (SELECT customer_id, paused_statuses, status
FROM Customer
WHERE status = user_paused
AND status = active
AND status = interval_paused)
GROUP BY customer_id
OR
SELECT customer_id, paused_statuses, status
FROM Customer
WHERE status = user_paused
AND status = interval_paused
AND status <> active
GROUP BY customer_id

SQL need help building a query

I need your help building a query.
I have two tables:
The first table (table1) gives me the historical status , all the status that my product passed and the second table(table2) tells me the status at this moment for my product.
the id columns are the same for both tables like the status column.
I want to build a query that tells me the amount of my products that are with the status D,E and F in my table 2 but on my table 1 didn't passed for the status C, like going to status B to status D,E or F without passing to C.
I tried running this query:
select count(id), status
from table1 e
where status not in (C) EXISTS (SELECT *
FROM table2 c
WHERE e.id = c.id
AND status IN (D,E,F))
group by status
The query didn't return with the expected results. Can you help?
As the other responders noted, you have some syntax errors. Basically, you're just missing a few words.
select count(id)
, status
from table1 t1
where status not in ('C')
*and*
EXISTS (
SELECT *
FROM table2 t2
WHERE t2.id = t1.id
and status in ('D','E','F')
)
group by status
;
Alternatively, you could try solving it this way. Full disclosure - this is probably not as efficient (see In vs Exists).
select count(id)
, status
from table1
where id not in
(
select id
from table1
where status not in ('C')
union
select id
from table2
where status in ('D','E','F')
)
group by status
;
select count(id), status
from table1 a, table2 b
where a.id = b.id
and a.status not in ('C')
group by a.status

How to compare tables and find duplicates and also find columns with different value

I have the following tables in Oracle 10g:
Table1
Name Status
a closed
b live
c live
Table2
Name Status
a final
b live
c live
There are no primary keys in both tables, and I am trying to write a query which will return identical rows without looping both tables and comparing rows/columns. If the status column is different then the row in the Table2 takes presedence.
So in the above example my query should return this:
Name Status
a final
b live
c live
Since you have mentioned that there are no Primary Key on both tables, I'm assuming that there maybe a possibility that a row may exist on Table1, Table2, or both. The query below uses Common Table Expression and Windowing function to get such result.
WITH unionTable
AS
(
SELECT Name, Status, 1 AS ordr FROM Table1
UNION
SELECT Name, Status, 2 AS ordr FROM Table2
),
ranks
AS
(
SELECT Name, Status,
ROW_NUMBER() OVER (PARTITION BY NAME ORDER BY ordr DESC) rn
FROM unionTable
)
SELECT Name, Status
FROM ranks
WHERE rn = 1
SQLFiddle Demo
Something like this?
SELECT table1.Name, table2.Status
FROM table1
INNER JOIN table2 ON table1.Name = table2.Name
By always returning table2.Status you've covered both the case when they're the same and when they're different (essentially it doesn't matter what the value of table1.Status is).