Slightly different greatest-n-per-group - sql

I have read this comment which explains the greatest-n-per-group problem and its solution. Unfortunately, I am facing a slightly different approach, and I am failing to find a solution for it.
Let's suppose I have a table with some basic info regarding users. Due to implementation, this info may or may not repeat itself:
+----+-------------------+----------------+---------------+
| id | user_name | user_name_hash | address |
+----+-------------------+----------------+---------------+
| 1 | peter_jhones | 0xFF321345 | Some Av |
| 2 | sally_whiterspoon | 0x98AB5454 | Certain St |
| 3 | mark_jackobson | 0x0102AB32 | Some Av |
| 4 | mark_jackobson | 0x0102AB32 | Particular St |
+----+-------------------+----------------+---------------+
As you can see, mark_jackobson appears twice, although its address is different in each appearance.
Every now and then, an ETL process queries new user_names and fetches the most recent records of each. Aftewards, it stores the user_name_hash in a table to sign it has already imported that certain user_name
+----------------+
| user_name_hash |
+----------------+
| 0xFF321345 |
| 0x98AB5454 |
+----------------+
Everything begins with the following query:
SELECT DISTINCT user_name_hash
FROM my_table
EXCEPT
SELECT user_name_hash
FROM my_hash_table
This way, I am able to select the new hashes from my table. Since I need to query the most recent occurrence of a hash, I wrap it as a sub-query:
SELECT MAX(id)
FROM my_table
WHERE user_name_hash IN (
SELECT DISTINCT user_name_hash
FROM my_table
EXCEPT
SELECT user_name_hash
FROM my_hash_table)
GROUP BY user_name_hash
Perfect! With the ids of my new users, I can query the addresses as follows:
SELECT
address,
user_name_hash
FROM my_table
WHERE Id IN (
SELECT MAX(id)
FROM my_table
WHERE user_name_hash IN (
SELECT DISTINCT user_name_hash
FROM my_table
EXCEPT
SELECT user_name_hash
FROM my_hash_table)
GROUP BY user_name_hash)
From my perspective, the above query works, but it does not seem optimal. Reading this comment, I noticed I could query the same data, using joins. Since I am failing to write the desired query, could anyone help me out and point me to a direction?
This is the query I have attempted, without success.
SELECT
tb1.address,
tb1.user_name_hash
FROM my_table tb1
INNER JOIN my_table tb2
ON tb1.user_name_hash = tb2.user_name_hash
LEFT JOIN my_hash_table ht
ON tb1.user_name_hash = ht.user_name_hash AND tb1.id > tb2.id
WHERE ht.user_name_hash IS NULL;
Thanks in advance.
EDIT > I am working with PostgreSQL

I believe you are looking for something like this:
SELECT
address,
user_name_hash
FROM my_table t1
JOIN (
SELECT MAX(id) maxid
FROM my_table t2
WHERE NOT EXISTS (
SELECT 1
FROM my_hash_table t3
WHERE t2.user_name_hash = t3.user_name_hash
)
GROUP BY user_name_hash
) t ON t1.ID = t.maxid
I'm using NOT EXISTS instead of EXCEPT since it is more clear to the optimizer.

You can get a better performance using a left outer join (to get the newest records not already imported) and then compute the max id for these records (subquery in the HAVING clause).
SELECT t1.address,
t1.user_name_hash,
MAX(id) AS maxid
FROM my_table t1
LEFT JOIN my_hash_table th ON t1.user_name_hash = th.user_name_hash
WHERE th.user_name_hash IS NULL
GROUP BY t1.address,
t1.user_name_hash
HAVING MAX(id) = (SELECT MAX(id)
FROM my_table t1)

Related

Use recursion to get all children of node in Bigquery SQL table

I'm working with a dataset in bigquery that has parent-child relationships, but doesn't indicate final_parent...
My data looks something like this:
| id | parent |
| -----| --------|
| AA | AB |
| AB | AC |
| .. | .. |
The rows are either questions or answers, all answers roll up to a single question, but the you can answer an answer so there is this recursive graph structure... What I want is to get all the answers to a single question, starting with the row id of that question...
I generated the following query - I think it is logically correct for the task:
WITH RECURSIVE tbl_1 AS(
(SELECT *
FROM source_table
WHERE (id = xxxxxxxxxxx) OR (parent = xxxxxxxxxxx))
UNION ALL
(SELECT *
FROM source_table
WHERE (parent IN (SELECT DISTINCT id FROM tbl_1)
AND (id NOT IN (SELECT DISTINCT id FROM tbl_1))))
)
SELECT *
FROM tbl_1
However I get the following error...
ERROR:
400 A recursive reference from inside an expression subquery is not allowed at [9:49]
I think this is just something that hasn't been implemented yet in bigquery? Any advice on how to do it despite this? Thanks so much!!
Try below
with recursive tbl as (
select *, 1 pos from your_table
where question not in (select answer from your_table)
union all
select t1.question, t2.answer, pos + 1
from tbl t1
join your_table t2
on t2.question = t1.answer
)
select question, string_agg(answer order by pos) answers
from tbl
group by question
for dummy data as in below example
the output is

How to filter out conditions based on a group by in JPA?

I have a table like
| customer | profile | status | date |
| 1 | 1 | DONE | mmddyy |
| 1 | 1 | DONE | mmddyy |
In this case, I want to group by on the profile ID having max date. Profiles can be repeated. I've ruled out Java 8 streams as I have many conditions here.
I want to convert the following SQL into JPQL:
select customer, profile, status, max(date)
from tbl
group by profile, customer,status, date, column-k
having count(profile)>0 and status='DONE';
Can someone tell how can I write this query in JPQL if it is correct in SQL? If I declare columns in select it is needed in group by as well and the query results are different.
I am guessing that you want the most recent customer/profile combination that is done.
If so, the correct SQL is:
select t.*
from t
where t.date = (select max(t2.date)
from t t2
where t2.customer = t.customer and t2.profile = t.profile
) and
t.status = 'DONE';
I don't know how to convert this to JPQL, but you might as well start with working SQL code.
In your query date column not needed in group by and status='DONE' should be added with where clause
select customer, profile, status, max(date)
from tbl
where status='DONE'
group by profile, customer,status,
having count(profile)>0

Get specific row from each group

My question is very similar to this, except I want to be able to filter by some criteria.
I have a table "DOCUMENT" which looks something like this:
|ID|CONFIG_ID|STATE |MAJOR_REV|MODIFIED_ON|ELEMENT_ID|
+--+---------+----------+---------+-----------+----------+
| 1|1234 |Published | 2 |2019-04-03 | 98762 |
| 2|1234 |Draft | 1 |2019-01-02 | 98762 |
| 3|5678 |Draft | 3 |2019-01-02 | 24244 |
| 4|5678 |Published | 2 |2017-10-04 | 24244 |
| 5|5678 |Draft | 1 |2015-05-04 | 24244 |
It's actually a few more columns, but I'm trying to keep this simple.
For each CONFIG_ID, I would like to select the latest (MAX(MAJOR_REV) or MAX(MODIFIED_ON)) - but I might want to filter by additional criteria, such as state (e.g., the latest published revision of a document) and/or date (the latest revision, published or not, as of a specific date; or: all documents where a revision was published/modified within a specific date interval).
To make things more interesting, there are some other tables I want to join in.
Here's what I have so far:
SELECT
allDocs.ID,
d.CONFIG_ID,
d.[STATE],
d.MAJOR_REV,
d.MODIFIED_ON,
d.ELEMENT_ID,
f.ID FILE_ID,
f.[FILENAME],
et.COLUMN1,
e.COLUMN2
FROM DOCUMENT -- Get all document revisions
CROSS APPLY ( -- Then for each config ID, only look at the latest revision
SELECT TOP 1
ID,
MODIFIED_ON,
CONFIG_ID,
MAJOR_REV,
ELEMENT_ID,
[STATE]
FROM DOCUMENT
WHERE CONFIG_ID=allDocs.CONFIG_ID
ORDER BY MAJOR_REV desc
) as d
LEFT OUTER JOIN ELEMENT e ON e.ID = d.ELEMENT_ID
LEFT OUTER JOIN ELEMENT_TYPE et ON e.ELEMENT_TYPE_ID=et.ID
LEFT OUTER JOIN TREE t ON t.NODE_ID = d.ELEMENT_ID
OUTER APPLY ( -- This is another optional 1:1 relation, but it's wrongfully implemented as m:n
SELECT TOP 1
FILE_ID
FROM DOCUMENT_FILE_RELATION
WHERE DOCUMENT_ID=d.ID
ORDER BY MODIFIED_ON DESC
) as df -- There should never be more than 1, but we're using TOP 1 just in case, to avoid duplicates
LEFT OUTER JOIN [FILE] f on f.ID=df.FILE_ID
WHERE
allDocs.CONFIG_ID = '5678' -- Just for testing purposes
and d.state ='Released' -- One possible filter criterion, there may be others
It looks like the results are correct, but multiple identical rows are returned.
My guess is that for documents with 4 revisions, the same values are found 4 times and returned.
A simple SELECT DISTINCT would solve this, but I'd prefer to fix my query.
This would be a classic row_number & partition by question I think.
;with rows as
(
select <your-columns>,
row_number() over (partion by config_id order by <whatever you want>) as rn
from document
join <anything else>
where <whatever>
)
select * from rows where rn=1

`INTERSECT` does not return anything from two tables, separately values are returned fine

I'm not sure what I am doing wrong here since I didn't touch SQL queries for several years plus MSSQL query language is a bit strange to me but after 30 minutes of googling I still cannot find the answer.
Problem
I have two queries that work perfectly fine:
SELECT COUNT(*) AS 'NumberOfAccounts' FROM Accounts
SELECT COUNT(*) AS 'NumberOfUsers' FROM Users
I need to get this information in one go in my API response since I don't want to execute two statements. How can I combine them into one query so it will return table as follows:
+------------------+---------------+
| NumberOfAccounts | NumberOfUsers |
+------------------+---------------+
| 10 | 16 |
+------------------+---------------+
What I have tried
UNION SELECT COUNT(*) AS 'NumberOfAccounts' FROM Accounts UNION SELECT COUNT(*) AS 'NumberOfUsers' FROM Users
This is giving me the result of both tables, however it all pushes it into NumberOfAccounts and the result is invalid for me to parse.
+------------------+
| NumberOfAccounts |
+------------------+
| 10 |
| 16 |
+------------------+
INTRSECT SELECT COUNT(*) AS 'NumberOfAccounts' FROM Accounts INTERSECT SELECT COUNT(*) AS 'NumberOfUsers' FROM Users
This just gives me empty result with only NumberOfAccounts column in it.
You can just put these as subqueries in a select:
SELECT (SELECT COUNT(*) FROM Accounts) as NumberOfAccounts,
(SELECT COUNT(*) FROM Users) as NumberOfUsers
In SQL Server, no FROM clause is needed.
UNION is the wrong usage here. Union will "merge" rows of identical tables (or identical selects) and not columns.
One solution might be:
SELECT AccountCount, UserCount FROM
(SELECT COUNT(*) AS AccountCount, 1 AS Id FROM Accounts) AS a
JOIN
(SELECT COUNT(*) AS UserCount, 1 as Id FROM Users) AS u ON (a.Id = u.Id)
Be aware of the artificial surrogate key 1 you need to insert to join both sub-selects together.
For completeness sake; with UNION ALL you'd do:
SELECT 'NumberOfAccounts' AS what, COUNT(*) AS howmany FROM accounts
UNION ALL
SELECT 'NumberOfUsers' AS what, COUNT(*) AS howmany FROM users;
which results in
+------------------+---------+
| what | howmany |
+------------------+---------+
| NumberOfAccounts | 10 |
| NumberOfUsers | 16 |
+------------------+---------+
And another variation:
WITH cte AS
(
SELECT COUNT(*) AS cntAccounts, 0 AS cntUsers FROM accounts
UNION ALL
SELECT 0 AS cntAccounts, COUNT(*) AS cntUsers FROM users
)
SELECT
SUM(cntAccounts) AS NumberOfAccounts
,SUM(cntUsers ) AS NumberOfUsers
FROM cte
If you want (need) better performance you can get the row counts from the following query which uses sys.dm_db_partition_stats to get the row counts:
SELECT (
SELECT SUM (row_count)
FROM sys.dm_db_partition_stats
WHERE object_id=OBJECT_ID('Accounts')
AND (index_id=0 or index_id=1)) NumberOfAccounts,
(
SELECT SUM (row_count)
FROM sys.dm_db_partition_stats
WHERE object_id=OBJECT_ID('Users')
AND (index_id=0 or index_id=1)) NumberOfUsers

PostgreSQL if query?

Is there a way to select records based using an if statement?
My table looks like this:
id | num | dis
1 | 4 | 0.5234333
2 | 4 | 8.2234
3 | 8 | 2.3325
4 | 8 | 1.4553
5 | 4 | 3.43324
And I want to select the num and dis where dis is the lowest number... So, a query that will produce the following results:
id | num | dis
1 | 4 | 0.5234333
4 | 8 | 1.4553
If you want all the rows with the minimum value within the group:
SELECT id, num, dis
FROM table1 T1
WHERE dis = (SELECT MIN(dis) FROM table1 T2 WHERE T1.num = T2.num)
Or you could use a join to get the same result:
SELECT T1.id, T1.num, T1.dis
FROM table1 T1
JOIN (
SELECT num, MIN(dis) AS dis
FROM table1
GROUP BY num
) T2
ON T1.num = T2.num AND T1.dis = T2.dis
If you only want a single row from each group, even if there are ties then you can use this:
SELECT id, dis, num FROM (
SELECT id, dis, num, ROW_NUMBER() OVER (PARTITION BY num ORDER BY dis) rn
FROM table1
) T1
WHERE rn = 1
Unfortunately this won't be very efficient. If you need something more efficient then please see Quassnoi's page on selecting rows with a groupwise maximum for PostgreSQL. Here he suggests several ways to perform this query and explains the performance of each. The summary from the article is as follows:
Unlike MySQL, PostgreSQL implements
several clean and documented ways to
select the records holding group-wise
maximums, including window functions
and DISTINCT ON.
However to the lack of the loose index
scan support by the PostgreSQL’s
optimizer and the less efficient usage
of indexes in PostgreSQL, the queries
using these function take too long.
To work around these problems and
improve the queries against the low
cardinality grouping conditions, a
certain solution described in the
article should be used.
This solution uses recursive CTE’s to
emulate loose index scan and is very
efficient if the grouping columns have
low cardinality.
Use this:
SELECT DISTINCT ON (num) id, num, dis
FROM tbl
ORDER BY num, dis
Or if you intend to use other RDBMS in future, use this:
select * from tbl a where dis =
(select min(dis) from tbl b where b.num = a.num)
If you need to have IF logic you can use PL/pgSQL.
http://www.postgresql.org/docs/8.4/interactive/plpgsql-control-structures.html
But try to solve your issue with SQL first if possible, it will be faster and use PL/pgSQL when SQL can't solve your problem.