How To include row when using HAVING COUNT(*) = 1 BiqQuery - google-bigquery

I have a bigQuery table with 30+ columns and I want to SELECT * where session is unique.
I have this query:
SELECT *
FROM `table.id`
WHERE session IN (
SELECT session
FROM `table.id`
GROUP BY session
HAVING COUNT(*) = 1
)
And it works, but I just learned from another question that HAVING COUNT(*) = 1 excludes the duplicate row:
Note that DISTINCT is used to show distinct records including 1 record from duplicate too. On the other hand HAVING COUNT() = 1 is checking only records which are not duplicate.
For a simple example, if session has : 1, 1, 2, 3
DISTINCT will result in: 1, 2, 3
HAVING COUNT() = 1 will result in: 2, 3
I need the DISTINCT result, the one that includes one entry of the duplicate.
Anyone can help me? Thanks in advance, kind regards

Maybe ROW_NUMBER?
SELECT *
FROM (
SELECT *, ROW_NUMBER() OVER (PARTITION BY session) as row_num
FROM `table.id`
)
WHERE row_num = 1

Related

Select specific data from data only if certain fields can be grouped by

I have the following data:
ID Date Num ClientID Dest
--------------------------------------------------------
123 04/29/2021 -2222 H1234 -1
123 04/29/2021 1 H1234 3
345 04/29/2021 -2222 H3456 -1
345 04/29/2021 1 H8888 .1
BTW: this does not include all the fields, just what I'm currently using for my query.
For every ID in the above table I'll always have 2 records. There are 2 scenarios that can take place:
As for ID = 123, the ClientID is the same
As for ID = 345, the ClientID is different
I'm trying to return the following data:
ID Date Num ClientID Dest
123 04/29/2021 1 H1234 3
The reason I'm returning only this row is because:
I only want 1 row per ID, where the ClientID is the same for both rows
Only need the record that does not have -2222, where the CLientID is the same for both rows
If the ClientID is different for the same ID (ex: 345), then completely skip these records.
Now the numbers for DEST can vary, so we can't always rely that one will be -1 and the other will be positive, however the NUM field will always have -2222 and 1 for the 2nd row (which is the row that I'd want to be returned)
I'm not sure how best to do this, I guess I thought about the alternative of just creating a CTE, and then counting the ClientID and if Count = 2 then select the data. The problem I find is with DEST field, I know that I can do Max(NUM) but since DEST field can very I wouldn't know how to select it.
Here is what I tried:
WITH Ranking AS (
SELECT Rank() OVER (PARTITION BY c.ID,c.date1,c.ClientID ORDER BY num
asc)x, c.*
FROM cte c
)
SELECT * FROM Ranking WHERE x = 2
I'm not if this is a good apprach, I guess it does the job but any thoughts?
One solution is to use the ever-useful window functions with row_number to select which pair to use and lead to check if both ClientIds are the same. This would work regardless if the specific values you have should change, plus is more performant than hitting the table twice:
select id, date, num, clientid, Dest from (
select *,
Row_Number() over(partition by id order by num) rn,
case when Lead(clientid) over(partition by id order by num)=clientid then 1 else 0 end same
from t
)t
where rn=1 and same=1
Just select all rows where Num isn't -2222 and the ID is in a subquery grouped by id and having only one distinct client id.
SELECT *
FROM tbl
WHERE ID IN (SELECT ID FROM tbl GROUP BY ID HAVING COUNT(DISTINCT ClientId) = 1)
AND Num != -2222

SQL query looping for each value in a list

New to SQL here - I am trying to get 1 row from a table matching to a particular criteria
Typically this would look like
SELECT TOP 1 *
FROM myTable
WHERE id = 'abc'
The output may look like
value id
--------------
1 abc
The table has many entries for an 'id', and I am trying to get one entry per 'id'. Now I have list of 'id's. How would I execute something like
SELECT TOP 1 *
FROM myTable
FOR EACH id
WHERE id IN ('abc', 'edf', 'fgh')
Expecting result like
value id
--------------
1 abc
10 edf
12 fgh
I do not know if it is some sort union or concat operation, but would like to learn. I am working on Azure SQL Server
The table has many entries for an 'id', and I am trying to get one entry per 'id'. Now I have list of 'id's.
A typical method is row_number():
select t.*
from (select t.*,
row_number() over (partition by id order by id) as seqnum
from mytable t
) t
where seqnum = 1;
Note: you can filter on particular ids, if you want. It is unclear if that is really required for your question.
If you happen to be using SQL Server (as select top suggests), you can use the more concise, but somewhat less performant:
select top (1) with ties t.*
from mytable t
order by row_number() over (order by id order by (select null));

Bigquery SELECT * WHEN COUNT(DISTINCT value) does not work

I have a bigQuery table with 30+ columns and I want to SELECT * where session is unique.
I've been to almost all questions regarding this subject in StackOverflow but none helped me to achieve the expect result.
I've tried SELECT COUNT(DISTINCT session) FROM table.id but the problem is that returns only session column and I need the whole row.
Then I tried:
SELECT *
FROM `table.id`
WHERE session IN (
SELECT session
FROM `table.id`
GROUP BY session
HAVING COUNT(*) = 1
)
But it returns much less rows then SELECT COUNT(DISTINCT sessions)
So by logic I tried:
SELECT *, COUNT(DISTINCT sessions) and SELECT * WHERE COUNT(DISTINCT sessions)
none works
Anyone can help? Thanks in advance and kind regards,
I want to SELECT * where session is unique ...
Use below instead - note use of = in COUNT(*) = 1
SELECT *
FROM `table.id`
WHERE session IN (
SELECT session
FROM `table.id`
GROUP BY session
HAVING COUNT(*) = 1
)
You query seems alright with HAVING COUNT(*) = 1 as suggested by #Mikhail.
What wrong is that you are trying to match this result with SELECT COUNT(DISTINCT sessions).
Note that DISTINCT is used to show distinct records including 1 record from duplicate too. On the other hand HAVING COUNT(*) = 1 is checking only records which are not duplicate.
For a simple example, if session has : 1, 1, 2, 3
DISTINCT will result in: 1, 2, 3
HAVING COUNT(*) = 1 will result in: 2, 3
hence the difference you see in both result.

How to display in Big Query ONLY duplicated records?

To view records without duplicated ones, I use this SQL
SELECT * EXCEPT(row_number)
FROM (SELECT*,ROW_NUMBER() OVER (PARTITION BY orderid) row_number
FROM `TABLE`)
WHERE row_number = 1
What is the best practice to display only duplicated records from a single table?
Below is for BigQuery Standard SQL
Me personally, I prefer not to rely on ROW_NUMBER() whenever it is possible because with big volume of data it tends to lead to Resource Exceeded error
So, from my experience I would recommend below options:
To view records for those orderid with only one entry:
#standardSQL
SELECT AS VALUE ANY_VALUE(t)
FROM `project.dataset.table` t
GROUP BY orderid
HAVING COUNT(1) = 1
to view records for those orderid with more than one entry:
#standardSQL
SELECT * EXCEPT(flag) FROM (
SELECT *, COUNT(1) OVER(PARTITION BY orderid) > 1 flag
FROM `project.dataset.table`
)
WHERE flag
note: behind the hood - COUNT(1) OVER() can be calculated using as many workers as available while ROW_NUMBER() OVER() requires all respective data to be moved to one worker (thus Resource related issue)
OR
#standardSQL
SELECT *
FROM `project.dataset.table`
WHERE orderid IN (
SELECT orderid FROM `project.dataset.table`
GROUP BY orderid HAVING COUNT(1) > 1
)
Why not just change the row_number ? You have partitionned by order id, creating partitions of duplicates, ranked the records and take only the first element to remove the duplicates. But if you take only the row_number = 2, you'll have only elements from partitions with at least 2 elements, i.e only duplicates.
SELECT * EXCEPT(row_number)
FROM (SELECT*,ROW_NUMBER() OVER (PARTITION BY orderid) row_number
FROM `TABLE`)
WHERE row_number = 2
Note :Use row_number = 2 will give you only 1 element of duplicates. If you go with row_number > 1, the result may contain duplicates again (for example if you had 3 identical elements in the first table).
You can display the duplicated row by showing only raw with row_number greater than 1.
select
* except(row_number)
from (
select
*, row_number() over (partition by ) as row_number
from `TABLE`)
where row_number > 1
If your table has not primary key column, you are obliged to define it. Asuming my table contains 12 columns in BigQuery, I do not find shorter than:
SELECT *, sum(1) as rowcount
FROM `TABLE`
GROUP BY 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12
HAVING rowcount>1;

Row with the highest ID

You have three fields ID, Date and Total. Your table contains multiple rows for the same day which is valid data however for reporting purpose you need to show only one row per day. The row with the highest ID per day should be returned the rest should be hidden from users (not returned).
To better picture the question below is sample data and sample output:
ID, Date, Total
1, 2011-12-22, 50
2, 2011-12-22, 150
The correct result is:
2, 2012-12-22, 150
The correct output is single row for 2011-12-22 date and this row was chosen because it has the highest ID (2>1)
Assuming that you have a database that supports window functions, and that the date column is indeed just date (and not datetime), then something like:
SELECT
* --TODO - Pick columns
FROM
(
SELECT ID,[Date],Total,ROW_NUMBER() OVER (PARTITION BY [Date] ORDER BY ID desc) rn
FROM [Table]
) t
WHERE
rn = 1
Should produce one row per day - and the selected row for any given day is that with the highest ID value.
SELECT *
FROM table
WHERE ID IN ( SELECT MAX(ID)
FROM table
GROUP BY Date )
This will work.
SELECT *
FROM tableName a
INNER JOIN
(
SELECT `DATE`, MAX(ID) maxID
FROM tableName
GROUP BY `DATE`
) b ON a.id = b.MaxID AND
a.`date` = b.`date`
SQLFiddle Demo
Probably
SELECT * FROM your_table ORDER BY ID DESC LIMIT 1
Select MAX(ID),Data,Total from foo
for MySQL
Another simple way is
SELECT TOP 1 * FROM YourTable ORDER BY ID DESC
And, I think this is the most simple way!
SELECT * FROM TABLE_SUM S WHERE S.ID =
(
SELECT MAX(ID) FROM TABLE_SUM
WHERE CDATE = GG.CDATE
GROUP BY CDATE
)