How to display in Big Query ONLY duplicated records? - google-bigquery

To view records without duplicated ones, I use this SQL
SELECT * EXCEPT(row_number)
FROM (SELECT*,ROW_NUMBER() OVER (PARTITION BY orderid) row_number
FROM `TABLE`)
WHERE row_number = 1
What is the best practice to display only duplicated records from a single table?

Below is for BigQuery Standard SQL
Me personally, I prefer not to rely on ROW_NUMBER() whenever it is possible because with big volume of data it tends to lead to Resource Exceeded error
So, from my experience I would recommend below options:
To view records for those orderid with only one entry:
#standardSQL
SELECT AS VALUE ANY_VALUE(t)
FROM `project.dataset.table` t
GROUP BY orderid
HAVING COUNT(1) = 1
to view records for those orderid with more than one entry:
#standardSQL
SELECT * EXCEPT(flag) FROM (
SELECT *, COUNT(1) OVER(PARTITION BY orderid) > 1 flag
FROM `project.dataset.table`
)
WHERE flag
note: behind the hood - COUNT(1) OVER() can be calculated using as many workers as available while ROW_NUMBER() OVER() requires all respective data to be moved to one worker (thus Resource related issue)
OR
#standardSQL
SELECT *
FROM `project.dataset.table`
WHERE orderid IN (
SELECT orderid FROM `project.dataset.table`
GROUP BY orderid HAVING COUNT(1) > 1
)

Why not just change the row_number ? You have partitionned by order id, creating partitions of duplicates, ranked the records and take only the first element to remove the duplicates. But if you take only the row_number = 2, you'll have only elements from partitions with at least 2 elements, i.e only duplicates.
SELECT * EXCEPT(row_number)
FROM (SELECT*,ROW_NUMBER() OVER (PARTITION BY orderid) row_number
FROM `TABLE`)
WHERE row_number = 2
Note :Use row_number = 2 will give you only 1 element of duplicates. If you go with row_number > 1, the result may contain duplicates again (for example if you had 3 identical elements in the first table).

You can display the duplicated row by showing only raw with row_number greater than 1.
select
* except(row_number)
from (
select
*, row_number() over (partition by ) as row_number
from `TABLE`)
where row_number > 1

If your table has not primary key column, you are obliged to define it. Asuming my table contains 12 columns in BigQuery, I do not find shorter than:
SELECT *, sum(1) as rowcount
FROM `TABLE`
GROUP BY 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12
HAVING rowcount>1;

Related

Query to return rows with duplicate values based on other column

I have a table in SQL Server, the format is as follows:
I would want to get rows according to the following conditions:
If the rows have multiple but similar Customer_ID and Order_Number then return only those rows with maximum date
Otherwise, return the rest of the rows
So the result in this case will be row 3, 4 and 5.
Any idea on how to achieve this using SQL query? The table has no primary or unique key.
use window function row_number()
select *
from
(
select *,
row_number() over(partition by Customer_ID,Order_Number order by date desc) as rn
from your_table
) t where rn=1
or use co-relates subquery
select *
from t
where date in (
select max(date)
from t t1
where t1.Customer_ID=t.Customer_ID and t1.Order_Number=t.Order_Number
)

Get two most frequent data from SQL tbl?

i have a tbl call it tbl_test in which continously data are inserting and it has approx 10^6 records at a time it has colums
Acquire_Id (Value between 1 to 20 ),
Status_Msg(value between 'A' to 'Z'),
Status_Code(value between 1 to 26)
There is one to one mapping b/w Status_Msg and Status_Code
Now i want to get two most frequent status_msg and Staus_Code Count for each acquirer if they are present in table
Query should be Cost Saving
Most databases support the ANSI standard window functions. You can get what you want using row_number() (or rank() or dense_rank(), depending on how ties are returned) after aggregating the values.
The following returns exactly two rows for each acquirer (even if there are ties).
select t.*
from (select t.acquire_id, t.status_msg, t.status_code, count(*) as cnt,
row_number() over (partition by t.acquire_id order by count(*) desc) as seqnum
from tbl_test t
group by t.acquire_id, t.status_msg, t.status_code
) t
where seqnum <= 2;

Group by every N records in T-SQL

I have some performance test results on the database, and what I want to do is to group every 1000 records (previously sorted in ascending order by date) and then aggregate results with AVG.
I'm actually looking for a standard SQL solution, however any T-SQL specific results are also appreciated.
The query looks like this:
SELECT TestId,Throughput FROM dbo.Results ORDER BY id
WITH T AS (
SELECT RANK() OVER (ORDER BY ID) Rank,
P.Field1, P.Field2, P.Value1, ...
FROM P
)
SELECT (Rank - 1) / 1000 GroupID, AVG(...)
FROM T
GROUP BY ((Rank - 1) / 1000)
;
Something like that should get you started. If you can provide your actual schema I can update as appropriate.
Give the answer to Yuck. I only post as an answer so I could include a code block. I did a count test to see if it was grouping by 1000 and the first set was 999. This produced set sizes of 1,000. Great query Yuck.
WITH T AS (
SELECT RANK() OVER (ORDER BY sID) Rank, sID
FROM docSVsys
)
SELECT (Rank-1) / 1000 GroupID, count(sID)
FROM T
GROUP BY ((Rank-1) / 1000)
order by GroupID
I +1'd #Yuck, because I think that is a good answer. But it's worth mentioning NTILE().
Reason being, if you have 10,010 records (for example), then you'll have 11 groupings -- the first 10 with 1000 in them, and the last with just 10.
If you're comparing averages between each group of 1000, then you should either discard the last group as it's not a representative group, or...you could make all the groups the same size.
NTILE() would make all groups the same size; the only caveat is that you'd need to know how many groups you wanted.
So if your table had 25,250 records, you'd use NTILE(25), and your groupings would be approximately 1000 in size -- they'd actually be 1010 in size; the benefit being, they'd all be the same size, which might make them more relevant to each other in terms of whatever comparison analysis you're doing.
You could get your group-size simply by
DECLARE #ntile int
SET #ntile = (SELECT count(1) from myTable) / 1000
And then modifying #Yuck's approach with the NTILE() substitution:
;WITH myCTE AS (
SELECT NTILE(#ntile) OVER (ORDER BY id) myGroup,
col1, col2, ...
FROM dbo.myTable
)
SELECT myGroup, col1, col2...
FROM myCTE
GROUP BY (myGroup), col1, col2...
;
Answer above does not actually assign a unique group id to each 1000 records. Adding Floor() is needed. The following will return all records from your table, with a unique GroupID for each 1000 rows:
WITH T AS (
SELECT RANK() OVER (ORDER BY your_field) Rank,
your_field
FROM your_table
WHERE your_field = 'your_criteria'
)
SELECT Floor((Rank-1) / 1000) GroupID, your_field
FROM T
And for my needs, I wanted my GroupID to be a random set of characters, so I changed the Floor(...) GroupID to:
TO_HEX(SHA256(CONCAT(CAST(Floor((Rank-1) / 10) AS STRING),'seed1'))) GroupID
without the seed value, you and I would get the exact same output because we're just doing a SHA256 on the number 1, 2, 3 etc. But adding the seed makes the output unique, but still repeatable.
This is BigQuery syntax. T-SQL might be slightly different.
Lastly, if you want to leave off the last chunk that is not a full 1000, you can find it by doing:
WITH T AS (
SELECT RANK() OVER (ORDER BY your_field) Rank,
your_field
FROM your_table
WHERE your_field = 'your_criteria'
)
SELECT Floor((Rank-1) / 1000) GroupID, your_field
, COUNT(*) OVER(PARTITION BY TO_HEX(SHA256(CONCAT(CAST(Floor((Rank-1) / 1000) AS STRING),'seed1')))) AS CountInGroup
FROM T
ORDER BY CountInGroup
You can also use Row_Number() instead of rank. No Floor required.
declare #groupsize int = 50
;with ct1 as ( select YourColumn, RowID = Row_Number() over(order by YourColumn)
from YourTable
)
select YourColumn, RowID, GroupID = (RowID-1)/#GroupSize + 1
from ct1
I read more about NTILE after reading #user15481328 answer
(resource: https://www.sqlservertutorial.net/sql-server-window-functions/sql-server-ntile-function/ )
and this solution allowed me to find the max date within each of the 25 groups of my data set:
with cte as (
select date,
NTILE(25) OVER ( order by date ) bucket_num
from mybigdataset
)
select max(date), bucket_num
from cte
group by bucket_num
order by bucket_num

Find the top 5 MAX() values from an SQL table and then performi an AVG() on that table without them

I want to be able to perform an avg() on a column after removing the 5 highest values in it and see that the stddev is not above a certain number. This has to be done entirely as a PL/SQL query.
EDIT:
To clarify, I have a data set that contains values in a certain range and tracks latency. I want to know whether the AVG() of those values is due to a general rise in latency, or due to a few values with a very high stddev. I.e - (1, 2, 1, 3, 12311) as opposed to (122, 124, 111, 212). I also need to achieve this via an SQL query due to our monitoring software's limitations.
You can use row_number to find the top 5 values, and filter them out in a where clause:
select avg(col1)
from (
select row_number() over (order by col1 desc) as rn
, *
from YourTable
) as SubQueryAlias
where rn > 5
select column_name1 from
(
select column_name1 from table_name order by nvl(column_name,0) desc
)a
where rownum<6
(the nvl is done to omit the null value if there is/are any in the column column_name)
Well, the most efficient way to do it would be to calculate (sum(all values) - sum(top 5 values)) / (row_count - 5)
SELECT SUM(val) AS top5sum FROM table ORDER BY val DESC LIMIT 5
SELECT SUM(val) AS allsum FROM table
SELECT (COUNT(*) - 5) AS bottomCount FROM table
The average is then (allsum - top5sum) / bottomCount
First, get the MAX 5 values:
SELECT TOP 5 RowId FROM Table ORDER BY Column
Now use this in your main statement:
SELECT AVG(Column) FROM Table WHERE RowId NOT IN (SELECT TOP 5 RowId FROM Table ORDER BY Column)

How do I use ROW_NUMBER()?

I want to use the ROW_NUMBER() to get...
To get the max(ROW_NUMBER()) --> Or i guess this would also be the count of all rows
I tried doing:
SELECT max(ROW_NUMBER() OVER(ORDER BY UserId)) FROM Users
but it didn't seem to work...
To get ROW_NUMBER() using a given piece of information, ie. if I have a name and I want to know what row the name came from.
I assume it would be something similar to what I tried for #1
SELECT ROW_NUMBER() OVER(ORDER BY UserId) From Users WHERE UserName='Joe'
but this didn't work either...
Any Ideas?
For the first question, why not just use?
SELECT COUNT(*) FROM myTable
to get the count.
And for the second question, the primary key of the row is what should be used to identify a particular row. Don't try and use the row number for that.
If you returned Row_Number() in your main query,
SELECT ROW_NUMBER() OVER (Order by Id) AS RowNumber, Field1, Field2, Field3
FROM User
Then when you want to go 5 rows back then you can take the current row number and use the following query to determine the row with currentrow -5
SELECT us.Id
FROM (SELECT ROW_NUMBER() OVER (ORDER BY id) AS Row, Id
FROM User ) us
WHERE Row = CurrentRow - 5
Though I agree with others that you could use count() to get the total number of rows, here is how you can use the row_count():
To get the total no of rows:
with temp as (
select row_number() over (order by id) as rownum
from table_name
)
select max(rownum) from temp
To get the row numbers where name is Matt:
with temp as (
select name, row_number() over (order by id) as rownum
from table_name
)
select rownum from temp where name like 'Matt'
You can further use min(rownum) or max(rownum) to get the first or last row for Matt respectively.
These were very simple implementations of row_number(). You can use it for more complex grouping. Check out my response on Advanced grouping without using a sub query
If you need to return the table's total row count, you can use an alternative way to the SELECT COUNT(*) statement.
Because SELECT COUNT(*) makes a full table scan to return the row count, it can take very long time for a large table. You can use the sysindexes system table instead in this case. There is a ROWS column that contains the total row count for each table in your database. You can use the following select statement:
SELECT rows FROM sysindexes WHERE id = OBJECT_ID('table_name') AND indid < 2
This will drastically reduce the time your query takes.
You can use this for get first record where has clause
SELECT TOP(1) * , ROW_NUMBER() OVER(ORDER BY UserId) AS rownum
FROM Users
WHERE UserName = 'Joe'
ORDER BY rownum ASC
ROW_NUMBER() returns a unique number for each row starting with 1. You can easily use this by simply writing:
ROW_NUMBER() OVER (ORDER BY 'Column_Name' DESC) as ROW_NUMBER
May not be related to the question here. But I found it could be useful when using ROW_NUMBER -
SELECT *,
ROW_NUMBER() OVER (ORDER BY (SELECT 100)) AS Any_ID
FROM #Any_Table
select
Ml.Hid,
ml.blockid,
row_number() over (partition by ml.blockid order by Ml.Hid desc) as rownumber,
H.HNAME
from MIT_LeadBechmarkHamletwise ML
join [MT.HAMLE] h on ML.Hid=h.HID
SELECT num, UserName FROM
(SELECT UserName, ROW_NUMBER() OVER(ORDER BY UserId) AS num
From Users) AS numbered
WHERE UserName='Joe'
You can use Row_Number for limit query result.
Example:
SELECT * FROM (
select row_number() OVER (order by createtime desc) AS ROWINDEX,*
from TABLENAME ) TB
WHERE TB.ROWINDEX between 0 and 10
--
With above query, I will get PAGE 1 of results from TABLENAME.
If you absolutely want to use ROW_NUMBER for this (instead of count(*)) you can always use:
SELECT TOP 1 ROW_NUMBER() OVER (ORDER BY Id)
FROM USERS
ORDER BY ROW_NUMBER() OVER (ORDER BY Id) DESC
Need to create virtual table by using WITH table AS, which is mention in given Query.
By using this virtual table, you can perform CRUD operation w.r.t row_number.
QUERY:
WITH table AS
-
(SELECT row_number() OVER(ORDER BY UserId) rn, * FROM Users)
-
SELECT * FROM table WHERE UserName='Joe'
-
You can use INSERT, UPDATE or DELETE in last sentence by in spite of SELECT.
SQL Row_Number() function is to sort and assign an order number to data rows in related record set. So it is used to number rows, for example to identify the top 10 rows which have the highest order amount or identify the order of each customer which is the highest amount, etc.
If you want to sort the dataset and number each row by seperating them into categories we use Row_Number() with Partition By clause. For example, sorting orders of each customer within itself where the dataset contains all orders, etc.
SELECT
SalesOrderNumber,
CustomerId,
SubTotal,
ROW_NUMBER() OVER (PARTITION BY CustomerId ORDER BY SubTotal DESC) rn
FROM Sales.SalesOrderHeader
But as I understand you want to calculate the number of rows of grouped by a column. To visualize the requirement, if you want to see the count of all orders of the related customer as a seperate column besides order info, you can use COUNT() aggregation function with Partition By clause
For example,
SELECT
SalesOrderNumber,
CustomerId,
COUNT(*) OVER (PARTITION BY CustomerId) CustomerOrderCount
FROM Sales.SalesOrderHeader
This query:
SELECT ROW_NUMBER() OVER(ORDER BY UserId) From Users WHERE UserName='Joe'
will return all rows where the UserName is 'Joe' UNLESS you have no UserName='Joe'
They will be listed in order of UserID and the row_number field will start with 1 and increment however many rows contain UserName='Joe'
If it does not work for you then your WHERE command has an issue OR there is no UserID in the table. Check spelling for both fields UserID and UserName.