Number of rows increased when I used UNNEST in BigQuery - google-bigquery

I have a orders table which contain 134537 rows I want to fetch data from this table and insert to testing table. To do the same I wrote a query and used unnest function in it which increased rows 134537 to 234832.
I found some duplicate rows of users orders which reflect in final result. How to handle it?

It makes sense to have more rows after unnested your data. You are actually "flattening" your data .
There are different approaches to remove data duplication. It can be at the same moment you are unnested your data, or after that. In the second scenario, there is already an answer from Jordan Tigani that should help you.
SELECT *
FROM ( SELECT *, ROW_NUMBER()
OVER (PARTITION BY <COLUMN_NAME>)
row_number FROM <TABLE>)
WHERE row_number = 1

Related

SQL Query for multiple columns with one column distinct

I've spent an inordinate amount of time this morning trying to Google what I thought would be a simple thing. I need to set up an SQL query that selects multiple columns, but only returns one instance if one of the columns (let's call it case_number) returns duplicate rows.
select case_number, name, date_entered from ticket order by date_entered
There are rows in the ticket table that have duplicate case_number, so I want to eliminate those duplicate rows from the results and only show one instance of them. If I use "select distinct case_number, name, date_entered" it applies the distinct operator to all three fields, instead of just the case_number field. I need that logic to apply to only the case_number field and not all three. If I use "group by case_number having count (*)>1" then it returns only the duplicates, which I don't want.
Any ideas on what to do here are appreciated, thank you so much!
You can use ROW_NUMBER(). For example
select *
from (
select *,
row_number() over(partition by case_number) as rn
) x
where rn = 1
The query above will pseudo-randomly pick one row for each case_number. If you want a better selection criteria you can add ORDER BY or window frames to the OVER clause.

Deduplicate rows in complex schema in a bigquery partition

I have read some threads but I know too little sql to solve my problem.
I have a table with a complex schema with records and nested fields.
Below you see a query which finds the exact row that I need to deduplicate.
SELECT *
FROM my-data-project-214805.rfid_data.rfid_data_table
WHERE DATE(_PARTITIONTIME) = "2020-02-07"
AND DetectorDataMessage.Header.MessageID ='478993053'
DetectorDataMessage.Header.MessageID is supposed to be unique.
How can I delete one of these rows? (there are two)
If possible I would like deduplicate the whole table but its partitioned and I can't get it right. I try the suggestions in below threads but I get this error Column DetectorDataMessage of type STRUCT cannot be used in...
Threads of interest:
Deduplicate rows in a BigQuery partition
Delete duplicate rows from a BigQuery table
Any suggestions? Can you guide me in the right direction?
Try using a MERGE to remove the existing duplicate rows, and a single identical one. In this case I'm going for a specific date and id, as in the question:
MERGE `temp.many_random` t
USING (
# choose a single row to replace the duplicates
SELECT a.*
FROM (
SELECT ANY_VALUE(a) a
FROM `temp.many_random` a
WHERE DATE(_PARTITIONTIME)='2018-10-01'
AND DetectorDataMessage.Header.MessageID ='478993053'
GROUP BY _PARTITIONTIME, DetectorDataMessage.Header.MessageID
)
)
ON FALSE
WHEN NOT MATCHED BY SOURCE
# delete the duplicates
AND DATE(_PARTITIONTIME)='2018-10-01'
AND DetectorDataMessage.Header.MessageID ='478993053'
THEN DELETE
WHEN NOT MATCHED BY TARGET THEN INSERT ROW
Based on this answer:
Deduplicate rows in a BigQuery partition
If all of the values in the duplicate rows are the same, just use 'SELECT distinct'.
If not, I would use the ROW_NUMBER() function to create a rank for each unique index, and then just choose the first rank.
I don't know what your columns are, but here's an example:
WITH subquery as
(select MessageId
ROW_NUMBER() OVER(partition by MessageID order by MessageId ASC) AS rank
)
select *
from subquery
where rank = 1

Is there any way to calculate total number of rows that return from dynamic query in Common Table Expression(CTE) or Subquery

We are in the process of optimizing our database.We have most of store procedure that uses CTE because it gives us high performance according to our table strucure.We have almost dynamic query that have different result according to different condition.We hold all data in CTE, and check condition, that was the not problem but we need total number of rows that return by each query ,in calculating this it takes lots of time.Temporary table or table variable not suitable in our case as it takes lots of time to insert data in it.We have structure as following
With t(fields) as
(select field1,field2.......
ROW_NUMBER() OVER (order by some column) as row...
from some table and lots of
inner n left joins
where some condition ),
rowTotal(RowTotal) as
(select max(row) from t)
select * from t,RowTotal
where condition for paging
But max(row) took lots of times if i remove this it return data within 100ms. I tried Coun(*),Count(SomeField) and many other it works but took lots of time.How can i achieve total number of rows from cte within some ms any aggregate function will not work for me.Is there any other way to calculate rowtotal like ##rowcount.Thanks in advance for any help.
If you are after the total number of rows from the inner query you can add this as a column to your select using COUNT() and PARTITION BY().
With t(fields) as
(select COUNT(*) OVER (PARTITION BY 1) AS TotalRows,
field1,field2.......
ROW_NUMBER() OVER (order by some column) as row...
from some table ...
This should give you a count of the total rows in 't' as the first column of t
I don't know that this is the fastest way to get the result you want but it works for me on 000's of returned records and prevents extra select queries to find the count separately.

Postgresql 8.3 duplicate rows

I'm having a little problem with one query that I'm writing. I've a lot of joins, and a lot of columns that I extract, in the where clause I compare date column with the minimum value of the same table. But when I got a same date for two rows, I need to get only one row. The where clause is like that:
bt.da2 = (select min(btreg.da2) from bt btreg.....
the query results a lot of customers, every customer has that bt.da2 date. I need when one customer has two rows, with the same value of the bt.da2 , I need to take only one of the two rows, not the two.
I may not explained myself clear. Please if anyone have a little clue what I'm asking, and something is not clear, please ask me.
I'm using PostgreSQL 8.3
Regards,
Julian
It's hard to tell with so little information, but I would try something like this:
select *
from (
select product_id, -- assumed to be the primary key
...
row_number() over (partition by product_id order by bt.da2) as rn
from products pr
left join bt on bt.da2 = pr.some_col
) t
where rn = 1
the row_number() function is used to create consecutive numbers for each product. The outer where clause then picks the first one. You can change the order by in the window definition to influence which one you pick.
You should be able to sort out duplicate values of da2 using:
bt.da2 = (select distinct min(btreg.da2) from bt btreg.....
I tried this out using PostgreSQL 9.1, but I am sure the distinct keyword will work as expected in 8.4 as well.

Remove duplicate rows - Impossible to find a decisive answer

You'd immediately think I went straight to here to ask my question but I googled an awful lot to not find a decisive answer.
Facts: I have a table with 3.3 million rows, 20 columns.
The first row is the primary key thus unique.
I have to remove all the rows where column 2 till column 11 is duplicate. In fact a basic question but so much different approaches whereas everyone seeks the same solution in the end, removing the duplicates.
I was personally thinking about GROUP BY HAVING COUNT(*) > 1
Is that the way to go or what do you suggest?
Thanks a lot in advance!
L
As a generic answer:
WITH cte AS (
SELECT ROW_NUMBER() OVER (
PARTITION BY <groupbyfield> ORDER BY <tiebreaker>) as rn
FROM Table)
DELETE FROM cte
WHERE rn > 1;
I find this more powerful and flexible than the GROUP BY ... HAVING. In fact, GROUP BY ... HAVING only gives you the duplicates, you're still left with the 'trivial' task of choosing a 'keeper' amongst the duplicates.
ROW_NUMBER OVER (...) gives more control over how to distinguish among duplicates (the tiebreaker) and allows for behavior like 'keep first 3 of the duplicates', not only 'keep just 1', which is a behavior really hard to do with GROUP BY ... HAVING.
The other part of your question is how to approach this for 3.3M rows. Well, 3.3M is not really that big, but I would still recommend doing this in batches. Delete TOP 10000 at a time, otherwise you'll push a huge transaction into the log and might overwhelm your log drives.
And final question is whether this will perform acceptably. It depends on your schema. IF the ROW_NUMBER() has to scan the entire table and spool to count, and you have to repeat this in batches for N times, then it won't perform. An appropriate index will help. But I can't say anything more, not knowing the exact schema involved (structure of clustered index/heap, all non-clustered indexes etc).
Group by the fields you want to be unique, and get an aggregate value (like min) for your pk field. Then insert those results into a new table.
If you have SQL Server 2005 or newer, then the easiest way would be to use a CTE (Common Table Expression).
You need to know what criteria you want to "partition" your data by - e.g. create partitions of data that is considered identical/duplicate - and then you need to order those partitions by something - e.g. a sequence ID, a date/time or something.
You didn't provide much details about your tables - so let me just give you a sample:
;WITH Duplicates AS
(
SELECT
OrderID,
ROW_NUMBER() OVER (PARTITION BY CustomerID ORDER BY OrderDate DESC) AS RowN
FROM
dbo.Orders
)
DELETE FROM dbo.Orders
WHERE RowN > 1
The CTE ( WITH ... AS :... ) gives you an "inline view" for the next SQL statement - it's not persisted or anything - it just lives for that next statement and then it's gone.
Basically, I'm "grouping" (partitioning) my Orders by CustomerID, and ordering by OrderDate. So for each CustomerID, I get a new "group" of data, which gets a row number starting with 1. The ORDER BY OrderDate DESC gives the newest order for each customer the RowN = 1 value - this is the one order I keep.
All other orders for each customer are deleted based on the CTE (the WITH..... expression).
You'll need to adapt this for your own situation, obviously - but the CTE with the PARTITION BY and ROW_NUMBER() are a very reliable and easy technique to get rid of duplicates.
If you don't want to deal with a new table delete then just use DELETE TOP(1). Use a subquery to get all the ids of rows that are duplicates and then use the delete top to delete where there is multiple rows. You might have to run more than once if there are more than one duplicate but you get the point.
DELETE TOP(1) FROM Table
WHERE ID IN (SELECT ID FROM Table GROUP BY Field HAVING COUNT(*) > 1)
You get the idea hopefully. This is just some pseudo code to help demonstrate.