How to Avoid adding duplicates to U-SQL Tables - azure-data-lake

Is there a way to avoid adding duplicate records in U-SQL tables other than adding/truncating partitions. Let's assume only unique identifiers are different for both records.

You can also use ROWNUMBER() in the U-SQL query
#transactions =
SELECT *,
ROW_NUMBER() OVER(PARTITION BY Id ORDER BY Id DESC) AS RowNumber
FROM #searchlog;
#result =
SELECT Id,Name,Description,Age
FROM #transactions
WHERE RowNumber == 1;
This will get the unique record in a file.

Since U-SQL tables do not provide UNIQUE constraints due to the limited scalability, you will have to make sure that you use ANTI SEMIJOIN on the unique columns to filter out the possible duplicates on the new data before you insert it.

Related

Deduplicate rows in complex schema in a bigquery partition

I have read some threads but I know too little sql to solve my problem.
I have a table with a complex schema with records and nested fields.
Below you see a query which finds the exact row that I need to deduplicate.
SELECT *
FROM my-data-project-214805.rfid_data.rfid_data_table
WHERE DATE(_PARTITIONTIME) = "2020-02-07"
AND DetectorDataMessage.Header.MessageID ='478993053'
DetectorDataMessage.Header.MessageID is supposed to be unique.
How can I delete one of these rows? (there are two)
If possible I would like deduplicate the whole table but its partitioned and I can't get it right. I try the suggestions in below threads but I get this error Column DetectorDataMessage of type STRUCT cannot be used in...
Threads of interest:
Deduplicate rows in a BigQuery partition
Delete duplicate rows from a BigQuery table
Any suggestions? Can you guide me in the right direction?
Try using a MERGE to remove the existing duplicate rows, and a single identical one. In this case I'm going for a specific date and id, as in the question:
MERGE `temp.many_random` t
USING (
# choose a single row to replace the duplicates
SELECT a.*
FROM (
SELECT ANY_VALUE(a) a
FROM `temp.many_random` a
WHERE DATE(_PARTITIONTIME)='2018-10-01'
AND DetectorDataMessage.Header.MessageID ='478993053'
GROUP BY _PARTITIONTIME, DetectorDataMessage.Header.MessageID
)
)
ON FALSE
WHEN NOT MATCHED BY SOURCE
# delete the duplicates
AND DATE(_PARTITIONTIME)='2018-10-01'
AND DetectorDataMessage.Header.MessageID ='478993053'
THEN DELETE
WHEN NOT MATCHED BY TARGET THEN INSERT ROW
Based on this answer:
Deduplicate rows in a BigQuery partition
If all of the values in the duplicate rows are the same, just use 'SELECT distinct'.
If not, I would use the ROW_NUMBER() function to create a rank for each unique index, and then just choose the first rank.
I don't know what your columns are, but here's an example:
WITH subquery as
(select MessageId
ROW_NUMBER() OVER(partition by MessageID order by MessageId ASC) AS rank
)
select *
from subquery
where rank = 1

How to retrieve the last 2 records from table?

I have a table with n number of records
How can i retrieve the nth record and (n-1)th record from my table in SQL without using derived table ?
I have tried using ROWID as
select * from table where rowid in (select max(rowid) from table);
It is giving the nth record but i want the (n-1)th record also .
And is there any other method other than using max,derived table and pseudo columns
Thanks
You cannot depend on rowid to get you to the last row in the table. You need an auto-incrementing id or creation time to have the proper ordering.
You can use, for instance:
select *
from (select t.*, row_number() over (order by <id> desc) as seqnum
from t
) t
where seqnum <= 2
Although allowed in the syntax, the order by clause in a subquery is ignored (for instance http://docs.oracle.com/javadb/10.8.2.2/ref/rrefsqlj13658.html).
Just to be clear, rowids have nothing to do with the ordering of rows in a table. The Oracle documentation is quite clear that they specify a physical access path for the data (http://docs.oracle.com/cd/B28359_01/server.111/b28318/datatype.htm#i6732). It is true that in an empty database, inserting records into a newtable will probably create a monotonically increasing sequence of row ids. But you cannot depend on this. The only guarantees with rowids are that they are unique within a table and are the fastest way to access a particular row.
I have to admit that I cannot find good documentation on Oracle handling or not handling order by's in subqueries in its most recent versions. ANSI SQL does not require compliant databases to support order by in subqueries. Oracle syntax allows it, and it seems to work in some cases, at least. My best guess is that it would probably work on a single processor, single threaded instance of Oracle, or if the data access is through an index. Once parallelism is introduced, the results would probably not be ordered. Since I started using Oracle (in the mid-1990s), I have been under the impression that order bys in subqueries are generally ignored. My advice would be to not depend on the functionality, until Oracle clearly states that it is supported.
select * from (select * from my_table order by rowid) where rownum <= 2
and for rows between N and M:
select * from (
select * from (
select * from my_table order by rowid
) where rownum <= M
) where rownum >= N
Try this
select top 2 * from table order by rowid desc
Assuming rowid as column in your table:
SELECT * FROM table ORDER BY rowid DESC LIMIT 2

Deleting non distinct rows

I have a table that has a unique non-clustered index and 4 of the columns are listed in this index. I want to update a large number of rows in the table. If I do so, they will no longer be distinct, therefore the update fails because of the index.
I am wanting to disable the index and then delete the oldest duplicate rows. Here's my query so far:
SELECT t.itemid, t.fieldid, t.version, updated
FROM dbo.VersionedFields w
inner JOIN
(
SELECT itemid, fieldid, version, COUNT(*) AS QTY
FROM dbo.VersionedFields
GROUP BY itemid, fieldid, version
HAVING COUNT(*) > 1
) t
on w.itemid = t.itemid and w.fieldid = t.fieldid and w.version = t.version
The select inside the inner join returns the right number of records that we want to delete, but groups them so there is actually twice the amount.
After the join it shows all the records but all I want to delete is the oldest ones?
How can this be done?
If you say SQL (Structured Query Language), but really mean SQL Server (the Microsoft relatinonal database system) by it, and if you're using SQL Server 2005 or newer, you can use a CTE (Common Table Expression) for this purpose.
With this CTE, you can partition your data by some criteria - i.e. your ItemId (or a combination of columns) - and have SQL Server number all your rows starting at 1 for each of those partitions, ordered by some other criteria - i.e. probably version (or some other column).
So try something like this:
;WITH PartitionedData AS
(
SELECT
itemid, fieldid, version,
ROW_NUMBER() OVER(PARTITION BY ItemId ORDER BY version DESC) AS 'RowNum'
FROM dbo.VersionedFields
)
DELETE FROM PartitionedData
WHERE RowNum > 1
Basically, you're partitioning your data by some criteria and numbering each partition, starting at 1 for each new partition, ordered by some other criteria (e.g. Date or Version).
So for each "partition" of data, the "newest" entry has RowNum = 1, and any others that belongs into the same partition (by means of having the same partitino values) will have sequentially numbered values from 2 up to however many rows there are in that partition.
If you want to keep only the newest entry - delete anything with a RowNum larger than 1 and you're done!
In SQL Server 2005 and above:
WITH q AS
(
SELECT *,
ROW_NUMBER() OVER (PARTITION BY itemid, fieldid, version ORDER BY updated DESC) AS rn
FROM versionedFields
)
DELETE
FROM q
WHERE rn > 1
Try something like:
DELETE FROM dbo.VersionedFields w WHERE w.version < (SELECT MAX(version) FROM dbo.VersionedFields)
Ofcourse, you'd want to limit the MAX(version) to only the versions of the field you're wanting to delete.
You probably need to look at this Stack Overflow answer (delete earlier of duplicate rows).
Essentially the technique uses grouping (or optionally, windowing) to find the minimum id value of a group in order to delete it. It may be more accurate to delete rows where the value <> max(row identifier).
So:
Drop unique index
Load data
Delete data using the grouping mechanism (ideally in a transaction, so that you can rollback if there is a mistake), then commit
Recreate the index.
Note that recreating an index on a big table can take a long time.

Deleting Duplicate Records from a Table

I Have a table called Table1 which has 48 records. Out of which only 24 should be there in that table. For some reason I got duplicate records inserted into it. How do I delete the duplicate records from that table.
Here's something you might try if SQL Server version is 2005 or later.
WITH cte AS
(
SELECT {list-of-columns-in-table},
row_number() over (PARTITION BY {list-of-key-columns} ORDER BY {rule-to-determine-row-to-keep}) as sequence
FROM myTable
)
DELETE FROM cte
WHERE sequence > 1
This uses a common table expression (CTE) and adds a sequence column. {list-of-columns-in-table} is just as it states. Not all columns are needed, but I won't explain here.
The {list-of-key-columns] is the columns that you use to define what is a duplicate.
{rule-to-determine-row-to-keep} is a sequence so that the first row is the row to keep. For example, if you want to keep the oldest row, you would use a date column for sequence.
Here's an example of the query with real columns.
WITH cte AS
(
SELECT ID, CourseName, DateAdded,
row_number() over (PARTITION BY CourseName ORDER BY DateAdded) as sequence
FROM Courses
)
DELETE FROM cte
WHERE sequence > 1
This example removes duplicate rows based on the CoursName value and keeps the oldest basesd on the DateAdded value.
http://support.microsoft.com/kb/139444
This section is the key. The primary point you should take away. ;)
This article discusses how to locate
and remove duplicate primary keys from
a table. However, you should closely
examine the process which allowed the
duplicates to happen in order to
prevent a recurrence.
Identify your records by grouping data by your logical keys, since you obviously haven't defined them, and applying a HAVING COUNT(*) > 1 statement at the end. The article goes into this in depth.
This is an easier way
Select * Into #TempTable FROM YourTable
Truncate Table YourTable
Insert into YourTable Select Distinct * from #TempTable
Drop Table #TempTable

Remove duplicate rows - Impossible to find a decisive answer

You'd immediately think I went straight to here to ask my question but I googled an awful lot to not find a decisive answer.
Facts: I have a table with 3.3 million rows, 20 columns.
The first row is the primary key thus unique.
I have to remove all the rows where column 2 till column 11 is duplicate. In fact a basic question but so much different approaches whereas everyone seeks the same solution in the end, removing the duplicates.
I was personally thinking about GROUP BY HAVING COUNT(*) > 1
Is that the way to go or what do you suggest?
Thanks a lot in advance!
L
As a generic answer:
WITH cte AS (
SELECT ROW_NUMBER() OVER (
PARTITION BY <groupbyfield> ORDER BY <tiebreaker>) as rn
FROM Table)
DELETE FROM cte
WHERE rn > 1;
I find this more powerful and flexible than the GROUP BY ... HAVING. In fact, GROUP BY ... HAVING only gives you the duplicates, you're still left with the 'trivial' task of choosing a 'keeper' amongst the duplicates.
ROW_NUMBER OVER (...) gives more control over how to distinguish among duplicates (the tiebreaker) and allows for behavior like 'keep first 3 of the duplicates', not only 'keep just 1', which is a behavior really hard to do with GROUP BY ... HAVING.
The other part of your question is how to approach this for 3.3M rows. Well, 3.3M is not really that big, but I would still recommend doing this in batches. Delete TOP 10000 at a time, otherwise you'll push a huge transaction into the log and might overwhelm your log drives.
And final question is whether this will perform acceptably. It depends on your schema. IF the ROW_NUMBER() has to scan the entire table and spool to count, and you have to repeat this in batches for N times, then it won't perform. An appropriate index will help. But I can't say anything more, not knowing the exact schema involved (structure of clustered index/heap, all non-clustered indexes etc).
Group by the fields you want to be unique, and get an aggregate value (like min) for your pk field. Then insert those results into a new table.
If you have SQL Server 2005 or newer, then the easiest way would be to use a CTE (Common Table Expression).
You need to know what criteria you want to "partition" your data by - e.g. create partitions of data that is considered identical/duplicate - and then you need to order those partitions by something - e.g. a sequence ID, a date/time or something.
You didn't provide much details about your tables - so let me just give you a sample:
;WITH Duplicates AS
(
SELECT
OrderID,
ROW_NUMBER() OVER (PARTITION BY CustomerID ORDER BY OrderDate DESC) AS RowN
FROM
dbo.Orders
)
DELETE FROM dbo.Orders
WHERE RowN > 1
The CTE ( WITH ... AS :... ) gives you an "inline view" for the next SQL statement - it's not persisted or anything - it just lives for that next statement and then it's gone.
Basically, I'm "grouping" (partitioning) my Orders by CustomerID, and ordering by OrderDate. So for each CustomerID, I get a new "group" of data, which gets a row number starting with 1. The ORDER BY OrderDate DESC gives the newest order for each customer the RowN = 1 value - this is the one order I keep.
All other orders for each customer are deleted based on the CTE (the WITH..... expression).
You'll need to adapt this for your own situation, obviously - but the CTE with the PARTITION BY and ROW_NUMBER() are a very reliable and easy technique to get rid of duplicates.
If you don't want to deal with a new table delete then just use DELETE TOP(1). Use a subquery to get all the ids of rows that are duplicates and then use the delete top to delete where there is multiple rows. You might have to run more than once if there are more than one duplicate but you get the point.
DELETE TOP(1) FROM Table
WHERE ID IN (SELECT ID FROM Table GROUP BY Field HAVING COUNT(*) > 1)
You get the idea hopefully. This is just some pseudo code to help demonstrate.