Detect which field is causing duplicates

Detect which field is causing duplicates - sql

I am designing a table and during testing it was found that one of the fields causes duplicate rows (which it shouldn't).
As a precaution, I would like to rule out possible duplicates in any other field. How would I go about checking which one of my columns causes duplicate PK's?
Intuitive method:
Select
count(*),
pk_field,
other_field1
from
table
group by
pk_field,
other_field
having
count(*) > 1
and count(distinct other_field1) >1;
I want to make sure that if I run this query it will rule out 100% that there are no duplicates caused by other_field1 (that there is only one value of other_field1 for each value of PK).
Extra bonus: is there a query that would show me directly which fields cause duplicate rows without having to make one query per field in the table?
Thanks a bunch!
EDIT: for clarity, the PK will not be enforced and the table is actually a view in a third party system

From my point of view, primary key should be enforced, and there should be a unique index on (pk_field, other_field). Additionally, other_field should be NOT NULL (so that you wouldn't have "duplicates" for the same pk_field, but empty other_field.
Doing so, database would handle your problem itself.
If you want to do it yourself, well, what can you do? A view? Third party system? What kind of a control do you have over the whole process? If all you CAN do is to find "duplicates", that's kind of too late.

Related

Updating the index in an SQL database (python)

while creating a tkinter application to store book information, I realize that simply deleting a row of information from the SQL database does not update the indexes. Kinda hard to explain but here is a picture of what I meant:
link to picture. (still young on this account, so pictures can't be embedded, sorry for the inconvenience)
As you can see, the first column represents the index and index 3 is missing because I deleted it. Is there a way such that upon deleting a row, anything below it just shifts up to cover for the empty spot?

Your use of the word "index" must be based on the application language, not the database language. In databases, indexes are additional data structures that speed certain operations on tables.
You are referring to an "id" column, presumably one that is defined automatically as identity, auto_increment, serial, or whatever the underlying database uses.
A very important point is that deleting a row from a table does not affect other rows in the table (unless you have gone through the work of writing triggers to make that happen). It just deletes the rows.
The second more important point is that you do not want to change the "identity" of rows -- and that is what the column you are calling an "index" is doing. It identifies the row. It not only identifies the row today, but it identifies the same row tomorrow. And, if it existed, yesterday. That is, you don't want to change the identity.
This is even more important when you have foreign key relationships -- that is, other tables that refer to this row. Those relationships could get all messed up if the ids start changing.
SQL does offer a simple way to get a number with no gaps:
select row_number() over (order by "index") as seqnum
from t;

SQL table design advice

I am building a community site where logon will be by email and members will be able to change their name/nick name.
Do you think I should keep member name/nick name in my members table with other properties of member or create another table, write member name/nick name on that table and associate member’s id.
I am in favour of second option because, I think it would be faster to pull members name from it.
Is it right/better way?
Update: reason is for other table is that I need to pull username for different sections. For example forums. Wouldn't it be faster to query a small table for each username for each post in a from topic?

I would keep it one table and set a unique constraint on Email in that table.
I can't see a single advantage in adding another table.

Why do you think the second option would be faster?
If nickname is a required one-to-one relation to member ID the appropriate place to store them is in the same table. This is still a indexed single-record search so it should be more-or-less as fast as your other option.
In fact, this solution would probably be faster, since you could get the nickname in the same SELECT as you get the other information.
Update to answer the update to the question:
The second table isn't any smaller in terms of the number of rows. The main factors in a SQL search are 1) number of records in the table and 2) number of possible matches from the indexed part of the search.
In this case, the number of records in your smaller table would be exactly the same as the larger table. And the number of possible matching records returned by the index will always be 1 because the member ID is unique.
The number of columns in the table you're searching is generally irrelevant to the time taken to return the data (the number of column you actually list in the SELECT statement can have an effect, but that's the same no matter which table you're searching).
SQL databases are very, very good at finding data. Structure your data correctly and let the database worry about getting it back to. Premature optimization is, as they say, the root of all evil.

Go with the first option: keep the name/nick name in the members table. There's no need to introduce an additional table, and the overhead of a join that goes with it, in this case.

Yes, associating member's ID to the other properties is the right way to go.
You can simply create an index on name to speed up your queries.

SQL Schema design question - delete flags

in our database schema, we like to use delete flags. When a record is deleted, we then update that field, rather than run a delete statement. The rest of our queries then check for the delete flag when returning data.
Here is the problem:
The delete flag is a date, with a default value of NULL. This is convenient because when a record is deleted we can easily see the date that it was deleted on.
However, to enforce unique constraints properly, we need to include the delete flag in the unique constraint. The problem is, on MS SQL , it behaves in accordance to what we want (for this design), but in postgresql, if any field in a multi column unique constraint is NULL, it allows the field. This behavior fits the SQL standard, but it makes our design broken.
The options we are considering are:
make a default value for the deleted field to be some hardcoded date
add a bit flag for deleted, then each table would have 2 delete related fields - date_deleted and is_deleted (for example)
change the date_deleted to is_deleted (bit field)
I suspect option 1 is a performance hit, each query would have to check for the hardcoded date, rather than just checking for IsNUll. Plus it feels wrong.
Option 2, also, feels wrong - 2 fields for "deleted" is non-dry.
Option 3, we lose the "date" information. There is a modified field, which would, in theory reflect the date deleted, but only assuming the last update to the row was the update to the delete bit.
So, Any suggestions? What have you done in the past to deal with "delete flags" ?
Update
Thanks to everyone for the super quick, and thoughtful responses.
We ended up going with a simple boolean field and a modified date field (with a trigger). I just noticed the partial index suggestion, and that looks like the perfect solution for this problem (but I havent actually tried it)

If just retaining the deleted records is important to you, have you considered just moving them to a history table?
This could easily be achieved with a trigger.
Application logic doesn't need to account for this deleted flag.
Your tables would stay lean and mean when selecting from it.
It would solve your problem with unique indexes.

Option 3, we lose the "date"
information. There is a modified
field, which would, in theory reflect
the date deleted, but only assuming
the last update to the row was the
update to the delete bit.
Is there a business reason that the record would be modified after it was deleted? If not, are you worrying about something that's not actually an issue? =)
In the system I currently work on we have the following "metadata" columns _Deleted, _CreatedStamp, _UpdatedStamp, _UpdatedUserId, _CreatedUserId ... quite a bit, but it's important for this system to carry that much data. I'd suggest going down the road of having a separate flag for Deleted to Modified Date / Deleted Date. "Diskspace is cheap", and having two fields to represent a deleted record isn't world-ending, if that's what you have to do for the RDBMS you're using.

What about triggers? When a record is deleted, a post-update trigger copies the row into an archive table which has the same structure plus any additional columns, and an additional column of the date/time and perhaps the user that deleted it.
That way your "live" table only has records that are actually live, so is better performance-wise, and your application doesn't have to worry about whether a record has been deleted or not.

One of my favourite solutions is an is_deleted bit flag, and a last_modified date field.
The last_modified field is updated automatically every time the row is modified (using any technique supported by your DBMS.) If the is_deleted bit flag is TRUE, then the last_modified value implies the time when the row was deleted.
You will then be able to set the default value of last_modified to GETDATE(). No more NULL values, and this should work with your unique constraints.

Just create a conditional unique constraint:
CREATE UNIQUE INDEX i_bla ON yourtable (colname) WHERE date_deleted IS NULL;

Would creating a multi column unique index that included the deleted date achieve the same constraint limit you need?
http://www.postgresql.org/docs/current/interactive/indexes-unique.html
Alternately, can you store a non-NULL and check that the deleted date to the minimum sql date = 0 or "1/1/1753" instead of NULL for undeleted records.

Is it possible to exclude the deleted date field from your unique index? In what way does this field contribute to the uniqueness of each record, especially if the field is usually null?

Does this query guarantee me a 'race free' PK value?

I was just reading How to avoid a database race condition when manually incrementing PK of new row.
There was a lot of good suggestions like having a separate table to get the PK values.
So I wonder if a query like this:
INSERT INTO Party VALUES(
(SELECT MAX(id)+1 FROM
(SELECT id FROM Party) as x),
'A-XXXXXXXX-X','Joseph')
could avoid race conditions?
Is the whole statement guaranteed to be atomic? Isn't in mysql? postgresql?

The best way to avoid race conditions while creating primary keys in a relational database is to allow the database to generate the primary keys.

It would work on tables which use table-level locking (MyISAM), but on Innodb etc, it could deadlock or produce duplicate keys, I think, depending on the isolation level in use.
In any case doing this is an extremely bad idea as it won't work well in the general case, but might appear to work during low-concurrency testing. It's a recipe for trouble.
You'd be better off using another table and incrementing a value in there; that's more likely to be race-free / deadlock-free.

No, you still have a problem, as, if two queries try to increment at the same time there may be a situation where the inner select is done, then another query is processed.
Your best bet, if you want a guarantee, if you don't want the database doing it, is to have a unique key on there.
In the event that there is an error in inserting, then try your query again, and once the primary key is unique it will work.
In this case, your best bet is to first insert only the id and any other non-null columns, and then do an update to set the nullable columns to whatever is correct.

SQL Server Database - Hidden Fields?

I'm implementing CRUD on my silverlight application, however I don't want to implement the Delete functionality in the traditional way, instead I'd like to set the data to be hidden instead inside the database.
Does anyone know of a way of doing this with an SQL Server Database?
Help greatly appreciated.

You can add another column to the table "deleted" which has value 0 or 1, and display only those records with deleted = 0.
ALTER TABLE TheTable ADD COLUMN deleted BIT NOT NULL DEFAULT 0
You can also create view which takes only undeleted rows.
CREATE VIEW undeleted AS SELECT * FROM TheTable WHERE deleted = 0
And you delete command would look like this:
UPDATE TheTable SET deleted = 1 WHERE id = ...

Extending Lukasz' idea, a datetime column is useful too.
NULL = current
Value = when soft deleted
This adds simple versioning that a bit column can not which may work better

In most situations I would rather archive the deleted rows to an archive table with a delete trigger. This way I can also capture who deleted each row and the deleted rows don't impact my performance. You can then create a view that unions both tables together when you want to include the deleted ones.

You could do as Lukasz Lysik suggests, and have a field that serves as a flag for "deleted" rows, filtering them out when you don't want them showing up. I've used that in a number of applications.
An alternate suggestion would be to add an extra status assignment if there's a pre-existing status code. For example, in a class attendance app we use internally an attendance record could be "Imported", "Registered", "Completed", "Incomplete", etc.* - we added a "Deleted" option for times where there are unintentional duplicates. That way we have a record and we're not just throwing a new column at the problem.
*That is the display name for a numeric code used behind the scenes. Just clarifying. :)

Solution with triggers
If you are friends with DB trigger, then you might consider:
add a DeletedAt and DeletedBy columns to your tables
create a view for each tables (ex: for table Customer have a CustomerView view, which would filter out rows that have DeletedAt not null (idea of gbn with date columns)
all your CRUD operations perform as usual, but not on the Customer table, but on the CustomerView
add INSTEAD OF DELETE trigger that would mark the row as delete instead of physically deleting it.
you may want to do a bit more complex stuff there like ensuring that all FK references to this row are also "logically" deleted in order to still have logical referential integrity
I you choose to use this pattern, I would probably name my tables differently like TCustomer, and views just Customer for clarity of client code.

Be careful with this kind of implementation because soft deletes break referential integrity and you have to enforce integrity in your entities using custom logic.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas