Fixing holes/gaps in numbers generated by Postgres sequence - sql

I have a postgres database that uses sequences extensively to generate primary keys of tables.
After a lot of usage of this database i.e. Adding/Update/Delete operation the columns that uses sequences for primary keys now have a lot holes/gaps in them and the sequence value itself is very high.
My question is: Are there any ways in which we can fix these gaps in Primary Keys? which should inturn bring down the max value of the number in that columns and then reset the sequence?
Note: A lot of these columns are also referenced by other columns as ForeignKeys.

If you feel the need to fill gaps in auto-generated Posgresql sequence numbers, I have the feeling you need another field in your table, like some kind of "number" you increment programmatically, either in your code, or in a trigger.

It is possible to solve this problem, but is expensive for the database to do (especially IO) and is guaranteed to reoccur. I would not worry about this problem. If you get close to 4B, upgrade your primary and foreign keys to BIGSERIAL and BIGINT. If you're getting close to 2^64... well... I'd be interested in hearing more about your application. :~]

Postgres allows you to update PKs, although a lot of people think it's bad practice. So you could lock the table, and UPDATE. (You can make an oldkey, newkey table all sorts of ways, e.g., window function.) All the FK relationships have to be marked to cascade. Then you can reset the currval of the id sequence.
Personally, I would just use a BIGSERIAL. If you have so many updates and deletes that you may run out even so, maybe there is some composite PK based on (say) a timestamp and id that would help you.

Related

SQL - Must there always be a primary key?

There are a couple of similar questions already out there and the consensus seemed to be that a primary key should always be created.
But what if you have a single row table for storing settings (and let's not turn this into a discussion about why it might be good/bad to create a single row table please)?
Surely having a primary key on a single row table becomes completely useless?
It may seem completely useless, but it's also completely harmless, and I'd vote for harmless with good design principles vs. useless with no design principles every time.
Other people have commented, rightly, that you don't know how you're going to use the table in a year or five years... what if someone comes along and decides they want to duplicate the configuration -- move it to a distributed environment or add a test environment by using a duplicate configuration string or whatever. Having a field that acts like a primary key means that whenever you query the table, if you use the key, you'll be certain no matter what anyone else may do to your table, that you're getting the correct record.
You're right there are a million other aspects -- surrogate keys vs. intelligent keys, indexing, partitioning (silly on a single row table, I know), whatever... but without getting into that I'd vote add the key rather than not add it. You could have done it by the time you read this thread.
Short answer, no key, duplicate records possible. Your planning a single row now, but what about six months in the future when you single row multiplies. Put a primary key on the table, even for single row.
You could always base your primary key on the name of the setting. Then your table would become a key-value store.
But no, in many RDBMS you are not REQUIRED to have a primary key per table.
Having declared a primary key on a single row table in SQL will ensure that there will be no duplicates. Whether it is useless depends on your requirements. Usually it is a good idea to avoid duplicates.

Do I need a primary key if something will NOT be changed?

If I had a site where a user can flag another user post and it cannot be undone or changed, do I need to have a primary key? All my selects would be on the post_id and with a where clause to see if the user already flagged it.
It seems to me from some of your other posts that the reason you are trying to avoid adding a primary key to your table is to save space.
Stop thinking like that.
It's a bad idea to make non-standard optimizations like this without having tested them first to see if they actually work. Have you run some tests that shows that you save a significant amount of space in your database by omitting the primary key on this table? Or are you just guessing?
Using a primary key doesn't necessarily mean that you will use more space. Depending on the database, if you omit the primary key it might add a hidden field for you anyway (for example if you don't have a PK in MySQL/InnoDB it adds a hidden clustered index on a synthetic column containing 6 byte row ID values (source)). If you do use a primary key, rather than adding a new column you can just choose some existing columns that you know should be unique anyway. It won't take up any more space, it will just mean that the data will be stored in a different order to make it easier to search.
When you add an index, that index is going to take up extra space, as an index is basically just a copy of a few columns of the table, plus a link back to the row in the original table. Remember that hidden column the database uses when you don't have a PK? Well now it has to use that to find your rows, so you'll get a copy of it in your index too. If you use a primary key then you probably don't need one of your indexes that you would have added, so you're actually saving space here.
Besides all this, some useful database tools just won't work well if you don't have a primary key on your table. You will annoy everyone that has to maintain your database after you are gone.
So tell me, why do you think it's a good idea to NOT have one?
A primary key has nothing to do with whether data can be changed - it's a single point of reference for an entire row, which can make looking up and/or changing data faster.
All my selects would be on the post_id and with a where clause to see if the user already flagged it.
You need to provide more information about business rules. For example, should the system support more than one user flagging the same post?
If the answer is "no", then I would model a POST_STATUS_CODE table and have a foreign key to the table in your POSTS table.
If the answer is "yes", then I would still have a POST_STATUS_CODE table but also a table linking the POSTS and POST_STATUS_CODE tables - say POSTS_STATUS_XREF.
I have a post_flag table with post_id, user_id (who flagged it) and flag_type (ATM as a byte). I don't see how PK will make it faster in this case but I imagine it will take up 4 or 8 bytes per row. I was thinking about indexing post_id. If I do should I still create a PK?
At a minimum, I would make the primary key to be a combination of:
post_id
user_id
The reason being that a primary key ensures that there can't be duplicates.
A primary key can be made up of more than one column - this is called a compound key. It means that the pair of values is unique. IE: You can't have more than one combination of 1, 1 values, but you could have 1,2, 1,3, etc (and vice versa). Attempts to add duplicates will result in duplicate primary key errors.
Primary keys help speed up lookups and joins, so it's always nice to have if you can.
You don't need a primary key, not even if users are going to modify rows. A primary key optimizes the performance every time you query that table though. If you think your table will grow larger than about a thousand rows or so, then setting a primary key will give a noticeable performance boost.
The only advantage in not creating a primary key really is that it means you don't have to create one, which is fair enough I suppose :-P
You could just not bother creating one for now. You can always add one later. Not a big deal. Don't let anyone bully you into thinking you absolutely must create a primary key right now! You'll see it being horribly slow soon enough :-P and then you can just add the primary key at that point. If you don't have too many duplicates by then :-P
Best have one, if just because you may have to delete the occasional record manually (e.g. duplicates) and one should have a unique identifier for that.
The simple answer is yes. every table should have a primary key (made of at least one column). what benefit do you get for not having one?
In such a situation, you might be able to get away without one, but I'd be inclined to throw a primary key on there anyway, simply because it's relatively simple to do and will save rework if the requirements change.
The software requirements may change rapidly. The customer may introduce new requirements. So having a primary key may be useful because you can eliminate totally unnecessary data migrations in such a situations.
Read this: "Is it OK not to use a Primary Key When I don’t Need one?"
Yes, you do need a primary key.
You may as well use text files for storage if you don't think you do because it means you don't understand them...

Is it OK not to use a Primary Key When I don't Need one

If I don't need a primary key should I not add one to the database?
You do need a primary key. You just don't know that yet.
A primary key uniquely identifies a row in your table.
The fact it's indexed and/or clustered is a physical implementation issue and unrelated to the logical design.
You need one for the table to make sense.
If you don't need a primary key then don't use one. I usually have the need for primary keys, so I usually use them. If you have related tables you probably want primary and foreign keys.
Yes, but only in the same sense that it's okay not to use a seatbelt if you're not planning to be in an accident. That is, it's a small price to pay for a big benefit when you need it, and even if you think you don't need it odds are you will in the future. The difference is you're a lot more likely to need a primary key than to get in a car accident.
You should also know that some database systems create a primary key for you if you don't, so you're not saving that much in terms of what's going on in the engine.
No, unless you can find an example of, "This database would work so much better if table_x didn't have a primary key."
You can make an arguement to never use a primary key, if performance, data integrity, and normalization are not required. Security and backup/restore capabilities may not be needed, but eventually, you put on your big-boy pants and join the real world of database implementation.
Yes, a table should ALWAYS have a primary key... unless you don't need to uniquely identify the records in it. (I like to make absolute statements and immediately contradict them)
When would you not need to uniquely identify the records in a table? Almost never. I have done this before though for things like audit log tables. Data that won't be updated or deleted, and wont be constrained in any way. Essentially structured logging.
A primary key will always help with query performance. So if you ever need to query using the "key" to a "foreign key", or used as lookup then yes, craete a foreign key.
I don't know. I have used a couple tables where there is just a single row and a single column. Will always only be a single row and a single column. There is no foreign key relationships.
Why would I put a primary key on that?
A primary key is mainly formally defined to aid referencial Integrity, however if the table is very small, or is unlikely to contain unique data then it's an un-necessary overhead.
Defining indexes on the table can normally be used to imply a primary key without formally declaring one.
However you should consider that defining the Primary key can be useful for Developers and Schema generation or SQL Dev tools, as having the meta data helps understanding, and some tools rely on this to correctly define the Primary/foreign key relationships in the model.
Well...
Each table in a relational DB needs a primary key. As already noted, a primary key is data that identies a record uniquely...
You might get away with not having an "ID" field, if you have a N-M table that joins 2 different tables, but you can uniquely identifiy the record by the values from both columns you join. (Composite primary key)
Having a table without an primary key is against the first normal form, and has nothing to do in a relational DB
You should always have a primary key, even if it's just on ID. Maybe NoSQL is what you're after instead (just asking)?
That depends very much on how sure you can be that you don't need one. If you have just the slightest bit of doubt, add one - you'll thank yourself later. An indicator being if the data you store could be related to other data in your DB at one point.
One use case I can think of is a logging kind-of table, in which you simply dump one entry after the other (to properly process them later). You probably won't need a primary key there, if you're storing enough data to filter out the relevant messages (like a date). Of course, it's questionable to use a RDBMS for this.

Should i have a primary ID? i am indexing another field

Using sqlite i need a table to hold a blob to store a md5 hash and a 4byte int. I plan to index the int but this value will not be unique.
Do i need a primary key for this table? and is there an issue with indexing a non unique value? (I assume there is not issue or reason for any).
Personally, I like to have a unique primary id on all tables. It makes finding unique records for updating/deleting easier.
How are you going to reference on a SELECT * FROM Table WHERE or an UPDATE ... WHERE? Are you sure you want each one?
You already have one.
SQLite automatically creates an integer ROWID column for every row of every table. This can function as a primary key if you don't declare your own.
In general it's a good idea to declare your own primary key column. In the particular instance you mentioned, ROWID will probably be fine for you.
My advice is to go with primary key if you want to have referential integrity. However there is no issue with indexing a non unique value. The only thing is that your performance will downgrade a little.
What are the consequences of letting two identical rows somehow get into this table?
One consequence is, of course, wasted space. But I'm talking about something more fundamental, here. There are times when duplicate rows in data give you wrong results. For example, if you grouped by the int column (field), and listed the count of rows in each group, a duplicate row (record) might throw you off, depending on what you are really looking for.
Relational databases work better if they are based on relations. Relations are always in first normal form. The primary reason for declaring a primary key is to prevent the table from getting out of first normal form, and thus not representing a relation.

Moving from ints to GUIDs as primary keys

I use several referenced tables with integer primary keys. Now I want to change ints to GUIDs leaving all references intact. What is the easiest way to do it?
Thank you!
Addition
I do understand the process in general, so I need more detailed advices, for example, how to fill new GUID column. Using default value newid() is correct, but what for already existing rows?
Create a new column for the guid
value in the master table. Use the
uniqueidentifier data type, make it
not null with a newid() default so
all existing rows will be populated.
Create new uniqueidentifier columns
in the child tables.
Run update statements to build the guild relationships using the exisitng int relationships to reference the entities.
Drop the original int columns.
In addition, leave some space in your data/index pages (specify fillfactor < 100) as guids are not sequential like int identity columns are. This means inserts can be anywhere in the data range and will cause page splits if your pages are 100% full.
Firstly: Dear God why?!?!?
Secondly, you're going to have to add the GUID column to all your tables first, then populate them based on the int value. Once done you can set the GUIDs to primary/foreign keys then drop the int columns.
To update the value you'd do something like
Set the new GUIDs in the primary key table
Run this:
.
UPDATE foreignTable f
SET f.guidCol = p.guidCol
FROM primaryTable p
WHERE p.intCol = f.intCol
This is relevent in a system that implements the distributed computing model. If the system is required to know the primary key at the time when you persist information in the system, the use of a auto-incrementing primary key maintained by ONE handler will slow down the system. Instead, you need a mechanism like a GUID generator to create primary key (keep in mind that the true feature of a primary key is its uniqueness). So, I can scale up with multiple services, each creating its primary key, independently of each other.
I had dubious privilege of doing this before and basically what I had to do was to export the whole damned database into XML. Next, I had a Java application that uses the java.util.Random's nextLong() function to replace the primary key with their new guid keys. After that I imported the whole thing back in to the database.
Of course, the first time I tried to import the XML files back, I forgot to turn off the auto-number feature of the primary key field, so do learn from my mistakes. I'm sure that there're better ways of doing it, but this was a fast and dirty way of doing it ... and it worked. In case you wondering, the project was to make the application scale.
Yeah, I'm with Glenn... I was actually hesitating on posting the same thing before he posted it....
Why would you not want an auto increment int primary key separate from your GUID? it's a lot more flexible, and you can just have the GUID column indexed so you have good performance on your queries...
As for the flexibility, I like to keep my id's as autoincrement ints because then the other seemingly unique and primary-key worthy item can change.
A great case of the flexibility is if you use usernames as a primary key. Even if they are unique, it is nice to be able to change them. What if users use an email address as their username? Being able to change the username and have it not affect all your queries is a big plus, and I suspect the same could be true with your GUIDs....
I think, you must do it manualy. Or you can write some utility for it. The scenario should be:
Duplicate the "int" PK/FK columns with new "guid" columns.
Generates new values for "guid" PK columns.
Update values in "guid" FK columns with specified values ( you find the records via "int" PK ).
Remove references ( relations ) with "int" PK/FK columns.
Create similar references ( relations ) with "guid" PK/FK columns.
Remove "int" PK/FK columns.
It's a very good choice. I switched from longs to UUID for one of my applications and I don't regret it. If you use MS SQL Server it is included in standard (I use postgresql and it's only included in standard from 8.3 on).
Like mentioned by Glenn Slaven, you can recreate UUIDs from the keys you have in your current records. Be aware that they will not be unique though but that way it's easy to keep the relationships intact. New records you create after the move will be unique.
DON'T DO IT! We started out using GUIDs, and now we've almost finished moving to INTs as PKs; we're retaining the GUID for logging purposes (and for some tables of, er, "negotiable relational integrity" ;) ), but the speed increase of using ints has been phenomenal.
This only really became apparent when the table rowcounts crossed into millions, mind you.
Our biggest folly by far was using a NEWID() as the PK of our (sequential) log table - there was much head-smacking when we realised our error.