What reason to use sequences in databases?

What reason to use sequences in databases? - sql

I new to DB/Hibernate and found code:
#SequenceGenerator(name = "entSeq", allocationSize = 5, sequenceName = "CODE_SEQ")
...
#Id
#GeneratedValue(strategy = GenerationType.SEQUENCE, generator = "entSeq")
which set sequence for primary key.
Why was sequences used for values of primary key? Which goals was addressed:
increase performance
add constraints, some checks
limit possible value range of integer values of ID, why to do so??
why to start counting from 1?
I read about syntax and usage in:
http://msdn.microsoft.com/en-us/library/ff878091.aspx
http://www.techonthenet.com/oracle/sequences.php
but doesn't found answer for my question.
UPDATE:
I enjoyed reading:
http://www.oracle.com/technetwork/products/rdb/0307-sequences-130053.pdf
Guide to Using SQL: Sequence Number Generator
where shown that there is problem in DB theory how to get unique ID for primary keys. That mean that I can make insert into table without providing value for primary key from my own:
INSERT INTO suppliers
(supplier_id, supplier_name)
VALUES
(supplier_seq.nextval, 'Kraft Foods');
But I expect that this feature must be present in all DB without forcing me to supply primary key values...
Do I think right?
UPDATE2:
Answer for why use START WITH:
This clause can be useful when adding sequences to existing databases. When an older scheme was in
use by the application and has already consumed some values from the legal range this clause can be
used to skip those consumed values. MINVALUE and MAXVALUE are used to specify the legal range but
START WITH would initiate the sequence usage within that range so that previously generated values
would not reappear.
UPDATE3: *sequences* provide http://en.wikipedia.org/wiki/Surrogate_key

Historically, there were two main reasons.
Avoid performance problems with ON UPDATE CASCADE in big tables.
Avoid performance problems with joins on wide, natural keys.
Oracle doesn't even support ON UPDATE CASCADE, so updating a value that's used in foreign key references is more troublesome than on other platforms.
Those performance "problems" are much less severe nowadays than they were 20 years ago, given tables of the same size. (Hardware's a lot faster now.) But we seem to deal with much bigger tables now than we did 20 years ago.
There are some undesirable side-effects for this kind of performance tuning.
You typically need many more joins than you might with carefully chosen natural keys and ON UPDATE CASCADE. You might need so many that the joins are more costly than the disk read.
It's easier to get lost in the joins when you have 20 or 30 of them.
Rows are harder to quickly understand. (A row that reads {1, 7, 13, 255, 438} is harder to understand than a row that reads {1, library, checkout, 255, 'A book is your friend'}.)
Often, a database designer assigns an ID number as the primary key, but doesn't set any other UNIQUE constraints. That makes the ID number a row identifier, not an identifier for the real-world thing the row represents. That can be a big problem.

You cannot rely on auto generated keys for all the databases. Unlike most other databases, Oracle does not provide an auto-incrementing datatype that can be used to generate a sequential primary key value.
However, the same effect can be achieved with a sequence and a trigger.

Related

SQL Server, does the id change if an element gets deleted?

I wondered if I insert, let's say, 10 entries into a SQL Server table.
If i then delete one of them, will the id/index change correspondingly?
Example:
1 | Simon Cowell | 56 years
2 | Frank Lampard| 24 years
3 | Harry Bennet | 12 years
If I delete #2, will Harry Bennet's index change to 2?
Thanks :)
EDIT:
Sorry for my outrage, had a bad day. And yes, I should have researched it myself, I deserve to be downvoted.
I don't ask for anything, I just want to say that I'm sorry :|

Since you seem to be conflating "id/index" let's talk a little but about the primary key and indexes in the context of a relational database.
The "id" or primary key assigned to a row in a SQL database is the unique identifier for that row. It can consist of one or more columns. (When more than one column is involved it is known as a "composite" or "multi-part" key.) The primary key should really do nothing more than be a unique handle for addressing a row: the primary key should not contain any information about the entity represented by the row, especially if that info has the potential to be mutable; an example would be a part number that has a suffix that stands for the type of metal the part is made from; if that metal can possibly change from titanium to unobtainium, say, that part number would make a bad choice as a primary key; it would be better to have another column to store the type of metal than to make the metal-type suffix part of the primary key. "Meaningful" primary keys might have made some sense in legacy non-relational databases but in a relational database they are to be avoided.
When seeking to enforce the uniqueness of a primary key, a database engine can make use of an index so it can rapidly test whether the key value exists. It could use a binary algorithm to find the value, avoiding the need to scan the actual data "brute force", row by row, looking for the value. But the index that is used behind the scenes by the engine to assist it with the primary key housekeeping is not the same as the primary key itself.
If you have a simple sequential integer as your primary key, there's an infinite number of them, so there is no need to reuse an integer when it becomes available when the row to which it was assigned has been deleted. So the relational database engine won't automatically attempt to reuse it, and it won't by any means change the primary key values that have been assigned to all other rows in the table when "gaps" in the number sequence are created by a deletion. Many other rows in other tables could be referencing those values and having them be mutable would create either chaos or a huge inefficiency.
Hashing algorithms are another very efficient way a database engine can quickly test for the existence of a key value. It computes the location in the hashed-file where the key would be if it did exist, and then looks there for it. The rows are stored in no particular order, so such schemes are optimized for instant finding of records in a large table, not for culling records that have a common characteristic, such as all customers in zipcode 10023.

No. You can set up triggers or logic to do it if you want; however, it will not automatically do this.

No it will not change automatically

No, it wont. And hopefully, that's the answer you're hoping for. For any auto-generated identifiers (such as IDENTITY columns), you should, so far as possible, ignore the data type and treat it as an opaque "blob" of identity information.
It gets assigned during insert, and you can use it for cross-referencing purposes, but the fact that it's numeric is not something you should use or rely upon. It's just a stable identifier for the row.

update part of the key

Q:
If I have a composite key combined from 4 fields for example, can I update one of them?
I mean can I execute a statement like this:
UPDATE tb
SET firstCol = '15', secondCol = 'test2'
WHERE firstCol = '1' AND serial = '2';
Given:
my table name is: tb
my fields are: firstCol, secondCol, serial
my keys are: firstCol , serial
Any suggestions? Did I miss some concept?
thanks.

Of course you can do that, why?
Do you have a problem doing that?

You may run into problems in updating if you try to make some row into the same values as an existing row. No matter what you do in an update, the unique constraint will still apply.
If you have related tables and have cascade update turned on, you may have locking issues if many records need to be locked. If you do not have cascade update turned on, you may have issues where a PK cannot be changed until you break those relationships and then put them back after manually changing all the related tables to the new value. This task, either way, should only be done in single user mode during non-peak hours.
Personally, if you need to change the PK, the design of your database is fragile and may cause problems in the future. Especially with a multicolumn key. If this is a one-time, rare change, go ahead and work through the issues. Otherwise, it might be time to decide if having a surrogate key as the PK and a unique index on the multi-columns is a better choice. Multicolumn PKs create much larger indexes not only on the main table but the child tables as well, they can create difficult issues when you need to update one of the columns, and they have performance implications for joins. In general I'm not a fan of them. And defintely not if there are some of those columns that will need updating with any frequency (and by that I mean any large update more than once a year - one or two records OK, but if you are running an update as described more often than once a year, you need to revisit the design in my opinion.).

Yes you can. It may be part of the key but it's still a column.
Note: if you have FKs relying on this key then you'll need to consider CASCASE updates. Also, a key update (assuming its clustered) means more work then "normal" because of how non clustered indexes refer to the clustered key

Changing a table's primary key column referenced by foreign key in other tables

In our DB (on SQL Server 2005) we have a "Customers" table, whose primary key is Client Code, a surrogate, bigint IDENTITY(1,1) key; the table is referenced by a number of other tables in our DB thru a foreign key.
A new CR implementation we are estimating would require us to change ID column type to varchar, Client Code generation algorithm being shifted from a simple numeric progression to a strict 2-char representation, with codes ranging from 01 to 99, then progressing like this:
1A -> 2A -> ... -> 9A -> 1B -> ... 9Z
I'm fairly new to database design, but I smell some serious problems here. First of all, what about this client code generation algorithm? What if I need a Client Code to go beyond 9Z code limit?
The I have some question: would this change be feasible, the table being already filled with a fair amount of data, and referenced by multiple entities? If so, how would you approach this problem, and how would you implement Client Code generation?

I would leave the primary key as it is and would create another key (unique) on the client code generated.
I would do that anyway. It's always better to have a short number primary key instead of long char keys.
In some situation you might prefer a GUID (for replication purposes) but a number int/bigint is alway preferable.
You can read more here and here.

My biggest concern with what you are proposing is that you will be limited to 360 primary records. That seems like a small number.
Performing the change is a multi-step operation. You need to create the new field in the core table and all its related tables.
To do an in-place update, you need to generate the code in the core table. Then you need to update all the related tables to have the code based on the old id. Then you need to add the foreign key constraint to all the related tables. Then you need to remove the old key field from all the related tables.
We only did that in our development server. When we upgraded the live databases, we created a new database for each and copied the data over using a python script that queried the old database and inserted into the new database. I now update that script for every software upgrade so the core engine stays the same, but I can specify different tables or data modifications. I get the bonus of having a complete backup of the original database if something unexpected happens when upgrading production.
One strong argument in favor of a non-identity/guid code is that you want a human readable/memorable code and you need to be able to move records between two systems.
Performance is not necessarily a concern in SQL Server 2005 and 2008. We recently went through a change where we moved from int ids everywhere to 7 or 8 character "friendly" record codes. We expected to see some kind of performance hit, but we in fact saw a performance improvement.
We also found that we needed a way to quickly generate a code. Our codes have two parts, a 3 character alpha prefix and a 4 or 5 digit suffix. Once we had a large number of codes (15000-20000) we were finding it to slow to parse the code into prefix and suffix and find the lowest unused code (it took several seconds). Because of this, we also store the prefix and the suffix separately (in the primary key table) so that we can quickly find the next available lowest code with a particular prefix. The cached prefix and suffix made the search almost fee.
We allow changing of the codes and they changed values propagate by cascade update rules on the foreign key relationship. We keep an identity key on the core code table to simplify the update of the code.
We don't use an ORM, so I don't know what specific things to be aware of with that. We also have on the order of 60,000 primary keys in our biggest instance, but have hundreds of tables related and tables with millions of related values to the code table.
One big advantage that we got was, in many cases, we did not need to do a join to perform operations. Everywhere in the software the user references things by friendly code. We don't have to do a lookup of the int ID (or a join) to perform certain operations.

The new code generation algorithm isn't worth thinking about. You can write a program to generate all possible codes in just a few lines of code. Put them in a table, and you're practically done. You just need to write a function to return the smallest one not yet used. Here's a Ruby program that will give you all the possible codes.
# test.rb -- generate a peculiar sequence of two-character codes.
i = 1
('A'..'Z').each do |c|
(1..9).each do |n|
printf("'%d%s', %d\n", n, c, i)
i += 1
end
end
The program will create a CSV file that you should be able to import easily into a table. You need two columns to control the sort order. The new values don't naturally sort the way your requirements specify.
I'd be more concerned about the range than the algorithm. If you're right about the requirement, you're limited to 234 client codes. If you're wrong, and the range extends from "1A" to "ZZ", you're limited to less than a thousand.
To implement this requirement in an existing table, you need to follow a careful procedure. I'd try it several times in a test environment before trying it on a production table. (This is just a sketch. There are a lot of details.)
Create and populate a two-column table to map
existing bigints to the new CHAR(2).
Create new CHAR(2) columns in all the
tables that need them.
Update all the new CHAR(2) columns.
Create new NOT NULL UNIQUE or PRIMARY KEY constraints and new FOREIGN KEY constraints on the new CHAR(2) columns.
Rewrite user interface code (?) to target the new columns. (Might not be necessary if you rename the new CHAR(2) and old BIGINT columns.)
Set a target date to drop the old BIGINT columns and constraints.
And so on.

Not really addressing whether this is a good idea or not, but you can change your foreign keys to cascade the updates. What will happen once you're done doing that is that when you update the primary key in the parent table, the corresponding key in the child table will be updated accordingly.

is it good to have primary keys as Identity field

I have read a lot of articles about whether we should have primary keys that are identity columns, but I'm still confused.
There are advantages of making columns are identity as it would give better performance in joins and provides data consistency. But there is a major drawback associated with identity ,i.e.When INSERT statement fails, still the IDENTITY value increases If a transaction is rolled back, the new IDENTITY column value isn't rolled back, so we end up with gaps in sequencing. I can use GUIDs (by using NEWSEQUENTIALID) but it reduces performance.

Gaps should not matter: the identity column is internal and not for end user usage or recognition.
GUIDs will kill performance, even sequential ones, because of the 16 byte width.
An identity column should be chosen to respect the physical implementation after modelling your data and working out what your natural keys are. That is, the chosen natural key is the logical key but you choose a surrogate key (identity) because you know how the engine works.
Or you use an ORM and let the client tail wag the database dog...

For all practical purposes, integers are ideal for primary keys and auto increment is a perfect way to generate them. As long as your PK is meaningless (surrogate) it will be protected from creativity of you customers and serve its main purpose (to identify a row in a table) just fine. Indexes are packed, joins fast as it gets, and it is easy to partition tables.
If you happen to need GUID, that's fine too; however, think auto-increment integer first.

I would like to say that depends on your needs. We use only Guids as primary keys (with default set to NewID) because we develop a distributed system with many Sql Server instances, so we have to be sure that every Sql Server generate unique primary key values.
But when using a Guid column as PK, be sure not to use it as your clustered index (thanks to marc_s for the link)
Advantage of the Guid type:
You can create unique values on different locations without synchronization
Disadvantage:
Its a large datatype (16 Bytes) and needs significant more space
It creates index fragmentation (at least when using the newid() function)
Dataconsistency is not an issue with primary keys independent of the datatype because a primary key has to be unique by definition!
I don't believe that an identity column has better join performance. At all, performance is a matter of the right indexes. A primary key is a constraint not an index.
Is your need to have a primary key of typ int with no gaps? This should'nt be a problem normally.

"yes, it KILLS performance - totally. I went from a legacy system with GUID as PK/CK and 99.5% index fragmentation on a daily basis to using INT IDENTITY - HUGE difference. Hardly any index fragmentation anymore, performance is significantly better. GUIDs as Clustering Index on your SQL Server table are BAD BAD BAD - period."
Might be true, but I see no logical reasoning according to which this leads me to conclude that GUIDs PER SE are also BAD BAD BAD.
Maybe you should consider using other types of indexes on such data. And if your dbms does not offer you a choice between several types of index, then perhaps you should consider getting yourself a better dbms.

Should I design a table with a primary key of varchar or int?

I know this is subjective, but I'd like to know peoples opinions and hopefully some best practices that I can apply when designing sql server table structures.
I personally feel that keying a table on a fixed (max) length varchar is a no-no, because it means having to also propogate the same fixed length across any other tables that use this as a foreign key. Using an int, would avoid having to apply the same length across the board, which is bound to lead to human error, i.e. 1 table has varchar (10), and the other varchar (20).
This sounds like a nightmare to initially setup, plus means future maintaining of the tables is cumbersome too. For example, say the keyed varchar column suddenly becomes 12 chars instead of 10. You now have to go and update all the other tables, which could be a huge task years down the line.
Am I wrong? Have I missed something here? I'd like to know what others think of this and if sticking with int for primary keys is the best way to avoid maintainace nightmares.

When choosing the primary key usualy you also choose the clustered key. Them two are often confused, but you have to understand the difference.
Primary keys are logical business elements. The primary key is used by your application to identify an entity, and the discussion about primary keys is largely wether to use natural keys or surrogate key. The links go into much more detail, but the basic idea is that natural keys are derived from an existing entity property like ssn or phone number, while surrogate keys have no meaning whatsoever with regard to the business entity, like id or rowid and they are usually of type IDENTITY or some sort of uuid. My personal opinion is that surrogate keys are superior to natural keys, and the choice should be always identity values for local only applicaitons, guids for any sort of distributed data. A primary key never changes during the lifetime of the entity.
Clustered keys are the key that defines the physical storage of rows in the table. Most times they overlap with the primary key (the logical entity identifier), but that is not actually enforced nor required. When the two are different it means there is a non-clustered unique index on the table that implements the primary key. Clustered key values can actualy change during the lifetime of the row, resulting in the row being physically moved in the table to a new location. If you have to separate the primary key from the clustered key (and sometimes you do), choosing a good clustered key is significantly harder than choosing a primary key. There are two primary factors that drive your clustered key design:
The prevalent data access pattern.
The storage considerations.
Data Access Pattern. By this I understand the way the table is queried and updated. Remember that clustered keys determine the actual order of the rows in the table. For certain access patterns, some layouts make all the difference in the world in regard to query speed or to update concurency:
current vs. archive data. In many applications the data belonging to the current month is frequently accessed, while the one in the past is seldom accessed. In such cases the table design uses table partitioning by transaction date, often times using a sliding window algorithm. The current month partition is kept on filegroup located a hot fast disk, the archived old data is moved to filegroups hosted on cheaper but slower storage. Obviously in this case the clustered key (date) is not the primary key (transaction id). The separation of the two is driven by the scale requirements, as the query optimizer will be able to detect that the queries are only interested in the current partition and not even look at the historic ones.
FIFO queue style processing. In this case the table has two hot spots: the tail where inserts occur (enqueue), and the head where deletes occur (dequeue). The clustered key has to take this into account and organize the table as to physically separate the tail and head location on disk, in order to allow for concurency between enqueue and dequeue, eg. by using an enqueue order key. In pure queues this clustered key is the only key, since there is no primary key on the table (it contains messages, not entities). But most times the queue is not pure, it also acts as the storage for the entities, and the line between the queue and the table is blured. In this case there is also a primary key, which cannot be the clustered key: entities may be re-enqueued, thus changing the enqueue order clustered key value, but they cannot change the primary key value. Failure to see the separation is the primary reason why user table backed queues are so notoriously hard to get right and riddled with deadlocks: because the enqueue and dequeue occur interleaved trought the table, instead of localized at the tail and the head of the queue.
Correlated processing. When the application is well designed it will partition processing of correlated items between its worker threads. For instance a processor is designed to have 8 worker thread (say to match the 8 CPUs on the server) so the processors partition the data amongst themselves, eg. worker 1 picks up only accounts named A to E, worker 2 F to J etc. In such cases the table should be actually clustered by the account name (or by a composite key that has the leftmost position the first letter of account name), so that workers localize their queries and updates in the table. Such a table would have 8 distinct hot spots, around the area each worker concentrates at the moment, but the important thing is that they don't overlap (no blocking). This kind of design is prevalent on high throughput OLTP designs and in TPCC benchmark loads, where this kind of partitioning also reflects in the memory location of the pages loaded in the buffer pool (NUMA locality), but I digress.
Storage Considerations. The clustered key width has huge repercursions in the storage of the table. For one the key occupies space in every non-leaf page of the b-tree, so a large key will occupy more space. Second, and often more important, is that the clustered key is used as the lookup key by every non-clustred key, so every non-clustered key will have to store the full width of the clustered key for each row. This is what makes large clustered keys like varchar(256) and guids poor choices for clustered index keys.
Also the choice of the key has impact on the clustered index fragmentation, sometimes drastically affecting performance.
These two forces can sometimes be antagonistic, the data access pattern requiring a certain large clustered key which will cause storage problems. In such cases of course a balance is needed, but there is no magic formula. You measure and you test to get to the sweet spot.
So what do we make from all this? Always start with considering clustered key that is also the primary key of the form entity_id IDENTITY(1,1) NOT NULL. Separate the two and organize the table accordingly (eg. partition by date) when appropiate.

I would definitely recommend using an INT NOT NULL IDENTITY(1,1) field in each table as the
primary key.
With an IDENTITY field, you can let the database handle all the details of making sure it's really unique and all, and the INT datatype is just 4 bytes, and fixed, so it's easier and more suited to be used for the primary (and clustering) key in your table.
And you're right - INT is an INT is an INT - it will not change its size of anything, so you won't have to ever go recreate and/or update your foreign key relations.
Using a VARCHAR(10) or (20) just uses up too much space - 10 or 20 bytes instead of 4, and what a lot of folks don't know - the clustering key value will be repeated on every single index entry on every single non-clustered index on the table, so potentially, you're wasting a lot of space (not just on disk - that's cheap - but also in SQL Server's main memory). Also, since it's variable (might be 4, might be 20 chars) it's harder to SQL server to properly maintain a good index structure.
Marc

I'd agree that in general an INT (or identity) field type is the best choice in most "normal" database designs:
it requires no "algorithm" to generate the id/key/value
you have fast(er) joins and the optimizer can work a lot harder over ranges and such under the hood
you're following a defacto standard
That said, you also need to know your data. If you're going to blow through a signed 32-bit int, you need to think about unsigned. If you're going to blow through that, maybe 64-bit ints are what you want. Or maybe you need a UUID/hash to make syncing between database instances/shards easier.
Unfortunately, it depends and YMMV but I'd definitely use an int/identity unless you have a good reason not to.

Like you said, consistency is key. I personally use unsigned ints. You're not going to run out of them unless you are working with ludicrous amounts of data, and you can always know any key column needs to be that type and you never have to go looking for the right value for individual columns.

Based on going through this exercise countless times and then supporting the system with the results, there are some caveats to the blanket statement that INT is always better. In general, unless there is a reason, I would go along with that. However, in the trenches, here are some pros and cons.
INT
Use unless good reason not to do so.
GUID
Uniqueness - One example is the case where there is one way communication between remote pieces of the program and the side that needs to initiate is not the side with the database. In that case, setting a Guid on the remote side is safe where selecting an INT is not.
Uniqueness Again - A more far fetched scenario is a system where multiple customers are coexisting in separate databases and there is migration between customers like similar users using a suite of programs. If that user signs up for another program, their user record can be used there without conflict. Another scenario is if customers acquire entities from each other. If both are on the same system, they will often expect that migration to be easier. Essentially, any frequent migration between customers.
Hard to Use - Even an experienced programmer cannot remember a guid. When troubleshooting, it is often frustrating to have to copy and paste identifiers for queries, especially if the support is being done with a remote access tool. It is much easier to constantly refer to SELECT * FROM Xxx WHERE ID = 7 than SELECT * FROM Xxx WHERE ID = 'DF63F4BD-7DC1-4DEB-959B-4D19012A6306'
Indexing - using a clustered index for a guid field requires constant rearrangement of the data pages and is not as efficient to index as INTs or even short strings. It can kill performance - don't do it.
CHAR
Readability - Although conventional wisdom is that nobody should be in the database, the reality of systems is that people will have access - hopefully personnel from your organization. When those people are not savvy with join syntax, a normalized table with ints or guids is not clear without many other queries. The same normalized table with SOME string keys can be much more usable for troubleshooting. I tend to use this for the type of table where I supply the records at installation time so they do not vary. Things like StatusID on a major table is much more usable for support when the key is 'Closed' or 'Pending' than a digit. Using traditional keys in these areas can turn an easily resolved issue to something that requires developer assistance. Bottlenecks like that are bad even when caused by letting questionable personnel access to the database.
Constrain - Even if you use strings, keep them fixed length, which speeds indexing and add a constraint or foreign key to keep garbage out. Sometimes using this string can allow you to remove the lookup table and maintain the selection as a simple Enum in the code - it is still important to constrain the data going into this field.

For best performance, 99.999% of the time the primary key should be a single integer field.
Unless you require the primary key to be unique across multiple tables in a database or across multiple databases. I am assuming that you are asking about MS SQL-Server because that is how your question was tagged. In which case, consider using the GUID field instead. Though better than a varchar, the GUID field performance is not as good as an integer.

Use INT. Your points are all valid; I would prioritize as:
Ease of using SQL auto increment capabiity - why reinvent the wheel?
Managability - you don't want to have to change the key field.
Performance
Disk Space
1 & 2 require the developer's time/energy/effort. 3 & 4 you can throw hardware at.

If Joe Celko was on here, he would have some harsh words... ;-)
I want to point out that INTs as a hard and fast rule aren't always appropriate. Say you have a vehicle table with all types of cars trucks, etc. Now say you had a VehicleType table. If you wanted to get all trucks you might do this (with an INT identity seed):
SELECT V.Make, V.Model
FROM Vehicle as V
INNER JOIN VehicleType as VT
ON V.VehicleTypeID = VT.VehicleTypeID
WHERE VT.VehicleTypeName = 'Truck'
Now, with a Varchar PK on VehicleType:
SELECT Make, Model
FROM Vehicle
WHERE VehicleTypeName = 'Truck'
The code is a little cleaner and you avoid a join. Perhaps the join isn't the end of the world, but if you only have one tool in your toolbox, you're missing some opportunities for performance gains and cleaner schemas.
Just a thought. :-)

While INT is generally recommended, it really depends on your situation.
If you're concerned with maintainability, then other types are just as feasible. For example, you could use a Guid very effectively as a primary key. There's reasons for not doing this, but consistency is not one of them.
But yes, unless you have a good reason not to, an int is the simplest to use, and the least likely to cause you any problems.

With PostgreSQL I generally use the "Serial" or "BigSerial" 'data type' for generating primary keys. The values are auto incremented and I always find integers to be easy to work with. They are essentially equivalent to a MySQL integer field that is set to "auto_increment".

One should think hard about whether 32-bit range is enough for what you're doing. Twitter's status IDs were 32-bit INTs and they had trouble when they ran out.
Whether to use a BIGINT or a UUID/GUID in that situation is debatable and I'm not a hardcore database guy, but UUIDs can be stored in a fixed-length VARCHAR without worrying that you'll need to change the field size.

We have to keep in mind that the primary key of a table should not have "business logic" and it should be only an identity of the record it belongs. Following this simple rule an int and especially an identity int is a very good solution. By asking about varchar I guess that you mean using for example the "Full Name" as a key to the "people" table. But what if we want to change the name from "George Something" to "George A. Something" ? And what size will the field be ? If we change the size we have to change the size on all foreign tables too. So we should avoid logic on keys. Sometimes we can use the social ID (integer value) as key but I avoid that too. Now if a project has the prospects to scale up you should consider using Guids too (uniqueidentifier SQL type).

Keeping in mind that this is quite old a question, I still want to make the case for using varchar with surrogate keys fur future readers:
An environment with several replicated machines
Scenarios where it is required that the ID of a to be inserted row is known before it is actually inserted (i.e., the client assigns this ID, not the database)

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas