find_in_batches works only with integer-based primary keys? - ruby-on-rails-3

My ActiveRecord models use uuid-based primary keys, and I want to use find_in_batches to load 1000 records at a time. However, seeing the documentation, saying it only works with integer-based primary key. I went through the code, and I see it just order records by "primary_key ASC". Why it doesn't work with non-integer based primary key? Just because of this ordering? I tried my model with this method, it works okay.
Could anyone explain me about this?

Guess the documentation is not 100% correct. It works correctly with incremental primary key. If you can guarantee that uuid of any new record will be greater than key of any existing record in the table, it will work correctly. Otherwise, you have a chance to miss new records added after you start batch processing.
Internally, on each step it gets id of last record (last_id) and selects 1000 records with id greater than last_id on next step. So if application creates new record with unique id < last_id during the processing step, this record will be excluded from processing.

Related

There a way to have primary key ids without any gaps

I'm creating an application with Java Spring and Oracle DB.
In the app, I want to generate a primary key value that is unique as well as ordered and without gaps: 1,2,3,4,5 instead of 1,2,5,7,8,9.
I've at one point used max(id) + 1 to get the maximum value of the id, and the id of the next/current transaction. However I know it isn't perfect in the case of concurrency with multiple users or multiple sessions.
I've tried using sequences, but even with the ORDER tag it could still create gaps with the possibility of a failed transaction.
REATE SEQUENCE num_seq
START WITH 1
INCREMENT BY 1
ORDER NOCACHE NOCYCLE;
I need there to be gapless values as a requirement, however I'm unsure how it's possible in the case of multiple users/multiple sessions.
Don't do it.
The goal of primary keys is not to be displayed on the UI or to be exposed to the external world, but only to provide a unique identifier of the row.
In simple words, a primary key doesn't need to be sexy or good looking. It's an internal identifier.
If you are considering the idea of having serial identifier, that means you probably want to display it somewhere or you want to expose it to the external world. If that's the case, then create a secondary column (also unique) that serves this "public relations" goal. It can be automatically generated, or updated at leisure without affecting the integrity of the database.
It can also be generated by a secondary process that runs in a deferred way (e.g. every 10 minutes) that finds all the "unassigned" new rows, and gives them the new number. This has the advantage that is not vulnerable to concurrency.

Is it a performance problem to start add sequential GUIDs in table with non sequential GUIDs

I have tables with primary key uniqueidentifier type and add non sequential ids, now I want just to start adding only sequential ids in this tables. The guid generation is made in the code. Is it possible this to create problems in the indexes of the previous data. Logically I don't see any problem, but I can't find any information about cases like this.
P.S This is legacy project. I can't update all the previous primary keys in the table to sequential, because there is no foreign key relations and mismatches will start occur in other tables.
First of all, I can give you an educated answer:
If the first sequential guid that you generate is greater than the last guid inserted in your table, you can be sure that further sequential guids you insert will not cause any indexing problems; since they are Sequential! To acheive this, You can do a work around and create the first SequentialGuid for each table in a way to be greater than the last inserted nonsequential guid in that table by passing the last inserted nonsequential guid to your Sequentialguid creator method.
You can find a library in this link that allows you to create SequentialGuid with a lastId as base value.
Also, be advised that the order of sequentialguids can be modified in case a server reboot happens. So, SequentialGuids are not reliable from this perspective.
Finally, if it is applicable to your case, I would suggest you even a better solution:
You can modify your tables by adding an Identity integer field as Clustered-index, and keep your Guid field as is BUT as a NonClustered-Index Primary Key Field.

Why do SQL id sequences go out of sync (specifically using Postgres)?

I've seen solutions for updating a sequence when it goes out of sync with the primary key it's generating, but I don't understand how this problem occurs in the first place.
Does anyone have insight into how a primary key field, with its default defined as the nextval of a sequence, whose primary keys aren't set explicitly anywhere, can go out of sync with the sequence? I'm using Postgres, and we see this occur now and then. It results eventually in a duplicate key constraint when the sequence produces an id for an existing row.
Your application is probably occasionally setting the value for the primary key for a new row. Then postgresql has no need to get a new sequence and the sequence doesn't get updated.
When a sequence number is allocated, it remains allocated, even if the TX that requested it is rolled back. So a number can be allocated that does not appear in the stable database. Of course, rows can also be deleted after they are created, so the maximum number found in the table need not be the maximum number ever allocated. This applies to any auto-incrementing type.
Also, depending on the technology used, separate sequences can be used with multiple tables, so a value might be missing from TableA but present in TableB. That could be because of a mistake in the use of sequence names, or it might be intentional.

sql primary key auto increment

Is having a primary key that auto increments on each new row necessary? for me this number is getting quite long and I'm not even using it for anything.
I can imagine that with gradual user activity on my site new rows will be added (I am only testing atm with just 2 alfa test users and already the number has auto incremented to over 100), eventually this number could reach silly proportions (example: 10029379000577352881086) and not only slow the site down (effecting user experience) but also could inevitably push my site over its quota (exceeding its allowed size (laymen's))
really is this needed?
If you have some field/column (or combination of columns) which can be a primary key, use that, why use Auto increment. There are school of thoughts which believe using a mix of both. You could search for surrogate keys and you may find this answer interesting Surrogate vs. natural/business keys
For size quota problem, practically I don't think the maximum auto increment value would cause your site to go over data limit. If it is of int type it will take 4 bytes, regardless of the value inside. For SQL server int type could contain values ranging from -2^31 (-2,147,483,648) to 2^31-1 (2,147,483,647).
Here is the link for that
You need a way to uniquely identify each record in your table.
If you have that already -- say a user-ID or email-address -- then you don't necessarily need that auto-incrementing field.
Note: If you don't already have a unique constraint on that field, you should add one so that duplicate data cannot be entered into the table.
Warning: If you decide to get rid of it, be sure that no other tables are using it.
can't you user multiple columns to get a composite key instead of that?
just a hint.
You do need a key that identifies every row. But a key doesn't have to be a number that "auto-increments" for every row. The fact that a few people seem to think incrementing numbers are always a good idea for keys is probably a consequence either of carelessness or a lack of appreciation of database fundamentals, sound design and data integrity.
primary key is not always necessary to have for a table . for your question check my answer:
when and when not primary key should use

How can I get the Primary Key id of a file I just INSERTED?

Earlier today I asked this question which arose from A- My poor planning and B- My complete disregard for the practice of normalizing databases. I spent the last 8 hours reading about normalizing databases and the finer points of JOIN and worked my way through the SQLZoo.com tutorials.
I am enlightened. I understand the purpose of database normalization and how it can suit me. Except that I'm not entirely sure how to execute that vision from a procedural standpoint.
Here's my old vision: 1 table called "files" that held, let's say, a file id and a file url and appropos grade levels for that file.
New vision!: 1 table for "files", 1 table for "grades", and a junction table to mediate.
But that's not my problem. This is a really basic Q that I'm sure has an obvious answer- When I create a record in "files", it gets assigned the incremented primary key automatically (file_id). However, from now on I'm going to need to write that file_id to the other tables as well. Because I don't assign that id manually, how do I know what it is?
If I upload text.doc and it gets file_id 123, how do I know it got 123 in order to write it to "grades" and the junction table? I can't do a max(file_id) because if you have concurrent users, you might nab a different id. I just don't know how to get the file_id value without having manually assigned it.
You may want to use LAST_INSERT_ID() as in the following example:
START TRANSACTION;
INSERT INTO files (file_id, url) VALUES (NULL, 'text.doc');
INSERT INTO grades (file_id, grade) VALUES (LAST_INSERT_ID(), 'some-grade');
COMMIT;
The transaction ensures that the operation remains atomic: This guarantees that either both inserts complete successfully or none at all. This is optional, but it is recommended in order to maintain the integrity of the data.
For LAST_INSERT_ID(), the most
recently generated ID is maintained in
the server on a per-connection basis.
It is not changed by another client.
It is not even changed if you update
another AUTO_INCREMENT column with a
nonmagic value (that is, a value that
is not NULL and not 0).
Using
LAST_INSERT_ID() and AUTO_INCREMENT
columns simultaneously from multiple
clients is perfectly valid. Each
client will receive the last inserted
ID for the last statement that client
executed.
Source and further reading:
MySQL Reference: How to Get the Unique ID for the Last Inserted Row
MySQL Reference: START TRANSACTION, COMMIT, and ROLLBACK Syntax
In PHP to get the automatically generated ID of a MySQL record, use mysqli->insert_id property of your mysqli object.
How are you going to find the entry tomorrow, after your program has forgotten the value of last_insert_id()?
Using a surrogate key is fine, but your table still represents an entity, and you should be able to answer the question: what measurable properties define this particular entity? The set of these properties are the natural key of your table, and even if you use surrogate keys, such a natural key should always exist and you should use it to retrieve information from the table. Use the surrogate key to enforce referential integrity, for indexing purpuses and to make joins easier on the eye. But don't let them escape from the database