My organisation has hundreds of DB2 tables that each have a randomly generated unique integer index. The random values are generated by either COBOL CICS mainframe programs or Java distributed applications. The normal approach taken is to randomly generate an integer value (only positive values are employed), then attempt to insert the data row, retrying when a duplicate index value has already been persisted. I would like to improve the performance of this approach and I'm considering trying to identify integer values that have not been generated and persisted to each table, this would mean we don't ever need to retry. We would know our insert would work. Does db2 have a function that can return unused index values?
The short answer is no.
The slightly longer answer is to point out that, if such a function existed, in your case on the first insert into one of your tables the size of the result set it would return would be 2,147,483,647 (positive) integers. At 4 bytes each, that would be 8,589,934,588 bytes.
Given the constraints of your existing system, what you're doing is probably the best that can be done. If the performance of retrying is unacceptable, I'm afraid redesigning your key scheme is the next step.
I think that's a question to ask: Is this scheme of using random numbers for unique keys causing a performance problem? As the tables fill up the key space you will see more and more retries, but you have a relatively large key space. If you're seeing large numbers of retries maybe your random numbers are less random than you'd like.
just a thought but you could use one sequence for a group of tables. In this way, the value will still be random (because you wouldn't know which it the next table you perform an insert to) but based on a specific sequance wich mean that most of the time you won't get a retry because the number keep ascending. that same Sequance can loop after a few hunderd million inserts and start to "fill in the blanks".
as far as other key ideas are concerned,You could also try and use a diffrent key, maybe one based on Timestamp or Rowid. that will still be random but not repetitive.
Related
I'm using sequences in a PostgreSQL database to insert rows into tables.
When creating the sequences I have never used the CYCLE option on them. I mean they can generate pretty big numbers (in the order of 2^63 as far as I remeber) and I don't really see why I would like a sequence to go back to zero. So my question is:
When should I use CYCLE while creating a sequence?
Do you have an example where it makes sense?
It seems a sequence can use CYCLE for other purposes rather than for primary key generation.
This is, in scenarios where the uniqueness of its value is not required; actually is quite the opposite, when the values are expected to cycle back and repeat themselves after some time.
For example:
When generating numbers that must return to the initial value and repeat themselves at some point, for any reason (e.g. implementing a "Bingo" game).
When the sequence is a temporary identifier that will last for a short period of time and will be unique during its life.
When the field is small -- or can accept a limited number of values -- and it doesn't matter if they repeat themselves.
When there is another field in the entity that will identify it, and the sequence value is used for something else.
When an entity has a composite unique key and the sequence value is only a part of it.
When using the sequence value to generate uniform distribution of values on a big set, though this is hardly a random assignation of values.
Any other cyclic number generation.
The requirement is to return unique value for each processed row from stored procedure, which will be used like dummy primary key. One solution seems to be using ROW_NUMBER() function. Another one is given here. Perhaps, there can be solutions involving Guid. Can someone recommend me a solution which is performant and reliable?
A random number would not be an option (as you specify unique).
ROW_NUMBER is a bigint taking up 8 bytes of storage per row.
uniqueidentifier is a 16 byte structure and more costly to obtain than a ROW_NUMBER which is simply incremented (SQL's unique identifiers are GUIDs, with NEWID() being slower than NEWSEQUENTIALID() because NEWSEQUENTIALID() will increment from a seed GUID).
In scenarios where you INSERT into a table variable or a temporary table, you can use IDENTITY. Storage size is that of the column data type, which can be any integer type (no bit, no decimal). It will increment from a configurable offset (default 1), in configurable step (default 1).
This seems to make ROW_NUMBER your best fit, it is both fast and reliable.
I would recommend to base your design choice on more than just performance though. On any reasonably configured SQL server installation you will barely notice a difference in speed, and unless you have very constrained resources, storage should not be the bottleneck either. Some rather old benchmarks here.
Be aware that neither will help you to maintain or guarantee a particular order of the returned rows - you still need to ORDER BY your outermost SELECT if you need predicable results.
I have N client machines. I want to load each of machine with distinct partition of BRIN index.
That requires to:
create BRIN with predefined number of partitions - equal to number of client machines
send queries from clients which uses WHERE on BRIN partitions identifier instead of filter on indexed column
The main goal is performance improvement when loading single table from postgres into distributed client machines, keeping equal number of rows between the clients - or close to equal if rows count not divides by machines count.
I can achieve it currently by maintaining new column which chunks my table into number of buckets equal to number of client machines (or use row_number() over (order by datetime) % N on the fly). This way it would not be efficient in timing and memory, and the BRIN index looks like a nice feature which could speed up such use cases.
Minimal reproducible example for 3 client machines:
CREATE TABLE bigtable (datetime TIMESTAMPTZ, value TEXT);
INSERT INTO bigtable VALUES ('2015-12-01 00:00:00+00'::TIMESTAMPTZ, 'txt1');
INSERT INTO bigtable VALUES ('2015-12-01 05:00:00+00'::TIMESTAMPTZ, 'txt2');
INSERT INTO bigtable VALUES ('2015-12-02 02:00:00+00'::TIMESTAMPTZ, 'txt3');
INSERT INTO bigtable VALUES ('2015-12-02 03:00:00+00'::TIMESTAMPTZ, 'txt4');
INSERT INTO bigtable VALUES ('2015-12-02 05:00:00+00'::TIMESTAMPTZ, 'txt5');
INSERT INTO bigtable VALUES ('2015-12-02 16:00:00+00'::TIMESTAMPTZ, 'txt6');
INSERT INTO bigtable VALUES ('2015-12-02 23:00:00+00'::TIMESTAMPTZ, 'txt7');
Expected output:
client 1
2015-12-01 00:00:00+00, 'txt1'
2015-12-01 05:00:00+00, 'txt2'
2015-12-02 02:00:00+00, 'txt3'
client 2
2015-12-02 03:00:00+00, 'txt4'
2015-12-02 05:00:00+00, 'txt5'
client 3
2015-12-02 16:00:00+00, 'txt6'
2015-12-02 23:00:00+00, 'txt7'
The question:
How can I create BRIN with predefined number of partitions and run queries which filters on partition identifiers instead of filtering on index column?
Optionally any other way that BRIN (or other pg goodies) can speed up the task of parallel loading multiple clients from single table?
It sounds like you want to shard a table over many machines, and have each local table (one shard of the global table) have a BRIN index with exactly one bucket. But that does not make any sense. If the single BRIN index range covers the entire (local) table, then it can never be very helpful.
It sounds like what you are looking for is partitioning with CHECK constraints that can be used for partition-exclusion. PostgreSQL has supported that for a long time with table inheritance (although not for each partition being on a separate machine). Using this method, the range covered in the CHECK constraint has to be set explicitly for each partition. This ability to explicitly specify the bounds sounds like it exactly what you are looking for, just using a different technology.
But, the partition exclusion constraint code doesn't work well with modulus. The code is smart enough to know that WHERE id=5 only needs to check the CHECK (id BETWEEN 1 and 10) partition, because it knows that id=5 implies that id is between 1 and 10. More accurately, it know that contrapositive of that.
But the code was never written to know that WHERE id=5 implies that id%10 = 5%10, even though humans know that. So if you build your partitions on modulus operators, like CHECK (id%10=5) rather than on ranges, you would have to sprinkle all your queries with WHERE id = $1 and id % 10= $1 %10 if you wanted it to take advantage of the constraints.
Going by your description and comments I'd say you're looking in the wrong direction. You want to split the table upfront so access will be fast and simple, but without having to split things upfront because that would require you know the number of nodes upfront which is kind of variable if I understand correctly. And regardless, it takes quite a bit of processing to split things too.
To be honest, I'd go about your problem differently. Instead of assigning every record to a bucket I'd rather suggest to assign every record a pseudo-random value in a given range. I don't know about Postgres but in MSSQL I'd use BINARY_CHECKSUM(NewID()) instead of Rand(). Main reason being that the random function is harder to use SET-based there. Instead you could also use some hashing code that returns a reasonable working space. Anyway, in my MSSQL situation the resulting value would then be a signed integer sitting somewhere in the range -2^31 to +2^31 (giver or take, check the documentation for the exact boundaries!). As such, when the master machine decides to assign n client-machines, each machine can be assigned an exact range that -- given the properties of the randomizer/hashing algo -- will envelope a reasonably close approximation to the workload divided by n.
Assuming you have an index on the selection field this should be reasonably fast, regardless whether you decide to split the table in a thousand or a million chunks.
PS: mind that this approach will only work 'properly' if number of rows to process (greatly) outnumbers the number of machines that will do the processing. With small numbers you might see several machines not getting anything while others get to do all the work.
Basically, all you need to know is the size of the relation after loading, and then the pages_per_range storage parameter should be set to the divisor that gives you the desired number of partitions.
No need to introduce an artificial partition ID, because there is support for sufficient types and operators. Physical table layout is important here, so if you insist on the partition ID being the key, and end up introducing an out-of-order mapping between the natural load order and the artificial partition ID, make sure you cluster the table on that column's sort order before creating BRIN.
However, at the same time, remember that more discrete values have a better chance of hitting the index than fewer, so high cardinality is better - artificial partition identifier will have 1/n the cardinality of a natural key, where n is the number of distinct values per partition.
More here and here.
I have a problem on postgresql which I think there is a bug in the postgresql, I wrongly implement something.
There is a table including colmn1(primary key), colmn2(unique), colmn3, ...
After an insertion of a row, if I try another insertion with an existing colmn2 value I am getting a duplicate value error as I expected. But after this unsuccesful try, colmn1's next value is
incremented by 1 although there is no insertion so i am getting rows with id sequences like , 1,2,4,6,9.(3,5,6,7,8 goes for unsuccessful trials).
I need help from the ones who can explain this weird behaviour.
This information may be useful: I used "create unique index on tableName (lower(column1)) " query to set unique constraint.
See the PostgreSQL sequence FAQ:
Sequences are intended for generating unique identifiers — not
necessarily identifiers that are strictly sequential. If two
concurrent database clients both attempt to get a value from a
sequence (using nextval()), each client will get a different sequence
value. If one of those clients subsequently aborts their transaction,
the sequence value that was generated for that client will be unused,
creating a gap in the sequence.
This can't easily be fixed without incurring a significant performance
penalty. For more information, see Elein Mustein's "Gapless Sequences for Primary Keys" in the General Bits Newsletter.
From the manual:
Important: Because sequences are non-transactional, changes made by
setval are not undone if the transaction rolls back.
In other words, it's normal to have gaps. If you don't want gaps, don't use a sequence.
SQL Server 2012 introduced Sequence as a new feature, same as in Oracle and Postgres. Where sequences are preferred over identities? And why do we need sequences?
I think you will find your answer here
Using the identity attribute for a column, you can easily generate
auto-incrementing numbers (which as often used as a primary key). With
Sequence, it will be a different object which you can attach to a
table column while inserting. Unlike identity, the next number for the
column value will be retrieved from memory rather than from the disk –
this makes Sequence significantly faster than Identity. We will see
this in coming examples.
And here:
Sequences: Sequences have been requested by the SQL Server community
for years, and it's included in this release. Sequence is a user
defined object that generates a sequence of a number. Here is an
example using Sequence.
and here as well:
A SQL Server sequence object generates sequence of numbers just like
an identity column in sql tables. But the advantage of sequence
numbers is the sequence number object is not limited with single sql
table.
and on msdn you can also read more about usage and why we need it (here):
A sequence is a user-defined schema-bound object that generates a
sequence of numeric values according to the specification with which
the sequence was created. The sequence of numeric values is generated
in an ascending or descending order at a defined interval and may
cycle (repeat) as requested. Sequences, unlike identity columns, are
not associated with tables. An application refers to a sequence object
to receive its next value. The relationship between sequences and
tables is controlled by the application. User applications can
reference a sequence object and coordinate the values keys across
multiple rows and tables.
A sequence is created independently of the tables by using the CREATE
SEQUENCE statement. Options enable you to control the increment,
maximum and minimum values, starting point, automatic restarting
capability, and caching to improve performance. For information about
the options, see CREATE SEQUENCE.
Unlike identity column values, which are generated when rows are
inserted, an application can obtain the next sequence number before
inserting the row by calling the NEXT VALUE FOR function. The sequence
number is allocated when NEXT VALUE FOR is called even if the number
is never inserted into a table. The NEXT VALUE FOR function can be
used as the default value for a column in a table definition. Use
sp_sequence_get_range to get a range of multiple sequence numbers at
once.
A sequence can be defined as any integer data type. If the data type
is not specified, a sequence defaults to bigint.
Sequence and identity both used to generate auto number but the major difference is Identity is a table dependant and Sequence is independent from table.
If you have a scenario where you need to maintain an auto number globally (in multiple tables), also you need to restart your interval after particular number and you need to cache it also for performance, here is the place where we need sequence and not identity.
Although sequences provide more flexibility than identity columns, I didn't find they had any performance benefits.
I found performance using identity was consistently 3x faster than using sequence for batch inserts.
I inserted approx 1.5M rows and performance was:
14 seconds for identity
45 seconds for sequence
I inserted the rows into a table which used sequence object via a table default:
NEXT VALUE for <seq> for <col_name>
and also tried specifying sequence value in select statement:
SELECT NEXT VALUE for <seq>, <other columns> from <table>
Both were the same factor slower than the identity method. I used the default cache option for the sequence.
The article referenced in Arion's first link shows performance for row-by-row insert and difference between identity and sequence was 16.6 seconds to 14.3 seconds for 10,000 inserts.
The Caching option has a big impact on performance, but identity is faster for higher volumes (+1M rows)
See this link for an indepth analysis as per utly4life's comment.
I know this is a little old, but wanted to add an observation that bit me.
I switched from identity to sequence to have my indexes in order. I later found out that sequence doesn't transfer with replication. I started getting key violations after I setup replication between two databases since the sequences were not in sync. just something to watch out for before you make a decision.
I find the best use of Sequences is not to replace an identity column but to create a "Order Number" type of field.
In other words, an Order Number is exposed to the end user and may have business rules along with it. You want it to be unique, but just using an Identity Column isn't really correct either.
For example, different order types might require a different sequence, so you might have a sequence for Internet Order, as opposed to In-house orders.
In other words, don't think of a Sequence as simple a replacement for identity, think of it as being useful in cases where an identity does not fit the business requirements.
Recently was bit by something to consider for identity vs sequence. Seems MSFT now suggests sequence if you may want to keep identity without gaps. We had an issue where there were huge gaps in the identity, but based on this statement highlighted would explain our issue that SQL cached the identity and after reboot we lost those numbers.
https://learn.microsoft.com/en-us/sql/t-sql/statements/create-table-transact-sql-identity-property?view=sql-server-2017
Consecutive values after server restart or other failures – SQL Server might cache identity values for performance reasons and some of the assigned values can be lost during a database failure or server restart. This can result in gaps in the identity value upon insert. If gaps are not acceptable then the application should use its own mechanism to generate key values. Using a sequence generator with the NOCACHE option can limit the gaps to transactions that are never committed.