HiLo: how to control Low values - nhibernate

I'm using the HiLo generator in my S#rpArchitecture/NHibernate project and I'm performing a large import batch.
I've read somewhere about the possibility to predict the Low values of any new records because they are generated on the client. I figure this means I can control the Low values myself or at least fetch the next Low value from somewhere.
The reason I want to use this is that I want to set relations to other entities I'm about to insert. They do not exist yet but will be inserted before the batch transaction completes.
However, I cannot find information about how to set the Low values or how to get what Low value is up next.
Any ideas?

You don't need to predict anything to set your relationships. They are set based on the domain model, not the IDs.
The benefit of using HiLo is that the IDs are generated client-side (transparent to you anyway), so the Unit of Work is preserved (no DB writes are done until flush/commit), unlike identity, where inserts are immediate.
Recommended read: http://fabiomaulo.blogspot.com/2009/02/nh210-generators-behavior-explained.html

Related

How to implement check and set or Read - Update ACID transaction in Aerospike

I have a usecase where I have data in Aerospike and now that data required frequent updates, but under a transaction, following ACID. The documentation doesn't clearly show how to achieve it: https://www.aerospike.com/docs/client/go/usage/kvs/write.html#read-modify-write
It should simply require you to set 'EXPECT_GEN_EQUAL' in the write policy, so that the generation of the record is checked prior to applying the write transaction. If the generation doesn't match, you will get an error back and the server will tick the fail_generation stat. The generation is an internal simple counter metadata on a per record basis that gets incremented every time the record is updated.
You of course would need to then first read the record in order to get its current generation.
You aren't describing what your operations are. Are you aware that in Aerospike you can do multiple operations on the same record in a single transaction, all under the same record lock? This is the operate() method in all the language clients. For the Go client, it's documented at https://godoc.org/github.com/aerospike/aerospike-client-go#Client.Operate

Best implementation of a "counter" table in SQL Server

I'm working with a large SQL Server database, and that's built upon the idea of a counter table for primary key values. Each table has a row in this counter table with the PK name and the next value to be used as a primary key (for that table). Our current method of getting a counter value is something like this:
BEGIN TRAN
UPDATE CounterValue + 1
SELECT Counter Value
COMMIT TRAN
That works mostly well since the process of starting a transaction, then updating the row, locks the row/page/table (the level of locking isn't too important for this topic) until the transaction is committed.
The problem here is that if a transaction is held open for a long period of time, access to that table/page/row is locked for too long. We have situations where hundreds of inserts may occur in a single transaction (which needs access to this counter table).
One attempt to address this problem would be to always use a separate connection from your application that would never hold a transaction open. Access to the table and hence the transaction would be quick, so access to the table is generally available. The problem here is that the use of triggers that may also need access to these counter values makes that a fairly unreasonable rule to have. In other words, we have triggers that also need counter values and those triggers sometimes run in the context of a larger parent transaction.
Another attempt to solve the problem is using a SQL Server app lock to serialize access to the table/row. That's Ok most of the time too, but has downsides. One of the biggest downsides here also involves triggers. Since triggers run in the context of the triggering query, the app lock would be locked until any parent transactions are completed.
So what I'm trying to figure out is a way to serialize access to a row/table that could be run from an application or from a SP / trigger that would never run in the context of a parent transaction. If a parent transaction would roll back, I don't need the counter value to roll back. Having always available, fast access to a counter value is much more important than loosing a few counter values should a parent transaction be rolled back.
I should point out that I completely realize that using GUID values or an identity column would solve a lot of my problems, but as I mentioned, we're talking about a massive system, with massive amounts of data that can't be changed in a reasonable time frame without a lot of pain for our clients (we're talking hundreds of tables with hundreds of millions of rows).
Any thoughts about the best way to implement such a counter table would be appreciated. Remember - access should be always available from many apps, services, triggers and other SPs, with very little blocking.
EDIT - we can assume SQL Server 2005+
The way the system currently works in unscalable. You have noticed that yourself. Here are some solutions in rough order of preference:
Use an IDENTITY column (You can set the IDENTITY property without rebuilding the table. Search the web to see how.)
Use a sequence
Use Hi-Lo ID generation (What's the Hi/Lo algorithm?). In short, consumers of IDs (application instances) check out big ranges of IDs (like 100) in a separate transaction. The overhead of that scheme is very low.
Working with the constraints from your comment below: You can achieve scalable counter generation even with a single transaction and no application-level changes. This is kind of a last resort measure.
Stripe the counter. For each table, you have 100 counters. The counter N tracks IDs that conform to ID % 100 = N. So each counter tracks 1/100th of all IDs.
When you want to take an ID, you take it from a randomly chosen counter. The chance is good that this counter is not in use by a concurrent transaction. You will have little blocking due to row-level locking in SQL Server.
You initialize counter N to N and increment it by 100. This ensures that all counters generate distinct ID ranges.
Counter 0 generates 0, 100, 200, .... Counter 1 generates 1, 101, 201, .... And so on.
A disadvantage of this is that your IDs now are not sequential. In my opinion, an application should not rely on this anyway because it is not a reliable property.
You can abstract all of this into a single procedure call. code complexity will actually not that much bigger. You basically just generate an additional random number and change the increment logic.
One way is to get and increment the counter value in one statement:
DECLARE #NextKey int
UPDATE Counter
SET #NextKey = NextKey + 1,
NextKey = #NextKey

How to insert/update multiple records in single call to create/update_attributes in Rhomobile

As per the performance tip in Rhom API of Rhomobile,
We should prepare the whole data set first and then call the create/update_attributes for better performance over preparing single record then calling create inside loop.
As per my knowledge, create method takes the object of single record as like this,
#account = Account.create(
{"name" => "some new record", "industry" => "electronics"}
)
So i wonder how to create/update multiple records on a single call?
Thanks in advance.
First, I have no idea how much this will actually affect performance, whether positively or negatively, and have never measured it.
That said, you can wrap all the CRUD calls in a transaction, to minimise the DB connections opened and closed. This can also help you with maintaining referential integrity, by rolling back changes if some record is causing a problem with your new dataset.
# Load all DB Models, to ensure they are available before first time import
Rho::RHO.load_all_sources();
# Get instance of DB to work transactions with
db = ::Rho::RHO.get_db_partitions()['local'] # Get reference to model db
db.start_transaction() # BEGIN transaction
... Do all your create/update/deletes
if (was_import_successful)
db.commit # COMMIT transaction
else
db.rollback() # ROLLBACK transaction
end
Using Rhom, you can still write SQL queries for the underlying SQLite engine. But you need to understand what is the Table format you're using.
The default PropertyBags data model are all stored in a key value store in a single Table, if you're looking for the maximum performance, you better switch to FixedSchema data models. In this case you loose some flexibility but you gain some performance and you save same space.
My suggestion is to use transactions, like you're already doing, switch to FixedSchema data models and see if you're fine in that way. If you really need to increase the speed, maybe you can achieve what you want in a different way, something like importing a SQLite database created on the server side.
This is the method that RhoConnect uses for the bulk synchronization.

Stream with a lot of UPDATEs and PostgreSQL

I'm quite a newbie with PostgreSQL optimization and chosing whatever's appropriate job for it and whatever's not. So, I want to know whenever I'm trying to use PostgreSQL for inappropriate job, or it is suitable for it and I should set everything up properly.
Anyway, I have a need for a database with a lot of data that changes frequently.
For example, imagine an ISP, having a lot of clients, each having a session (PPP/VPN/whatever), with two self-describing frequently updated properties bytes_received and bytes_sent. There is a table with them, where each session is represented by a row with unique ID:
CREATE TABLE sessions(
id BIGSERIAL NOT NULL,
username CHARACTER VARYING(32) NOT NULL,
some_connection_data BYTEA NOT NULL,
bytes_received BIGINT NOT NULL,
bytes_sent BIGINT NOT NULL,
CONSTRAINT sessions_pkey PRIMARY KEY (id)
)
And as accounting data flows, this table receives a lot of UPDATEs like those:
-- There are *lots* of such queries!
UPDATE sessions SET bytes_received = bytes_received + 53554,
bytes_sent = bytes_sent + 30676
WHERE id = 42
When we receive a never ending stream with quite a lot (like 1-2 per second) of updates for a table with a lot (like several thousands) of sessions, probably thanks to MVCC, this makes PostgreSQL very busy. Are there any ways to speed everything up, or Postgres is just not exactly suitable for this task and I'd better consider it unsuitable for this job and put those counters to another storage like memcachedb, using Postgres only for fairly static data? But I'll miss an ability to infrequently query on this data, for example to find TOP10 downloaders, which is not really good.
Unfortunately, the amount of data cannot be lowered much. The ISP accounting example is all thought up to simplify the explanation. The real problem's with another system, which structure is somehow harder to explain.
Thanks for suggestions!
The database really isn't the best tool for collecting lots of small updates, but as I don't know your queryability and ACID requirements I can't really recommend something else. If it's an acceptable approach the application side update aggregation suggested by zzzeek can help lower the update load significantly.
There is an similar approach that can give you durability and ability to query the fresher data at some performance cost. Create a buffer table that can collect the changes to the values that need to be updated and insert the changes there. At regular intervals in a transaction rename the table to something else and create a new table in place of it. Then in a transaction aggregate all the changes, do the corresponding updates to the main table and truncate the buffer table. This way if you need a consistent and fresh snapshot of any data you can select from the main table and join in all the changes from the active and renamed buffer tables.
However if neither is acceptable you can also tune the database to deal better with heavy update loads.
To optimize the updating make sure that PostgreSQL can use heap-only tuples to store the updated versions of the rows. To do this make sure that there are no indexes on the frequently updated columns and change the fillfactor to something lower from the default 100%. You'll need to figure out a suitable fill factor on your own as it depends heavily on the details of the workload and the machine it is running on. The fillfactor needs to be low enough that allmost all of the updates fit on the same database page before autovacuum has the chance to clean up the old non-visible versions. You can tune autovacuum settings to trade off between the density of the database and vacuum overhead. Also, take into account that any long transactions, including statistical queries, will hold onto tuples that have changed after the transaction has started. See the pg_stat_user_tables view to see what to tune, especially the relationship of n_tup_hot_upd to n_tup_upd and n_live_tup to n_dead_tup.
Heavy updating will also create a heavy write ahead log (WAL) load. Tuning the WAL behavior (docs for the settings) will help lower that. In particular, a higher checkpoint_segments number and higher checkpoint_timeout can lower your IO load significantly by allowing more updates to happen in memory. See the relationship of checkpoints_timed vs. checkpoints_req in pg_stat_bgwriter to see how many checkpoints happen because either limit is reached. Raising your shared_buffers so that the working set fits in memory will also help. Check buffers_checkpoint vs. buffers_clean + buffers_backend to see how many were written to satisfy checkpoint requirements vs. just running out of memory.
You want to assemble statistical updates as they happen into an in-memory queue of some kind, or alternatively onto a message bus if you're more ambitious. A receiving process then aggregates these statistical updates on a periodic basis - which can be anywhere from every 5 seconds to every hour - depends on what you want. The counts of bytes_received and bytes_sent are then updated, with counts that may represent many individual "update" messages summed together. Additionally you should batch the update statements for multiple ids into a single transaction, ensuring that the update statements are issued in the same relative order with regards to primary key to prevent deadlocks against other transactions that might be doing the same thing.
In this way you "batch" activities into bigger chunks to control how much load is on the PG database, and also serialize many concurrent activities into a single stream (or multiple, depending on how many threads/processes are issuing updates). The tradeoff which you tune based on the "period" is, how much freshness vs. how much update load.

SQL Identity Column out of step

We have a set of databases that have a table defined with an Identity column as the primary key. As a sub-set of these are replicated to other servers, a seed system was created so that they could never clash. That system was by using a starting seed with an increment of 50.
In this way the table on DB1 would generate 30001, 30051 etc, where Database2 would generate 30002, 30052 and so on.
I am looking at adding another database into this system (it is split for scaling/loading purposes) and have discovered that the identites have got out of sync on one or two of the databases - i.e. database 3 that should have numbers ending in 3, doesn't anymore. The seeding and increments is still correct according to the table design.
I am obviously going to have to work around this problem somehow (probably by setting a high initial value), but can anyone tell me what would cause them to get out of sync like this? From a query on the DB I can see the sequence went as follows: 32403,32453, 32456, 32474, 32524, 32574 and has continued in increments of 50 ever since it went wrong.
As far as I am aware no bulk-inserts or DTS or anything like that has put new data into these tables.
Second (bonus) question - how to reset the identity so that it goes back to what I want it to actually be!
EDIT:
I know the design is in principle a bit ropey - I didn't ask for criticism of it, I just wondered how it could have got out of sync. I inherited this system and changing the column to a GUID - whilst undoubtedly the best theoretical solution - is probably not going to happen. The system evolved from a single DB to multiple DBs when the load got too large (a few hundred GBs currently). Each ID in this table will be referenced in many other places - sometimes a few hundred thousand times each (multiplied by about 40,000 for each item). Updating all those will not be happening ;-)
Replication = GUID column.
To set the value of the next ID to be 1000:
DBCC CHECKIDENT (orders, RESEED, 999)
If you want to actually use Primary Keys for some meaningful purpose other than uniquely identify a row in a table, then it's not an Identity Column, and you need to assign them some other explicit way.
If you want to merge rows from multiple tables, then you are violating the intent of Identity, which is for one table. (A GUID column will use values that are unique enough to solve this problem. But you still can't impute a meaningful purpose to them.)
Perhaps somebody used:
SET IDENTITY INSERT {tablename} ON
INSERT INTO {tablename} (ID, ...)
VALUES(32456, ....)
SET IDENTITY INSERT {tablename} OFF
Or perhaps they used DBCC CHECKIDENT to change the identity. In any case, you can use the same to set it back.
It's too risky to rely on this kind of identity strategy, since it's (obviously) possible that it will get out of synch and wreck everything.
With replication, you really need to identify your data with GUIDs. It will probably be easier for you to migrate your data to a schema that uses GUIDs for PKs than to try and hack your way around IDENTITY issues.
To address your question directly,
Why did it get out of sync may be interesting to discuss, but the only result you could draw from the answer would be to prevent it in the future; which is a bad course of action. You will continue to have these and bigger problems unless you deal with the design which has a fatal flaw.
How to set the existing values right is also (IMHO) an invalid question, because you need to do something other than set the values right - it won't solve your problem.
This isn't to disparage you, it's to help you the best way I can think of. Changing the design is less work both short term and long term. Not to change the design is the pathway to FAIL.
This doesn't really answer your core question, but one possibility to address the design would be to switch to a hi_lo algorithm. it wouldn't require changing the column away from an int. so it shouldn't be nearly as much work as changing to a guid.
Hi_lo is used by the nhibernate ORM, but I couldn't find much documentation on it.
Basically the way a Hi_lo works is you have 1 central place where you keep track of your hi value. 1 table in 1 of the databases that every instance of your insert application can see. then you need to have some kind of a service (object, web service, whatever) that has a life somewhat longer than a single entity insert. this service when it starts up will go to the hi table, grab the current value, then increment the value in that table. Use a read committed lock to do this so that you won't get any concurrency issues with other instances of the service. Now you would use the new service to get your next id value. It internally starts at the number it got from the db, and when it passes that value out, increments by 1. keeping track of this current value and the "range" it's allowed to pass out. A simplistic example would be this.
service 1 gets 100 from "hi_value" table in db. increments db value 200.
service 1 gets request for a new ID. passes out 100.
another instance of the service, service 2 (either another thread, another middle tier worker machine, etc) spins up, gets 200 from the db, increments db to 300.
service 2 gets a request for a new id. passes out 200.
service 1 gets a request for a new id. passes out 101.
if any of these ever gets to passing out more than 100 before dying, then they will go back to the db, and get the current value and increment it and start over. Obviously there's some art to this. How big should your range be, etc.
A very simple variation on this is a single table in one of your db's that just contains the "nextId" value. basically manually reproducing oracle's sequence concept.