Cockroach oddly auto incrementing on PK id at server but not local (knex.js seeding cockroach public cloud hosted db) - sql

I can't for the life of me figure out why this is happening. Each time I run my table delete / create migration then seeding logic on the server DB it will give me bizarre 18 digit ids instead of incrementing 1, 2, 3. Locally all works fine, my db is a free tier hosted Cockroach database. I am seeding 3 records here are example ids that are generated (725368665646530561,725368665646694401,725368665646727169)
EDIT:
Based on a comment and some extra research I found that Cockroach DB, although is Postgres compatible, is not truly a Postgres DB. I also didn't realize how disliked and non-performant the AUTO_INCREMENT approach is. I ended up using an extension to generate UUIDS as the PK and then I query the newly created data and grab those if I need some FK relationships in another seed.
t.uuid('id').primary().notNullable().defaultTo(knex.raw('uuid_generate_v4()'));

The answer you linked to is a little old (from 2017). CockroachDB can generate auto-incrementing IDs, but there is a performance cost. (1) Having a primary index on sequential data is worse for performance, and (2) extra coordination between nodes is required to generate incrementing values.
If those performance tradeoffs are fine for you, then you can use the serial_normalization=sql_sequence_cached setting to get what you want. See https://www.cockroachlabs.com/docs/stable/serial.html
Also, v21.2 supports identity columns, which are a different syntax to get something similar: https://www.cockroachlabs.com/docs/stable/create-table.html#identity-columns

Related

Merge identical databases into one

We have 15 databases of 75 tables with an avarage of a million rows. all with the same schema but different data. We have now been given the requirements by the client to bring all 15 into one database. Each set of data filtered by the user’s login.
The changes to the application have been completed to do the filtering. We are now left with the task of merging all databases into one.
The issue is conflicting PK and FK as the PK’s and the FK’s are of type int so we will have 15 PK ids of 1.
One idea is to use. net and the DBML to insert the records as new records into the new database letting linq deal with the PK and FK and using code to deal with duplicate data.
What other ways are there to do this?
It's never a trivial job to integrate databases when the records don't have unique primary keys in all databases. A few weeks ago I built a similar integration script for which I decided to use Entity Framework.
First the good news. With EF's DbContext API it's ridiculously easy to insert a complete object graph and make EF take care of all newly generated primary keys an foreign keys. The reason why this is so easy is that when an object's state is changed to Added all of its adhering objects become Added as well and EF figures out the right order of inserts. This is truly great! It made me build the core of the copy routine in a few hours, which would have been many days if I should have done it in T-SQL for example. The latter is much much more error prone too.
Of course life isn't that easy. Now the bad news:
This takes tons of machine resources. Of course I used a new context instance for each copy step, but still I had to execute the program on a machine with a decent processor and a fair amount of internal memory. The exact specifications don't matter, the message is: test with the largest databases and see what kind of beast you need. If the memory consumption can't be managed by any machine at your disposal, you have to split up the routine in smaller chunks, but that will take more programming.
The object graph that's changed to Added must be divergent. By this I mean that there should only be 1-n associations starting from the root. The reason is, EF will really mark all objects as Added. So if somewhere in the graph a few branches refer back to the same object (because there is a n-1 association), these "new" objects will be multiplied, because EF doesn't know their identity. An example of this could be Company -< Customer -< Order >- OrderType: when there are only 2 order types, inserting one root company with 10 customers with 10 orders each will create 100 order type records in stead of 2.
So the hard part is to find paths your class structure that are divergent as much as possible. This won't always be possible. If so, you'll have to add the leaves of the converging paths first. In the example: first insert order types. When a new company is inserted you first load the existing order types into the context and then add the company. Now link the new orders to the existing order types. This can only be done if you can match objects by natural keys (in this example: the order type names), but usually this is possible.
You must take care not to insert multiple copies of master data. Suppose the order types in the previous example are the same in all databases (although their primary keys may differ!). The order types from the source database should not be reinserted in the target database. Moreover, you must fix the references in the source data to the correct records in the target database (again by matching by natural key).
So although it wasn't trivial it was doable and the job was done in a relatively short time. I'm sure that other alternatives (t-SQL, integration services, BIDS, if doable at all) would have taken more time or would have been more buggy. And the problem with bugs in this area is that they may become apparent much later.
I later found out that the issues I describe under 2) are related to fetching the source objects with AsNoTracking. See this interesting post: Entity Framework 6 - use my getHashCode(). I used AsNoTracking because it performs better and it reduces memory consumption.

Azure SQL Server identity key values jumping up by 1,000

I am using the SQL Server 2012 that MSOFT provide for Azure. My identity columns have a habit of jumping up by 1,000 sometimes even though they are "INT IDENTITY (1, 1) NOT NULL" in my table.
Is there anything I can do to stop this happening. What about if I remove all rows from a table? Seems like even after I delete every row then when I add a new row it starts off with an ID that's more than 1,000.
Refer to this post and this answer. Basically, this is by design and the argument as to why this is not too much of an issue is that azure database limits will be exceeded before hitting the identity limit. There is also the option of using a bigint.
There is no explanation as to why the jump in seed is done when the database is bounced, but I am guessing it has something to do with concurrency problems that might result in the same identity being used for two records at the boundary between shutting down and restarting (for some reason I can't think of).

SQL Server Auditing Data in the Same Table

A project I'm working on requires that a record be digitally "signed" and after that any modifications would create a new "version" of the row. The "signed" record can't be modified for regulatory reasons and new versions shouldn't be modified very often. In the past, done so by creating a separate logging table with the same schema as the main table with some extra columns for tracking who modified it and when.
However, after doing some work with SharePoint where ALL data (including different versions) is put into the same table I thought of a different approach which I can't find any examples of people doing: I could put the new version of the row right in the same table and increment the version number. Then add the version number to the PK.
PROS:
Implementation is easy, just create an "Instead of update" trigger which performs an insert instead of an update of the row is "signed". I could easily add a IsCurrentVersion column to be updated in the trigger.
Querying for older versions is easy, just get all the records with
the ID I want let the user choose from the list.
A trigger is nice because it guarantees that a row CAN'T be updated if signed (for regulatory and audit purposes).
Schema changes to the table don't have to be replicated to the mirror "logging" table.
CONS:
The table could get a bit larger but most of the time the record won't be changed after "signing" it. The client estimated around 100,000 rows/year max at current usage levels. SQL Server can handle hundreds of millions of rows so this doesn't seem too bad.
Indexing and performance could be an issue. SharePoint adds a tp_CalculatedVersion int to the PK where the calculated number is always 0 for the latest version. I could do the same and calculate it based off the Version number. Would that help performance?
There is an extra step in querying the data to make sure you get the latest version but that could be handled in a SP.
What other cons are there in this scenario. Am I missing anything??
I've seen this pattern used in an enterprise system before,and IMO it wasn't successful.
You are Mixing two different concerns here, viz storage of live and audit data. Queries to this table will always need to keep in mind whether they are seeking leaf or audit data (e.g. reports) - new team members may find this non intuitive. You would likely need to encapsulate this complexity with views etc.
As you mentioned performance will always be a concern. Inserting a new record will also need to update the previous record to mark it as inactive.You may also need to consider changing your clustered index to keep all versions on the same page.
Foreign keys to this table are going to be problematic. Do you
reference an exact version record? Do you then fix up the foreign
keys to point to the new live leaf record?
The one benefit I can think of doing this is that the audit table DDL will always be in synch with the live table - often with the 2 table strategy changes are made to the live, and the audit and trigger DDL isn't updated accordingly.
Overall, I would still recommend keeping your audit table separate.
If the requirement is that the signed data not be changed, then you should move it to another table. In fact, I might suggest moving it to another database/schema, where the only operation allowed on the table is inserting and reading records. You can use both permissions and triggers, if you really want to prevent updates.
You don't want to mess around with regulatory requirements. A complex schema that uses a combination of primary key with version, along with triggers, is a sign that there might be a simpler way.
The historical records can affect performance of the current records. If you end up in a situation where every record has changed 100 times, then keeping them in the same table is just going to slow down queries. Of course, you can embark on more complexity, in the form of partitioning the data. In the end, the solution is simpler: keep the data that cannot be changed in another table where it cannot be changed. You don't want to have to upgrade the hardware just because lots of history has accumulated.
I would also suggest including an effective and end date in the history records. This will allow you to reconstruct all the data as of a particular date, something that users might find useful in the future.
That's right. Audit trails can stay in an application for internal reporting/audit but infosec best practice mandates getting audit logs off the system where they are generated into your log management / SIEM solution.

memcache as Nhibernate second level cache

I have a question about second level caching with NHibernate and memcache. Suppose the following configuration:
Website A uses DB_A. Data from table X is being cached.
Website B uses DB_B. Data from table X is being cached.
Both web apps share a single memcache server.
Now, table X in DB_A and DB_B while having the same schema have different data, so row with PK = 1 in DB_A will NOT be the same data as row with PK = 1 in DB_B.
My question is, will each application clobber the other's data, or is the second level caching smart enough to create cache keys which don't over lap databases.
I'm not sure if you'll have overlapped, overwritten data. You'll need to check what cache-keys are being used. However, interesting reads that you might find helpful:
http://ayende.com/blog/3976/nhibernate-2nd-level-cache
http://ayende.com/blog/3112/nhibernate-and-the-second-level-cache-tips
http://ayende.com/blog/1708/nhibernate-caching-the-secong-level-cache-space-is-shared
The last one is probably of most use to you. The author did something similar to what you are attempting, except you're making your life easier by (somehow) having no primary key conflicts.

Indexing a 'non guessable' key for quick retrieval?

I'm not fully getting all i want from Google analytics so I'm making my own simple tracking system to fill in some of the gaps.
I have a session key that I send to the client as a cookie. This is a GUID.
I also have a surrogate IDENTITY int column.
I will frequently have to access the session row to make updates to it during the life of the client. Finding this session row to make updates is where my concern lies.
I only send the GUID to the client browser:
a) i dont want my technical 'hacker'
users being able to guage what 'user
id' they are - i.e. know how many
visitors we have had to the site in total
b) i want to make sure noone messes with data maliciously - nobody can guess a GUID
I know GUID indexes are inefficnent, but I'm not sure exactly how inefficient. I'm also not clear how to maximize the efficiency of multiple updates to the same row.
I don't know which of the following I should do :
Index the GUID column and always use that to find the row
Do a table scan to find the row based on the GUID (assuming recent sessions are easy to find). Do this by reverse date order (if thats even possible!)
Avoid a GUID index and keep a hashtable in my application tier of active sessions : IDictionary<GUID, int> to allow the 'secret' IDENTITY surrogate key to be found from the 'non secret' GUID key.
There may be several thousand sessions a day.
PS. I am just trying to better understand the SQL aspects of this. I know I can do other clever thigns like only write to the table on session expiration etc., but please keep answers SQL/index related.
In this case, I'd just create an index on the GUID. Thousands of sessions a day is a completely trivial load for a modern database.
Some notes:
If you create the GUID index as non-clustered, the index will be small and probably be cached in memory. By default most databases cluster on primary key.
A GUID column is larger than an integer. But this is hardly a big issue nowadays. And you need a GUID for the application.
An index on a GUID is just like an index on a string, for example Last Name. That works efficiently.
The B-tree of an index on a GUID is harder to balance than an index on an identity column. (But not harder than an index on Last Name.) This effect can be countered by starting with a low fill factor, and reorganizing the index in a weekly job. This is a micro-optimization for a databases that handle a million inserts an hour or more.
Assuming you are using SQL Server 2005 or above, your scenario might benefit from NEWSEQUENTIALID(), the function that gives you ordered GUIDs.
Consider this quote from the article Performance Comparison - Identity() x NewId() x NewSequentialId
"The NEWSEQUENTIALID system function is an addition to SQL Server 2005. It seeks to bring together, what used to be, conflicting requirements in SQL Server 2000; namely identity-level insert performance, and globally unique values."
Declare your table as
create table MyTable(
id uniqueidentifier default newsequentialid() not null primary key clustered
);
However, keep in mind, as Andomar noted that the sequentiality of the GUIDs produced also make them easy to predict. There are ways to make this harder, but non that would make this better than applying the same techniques to sequential integer keys.
Like the other authors I seriously doubt that the overheads of using straight newid() GUIDs would be significant enough for your application to notice. You would be better of focusing on minimizing roundtrips to your DB than on implementing custom caching scenarios such as the dictionary you propose.
If I understand what you're asking, you're worrying that indexing and looking up your users by their hashed GUID might slow your application down? I'm with Andomar, this is unlikely to matter unless you're inserting rows so fast that updating the index slows things down. Only on something like a logging table might that happen, and then only for complicated indicies.
More importantly, did you profile it first? You don't have to guess why your program is slow, you can find out which bits are slow with a profiler. Otherwise you'll waste hours optimizing bits of code that are either A) never used or B) already fast enough.