Bigquery - create surrogate keys on migrated data - google-bigquery

We are doing a migration from AWS Redshift to GCP BigQuery.
Problem statement:
We have a Redshift table that uses the IDENTITY column functionality to issue an internal EDW surrogate key (PK) for natural/business keys. These natural keys are from at least 20 different source systems for customers. We need a method to identify them in case natural keys are somehow duplicated (because we have so many source systems). In BigQuery, the functionality of the Redshift IDENTITY column does not exist. How can I replicate this in BQ?
We cant use GENERATE_UUID() because all our downstream clients have been using a BIGINT for the last 4 years. All history is based on BIGINT and too much would need to change for a VARCHAR.
Does anyone have any ideas, recommendations or suggestions?
Some considerations I have made:
1. load the data into Spark and keep it in memory and use scala or python functions to issue the surrogate key.
2. use a nosql data store (but this does not seem likely as a use case).
Any ideas are welcome!

In these cases, the idea is generally to identify an injective/bijective function, which can map to some unique space.
How about you try something like: SELECT UNIX_MICROS(current_timestamp()) + x as identity where x is a numeral that you can somehow manage (using case statements or if conditions) based on the business name or something?
You can also eliminate x from this formula if you intend to process things linearly in some order, like one business entity at a time.
Hope it helps.

Related

Why manage your own Sql Server ID column?

I recently started a new job and I am perplexed as to why the tables were designed this way. (in many databases) Is there someone who can give me a logical explanation?
Each table has a primary key/Id field. Example: EmployeeId (Integer)
Then to get the next id we actually need to query and update a table that manages all the keys for every table.
SELECT NextId
FROM dbo.NextID
Where TableName = 'Employees'
This makes life difficult, as you can imagine. The person who designed this mess has left, and the others just buy into this is the way you do things.
Is there some design flaw in MS SQL Identity columns? I don't get it? Any ideas?
Thanks for your input
The features/limitations of IDENTITY columns make them useful for generating surrogate keys in many scenarios but they are not ideal for all purposes - such as creating "meaningful", managed and/or potentially updateable identifers usable by the business; or for data integration or replication. Microsoft introduced the SEQUENCE feature as a more flexible alternative to IDENTITY in SQL Server 2008. In code written for earlier versions where sequences weren't available it isn't unusual to see the kind of scheme that you have described.
My guess is the person wanted no gaps in the ID column therefore he/she implemented this unnecessary process of getting next available id.
Maybe your Application depends on sequential Ids, either way it is not the way to go, Your application should not be dependant on sequential values. and no doubt Identity value is the way to go for this kind of requirement.
Issue with Identity Column
Yes there was an active bug in Identity column in Sql Server 2012. Identity Column taking big jumps when creating new identity values. Still it should not matter.

Reverse Engineering of a DB without foreign keys

I'm looking for a solution for reverse engineering a DB without foreign keys (really! a 20 years old DB...). The intention is to do this completely without additional application or persistence logic, just by analyzing the data.
I know this would be somewhat difficult, but should be possible if the data itself esp. the PKs are analyzed as well.
I don't think there is a universal solution to your problem. Hopefully there is some sort of a naming convention for the tables/columns that can lead you. You can query the system tables to try and figure what's going on (Oracle: user_tab_columns, SQL Server: INFORMATION_SCHEMA.COLUMNS, etc.). Good luck!
I also don't think you'll find a universal solution to this... but I'd like to suggest you an approach, especially if you don't have any source code that could be read to guide you on mapping:
First scan all tables on your database. With scan, I mean store table names and columns.
You can assume columns types trying to convert data to a specific format (start trying to convert to dates, numbers, booleans and so on)... You can also try to discover data types by analysing its contents (if it has only numbers without floating points, if it has numbers with slashes, if it has long or short texts, if it is only one digit..etc.).
Once you have mapped all tables, start by comparing contents of all columns that has a numeric type. (Why? If the database was designed by a human... then he/she/they probably will use numbers as primary/foreign keys).
Every single time you find more than X successful correspondences between the contents of 2 columns from 2 different tables, log this connection. (This X factor depends on the amount of records you have...)
This analysis must run for each table comparing all other tables... column by column - so... this will take some time...
Of course, this is an overview of what need to be done, but it is not a complex code to be written...
Good luck and let me know if you find any sort of tool to do this! :-)
No offense but you can't have been in databases very long if this surprises you.
I am going to assume that by "reverse engineering" you are just looking to fill in the foreign keys, not moving to NoSQL or something. It could be an interesting project. Here is how I would go about it:
Look at all the SELECT statements and see how joins are made to a table. 20 years ago this would be in a WHERE clause but it gets more complicated than that, of course. With correlated subqueries and UPDATE statements with FROM clauses and whatever implies a join of some sort. You have to be able to figure all that out. If you want to do it formally (you can probably suss out all this stuff intuitively) you could list the number of times combinations are used in joins between tables. List them by pairs of tables not the the set of all the tables in the join. Those would be the candidate foreign keys if one side is a primary key. The other side gets the foreign key. There are multi-column PKs but you can figure that out (so if the other side of the primary key is in two tables that's not a foreign key). If one column ends up pointing to two different table PKs that's not a proper foreign key either but it might be appropriate to pick a table and use it as the target.
If you don't already have a primary keys you should do that first. Indexes, perhaps even clustered indexes (in Sybase/MSSQL), aren't always the correct primary keys. In any case you may have to change the primary keys accordingly.
To collect all the statements might be challenging in itself. You could use perl/awk to parse them out of their C/Java/PHP/Basic/COBOL programs, or you could reap them from monitoring input to the server. You would want to look for WHERE/JOIN/APPLY etc. rather than SELECT. There are lots of other ways.

Is there implementation-agnostic way of having SQL DB provide UUIDs?

For development I'm using H2 database, in prod it will most likely be Postgres. Is there a way to instruct, in implementation-agnostic fashion, the database to automatically provide UUIDs for table's rows?
A user defined function could be used.
Related (I know this isn't your question): Please note if you have a lot of rows in the table (millions of rows), and if you have an index on this UUID, you should avoid randomly distributed UUIDs for performance reasons. This is for all databases, except if it the index easily fits completely in memory. Because of that, I personally would avoid UUIDs and use sequences instead if ever possible.
Well apparently, it's as simple as that:
CREATE TABLE items (
uuid SERIAL,
PRIMARY KEY (uuid)
)
I didn't find SERIAL documented on H2, here's the doc for PostgreSQL. I don't know to what extent this is db-agnostic, but it works on both H2 and Postgre so good enough at the moment.

How does row design influence MySQL performance?

I got an users table and some forum, where users can write. Every action on forum uses users table. User can have a profile, which can be quite big (50KB). If I got such big data in each row wouldn't it be faster to have separate table with user's profiles and other data that aren't accessed very often?
In an online RPG game each character have a long list of abilities, for example: pistols experience, machine guns experience, throwing grenades experience, and 15 more. Is it better to store them in a string as numbers separated with semicolon - which would take more space than integers, or should I make for each ability individual field? Or maybe binary? (I use c++)
If you don't need the data from
specific columns, don't get it.
Don't do SELECT * but SELECT a,
b,...
If you need to do SQL-queries over
certain columns e.g. ORDER BY
pistols_experience, you should
leave it in different columns. If
you just display it all at once, you
could serialize the different
key-value-pairs into a text field
via YAML, JSON etc.
(1) Not in itself, no. As stefan says, you should be selecting only what you want, so having stuff you don't want in the table is no issue. A 50K TEXT blob is only a pointer in the row.
However, there can be an issue if you are using MyISAM tables. In MyISAM there is only table-level locking, so when you have one user update their row (eg. last visit time), it blocks all other users from accessing the table. In this case you might experience some improvement by breaking out heavily-updated columns into a separate table from the relatively static but heavily-selected ones.
But you don't want to be using MyISAM anyway: it's a bit crap. Use InnoDB, get row-level locking (and transactions, and foreign key constraints), and don't worry about it. The only reason to use MyISAM tables today is for fulltext search, which InnoDB doesn't support.
(2) You would normally separate every independent value into its own field. If you hit a real performance issue and you don't need to do database-level manipulation of the values on their own, you could consider denormalising it, but you'd be losing the power of the database.

Database design: why use an autoincremental field as primary key?

here is my question: why should I use autoincremental fields as primary keys on my tables instead of something like UUID values?
What are the main advantages of one over another? What are the problems and strengths of them?
Simple numbers consume less space. UUID values take consume 128 bits each. Working with numbers is also simpler. For most practical purposes 32-bit or 64-bit integers can server well as the primary key. 264 is a very large number.
Consuming less space doesn't just save hard disk space In means faster backups, better performance in joins, and having more real stuff cached in the database server memory.
You don't have to use auto-incrementing primary keys, but I do. Here's why.
First, if you're using int's, they're smaller than UUIDs.
Second, it's much easier to query using ints than UUIDs, especially if your primary keys turn up as foreign keys in other tables.
Also, consider the code you'll write in any data access layer. A lot of my constructors take a single id as an int. It's clean, and in a type-safe language like C# - any problems are caught at compile time.
Drawbacks of autoincrementers? Potentially running out of space. I have a table which is at 200M on it's id field at the moment. It'll bust the 2 billion limit in a year if I leave as is.
You could also argue that an autoincrementing id has no intrinsic meaning, but then the same is true of a UUID.
I guess by UUID you mean like a GUID? GUIDs are better when you will later have to merge tables. For example, if you have local databases spread around the world, they can each generate unique GUIDs for row identifiers. Later the data can be combined into a single database and the id's shouldn't conflict. With an autoincrement in this case you would have to have a composite key where the other half of the key identifies the originating location, or you would have to modify the IDs as you imported data into the master database.