Schema migration in Redis - redis

I have an application using redis. I used a key name user:<id> to store user info. Then locally I changed my app code to use key name user:<id>:data for that purpose.
I am scared by the fact that if I git push this new code to my production server things will break. And the reason for this is that since my production redis server would already have the keys will older key names.
So the only way I think is to stop my app, change all the older key names to new ones & then restart it. Do you have a better alternative? Thanks for help :)

Pushing new code to your production environment is always a scary business (that's why only the toughest survive in this profession ;)). I strongly recommend that before you change your production code and database, make sure that you test the workflow and its results locally.
Almost any update to the application requires its stoppage - even if only to replace the relevant files. This is even truer for any changes that involve a database exactly because of the reason you had mentioned.
Even if you can deploy your code changes without stopping the application per se (e.g. a PHP page), you will still want the database change to be done "atomically" - i.e. without any application requests intervening and possibly breaking. While some database can be taken offline for maintenance, even then you usually stop the app or else errors will be generated all over the place.
If that is indeed the case, you'll be stopping the app (or putting it into maintenance mode) regardless of the database change, so we take your question to actually mean: what's the fastest way to rename all/some keys in my database?
To answer that question, similarly to the pseudo-code suggested above, I suggest you use a Lua script such as the following and EVAL it once you stop the app:
for _,k in ipairs(redis.call('keys', 'user:*')) do
if k:sub(-5) ~= ':data' then
redis.call('rename', k, k .. ':data')
end
end
A few notes about this script that you should keep in mind:
Although the KEYS command is not safe to use in production, since you are doing maintenance it can be used safely. For all other use cases where you need to scan you keys, Redis' SCAN is much more advisable.
Since Lua scripts are "atomic", you can in theory run this script without stopping the app - as long as the script runs (which depends on the size of your dataset) the app's requests will be blocked. Put differently, this approach solves the concern of getting mixed key names (old & new). This, however, is probably not what you'd want to do in any case because a) your app may still error/timeout during that time but mainly because b) it will need to be able to handle both types of key names (i.e. running with old keys -> short/long pause -> running with new keys) making your code much more complex.
The if condition is not required if you're going to run the script only once and succeed.
Depending on the actual contents of your database, you may want to further filter out keys that should not be renamed.
To ensure compatibility, refrain from hardcoding / computationally generating key names - instead, they should be passed as arguments to the script.

You can run a migration script in your redis client language, using RENAME.
If you don't have any other control over the total of keys, you first issue a KEYS user:* to list all keys, then substring for getting the numeric id, then renaming.
You can issue all of this in a transaction.
So a little pseudocode and redis commands:
MULTI
KEYS user:*
For each key {
id = <Get id from key>
RENAME user:id user:id:data
}
EXEC
Got it ?

Related

CloudTrail RunInstances event, who actually provisioned EC2 instance when STS AssumeRole used?

My client is in need of an AWS spring cleaning!
Before we can terminate EC2 instances, we need to find out who provisioned them and ask if they are still using the instance before we delete them. AWS doesn't seem to provide out-of-the-box features for reporting who the 'owner'/'provisioner' of an EC2 instance is, as I understand, I need to parse through gobs of archived zipped log files residing in S3.
Problem is, their automation is making use of STS AssumeRole to provision instances. This means the RunInstances event in the logs doesn't trace back to an actual user (correct me if I'm wrong, please please I hope I am wrong).
AWS blog provides a story of a fictional character, Alice, and her steps tracing a TerminateInstance event back to a user which involves 2 log events: The TerminateInstance event and an event "somewhere around the time" of an AssumeRole event containing the actual user details. Is there a pragmatic approach one can take to correlate these 2 events?
Here's my POC that's parsing through a cloudtrail log from s3:
import boto3
import gzip
import json
boto3.setup_default_session(profile_name=<your_profile_name>)
s3 = boto3.resource('s3')
s3.Bucket(<your_bucket_name>).download_file(<S3_path>, "test.json.gz")
with gzip.open('test.json.gz','r') as fin:
file_contents = fin.read().replace('\n', '')
json_data = json.loads(file_contents)
for record in json_data['Records']:
if record['eventName'] == "RunInstances":
user = record['userIdentity']['userName']
principalid = record['userIdentity']['principalId']
for index, instance in enumerate(record['responseElements']['instancesSet']['items']):
print "instance id: " + instance['instanceId']
print "user name: " + user
print "principalid " + principalid
However, the details are generic since these roles are shared by many groups. How can I find details of the user before they Assumed Role in a script?
UPDATE: Did some research and it looks like I can correlate the Runinstances event to an AssumeRole event by a shared 'accessKeyId' and that should show me the account name before it assumed a role. Tricky though. Not all RunInstances events contain this accessKeyId, for example, if 'invokedby' was an autoscaling event.
Direct answer:
For the solution you are proposing, you are unfortunately out of luck. You can take a look at http://docs.aws.amazon.com/IAM/latest/UserGuide/cloudtrail-integration.html#w28aac22b9b4b7b3b1. On the 4th row, it says that the Assume Role will save the Role identity only for all subsequent calls.
I'd contact aws support to make sure of this as I might very well be mistaken.
What I would do in your case:
First, wait a couple of days in case someone had a better idea or I was mistaken and aws support answers with an out-of-the-box solution
Create an aws config rule that would delete all instances that have a certain tag. Then tell your developers to tag all instances that they are sure that should be deleted, then these will get deleted
Tag all the production instances and still needed development instances with a tag of their own
Run a script that would tag all of the untagged instances with a separate tag. Douple and triple check these instances.
Back up and turn off the instances tagged in step 3 (without
deleting the instances).
If someone complained about something not being on, that means they
missed an instance in step 1 or 2. Tag this instance correctly and
turn it on again.
After a while (a week or so), delete the instances that are still
stopped (keep the backups)
After a couple months, delete the backups that were not restored
Note that this isn't foolproof as it has the possibility of human error and possible downtime, so double and triple check, make a clone of the same environment and test on that (if you have a development environment that already has such a configuration, that would be the best scenario), take it slow to be able to monitor everything, and be sure to keep backups of everything.
Good luck and plzz tell me what your solution ended up being.
General guidelines for the future:
Note: The following points are very opiniated, and are general rules that I abide by as I find them saving me a load of trouble from time to time. Read them, dismiss what you find as unfit for you and take the things that you find reasonable.
Don't use assume role that often as it obfuscates user access. In case it was a script run on the developer's pc, let it run with their own username. If it's running on a server, keep it with the role it was created in. The amount of management will be less that way as you just cut the middle-man (the assume-role) and don't need to create roles anymore, just assign the permissions to the correct group/user. Take a look below for when I'd consider using the assume-role as a necessity.
Automate deletions. The first things you should create is automating the task of keeping the aws account as clean as possible as this would save both $$$ and debugging pain. Tags and scripts to act on these tags are very powerful tools. So if a developer needs an instance for a day to try out something new, he can create a tag that times the instance out, then there is a script that cleans it up when the time comes. These are project-specific, and not everyone needs all of these, so see and assess what you need for your project and act on them.
What I'd recommend is giving the permissions to the users themselves in the development environment as it would make tracking things to their root and finding the most knowledgeable person to solve things easier. As of the production environment, everything should be automated anyway (creation when needed and deletion when no longer needed) and no one should have any write access to that account, ever.
As for the assume-role, I only use it in case I want to give access to read-only production logs on another account. Another case would be something that really shouldn't be happening that often, if at all, but still need to give some users access to it. So, as an extra layer of protection against the 'I did it by mistake', I make them switch role to do it, and never have a script that automatically switches roles and do the action in an attempt to make it as deliberate as possible (think deleting a database and such). Another thing would be accessing sensitive information (credit-card database, etc.). Many more scenarios can occur, and here it comes to your judgement.
Again, Good Luck.

SQL (or any relational db) engine with SCM-friendly backing store [duplicate]

I'm doing a web app, and I need to make a branch for some major changes, the thing is, these changes require changes to the database schema, so I'd like to put the entire database under git as well.
How do I do that? is there a specific folder that I can keep under a git repository? How do I know which one? How can I be sure that I'm putting the right folder?
I need to be sure, because these changes are not backward compatible; I can't afford to screw up.
The database in my case is PostgreSQL
Edit:
Someone suggested taking backups and putting the backup file under version control instead of the database. To be honest, I find that really hard to swallow.
There has to be a better way.
Update:
OK, so there' no better way, but I'm still not quite convinced, so I will change the question a bit:
I'd like to put the entire database under version control, what database engine can I use so that I can put the actual database under version control instead of its dump?
Would sqlite be git-friendly?
Since this is only the development environment, I can choose whatever database I want.
Edit2:
What I really want is not to track my development history, but to be able to switch from my "new radical changes" branch to the "current stable branch" and be able for instance to fix some bugs/issues, etc, with the current stable branch. Such that when I switch branches, the database auto-magically becomes compatible with the branch I'm currently on.
I don't really care much about the actual data.
Take a database dump, and version control that instead. This way it is a flat text file.
Personally I suggest that you keep both a data dump, and a schema dump. This way using diff it becomes fairly easy to see what changed in the schema from revision to revision.
If you are making big changes, you should have a secondary database that you make the new schema changes to and not touch the old one since as you said you are making a branch.
I'm starting to think of a really simple solution, don't know why I didn't think of it before!!
Duplicate the database, (both the schema and the data).
In the branch for the new-major-changes, simply change the project configuration to use the new duplicate database.
This way I can switch branches without worrying about database schema changes.
EDIT:
By duplicate, I mean create another database with a different name (like my_db_2); not doing a dump or anything like that.
Use something like LiquiBase this lets you keep revision control of your Liquibase files. you can tag changes for production only, and have lb keep your DB up to date for either production or development, (or whatever scheme you want).
Irmin (branching + time travel)
Flur.ee (immutable + time travel + graph query)
XTDB (formerly called 'CruxDB') (time travel + query)
TerminusDB (immutable + branching + time travel + Graph Query!)
DoltDB (branching + time-travel + SQL query)
Quadrable (branching + remote state verification)
EdgeDB (no real time travel, but migrations derived by the compiler after schema changes)
Migra (diffing for Postgres schemas/data. Auto-generate migration scripts, auto-sync db state)
ImmuDB (immutable + time-travel)
I've come across this question, as I've got a similar problem, where something approximating a DB based Directory structure, stores 'files', and I need git to manage it. It's distributed, across a cloud, using replication, hence it's access point will be via MySQL.
The gist of the above answers, seem to similarly suggest an alternative solution to the problem asked, which kind of misses the point, of using Git to manage something in a Database, so I'll attempt to answer that question.
Git is a system, which in essence stores a database of deltas (differences), which can be reassembled, in order, to reproduce a context. The normal usage of git assumes that context is a filesystem, and those deltas are diff's in that file system, but really all git is, is a hierarchical database of deltas (hierarchical, because in most cases each delta is a commit with at least 1 parents, arranged in a tree).
As long as you can generate a delta, in theory, git can store it. The problem is normally git expects the context, on which it's generating delta's to be a file system, and similarly, when you checkout a point in the git hierarchy, it expects to generate a filesystem.
If you want to manage change, in a database, you have 2 discrete problems, and I would address them separately (if I were you). The first is schema, the second is data (although in your question, you state data isn't something you're concerned about). A problem I had in the past, was a Dev and Prod database, where Dev could take incremental changes to the schema, and those changes had to be documented in CVS, and propogated to live, along with additions to one of several 'static' tables. We did that by having a 3rd database, called Cruise, which contained only the static data. At any point the schema from Dev and Cruise could be compared, and we had a script to take the diff of those 2 files and produce an SQL file containing ALTER statements, to apply it. Similarly any new data, could be distilled to an SQL file containing INSERT commands. As long as fields and tables are only added, and never deleted, the process could automate generating the SQL statements to apply the delta.
The mechanism by which git generates deltas is diff and the mechanism by which it combines 1 or more deltas with a file, is called merge. If you can come up with a method for diffing and merging from a different context, git should work, but as has been discussed you may prefer a tool that does that for you. My first thought towards solving that is this https://git-scm.com/book/en/v2/Customizing-Git-Git-Configuration#External-Merge-and-Diff-Tools which details how to replace git's internal diff and merge tool. I'll update this answer, as I come up with a better solution to the problem, but in my case I expect to only have to manage data changes, in-so-far-as a DB based filestore may change, so my solution may not be exactly what you need.
There is a great project called Migrations under Doctrine that built just for this purpose.
Its still in alpha state and built for php.
http://docs.doctrine-project.org/projects/doctrine-migrations/en/latest/index.html
Take a look at RedGate SQL Source Control.
http://www.red-gate.com/products/sql-development/sql-source-control/
This tool is a SQL Server Management Studio snap-in which will allow you to place your database under Source Control with Git.
It's a bit pricey at $495 per user, but there is a 28 day free trial available.
NOTE
I am not affiliated with RedGate in any way whatsoever.
I've released a tool for sqlite that does what you're asking for. It uses a custom diff driver leveraging the sqlite projects tool 'sqldiff', UUIDs as primary keys, and leaves off the sqlite rowid. It is still in alpha so feedback is appreciated.
Postgres and mysql are trickier, as the binary data is kept in multiple files and may not even be valid if you were able to snapshot it.
https://github.com/cannadayr/git-sqlite
I want to make something similar, add my database changes to my version control system.
I am going to follow the ideas in this post from Vladimir Khorikov "Database versioning best practices". In summary i will
store both its schema and the reference data in a source control system.
for every modification we will create a separate SQL script with the changes
In case it helps!
You can't do it without atomicity, and you can't get atomicity without either using pg_dump or a snapshotting filesystem.
My postgres instance is on zfs, which I snapshot occasionally. It's approximately instant and consistent.
I think X-Istence is on the right track, but there are a few more improvements you can make to this strategy. First, use:
$pg_dump --schema ...
to dump the tables, sequences, etc and place this file under version control. You'll use this to separate the compatibility changes between your branches.
Next, perform a data dump for the set of tables that contain configuration required for your application to operate (should probably skip user data, etc), like form defaults and other data non-user modifiable data. You can do this selectively by using:
$pg_dump --table=.. <or> --exclude-table=..
This is a good idea because the repo can get really clunky when your database gets to 100Mb+ when doing a full data dump. A better idea is to back up a more minimal set of data that you require to test your app. If your default data is very large though, this may still cause problems though.
If you absolutely need to place full backups in the repo, consider doing it in a branch outside of your source tree. An external backup system with some reference to the matching svn rev is likely best for this though.
Also, I suggest using text format dumps over binary for revision purposes (for the schema at least) since these are easier to diff. You can always compress these to save space prior to checking in.
Finally, have a look at the postgres backup documentation if you haven't already. The way you're commenting on backing up 'the database' rather than a dump makes me wonder if you're thinking of file system based backups (see section 23.2 for caveats).
What you want, in spirit, is perhaps something like Post Facto, which stores versions of a database in a database. Check this presentation.
The project apparently never really went anywhere, so it probably won't help you immediately, but it's an interesting concept. I fear that doing this properly would be very difficult, because even version 1 would have to get all the details right in order to have people trust their work to it.
This question is pretty much answered but I would like to complement X-Istence's and Dana the Sane's answer with a small suggestion.
If you need revision control with some degree of granularity, say daily, you could couple the text dump of both the tables and the schema with a tool like rdiff-backup which does incremental backups. The advantage is that instead of storing snapshots of daily backups, you simply store the differences from the previous day.
With this you have both the advantage of revision control and you don't waste too much space.
In any case, using git directly on big flat files which change very frequently is not a good solution. If your database becomes too big, git will start to have some problems managing the files.
Here is what i am trying to do in my projects:
separate data and schema and default data.
The database configuration is stored in configuration file that is not under version control (.gitignore)
The database defaults (for setting up new Projects) is a simple SQL file under version control.
For the database schema create a database schema dump under the version control.
The most common way is to have update scripts that contains SQL Statements, (ALTER Table.. or UPDATE). You also need to have a place in your database where you save the current version of you schema)
Take a look at other big open source database projects (piwik,or your favorite cms system), they all use updatescripts (1.sql,2.sql,3.sh,4.php.5.sql)
But this a very time intensive job, you have to create, and test the updatescripts and you need to run a common updatescript that compares the version and run all necessary update scripts.
So theoretically (and thats what i am looking for) you could
dumped the the database schema after each change (manually, conjob, git hooks (maybe before commit))
(and only in some very special cases create updatescripts)
After that in your common updatescript (run the normal updatescripts, for the special cases) and then compare the schemas (the dump and current database) and then automatically generate the nessesary ALTER Statements. There some tools that can do this already, but haven't found yet a good one.
What I do in my personal projects is, I store my whole database to dropbox and then point MAMP, WAMP workflow to use it right from there.. That way database is always up-to-date where ever I need to do some developing. But that's just for dev! Live sites is using own server for that off course! :)
Storing each level of database changes under git versioning control is like pushing your entire database with each commit and restoring your entire database with each pull.
If your database is so prone to crucial changes and you cannot afford to loose them, you can just update your pre_commit and post_merge hooks.
I did the same with one of my projects and you can find the directions here.
That's how I do it:
Since your have free choise about DB type use a filebased DB like e.g. firebird.
Create a template DB which has the schema that fits your actual branch and store it in your repository.
When executing your application programmatically create a copy of your template DB, store it somewhere else and just work with that copy.
This way you can put your DB schema under version control without the data. And if you change your schema you just have to change the template DB
We used to run a social website, on a standard LAMP configuration. We had a Live server, Test server, and Development server, as well as the local developers machines. All were managed using GIT.
On each machine, we had the PHP files, but also the MySQL service, and a folder with Images that users would upload. The Live server grew to have some 100K (!) recurrent users, the dump was about 2GB (!), the Image folder was some 50GB (!). By the time that I left, our server was reaching the limit of its CPU, Ram, and most of all, the concurrent net connection limits (We even compiled our own version of network card driver to max out the server 'lol'). We could not (nor should you assume with your website) put 2GB of data and 50GB of images in GIT.
To manage all this under GIT easily, we would ignore the binary folders (the folders containing the Images) by inserting these folder paths into .gitignore. We also had a folder called SQL outside the Apache documentroot path. In that SQL folder, we would put our SQL files from the developers in incremental numberings (001.florianm.sql, 001.johns.sql, 002.florianm.sql, etc). These SQL files were managed by GIT as well. The first sql file would indeed contain a large set of DB schema. We don't add user-data in GIT (eg the records of the users table, or the comments table), but data like configs or topology or other site specific data, was maintained in the sql files (and hence by GIT). Mostly its the developers (who know the code best) that determine what and what is not maintained by GIT with regards to SQL schema and data.
When it got to a release, the administrator logs in onto the dev server, merges the live branch with all developers and needed branches on the dev machine to an update branch, and pushed it to the test server. On the test server, he checks if the updating process for the Live server is still valid, and in quick succession, points all traffic in Apache to a placeholder site, creates a DB dump, points the working directory from 'live' to 'update', executes all new sql files into mysql, and repoints the traffic back to the correct site. When all stakeholders agreed after reviewing the test server, the Administrator did the same thing from Test server to Live server. Afterwards, he merges the live branch on the production server, to the master branch accross all servers, and rebased all live branches. The developers were responsible themselves to rebase their branches, but they generally know what they are doing.
If there were problems on the test server, eg. the merges had too many conflicts, then the code was reverted (pointing the working branch back to 'live') and the sql files were never executed. The moment that the sql files were executed, this was considered as a non-reversible action at the time. If the SQL files were not working properly, then the DB was restored using the Dump (and the developers told off, for providing ill-tested SQL files).
Today, we maintain both a sql-up and sql-down folder, with equivalent filenames, where the developers have to test that both the upgrading sql files, can be equally downgraded. This could ultimately be executed with a bash script, but its a good idea if human eyes kept monitoring the upgrade process.
It's not great, but its manageable. Hope this gives an insight into a real-life, practical, relatively high-availability site. Be it a bit outdated, but still followed.
Update Aug 26, 2019:
Netlify CMS is doing it with GitHub, an example implementation can be found here with all information on how they implemented it netlify-cms-backend-github
I say don't. Data can change at any given time. Instead you should only commit data models in your code, schema and table definitions (create database and create table statements) and sample data for unit tests. This is kinda the way that Laravel does it, committing database migrations and seeds.
I would recommend neXtep (Link removed - Domain was taken over by a NSFW-Website) for version controlling the database it has got a good set of documentation and forums that explains how to install and the errors encountered. I have tested it for postgreSQL 9.1 and 9.3, i was able to get it working for 9.1 but for 9.3 it doesn't seems to work.
Use a tool like iBatis Migrations (manual, short tutorial video) which allows you to version control the changes you make to a database throughout the lifecycle of a project, rather than the database itself.
This allows you to selectively apply individual changes to different environments, keep a changelog of which changes are in which environments, create scripts to apply changes A through N, rollback changes, etc.
I'd like to put the entire database under version control, what
database engine can I use so that I can put the actual database under
version control instead of its dump?
This is not database engine dependent. By Microsoft SQL Server there are lots of version controlling programs. I don't think that problem can be solved with git, you have to use a pgsql specific schema version control system. I don't know whether such a thing exists or not...
Use a version-controlled database, of which there are now several.
https://www.dolthub.com/blog/2021-09-17-database-version-control/
These products don't apply version control on top of another type of database -- they are their own database engines that support version control operations. So you need to migrate to them or start building on them in the first place.
I write one of them, DoltDB, which combines the interfaces of MySQL and Git. Check it out here:
https://github.com/dolthub/dolt
I wish it were simpler. Checking in the schema as a text file is a good start to capture the structure of the DB. For the content, however, I have not found a cleaner, better method for git than CSV files. One per table. The DB can then be edited on multiple branches and merges extremely well.

Replace/Rename the Online Database

I have got a database of ms-sql server 2005 named mydb, which is being accessed by 7 applications from different location.
i have created its copy named mydbNew and tuned it by applying primary keys, indexes and changing queries in stored procedure.
now i wants to replace old db "mydb" from new db "mydbnew"
please tell me what is the best approach to do it. i though to do changes in web.config but all those application accessing it are not accessible to me, cant go for it.
please provide me experts opinion, so that i can do replace database in minimum time without affecting other db and all its application.
my meaning of saying replace old db by new db is that i wants to rename old db "mydb" to "mydbold" and then wants to remname my new db "mydbnew" to "mydb"
thanks
Your plan will work but it does carry a high risk, especially since I'm assuming this is a system that has users actively changing data, which means your copy won't have the same level of updated content in it unless you do a cut right before go-live. Your best bet is to migrate your changes carefully into the live system during a low traffic / maintenance period and extensively test it once your done. Prior to doing this, or the method you mentioned previously, backup everything.
All of the changes you described above can be made to an online database without the need to actually bring it down. However, some of those activities will change the way in which the data is affected by certain actions (changes to stored procs), that means that during the transition the behaviour of the system or systems may be unpredicatable and therefore you should either complete this update at a low point in day to day operations or take it down for a maintenance window.
Sql Server comes with a function to make a script file out of you database, you can also do this manually but clicking on the object you want to script and selecting the Script -> CREATE option. Depending on the amount of changes you have to make it may be worthwhile to script your whole new database (By clicking on the new database and selecting Tasks -> Generate Scripts... and selecting the items needed).
If you want to just script out the new things you need to add individually then you simply click on the object you want to script, select the Script <object> as -> then select DROP and CREATE to if you want to kill the original version (like replacing a stored proc) or select CREATE to if your adding new stuff.
Once you have all the things you want to add/update as a script your then ready to execute that against the new database. This would be the part where you backup everything. Once your happy everything is backed up and the system is in maintenance or a low traqffic period, you execute the script. There may be a few problems when you do this, you will need to fix these as quickly as possible (usually mostly just 'already exisits' errors, thats why drop and create scripts are good) and if anything goes really wrong restore your backups and try again (after figuring out what happened and how to fix it).
Make no mistake if you have a lot of changes to make this could be a long process, or it could take mere minutes, you just need to adapt if things go wrong and be sure to cover yourself with backups/extensive prayer. Good Luck!

Do you put your database static data into source-control ? How?

I'm using SQL-Server 2008 with Visual Studio Database Edition.
With this setup, keeping your schema in sync is very easy. Basically, there's a 'compare schema' tool that allow me to sync the schema of two databases and/or a database schema with a source-controlled creation script folder.
However, the situation is less clear when it comes to data, which can be of three different kind :
static data referenced in the code. typical example : my users can change their setting, and their configuration is stored on the server. However, there's a system-wide default value for each setting that is used in case the user didn't override it. The table containing those default settings grows as more options are added to the program. This means that when a new feature/option is checked in, the system-wide default setting is usually created in the database as well.
static data. eg. a product list populating a dropdown list. The program doesn't rely on the existence of a specific product in the list to work. This can be for example a list of unicode-encoded products that should be deployed in production when the new "unicode version" of the program is deployed.
other data, ie everything else (logs, user accounts, user data, etc.)
It seems obvious to me that my third item shouldn't be source-controlled (of course, it should be backuped on a regular basis)
But regarding the static data, I'm wondering what to do.
Should I append the insert scripts to the creation scripts? or maybe use separate scripts?
How do I (as a developer) warn the people doing the deployment that they should execute an insert statement ?
Should I differentiate my two kind of data? (the first one being usually created by a dev, while the second one is usually created by a non-dev)
How do you manage your DB static data ?
I have explained the technique I used in my blog Version Control and Your Database. I use database metadata (in this case SQL Server extended properties) to store the deployed application version. I only have scripts that upgrade from version to version. At startup the application reads the deployed version from the database metadata (lack of metadata is interpreted as version 0, ie. nothing is yet deployed). For each version there is an application function that upgrades to the next version. Usually this function runs an internal resource T-SQL script that does the upgrade, but it can be something else, like deploying a CLR assembly in the database.
There is no script to deploy the 'current' database schema. New installments iterate trough all intermediate versions, from version 1 to current version.
There are several advantages I enjoy by this technique:
Is easy for me to test a new version. I have a backup of the previous version, I apply the upgrade script, then I can revert to the previous version, change the script, try again, until I'm happy with the result.
My application can be deployed on top of any previous version. Various clients have various deployed version. When they upgrade, my application supports upgrade from any previous version.
There is no difference between a fresh install and an upgrade, it runs the same code, so I have fewer code paths to maintain and test.
There is no difference between DML and DDL changes (your original question). they all treated the same way, as script run to change from one version to next. When I need to make a change like you describe (change a default), I actually increase the schema version even if no other DDL change occurs. So at version 5.1 the default was 'foo', in 5.2 the default is 'bar' and that is the only difference between the two versions, and the 'upgrade' step is simply an UPDATE statement (followed of course by the version metadata change, ie. sp_updateextendedproperty).
All changes are in source control, part of the application sources (T-SQL scripts mostly).
I can easily get to any previous schema version, eg. to repro a customer complaint, simply by running the upgrade sequence and stopping at the version I'm interested in.
This approach saved my skin a number of times and I'm a true believer now. There is only one disadvantage: there is no obvious place to look in source to find 'what is the current form of procedure foo?'. Because the latest version of foo might have been upgraded 2 or 3 versions ago and it wasn't changed since, I need to look at the upgrade script for that version. I usually resort to just looking into the database and see what's in there, rather than searching through the upgrade scripts.
One final note: this is actually not my invention. This is modeled exactly after how SQL Server itself upgrades the database metadata (mssqlsystemresource).
If you are changing the static data (adding a new item to the table that is used to generate a drop-down list) then the insert should be in source control and deployed with the rest of the code. This is especially true if the insert is needed for the rest of the code to work. Otherwise, this step may be forgotten when the code is deployed and not so nice things happen.
If static data comes from another source (such as an import of the current airport codes in the US), then you may simply need to run an already documented import process. The import process itself should be in source control (we do this with all our SSIS packages), but the data need not be.
Here at Red Gate we recently added a feature to SQL Data Compare allowing static data to be stored as DML (one .sql file for each table) alongside the schema DDL that is currently supported by SQL Compare.
To understand how this works, here is a diagram that explains how it works.
The idea is that when you want to push changes to your target server, you do a comparison using the scripts as the source data source, which generates the necessary DML synchronization script to update the target. This means you don't have to assume that the target is being recreated from scratch each time. In time we hope to support static data in our upcoming SQL Source Control tool.
David Atkinson, Product Manager, Red Gate Software
I have come across this when developing CMS systems.
I went with appending the static data (the stuff referenced in the code) to the database creation scripts, then a separate script to add in any 'initialisation data' (like countries, initial product population etc).
For the first two steps, you could consider using an intermediate format (ie XML) for the data, then using a home grown tool, or something like CodeSmith to generate the SQL, and possible source files as well, if (for example) you have lookup tables which relate to enumerations used in the code - this helps enforce consistency.
This has another benefit that if the schema changes, in many cases you don't have to regenerate all your INSERT statements - you just change the tool.
I really like your distinction of the three types of data.
I agree for the third.
In our application, we try to avoid putting in the database the first, because it is duplicated (as it has to be in the code, the database is a duplicate). A secondary benefice is that we need no join or query to get access to that value from the code, so this speed things up.
If there is additional information that we would like to have in the database, for example if it can be changed per customer site, we separate the two. Other tables can still reference that data (either by index ex: 0, 1, 2, 3 or by code ex: EMPTY, SIMPLE, DOUBLE, ALL).
For the second, the scripts should be in source-control. We separate them from the structure (I think they typically are replaced as time goes, while the structures keeps adding deltas).
How do I (as a developer) warn the people doing the deployment that they should execute an insert statement ?
We have a complete procedure for that, and a readme coming with each release, with scripts and so on...
First off, I have never used Visual Studio Database Edition. You are blessed (or cursed) with whatever tools this utility gives you. Hopefully that includes a lot of flexibility.
I don't know that I'd make that big a difference between your type 1 and type 2 static data. Both are sets of data that are defined once and then never updated, barring subsequent releases and updates, right? In which case the main difference is in how or why the data is as it is, and not so much in how it is stored or initialized. (Unless the data is environment-specific, as in "A" for development, "B" for Production. This would be "type 4" data, and I shall cheerfully ignore it in this post, because I've solved it useing SQLCMD variables and they give me a headache.)
First, I would make a script to create all the tables in the database--preferably only one script, otherwise you can have a LOT of scripts lying about (and find-and-replace when renaming columns becomes very awkward). Then, I would make a script to populate the static data in these tables. This script could be appended to the end of the table script, or made it's own script, or even made one script per table, a good idea if you have hundreds or thousands of rows to load. (Some folks make a csv file and then issue a BULK INSERT on it, but I'd avoid that is it just gives you two files and a complex process [configuring drive mappings on deployment] to manage.)
The key thing to remember is that data (as stored in databases) can and will change over time. Rarely (if ever!) will you have the luxury of deleting your Production database and replacing it with a fresh, shiny, new one devoid of all that crufty data from the past umpteen years. Databases are all about changes over time, and that's where scripts come into their own. You start with the scripts to create the database, and then over time you add scripts that modify the database as changes come along -- and this applies to your static data (of any type) as well.
(Ultimately, my methodology is analogous to accounting: you have accounts, and as changes come in you adjust the accounts with journal entries. If you find you made a mistake, you never go back and modify your entries, you just make a subsequent entries to reverse and fix them. It's only an analogy, but the logic is sound.)
The solution I use is to have create and change scripts in source control, coupled with version information stored in the database.
Then, I have an install wizard that can detect whether it needs to create or update the db - the update process is managed by picking appropriate scripts based on the stored version information in the database.
See this thread's answer. Static data from your first two points should be in source control, IMHO.
Edit: *new
all-in-one or a separate script? it does not really matter as long as you (dev team) agree with your deployment team. I prefer to separate files, but I still can always create all-in-one.sql from those in the proper order [Logins, Roles, Users; Tables; Views; Stored Procedures; UDFs; Static Data; (Audit Tables, Audit Triggers)]
how do you make sure they execute it: well, make it another step in your application/database deployment documentation. If you roll out application which really needs specific (new) static data in the database, then you might want to perform a DB version check in your application. and you update the DB_VERSION to your new release number as part of that script. Then your application on a start-up should check it and report an error if the new DB version is required.
dev and non-dev static data: I have never seen this case actually. More often there is real static data, which you might call "dev", which is major configuration, ISO static data etc. The other type is default lookup data, which is there for users to start with, but they might add more. The mechanism to INSERT these data might be different, because you need to ensure you do not destoy (power-)user-created data.

How to deploy complex SQL solutions through an installer?

Part of the setup routine for the product I'm working on installs a database update utility. The utility checks the current version of the users database and (if necessary) executes a series of SQL statements that upgrade the database to the current version.
Two key features of this routine:
Once initiated, it runs without user interaction
SQL operations preserve the integrity of the users data
The goal is to keep the setup/database routine as simple as possible for the end user (the target audience is non-technical). However, I find that in some cases, these two features are at odds. For example, I want to add a unique index to one of my tables - yet it's possible that existing data already breaks this rule. I could:
Silently choose what's "right" for the user and discard (or archive) data; or
Ask the user to understand what a unique index is and get them to choose what data goes where
Neither option sounds appealing to me. I could compromise and not create a unique index at all, but that would suck. I wonder what others do in this situation?
Check out SQL Packager from Red-Gate. I have not personally used it, but these guys make good tools overall and this seems to do what you're looking for. It let's you modify the script to customize the install:
http://www.red-gate.com/products/SQL_Packager/index.htm
You never throw a users data out. One possible option is to try and create the unique index. If the index creation fails, let them know it failed, tell them what they need to research, and provide them a script they can run if they find they have a data error that they choose to fix up.