packaging large frequently changing data files with maven project

packaging large frequently changing data files with maven project - maven-2

I am in the process of converting a legacy Ant project into a Maven project. Part of the project is a very large (~1.6GB) set of data files in a compressed binary format which are accessed in a random-seek fashion via index tables. The data files are like logarithmic function tables, rainbow tables or similar data tables for massively abbreviating complex computations.
We publish new data tables on a weekly basis, and I want to be able to exploit Maven's dependency management system to help the developers get the latest tables.
The main problem I am having is that I cannot figure out how to bundle the tables up in a way that isn't just a JAR, ZIP or RAR of the whole set of them. Is there a way to write a pom that will result in a directory of data files? Or am I just thinking about the problem in a non-Maven way?
Thanks for any suggestions.

This depends on what the consumer can deal with. Maven dependencies don't deal with directories of files, so you'd need the whole artifact. You probably want to deal with ZIPs, as JAR has an overloaded meaning (put on the classpath) and other compression need custom plugins.
However, if you can break it up into long-lived and short-lived data you may get better behaviour (e.g. a quarterly full release, and a set of changes to apply to that that is re-released weekly). This depends whether the data can easily be split in this fashion, or overlaid, or patched in some way. This might be difficult in a compressed binary artifact.
The other alternative is to continuously build the large artifact, and discard old ones. This relies on good bandwidth between builds and repository, and enough disk to hold as many builds as you need (repository managers like Archiva can help purge old builds on a regular schedule if that's appropriate).
One final note - if you are dealing with ZIPs over 2G (which you are close to approaching), you will need to use a different ZIP such as the truezip-maven-plugin.

Related

How do I put all my bigquery queries into a repository or is that even a thing

I have a project that is a bunch of BigQuery tables and queries and scheduled queries etc. Since it's my only copy in the whole wide world of this code and it took me a long time to stumble across it all (thank you StackOverflow), how do I put all those SQLs into a repository and save it, or do people only do that with real code like programs and such.
I did this so far:
https://cloud.google.com/source-repositories/docs/quickstart#before-you-begin
https://cloud.google.com/source-repositories/docs/authentication
https://chrome.google.com/webstore/detail/save-bigquery-to-github/heeofgmoliomldhglhlgeikfklacpnad
I made a local repository with git for windows. I made a repository in google and connected my project to it. The repository is currently empty and there isn't a big button anywhere that I can click to just copy the SQLs into the repository. There are quickstarts to make code locally and flow it all the way up to google, but I want the code going in the opposite direction. You'd think they'd make it this easy, but I think they want it to be inaccessible to noobs. It seems like it's just easier to copy and paste the SQLs one at a time to notepad++ as I would do for Postgres.

The best solution for that is to use a managing tool like DBT or Dataform.
In this kind of tool, you store your infrastructure information in YAML files (dataset organisation, what tables you have, their schema, their materialization -view, table, materialized view, etc), and your queries, UDFs, functions, etc in files with a .sql extension.
(Note you can use templating which is awesome).
Once your queries are in place, you can build your project with simple commands (for dbt for instance: dbt run).
There is a lot of additional functionalities that are extremely useful (ex: data lineage, documentation website).
And since you are manipulating files, you then only need to git add git commit git push to your repository to safely store your whole project.
Maybe you can have a look first at Dataform: it has been acquired by Google earlier this year, and it will probably be integrated in BigQuery in the future (becoming the de facto recommended way to manage your data pipelines).

SQL files management

Most of my day is spent on writing SQL queries to perform small tasks, mainly to get information from the database and manipulate it somehow for data visualization building reports for others.
At the end of the day i try to have a nice folder scheme to help me reusing code and so on, but it's becoming harder to handle so many files and keep
track of everything I've done so far.
Don't want to have huge SQL files because I might want to
the end It's hard to avoid a war zone in my desktop and on this folders. It's also a mess to handle so many folders/codes.
For version control we're using a GIT server, but there is plenty of code that is not in production that we would like to keep track and reuse somehow.
We're using iPython notebook, R studio and SSMS to build our codes, I'm wonder if there is some efficient ways to work.
There must be an efficient way to work out there. What do you use to keep track of your (SQL) codes? and more importantly reuse it.
Thanks in advance,
Rafael

I just use a folder system. And I keep the shell-scripts so to speak as the first file (like the generic code to do X). Whereas the specific codes where I take X and apply dates and other conditions in the bottom half of the folder.

What do you use to keep track of your (SQL) codes? and more importantly reuse it.
For ease of reuse, I have all my running SQL code backed up on an SQL server through routine INFORMATION SCHEMA dumps. For all development code that I need to reuse with others, I have a GIT server that gets automatic updates throughout the day. For reuse on my laptop itself, I have a local backup through time machine.
As for directory or folder structure, all code starts as project based and eventually I migrate the best and most useful code to a personal folder structure that is topic based (date arithmetic, indexing, etc.). No matter how they are stored, all these folders are indexed using local and remote indexing features so I can search and retrieve them with just a few keystrokes when needed. Ultimately what's needed for optimum reuse is ease of retrieval. The quicker I can retrieve, the more reuse I get.
Lastly, it's not just SQL code, but all the supporting documents that led to that code solution. Sometimes this collection may include code from other languages, code from other servers, emails, text documents, images, workflows, etc. Keeping them all together enhances the value of reuse.

Database objects version control (not schema)

I'm trying to figure out how to implement version control in an environment where we have two DBs: one Testing and one Production.
In Testing. there are an arbitrary number of tasks being tested. These have no constraints in number of objects manipulated and complexity, meaning we can have a 3-day task that changes 2 package bodys and one trigger, and we can have a 3 month task that changes 100 different objects, including С source files and binary objects.
My main concern are the text-based objects of the DB. We need to version the Test and Production code, but any task can go from Testing to Production with no defined order whatsoever.
This means right now we have to manually track the changes in the files, selecting inside each file which lines in the code go from Testing to Production. We use a very rudimentary solution, writing in the header a sequence of comments with a file-based version number and adding in the code tags with that sequence to delimit the change.
I'm struggling to implement SVN because I wanted to create Testing as a branch of Production, having branches in Testing to limit each task, but I find that it can lead to many Testing tasks being ported to Production during merges.
This said, my questions are:
Is there a way to resolve this automatically?
Are there any database-specific version control solutions?
How can I "link" both environments if the code base is so different?

I used SVN for source control on DB scripts.
I dont have a technological solution to your problem but i can explain the methodology we used.
We had two sets of scripts - one for incremental changes and another for the complete declaration of database objects and procedures.
During development we updated only the incremental changes in a script that that was eventually used during deployment. during test rounds we updated the script.
Finally, After running the script on production we updated the second set of scripts containing the full declarations. The full scripts were used as reference and to create a db from scratch.

How to properly manage database deployment with SSDT and Visual Studio 2012 Database Projects?

I'm in the research phase trying to adopt 2012 Database Projects on an existing small project. I'm a C# developer, not a DBA, so I'm not particularly fluent with best practices. I've been searching google and stackoverflow for a few hours now but I still don't know how to handle some key deployment scenarios properly.
1) Over the course of several development cycles, how do I manage multiple versions of my database? If I have a client on v3 of my database and I want to upgrade them to v8, how do I manage this? We currently manage hand-crafted schema and data migration scripts for every version of our product. Do we still need to do this separately or is there something in the new paradigm that supports or replaces this?
2) If the schema changes in such a way that requires data to be moved around, what is the best way to handle this? I assume some work goes in the Pre-Deployment script to preserve the data and then the Post-Deploy script puts it back in the right place. Is that the way of it or is there something better?
3) Any other advice or guidance on how best to work with these new technologies is also greately appreciated!
UPDATE: My understanding of the problem has grown a little since I originally asked this question and while I came up with a workable solution, it wasn't quite the solution I was hoping for. Here's a rewording of my problem:
The problem I'm having is purely data related. If I have a client on version 1 of my application and I want to upgrade them to version 5 of my application, I would have no problems doing so if their database had no data. I'd simply let SSDT intelligently compare schemas and migrate the database in one shot. Unfortunately clients have data so it's not that simple. Schema changes from version 1 of my application to version 2 to version 3 (etc) all impact data. My current strategy for managing data requires I maintain a script for each version upgrade (1 to 2, 2 to 3, etc). This prevents me from going straight from version 1 of my application to version 5 because I have no data migration script to go straight there. The prospect creating custom upgrade scripts for every client or managing upgrade scripts to go from every version to every greater version is exponentially unmanageable. What I was hoping was that there was some sort of strategy SSDT enables that makes managing the data side of things easier, maybe even as easy as the schema side of things. My recent experience with SSDT has not given me any hope of such a strategy existing but I would love to find out differently.

I've been working on this myself, and I can tell you it's not easy.
First, to address the reply by JT - you cannot dismiss "versions", even with declarative updating mechanics that SSDT has. SSDT does a "pretty decent" job (provided you know all the switches and gotchas) of moving any source schema to any target schema, and it's true that this doesn't require verioning per se, but it has no idea how to manage "data motion" (at least not that i can see!). So, just like DBProj, you left to your own devices in Pre/Post scripts. Because the data motion scripts depend on a known start and end schema state, you cannot avoid versioning the DB. The "data motion" scripts, therefore, must be applied to a versioned snapshot of the schema, which means you cannot arbitrarily update a DB from v1 to v8 and expect the data motion scripts v2 to v8 to work (presumably, you wouldn't need a v1 data motion script).
Sadly, I can't see any mechanism in SSDT publishing that allows me to handle this scenario in an integrated way. That means you'll have to add your own scafolding.
The first trick is to track versions within the database (and SSDT project). I started using a trick in DBProj, and brought it over to SSDT, and after doing some research, it turns out that others are using this too. You can apply a DB Extended Property to the database itself (call it "BuildVersion" or "AppVersion" or something like that), and apply the version value to it. You can then capture this extended property in the SSDT project itself, and SSDT will add it as a script (you can then check the publish option that includes extended properties). I then use SQLCMD variables to identify the source and target versions being applied in the current pass. Once you identify the delta of versions between the source (project snapshot) and target (target db about to be updated), you can find all the snapshots that need to be applied. Sadly, this is tricky to do from inside the SSDT deployment, and you'll probably have to move it to the build or deployment pipeline (we use TFS automated deployments and have custom actions to do this).
The next hurdle is to keep snapshots of the schema with their associated data motion scripts. In this case, it helps to make the scripts as idempotent as possible (meaning, you can rerun the scripts without any ill side-effects). It helps to split scripts that can safely be rerun from scripts that must be executed one time only. We're doing the same thing with static reference data (dictionary or lookup tables) - in other words, we have a library of MERGE scripts (one per table) that keep the reference data in sync, and these scripts are included in the post-deployment scripts (via the SQLCMD :r command). The important thing to note here is that you must execute them in the correct order in case any of these reference tables have FK references to each other. We include them in the main post-deploy script in order, and it helps that we created a tool that generates these scripts for us - it also resolves dependency order. We run this generation tool at the close of a "version" to capture the current state of the static reference data. All your other data motion scripts are basically going to be special-case and most likely will be single-use only. In that case, you can do one of two things: you can use an IF statement against the db build/app version, or you can wipe out the 1 time scripts after creating each snapshot package.
It helps to remember that SSDT will disable FK check constraints and only re-enable them after the post-deployment scripts run. This gives you a chance to populate new non-null fields, for example (by the way, you have to enable the option to generate temporary "smart" defaults for non-null columns to make this work). However, FK check constraints are only disabled for tables that SSDT is recreating because of a schema change. For other cases, you are responsible for ensuring that data motion scripts run in the proper order to avoid check constraints complaints (or you manually have disable/re-enable them in your scripts).
DACPAC can help you because DACPAC is essentially a snapshot. It will contain several XML files describing the schema (similar to the build output of the project), but frozen in time at the moment you create it. You can then use SQLPACKAGE.EXE or the deploy provider to publish that package snapshot. I haven't quite figured out how to use the DACPAC versioning, because it's more tied to "registered" data apps, so we're stuck with our own versioning scheme, but we do put our own version info into the DACPAC filename.
I wish I had a more conclusive and exhasutive example to provide, but we're still working out the issues here too.
One thing that really sucks about SSDT is that unlike DBProj, it's currently not extensible. Although it does a much better job than DBProj at a lot of different things, you can't override its default behavior unless you can find some method inside of pre/post scripts of getting around a problem. One of the issues we're trying to resolve right now is that the default method of recreating a table for updates (CCDR) really stinks when you have tens of millions of records.
-UPDATE: I haven't seen this post in some time, but apparently it's been active lately, so I thought I'd add a couple of important notes: if you are using VS2012, the June 2013 release of SSDT now has a Data Comparison tool built-in, and also provides extensibility points - that is to say, you can now include Build Contributors and Deployment Plan Modifiers for the project.

I haven't really found any more useful information on the subject but I've spent some time getting to know the tools, tinkering and playing, and I think I've come up with some acceptable answers to my question. These aren't necessarily the best answers. I still don't know if there are other mechanisms or best practices to better support these scenarios, but here's what I've come up with:
The Pre- and Post-Deploy scripts for a given version of the database are only used migrate data from the previous version. At the start of every development cycle, the scripts are cleaned out and as development proceeds they get fleshed out with whatever sql is needed to safely migrate data from the previous version to the new one. The one exception here is static data in the database. This data is known at design time and maintains a permanent presence in the Post-Deploy scripts in the form of T-SQL MERGE statements. This helps make it possible to deploy any version of the database to a new environment with just the latest publish script. At the end of every development cycle, a publish script is generated from the previous version to the new one. This script will include generated sql to migrate the schema and the hand crafted deploy scripts. Yes, I know the Publish tool can be used directly against a database but that's not a good option for our clients. I am also aware of dacpac files but I'm not really sure how to use them. The generated publish script seems to be the best option I know for production upgrades.
So to answer my scenarios:
1) To upgrade a database from v3 to v8, I would have to execute the generated publish script for v4, then for v5, then for v6, etc. This is very similar to how we do it now. It's well understood and Database Projects seem to make creating/maintaining these scripts much easier.
2) When the schema changes from underneath data, the Pre- and Post-Deploy scripts are used to migrate the data to where it needs to go for the new version. Affected data is essentially backed-up in the Pre-Deploy script and put back into place in the Post-Deploy script.
3) I'm still looking for advice on how best to work with these tools in these scenarios and others. If I got anything wrong here, or if there are any other gotchas I should be aware of, please let me know! Thanks!

In my experience of using SSDT the notion of version numbers (i.e. v1, v2...vX etc...) for databases kinda goes away. This is because SSDT offers a development paradigm known as declarative database development which loosely means that you tell SSDT what state you want your schema to be in and then let SSDT take responsibility for getting it into that state by comparing against what you already have. In this paradigm the notion of deploying v4 then v5 etc.... goes away.
Your pre and post deployment scripts, as you correctly state, exist for the purposes of managing data.
Hope that helps.
JT

I just wanted to say that this thread so far has been excellent.
I have been wrestling with the exact same concerns and am attempting to tackle this problem in our organization, on a fairly large legacy application. We've begun the process of moving toward SSDT (on a TFS branch) but are at the point where we really need to understand the deployment process, and managing custom migrations, and reference/lookup data, along the way.
To complicate things further, our application is one code-base but can be customized per 'customer', so we have about 190 databases we are dealing with, for this one project, not just 3 or so as is probably normal. We do deployments all the time and even setup new customers fairly often. We rely heavily on PowerShell now with old-school incremental release scripts (and associated scripts to create a new customer at that version). I plan to contribute once we figure this all out but please share whatever else you've learned. I do believe we will end up maintaining custom release scripts per version, but we'll see. The idea about maintaining each script within the project, and including a From and To SqlCmd variable is very interesting. If we did that, we would probably prune along the way, physically deleting the really old upgrade scripts once everybody was past that version.
BTW - Side note - On the topic of minimizing waste, we also just spent a bunch of time figuring out how to automate the enforcement of proper naming/data type conventions for columns, as well as automatic generation for all primary and foreign keys, based on naming conventions, as well as index and check constraints etc. The hardest part was dealing with the 'deviants' that didn't follow the rules. Maybe I'll share that too one day if anyone is interested, but for now, I need to pursue this deployment, migration, and reference data story heavily. Thanks again. It's like you guys were speaking exactly what was in my head and looking for this morning.

Migrate clearcase to perforce

I have a large quantity of clearcase data which needs to be migrated into perforce. The revisions span the better part of a decade and I need to preserve as much branch and tag information as possible. Additionally we make extensive use of symbolic links, supported in clearcase but not in perforce. What advice or tools can you suggest which might make this easier?

The first step is to decide if you need to migrate everything, or just certain key versions. If you only migrate the important versions (releases and major milestones) you'll end up with a much simpler history in Perforce, without losing anything important. Then ClearCase can be keep as a historical archive in case it is ever needed. (Unless IBM has changed things ClearCase licenses do not expire when maintainance runs out, you just lose the right to new upgrades and patches and acces to support)
Keep in mind that Perforce does not version control directories and does not keep a full per-element version tree - this means a 1:1 with exact results is going to be impossible. Recreating the important snapshots is a much more achievable goal; keeping everything may be impossible, as Perforce lacks features ClearCase relies upon.
To see what Perforce says about the miration, check out
http://perforce.com/perforce/ccaseconv.html
This explains the key differences and covers a few approaches you can take.

Start by doing a Google search on "clearcase to perforce conversion".
Then read the ClearCase to Perforce Conversion Guide.
Once you're done crying, you're going to have to decide (1) how much effort you can afford, and (2) what you really need to capture as part of the conversion. You're not going to get it all, so you might as well just focus on getting the important branches.
Another consideration would be to just capture the current state of each supported branch as a snapshot, import that into Perforce, and then turn off the old ClearCase server, saving it in a known good state for that day when you need to access something from the deep, dark, pre-Perforce days...

The other answers are outdated. Now you can import CC->Perforce with many options also preserving history.
http://www.perforce.com/sites/default/files/pdf/migration-planning-guide-clearcase-to-perforce.pdf

What you also have to keep in mind is the fact, that your importerscript may slightly commit in another sequence than the clearcase commits(maybe you are traversing dir, may be histories of files, etc.)
So, unless you gather all version information into a (large) database and sort them afterwards, you will end up with commits which are not very useful to look into(except of course history of single files). As you (hopefully) change your commit-policy to commit atomic changes into perforce, it will be visible when development started: The commits before just do not make any sense on a project scope.
So you really should think of leaving clearcase history behind. Tags/Branches creation is also a different problem, as you need your old configspecs for your old branches.
At the end you will get wrong filenames in old tags(as perforce do not support dir-vers.) so you will use clearcase for this(and it is very tricky to get the correct filename for each version of a file!).
The last problem you will encounter: importer run time:
if you have large VOBs(eg. 10 years, 50 GB size), you will wait days for the importer to gather all information and convert it to a nice shiny perforce repo. All this day your devteam will stop working.

Just a quick note on the one import I saw from ClearCase to Perforce.
As noted in the ClearCase to Perforce Conversion Guide:
Perforce supports atomic change transactions; ClearCase doesn't.
Note that labels are often used to simply denote a snapshot in time for a particular easily-specified set of files; this is inherently easy to do in Perforce without using a label, due to Perforce's use of atomic change transactions and file naming syntax.
For example, the state of all the files in //depot/projecta as of change 42 can be obtained with
p4 sync //depot/projecta/...#42
That means the ClearCase project that got imported was an UCM one, since the concept of baseline closely follows the one of global revision.
Only files with a baseline on them were imported, the other versions were discarded.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas