Quick backup system for large projects - backup

I've always backed up all my source codes into .zip files and put it in my usb drive and uploaded to my server somewhere else in the world.. however I only do this once every two weeks, because my project is a little big.
Right now my project directories (I have a few of them) contains a hierarchy of c++ files in it, and interspersed with them are .o files which would make backing up take a while if not ignored.
What tools exist out there that will let me just back things up efficiently, conveniently and lets me specify which file types to back up (lots of .png, .jpg and some text types in there), and which directories to be ignored (esp. the build dirs)?
Or is there any ingenious methods out there that people use?

Though not a backup solution, a version control manager on a remote server responds to most of your needs:
only changes are saved, not the whole project
you can filter out what you don't want to save
Moreover, you can create archives of your repository for true backup purposes.
If you want to learn about version control, take a look at Eric Sink's weblog, in particular:
Source Control HOWTO, for the basics of source control
Mercurial, Subversion, and Wesley Snipes for the links to articles on distributed version control systems

I use dropbox, im a single developer developing software. In some projects I work out from my dropbox which means they synchronize every time i build. Other projects i copy the source code there my self. But most important is that i can work on all my computers with dropbox installed on them... works for my simple needs

Agree with mouviciel. If you do not want that, consider rsync or unison to efficiently keep an up-to-date copy, be it on the same or a different machine.

Related

Attaching a specific piece of non-intrusive info to a file or folder to keep a connection to a program

This is going to be a question with a lot of hypotheticals, but it's been on my mind for a while now and I finally want to get some perspectives on how to tackle this "issue". For the sake of the question, I'll make up an example requirement of how the program I want to make would work on a conceptual level without too many specifics.
The Problem
I want to create a program to keep track of miscellaneous info for files and folders. This miscellaneous info can be anything from comments, authors, to more specific info like the original source of the file (a URL for example), categories, tags, and more. All this info is kept track of in an SQLite database.
Now... how would you create a connection to the file (or folder) to the database? Whatever file is added to the program, the file should continue to operate on an independent level from the program, meaning you should be able to edit, copy, move, rename or do anything else with the file you would usually do with your OS of choice - even deleting it.
You should even be able to archive it, zip it, upload it somewhere or do other things that temporarily or permanently removes the file from your system, without losing the connection to the database. The program itself doesn't actually ever touch the files themselves, unless to generate a new entry in the database, but obviously, there should be some kind of reference in the file to a database entry in the program.
Yes, I know that if you delete the file, you would have a dead entry in the database. For now, just treat this as an unfortunate reality that can't be solved unless you incorporate the file more closely into the program.
Possible solutions and why I decided against them
Reference inside Filename
Probably the most obvious choice, you could just have a reference inside the filename to point to a database entry, for example by including the id at the start of the filename:
#1 my-example-file.txt
#12814 this-is-one-of-many-files.txt
Obviously, that goes against what I established earlier, as you would be restricted from freely renaming the file. You would always have to keep in mind to not mess with the id inside the filename, or else the connection to your program is broken. Unfortunately, that is the best bet I currently have, but I would like to avoid using that approach if possible.
Alternate Data Streams (ADS)
A pretty cool feature I recently discovered that's available on NTFS file systems, ADS allows you to store different streams of data for your files, to grossly simplify it. You could attach a data stream to your file that saves the id for the database entry in the program, and a regular user would never be able to mess directly with that.
However, since this is a feature reserved for specific file systems, there's some ugly side effects to ADS, as you can easily lose that part of the file by:
moving/copying it to a file system that doesn't support ADS, such as the file systems most often used in removable drives
uploading it to a cloud then later downloading it
moving it to another OS that might not support ADS or treats it in an unexpected way
zipping it
Thus I can't really rely on ADS either.

Is Dropbox considered a Distributed File System?

I was just reading this https://en.wikipedia.org/wiki/Clustered_file_system#Distributed_file_systems
The definition of a DFS seems to exactly describe Dropbox to me but it isn't in the list of examples, which of course it would be if it was one I think.
So what is different about Dropbox which makes it not fall into this category?
Usually, when talking about distributed file-systems, you expect properties that Dropbox doesn't support. For example, if you and I share a folder, I can create a file called "work.txt" in it and you can create a file "work.txt" in it, and if we do it fast enough (or when we're not syncing with dropbox) we'll have conflicting copies of the same file.
A similar example would be if we both edit the same file concurrently - we'll have conflicting copies, which is something a distributed file system should prevent. In the link you refer to, this is called "Concurrency transparency; all clients have the same view of the state of the file system".
Another example of a property dropbox doesn't support: if my computer fails (e.g., my hard-drive is corrupted) I might lose data that wasn't uploaded to Dropbox. There is a small window in which I think my data was written to the local disk, but if my computer fails, I lose that data.
Lastly, I'm not sure how Dropbox will operate with file locks. For example, MS office takes locks on .doc files, to ensure no one else is working on them at the same time. I don't think Dropbox supports this feature.
I've written a blog post about some of complexities of implementing a distributed file-system, you might find it helpful as well.

Source control in SSIS and Concurrent work on dtsx file

I am working on building a new SSIS project from scratch. I want to work with couple of my teammates. I was hoping to get a suggestion on how we can have some have some source control, so that few of us can work concurrently on the same SSIS project (same dtsx file, building new packages.)
Version:
SQL Server Integration Service v11
Microsoft Visual Studio 2010
It is my experience that there are two opportunities for any source control system and SSIS projects to get out of whack: adding new items to the project and concurrent changes to an existing package.
Adding new items
An SSIS project has the .dtproj extension. Inside there, it's "just" XML defining what all belongs to the project. At least for 2005/2008 and 2012+ on the package deployment model. The 2012+ project deployment model carries a good bit more information about the state of the packages in the project.
When you add new packages (or project level connection managers or .biml files) the internal structure of the .dtproj file is going to change. Diff tools generally don't handle merging XML well. Or at all really. So, to prevent the need for merging the project definition, you need to find a strategy that works for you team.
I've seen two approaches work well. The first is to upfront define all the packages you think you'll need. DimFoo, DimDate, DimFoo, DimBar, FactBlee. Check that project and the associated empty packages in and everyone works on what is out there. When the initial cut of packages is complete, then you'll ensure everyone is sync'ed up and then add more empty packages to the project. The idea here is that there is one person, usually the lead, who is responsible for changing the "master" project definition and everyone consumes from their change.
The other approach requires communication between team members. If you discover a package needs to be added, communicate with your mates "I need to add a new package - has anyone modified the project?" The answer should be No. Once you've notified that a change to the project definition is coming, make it and immediately commit it. The idea here is that people commit and sync/check in whatever terminology with great frequency. If you as a developer don't keep your local repository up to date, you're going to be in for a bad time.
Concurrent edits
Don't. Really, that's about it. The general problem with concurrent changes to an SSIS package is that in addition to the XML diff issue above, SSIS also includes layout data alongside tasks so I can invert the layout and make things flow from bottom to top or right to left and there's no material change to SSIS package but as Siyual notes "Merging changes in SSIS is nightmare fuel"
If you find your packages are so large and that developers need to make concurrent edits, I would propose that you are doing too much in there. Decompose your packages into smaller, more tightly focused units of work and then control their execution through a parent package. That would allow a better level of granularity to your development and debugging process in addition to avoiding the concurrent edit issue.
A dtsx file is basically just an xml file. Compare it to a bunch of people trying to write the same book. The solution I suggest is to use Team Foundation Server as a source control. That way everyone can check in and out and merge packages. If you really dont have that option try to split your ETL process in logical parts and at the end create a master package that calls each sub packages in the right order.
An example: Let's say you need to import stock data from one source, branches and other company information from an internal server and sale amounts from different external sources. After u have all information gathered, you want to connect those and run some analyses.
You first design the target database entities that you need and the relations. One of your member creates a package that does all the import to staging tables. Another guy maybe handles external sources and parallelizes / optimizes the loading. You would build a package that in merges your staging and production tables, maybe historicizing and so on.
At the end you have a master package that calls each of the mentioned packages and maybe some additional logging or such.
In our multi-developer operation, we follow this rough plan:
Each dev has their own branch, separate from master branch
Once a week, devs push all their changes to remote
One of us pulls all changes, and merges all branches into master, manually resolving .dtproj conflicts as we go
Merge master in all dev branches - now all branches agree
Test in VS
Push all branches to remote, other devs can now pull and keep working
It's not a perfect solution, but it helps quarantine the amount of merge pain we have to experience.
We have large ssis solutions with 20+ packages in one solution, with TFS Git. One project required adding a bunch of new packages to the existing solution. We thought we were smart and knew to assign only one person to work on each new package, 2 people working on the same package would be suicide. Wasn't good enough. When 2 people tried add a different named, new, package at the same time, each showed dtproj as a file that had changed/needed to be checked in and suddenly I found myself looking at the xml for dtproj and trying to figure out which lines to keep (Microsoft should never ask end users to manually edit their internal files, which only they wrote and understand). Billinkc's solutions here are very good and the problem is very real. You may think that Microsoft is the great Wise One, and that your team can always add new packages to an existing solution without conflicts, but you'd be wrong. It also doesn't work to put dtproj in .gitignore. If you do that, you won't see other peoples new packages (actually the .dtsx file will come down in git, but you won't see that package in Solution Explorer because dtproj is what feeds Solution Explorer). This is a current problem (2021) and we are using Visual Studio 2017 Enterprise with SSDT.
To explain this problem to people, git obviously can handle a group of independent, individual files in a directory (like say .bat files) and can add, change, and delete those files easily. The problem comes in when you have a file that is naming, describing, and counting all the files in a directory (what dtproj does). When you have a file like dtproj you are creating a conflict on dtproj itself, when 2 people try to a add a new package at the same time. Your dtproj file has a line that shows the package you added, and my dtproj file shows the package I added, and tfs/git sees that as a Conflict.
Some are suggesting ways to deal with this if you have to add a lot of new packages, my idea is a little different. For the people who have to add new packages, don't work in the primary solution where this problem is, work somewhere else. Probably best to work in the "Projects" directory you get when you install Visual Studio, outside of TFS/Git. Obviously follow all the standards, Variable naming, and Package Configuration conventions for the target Solution. Then when the new packages are ready, give the .dtsx files to your Solution Gatekeeper for them to check in. Only the Gatekeeper can check in new packages using Add From Existing, avoiding conflicts. Once the package is checked in, developers can work on them in the main Solution.

Software configuration management tool for hundreds of binary files, many are large

Note: I've tried searching, Stackoverflows near useless. I am not sure what kind of tool I need.
At my organization we need to keep track of the software configuration for many types of computers including the binary installers and automation scripts. Change is infrequent but the size of latest version of the configuration is several gigs.
We are trying to use Mercurial to store changes but it is just too slow, even without many revisions at all. I did an hg status but killed it after it took 10 minutes without finishing.
We are looking for a way to store the current configuration as well as having the old configurations there just in case. I have never done anything like this before and do not know what tools are available or even suitable for such tasks. Can someone point me in the right direction or tell me how the are solving this problem? Thanks
Since hard disk space is cheap and being able to view binary differences isn't very helpful, perhaps the best option you have is to store each configuration in a new directory that is indexed somehow. Example below:
/software/configs/2009-03-15
/software/configs/2009-09-28
/software/configs/2009-09-30
Given the size of your files and the infrequent number of changes, this would allow you to pick a configuration from a given 'tag' without the overhead of revision control.
If you pack your files into a single tar file and generate a SHA-512 hash, then you can be reasonably sure that no one has tampered with your files since they were archived.
While I don't know specific details about how to implement this strategy in mercurial, I have been working with git and git-fat. It sets up a general procedure that is likely to be feasible on mercurial as well. Basically the idea is whenever you add a binary file to the repository, under the hood, the repo creates a symlink to the file that is actually stored in another location as a checksummed object.
This allows large files to be tracked by the repo, without storing the actual data inside. It requires the data to be stored in some other location (perhaps in a binary management system).
It might take some configuration to do it in mercurial, but I think it's an elegantly simple solution.

Best approach to perform a CMMI Physical Configuration Audit?

The organization I currently work for an organization that is moving into the whole CMMI world of documenting everything. I was assigned (along with one other individual) the title of Configuration Manager. Congratulations to me right.
Part of the duties is to perform on a regular basis (they are still defining regular basis, it will either by quarterly or monthly) a physical configuration audit. This is basically a check of source code versions deployed in production to what we believe to be the source code versions in production.
Our project is a relatively small web application with written in Java. The file types we work with are java, jsp, xml, property files, and sql packages.
The problem I have (and have expressed but seem to be going ignored) is how am I supposed to physical log on to the production server and verify file versions and even if I could it would take a ridiculous amount of time?
The file versions are not even currently in the file(i.e. in a comment or something). It was suggested that we place visible version numbers on each screen that is visible to the users also. I thought this ridiculous also, since the screens themselves represent only a small fraction of the code we maintain.
The tools we currently use are Netbeans for our IDE and Serena Dimensions as our versioning tool.
I am specifically looking for ideas on how to perform this audit in a hopefully more automated way, that will be both accurate and not time consuming.
My idea is currently to add a comment to the top of each file that contains the version number of that file, a script that runs when a production build is created to create an XML file or something similar containing the file name and version file of each file in the build. Then when I need to do an audit I go to the production server grab the the xml file with the info, and compare it programmatically to what we believe to be in production, and output a report.
Any better ideas. I know this has to have been done already, and seems crazy to me that I have not found any other resources.
You could compute a SHA1 hash of the source files on the production server, and compare that hash value to the versions stored in source control. If you can find the same hash in source control, then you know what version is in production. If you can't find the same hash in source control, then there are untracked modifications in production and your new job title is justified. :)
The typical trap organizations fall into with the CMMI is trying to overdo everything. If I could suggest anything, it'd be start small & only do what you need. So consider any problems that you may have had in the CM area peviously.
The CMMI describes WHAT an organisation should do, but leaves the HOW up to you. The CMMI specification, chapter 2 is well worth a read - it describes the required, expected, and informative components of the specification - basically the goals are required, the practices are expected, and everything else is informative. This means there is only a small part of the specification which a CMMI appraiser can directly demand - the goals. At the practice level, it is permissable to have either the practices as described, or acceptable alternatives to them.
In the case of configuration audits, goal SG3 is "Integrity of baselines is established and maintained". SP3.2 says "Perform configuration audits to maintain integrity of the configuration baselines." There is nothing stated here about how often these are done, or how long they may take.
In my previous organisation, FCA/PCA was usually only done as part of the product release process, and we used ClearCase as the versioning tool, with labels applied across the codebase to define baselines. We didn't have version numbers in all the source files, nor did we have version numbers on all the products screens - the CM activity was doing the right thing & was backed up by audits, and this was never an issue in any CMMI appraisal.
We could use the deltas between labels to look at what files had changed, perform diffs to see the actual code changes. An important part of the process is being able to link those changes back to either a requirement/bug report/whatever the reason was which initiated the change.
Our auditing did use scripts to automate the process, but these were in-house developed scripts are specific to ClearCase - basically they would list all the files, their versions in the CM system, and the baseline/config item to which they belonged.
can't you use your source control for this? if you deploy a version and tag your sourcecontrol with that deployment, you can then verify against the source control system