What's a best approach to create a filestore - sql

This is an open ended question. I have noob understanding of databases but willing to learn whatever is required. Though I believe my problem could be done without learning a lot.
So, here goes the question:
I have large amount of files getting generated in mt projects(depending on the builds) and I need to archive them and also need to reproduce them according to buildNumber if requested by users. I don't expect these requests to be a lot. May be 1-2 requests a day.
For eg: 16GB data per build every week. Most of the files in weekly builds are duplicate. And I don't want to archive them again and again. I prefer to store them only once. There is one caveat that it can happen that the files relative location can change, even though content hasn't changed.
My approach is as follow: Create a hash from each file. Create the key-value pair as fileHash-actual file and store it. Store this information in some kind of manifest file for each build. So, I should be able to create the builds back with correct files/paths etc.
Can it ever happen that 2 different files will ever have the same hash? Can some database help to do it efficiently? I am currently thinking of dumping all files in one folder.
Thanks

Related

SQL Database in GitHub

I am building a Java app that uses an SQLite database to hold most of its data. For the end-user, the database would be almost entirely read-only, with very occasional edits. I'll (theoretically) be displaying/distributing it through my GitHub page, so my question is:
What's the best way to load the database into GitHub? (I'm using IntelliJ with DataGrip.)
I'd prefer to be able to update the database when I commit/push, instead of having to overwrite the whole file. The closest question I can find is How to include MySQL database schema on GitHub? but there could potentially be hundreds or thousands of entries, so I can't just rebuild the tables when the user installs the app.
I'm applying for entry-level developer jobs, and this project is going to be my main portfolio piece during job-hunting. I'm trying to make sure it is not only functional but also makes a good impression. Any help is (very) greatly appreciated.
EDIT:
After moving my .db file into the folder connected to GitHub (same level as my src folder) apparently I can now commit/push it with the rest of my files. How do I make sure that the connection from my Java code to the database stays valid once it is loaded onto another user's system? Can I just stick with
connection = DriverManager.getConnection("jdbc:sqlite:mydatabase.db");
or do I need to rework the path?
Upon starting, if your application can't find a corresponding sqlite database file, have it create one. Then do initial load of your tables from either CSV, JSON or XML files.
You can upload these files to Git, as they are text formats.

What is the best solution to save big amount files receiving everyday?

in my application, i receive at least 1000 files per days from different email.
Pentaho i store them in a folder, but what is the best solution to save these files:
storing in folder in my hard disk or saving in a table (sql)?
Thank you
Unless you plan to take advantage of the additional features offered by a sql database, I would advise you continue to store them to the hard drive, and DO NOT forget the importance of backing up your files.

Get a list of files on S3 that have changed since X

I have about 40,000 images up on S3 and I've downloaded into my application/database then sent them out to another site (like ebay or magento)
This is to support a client that sells his products on a few sites. Sites which really you'd rather keep a copy of the product image on their site. (so they can resize it and such)
My issue right now is that I want to poke S3 every once and a while looking for new files, or modified files.
I don't much like the idea of targeting each file one at a time. Nor do I like the idea of bringing down all the file names and dates and then comparing them with dates I've stored. Both seem to be quite wasteful especially if I want to run this every day (or every hour).
What I had hoped for, and what I'm looking for is a way to say "give me the names of all the files that have changed since 2013-10-14 13:10:30. This would let me store just one value, and if nothings changed, then I'd get back nothing (or something that indicated nothing).
Is there a way to get a list of changed files since X date?
I'm language agnostic.. though Ruby/Rails would be cool.
Note: I've tried to figure it out with the WSDL, but it doesn't quite seem to help as much as I'd hope.
Unfortunately S3 does not offer any support for this.
Currently your only options are to either list all the objects in the S3 bucket and check for changes or to keep track of the changed objects separate from S3 (record the last changed timestamp in some data store when you change the objects).

Uploads by Date

I need to manage files uploaded by people according to date, and make these files available for download. I need to keep only files for upto one month as there will be heavy files clean up is necessary. Can someone provide me a code snippet to guide me in the correct direction. I have not decided on a back end language yet so ASP or JSP will do.
Here's a pseudo:
for each file
if file is more than one month old
delete file
Run this as hourly background job or something. Since you're not language-specific, I can't give a language-specific answer. I will however note that such a job doesn't belong in the view side (ASP/JSP), but rather in the backend (C#/Java).

Asset Management: which is the better way to organise user generated files on a web server?

We are in the process of building a system which allows users to upload multiple images and videos to our servers.
The team I'm working with have decided to save all the assets belonging to a user in a folder named using the user's unique identifier. This folder in turn will be a sub-folder of our main assets folder on the file server.
The file structure they have proposed is as follows:
[asset_root]/userid1/assets1
[asset_root]/userid1/assets2
[asset_root]/userid2/assets1
[asset_root]/userid2/assets2
etc.
We are expecting to have thousands or possibly a million+ users in the life time of this system.
I always thought that it wasn't a good idea to have many sub-folders in a single location and suggested a year/month/day approach as follows:
[asset_root]/2010/11/04/userid1/assets1
[asset_root]/2010/11/04/userid1/assets2
[asset_root]/2010/11/04/userid2/assets1
[asset_root]/2010/11/04/userid2/assets2
etc.
Does anyone know which of the above approaches would be better suited for this many assets? Is there a better method to organize images/videos on a server?
The system in question will be an Windows IIS 7.5 with a SAN.
Many thanks in advance.
In general you are correct, in that many file systems impose a limit on the number of files and folders which may be in one folder. If you hit that limit with the number of users you have, your in trouble.
In general, I would simply use a uuid for each image, with some dimension of partitioning. e.g. A hash of ABCDEFGH would end up as [asset_root]/ABC/DEFGH. Using a hash gives you a greater degree of assurance about the number of files which will end up in each folder and prevents you from having to worry about, for example, not knowing which month an image you need was stored in.
I'm presuming your file system is NTFS? IF so, you've got a limit of 4,294,967,295 files on the disk - the limit of files in a folder is the same. If you have on the order of millions of users you should be fine, though you might want to consider having only one folder per user instead of several as your example indicates.