bulk update of urls - sql

i trying to build a json file that is future proof. meaning the urls for media in the json file will need to be changed from time to time (like if i change a host or similar ).
so if my url is
http://mediaserver.com/media/video.mp4
then later i need to change all to
http://new-mediaserver.com/media/video.mp4
or
rtmp://s2ziuhvw5dd27c.cloudfront.net/cfx/
is this possible to avoid hand input of hundreds or urls every time sources change?
maybe have to do it using a database? any tips?

Related

Automatically add database entry after ftp upload

Sorry if this seems stupid but I wonder if it's possible to add a database entry after an ftp upload.
To be more clear, thanks to winSCP I have several folders sending everything I put in there automatically to my server.
However, I would like to create a mysql entry for each uploaded files and once again, automatically. Is it possible to do that? How?
To gives the full details of what I need to do, you can read the following.
I have several folders with pictures and each folders are uploaded automatically.
Each of those folders belong to one user and the goal is to give them an account and allow them to see and download those files through a web interface. Since one account = one folder, that's kinda easy.
And I think a simple .htaccess can simply secure things so one user can only see and download the file in his own repository, no?
However if I want them to be able to see what's new (=something they didn't download or simply mark as read) I think I need a table to manage those files.
Something like id | file (string) | read (bool).
If you think this way to proceed is bad, they I'm open to change how to do things, but to be clear uploading the file need to work this way. Not using any kind of formulary.
Thanks for reading that, sorry for my english.
Your problem contains three steps:
Folders/Files been automatically uploaded to your server directory, as you say, this been efficiently handled by winSCP.
You need to update your database with all the files and folders present in your server directory.
You need to update whether or not it is been read/downloaded by the user.
Since your first step is in place, we don't need anything there. For second step, you should write a script and schedule that script to run at a fixed time interval using CRON (if using LINUX or UNIX, or WINDOWS). The script would be responsible to create a list of file(s) present in the directory, and simply insert the file(s) information that are not present in your database.
EDIT:
This edit is to describe how your script file should work. As I explained, the cron jobs would simply help you run your script file in fixed set of interval (which can be every minute, or every hour, or every day, and so on). Lets say your database table has following columns:
fileid (varchar[20])
filepath (varchar[20])
status (boolean)
Your script file should do following things:
Create a list of existing filepaths in your server directory
Run a select query, create a list of existing filepaths from database table.
Compare list1 with list2, and find the ones that doesn't exist in list2 (This would give you a list of filepath that needs to be inserted into table)
Just insert the list of file paths you got above, and set there status to be false (which means the file is not read/downloaded yet)
NOTE: Please keep in mind that I am not advising right now that how your database table should look like. It can be what you have proposed or can even differ depending on your will or requirements.
For the third step, simply keep the status of your file to be unread when creating entries in your table from the second step, and then when user click on the file link in your application whether to view or download it, send a POST request to your server updating the file status to be marked as read.
Let me know if this helps!

Why is matillion not loading data from S3?

I have a simple S3 load with all the correct information. There are no validation errors but and the package executes without a problem. It's just that there is no data in the table. Any tips from someone that is knowledgeable about Matillion?
There are a number of reasons why Matillion might not appear to load any data in an S3 Load.
Firstly, I'd check that the pattern matches the file names in the S3 location, which is a regular expression match.
I believe that also includes the path which you may have included in the location parameter, so it may be worth modifying your pattern to look something like .*\/FilePrefix.* or even just .* and then selecting the actual file in the location parameter
Secondly, if the files were last modified more than 64 days ago, or they have already been loaded in to the table previously, Snowflake won't load them by default, which you can get around by turning the Force Load parameter On.

What's a best approach to create a filestore

This is an open ended question. I have noob understanding of databases but willing to learn whatever is required. Though I believe my problem could be done without learning a lot.
So, here goes the question:
I have large amount of files getting generated in mt projects(depending on the builds) and I need to archive them and also need to reproduce them according to buildNumber if requested by users. I don't expect these requests to be a lot. May be 1-2 requests a day.
For eg: 16GB data per build every week. Most of the files in weekly builds are duplicate. And I don't want to archive them again and again. I prefer to store them only once. There is one caveat that it can happen that the files relative location can change, even though content hasn't changed.
My approach is as follow: Create a hash from each file. Create the key-value pair as fileHash-actual file and store it. Store this information in some kind of manifest file for each build. So, I should be able to create the builds back with correct files/paths etc.
Can it ever happen that 2 different files will ever have the same hash? Can some database help to do it efficiently? I am currently thinking of dumping all files in one folder.
Thanks

Prevent URL skipping when Bulk extracting with import.io

So, I've been extracting lot of data with import.io desktop app for quite some time; but what always bugged me is when you try to bulk extract multiple URLs it always skips around half of them.
It's not URL problem, if you take same let's say 15 URLs it will return for example first time 8, second time 7, third time 9; some links will be extracted first time but will be skipped second time and so on.
I am wondering is there a way to make it process all URL I feed it?
I have encountered this issue a few times when I am extracting data. This typically is due to the speed of the Bulk Extract requesting URLs from the site's servers.
A workaround is to use a Crawler like an Extractor. You can paste the URLs that you created/collected into the Where to Start, Where to Crawl, and Where to Get Data From sections (you need to click on the advanced settings button in the Crawler).
Make sure to turn on 0 depth Crawl. (This turns the Crawler into an Extractor; i.e. no discovery of additional URLs)
Increase the Pause Between Pages.
Here is screenshot of one I built sometime ago.
http://i.gyazo.com/92de3b7c7fbca2bc4830c27aefd7cba4.png

How do i force a file to be deleted? Windows server 2008

On my site a user may upload a file (pic, zip, audio, video, whatever). He then may decide to replace it with a newer revision. This user may upload a file, make a post then decide to put up a new revision replacing the old (lets say its a large zip or tar.gz file). Theres a good chance people may be downloading it if he sent out an email or even im for the home user.
Problem. I need to replace the file and people may be downloading and it may be some minutes before it is deleted. I dont want my code to stall until i cant delete or check every second to see if its unused (especially bad if another user can start and he takes long creating a cycle).
How do i delete the file while users are downloading the file? i dont care if they stop i just care that the file can be replaced and new downloads are the new revision.
What about referencing the files indirectly?
A mapping script, maps a virtual file entry from your site to a real file . If the user wants to upload a new revision of his file you just update the mapping, not the real file.
You can install a daily task that scans all files and deletes all files without a mapping and without open connections.
lajuette's answer is right, the easiest solution is to work around the file locking altogether:
When a user uploads file foo.zip, internally store it as foo-v1.zip.
Create a mapping file somewhere (database, code, whatever) that maps foo.zip to foo-v1.zip.
Rather than exposing a direct link to the file, expose a link to a service that gets the file: mysite.com/Download?foo.zip or something. This service uses the mapping to determine which version of the file to send to the client.
When a new version is uploaded, create foo-v2.zip and update the mapping file.
It wouldn't be that hard to write a scheduled task that cleans up old, un-mapped files.
If your oppose to a database and If the filenames are in a fix format (such as user/id.ext) you could append the id with a revision number and enumerate the folder using a pattern (user/id-*) and use the latest revision.