Cloud file storage with file tagging and search by tags/filename - amazon-s3

My project needs to meet next requirements.
store large amount of files for reasonable price
tag individual files with custom tags
have API method to search files by name (contains) and tags (exact)
do it all via JS SDK (keep project serverless)
I made some work with Amazon S3 and turned out
no search method in JS SDK http://docs.aws.amazon.com/AWSJavaScriptSDK/latest/AWS/S3.html#listObjectsV2-property
listObjects accepts param Key Prefix (i.e. filename starts with), so there is no way to find by contains
no param to search by tag at all, i can only get it for individual file with getObjectTagging
So question is - what stable service can i use for file storage WITH functionality described above
Azure? Google Cloud? Backblaze B2? something else?
thanks!

If you use Azure blob storage, you can use Azure Search blob indexer to index both the metadata and textual content of your blobs. For a walkthrough of setting this up, see Build and query your first Azure Search index in the portal.

Related

is it possible to read a Google Drive folder (all files) as BigQuery external data source?

I am using Google Drive as an external data source in BigQuery. I can able to access a single file, but unable to read a folder with multiple files.
Note:
I have picked up the shareable link from Google Drive for folder and used "bq mk.." command referencing the link ID. Although it creates the table but unable to pull data.
I've not tried it with drive so I have no sense of how performant it is, but when defining an external table (or load job), you can specify the source data as a list of URIs. My suspicion is that it's not particularly scalable and may run into limits in drive, as that's not a typical access pattern. Google Cloud Storage is a much more suitable datasource for this kind of thing.

Have multiple names for same blob content

Let's say I have a file called foo.txt in my Azure Storage as a blob. Is it possible for creating a link of sorts or a redirect url where I can access foo.txt's content even when I visit bar.txt?
Ideally I do not want to upload the same file content again for bar.txt too to avoid wasting space.
No, you can't. Azure Blob Storage is just simple object storage, not a full file system having soft links or hard links.
BTW, you may consider simulating the link feature following answers here: Is there a way to do symbolic links to the blob data when using Azure Storage to avoid duplicate blobs?

How to search a string in Amazon S3 files?

I have all types of files or documents stored in Amazon S3.
How to perform search on those documents using a search keyword or string (full-text search, if possible) ?
Is there any documentum built on it ?
Matching documents list which has the search string will be displayed to the user for download.
Any help please ?
Searching documents in S3 is not possible.
S3 is not a document database. It is an object store, designed for storing data but inferring no "meaning" from the data -- essentially a key/value store suporting very large values. It has no sense of context. It doesn't index the content of the objects, or even the object metadata. The only way to "find" an object in S3 is to already know its key.
It is excellent for highly available and highly reliable storage, but searching not part of its design.
The solutions depends on how structured your S3 file data is.
If it is structured or semi-structured like cvs, JSON with columnar alike format, AWS Athena will be the best choice. With just a few clicks, you’re ready to query your S3 files.
Otherwise, if the data is totally un-structured, you may want to use elasticsearch and etc.
You cant search as you wish in amazon S3.
but we have alternate solution for this. I am using S3 browser software for this.
here is link to download: http://s3browser.com/
Download it and you will have all access same like amazon S3 browser. you can also perform search and other processes.

jclouds/amazon s3/rackspace cloudfiles/windows azure storage- appending content to an existing blob

I have a scenario in which multiple lines of text are to be appended to an existing text file... Is it possible to do this using Jclouds? (That would be ideal for me as jclouds supports a lot of cloud providers)...
Even if this is not doable using jclouds, does the native API of Amazon S3/Rackspace Cloudfiles/Azure storage support appending content to existing blobs?
If this is doable, then kindly point me to good working examples which show the same...
This is not possible in the underlying blob stores I know of.

How do services like Dropbox implement delta encoding if their files are stored in the cloud?

Dropbox claims that during syncing only the portion of files that changes are transmitted back to main server, which is obviously a great functionality, but how do they perform changes to files stored in Amazon S3 cloud? So for example, lets say a 30 page document on user's desktop contains changes to only page 4. Dropbox now syncs the blocks representing the changes and what happens on the backend if they files that they store are in the cloud? Does that mean they have to download the 30 page document stored in S3 to their server, then perform replacement of blocks representing page 4, and then uploading back to the cloud? I doubt this would be the case because that would be somewhat inefficient. The other option I could think of is if Amazon S3 provides update of file stored in the cloud based on byte ranges, so for example, make a PUT request to file X from bytes 100-200 which will replace all the bytes from 100 to 200 with value of PUT request. So I was curious how companies that use other cloud services such as Amazon, implement this type of syncing.
Thanks
As S3 and similar storages don't offer filesystem capabilities, anything that pretends to store files and directories needs to emulate a file system. And when doing this files are often split to pages of certain size, where each page is stored in a separate file in the storage. This way the changed block requires uploading only one page (for example) and not the whole file. I should note, that with files like office documents this approach can be faulty if file size is changed - for example, if you insert a page at the beginning or delete a page, then the whole file will be changed and the complete file would need to be re-uploaded. We didn't analyze how Dropbox in particular does his job, and I just described the common scenario. There exist also different "patch algorithms", where a patch can be created locally (if Dropbox has an older local copy in the cache) and then applied to one or more blocks on the server.
There are several synchronizing tools which transfer deltas over the wire like rsync, rdiff, rdiff-backup, etc. For bi-directional synchronising with S3 there are paid services like s3rsync for example. For pure client-side synchronising, tools like zsync can be considered (which is what many people employ to roll-out app updates).
An alternative approach would be to tar-ball a directory, generate a delta file (using rdiff or xdelta3), and upload the delta file by using a timestamp as part of the key. In order to sync, all you need to do is to perform these 2 checks client-side:
You have all the delta files from S3. If not pull them and apply them to generate the latest backup state.
Your last backup state corresponds to your current directory. If not generate a new delta file and push to S3.
The concerning factor here would be the at least 100% additional space utilization, client-side. But this approach will help you revert changes if needed.