I have all types of files or documents stored in Amazon S3.
How to perform search on those documents using a search keyword or string (full-text search, if possible) ?
Is there any documentum built on it ?
Matching documents list which has the search string will be displayed to the user for download.
Any help please ?
Searching documents in S3 is not possible.
S3 is not a document database. It is an object store, designed for storing data but inferring no "meaning" from the data -- essentially a key/value store suporting very large values. It has no sense of context. It doesn't index the content of the objects, or even the object metadata. The only way to "find" an object in S3 is to already know its key.
It is excellent for highly available and highly reliable storage, but searching not part of its design.
The solutions depends on how structured your S3 file data is.
If it is structured or semi-structured like cvs, JSON with columnar alike format, AWS Athena will be the best choice. With just a few clicks, you’re ready to query your S3 files.
Otherwise, if the data is totally un-structured, you may want to use elasticsearch and etc.
You cant search as you wish in amazon S3.
but we have alternate solution for this. I am using S3 browser software for this.
here is link to download: http://s3browser.com/
Download it and you will have all access same like amazon S3 browser. you can also perform search and other processes.
Related
My project needs to meet next requirements.
store large amount of files for reasonable price
tag individual files with custom tags
have API method to search files by name (contains) and tags (exact)
do it all via JS SDK (keep project serverless)
I made some work with Amazon S3 and turned out
no search method in JS SDK http://docs.aws.amazon.com/AWSJavaScriptSDK/latest/AWS/S3.html#listObjectsV2-property
listObjects accepts param Key Prefix (i.e. filename starts with), so there is no way to find by contains
no param to search by tag at all, i can only get it for individual file with getObjectTagging
So question is - what stable service can i use for file storage WITH functionality described above
Azure? Google Cloud? Backblaze B2? something else?
thanks!
If you use Azure blob storage, you can use Azure Search blob indexer to index both the metadata and textual content of your blobs. For a walkthrough of setting this up, see Build and query your first Azure Search index in the portal.
i would like to ask your help, since i read the Amazon guide talking about using Query String Parameters in the Urls to request content stored in buckets, and it was not clear to me.
So, i am planning to use a bucket to store Media content, and use Query String parameter for different versions of a media file. So, if i have an image, i can create the original version, the small version, the large version, and so on. Then, i can request the different versions on my website, based on my need.
But i did not understand how this is all managed. So, all the versions of the file, do they have the same file name? And using Powershell script to upload the files to the bucket, how do i specify the version that i am uploading?
Thank you.
I'm using Marklogic 8.0.6 and we also have JSON documents in it. I need to extract a lot of data from Marklogic and store them in AWS S3. We tried to run "mlcp" locally and them upload the data to AWS S3 but it's very slow because it's generating a lot of files.
Our Marklogic platform is already connected to S3 to perform backup. Is there a way to extract a specific database in aws s3 ?
It can be OK for me if I have one big file with one JSON document per line
Thanks,
Romain.
I don't know about getting it to s3, but you can use CORB2 to extract MarkLogic documents to one big file with one JSON document per line.
S3:// is a native file type in MarkLogic. So you can also iterate through all your docs and export them with xdmp:save("s3://...).
If you want to make agrigates, then You may want to marry this idea into Sam's suggestion of CORB2 to control the process and assist in grouping your whole database into multiple manageable aggregate documents. Then use a post-back task to run xdmp-save
Thanks guys for your answers. I do not know about CORB2, this is a great solution! But unfortunately, due to bad I/O I prefer a solution to write directly on s3.
I can use a basic Ml query and dump to s3:// with native connector but I always face memory error even launching with the "spawn" function to generate a background process.
Do you have any xquey example to extract each document on s3 one by one without memory permission?
Thanks
I'll have to store millions of files (many TB in the future) in S3.
Are there any limitations? (not a price :) ), i'm asking about architectural limitations (like - don't store it this way, the other way will be better/faster).
My files are in a hierarchy
/{country}/{number}/{code}/docs
and i checked i can keep them that way (to access them easy thru REST)
(of course i know S3 keeps them internally in other way - not important to me).
So, are there any limitations/pitfalls ?
S3 has no limits that you would hit. The files are not really in folders, they are just strings as locations. Make the folder structure something that is easy for you to keep track of and organize.
You do NOT want to be listing the "folder" contents in S3 to find things.
S3 is slow at giving directory listings, because it's not really directories.
You should be storing either the whole path /{country}/{number}/{code}/docs in a database or the logic should be so repeatable that you can be confident that the file will be in that location.
James Brady gave an excellent and very detailed answer to how s3 treats file storage in a question here https://stackoverflow.com/a/394505/4179009
AWS S3 does definitely have limits to access 100req/sec in case of similar path prefix, see the official docs: http://docs.aws.amazon.com/AmazonS3/latest/dev/request-rate-perf-considerations.html
From the other side a hierarchical approach makes logic complicated. A trade off depends on your requirements, one of good options can be using at least 4 symbols length key (primary id or hash key) in front of URL. In case of having limited number countries try using multiple buckets with country code as a bucket name, it also helps to define a specific physical location if required.
I have a scenario in which multiple lines of text are to be appended to an existing text file... Is it possible to do this using Jclouds? (That would be ideal for me as jclouds supports a lot of cloud providers)...
Even if this is not doable using jclouds, does the native API of Amazon S3/Rackspace Cloudfiles/Azure storage support appending content to existing blobs?
If this is doable, then kindly point me to good working examples which show the same...
This is not possible in the underlying blob stores I know of.