How to save millions of files in S3 so that arbitrary future searches on key/path values are fast - amazon-s3

My company has millions of files in an S3 bucket, and every so often I have to search for files whose keys/paths contain some text. This is an extremely slow process because I have to iterate through all files.
I can't use prefix because the text of interest is not always at the beginning. I see other posts (here and here) that say this is a known limitation in S3's API. These posts are from over 3 years ago, so my first question is: does this limitation still exist?
Assuming the answer is yes, my next question is, given that I anticipate arbitrary regex-like searches over millions of S3 files, are there established best practices for workarounds? I've seen some people say that you can store the key names in a relational database, Elasticsearch, or a flat file. Are any of these approaches more common place than others?
Also, out of curiosity, why hasn't S3 supported such a basic use case in a service (S3) that is such an established core product of the overall AWS platform? I've noticed that GCS on Google Cloud has a similar limitation. Is it just really hard to do searches on key name strings well at scale?

S3 is an object store, conceptually similar to a file system. I'd never try to make a database-like environment based on file names in a file system nor would I in S3.
Nevertheless, if this is what you have then I would start by running code to get all of the current file names into a database of some sort. DynamoDB cannot query by regular expression but any of PostgreSQL, MySQL, Aurora, and ElasticSearch can. So start with listing every file and put the file name and S3 location into a database-like structure. Then, create a Lambda that is notified of any changes (see this link for more info) that will do the appropriate thing with your backing store when a file is added or deleted.
Depending on your needs ElasticSearch is super flexible with queries and possibly better suited for these types of queries. But traditional relational database can be made to work too.
Lastly, you'll need an interface to the backing store to query. That will likely require some sort of server. That could be a simple as API gateway to a Lambda or something far more complex.

You might consider using Amazon S3 Inventory, which can provide a daily or weekly CSV file containing a list of all objects in the bucket.
You could then load this file into a database, or even write a script to parse it. Or possibly even just play with it in Excel.

Related

DynamoDb + S3 + CloudSearch + Redis

I'm currently creating a scheme for my application and I'm wondering if my thinking is right
Example : Ecommerce site
In DynamoDb, I would put products ( product_id, meta-data link to S3)
S3, i would use it for storing Search Data Format (SDF/JSON)
(Product name, product description, price, ...etc )
Amazon CloudSearch would be used to index documents in S3, and to be able to search them.
Redis would be used to cache results
Is my scheme right? Can s3 be a good "database" ?
Is DynamoDb here even needed?
If you are thinking that S3 would just be the source of record for your products and you are not expecting heavy reads/writes, then it COULD work, but you have to recognize that it will be far far slower than using a real database. Not just 1-2x slower but MANY magnitudes slower. We use S3 for storing audit data for realtime data stored in Postgres - works a charm, but this is data that is written once and read rarely. Retrieval times when it does have to retrieve audit records is > 50ms. This type of speed is usually not acceptable when you need to manipulate multiple records at one time.
If you are going to be using dynamoDB anyway, why not just use that to store what you'd be storing on s3? Trying to adhere to the concept of keep it simple, I would use the following stack:
dynamoDB to be the system of record and to do some searches
Cloudsearch for more flexible searching than what dynamodb can
provide
S3 for static files (product images, etc.)
And again, to keep things simple, Skip Redis for caching if you are already using dynamoDB and don't plan on using any of Redis' specialized dastatypes - ie: your caching will be nothing more than keys to strings, etc. Use Redis if you plan on taking advantage of its other datatypes or if you want to have a cache closer to your app - ie: you plan on using Redis on the webserver.
Dynamo is used for storing write-extensive data. If your application does not require extensive writes over product_id and meta-data, I think RDS/MySQL is better.
When designing an application, you really should keep things as simple as possible from the beginning. It will always get worse with time :)
S3 is not a good DB. It has not been designed for this and is too slow. It's for file storage only. If you want to stick with DynamoDB, you should put all your products info in it, including the metadata.
CloudSearch may be a good option. You can also build you own "indexes" on top of DynamoDB. It requires more design and programming but might be worth considering. Here is a link to an excellent blog-post on this matter: http://blog.coredumped.org/2012/01/amazon-dynamodb.html.
So,
Is DynamoDB even needed: Yes, or RDS, Mongo,... any real DB depending on you needs.
Is S3 a good DB: I don't think so.

How can I organize a million+ files now that I'm moving to Amazon S3?

Well I'm getting booted from my shared host and I'm switching over to a combination of a VPS from Linode and Amazon S3 to host a few million jpegs.
My big worry is keeping some kind of sanity with all these images. Is there any hope of that? My understanding is you're only allowed 100 "buckets" and "buckets" are the only type of structure within S3.
Is putting a few million files in a bucket something you'd advise against?
You may notice in Bucket Restrictions and Limitations, it is stated:
There is no limit to the number of objects that can be stored in a bucket
My experience is that a very large number of objects in a single bucket will not affect the performance of getting a single object by its key (that is, get appears to be of constant complexity).
Having a very large number of object also does not affect the speed of listing a given number of objects:
List performance is not substantially affected by the total number of keys in your bucket
However, I must warn you, that most S3 management tools I've used (like S3Fox) will choke and die a horrible slow death when attempting to access a bucket with a very large number of objects. One tool that seems to cope well with very large numbers of objects is S3 Browser (they have a free version and a Pro version, I am not affiliated with them in any way).
Using "folders" or prefixes, does not change any of these points (get and listing a given number of objects are still constant, most tools still fall over themselves and hang).

BLOB's in SQL that stores a Video file

I am hoping someone can explain how to use BLOBs. I see that BLOBs can be used to store video files. My question is why would a person store a video file in a BLOB in a SQL database? What are the advantages and disadvantages compared to storing pointers to the location of the video file?
A few different reasons.
If you store a pointer to a file on disk (presumably using the BFILE data type), you have to ensure that your database is updated whenever files are moved, renamed, or deleted on disk. It's relatively common when you store data like this that over time your database gets out of sync with the file system and you end up with broken links and orphaned content.
If you store a pointer to a file on disk, you cannot use transactional semantics when you're dealing with multimedia. Since you can't do something like issue a rollback against a file system, you either have to deal with the fact that you're going to have situations where the data on the file system doesn't match the data in the database (i.e. someone uploaded a video to the file system but the transaction that created the author and title in the database failed or vice versa) or you have to add additional steps to the file upload to simulate transactional semantics (i.e. upload a second <>_done.txt file that just contains the number of bytes in the actual file that was uploaded. That's cumbersome and error-prone and may create usability issues.
For many applications, having the database serve up data is the easiest way to provide it to a user. If you want to avoid giving a user a direct FTP URL to your files because they could use that to bypass some application-level security, the easiest option is to have a database-backed application where to retrieve the data, the database reads it from the file system and then returns it to the middle tier which then sends the data to the client. If you're going to have to read the data into the database every time the data is retrieved, it often makes more sense to just store the data directly in the database and to let the database read it from its data files when the user asks for it.
Finally, databases like Oracle provide additional utilities for working with multimedia data in the database. Oracle interMedia, for example, provides a rich set of objects to interact with video data stored in the database-- you can easily tag where scenes begin or end, tag where various subjects are discussed, when the video was recorded, who recorded it, etc. And you can integrate that search functionality with searches against all your relational data. Of course, you could write an application on top of the database that did all those things as well but then you're either writing a lot of code or using another framework in your app. It's often much easier to leverage the database functionality.
Take a read of this : http://www.oracle.com/us/products/database/options/spatial/039950.pdf
(obviously a biased view, but does have a few cons (that have now been fixed by the advent of 11g)

riak backup solution for a single bucket

What are your recommendations for solutions that allow backing up [either by streaming or snapshot] a single riak bucket to a file?
Backing up just a single bucket is going to be a difficult operation in Riak.
All of the solutions will boil down to the following two steps:
List all of the objects in the bucket. This is the tricky part, since there is no "manifest" or a list of contents of any bucket, anywhere in the Riak cluster.
Issue a GET to each one of those objects from the list above, and write it to a backup file. This part is generally easy, though for maximum performance you want to make sure you're issuing those GETs in parallel, in a multithreaded fashion, and using some sort of connection pooling.
As far as listing all of the objects, you have one of three choices.
One is to do a Streaming List Keys operation on the bucket via HTTP (e.g. /buckets/bucket/keys?keys=stream) or Protocol Buffers -- see http://docs.basho.com/riak/latest/dev/references/http/list-keys/ and http://docs.basho.com/riak/latest/dev/references/protocol-buffers/list-keys/ for details. Under no circumstances should you do a non-streaming regular List Keys operation. (It will hang your whole cluster, and will eventually either time out or crash once the number of keys grows large enough).
Two is to issue a Secondary Index (2i) query to get that object list. See http://docs.basho.com/riak/latest/dev/using/2i/ for discussion and caveats.
And three would be if you're using Riak Search and can retrieve all of the objects via a single paginated search query. (However, Riak Search has a query result limit of 10,000 results, so, this approach is far from ideal).
For an example of a standalone app that can backup a single bucket, take a look at Riak Data Migrator, an experimental Java app that uses the Streaming List Keys approach combined with efficient parallel GETs.
The Basho function contrib has an erlang solution for backing up a single bucket. It is a custom function but it should do the trick.
http://contrib.basho.com/bucket_exporter.html
As far as I know, there's no automated solution to backup a single bucket in Riak. You'd have to use the riak-admin command line tool to take care of backing up a single physical node. You could write something to retrieve all keys in a single bucket and using low r values if you want it to be fast but not secure (r = 1).
Buckets are a logical namespace, all of the keys are stored in the same bitcask structure. That's why the only way to get just a single node is to write a tool to stream them yourself.

Max files per directory on Amazon S3

I'm currently storing ~3 million images files in a single directory on my server, which is causing serious performance issues. I'd like to move them to Amazon S3 and I'm wondering whether I'd need to use a hierarchical folder structure or whether I can store them in a single folder on S3.
I get a large percentage of my traffic from google image search and I don't want to hurt my SEO by changing the image path, so a single folder on S3 would be ideal if there aren't any performance issues. I imagine LIST operations would be slow, but I'm okay with that.
S3 has no limit on the number of items stored in a bucket. In fact, using a 'directory' separator in key names is completely optional.
There is a practical use for using a separator in your key: as you correctly guessed listing the keys will be more difficult as you'll have to page though many list results.
However, as the S3 documentation points out you can use any character as a separator.