Gemfire for storing BLOB - gemfire

We have a requirement where we want to store BLOB files in GemFire. The estimated size of this region would be in TBs because of the BLOB files.
2 Approach we are planning to analyze. Please suggest
1) create the gemfire region with Overflow configuration. This should enable only Key in memory and actual data file overflowed to Gemfire diskstore.
This will also help to control the GemFire region size.
Size of Gemfire disk store however would be huge. Is this option feasible?
2) Store the files on Server disks with the file path stored in Gemfire region.
Use the file paths for directly accessing/updating the files from the client
any other suggested approach for such requirements?

To me option 2 seems brittle. You would have a to do a lot of work to provide HA, handle disk failures, IO exceptions etc. Gemfire will take care of all these things if you use option 1.
What is the size of the blob you are looking to store? Gemfire will not be able to store a single value that is more than 2gb. Handling Terabytes of data in a region should not be a problem.

Related

For Ignite Persistence, how to control maximum data size on disk?

How can I limit maximum size on disk when using Ignite Persistence? For example, my data set in a database is 5TB. I want to cache maximum of 50GB of data in memory with no more than 500GB on disk. Any reasonable eviction policy like LRU for on-disk data will work for me. Parameter maxSize controls in-memory size and I will set it to 50GB. What should I do to limit my disk storage to 500GB then? Looking for something like maxPersistentSize and not finding it there.
Thank you
There is no direct parameter to limit the complete disk usage occupied by the data itself. As you mentioned in the question, you can control in-memory regon allocation, but when a data region is full, data pages are going to be flushed and loaded on demand to/from the disk, this process is called page replacement.
On the other hand, page eviction works only for non-persistent cluster preventing it from running OOM. Personally, I can't see how and why that eviction might be implemented for the data stored on disk. I'm almost sure that other "normal" DBs like Postgres or MySQL do not have this option either.
I suppose you might check the following options:
You can limit WAL and WAL archive max sizes. Though these items are rather utility ones, they still might occupy a lot of space [1]
Check if it's possible to use expiry policies on your data items, in this case, data will be cleared from disk as well [2]
Use monitoring metrics and configure alerting to be aware if you are close to the disk limits.

Is there (still) an advantage to staging data on Google Cloud Storage before loading into BigQuery?

I have a data set stored as a local file (~100 GB uncompressed JSON, could still be compressed) that I would like to ingest into BigQuery (i.e. store it there).
Certain guides (for example, https://www.oreilly.com/library/view/google-bigquery-the/9781492044451/ch04.html) suggest to first upload this data to Google Cloud Storage before loading it from there into BigQuery.
Is there an advantage in doing this, over just loading it directly from the local source into BigQuery (using bq load on a local file)? It's been suggested in a few places that this might speed up loading or make it more reliable (Google Bigquery load data with local file size limit, most reliable format for large bigquery load jobs), but I'm unsure whether that's still the case today. For example, according to its documentation, BigQuery supports resumable uploads to improve reliability (https://cloud.google.com/bigquery/docs/loading-data-local#resumable), although I don't know if those are used when using bq load. The only limitation I could find that still holds true is that the size of a compressed JSON file is limited to 4 GB (https://cloud.google.com/bigquery/quotas#load_jobs).
Yes, having data in Cloud Storage is a big advantage during development. In my cases I often create a BigQuery table from data in the Cloud Storage multiple times till I tune up all things like schema, model, partitioning, resolving errors etc. It would be really time consuming to upload data every time.
Cloud Storage to BigQuery
Pros
loading data is incredibly fast
possible to remove BQ table when not used and import it when needed (BQ table is much bigger than plain maybe compressed data in Cloud Storage)
you save your local storage
less likely fail during table creation (from local storage there could be networking issues, computer issues etc.)
Cons
you pay some additional cost for storage (in the case you do not plan to touch your data often e.g. once per month - you can decrease price to use the nearline storage)
So I would go for storing data to the Cloud Storage first but of course, it depends on your use case.

Which storage is good for read performance

I have a custom data file. Reading this file with high speed on my local computer. Reading speed is avarage is 0.5 ms in my tests(simple read operations with seeking). I want to use same operation on azure. Tried to use Blob Storage with following steps:
Create cloud storage account
Create blob client
Get container
Get blob reference
OpenRead stream
This steps takes approximatelly 10-15 seconds. It's a readonly file. What can i do for increse reading performance? What is the best storage for a large number of read operations. In this time reading speed is more important for me. I do not want to use data file with web/worker role. I must be on the cloud storage.
You would have to analyze your access patterns to debug this issue further. For example, OpenRead gives you a stream that is easy to work with, but its read-ahead buffering strategy might not be optimal if you are seeking within the file. By default, the stream will buffer 4MB at a time, but it has to discard this buffer if the caller seeks beyond that 4MB range. Depending on how much you read after each seek, you might want to reduce the read-ahead buffer size or use DownloadRangeToStream API directly. Or, if your blob is small enough, you can download it in one shot using DownloadToStream API and then handle it in memory.
I would recommend using Fiddler to watch what requests your application makes to Azure Storage and see whether that is the best approach for your scenario. If you see that each individual request is taking a long time, you can enable Azure Storage Analytics to analyze the E2E latency and Server latency for those requests. Please refer to the Monitor, diagnose, and troubleshoot Microsoft Azure Storage article for more information on how to interpret Analytics data.

S3 or EBS for storing data in flat files

I have flat files in which I store data and retrieve it instead of storing to database. This is temporary and may last for couple of months.I was wondering If I should be using EBS or S3. EBS is mainly used for I/O , S3 for content delivery , but S3 is on use you go model and EBS is you have to pay for the volume purchased ?
Pls guide, which one is better ?
S3 sounds like it's more appropriate for your use case.
S3 is object storage. Think of it as an Amazon-run file server. (Objects are not exactly equal to files, but it's close enough here.) You tell S3 to put a file, it'll store it. You tell S3 to get a file, it'll get return it. You tell S3 to delete it, it's gone. This is easy to work with and very scalable.
EBS is block storage. Think of it as an Amazon-run external hard drive. You can plug an EBS volume into an EC2 virtual machine, or you access it over the Internet via AWS Storage Gateway. Like an external hard drive, you can only plug it into one computer at a time. The size is set up front, and while there are ways to grow and shrink it, you're paying for all the bits all the time. It's also much more complex than S3, since it has to provide strong consistency guarantees for the entire volume, instead of just on a file-by-file basis.
To build on the good answer from willglynn. If you are interacting with the data regularly, or need more file-system-like access you might consider EBS more strongly.
If the amount of data is relatively small and you read and write to the data store regularly, you might consider something like elasticache for in-memory storage which would likely be superior performance-wise then using s3 or EBS.
Similarly, you might look at DynamoDb for document type storage, especially if you need to be able to search/filter across your data objects.
Point 1) You can use both S3 and EBS for this option. If you want reduced latency and file sizes are bigger then EBS is better option.
Point 2) If you want lower costs, then S3 is a better option.
From what you describe, S3 will be the most cost-effective and likely easiest solution.
Pros to S3:
1. You can access the data from anywhere. You don't need to spin up an EC2 instance.
2. Crazy data durability numbers.
3. Nice versioning story around buckets.
4. Cheaper than EBS
Pros to EBS
1. Handy to have the data on a file system in EC2. That let you do normal processing with the Unix pipeline.
2. Random Access patterns work as you would expect.
3. It's a drive. Everyone knows how to deal with files on drives.
If you want to get away from a flat file, DynamoDB provides a nice set of interfaces for putting lots and lots of rows into a table, then running operations against those rows.

S3 to EC2 Performance for fetching large numbers of small files

I have a large collection of data chunks sized 1kB (in the order of several hundred million), and need a way to store and query these data chunks. The data chunks are added, but never deleted or updated. Our service is deployed on the S3, EC2 platform.
I know Amazon SimpleDB exists, but I want a solution that is platform agnostic (in case we need to move out of AWS for example).
So my question is, what are the pro's and con's of these two options for storing and retrieving data chunks. How would the performance compare?
Store the data chunks as files on S3 and GET them when needed
Store the data chunks on a MySQL Server cluster
Would there be that much of a performance difference?
I tried using S3 as a sort of "database" using tiny XML files to hold my structured data objects, and relying on the S3 "keys" to look up these objects.
The performance was unacceptable, even from EC2 - the latency to S3 is just too high.
Running MySQL on an EBS device will be an order of magnitude faster, even with so many records.
Do you need to provide access to these data chunks directly to the users of your application? If not, then S3 and HTTP GET requests are an overkill. Having also in mind that S3 is a secured service, the overhead for every GET request (for just 1KB of data) will be considerably large.
MySQL server cluster would be a better idea, but to run in EC2 you need to employ Elastic Block Storage. Finally, do not rule out SimpleDB. It is perhaps the best solution for your problem. Design your system carefully and you would be able to easily migrate in other database systems (distributed or relational) in the future.