My Redis is filling up rather quickly and I am trying to find a way to visualise what is taking up the space.
Is there a tool or command that will help me visualise whats taking up the storage? Ideally, I would like to know:
Key prefix by storage size
Key prefix key count
(An option that will work only on Azure will also be suitable as we are using Azure as our Cache storage)
Related
I am trying to copy 25 TB of data to Azure. Do we have any option to move the date?
Tried to copy but it has taken 1 hr for 1 GB Data, do we have any better solution so that I can do it more quickly?
The problem statement is very general. I would start with asking, how are you transferring the data?
The speed is dependent on so many factors, a few being:
1. Location of the data.
2. Location of the storage account you're writing to.
3. Network speed and bandwidth on the client side.
4. Network speed and bandwidth on the azure storage side. (expected to be good)
If you're writing the data to a Azure Storage account which is in a region closer to you, you're expected to get better speed.
As for the options to write the data:
1. Look at AzCopy.
https://azure.microsoft.com/en-us/documentation/articles/storage-use-azcopy/
Use Import\Export service.
https://azure.microsoft.com/en-us/pricing/details/storage-import-export/
The best way to upload large datasets into the cloud is still the sneakernet
Azure do a thing called the Azure Import/Export Service Basically you buy a SATA hard drive, encrypt it with a numerical bitlocker key, copy data to it, create an Azure import job, then ship the hard drive to them.
This ends up being considerably quicker than trying to upload.
An alternative you might want to look into, would be the AWS Import/Export Snowball for which they will ship you an appliance to copy the data to which you ship back to them when complete. It might be worth considering copying data into AWS via Snowball then copying it across their much faster internet pipes into Azure instead of buying the hardware required to transfer that much data.
If you open the target Storage account in the Azure Portal, there's now a calculator that will accept basic details (how much data etc) and then recommend the best options to you. Its under the heading "Data transfer".
Users of our platform will have large amounts of stored data on our system. Through an application, once connected, that data will be transferred to them and no longer need to remain on our servers. There could potentially be hundreds or thousands of users connected at any given time, performing their downloads.
Here's the proposed architecture:
User management, configuration, and data download statistics will be maintained in a SQL Server database, while using either Redis or DynamoDB for the large data sets.
The reason for choosing either Redis or DynamoDB is based on cost - cheaper than running another SQL Server instance, and performance. The data format will be similar to a datamart - flat table with no joins.
Initially the queries would be simple - get all data for user X between a date range, and optionally delete.
Since we may want to add free text searching for certain fields of that data using elasticsearch may be a better option to use from the get-go.
I want this to be auto-scaling but not sure which database would be best to use for this scenario.
Here's some great discussion on Database + Search tier from AWS ReInvent:
https://youtu.be/K7o5OlRLtvU?t=1574
I would not take Elastic-search alone because it does not provide auto-scaling for writing capacity. In fact, it's not trivial to augment the number of shard of an index. Secondly it can only handle the JSON format, which could be an issue for you.
Redis could be a good idea because it is really fast, everything is done in RAM, and it provides keys with a limited time-to-live which could be interesting for you. Unfortunately, if your data size exceeds the capacity in RAM of your amazon instance you will have to shard your Redis database. And Redis does not support it, you will have to deal it on your application code. Moreover, as far as I know Redis does not handle complex queries. You will also need to save your data in a Redis data structure which could be an issue for you
DynamoDB handles auto-scaling really well but on the other hand it is a key/value database so it does not allow you to make queries like "get all data for user X between a date range". DynamoDB also allows you to save your data in any format.
The solution will be to use either DynamoDB or either Redis depending of the size of your datas, and to use ElasticSearch in order to index your key with only the meta-data (user and dates). Like that your index will be small, and if you lost the ability to index because of ElasticSearch get too buzy, you keep the ability to save user's datas.
I have flat files in which I store data and retrieve it instead of storing to database. This is temporary and may last for couple of months.I was wondering If I should be using EBS or S3. EBS is mainly used for I/O , S3 for content delivery , but S3 is on use you go model and EBS is you have to pay for the volume purchased ?
Pls guide, which one is better ?
S3 sounds like it's more appropriate for your use case.
S3 is object storage. Think of it as an Amazon-run file server. (Objects are not exactly equal to files, but it's close enough here.) You tell S3 to put a file, it'll store it. You tell S3 to get a file, it'll get return it. You tell S3 to delete it, it's gone. This is easy to work with and very scalable.
EBS is block storage. Think of it as an Amazon-run external hard drive. You can plug an EBS volume into an EC2 virtual machine, or you access it over the Internet via AWS Storage Gateway. Like an external hard drive, you can only plug it into one computer at a time. The size is set up front, and while there are ways to grow and shrink it, you're paying for all the bits all the time. It's also much more complex than S3, since it has to provide strong consistency guarantees for the entire volume, instead of just on a file-by-file basis.
To build on the good answer from willglynn. If you are interacting with the data regularly, or need more file-system-like access you might consider EBS more strongly.
If the amount of data is relatively small and you read and write to the data store regularly, you might consider something like elasticache for in-memory storage which would likely be superior performance-wise then using s3 or EBS.
Similarly, you might look at DynamoDb for document type storage, especially if you need to be able to search/filter across your data objects.
Point 1) You can use both S3 and EBS for this option. If you want reduced latency and file sizes are bigger then EBS is better option.
Point 2) If you want lower costs, then S3 is a better option.
From what you describe, S3 will be the most cost-effective and likely easiest solution.
Pros to S3:
1. You can access the data from anywhere. You don't need to spin up an EC2 instance.
2. Crazy data durability numbers.
3. Nice versioning story around buckets.
4. Cheaper than EBS
Pros to EBS
1. Handy to have the data on a file system in EC2. That let you do normal processing with the Unix pipeline.
2. Random Access patterns work as you would expect.
3. It's a drive. Everyone knows how to deal with files on drives.
If you want to get away from a flat file, DynamoDB provides a nice set of interfaces for putting lots and lots of rows into a table, then running operations against those rows.
I'm currently creating a scheme for my application and I'm wondering if my thinking is right
Example : Ecommerce site
In DynamoDb, I would put products ( product_id, meta-data link to S3)
S3, i would use it for storing Search Data Format (SDF/JSON)
(Product name, product description, price, ...etc )
Amazon CloudSearch would be used to index documents in S3, and to be able to search them.
Redis would be used to cache results
Is my scheme right? Can s3 be a good "database" ?
Is DynamoDb here even needed?
If you are thinking that S3 would just be the source of record for your products and you are not expecting heavy reads/writes, then it COULD work, but you have to recognize that it will be far far slower than using a real database. Not just 1-2x slower but MANY magnitudes slower. We use S3 for storing audit data for realtime data stored in Postgres - works a charm, but this is data that is written once and read rarely. Retrieval times when it does have to retrieve audit records is > 50ms. This type of speed is usually not acceptable when you need to manipulate multiple records at one time.
If you are going to be using dynamoDB anyway, why not just use that to store what you'd be storing on s3? Trying to adhere to the concept of keep it simple, I would use the following stack:
dynamoDB to be the system of record and to do some searches
Cloudsearch for more flexible searching than what dynamodb can
provide
S3 for static files (product images, etc.)
And again, to keep things simple, Skip Redis for caching if you are already using dynamoDB and don't plan on using any of Redis' specialized dastatypes - ie: your caching will be nothing more than keys to strings, etc. Use Redis if you plan on taking advantage of its other datatypes or if you want to have a cache closer to your app - ie: you plan on using Redis on the webserver.
Dynamo is used for storing write-extensive data. If your application does not require extensive writes over product_id and meta-data, I think RDS/MySQL is better.
When designing an application, you really should keep things as simple as possible from the beginning. It will always get worse with time :)
S3 is not a good DB. It has not been designed for this and is too slow. It's for file storage only. If you want to stick with DynamoDB, you should put all your products info in it, including the metadata.
CloudSearch may be a good option. You can also build you own "indexes" on top of DynamoDB. It requires more design and programming but might be worth considering. Here is a link to an excellent blog-post on this matter: http://blog.coredumped.org/2012/01/amazon-dynamodb.html.
So,
Is DynamoDB even needed: Yes, or RDS, Mongo,... any real DB depending on you needs.
Is S3 a good DB: I don't think so.
We run Solr on an Amazon Web Services EC2 instance with a 1TB EBS volume to store the index so that we can easily launch additional servers with the same (read-only) index. However, our index is soon going to exceed 1TB, and I don't really want to deal with striping multiple EBS volumes to hold the index. Also, regenerating the index is very slow. I would like to move the index generation--and maybe hosting--to Hadoop, and preferably to Amazon's Elastic MapReduce, although I can set up separate Hadoop servers if need be. We use RightScale, so their library of ServerTemplates is available to us.
What would be the best place to get started using Lucene/Solr on Hadoop?
Take a look at ElasticSearch. You can index to ElasticSearch from Hadoop for bulk loading. Infochimps has open sourced an ElasticSearch bulk indexer called Wonderdog that you can look at for a proof of concept.
https://github.com/infochimps/wonderdog
http://www.elasticsearch.com
It's cloud friendly (See cloud-aws plugin for discovery), and can scale up / down by adding nodes to hold the index.
Is your index sharded? You could shard the index and distribute shards across several instances.