Maintain data in Google Bigtable for longer periods - bigtable

We have use-cases where we would like to store a large volume of data in Google Bigtable for long periods:
during product development
for performance tuning
for demos
We need to store the data but we don't really need it to be "online" all the time. The current cost bottleneck seems to be the cost of nodes which in these cases are idle for long periods.
How is Google Bigtable being used during product development? I am aware of the development mode (and the emulator) and they are fine for some use-cases but we still need the production environment for other use-cases.
Really, what would be ideal is the ability to switch "off" Bigtable (while still paying for data stored but not for nodes) and bring up the nodes when needed. I don't believe this feature exists. In its absence are there other possible workarounds/alternatives?

It's an interesting question. I've done it with smaller projects using Datastore with much smaller sizes (~2Gb) that have hung around for years after disabling billing. Given how much it costs to do backup/restores on those projects, I could imagine this would be cost prohibitive solution in the BigTable world. It is disappointing that Google hasn't provided a better solution for this. They do talk about different storage classes so I'd imagine disabling a project would move its assets to coldline - but that is just rank speculation on my part.

Related

In real terms, how much faster is a production tier Heroku database than the starter database?

A basic production level database in Heroku implements a 400Mb cache. I have a production site running 2 dynos and a worker which is pretty heavy on reads and writes. The database is the bottleneck in my app.
A write to the database will invalidate many queries, as searches are performed across the database.
My question is, given the large jump in price between the $9 starter and $50 first level production database, would migrating be likely to give a significant performance improvement?
"Faster" is an odd metric here. This implies something like CPU, but CPU isn't always a huge factor in databases, especially if you're not doing heavy writes. You Basic database has 0mb of cache – every query hits disk. Even a 400mb cache will seem amazing compared to this. Examine your approximate dataset size; a general rule of thumb is for your dataset to fit into cache. Postgres will manage this cache itself, and optimize for the most referenced data.
Ultimately, Heroku Postgres doesn't sell raw performance. The benefits of the Production-tier are multiple, but to name a few: In-memory Cache, Fork/Follow support, 500 available connections, 99.95% expected uptime.
You will definitely see performance boost in upgrading to a Production-tier plan, however it's near impossible to claim it to be "3x faster" or similar, as this is dependent on how you're using the database.
It sure is a steep step, so the question really is how bad is the bottleneck? It will cost you 40 dollar extra, but once your app runs smooth again it could also mean more revenue. Of course you could also consider other hosting services, but personally I like Heroku the best (albeit cheaper options are available). Besides, you are already familiar with Heroku. There is some more information on Heroku devcenter, regarding their different plans:
https://devcenter.heroku.com/articles/heroku-postgres-plans:
Production plans
Non-production applications, or applications with minimal data storage, performance or availability requirements can choose between one of the two starter tier plans, dev and basic, depending on row requirements. However, production applications, or apps that require the features of a production tier database plan, have a variety of plans to choose from. These plans vary primarily by the size of their in-memory data cache.
Cache size
Each production tier plan’s cache size constitutes the total amount of RAM given to Postgres. While a small amount of RAM is used for managing each connection and other tasks, Postgres will take advantage of almost all this RAM for its cache.
Postgres constantly manages the cache of your data: rows you’ve written, indexes you’ve made, and metadata Postgres keeps. When the data needed for a query is entirely in that cache, performance is very fast. Queries made from cached data are often 100-1000x faster than from the full data set.
Well engineered, high performance web applications will have 99% or more of their queries be served from cache.
Conversely, having to fall back to disk is at least an order of magnitude slower. Additionally, columns with large data types (e.g. large text columns) are stored out-of-line via TOAST, and accessing large amounts of TOASTed data can be slow.

stress testing web applications on less capable hardware

My organization is having an interesting internal debate right now that raises a question that I would like to open to the community at large.
The issue at hand is our environment in which we do stress-testing, capacity-testing, performance-regression-testing, and the like.
On one side of the debate are some software engineers who would like this environment to mirror the production environment as much as possible, in the interest of making the results as meaningful as possible. While we currently do have an environment for such testing, it is far less capable than the production system, and these software engineers feel that they are reaching the limits of what they can learn from it.
On the other side of the debate are some network engineers who both administer the environments and control the purse-strings. While they concede that capacity-testing would be better in an environment that is a better replica of the production environment, they argue that – for the purposes of stress testing – a more modest environment would have the effect of magnifying performance bottlenecks, making them easier to discover and fix.
This finally brings us to the part that piqued my interest: one software engineer suggests that while a more modest stress environment will increase the likelihood that you will encounter some bottleneck, it does not necessarily follow that it would help you find the next bottleneck you may encounter in production. The scaling effect, he argues, may not be linear.
Is there merit to that point of view? If yes, then why? What are the sources of that nonlinearity?
There are a lot of moving parts involved here: a cluster of java application servers, a cluster of database servers, lots of dynamic content being generated for each HTTP hit.
Edit: I appreciate everybody's thoughts so far, but I was really hoping that someone would do more than re-affirm one side or the other and actually tackle the question of "why". If there is such nonlinearity, what gives rise to it? Better yet it would be great if the reasons were expressed in terms of the CPU, memory, bandwidth, latency, interactions between subsystems, what have you... TerryE, you have come the closest. You should re-post your comment as an answer for the bounty if no one else steps up
Your software developer is right and I will take the point even further.
When you test an application components, like a web service, to see its behaviour under load, it is understandable to use a less capable environment. You can find the bottlenecks about memory, io etc. And most probably will find bugs and oversights like out of memory errors and log files getting huge.
But when your application components are running as intended and you need to test the whole shebang, you need to test the real environment.
When you run stress tests on an environment, you measure that environment's behaviour under load and its bottlenecks. While this tests may provide valuable information, this information will not be about your production system. The bottlenecks you find might not be relevant to your real system and you may spend precious development time to fix the bugs that do not exist. To know about bottlenecks you really might face with, you should run your stress tests on your real production system (preferably before the grand opening).
The assumption of the network engineers is that modest system is basically a scale model of the production system. They are also assuming that the various characteristics of the production environment which would be affecting the software performance are mirrored in the more modest system just at lower levels however in the same ratios. For instance, the CPU is not as fast, there is not quite as much memory, the storage is a bit slower, etc. and all of these differences are in similar ratios such that if everything were magically multiplied by some factor, say 1.77, the resulting changed modest system would be exactly like the production system.
However that the modest system is an exact scale model in all particulars of the production system is difficult for me to believe.
Here is a specific example. Lets say that measurements on the production system indicates that CPU utilization, the percentage of time the CPU is not idle, is too high. So you put the software on the modest system and do measurements and discover that on the modest system, the CPU utilization is lower. An investigation reveals that the modest system has slower storage so the CPU is spending more time idle waiting on data transfer from storage to complete because the application is I/O bound on the modest system where as on the production system it is not. This difference is due to the modest machine not being an exact scale model of the production machine because the CPU ratio is different than the I/O transfer ratio.
Another example would be having more memory allowing fewer page faults in the production environment. When the software is loaded onto the more modest machine, there are more page faults due to having less physical memory. With the various applications paging in and out, they begin to affect each other as pages of other applications are swapped out and then swapped back in again. On the product machine with larger memory, this cascading page fault behavior is not seen because there is sufficient memory to hold more applications simultaneously.
The point that I am really trying to make here is that a computer with all its various parts and applications is a complex, dynamic system. The idea that one computing environment is just a scale model of another is too simplistic of an assumption. Using a modest system can certainly provide valuable data. However once the gross adjustments have been made to the software and you are beginning to get into more subtle detailed adjustments, the differences in the environment can have a large impact on the results of the testing.
Some citations.
Computer systems are dynamical systems by Tod Mytkowicz, Amer Diwan, and Elizabeth Bradley.
Bayesian fault detection and diagnosis in dynamic systems by Uri Lerner, Ronald Parr, Daphne Koller, and Gautam Biswas.
I have encountered the similar situation in my production environment. We use modest system just for initial and basic level testing and findings. It is true that you can never find real bottlenecks and other performance issues on your testing environment. So to find real performance related issues and to find bottlenecks you must do it on production environment, there's no other way.
We have hosted over 2.5Million websites, although it might not be case with yours but let me tell you this, that in our case, we have faced horrible situations of linear bottlenecks. Meaning, we first faced memory issues when our traffic was getting increased. We resolved that by adding more memory. Until then we didn't even notice that having only 256 threads of httpd was our next bottleneck because limited memory was hiding it, once we resolved memory issue it quickly came down to the problem that why our webservers were slow again after just few weeks? We found out that 256 httpd threads are just not enough to serve that much traffic. We not only increased threads but also installed HA parallel load balancers in front of our WebServers to mitigate the issue.
Fortunately it solved our slow page loading problems. But after few months as traffic continuously grew we got into next bottleneck of storage system. You know what this time disk I/Os was the issue. To make the story short, we put parallel NFS based physical storage systems. Each NFS machine now serve files by having over 2000 threads running.
I forgot to mentioned that Database was also a big culprit of slowness that we resolved that issue by installing Master-Slaves model in cluster. We had to do a lot of performance tweaks in our application code as well and we had to physically distribute our application into different modules over different servers.
I'm just mentioning all this to prove a point that it's very likely that all performance related issues almost come in a linear way, at least that's what we have faced in our WebBased model. Even you have tested a lot on your modest systems you still have chances hidden bottlenecks which you can't find on testing environments.
What I have learned in my last 6 years experience that try to DISTRIBUTE your model AS MUCH AS POSSIBLE if you think you might going to have a lot of traffic or hits/sec. Centralized model can hold your traffic for some time by doing much tweaks but in the end your system gets busted.
I'm not saying you will face some bottlenecks or issues in your situation but I just wanted to warn you that these cases happen sometimes, just so you better aware.
**Sorry for my English.
good question. learning and optimization is best on modest hardware. but testing is safer on mirror (or at least something from same epoch)
it seams like you try to predict the first bottleneck that will appears and when it will happen. i'm not sure if that's the correct objective and the correct way. i assume we don't speak about a typical CRUD where client says 'it should work as fast as every other web application'. if you want to do tests correctly then, before you start your tests, you should know the expected load. expected number of users, expected number of events, response time etc. it's a part of your product specification. if you don't have the numbers, that means your analysts didn't do their job.
if you have the numbers then you don't need exact tests result. you just need to know the order of magnitude. you should also check how your software/hardware scale. how many instances do you need to handle x users/requests/whatever and how many to handle y
We load test systems for our customers every day -- and we see a wide range of problems. Certain classes of problems can be found on down-sized systems. Other cannot. Some can ONLY be found in production...because no matter how closely you mirror the two systems, they can never be identical. You can get REALLY close, if you work hard enough.
So, simple fact of testing: the closer your system is to the production system, the more accurate your tests will be.
IMO, this is one of the best reasons for moving to the cloud: you can spin up a system that is very close to your production system (about as identical as you could ever get) and run your load tests on that.
It is probably worth mentioning that we've occasionally seen customers waste a lot of hours chasing problems in their test environments that never would have occurred in production. The more different the environments are, the more likely this is to happen :(
I think you have partially answered your own question - you already have a production level environment and are already finding it is not at the same level / not as capable as the production environment. The bottom line is that with all the money in the world you will never be able to replicate the exact functioning of the production website - timings of events, volumes, cpu utilisation, memory utilisation, db IO, when it's all working in anger the behaviour can be non-deterministic to a certain extend. My point is you can never make it exactly the same. And on the other side of the coin a production environment by it's nature is going to be an expensive environment with a lot of kit in order to make it perform and handle your production volume of data / transactions. This is a big expense / overhead to the business, and in these times of frugality should we not be looking to avoid additional cost to the business.
Maybe a different tactic should be taken - learn the performance profile of your production software - how it scales with volume, does running times increase linearly, exponentially or logarithmically? Can you model this? Firstly you can verify that the test environment is behaving in a similar way - this is key to having a valid test. Then the other important part is taking relative tests rather than absolutes - you aren't going to get absolute running times that are the same as production, but run your performance tests before deploying the code changes to give you your baseline, then deploy your code changes and re-run the performance tests - this will give you the relative changes in production (e.g. will the performance degrade with this code release), based on your models of performance you will be able to verify that the software is scaling in the same way with extra volume.
So my viewpoint is that there is a great deal you can learn about your software and hardware performance in the lower environment, and doing this on a smaller / less capable infrastructure saves your company money, and if used right can give you most of your answers to performance testing that you are looking for.

TPC or other DB benchmarks for SSD drives

I have been interested in SSD drives for quite sometime. I do a lot of work with databases, and I've been quite interested to find benchmarks such as TPC-H performed with and without SSD drives.
On the outside it sounds like there would be one, but unfortunately I have not been able to find one. The closest I've found to an answer was the first comment in this blog post.
http://dcsblog.burtongroup.com/data_center_strategies/2008/11/intels-enterprise-ssd-performance.html
The fellow who wrote it seemed to be a pretty big naysayer when it came to SSD technology in the enterprise, due to a claim of lack of performance with mixed read/write workloads.
There have been other benchmarks such as
this
and
this
that show absolutely ridiculous numbers. While I don't doubt them, I am curious if what said commenter in the first link said was in fact true.
Anyways, if anybody can find benchmarks done with DBs on SSDs that would be excellent.
I've been testing and using them for a while and whilst I have my own opinions (which are very positive) I think that Anandtech.com's testing document is far better than anything I could have written, see what you think;
http://www.anandtech.com/show/2739
Regards,
Phil.
The issue with SSD is that they make real sense only when the schema is normalized to 3NF or 5NF, thus removing "all" redundant data. Moving a "denormalized for speed" mess to SSD will not be fruitful, the mass of redundant data will make SSD too cost prohibitive.
Doing that for some existing application means redefining the existing table (references) to views, encapsulating the normalized tables behind the curtain. There is a time penalty on the engine's cpu to synthesize rows. The more denormalized the original schema, the greater the benefit to refactor and move to SSD. Even on SSD, these denormalized schemas will run slower, likely, due to the mass of data which must be retrieved and written.
Putting logs on SSD is not indicated; this is a sequential write-mostly (write-only under normal circumstances) operation, physics of SSD (flash type; a company named Texas Memory Systems has been building RAM based sub-systems for a long time) makes this non-indicated. Conventional rust drives, duly buffered, will do fine.
Note the anandtech articles; the Intel drive was the only one which worked right. That will likely change by the end of 2009, but as of now only the Intel drives qualify for serious use.
I've been running a fairly large SQL2008 database on SSDs for 9 months now. (600GB, over 1 billion rows, 500 transactions per second). I would say that most SSD drives that I tested are too slow for this kind of use. But if you go with the upper end Intels, and carefully pick your RAID configuration, the results will be awesome. We're talking 20,000+ random read/writes per second. In my experience, you get the best results if you stick with RAID1.
I can't wait for Intel to ship the 320GB SSDs! They are expected to hit the market in September 2009...
The formal TPC benchmarks will probably take a while to appear using SSD because there are two parts to the TPC benchmark - the speed (transactions per unit time) and the cost per (transaction per unit time). With the high speed of SSD, you have to scale the size of the DB even larger, thus using more SSD, and thus costing more. So, even though you might get superb speed, the cost is still prohibitive for a fully-scaled (auditable, publishable) TPC benchmark. This will remain true for a while yet, as in a few years, while SSD is more expensive than the corresponding quantity of spinning disk.
Commenting on:
"...quite interested to find benchmarks such as TPC-H performed with and without SSD drives."
(FYI and full disclosure, I am pseudonymously "J Scouter", the "pretty big naysayer when it came to SSD technology in the enterprise" referred to and linked above.)
So....here's the first clue to emerge.
Dell and Fusion-IO have published the first EVER audited benchmark using a Flash-memory device for storage.
The benchmark is the TPC-H, which is a "decision support" benchmark. This is important because TPC-H entails an almost exclusively "read-only" workload pattern -- perfect context for SSD as it completely avoids the write performance problem.
In the scenarios painted for us by the Flash SSD hypesters, this application represents a soft-pitch, a gentle lob right over the plate and an easy "home run" for a Flash-SSD database application.
The results? The very first audited benchmark for a flash SSD based database application, and a READ ONLY one at that resulted in (drum roll here)....a fifth place finish among comparable (100GB) systems tested.
This Flash SSD system produced about 30% as many Queries-per-hour as a disk-based system result published by Sun...in 2007.
Surely though it will be in price/performance that this Flash-based system will win, right?
At $1.46 per Query-per-hour, the Dell/Fusion-IO system finishes in third place. More than twice the cost-per-query-per-hour of the best cost/performance disk-based system.
And again, remember this is for TPC-H, a virtually "read-only" application.
This is pretty much exactly in line with what the MS Cambridge Research team discovered over a year ago -- that there are no enterprise workloads where Flash makes ROI sense from economic or energy standpoints
Can't wait to see TPC-C, TPC-E, or SPC-1, but according the the research paper that was linked above, SSDs will need to become orders-of-magnitude cheaper for them to ever make sense in enterprise apps.

Web Application Infrastructure

I have custom coded several enterprise applications for mid to large organizations to use internally (some with a minimal external footprint). I now have plans for a web project that may (hopefully) see a large userbase with more daily traffic than my previous projects have ever attained. Obviously I want my design to be scalable and maintainable. The problem is that from a physical layout perspective (servers/VMs) I do not know what to expect.
The question: What are some good resources for this? Books? Websites? I have found plenty on scalable application design, but nothing on scalable physical design.
It's hard to give exact answer without knowing something about what technologies you plan to use. The approach to the application can't be completely unaware of planned physical infrastructure if scaling is a major driver.
Caching would have to be a big concern. Also ways to expand the hardware where your data lives.
A very interesting and instructive read is the real world bio of live journal, a history of scaling, and how they grew their physical presence with a massive growth in their website. One major offshoot of their work was a new caching technology, memcached, which is now used by FaceBook among others. It is surprisingly honest.
The High Scalability blog is good. You can look at some of their examples that go over the physical parts of large sites. I would say the common first level physical scaling technique would be a load balancer. That is pretty easy but at the simplest you still have a database that is a potential bottleneck. Most of the physical parts of scaling require you to just add more and the real issues come in where you are forced to use just one of something.

Commercial uses for grid computing?

I keep hearing from associates about grid computing which, from what I can gather, is highly distributed stuff along the lines of SETI#Home.
Is anyone working on these sort of systems for business use? My interest is in figuring out if there's a commercial reason for starting software development in this field.
Rendering Farms such as Pixar
Model Evaluation e.g. weather, financials, military
Architectural Engineering e.g. earthquakes.
To list a few.
Grid computing is really only needed if you have a lot of WORK that needs to be done, like folding proteins, otherwise a simple server farm will likely be plenty.
Obviously Google are major users of Grid Computing; all their search service relies on it, and many others.
Engines such as BigTable are based on using lots of nodes for storage and computation. These are commercially very useful because they're a good alternative to a small number of big servers, providing better redundancy and cost effective scaling.
The downside is that the software is fiendishly difficult to write, but Google seem to manage that one ok :)
So anything which requires big storage and/or lots of computation.
I used to work for these guys. Grid computing is used all over. Anyone who makes computer chips uses them to test designs before getting physical silicon cut. Financial websites use grids to calculate if you qualify for that loan. These days they are starting to replace big iron in a lot of places, as they tend to be cheaper to maintain over the long term.