I hope I am asking in the right community, If not, any suggestion will be appreciated.
I am doing a survey paper where I am doing a comparison between erasure coding and replication techniques. At this stage, I am comparing them regarding specific parameters as below.
The table that I am trying to construct is dealing with parameters that differentiate which technique is better in: storage efficiency, availability, durability, encoding time, latency of failure and cost of reconstruction.
Since replication is faster in terms of read performance when failure occurs, is it correct when I say that replication technique has Higher latency on failure? and the same for encoding time, is it correct to say replication has a high encoding time since it has better performance time in writing?
Does the reconstruction cost of failure in erasure coded system is higher than replication? does it involves more disk I/O ? will it be different if the failure is transient or permanent?
Will it be more informative if I compared all the above parameters according to transient and permanent failures?
is it correct if I compare them as below?
Erasure code: Higher ( Durability, Storage Efficiency, Availability ) and Lower (Encoding time, latency on failure, cost to reconstruct)
Replication : Higher ( Encoding time, latency on failure, cost to reconstruct) and Lower ( Durability, Storage Efficiency, Availability)
Related
CRDT or Conflict-Free Replicated Data Type follows a strong eventual consistency guarantee, essentially meaning consistency is guaranteed to be achieved at some point in time in the future.
My question is, is the Consistency part of the CAP theorem sacrificed or else which one is?
CRDTs sacrifice consistency to achieve availability at least in the most straightforward utilization of them that does nothing to check that you have received inputs from all potential clients (nodes in the network).
However CRDT is a kind of data structure and is not a distributed algorithm so its behavior in a distributed environment will depend on the full distributed algorithm that they participated in.
Some similar ideas are discussed in https://blog.acolyer.org/2017/08/17/on-the-design-of-distributed-programming-models/:
Lasp is an example of an AP model that sacrifices consistency for availability. In Lasp, all data structures are CRDTs...
After going through couple of resources on Google and stack overflow(mentioned below) , I have got high level understanding when to use what but
got couple of questions too
My Understanding :
When used as pure in-memory memory databases both have comparable performance. But for big data where complete complete dataset
can not be fit in memory or even if it can be fit (but it increases the cost), AS(aerospike) can be the good fit as it provides
the mode where indexes can be kept in memory and data in SSD. I believe performance will be bit degraded(compared to completely in memory
db though the way AS handles the read/write from SSD , it makes it fast then traditional disk I/O) but saves the cost and provide performance
then complete data on disk. So when complete data can be fit in memory both can be
equally good but when memory is constraint AS can be good case. Is that right ?
Also it is said that AS provided rich and easy to set up clustering feature whereas some of clustering features in redis needs to be
handled at application. Is it still hold good or it was true till couple of years back(I believe so as I see redis also provides clustering
feature) ?
How is aerospike different from other key-value nosql databases?
What are the use cases where Redis is preferred to Aerospike?
Your assumption in (1) is off, because it applies to (mostly) synthetic situations where all data fits in memory. What happens when you have a system that grows to many terabytes or even petabytes of data? Would you want to try and fit that data in a very expensive, hard to manage fully in-memory system containing many nodes? A modern machine can store a lot more SSD/NVMe drives than memory. If you look at the new i3en instance family type from Amazon EC2, the i3en.24xl has 768G of RAM and 60TB of NVMe storage (8 x 7.5TB). That kind of machine works very well with Aerospike, as it only stores the indexes in memory. A very large amount of data can be stored on a small cluster of such dense nodes, and perform exceptionally well.
Aerospike is used in the real world in clusters that have grown to hundreds of terabytes or even petabytes of data (tens to hundreds of billions of objects), serving millions of operations per-second, and still hitting sub-millisecond to single digit millisecond latencies. See https://www.aerospike.com/summit/ for several talks on that topic.
Another aspect affecting (1) is the fact that the performance of a single instance of Redis is misleading if in-reality you'll be deployed on multiple servers, each with multiple instances of Redis on them. Redis isn't a distributed database as Aerospike is - it requires application-side sharding (which becomes a bit of a clustering and horizontal scaling nightmare) or a separate proxy, which often ends up being the bottleneck. It's great that a single shard can do a million operations per-second, but if the proxy can't handle the combined throughput, and competes with shards for CPU and memory, there's more to the performance at scale picture than just in-memory versus data on SSD.
Unless you're looking at a tiny amount of objects or a small amount of data that isn't likely to grow, you should probably compare the two for yourself with a proof-of-concept test.
We are in the case of using a SQL database for a single node storage of roughly 1 hour of high frequency metrics (several k inserts a second). We quickly ran into I/O issues which proper buffering would not simply handle, and we are willing to put time into solving the performance issue.
I suggested to switch to a specialised database for handling time series, but my colleague stayed pretty skeptical. His argument is that the gain "out of the box" is not guaranteed as he knows SQL well and already spent time optimizing the storage, and we in comparison do not have any kind of TSDB experience to properly optimize it.
My intuition is that using a TSDB would be much more efficient even with an out of box configuration but I don't have any data to measure this, and internet benchs such as InfluxDB's are nowhere near trustable. We should run our own, except we can't affoard to loose time in a dead end or a mediocre improvement.
What would be, in my use case but very roughly, the performance gap between relational storage and TSDB, when it comes to single node throughput ?
This question may be bordering on a software recommendation. I just want to point one important thing out: You have an existing code base so switching to another data store is expensive in terms of development costs and time. If you have someone experienced with the current technology, you are probably better off with a good-faith effort to make that technology work.
Whether you switch or not depends on the actual requirements of your application. For instance, if you don't need the data immediately, perhaps writing batches to a file is the most efficient mechanism.
Your infrastructure has ample opportunity for in-place growth -- more memory, more processors, solid-state disk (for example). These might meet your performance needs with a minimal amount of effort.
If you cannot make the solution work (and 10k inserts per second should be quite feasible), then there are numerous solutions. Some NOSQL databases relax some of the strict ACID requirements of traditional RDBMSs, providing faster throughout.
can you give some.explanation on how to interpret deviation vs.throughput?
Is 1000++ deviation result means a poor performance of web under test?
And how can you also say that the web under test performs good? Is it base on.throughput result? How?
And what listener is the best for tracing the load/performance of a thousand users
Lastly is that possible to check the cpu/ram usage of the server while pwrforming the test
Standard deviation quantifies how much response time varies around its mean, or average. It is not advisable to judge the system performance based on Standard Deviation. In reality this gives how much system is fluctuating. Deviations should be minimum i.e. less than 5%.
Thoughput is defined as number of requests processed per second.
It is better to use Throughput as a factor to judge system/application performance. Higher throughput means good system performance but again this depends on your choice. For some critical systems low response time is more required than high throughput. Throughput simply states that how much concurrent transactions your system can process per second which might mean high response time. if response time increases beyond a certain limit then that system is considered for performance tuning.
some systems Throughput means
You can either use Summary Report or Aggregate Report listeners.
cpu/ram usage, you can use "jp#gc - Perfmon Metrics collector" from Jmeter Plugin.
Hope this will help.
Well, not much to ask apart from the question. What do you mean when you say a OLTP DB must have a high throughput.
Going to the wiki.
"In communication networks, such as
Ethernet or packet radio, throughput
or network throughput is the average
rate of successful message delivery
over a communication channel. This
data may be delivered over a physical
or logical link, or pass through a
certain network node. The throughput
is usually measured in bits per second
(bit/s or bps), and sometimes in data
packets per second or data packets per
time slot."
So does this mean , OLTP databases need to have a high/quick insertion rate ( i.e. avoiding deadlocks etc)??
I was always under an impression if we take a database for say an airline industry, it must have quick insertion , but at the same time quick response time since it is critical to it's operation. And in many ways this shouldn't this be limited to the protocol involved in delivering the message/data to the database?
I am not trying to single out the "only" characteristic of OLTP systems. In general I would like to understand, what characteristics are inherent to a OLTP system.
Cheers!
In general, when you're talking about the "throughput" of an OLTP database, you're talking about the number of transactions per second. How many orders can the system take a second, how many web page requests can it service, how many customer inquiries can it handle. That tends to go hand-in-hand with discussions about how the OLTP system scales-- if you double the number of customers hitting your site every month because the business is taking off, for example, will the OLTP systems be able to handle the increased throughput.
That is in contrast to OLAP/ DSS systems which are designed to run a relatively small number of transactions over much larger data volumes. There, you're worried far less about the number of transactions you can do than about how those transactions slow down as you add more data. If you're that wildly successful company, you probably want the same number and frequency of product sales by region reports out of your OLAP system as you generate exponentially more sales. But you now have exponentially more data to crunch which requires that you tune the database just to keep report performance constant.
Throughput doesn't have a single, fixed meaning in this context. Loosely, it means the number of transactions per second, but "write" transactions are different than "read" transactions, and sustained rates are different than peak rates. (And, of course, a 10-byte row is different than a 1000-byte row.)
I stumbled on Performance Metrics & Benchmarks: Berkeley DB the other day when I was looking for something else. It's not a bad introduction to the different ways of measuring "how fast". Also, this article on database benchmarks is an entertaining read.