InfluxDB max available expiration and performance concerns - backup

I develop my metrics based on influxdb. I want to keep the data forever therefore my retention policy is set to inf and my shard retention policy is set to 100 years (the max I could set).
My main concern has to do with degrading performance by keeping this data. My series will not be more than 100000 (as adviced for the low server specs).
Is there gonna be an impact on the memory used indexing wise? More specific memory used by influxdb regardless of issuing any actions such as queries/continoues queries
Also in case there is a problem with performance, is it possible to backup only the data that are bound to be deleted?

Based on InfluxDB Hardware sizing guidelines, in moderate load situation with a single node InfluxDB deployed on a server with these specifications: CPU:6 cores and RAM:8-32 GB; you can have 250k writes per second and about 25 queries per second. These numbers will definitely meet your requirements. Also by increasing CPU and RAM you can achieve better performance.
Note, If the scale of your work grew in the future, you can also use "continues query" for down-sampling old data; or export a part of data to a backup file.

Related

How to estimate the maximum number of reads and writes per second a RDBMS server can handle?

Before spinning up an actual (MySQL, Postgres, etc) database, are there ways to estimate how many reads & writes per second the database can handle?
I'm assuming this is dependant on the CPU and memory (+ network if we're sharding), but is there a good best practice on how to put these variables together?
This is useful for estimating cost and understanding how much of a traffic spike can the db handle.
You can learn from others to gauge transactions per second you'll get from certain instances. For example, https://aiven.io/blog/postgresql-12-gcp-aws-performance gives you a good idea of how PostgreSQL 12 performs.
Percona has blogged about performance benchmarks also: https://www.percona.com/blog/2017/01/06/millions-queries-per-second-postgresql-and-mysql-peaceful-battle-at-modern-demanding-workloads/
Here's another benchmark with useful information: http://dimitrik.free.fr/blog/posts/mysql-performance-80-and-sysbench-oltp_rw-updatenokey.html about MySQL 8.0 and links to 5.7 performance.
There are several blogs about SQL Server performance such as https://storagehub.vmware.com/t/microsoft-sql-server-2017-database-on-vmware-vsan-tm-6-7-using-vmware-cloud-foundation-tm/performance-test-results/ that can also help you recognize the workloads these databases can handle.
Under 10K tps shouldn't be much of a problem with modern hardware. You can start with a most common configuration on the cloud or a standard sized server in your own environment. Use SSDs. Optimize your server settings to gain more speed and be ready to add more resources gradually. As Gordon mentions, benchmark your database after you have installed it. I'd start with 32G memory, 8 cores and SSDs to pull 10K tps as a thumbrule and adjust from there.
As you assumed, a lot depends on the # and type of CPU/memory/SSD, your workload, how you structure data, latency between your app and database, reporting happening against the database, master/slave configuration, types of transactions, storage engines etc.

Are there any in-memory (persistent) solutions faster than Aerospike for a single-node?

I am working on a cloud application that requires low latency and very high read/writes per second. I will only have around 1 million records stored persistently but this may fluctuate largely as the application runs.
After YCSB benchmarking Aerospike and Redis, I found that Aerospike beats Redis and MongoDB both in terms of performance on a single-node for 60/40 read write.
Some points to note:
Fetching all my data using a single 32-bit integer key (no advanced queries)
Running on a single machine with 8 GB RAM and an SSD (small number of records)
Multiple clients need access to the database at once (via LAN)
I'm also assuming that key-value stores will outperform document stores and are the best fit considering I do not need advanced queries.
Before committing myself to Aerospike, are there any other solutions which may be more fit for my scenario considering that I am only running a single node with a small-ish amount of records?
Not that I'm aware of. I think Aerospike is the fastest.
However, for some use cases you can consider Tarantool.
Here's one of the benchmarks: https://medium.com/#rvncerr/tarantool-vs-competitors-racing-in-microsoft-azure-ebde9c5d619

Optimizing write performance of a 3 Node 8 Core/16G Cassandra cluster

We have setup a 3 node performance cluster with 16G RAM and 8 Cores each. Our use case is to write 1 million rows to a single table with 101 columns which is currently taking 57-58 mins for the write operation. What should be our first steps towards optimizing the write performance on our cluster?
The first thing I would do is look at the application that is performing the writes:
What language is the application written in and what driver is it using? Some drivers can offer better inherent performance than others. i.e. Python, Ruby, and Node.js drivers may only make use of one thread, so running multiple instances of your application (1 per core) may be something to consider. Your question is tagged 'spark-cassandra-connector' so that possibly indicates your are using that, which uses the datastax java driver, which should perform well as a single instance.
Are your writes asynchronous or are you writing data one at a time? How many writes does it execute concurrently? Too many concurrent writes could cause pressure in Cassandra, but not very many concurrent writes could reduce throughput. If you are using the spark connector are you using saveToCassandra/saveAsCassandraTable or something else?
Are you using batching? If you are, how many rows are you inserting/updating per batch? Too many rows could put a lot of pressure on cassandra. Additionally, are all of your inserts/updates going to the same partition within a batch? If they aren't in the same partition, you should consider batching them up.
Spark Connector Specific: You can tune the write settings, like batch size, batch level (i.e. by partition or by replica set), write throughput in mb per core, etc. You can see all these settings here.
The second thing I would look at is look at metrics on the cassandra side on each individual node.
What does the garbage collection metrics look like? You can enable GC logs by uncommenting lines in conf/cassandra-env.sh (As shown here). Are Your Garbage Collection Logs Speaking to You?. You may need to tune your GC settings, if you are using an 8GB heap the defaults are usually pretty good.
Do your cpu and disk utilization indicate that your systems are under heavy load? Your hardware or configuration could be constraining your capability Selecting hardware for enterprise implementations
Commands like nodetool cfhistograms and nodetool proxyhistograms will help you understand how long your requests are taking (proxyhistograms) and cfhistograms (latencies in particular) could give you insight into any other possibile disparities between how long it takes to process the request vs. perform mutation operations.

What is the best way to store highly parametrized entities?

Ok, let met try to explain this in more detail.
I am developing a diagnostic system for airplanes. Let imagine that airplanes has 6 to 8 on-board computers. Each computer has more than 200 different parameters. The diagnostic system receives all this parameters in binary formatted package, then I transfer data according to the formulas (to km, km/h, rpm, min, sec, pascals and so on) and must store it somehow in a database. The new data must be handled each 10 - 20 seconds and stored in persistence again.
We store the data for further analytic processing.
Requirements of storage:
support sharding and replication
fast read: support btree-indexing
NOSQL
fast write
So, I calculated an average disk or RAM usage per one plane per day. It is about 10 - 20 MB of data. So an estimated load is 100 airplanes per day or 2GB of data per day.
It seems that to store all the data in RAM (memcached-liked storages: redis, membase) are not suitable (too expensive). However, now I am looking to the mongodb-side. Since it can utilize as RAM and disk usage, it supports all the addressed requirements.
Please, share your experience and advices.
There is a helpful article on NOSQL DBMS Comparison.
Also you may find information about the ranking and popularity of them, by category.
It seems regarding to your requirements, Apache's Cassandra would be a candidate due to its Linear scalability, column indexes, Map/reduce, materialized views and powerful built-in caching.

In real terms, how much faster is a production tier Heroku database than the starter database?

A basic production level database in Heroku implements a 400Mb cache. I have a production site running 2 dynos and a worker which is pretty heavy on reads and writes. The database is the bottleneck in my app.
A write to the database will invalidate many queries, as searches are performed across the database.
My question is, given the large jump in price between the $9 starter and $50 first level production database, would migrating be likely to give a significant performance improvement?
"Faster" is an odd metric here. This implies something like CPU, but CPU isn't always a huge factor in databases, especially if you're not doing heavy writes. You Basic database has 0mb of cache – every query hits disk. Even a 400mb cache will seem amazing compared to this. Examine your approximate dataset size; a general rule of thumb is for your dataset to fit into cache. Postgres will manage this cache itself, and optimize for the most referenced data.
Ultimately, Heroku Postgres doesn't sell raw performance. The benefits of the Production-tier are multiple, but to name a few: In-memory Cache, Fork/Follow support, 500 available connections, 99.95% expected uptime.
You will definitely see performance boost in upgrading to a Production-tier plan, however it's near impossible to claim it to be "3x faster" or similar, as this is dependent on how you're using the database.
It sure is a steep step, so the question really is how bad is the bottleneck? It will cost you 40 dollar extra, but once your app runs smooth again it could also mean more revenue. Of course you could also consider other hosting services, but personally I like Heroku the best (albeit cheaper options are available). Besides, you are already familiar with Heroku. There is some more information on Heroku devcenter, regarding their different plans:
https://devcenter.heroku.com/articles/heroku-postgres-plans:
Production plans
Non-production applications, or applications with minimal data storage, performance or availability requirements can choose between one of the two starter tier plans, dev and basic, depending on row requirements. However, production applications, or apps that require the features of a production tier database plan, have a variety of plans to choose from. These plans vary primarily by the size of their in-memory data cache.
Cache size
Each production tier plan’s cache size constitutes the total amount of RAM given to Postgres. While a small amount of RAM is used for managing each connection and other tasks, Postgres will take advantage of almost all this RAM for its cache.
Postgres constantly manages the cache of your data: rows you’ve written, indexes you’ve made, and metadata Postgres keeps. When the data needed for a query is entirely in that cache, performance is very fast. Queries made from cached data are often 100-1000x faster than from the full data set.
Well engineered, high performance web applications will have 99% or more of their queries be served from cache.
Conversely, having to fall back to disk is at least an order of magnitude slower. Additionally, columns with large data types (e.g. large text columns) are stored out-of-line via TOAST, and accessing large amounts of TOASTed data can be slow.