RRDtool what use are multiple RRAs? - graphing

I'm trying to implement rrdtool. I've read the various tutorials and got my first database up and running. However, there is something that I don't understand.
What eludes me is why so many of the examples I come across instruct me to create multiple RRAs?
Allow me to explain: Let's say I have a sensor that I wish to monitor. I will want to ultimately see graphs of the sensor data on an hourly, daily, weekly and monthly basis and one that spans (I'm still on the fence on this one) about 1.5 yrs (for visualising seasonal influences).
Now, why would I want to create an RRA for each of these views? Why not just create a database like this (stepsize=300 seconds):
DS:sensor:GAUGE:600:U:U \
RRA:AVERAGE:0.5:1:160000
If I understand correctly, I can then create any graph I desire, for any given period with whatever resolution I need.
What would be the use of all the other RRAs people tell me I need to define?
BTW: I can imagine that in the past this would have been helpful when computing power was more rare. Nowadays, with fast disks, high-speed interfaces and powerful CPUs I guess you don't need the kind of pre-processing that RRAs seem to be designed for.
EDIT:
I'm aware of this page. Although it explains about consolidation very clearly, it is my understanding that rrdtool graph can do this consolidation aswell at the moment the data is graphed. There still appears (to me) no added value in "harvest-time consolidation".

Each RRA is a pre-consolidated set of data points at a specific resolution. This performs two important functions.
Firstly, it saves on disk space. So, if you are interested in high-detail graphs for the last 24h, but only low-detail graphs for the last year, then you do not need to keep the high-detail data for a whole year -- consolidated data will be sufficient. In this way, you can minimise the amount of storage required to hold the data for graph generation (although of course you lose the detail so cant access it if you should want to). Yes, disk is cheap, but if you have a lot of metrics and are keeping low-resolution data for a long time, this can be a surprisingly large amount of space (in our case, it would be in the hundreds of GB)
Secondly, it means that the consolidation work is moved from graphing time to update time. RRDTool generates graphs very quickly, because most of the calculation work is already done in the RRAs at update time, if there is an RRA of the required configuration. If there is no RRA available at the correct resolution, then RRDtool will perform the consolidation on the fly from a high-granularity RRA, but this takes time and CPU. RRDTool graphs are usually generated on the fly by CGI scripts, so this is important, particularly if you expect to have a large number of queries coming in. In your example, using a single 5min RRA to make a 1.5yr graph (where 1pixel would be about 1 day) you would need to read and process 288 times more data in order to generate the graph than if you had a 1-day granularity RRA available!
In short, yes, you could have a single RRA and let the graphing work harder. If your particular implementation needs faster updates and doesnt care about slower graph generation, and you need to keep the detailed data for the entire time, then maybe this is a solution for you, and RRDTool can be used in this way. However, usually, people will optimise for graph generation and disk space, meaning using tiered sets of RRAs with decreasing granularity.

Related

Calculating Production chain using a database (factorio)

I'm playing the game Factorio, where you build a factory.
For the time being, I made a kind-of flowchart using libreoffice calc to calculate how many machines I need to produce a certain material.
Example image from the spreadsheet
Each block has a recipe saved (blue). This recipe includes what and how much it produces and needs and how much time it takes.
It takes the demand from the previous Block (yellow) and, using the recipe, calculates how many machines (green) it needs to fulfill this demand.
Based on the amount of machines it calculates its own demands (orange).
Then the following blocks do the same, until it has reached the last block.
Doing this in a spreadsheet does work, but it is quite a tedious task.
I showed this to my dad, as I'm quite proud of what I made, and he said that maybe a database would be more suitable.
I definitely see its advantages. For example I could easily summarize the final demands of raw resources, or the total power consumption, etc.
So I got myself Microsoft Access, and I'm pretty lost now. I know the basics of Databases and some SQL-Coding, but I'm not quite sure how I would make this.
My first attempt was:
one table for machines. It includes the machines production speed and other relevant stats.
one table for recipes. Each recipe clearly states what it produces, what it needs, the amount of each, and whether or not it is a basic. Basic means that it is a raw resources, i.e. the production chain would end with this.
one table for units. Each unit has a machine, a recipe and an amount. For example I would have one unit using basic assemblers to produce iron gears. This unit also says how many machines there are, so it needs more and produces more.
I did manage to make a query that calculates the total in and outputs of all units based on their machine and recipe, as well as a total energy consumption.
However, that is nowhere near the spreadsheet I made.
For now we can probably set the Graphical overlay aside, that would probably be quite a bit overkill. However what I do want to be able to make:
enter how much I want of a certain resources
based on that entry the database would create a new table. The first entry would be the unit that produces the requested resources. The second would fulfill the firsts demand, the third fulfills the seconds demand, and so on.
So in the end I would end up with a list of units that will produce my requested resource.
I hope someone can help me. There are programs out there that already do this kind of stuff, but I want to do this myself. If this is a problem that a database isn't suited for, then please tell me so.
Thanks for any help!

Rapidly changing large data processing advise

My team has the following dilemma that we need some architectural/resources advise:
Note: Our data is semi-structured
Over-all Task:
We have a semi-large data that we process during the day
each day this "process" get executed 1-5 times a day
each "process" takes anywhere from 30 minutes to 5 hours
semi-large data = ~1 million rows
each row gets updated anywhere from 1-10 times during the process
during this update ALL other rows may change, as we aggregate these rows for UI
What we are doing currently:
our current system is functional, yet expensive and inconsistent
we use SQL db to store all the data and we retrieve/update as process requires
Unsolved problems and desired goals:
since this processes are user triggered we never know when to scale up/down, which causes high spikes and Azure doesnt make it easy to do autoscale based on demand without data warehouse which we are wanting to stay away from because of lack of aggregates and other various "buggy" issues
because of constant IO to the db we hit 100% of DTU when 1 process begins (we are using Azure P1 DB) which of course will force us to grow even larger if multiple processes start at the same time (which is very likely)
yet we understand the cost comes with high compute tasks, we think there is better way to go about this (SQL is about 99% optimized, so much left to do there)
We are looking for some tool that can:
Process large amount of transactions QUICKLY
Can handle constant updates of this large amount of data
supports all major aggregations
is "reasonably" priced (i know this is an arguable keyword, just take it lightly..)
Considered:
Apache Spark
we don't have ton of experience with HDP so any pros/cons here will certainly be useful (does the use case fit the tool??)
ArangoDB
seems promising.. Seems fast and has all aggregations we need..
Azure Data Warehouse
too many various issues we ran into, just didn't work for us.
Any GPU-accelerated compute or some other high-end ideas are also welcome.
Its hard to try them all and compare which one fits the best, as we have a fully functional system and are required to make adjustments to whichever way we go.
Hence, any before hand opinions are welcome, before we pull the trigger.

solutions for cleaning/manipulating big data (currently using Stata)

I'm currently using a 10% sample of a very large dataset (10 vars, over 300m rows) which amounts to over 200 GB of data when stored in .dta format for the full dataset. Stata is able to handle operations like egen, collapse, merging, etc in a reasonable amount of time for the 10% sample when using Stata-MP on a UNIX server with ~50G of RAM and multiple cores.
However, now I want to move on to analyzing the whole sample. Even if I use a machine that has enough RAM to hold the dataset, simply generating a variable takes ages. (I think perhaps the background operations are causing Stata to run into virtual mem)
The problem is also very amenable to parallelization, i.e., the rows in the dataset are independent of each other, so I can just as easily think about the one large dataset as 100 smaller datasets.
Does anybody have any suggestions for how to process/analyze this data or can give me feedback on some suggestions I currently have? I mostly use Stata/SAS/MATLAB so perhaps there are other approaches that I am simply unaware of.
Here are some of my current ideas:
Split the dataset up into smaller datasets and utilize informal parallel processing in Stata. I can run my cleaning/processing/analysis on each partition and then merge the results after without having the store all the intermediate parts.
Use SQL to store the data and also perform some of the data manipulation such as aggregating over certain values. One concern here is that some tasks that Stata can handle fairly easily such as comparing values across time won't work so well in SQL. Also, I'm already running into performance issues when running some queries in SQL on a 30% sample of the data. But perhaps I'm not optimizing by indexing correctly, etc. Also, Shard-Query seems like it could help with this but I have not researched it too thoroughly yet.
R also looks promising, but I'm not sure if it would solve the problem of working with this enormous amount of data.
Thanks to those who have commented and replied. I realized that my problem is similar to this thread. I have re-written some of my data manipulation code in Stata into SQL and the response time is much quicker. I believe I can make large optimization gains by correctly utilizing indexes and using parallel processing via partitions/shards if necessary. After all the data manipulation has been done, I can import that data via ODBC in Stata.
Since you are familiar with Stata there is a well documented FAQ about large data sets in Stata Dealing with Large Datasets: you might find this helpful.
I would clean via columns, splitting those up, running any specific cleaning routines and merge back in later.
Depending on your machine resources, you should be able to hold the individual columns in multiple temporary files using tempfile. Taking care to select only the variables or columns most relevant to your analysis should reduce the size of your set quite a lot.

Where does min/max/average calculation take place?

I have a data logging application. I record 10,000 temperatures every 30 seconds. I need to be able to calculate the min/max/average temperature of each of the 10,000 items over a hourly/daily/weekly basis. Can the min/max/Average calculation be performed on the server or does each document need to be downloaded to the client for the calculation to be performed?
Andrew
Either calculate or store a summary in the DB/ on the server. Keep the original data as well, if this is important.
Calculating a summary early & sending that to the client/ human level, is far more efficient than trucking around 10,000 samples that nobody usually wants to drill into.
A really good summary having average, min, max & standard deviation would be statistically comprehensive for almost all purposes.
When the client really wants, then you can bring down the big dataset (10k samples) and display it.
Definitely you want to calculate it on the server, but there are multiple approaches you might consider:
You could store these in a specific documents that you manually update with each sample. This could work, but you would be putting a lot of stress on a single document, and it could lead to concurrency issues.
You can write a Map/Reduce index to calculate the totals. Every time you write a new document, RavenDB will update your index with the new totals. You can divide total value by total count to get an average, and you can easily use min and max functions. Since you want to view these results by different time intervals, you'll need multiple indexes.
I actually wrote a small demo program that does exactly that. Instead of temperature, it's recording PSI values from simulated pressure gauges. But the concepts are identical. There are a few shortcuts in there that you can probably pick up on if you read the comments closely.
Project Site: Raven Sensors
I wrote this when the current version of RavenDB was 2.0.2261. I haven't updated it in a while, but it should still work and be relevant.
I haven't done much with it yet, but RavenDB 2.5 added a feature called Dynamic Aggregation. It is also exposed through the studio as Dynamic Reporting. Essentially, this does the aggregation at query time. You may find it much easier to express the aggregations you are interested in, but it could be considerably slower than a map-reduce approach. You may want to experiment. The performance difference may come down to how many items are in the set being aggregated.

Algorithmically suggest best node to perform demanding computation

At work we perform demanding numerical computations.
We have a network of several Linux boxes with different processing capabilities. At any given time, there can be anywhere from zero to dozens of people connected to a given box.
I created a script to measure the MFLOPS (Million of Floating Point Operations per Second) using the Linpack Benchmark; it also provides number of cores and memory.
I would like to use this information together with the load average (obtained using the uptime command) to suggest the best computer for performing a demanding computation. In other words, its 3:00pm; I have a meeting in two hours; I need to run a demanding process: what node will get me the answer fastest?
I envision a script which will output a suggestion along the lines of:
SUGGESTED HOSTS (IN ORDER OF PREFERENCE)
HOST1.MYNETWORK
HOST2.MYNETWORK
HOST3.MYNETWORK
Such suggestion should favor fast computers (high MFLOPS) if the load average is low and, as load average increases for a given node, it should favor available nodes instead (i.e., I'd rather run in a slower computer with no users than in an eight-core with forty dudes logged in).
How should I prioritize? What algorithm (rationale) would you use? Again, what I have is:
Load Average (1min, 5min, 15min)
MFLOPS measure
Number of users logged in
RAM (installed and available)
Number of cores (important to normalize the load average)
Any thoughts? Thanks!
You don't have enough data to make an well-informed decision. It sounds as though the scheduling is very volatile: "At any given time, there can be anywhere from zero to dozens of people connected to a given box." So the current load does not necessarily reflect the future load of the machines.
To properly asses what hosts someone should use to minimize computation time would require knowing when the current jobs will terminate. If a powerful machine is about to be done doing most of its jobs, it would be a good candidate even though it currently has a high load.
If you want to guess purely on the current situation, you can do a weighed calculation to find out which hosts have the most MFLOPS available.
MFLOPS available = host's MFLOPS + (number of logical processors - load average)
Sort the hosts by MFLOPS available and suggest them in a descending order.
This formula assumes that the MFLOPS of a host is linearly related to its load average. This might not be exactly true, but it's probably fairly close.
I would favor the most recent load average since it's closer to the current/future situation, whereas, jobs from 15 minutes ago might have completed by now.
Have you considered a distributed approach to computation? Not all computations can be broken up such that more than one cpu can work on them. But perhaps your problem space can benefit from some parallelization. Have a look at Hadoop.
You don't need to know FLOPS. beowulf modules paralell computing center has I go to has the script for sure
PDC operates leading-edge, high-performance computers on a national level. PDC offers easily accessible computational resources that primarily cater to the ...