grafana 2, collectd - issues with graphs - collectd

So I have collectd running on some servers, they are sending the data back to InfluxDB. InfluxDB is storing the data, and Grafana 2 is configured with the InfluxDB as data backed - some graphs work fine - such as load average, however some doesn't graph properly - like interface statistics (see picture):
http://i.imgur.com/YgIxBE1.png
I'm guessing this is because load average is stored like so:
timestamp1: $current_load_average (ex. 1.2)
timestamp2: $current_load_average (ex. 1.1)
And interface statistics are stored like so:
timestamp1: $bytes_transfered_so_far (ex. 1002)
timestamp2: $bytes_transfered_so_far (ex. 1034)
So Grafana just graphs the total bytes that have been transferred over that interface but not the bytes/second that I need. With the same setup - when collectd was writing to RRD files and they were being graphed by several interfaces - it all worked as expected.
Can you advise what should I look into or change?

Grafana query might look like this:
For counters that are constantly increased you're interested in their derivation with a time window. Depends on your graph resolution (if you're looking on last day or last hour) you should choose appropriate window, where all possible peaks would be visible.
You can either use:
Transformation > derivative()
Transformation > non_negative_derivative()
The latter is useful in cases where you want to omit negative values from your chart.

Related

Sum over unix timestamps in grafana prometheus

Lets say I have my_val for a value in prometheus, that is recorded when I use a gpu instance. I want to sum up how many hours or gpu usage I had in the past week. I can call timestamp(myval{instance="$instance"}) which will return a vector with timestamps, but I cannot call sum(idelta) over them because it is an instant vector for some reason.
Grafana also messes with the amount of data requested based on how far I am zooming.
How do I create a reliable call for every datapoint
Try the following query:
count_over_time(my_val[7d]) * <scrape_interval> / 3600
Where <scrape_interval> is scrape interval in seconds set in Prometheus config file for the target that exports my_val metric.
This query assumes that my_val doesn't contain data points when gpu instance wasn't used.

Best practice to store List<T> in StackExchange.Redis

I am trying to find best practice(efficient) way of storing set of List objects against ReportingDate key.
List could be serailised as Xml/DataContract or ProtoBuf....
And given some of the data could be big (for that slice of key):
I was wondering if there is any of getting data from redis cache in IEnum/streamed fashion? Atm we using ProtoBuf.NET to have file based cache. And we retrieve data into mem in streamed fashion (we also have an option of selecting what props/fields we want in that T object as ProtoBuf allows us to do it)
Is there any way can force (after some inactivity) certain part of the data to be offloaded from mem and back into file if it is not being used. But load it up again if it is called
Tnx
It sounds like you want a sorted set - see https://redis.io/topics/data-types#sorted-sets. You would use the date as the value, perhaps in epoch time (since it needs to be a number). SE.Redis supports all the operations you would expect to get ranges of values (either positional ranges - the first 20 records, etc; or absolute ranges bases on the value - all items between two dates expressed in the same unit). Look at the methods starting " SortedSet...".
The value can be binary, so protobuf-net is fine (you would serialize the value for each date separately). Just pass a byte[] as the value. You need to handle serialization separately to the redis library.
As for swapping data out: no. Redis has date-based expiration, but doesn't have hot and cold storage. It is either there, or it isn't. You could use scheduled tasks to purge or move data based on date ranges, again using any of the Z* (redis) or SortedSet* (SE.Redis) methods.
For the complete list of Z* operations, see: https://redis.io/commands#sorted_set. They should all be available in SE.Redis.

How to create inhouse funnel analytics?

I want to create in-house funnel analysis infrastructure.
All the user activity feed information would be written to a database / DW of choice and then, when I dynamically define a funnel I want to be able to select the count of sessions for each stage in the funnel.
I can't find an example of creating such a thing anywhere. Some people say I should use Hadoop and MapReduce for this but I couldn't find any examples online.
Your MapReduce is pretty simple:
Mapper reads row of a session in log file, its output is (stag-id, 1)
Set number of Reducers to be equal to the number of stages.
Reducer sums values for each stage. Like in wordcount example (which is a "Hello World" for Hadoop - https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html#Example%3A+WordCount+v1.0).
You will have to set up a Hadoop cluster (or use Elastic Map Reduce on Amazon).
To define funnel dynamically you can use DistributedCache feature of Hadoop. To see results you will have to wait for MapReduce to finish (minimum dozens of seconds; or minutes in case of Amazon's Elastic MapReduce; the time depends on the amount of data and the size of your cluster).
Another solution that may give you results faster - use a database: select count(distinct session_id) group by stage from mylogs;
If you have too much data to quickly execute that query (it does a full table scan; HDD transfer rate is about 50-150MB/sec - the math is simple) - then you can use a distributed analytic database that runs over HDFS (distributed file system of Hadoop).
In this case your options are (I list here open-source projects only):
Apache Hive (based on MapReduce of Hadoop, but if you convert your data to Hive's ORC format - you will get results much faster).
Cloudera's Impala - not based on MapReduce, can return your results in seconds. For fastest results convert your data to Parquet format.
Shark/Spark - in-memory distributed database.

Classify data using mahout

I'm new to Apache Mahout and working on a classsification problem.
The Problem states:
There exists a set of data in a text file and I need to fetch some or all of the data from the file depending upon the given span of time.
Span of time : Each record would have a Date of transaction.
So, time span would be calculated using the logic (Sys_Date - Transaction_Date).
Thus, output would vary depending upon whether data is required for last month / week / specific number of days.
How can this filtering be achieved using Apache Mahout.
This by itself does not sound like a machine learning problem at all. You want to put your data in a database of some kind and query for records in a date range. Then, you want to do something with that data. This is not something ML tools do.
I haven't been working properly with hadoop yet. But it seems to me that this video should help:
http://www.youtube.com/watch?v=KwW7bQRykHI&feature=player_embedded
After the filtering, you can use result in mahout (for solving the classification problem)

Caching of Map applications in Hadoop MapReduce?

Looking at the combination of MapReduce and HBase from a data-flow perspective, my problem seems to fit. I have a large set of documents which I want to Map, Combine and Reduce. My previous SQL implementation was to split the task into batch operations, cumulatively storing what would be the result of the Map into table and then performing the equivalent of a reduce. This had the benefit that at any point during execution (or between executions), I had the results of the Map at that point in time.
As I understand it, running this job as a MapReduce would require all of the Map functions to run each time.
My Map functions (and indeed any function) always gives the same output for a given input. There is simply no point in re-calculating output if I don't have to. My input (a set of documents) will be continually growing and I will run my MapReduce operation periodically over the data. Between executions I should only really have to calculate the Map functions for newly added documents.
My data will probably be HBase -> MapReduce -> HBase. Given that Hadoop is a whole ecosystem, it may be able to know that a given function has been applied to a row with a given identity. I'm assuming immutable entries in the HBase table. Does / can Hadoop take account of this?
I'm made aware from the documentation (especially the Cloudera videos) that re-calculation (of potentially redundant data) can be quicker than persisting and retrieving for the class of problem that Hadoop is being used for.
Any comments / answers?
If you're looking to avoid running the Map step each time, break it out as its own step (either by using the IdentityReducer or setting the number of reducers for the job to 0) and run later steps using the output of your map step.
Whether this is actually faster than recomputing from the raw data each time depends on the volume and shape of the input data vs. the output data, how complicated your map step is, etc.
Note that running your mapper on new data sets won't append to previous runs - but you can get around this by using a dated output folder. This is to say that you could store the output of mapping your first batch of files in my_mapper_output/20091101, and the next week's batch in my_mapper_output/20091108, etc. If you want to reduce over the whole set, you should be able to pass in my_mapper_output as the input folder, and catch all of the output sets.
Why not apply your SQL workflow in a different environment? Meaning, add a "processed" column to your input table. When time comes to run a summary, run a pipeline that goes something like:
map (map_function) on (input table filtered by !processed); store into map_outputs either in hbase or simply hdfs.
map (reduce function) on (map_outputs); store into hbase.
You can make life a little easier, assuming you are storing your data in Hbase sorted by insertion date, if you record somewhere timestamps of successful summary runs, and open the filter on inputs that are dated later than last successful summary -- you'll save some significant scanning time.
Here's an interesting presentation that shows how one company architected their workflow (although they do not use Hbase):
http://www.scribd.com/doc/20971412/Hadoop-World-Production-Deep-Dive-with-High-Availability