Best data structure to store temperature readings over time - redis

I used to work with SQL like MySQL, Postgres or MSSQL.
Now I want to play with Redis. I'm working on a little home project, that I think is the best choice for starting using Redis.
I have a machine that reads temperature (indoor and outdoor) and humidity. I need to store the readings into Redis. Can you help me to understand the best data structure to do so?
Other than this data I need to store the time (ex. unix timestamp) of the temperature reading for use plotting a graphic.
I installed Redis read the documentation, so I understand the commands and data types.

Since this is your first Redis project and it's a home project, I'd be careful about being to careful. Here's a couple ways to consider designing it (NOTE: I only dug deep into REDIS this past weekend so hopefully others will weigh in).
IDEA 1:
Four ordered sets
KEY for sets are "indoor_temps", "outdoor_temps", "indoor_humidity", "outdoor_humidity"
VALUES are the temperatures / humidities
SCORE is the date stored as EPOCH
IDEA 2:
Four types of keys (best shown by example)
datetime_key = /year:2014/month:07/day:12/hour:07/minute:32/second:54
type_keys = [indoor_temps, outdoor_temps, indoor_humidity, outdoor_humidity]
keys are of form type + "/" + datetime_key
values are the temp and humidity itself
You probably want to implement some initial design and then work with the data immediately - graph it, do stats, etc. Whatever you plan to do with it. That will expose flaws and if they are major, flush the database and try again. These designs should really only take ~1 hour to implement since the only thing you're really changing is a few Redis commands and some string manipulation to convert the data to keys.

I like Tony's suggestions, but I'll also throw out another possibility.
4 lists
keys are "indoor_temps", "outdoor_temps", "indoor_humidity", "outdoor_humidity"
values are of the form < timestamp >_< reading > ie.( "1403197981_27.2" )
Push items onto the front of the list using LPUSH. Get a set of readings using LRANGE. The list will always be ordered by the time of the reading. Obviously split the value on "_" to get your time and reading...
In all honesty, this will give the same properties as Tony's first example, with slightly worse lookup performance, but better memory usage. I'm guessing for this project you'll be neither memory, nor CPU constrained, so the choice is probably not an issue. That said, if you expect to be saving 100's of thousands or more readings, I would suggest the list unless you want to consume a large portion of your system's memory.
Also, it's a good idea to call EXPIRE on your entries with some reasonable TTL that encompasses the length of time you want to save the readings for. If your plan is to have them live in perpetuity then you may want to look at backing them up to a disk DB over time, and just use Redis as a quick lookup cache for recent readings.

Thank to all answer, I choose this strucure:
4 lists: tempIN, tempOut, humidIN and humidOUT
values are: [value]:[timestamp]. For example: "25.4:1403615247"
As suggested from wallacer i want to backup old entries out from Redis.
For main frontend i need only last two days of sample.
For example i can create Redis RDB file snapshot and "trim" the live lists. This solution is not convenient in the event that, in the future you want to recover old values​​.
Do you have any tips on what kind of procedure to adopt to store the data? Maybe use of SQLIte DB?

Related

Limitations in using all string columns in BigQuery

I have an input table in BigQuery that has all fields stored as strings. For example, the table looks like this:
name dob age info
"tom" "11/27/2000" "45" "['one', 'two']"
And in the query, I'm currently doing the following
WITH
table AS (
SELECT
"tom" AS name,
"11/27/2000" AS dob,
"45" AS age,
"['one', 'two']" AS info )
SELECT
EXTRACT( year from PARSE_DATE('%m/%d/%Y', dob)) birth_year,
ANY_value(PARSE_DATE('%m/%d/%Y', dob)) bod,
ANY_VALUE(name) example_name,
ANY_VALUE(SAFE_CAST(age AS INT64)) AS age
FROM
table
GROUP BY
EXTRACT( year from PARSE_DATE('%m/%d/%Y', dob))
Additionally, I tried doing a very basic group by operation casting an item to a string vs not, and I didn't see any performance degradation on a data set of ~1M rows (actually, in this particular case, casting to a string was faster):
Other than it being bad practice to "keep" this all-string table and not convert it into its proper type, what are some of the limitations (either functional or performance-wise) that I would encounter by keeping a table all-string instead of storing it as their proper type. I know there would be a slight increase in size due to storing strings instead of number/date/bool/etc., but what would be the major limitations or performance hits I'd run into if I kept it this way?
Off the top of my head, the only limitations I see are:
Queries would become more complex (though wouldn't really matter if using a query-builder).
A bit more difficult to extract non-string items from array fields.
Inserting data becomes a bit trickier (for example, need to keep track of what the date format is).
But these all seem like very small items that can be worked around. Are there are other, "bigger" reasons why using all string fields would be a huge limitation, either in limiting query-ability or having a huge performance hit in various cases?
First of all - I don't really see any bigger show-stoppers than those you already know and enlisted
Meantime,
though wouldn't really matter if using a query-builder ...
based on above excerpt - I wanted to touch upon some aspect of this approach (storing all as strings)
While we usually concerned about CASTing from string to native type to apply relevant functions and so on, I realized that building complex and generic query with some sort of query builder in some cases requires opposite - cast native type to string for applying function like STRING_AGG [just] as a quick example
So, my thoughts are:
When table is designed for direct user's access with trivial or even complex queries - having native types is beneficial and performance wise and being more friendly for user to understand, etc.
Meantime, if you are developing your own query builder and you design table such that it will be available to users for querying via that query builder with some generic logic being implemented - having all fields in string can be helpful in building the query builder itself.
So it is a balance - you can lose a little in performance but you can win in being able to better implement generic query builder. And such balance depend on nature of your business - both from data prospective and what kind of query you envision to support
Note: your question is quite broad and opinion based (which is btw not much respected on SO) so, obviously my answer - is totally my opinion but based on quite an experience with BigQuery
Are you OK to store string "33/02/2000" as a date in one row and "21st of December 2012" in another row and "22ое октября 2013" in another row?
Are you OK to store string "45" as age in one row and "young" in another row?
Are you OK when age "10" is less than age "9"?
Data types provide some basic data validation mechanism at the database level.
Does BigQuery databases have a notion of indexes?
If yes, then most likely these indexes become useless as soon as you start casting your strings to proper types, such as
SELECT
...
WHERE
age > 10 and age < 30
vs
SELECT
...
WHERE
ANY_VALUE(SAFE_CAST(age AS INT64)) > 10
and ANY_VALUE(SAFE_CAST(age AS INT64)) < 30
It is normal that with less columns/rows you don't feel the problems. You start to feel the problems when your data gets huge.
Major concerns:
Maintenance of the code: Think of future requirements that you may receive. Every conversion for data manipulation will add extra complexity to your code. For example, if your customer asks for retrieving teenagers in future, you'll need to convert string to date to get the age and then be able to do the manupulation.
Data size: The data size has broader impacts that can not be seen at the start. For example if you have N parallel test teams which require own test systems, you'll need to allocate more disk space.
Read Performance: When you have more bytes to read in huge tables it will cost you considerable time. For example typically telco operators have a couple of billions of rows data per month.
If your code complexity increase, you'll need to replicate conversions in multiple places.
Even single of above items should push one to distance from using strings for everything.
I would think the biggest issue with this would be if there are other users of this table/data, for instance if someone is trying to write reports with it and do calculations or charts or date ranges it could be a big headache having to always cast or convert the data with whatever tool they are using. You or someone would likely get a lot of complaints about it.
And if someone decided to build a layer between this data and the reporting tool which converted all of the data, then you may as well just do it one time to the table/data and be done with it.
From the solution below, you might face some storage and performance problems, you can find some guidance in the official documentation:
The main performance problem will come from the CAST operation, remember that the BigQuery Engine will have to deal with a CAST operation for each value per row.
In order to test the compute cost of this operations, I used the following query:
SELECT
street_number
FROM
`bigquery-public-data.austin_311.311_service_requests`
LIMIT
5000
Inspecting the stages executed in the execution details we are able to see the following:
READ
$1:street_number
FROM bigquery-public-data.austin_311.311_service_requests
LIMIT
5000
WRITE
$1
TO __stage00_output
Only the Read, Limit and Write operations are required. However if we execute the same query adding the the CAST operator.
SELECT
CAST(street_number AS int64)
FROM
`bigquery-public-data.austin_311.311_service_requests`
LIMIT
5000
We see that a compute operation is also required in order to perform the cast operation:
READ
$1:street_number
FROM bigquery-public-data.austin_311.311_service_requests
LIMIT
5000
COMPUTE
$10 := CAST($1 AS INT64)
WRITE
$10
TO __stage00_output
Those compute operations will consume some time, that might cause problems when escalating the operation size.
Also, remember that each time that you want to use the data type properties of each data type, you will have to cast your value, and deal with the compute operation time required.
Finally, referring to the storage performance, as you mentioned Strings do not have a fixed size, and that might cause a size increase.

Storing large amount of data in Redis / NoSQL or Relational db?

I need to store and access financial market candle stick information.
The amount of candles sticks that I will need to store is beginning to looking staggering (huge). There are 1000s of markets and each one has many trading pairs, and each pair has many time frames, and each time frame is an array of candles like the below. The array below could be for hourly price data or daily price data for example.
I need to make this information available to multiple users at any given time, so need to store it and make it available somehow.
The data looks something like this:
[
{
time: 1528761600,
openPrice: 100,
closePrice: 20,
highestPrice: 120,
lowesetPrice:10
},
{
time: 1528761610,
openPrice: 100,
closePrice: 20,
highestPrice: 120,
lowesetPrice:10
},
{
time: 1528761630,
openPrice: 100,
closePrice: 20,
highestPrice: 120,
lowesetPrice:10
}
]
Consumers of the data will mostly be a complex Javascript based charting app, but other consumers will be node code, and perhaps other backend code.
My current best idea is to put save the candlesticks in Redis, though I have also considered a noSQL database. I'm not super experienced in either, so I'm not 100% sure Redis is the right choice. It seems to be the most performant option though, but perhaps harder to work with, since I am having to learn a lot, and I'm not convinced that the method of saving and retrieval used by Redis is going to make this very easy since, I will need to continually add candles to each array.
I'm currently thinking something like:
Do an initial fetch from the candle stick api and either:
Create a Redis hash with a suitable label and stingify the whole array of candles into the hash, so that it back be parsed by Javascript etc
Drawbacks of this approach:
Every time a new candle is created, I have to parse the json, add any new candles sticks and stringify and save it.
Pros of this approach:
I can use Javascript to manage the array and make sure it's sorted etc
Create a Redis list of time stamps, which allows me to just push new candles onto the list and trust it to be in the right order. I can then do a Redis SCAN? to return time stamps between the specific dates and then use the time stamps to pull the data out of a Redis hash. After retriveng all of this, then building a json object similar to above to pass to Javascript.
I have to say that both of these approaches feels way more painfull to me putting the data in a relational database. I imagine that a no-SQL database could also be way easier, but I'm not experienced with them, so I can't say for sure.
I'm a bit lost and out of my experience here, as you can tell, and would love any advice anyone can give me.
Thanks :)
Your data is very regular - each candlestick has essentially 1 64 bit long for timestamp, and 4 32 bit numbers for the prices. This makes it very amenable to bitfield.
Storing the data
Here is how I would store it -
stock-symbol:daily_prices = bitfield with 30 * 5 records, assuming you are storing data for past 30 days
stock-symbol:hourly_prices = bitfield with 24 * 5 records
This way, your memory is (30*5 + 24*5) * 16 bytes = 4320 bytes per symbol + constant overhead per key.
You don't need to store the timestamp (see below). Also, I have assumed 4 bytes to store the price. You can store it as a whole number by eliminating the decimal.
Writing the data
To insert hourly prices, find the current hour (say 07:00 hours). If you treat the bitfield as an array of 4 byte integers, you will have to skip 7 * 4 = 28 integers. You then insert the prices at position 28, 29, 30, 31 (0 based indexes).
So, to store price for AAPL at 07:00 hours, you would run the command
bitfield AAPL:hourly_prices set i32 28 <open price> i32 29 <close price> i32 30 <highest price> i32 31 <lowest price>
You would do something similar for daily prices as well.
Reading Data
If you are building a charting library, most likely you would want to return data for multiple symbols for a given time range. Let's say you want to pull out daily prices for past 7 days, your logic will be -
For each symbol:
Get start and end range within the array
Invoke the Get Range command.
If you run this in a pipeline, it will be very fast.
Other tips
Usually, you would to filter by some property of the symbol. For example, "show me graphs of top 10 tech companies for the last 5 days".
A symbol itself is relational data. I would recommend storing that in a relational database. Just get the symbol names as a list from the relational database, and then fetch the stock prices from redis.
Redis has its limits, like anything, but they're pretty high, and if you're clever about it, you can get amazing performance out of redis. If you outgrow one instance you can start thinking about clustering, which should scale relatively linearly to a level where budget is a bigger concern than performance.
Without having a really great grasp of the data you're describing and its relations, sounds like what you're looking for is a sorted set, perhaps sorted by date. You can ZSCAN a sorted set to move through it sequentially, or you can do lots of other great things against one as well. You might have data that requires a few different things - eg a hash for some data and an entry into an index for the hash itself, or even in a few different indexes. A simple redis list might also do the job for you, since it's inherently ordered by insertion order ( this may or may not work for your cases of course; it may depend on whether your input is inherently temporally ordered).
At the end of the day, redis performance is generally dictated by how "well" the data is stored in redis - in other words, how well the native redis capabilities have been mapped into your problem domain. It's pretty easy to use and to program against. I'd highly recommend you look into it.

Alphabetical index with millions of rows in redis

For my application, I need an alphabetical index on a set with millions of rows.
When I use a sorted set, and give all members the same score, the result looks perfect.
Performance is also great, with a test set of 2 million rows, the last third does not perform noticably less than the first third of the set.
However, I need to query those results. For example, get the first (max) 100 items that start with "goo". I played around with zscan and sort, but it does not give me a working and performant result.
Since redis is very fast when inserting a new member to the sorted set, it must be technically possible to immediately (well, very quickly) go to the right memory location. I suppose redis uses some kind of quicksort mechanism to accomplish this.
But.. I don't seem to get the result when I just want to query the data, and not write to it.
We use replicated slaves for read actions, and we prefer the (default) read-only config switch. So creating a dummy key and deleting it afterward (however unelegant) is not really an option.
I'm stuck a bit, and I'm thinking about writing a ZLEX command in redis-server itself. Which I could use like this:
HELP "ZLEX" -> (ZLEX set score startswith)
-- Query the lexicographical index of a sorted set, supplying a 'startswith' string.
127.0.0.1:12345> ZLEX myset 0 goo LIMIT 0 100
1) goo
2) goof
3) goons
4) goozer
What are your thoughts? Am I missing something in the standard redis commands?
We're using Redis 2.8.4 x64 on Debian.
Kind regards, TW
Edits:
Note:
Related issue: indexing-using-redis-sorted-sets -> At least the name I gave to ZLEX seems to conform with Antirez' (Salvatore's) standards. As of 24-1-2014, I'm working on implementing ZLEX. It seems to be the easiest and most straight-forward solution for this use case, and Antirez could merge it into the main branch for everyone's benefit.
I've implemented ZLEX.
Here are the full specs.
You can grab the new functionality from here: github tw-bert
I also posted a pull request to Antirez here.
Kind regards, TW
Have you had a look at this ?
It can be useful depending on the length of the field by which you sort, this method requires b*(a^2) keys, where a is the length of the field , and b is amount of rows for this field.

Redis Sorted Set ... store data in "member"?

I am learning Redis and using an existing app (e.g. converting pieces of it) for practice.
I'm really struggling to understand first IF and then (if applicable) HOW to use Redis in one particular use-case ... apologies if this is super basic, but I'm so new that I'm not even sure if I'm asking correctly :/
Scenario:
Images are received by a server and info like time_taken and resolution is saved in a database entry. Images are then associated (e.g. "belong_to") with one Event ... all very straight-forward for a RDBS.
I'd like to use a Redis to maintain a list of the 50 most-recently-uploaded image objects for each Event, to be delivered to the client when requested. I'm thinking that a Sorted Set might be appropriate, but here are my concerns:
First, I'm not sure if a Sorted Set can/should be used in this associative manner? Can it reference other objects in Redis? Or is there just a better way to do this altogether?
Secondly, I need the ability to delete elements that are greater than X minutes old. I know about the EXPIRE command for keys, but I can't use this because not all images need to expire at the same periodicity, etc.
This second part seems more like a query on a field, which makes me think that Redis cannot be used ... but then I've read that I could maybe use the Sorted Set score to store a timestamp and find "older than X" in that way.
Can someone provide come clarity on these two issues? Thank you very much!
UPDATE
Knowing that the amount of data I need to store for each image is small and will be delivered to the client's browser, can is there anything wrong with storing it in the member "field" of a sorted set?
For example Sorted Set => event:14:pictures <time_taken> "{id:3,url:/images/3.png,lat:22.8573}"
This saves the data I need and creates a rapidly-updatable list of the last X pictures for a given event with the ability to, if needed, identify pictures that are greater than X minutes old ...
First, I'm not sure if a Sorted Set can/should be used in this
associative manner? Can it reference other objects in Redis?
Why do you need to reference other objects? An event may have n image objects, each with a time_taken and image data; a sorted set is perfect for this. The image_id is the key, the score is time_taken, and the member is the image data as json/xml, whatever; you're good to go there.
Secondly, I need the ability to delete elements that are greater than
X minutes old
If you want to delete elements greater than X minutes old, use ZREMRANGEBYSCORE:
ZREMRANGEBYSCORE event:14:pictures -inf (currentTime - X minutes)
-inf is just another way of saying the oldest member without knowing the oldest members time, but for the top range you need to calculate it based on current time before using this command ( the above is just an example)

Best way to store real world "events" in a DB?

I'm building a system which will collect data about an industrial process, which is externally controlled. Those datas will be used to build usage statistics for various components of the system.
Simplified example: there's a heater that is turned on and off, and I get notified when it happens. I need to log this, and based on these data be able to answer questions like "How long has the heater been on last month?"
What I came up with is to create a table in which I insert a line each time a state change happens, include a timestamp.
However, it seems to me that it will require quite a lot of after-processing, eg to answer the example question above. I see no way to extract this kind of answer with just SQL.
Question: is there a better suited, more effective "storage pattern" that what I describe here?
Thanks.
You could store the time the heater was on, rather than the discrete on/off events. Use time_on and time_off columns to track when the heater was turned on and off respectively, and then subtract time_on from time_off to get the duration.
When the heater is turned on:
insert into heater_usage (time_on, time_off) values (now(), null);
When the heater is turned off:
update heater_usage set time_off = now() where time_off is null;
Use unique constraints to insure no two rows can have null for time_off, as a basic check to make sure you don't leave "dangling" records with no time_off if your script isn't invoked properly. You could check for those when the heater is turned on, and remove them.
To sum the total time on:
select sum(time_off - time_on) from heater_usage;
I dont think you have provided enough information to be able to propose a design.
I am sure that you are storing more than just one event type; is it a few, or is it a very large amount.
how different is the data that needs to be stored for each event type?
how often will this system need to be changed? will you have to edit or add event types regularly or rarely?
is this a system that has to be flexible to the type of data that an event produces?
that said, you effectively have two main types of design possibilities:
create a unique table for every event type that explicitly captures data for the event type OR create a limited number of tables that can store data for many event types which have a column containing xml, or serialised data of some form.
the first is less flexible, the second requires more post processing.