Deleting redundant values in timeseries data - sql

Consider a database scheme like this:
CREATE TABLE log (
observation_point_id INTEGER PRIMARY KEY NOT NULL,
datetime TEXT NOT NULL,
value REAL NOT NULL
)
which contains 'observations' of some value; say for example a temperature measurement. The observation device (i.e., thermometer :) ) samples the temperature every 5 seconds and this gets logged to the database.
There are multiple thermometers, each of which is identified (for the purposes of this simplified example) by an 'observation_point'.
Now, let's assume that the precision of my thermometer is one degree; then I will have many observations that are redundant. Let's say I log x degrees at 9h00m00s, then it's quite likely it will still be x degrees at 9h00m05s, 9h00m10s etc. So I only need to store the value and time at which I first measured this temperature, and at which I last measured it.
I can check on every insert if the value immediately preceding it is redundant, and then delete that. But that's quite expensive, especially considering that there are many loggers to write to my database, and the frequency of logging is higher than 5 seconds in my real use case.
So my idea is to run a 'cleanup' every, say 1 minute, that will delete all values between extremes e1 and e2 where the interval [e1,e2] is defined as each series of subsequent values v1, v2, ..., vn where v1 = v2 = ... = vn. 'Subsequent' here meaning when ordered by 'datetime'.
My question: is there a way to express this in an SQL query? Is there another way to approach this?
(my baseline is to do a 'select order by', then loop over all results). I can't do anything 'before' my values hit the database (i.e., cache values until I get the next measurement and only write value if that measurement is different), because I might also get observations at a much lower frequency than once every few seconds, and I cannot afford to lose observations. (now that I'm typing this, maybe I could 'cache' values in a separate database table, but I think I'm straying too far from my real question now).

Related

Infinite scroll algorithm for random items with different weight ( probability to show to the user )

I have a web / mobile application that should display an infinite scroll view (the continuation of the list of items is loaded periodically in a dynamic way) with items where each of the items have a weight, the bigger is the weight in comparison to the weights of other items the higher should be the chances/probability to load the item and display it in the list for the users, the items should be loaded randomly, just the chances for the items to be in the list should be different.
I am searching for an efficient algorithm / solution or at least hints that would help me achieve that.
Some points worth to mention:
the weight has those boundaries: 0 <= w < infinite.
the weight is not a static value, it can change over time based on some item properties.
every item with a weight higher than 0 should have a chance to be displayed to the user even if the weight is significantly lower than the weight of other items.
when the users scrolls and performs multiple requests to API, he/she should not see duplicate items or at least the chance should be low.
I use a SQL Database (PostgreSQL) for storing items so the solution should be efficient for this type of database. (It shouldn't be a purely SQL solution)
Hope I didn't miss anything important. Let me know if I did.
The following are some ideas to implement the solution:
The database table should have a column where each entry is a number generated as follows:
log(R) / W,
where—
W is the record's weight greater than 0 (itself its own column), and
R is a per-record uniform random number in (0, 1)
(see also Arratia, R., "On the amount of dependence in the prime factorization of a uniform random integer", 2002). Then take the records with the highest values of that column as the need arises.
However, note that SQL has no standard way to generate random numbers; DBMSs that implement SQL have their own ways to do so (such as RANDOM() for PostgreSQL), but how they work depends on the DBMS (for example, compare MySQL's RAND() with T-SQL's NEWID()).
Peter O had a good idea, but had some issues. I would expand it a bit in favor of being able to shuffle a little better as far as being user-specific, at a higher database space cost:
Use a single column, but store in multiple fields. Recommend you use the Postgres JSONB type (which stores it as json which can be indexed and queried). Use several fields where the log(R) / W. I would say roughly log(U) + log(P) where U is the number of users and P is the number of items with a minimum of probably 5 columns. Add an index over all the fields within the JSONB. Add more fields as the number of users/items get's high enough.
Have a background process that is regularly rotating the numbers in #1. This can cause duplication, but if you are only rotating a small subset of the items at a time (such as O(sqrt(P)) of them), the odds of the user noticing are low. Especially if you are actually querying for data backwards and forwards and stitch/dedup the data together before displaying the next row(s). Careful use of manual pagination adjustments helps a lot here if it's an issue.
Before displaying items, randomly pick one of the index fields and sort the data on that. This means you have a 1 in log(P) + log(U) chance of displaying the same data to the user. Ideally the user would pick a random subset of those index fields (to avoid seeing the same order twice) and use that as the order, but can't think of a way to make that work and be practical. Though a random shuffle of the index and sorting by that might be practical if the randomized weights are normalized, such that the sort order matters.

When organising an InfluxDB database, which of these two approaches would be most preferred?

I am trying to decide how measurements should be organised in an InfluxDB the database (which I believe they call schema design and data layout) but I think this may be a more general database type question.
Let's say as a simple example that I am measuring two quantites, temperature and humidity (imaginative, I know!), in two locations, living room and outside.
InfluxDB has the syntax for inserting data points:
measurement, tag_key=tag_value field_key=field_value
and so there are two obvious (at least to me) options. Briefly, the first option would insert a datapoint like this:
INSERT temperature,location=outside value=15
INSERT humidity,location=outside value=50
whereas the second option would do it this way:
INSERT sensor_measurements,location=outside temperature=15,humidity=50
My questions are more high level:
Is there a preferred/accepted way to go about this?
Will I run into problems with either of these if I try to scale it up to more quantities/locations/data types?
Does either of the methods offer an advantage if I later on try to graph these things in Grafana, for example, or if I try to implement later some of the many InfluxQL functions?
Does anyone have any general advice about this to offer?
My own thoughts:
Option 1 seems to me to be more like what is implied by the InfluxDB description "measurement". Both temperature and humidity are separate quantities. But it seems a little clunky to just call it "value".
Option 2 appears to have the advantage that both the humidity and the temperature share exactly the same timestamp. This would come in useful, for example, if I wanted to import the data into some other software and do a correlation between the two quantites, and would mean I wouldn't have to do any interpolation or binning to get them to match up.
I am not sure if it is a bad idea with Option 2 to just have a general measurement called sensor_measurements, and will be hard to maintain later.
In detail:
Option 1
Have a separate "measurement" for each of temperature and humidity, use the location as a "tag", and just name the "field" as value:
At time t1, insert the data:
INSERT humidity,location=outside value=50
INSERT temperature,location=outside value=15
INSERT humidity,location=living_room value=65
INSERT temperature,location=living_room value=28
At time t2, insert some different data:
INSERT humidity,location=outside value=50
INSERT temperature,location=outside value=15
INSERT humidity,location=living_room value=65
INSERT temperature,location=living_room value=28
I can then get access to the living room temperature by querying the following:
> SELECT value FROM temperature WHERE location='living_room'
name: temperature
time value
---- -----
1590416682017481091 28
1590416723963187592 29
I can also use the group by function to do something like this:
SELECT value FROM temperature GROUP BY "location"
Option 2
Have a combined "measurement" called sensor_measurements, for example, use a "tag" for location, and then have separate "fields" for each of temperature and humidity:
At time t1, insert the data:
INSERT sensor_measurements,location=outside temperature=15,humidity=50
INSERT sensor_measurements,location=living_room temperature=28,humidity=65
At time t2, insert some different data:
INSERT sensor_measurements,location=outside temperature=14,humidity=56
INSERT sensor_measurements,location=living_room temperature=29,humidity=63
I can now get access to the living room temperature by querying the following:
> SELECT temperature FROM sensor_measurements WHERE location='living_room'
name: sensor_measurements
time temperature
---- -----------
1590416731530452068 28
1590416757055629103 29
I can now use the group by function to do something like this:
SELECT temperature FROM sensor_measurements GROUP BY "location"
I would use option 2 from offered options, because less records = less resources = better query response time (in theory). Generally, both approaches look good.
But I will use more generic 3rd option in real world. Single generic metrics measurement with tags metric,location and field value:
INSERT metrics,metric=temperature,location=outside value=15
INSERT metrics,metric=humidity,location=living_room value=50
INSERT metrics,metric=temperature,location=living_room value=28
INSERT metrics,metric=humidity,location=living_room value=65
That gives me opportunity to create single generic Grafana dashboard, where user will have option to select visualized metric/location via dashboard variable (generated directly from the InfluxDB, e.g. SHOW TAG VALUES WITH KEY = "metric"). Any new inserted metrics (e.g. `illuminance , pressure, wind-speed, wind-direction, ...) or location can be immediately visualized in this generic dashboard. Eventually, some metrics may have also additional tags. That is good and I will be able to use ad-hoc Grafana variable, so users will be able specify any number of key/value filters on the fly. Grafana doc.

How is insertion for a Singly Linked List and Doubly Linked List constant time?

Thinking about it, I thought the time complexity for insertion and search for any data structure should be the same, because to insert, you first have to search for the location you want to insert, and then you have to insert.
According to here: http://bigocheatsheet.com/, for a linked list, search is linear time but insertion is constant time. I understand how searching is linear (start from the front, then keep going through the nodes on the linked list one after another until you find what you are searching for), but how is insertion constant time?
Suppose I have this linked list:
1 -> 5 -> 8 -> 10 -> 8
and I want to insert the number 2 after the number 8, then would I have to first search for the number 8 (search is linear time), and then take an extra 2 steps to insert it (so, insertion is still linear time?)?
#insert y after x in python
def insert_after(x, y):
search_for(y)
y.next = x.next
x.next = y
Edit: Even for a doubly linked list, shouldn't it still have to search for the node first (which is linear time), and then insert?
So if you already have a reference to the node you are trying to insert then it is O(1). Otherwise, it is search_time + O(1). It is a bit misleading but on wikipedia there is a chart explains it a bit better:
Contrast this to a dynamic array, which, if you want to insert at the beginning is: Θ(n).
Just for emphasis: The website you reference is referring to the actual act of inserting given we already know where we want to insert.
Time to insert = Time to set three pointers = O(3) = constant time.
Time to insert the data is not the same as time to insert the data at a particular location. The time asked is the time to insert the data only.

How to calculate blocks of free time using start and end time?

I have a Ruby on Rails application that uses MySQL and I need to calculate blocks of free (available) time given a table that has rows of start and end datetimes. This needs to be done for a range of dates, so for example, I would need to look for which times are free between May 1 and May 7. I can query the table with the times that are NOT available and use that to remove periods of time between May 1 and May 7. Times in the database are stored at a fidelity of 15 minutes on the quarter hour, meaning all times end at 00, 15, 30 or 45 minutes. There is never a time like 11:16 or 10:01, so no rounding is necessary.
I've thought about creating a hash that has time represented in 15 minute increments and defaulting all of the values to "available" (1), then iterating over an ordered resultset of rows and flipping the values in the hash to 0 for the times that come back from the database. I'm not sure if this is the most efficient way of doing this, and I'm a little concerned about the memory utilization and computational intensity of that approach. This calculation won't happen all the time, but it needs to scale to happening at least a couple hundred times a day. It seems like I would also need to reprocess the entire hash to find the blocks of time that are free after this which seems pretty inefficient.
Any ideas on a better way to do this?
Thanks.
I've done this a couple of ways. First, my assumption is that your table shows appointments, and now you want to get a list of un-booked time, right?
So, the first way I did this was like yours, just a hash of unused times. It's slow and limited and a little wasteful, since I have to re-calculate the hash every time someone needs to know the times that are available.
The next way I did this was borrow an idea from the data warehouse people. I build an attribute table of all time slots that I'm interested in. If you build this kind of table, you may want to put more information in there besides the slot times. You may also include things like whether it's a weekend, which hour of the day it's in, whether it's during regular business hours, whether it's on a holiday, that sort of thing. Then, I have to do a join of all slots between my start and end times and my appointments are null. So, this is a LEFT JOIN, something like:
SELECT *
FROM slots
WHERE ...
LEFT JOIN appointments
WHERE appointments.id IS NULL
That keeps me from having to re-create the hash every time, and it's using the database to do the set operations, something the database is optimized to do.
Also, if you make your slots table a little rich, you can start doing all sorts of queries about not only the available slots you may be after, but also on the kinds of times that tend to get booked, or the kinds of times that tend to always be available, or other interesting questions you might want to answer some day. At the very least, you should keep track of the fields that tell you whether a slot should be one that is being filled or not (like for business hours).
Why not have a flag in the row that indicates this. As time is allocated, flip the flag for every date/time in the appropriate range. For example May 2, 12pm to 1pm, would be marked as not available.
Then it's a simple matter of querying the date range for every row that has the availability flagged set as true.

Trending 100 million+ rows

I have a system which records some measured values every second. What is the best way to store trend data which are values corresponding to a specific second?
1 day = 86.400 seconds
1 month = 2.592.000 seconds
Around 1000 values to keep track of every seconds.
Currently there are 50 tables grouping the trend data for 20 columns each. These tables contain more than 100 million rows.
TREND_TIME datetime (clustered_index)
TREND_DATA1 real
TREND_DATA2 real
...
TREND_DATA20 real
Have you considered RRDTool - it provides a round robin database, or circular buffer, for time series data. You can store data at whatever interval you like, then define consolidation points and a consolidation function, for example (sum, min, max, avg) for a given period, 1 second, 5 seconds, 2 days, etc. Because it knows what consolidation points you want, it doesn't need to store all the data points once they've been agregated.
Ganglia and Cacti use this under the covers and it's quite easy to use from many languages.
If you do need all the datapoints, consider using it just for the aggregation.
I would change the data saving approach and instead of saving 'raw' data as values I would save 5-20 minutes of data in an array (Memory, BL side), compress that array using LZ based algorithm and then store the data in the database as binary data. Also, it would be nice to save Max/Min/Avg/etc.. info for that binary chunk.
When you want to process the data you can process the data chunk after chunk and by that you keep a low memory profile for your application. this approach is a little more complex but very scalable in terms of memory/processing.
hope this helps.
Is the problem the database schema?
1 second to many trends obviously first shows you a separate table with a seconds-table foreign key. Alternatively, if the "many trend values" is represented by the columns and not rows you can always append the columns to the seconds table and incur null values.
Have you tried that? Was performance poor?