save data to redis and support quick query - redis

I have a table like below:
sec_name date1 date2 date3
IBM 10 11 12
GOO 11 8 7
AWS 7 14 12
ALI 12 6 6
I want to store them into redis, and support the query to get data by sec_name list and date list, e.g [IBM, AWS]& [date1, date3].
Any better data structure for store in redis?
Assume that there are 10000*10000 data.

In terms of performance every sec_name should be a key of a Hash data structure, in which each field is a date.
To perform your query you'll need to break it down to multiple calls to HGET, one for each security's name.

Related

Modeling data in Redis. Which is better: Sorted set & strings or hash table?

I want to use redis to store data that is sourced from a sql db. In the db, each row has an ID, date, and value, where the ID and date make up a composite key (there can only be one value for a particular ID and date). An example is below:
ID Date Value
1 01/01/2001 1.2
1 02/01/2001 1.5
1 04/23/2002 1.5
2 05/05/2009 0.4
Users should be able to query this data in redis given a particular ID and date range. For example, they might want all values for 2019 with ID 45. If the user does not specify a start or end time, we use the system's Date.Min or Date.Max respectively. We also want to support refreshing redis data from the database using the same parameters (ID and date range).
Initially, I used a zset:
zset key zset member score
1 01/01/2001_1.2 20010101
1 02/01/2001_1.5 20010201
1 04/23/2002_1.5 20020423
2 05/05/2009_0.4 20090505
Only, what happens if the value field changes in the db? For instance, ID 1 and date 01/01/2001 might change to 1.3 later on. I would want the original value to be updated, but instead, a new member will be inserted. Rather, I would need to first check that a member for a particular score exists, and delete if it does before inserting a new member. I imagine this could get expensive if refreshing, for example, 10 years worth of data.
I thought of two possible fixes to this:
1.) Use a zset and string key-value:
zset key zset value score
1 1_01/01/2001 20010101
1 1_02/01/2001 20010201
1 1_04/23/2002 20020423
2 2_05/05/2009 20090505
string key string value
1_01/01/2001 1.2
1_02/01/2001 1.5
1_04/23/2002 1.5
2_05/05/2009 0.4
This allows me to easily update the value, and query for a date range, but adds some complexity as now I need to use two redis data structures instead of 1.
2.) Use a hash table:
hash key sub-key value
1 01/01/2001 1.2
1 02/01/2001 1.5
1 04/23/2002 1.5
2 05/05/2009 0.4
This is nice because I only have to use 1 data structure and although it would be O(N) to get all values for a particular hash key, solution 1 would have the same drawback when getting values for all string keys returned from the zset.
However, with this solution, I now need to generate all sub-keys between a given start and end date in my calling code, and not every date may have a value. There are also some edge cases that I now need to handle (what if the user wants all values up until today? Do I use HGETALL and just remove the ones in the future I don't care about? At what date range size should I use HGETALL rather than HMGET?)
In my view, there are pro's and con's to each solution, and I'm not sure which one will be easier in the long term to maintain. Does anyone have thoughts as to which structure they would choose in this situation?

django database design when you will have too many rows

I have a django web app with postgres db; the general operation is that every day I have an array of values that need to be stored in one of the tables.
There is no foreseeable need to query the values of the array but need to be able to plot the values for a specific day.
The problem is that this array is pretty big and if I were to store it in the db, I'd have 60 million rows per year but if I store each row as a blob object, I'd have 60 thousand rows per year.
Is is a good decision to use a blob object to reduce table size when you do not want to query with the row of values?
Here are the two options:
option1: keeping all
group(foreignkey)| parent(foreignkey) | pos(int) | length(int)
A | B | 232 | 45
A | B | 233 | 45
A | B | 234 | 45
A | B | 233 | 46
...
option2: collapsing the array into a blob:
group(fk)| parent(fk) | mean_len(float)| values(blob)
A | B | 45 |[(pos=232, len=45),...]
...
so I do NOT want to query pos or length but I want to query group or parent.
An example of read query that I'm talking about is:
SELECT * FROM "mytable"
LEFT OUTER JOIN "group"
ON ( "group"."id" = "grouptable"."id" )
ORDER BY "pos" DESC LIMIT 100
which is a typical django admin list_view page main query.
I tried loading the data and tried displaying the table in the django admin page without doing any complex query (just a read query).
When I get pass 1.5 millions rows, the admin page freezes. All it takes is a some count query on that table to cause the app to crash so I should definitely either keep the data as a blob or not keep it in the db at all and use the filesystem instead.
I want to emphasize that I've used django 1.8 as my test bench so this is not a postgres evaluation but rather a system evaluation with django admin and postgres.

Cumulative average number of records created for specific day of week or date range

Yeah, so I'm filling out a requirements document for a new client project and they're asking for growth trends and performance expectations calculated from existing data within our database.
The best source of data for something like this would be our logs table as we pretty much log every single transaction that occurs within our application.
Now, here's the issue, I don't have a whole lot of experience with MySql when it comes to collating cumulative sum and running averages. I've thrown together the following query which kind of makes sense to me, but it just keeps locking up the command console. The thing takes forever to execute and there are only 80k records within the test sample.
So, given the following basic table structure:
id | action | date_created
1 | 'merp' | 2007-06-20 17:17:00
2 | 'foo' | 2007-06-21 09:54:48
3 | 'bar' | 2007-06-21 12:47:30
... thousands of records ...
3545 | 'stab' | 2007-07-05 11:28:36
How would I go about calculating the average number of records created for each given day of the week?
day_of_week | average_records_created
1 | 234
2 | 23
3 | 5
4 | 67
5 | 234
6 | 12
7 | 36
I have the following query which makes me want to murderdeathkill myself by casting my body down an elevator shaft... and onto some bullets:
SELECT
DISTINCT(DAYOFWEEK(DATE(t1.datetime_entry))) AS t1.day_of_week,
AVG((SELECT COUNT(*) FROM VMS_LOGS t2 WHERE DAYOFWEEK(DATE(t2.date_time_entry)) = t1.day_of_week)) AS average_records_created
FROM VMS_LOGS t1
GROUP BY t1.day_of_week;
Halps? Please, don't make me cut myself again. :'(
How far back do you need to go when sampling this information? This solution works as long as it's less than a year.
Because day of week and week number are constant for a record, create a companion table that has the ID, WeekNumber, and DayOfWeek. Whenever you want to run this statistic, just generate the "missing" records from your master table.
Then, your report can be something along the lines of:
select
DayOfWeek
, count(*)/count(distinct(WeekNumber)) as Average
from
MyCompanionTable
group by
DayOfWeek
Of course if the table is too large, then you can instead pre-summarize the data on a daily basis and just use that, and add in "today's" data from your master table when running the report.
I rewrote your query as:
SELECT x.day_of_week,
AVG(x.count) 'average_records_created'
FROM (SELECT DAYOFWEEK(t.datetime_entry) 'day_of_week',
COUNT(*) 'count'
FROM VMS_LOGS t
GROUP BY DAYOFWEEK(t.datetime_entry)) x
GROUP BY x.day_of_week
The reason why your query takes so long is because of your inner select, you are essentialy running 6,400,000,000 queries. With a query like this your best solution may be to develop a timed reporting system, where the user receives an email when the query is done and the report is constructed or the user logs in and checks the report after.
Even with the optimization written by OMG Ponies (bellow) you are still looking at around the same number of queries.
SELECT x.day_of_week,
AVG(x.count) 'average_records_created'
FROM (SELECT DAYOFWEEK(t.datetime_entry) 'day_of_week',
COUNT(*) 'count'
FROM VMS_LOGS t
GROUP BY DAYOFWEEK(t.datetime_entry)) x
GROUP BY x.day_of_week

custom sorting or ordering a table without resorting the whole shebang

For ten years we've been using the same custom sorting on our tables, I'm wondering if there is another solution which involves fewer updates, especially since today we'd like to have a replication/publication date and wouldn't like to have our replication replicate unnecessary entries.I had a look into nested sets, but it doesn't seem to do the job for us.
Base table:
id | a_sort
---+-------
1 10
2 20
3 30
After inserting:
insert into table (a_sort) values(15)
An entry at the second position.
id | a_sort
---+-------
1 10
2 20
3 30
4 15
Ordering the table with:
select * from table order by a_sort
and resorting all the a_sort entries, updating at least id=(2,3,4)
will of course produce the desired output:
id | a_sort
---+-------
1 10
4 20
2 30
3 40
The column names, the column count, datatypes, a possible join, possible triggers or the way the resorting is done is/are irrelevant to the problem.Also we've found some pretty neat ways to do this task fast.
only; how the heck can we reduce the updates in the db to 1 or 2 max.
Seems like an awfully common problem.
The captain obvious in me thougth once "use an a_sort float(53), insert using a fixed value of ordervaluefirstentry+abs(ordervaluefirstentry-ordervaluenextentry)/2".
But this would only allow around 1040 "in between" entries - so never resorting seems a bit problematic ;)
You really didn't describe what you're doing with this data, so forgive me if this is a crazy idea for your situation:
You could make a sort of 'linked list' where instead of a column of values, you have a column for the 'next highest valued' id. This would decrease the number of updates to a maximum of 2.
You can make it doubly linked and also have a column for next lowest, which would bring the maximum number of updates to 3.
See:
http://en.wikipedia.org/wiki/Linked_list

Sparse data: efficient storage and retrieval in an RDBMS

I have a table representing values of source file metrics across project revisions, like the following:
Revision FileA FileB FileC FileD FileE ...
1 45 3 12 123 124
2 45 3 12 123 124
3 45 3 12 123 124
4 48 3 12 123 124
5 48 3 12 123 124
6 48 3 12 123 124
7 48 15 12 123 124
(The relational view of the above data is different. Each row contains the following columns: Revision, FileId, Value. The files and their revisions from which the data is calculated are stored in Subversion repositories, so we're trying to represent the repository's structure in a relational schema.)
There can be up to 23750 files in 10000 revisions (this is the case for the ImageMagick drawing program). As you can see, most values are the same between successive revisions, so the table's useful data is quite sparse. I am looking for a way to store the data that
avoids replication and uses space efficiently (currently the non-sparse representation requires 260 GB (data+index) for less than 10% of the data I want to store)
allows me to retrieve efficiently the values for a specific revision using an SQL query (without explicitly looping through revisions or files)
allows me to retrieve efficiently the revision for a specific metric value.
Ideally, the solution should not depend on a particular RDBMS and should be compatible with Hibernate. If this is not possible, I can live with using Hibernate, MySQL or PostgreSQL-specific features.
This is how I might model it. I've left out the Revisions table and Files table as those should be pretty self-explanatory.
CREATE TABLE Revision_Files
(
start_revision_number INT NOT NULL,
end_revision_number INT NOT NULL,
file_number INT NOT NULL,
value INT NOT NULL,
CONSTRAINT PK_Revision_Files PRIMARY KEY CLUSTERED (start_revision_number, file_number),
CONSTRAINT CHK_Revision_Files_start_before_end CHECK (start_revision_number <= end_revision_number)
)
GO
To get all of the values for files of a particular revision you could use the following query. Joining to the files table with an outer join would let you get those that have no defined value for that revision.
SELECT
REV.revision_number,
RF.file_number,
RF.value
FROM
Revisions REV
INNER JOIN Revision_Files RF ON
RF.start_revision_number <= REV.revision_number AND
RF.end_revision_number >= REV.revision_number
GO
Assuming that I understand correctly what you want in your third point, this will let you get all of the revisions for which a particular file has a certain value:
SELECT
REV.revision_number
FROM
Revision_Files RF
INNER JOIN Revisions REV ON
REV.revision_number BETWEEN RF.start_revision_number AND RF.end_revision_number
WHERE
RF.file_number = #file_number AND
RF.value = #value
GO