Redis - best way to cache the data - redis

I have 3 types of values.
Type1 - In millions
Type2 - In hundreds
Type3 - In tens
For each type1, all type2 should be processed by type3.
I can use set.
Type1#Type3 = {Type2 values}
Or
Type1#Type2 = {Type3 values}
If I keep Type3 as a key, more memory occupied by the set.
Or is there any better way to store this type data?
I evaluated hash which occupies more memory than set. I evaluated with ziplist compression for hash.
Any suggestions?

If you can tolerate splitting, here is another way:
First HashMap with key as a type1 value and value as a : separated list of type2 values linked to that type 1 value
type1_a => type2_b:type2_c..
type1_b => type2_k:type2_d..
Second HashMap with key as a type2 value and value as the corresponding type3 value
type2_a => type3a
type3_b => type3b
for type2_key in firstHashMap.get(type1_key).split(':')
secondHasMap.get(type2_key)
Also, you can look into Lua script to avoid doing the above on client side.

Related

Modeling data in Redis. Which is better: Sorted set & strings or hash table?

I want to use redis to store data that is sourced from a sql db. In the db, each row has an ID, date, and value, where the ID and date make up a composite key (there can only be one value for a particular ID and date). An example is below:
ID Date Value
1 01/01/2001 1.2
1 02/01/2001 1.5
1 04/23/2002 1.5
2 05/05/2009 0.4
Users should be able to query this data in redis given a particular ID and date range. For example, they might want all values for 2019 with ID 45. If the user does not specify a start or end time, we use the system's Date.Min or Date.Max respectively. We also want to support refreshing redis data from the database using the same parameters (ID and date range).
Initially, I used a zset:
zset key zset member score
1 01/01/2001_1.2 20010101
1 02/01/2001_1.5 20010201
1 04/23/2002_1.5 20020423
2 05/05/2009_0.4 20090505
Only, what happens if the value field changes in the db? For instance, ID 1 and date 01/01/2001 might change to 1.3 later on. I would want the original value to be updated, but instead, a new member will be inserted. Rather, I would need to first check that a member for a particular score exists, and delete if it does before inserting a new member. I imagine this could get expensive if refreshing, for example, 10 years worth of data.
I thought of two possible fixes to this:
1.) Use a zset and string key-value:
zset key zset value score
1 1_01/01/2001 20010101
1 1_02/01/2001 20010201
1 1_04/23/2002 20020423
2 2_05/05/2009 20090505
string key string value
1_01/01/2001 1.2
1_02/01/2001 1.5
1_04/23/2002 1.5
2_05/05/2009 0.4
This allows me to easily update the value, and query for a date range, but adds some complexity as now I need to use two redis data structures instead of 1.
2.) Use a hash table:
hash key sub-key value
1 01/01/2001 1.2
1 02/01/2001 1.5
1 04/23/2002 1.5
2 05/05/2009 0.4
This is nice because I only have to use 1 data structure and although it would be O(N) to get all values for a particular hash key, solution 1 would have the same drawback when getting values for all string keys returned from the zset.
However, with this solution, I now need to generate all sub-keys between a given start and end date in my calling code, and not every date may have a value. There are also some edge cases that I now need to handle (what if the user wants all values up until today? Do I use HGETALL and just remove the ones in the future I don't care about? At what date range size should I use HGETALL rather than HMGET?)
In my view, there are pro's and con's to each solution, and I'm not sure which one will be easier in the long term to maintain. Does anyone have thoughts as to which structure they would choose in this situation?

Amazon Redshift - Pivot Large JSON Arrays

I have an optimisation problem.
I have a table containing about 15MB of JSON stored as rows of VARCHAR(65535). Each JSON string is an array of arbitrary size.
95% contains 16 or fewer elements
the longest (to date) contains 67 elements
the hard limit is 512 elements (before 64kB isn't big enough)
The task is simple, pivot each array such that each element has its own row.
id | json
----+---------------------------------------------
01 | [{"something":"here"}, {"fu":"bar"}]
=>
id | element_id | json
----+------------+---------------------------------
01 | 1 | {"something":"here"}
01 | 2 | {"fu":"bar"}
Without having any kind of table valued functions (user defined or otherwise), I've resorted to pivoting via joining against a numbers table.
SELECT
src.id,
pvt.element_id,
json_extract_array_element_text(
src.json,
pvt.element_id
)
AS json
FROM
source_table AS src
INNER JOIN
numbers_table AS pvt(element_id)
ON pvt.element_id < json_array_length(src.json)
The numbers table has 512 rows in it (0..511), and the results are correct.
The elapsed time is horrendous. And it's not to do with distribution or sort order or encoding. It's to do with (I believe) redshift's materialisation.
The working memory needed to process 15MB of JSON text is 7.5GB.
15MB * 512 rows in numbers = 7.5GB
If I put just 128 rows in numbers then the working memory needed reduces by 4x and the elapsed time similarly reduces (not 4x, the real query does other work, it's still writing the same amount of results data, etc, etc).
So, I wonder, what about adding this?
WHERE
pvt.element_id < (SELECT MAX(json_array_length(src.json)) FROM source_table)
No change to the working memory needed, the elapsed time goes up slightly (effectively a WHERE clause that has a cost but no benefit).
I've tried making a CTE to create the list of 512 numbers, that didn't help. I've tried making a CTE to create the list of numbers, with a WHERE clause to limit the size, that didn't help (effectively Redshift appears to have materialised using the 512 rows and THEN applied the WHERE clause).
My current effort is to create a temporary table for the numbers, limited by the WHERE clause. In my sample set this means that I get a table with 67 rows to join on, instead of 512 rows.
That's still not great, as that ONE row with 67 elements dominates the elapsed time (every row, no matter how many elements, gets duplicated 67 times before the ON pvt.element_id < json_array_length(src.json) gets applied).
My next effort will be to work on it in two steps.
As above, but with a table of only 16 rows, and only for row with 16 or fewer elements
As above, with the dynamically mixed numbers table, and only for rows with more than 16 elements
Question: Does anyone have any better ideas?
Please consider declaring the JSON as an external table. You can then use Redshift Spectrum's nested data syntax to access these values as if they were rows.
There is a quick tutorial here: "Tutorial: Querying Nested Data with Amazon Redshift Spectrum"
Simple example:
{ "id": 1
,"name": { "given":"John", "family":"Smith" }
,"orders": [ {"price": 100.50, "quantity": 9 }
,{"price": 99.12, "quantity": 2 }
]
}
CREATE EXTERNAL TABLE spectrum.nested_tutorial
(id int
,name struct<given:varchar(20), family:varchar(20)>
,orders array<struct<price:double precision, quantity:double precision>>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
LOCATION 's3://my-files/temp/nested_data/nested_tutorial/'
;
SELECT c.id
,c.name.given
,c.name.family
,o.price
,o.quantity
FROM spectrum.nested_tutorial c
LEFT JOIN c.orders o ON true
;
id | given | family | price | quantity
----+-------+--------+-------+----------
1 | John | Smith | 100.5 | 9
1 | John | Smith | 99.12 | 2
Neither the data format, nor the task you wish to do, is ideal for Amazon Redshift.
Amazon Redshift is excellent as a data warehouse, with the ability to do queries against billions of rows. However, storing data as JSON is sub-optimal because Redshift cannot use all of its abilities (eg Distribution Keys, Sort Keys, Zone Maps, Parallel processing) while processing fields stored in JSON.
The efficiency of your Redshift cluster would be much higher if the data were stored as:
id | element_id | key | value
----+------------+---------------------
01 | 1 | something | here
01 | 2 | fu | bar
As to how to best convert the existing JSON data into separate rows, I would frankly recommend that this is done outside of Redshift, then loaded into tables via the COPY command. A small Python script would be more efficient at converting the data that trying strange JOINs on a numbers table in Redshift.
Maybe if you avoid parsing and interpreting JSON as JSON and instead work with this as text it can work faster. If you're sure about the structure of your JSON values (which I guess you are since the original query does not produce the JSON parsing error) you might try just to use split_part function instead of json_extract_array_element_text.
If your elements don't contain commas you can use:
split_part(src.json,',',pvt.element_id)
if your elements contain commas you might use
split_part(src.json,'},{',pvt.element_id)
Also, the part with ON pvt.element_id < json_array_length(src.json) in the join condition is still there, so to avoid JSON parsing completely you might try to cross join and then filter out non-null values.

What is the shortest notation for updating a column in an internal table?

I have an ABAP internal table. Structured, with several columns (e.g. 25). Names and types are irrelevant. The table can get pretty large (e.g. 5,000 records).
| A | B | ... |
| --- | --- | --- |
| 7 | X | ... |
| 2 | CCC | ... |
| 42 | DD | ... |
Now I'd like to set one of the columns (e.g. B) to a specific constant value (e.g. 'Z').
What is the shortest, fastest, and most memory-efficient way to do this?
My best guess is a LOOP REFERENCE INTO. This is pretty efficient as it changes the table in-place, without wasting new memory. But it takes up three statements, which makes me wonder whether it's possible to get shorter:
LOOP AT lt_table REFERENCE INTO DATA(ls_row).
ls_row->b = 'Z'.
ENDLOOP.
Then there is the VALUE operator which reduces this to one statement but is not very efficient because it creates new memory areas. It also gets longish for a large number of columns, because they have to be listed one by one:
lt_table = VALUE #( FOR ls_row in lt_table ( a = ls_row-a
b = 'Z' ) ).
Are there better ways?
The following code sets PRICE = 0 of all lines at a time. Theoritically, it should be the fastest way to update all the lines of one column, because it's one statement. Note that it's impossible to omit the WHERE, so I use a simple trick to update all lines.
DATA flights TYPE TABLE OF sflight.
DATA flight TYPE sflight.
SELECT * FROM sflight INTO TABLE flights.
flight-price = 0.
MODIFY flights FROM flight TRANSPORTING price WHERE price <> flight-price.
Reference: MODIFY itab - itab_lines
If you have a workarea declared...
workarea-field = 'Z'.
modify table from workarea transporting field where anything.
Edition:
After been able to check that syntax in my current system, I could prove (to myself) that the WHERE clause must be added or a DUMP is raised.
Thanks Sandra Rossi.

Is condensing the number of columns in a database beneficial?

Say you want to record three numbers for every Movie record...let's say, :release_year, :box_office, and :budget.
Conventionally, using Rails, you would just add those three attributes to the Movie model and just call #movie.release_year, #movie.box_office, and #movie.budget.
Would it save any database space or provide any other benefits to condense all three numbers into one umbrella column?
So when adding the three numbers, it would go something like:
def update
...
#movie.umbrella = params[:movie_release_year]
+ "," + params[:movie_box_office] + "," + params[:movie_budget]
end
So the final #movie.umbrella value would be along the lines of "2015,617293,748273".
And then in the controller, to access the three values, it would be something like
#umbrella_array = #movie.umbrella.strip.split(',').map(&:strip)
#release_year = #umbrella_array.first
#box_office = #umbrella_array.second
#budget = #umbrella_array.third
This way, it would be the same amount of data (actually a little more, with the extra commas) but stored only in one column. Would this be better in any way than three columns?
There is no benefit in squeezing such attributes in a single column. In fact, following that path will increase the complexity of your code and will limit your capabilities.
Here's some of the possible issues you'll face:
You will not be able to add indexes to increase the performance of lookup of records with a specific attribute value or sort the filtering
You will not be able to query a specific attribute value
You will not be able to sort by a specific column value
The values will be stored and represented as Strings, rather than Integers
... and I can continue. There are no advantages, only disadvantages.
Agree with comments above, as an example try to use pg_column_size() to compare results:
WITH test(data_txt,data_int,data_date) AS ( VALUES
('9999'::TEXT,9999::INTEGER,'2015-01-01'::DATE),
('99999999'::TEXT,99999999::INTEGER,'2015-02-02'::DATE),
('2015-02-02'::TEXT,99999999::INTEGER,'2015-02-02'::DATE)
)
SELECT pg_column_size(data_txt) AS txt_size,
pg_column_size(data_int) AS int_size,
pg_column_size(data_date) AS date_size
FROM test;
Result is :
txt_size | int_size | date_size
----------+----------+-----------
5 | 4 | 4
9 | 4 | 4
11 | 4 | 4
(3 rows)

Convert any string to an integer

Simply put, I'd like to be able to convert any string to an integer, preferably being able to restrict the size of the integer and ensure that the result is always identical. In other words is there a hashing function, supported by Oracle, that returns a numeric value and can that value have a maximum?
To provide some context if needed, I have two tables that have the following, simplified, format:
Table 1 Table 2
id | sequence_number id | sequence_number
-------------------- -------------
1 | 1 1 | 2QD44561
1 | 2 1 | 6HH00244
2 | 1 2 | 5DH08133
3 | 1 3 | 7RD03098
4 | 2 4 | 8BF02466
The column sequence_number is number(3) in Table 1 and varchar2(11) in Table 2; it is part of the primary key in both tables.
The data is externally provided and cannot be changed; in Table 1 it is, I believe, created by a simple sequence but in Table 2 has a meaning. The data is made up but representative.
Someone has promised that we would output a number(3) field. While this is fine for the column in the first table, it causes problems for the second.
I would like to be able to both convert sequence_number to an integer (easy), that is less than 1000 (harder) and if at all possible is constant (seemingly impossible). This means that I would like '2QD44561' to always return 586. It does not matter, much, if two strings return the same number.
Simply converting to an integer I can use utl_raw.cast_to_number():
select utl_raw.cast_to_number((utl_raw.cast_to_raw('2QD44561'))) from dual;
UTL_RAW.CAST_TO_NUMBER((UTL_RAW.CAST_TO_RAW('2QD44561')))
---------------------------------------------------------
-2.033E+25
But as you can see this isn't less than 1000
I've also been playing around with dbms_crypto and utl_encode to see if I could come up with something but I've not managed to get a small integer. Is there a way?
How about ora_hash?
select ora_hash(sequence_number, 999) from table_2;
... will produce a maximum of 3 digits. You could also seed it with the id I suppose, but not sure that adds much with so few values, and I'm not sure you'd want that anyway.
You are talking about using a hash function. There are lots of solutions out there - sha1 is very common.
But just FYI, when you say "restrict the size of the integer" understand that you will then be mapping an infinite set of strings or numbers onto a limited set of values. So while your strings will always map to the same value when they are the same, they will not be the only string to map to that value