Sparse data: efficient storage and retrieval in an RDBMS

Sparse data: efficient storage and retrieval in an RDBMS - sql

I have a table representing values of source file metrics across project revisions, like the following:
Revision FileA FileB FileC FileD FileE ...
1 45 3 12 123 124
2 45 3 12 123 124
3 45 3 12 123 124
4 48 3 12 123 124
5 48 3 12 123 124
6 48 3 12 123 124
7 48 15 12 123 124
(The relational view of the above data is different. Each row contains the following columns: Revision, FileId, Value. The files and their revisions from which the data is calculated are stored in Subversion repositories, so we're trying to represent the repository's structure in a relational schema.)
There can be up to 23750 files in 10000 revisions (this is the case for the ImageMagick drawing program). As you can see, most values are the same between successive revisions, so the table's useful data is quite sparse. I am looking for a way to store the data that
avoids replication and uses space efficiently (currently the non-sparse representation requires 260 GB (data+index) for less than 10% of the data I want to store)
allows me to retrieve efficiently the values for a specific revision using an SQL query (without explicitly looping through revisions or files)
allows me to retrieve efficiently the revision for a specific metric value.
Ideally, the solution should not depend on a particular RDBMS and should be compatible with Hibernate. If this is not possible, I can live with using Hibernate, MySQL or PostgreSQL-specific features.

This is how I might model it. I've left out the Revisions table and Files table as those should be pretty self-explanatory.
CREATE TABLE Revision_Files
(
start_revision_number INT NOT NULL,
end_revision_number INT NOT NULL,
file_number INT NOT NULL,
value INT NOT NULL,
CONSTRAINT PK_Revision_Files PRIMARY KEY CLUSTERED (start_revision_number, file_number),
CONSTRAINT CHK_Revision_Files_start_before_end CHECK (start_revision_number <= end_revision_number)
)
GO
To get all of the values for files of a particular revision you could use the following query. Joining to the files table with an outer join would let you get those that have no defined value for that revision.
SELECT
REV.revision_number,
RF.file_number,
RF.value
FROM
Revisions REV
INNER JOIN Revision_Files RF ON
RF.start_revision_number <= REV.revision_number AND
RF.end_revision_number >= REV.revision_number
GO
Assuming that I understand correctly what you want in your third point, this will let you get all of the revisions for which a particular file has a certain value:
SELECT
REV.revision_number
FROM
Revision_Files RF
INNER JOIN Revisions REV ON
REV.revision_number BETWEEN RF.start_revision_number AND RF.end_revision_number
WHERE
RF.file_number = #file_number AND
RF.value = #value
GO

Related

Amazon Redshift - Pivot Large JSON Arrays

I have an optimisation problem.
I have a table containing about 15MB of JSON stored as rows of VARCHAR(65535). Each JSON string is an array of arbitrary size.
95% contains 16 or fewer elements
the longest (to date) contains 67 elements
the hard limit is 512 elements (before 64kB isn't big enough)
The task is simple, pivot each array such that each element has its own row.
id | json
----+---------------------------------------------
01 | [{"something":"here"}, {"fu":"bar"}]
=>
id | element_id | json
----+------------+---------------------------------
01 | 1 | {"something":"here"}
01 | 2 | {"fu":"bar"}
Without having any kind of table valued functions (user defined or otherwise), I've resorted to pivoting via joining against a numbers table.
SELECT
src.id,
pvt.element_id,
json_extract_array_element_text(
src.json,
pvt.element_id
)
AS json
FROM
source_table AS src
INNER JOIN
numbers_table AS pvt(element_id)
ON pvt.element_id < json_array_length(src.json)
The numbers table has 512 rows in it (0..511), and the results are correct.
The elapsed time is horrendous. And it's not to do with distribution or sort order or encoding. It's to do with (I believe) redshift's materialisation.
The working memory needed to process 15MB of JSON text is 7.5GB.
15MB * 512 rows in numbers = 7.5GB
If I put just 128 rows in numbers then the working memory needed reduces by 4x and the elapsed time similarly reduces (not 4x, the real query does other work, it's still writing the same amount of results data, etc, etc).
So, I wonder, what about adding this?
WHERE
pvt.element_id < (SELECT MAX(json_array_length(src.json)) FROM source_table)
No change to the working memory needed, the elapsed time goes up slightly (effectively a WHERE clause that has a cost but no benefit).
I've tried making a CTE to create the list of 512 numbers, that didn't help. I've tried making a CTE to create the list of numbers, with a WHERE clause to limit the size, that didn't help (effectively Redshift appears to have materialised using the 512 rows and THEN applied the WHERE clause).
My current effort is to create a temporary table for the numbers, limited by the WHERE clause. In my sample set this means that I get a table with 67 rows to join on, instead of 512 rows.
That's still not great, as that ONE row with 67 elements dominates the elapsed time (every row, no matter how many elements, gets duplicated 67 times before the ON pvt.element_id < json_array_length(src.json) gets applied).
My next effort will be to work on it in two steps.
As above, but with a table of only 16 rows, and only for row with 16 or fewer elements
As above, with the dynamically mixed numbers table, and only for rows with more than 16 elements
Question: Does anyone have any better ideas?

Please consider declaring the JSON as an external table. You can then use Redshift Spectrum's nested data syntax to access these values as if they were rows.
There is a quick tutorial here: "Tutorial: Querying Nested Data with Amazon Redshift Spectrum"
Simple example:
{ "id": 1
,"name": { "given":"John", "family":"Smith" }
,"orders": [ {"price": 100.50, "quantity": 9 }
,{"price": 99.12, "quantity": 2 }
]
}
CREATE EXTERNAL TABLE spectrum.nested_tutorial
(id int
,name struct<given:varchar(20), family:varchar(20)>
,orders array<struct<price:double precision, quantity:double precision>>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
LOCATION 's3://my-files/temp/nested_data/nested_tutorial/'
;
SELECT c.id
,c.name.given
,c.name.family
,o.price
,o.quantity
FROM spectrum.nested_tutorial c
LEFT JOIN c.orders o ON true
;
id | given | family | price | quantity
----+-------+--------+-------+----------
1 | John | Smith | 100.5 | 9
1 | John | Smith | 99.12 | 2

Neither the data format, nor the task you wish to do, is ideal for Amazon Redshift.
Amazon Redshift is excellent as a data warehouse, with the ability to do queries against billions of rows. However, storing data as JSON is sub-optimal because Redshift cannot use all of its abilities (eg Distribution Keys, Sort Keys, Zone Maps, Parallel processing) while processing fields stored in JSON.
The efficiency of your Redshift cluster would be much higher if the data were stored as:
id | element_id | key | value
----+------------+---------------------
01 | 1 | something | here
01 | 2 | fu | bar
As to how to best convert the existing JSON data into separate rows, I would frankly recommend that this is done outside of Redshift, then loaded into tables via the COPY command. A small Python script would be more efficient at converting the data that trying strange JOINs on a numbers table in Redshift.

Maybe if you avoid parsing and interpreting JSON as JSON and instead work with this as text it can work faster. If you're sure about the structure of your JSON values (which I guess you are since the original query does not produce the JSON parsing error) you might try just to use split_part function instead of json_extract_array_element_text.
If your elements don't contain commas you can use:
split_part(src.json,',',pvt.element_id)
if your elements contain commas you might use
split_part(src.json,'},{',pvt.element_id)
Also, the part with ON pvt.element_id < json_array_length(src.json) in the join condition is still there, so to avoid JSON parsing completely you might try to cross join and then filter out non-null values.

Second highest column

I have seen a similar question asked How to get second highest value among multiple columns in SQL ... however the solution won't work for Microsoft Access (Row_Number/Over Partition isn't valid in Access).
My Access query includes dozens of fields. I would like to create a new field/column that would return the second highest value of 10 specific columns that are included in the query, I will call this field "Cover". Something like this:
Product Bid1 Bid2 Bid3 Bid4 Cover
Watch 104 120 115 108 115
Shoe 65 78 79 76 18
Hat 20 22 19 20 20
I can do a really long SWITCH formula such as the following equivalent Excel formula:
IF( AND(Bid1> Bid2, Bid1 > Bid3, Bid1 > Bid4), Bid1,
AND(Bid2> Bid1, Bid2 > Bid3, Bid2 > Bid4), Bid2,
.....
But there must be a more efficient solution. A MAXIF equivalent would work perfectly if MS-Access Query had such a function.
Any ideas? Thank you in advance.

This would be easier if the data were laid out in a more normalized way. The clue is the numbered field names.
Your data is currently organized as a Pivot (known in Access as crosstab), but can easily be Unpivoted.
This data is much easier to work with if laid in a more normalized fashion which is this case would be:
Product Bid Amount
--------- ----- --------
Watch 1 104
Watch 2 120
Watch 3 115
Watch 4 108
Shoe 1 65
Shoe 2 78
Shoe 3 79
Shoe 4 76
Hat 1 20
Hat 2 22
Hat 3 19
Hat 4 20
This way querying becomes simpler.
It looks like you want the maximum of the bids, grouped by Product, so:
select Product, max(amount) as maxAmount
from myTable
group by product
Really, we shouldn't be storing text fields at all, so Product should be an ID number, with associated Product Names stored once in a separate table, instead of several times in the this one, like:
ProdID ProdName
-------- ----------
1 Watch
2 Shoe
3 Hat
... but that's another lesson.
Generally speaking repeating of anything should be avoided... that's pretty much the purpose of a database... but the links below will explain than I. :)
Quackit : Microsoft Access Tutorial
YouTube : DB Planning
Microsoft : Database Design Basics
Microsoft : Database Normalization Basics
Wikipedia : Database Normalization

Set based approach to SQL Server insert where one column is calculated as Max from same column

I'm wondering if this is one of those situations where I'm forced to use a cursor or if I can use a set based approach. I've searched for several hours and also tried to come up with a solution myself to no avail.
I've got a table, SuperSupplierCodes, that contains two columns: SuperSupplierCode INT, and SupplierName NVARCHAR(50).
SuperSupplierID SupplierName
1 21ST CENTURY GRAPHIC TECHNOLOGIES LLC
2 3D SYSTEMS
3 3G
4 A A ABRASIVOS ARGENTINOS SAIC
5 A AND F DRUCKLUFTTECHNIK GMBH
6 A BAY STATIONERS
7 A C T TOOL AND ENGINEERING LLC
8 A HERZOG AG
9 A LI T DI MONTANARI MARCO AND CO SAS
11 A RAYMOND GMBH AND CO KG
I've got a second table with millions of rows in it containing financial data as well as the SupplierName column.
LocalSupplierName
23 JAN HOFMEYER ROAD
303 TAXICAB, LLC
3D MECA SARL
3D SYSTEMS
3E CO ENVIRONMENTAL, ECO. & EN
3E COMPANY
What I need to do is insert into the SuperSupplierCodes table such that each row gets the MAX(SuperSupplierCode) from the previous row, increments it by one, and inserts that into the SuperSupplierCode column along with the SupplierName from the second table.
I've tried the following, just as a test, that I might be able to use for the insert, but of course it will only do the increment once and try to use that same value for SuperSupplierCode for every row:
SELECT s.SuperSupplierID,
s.SupplierName,
s.SupplierAddress,
s.DateCreated,
s.DateModified,
s.SupplierCode,
s.PlantName,
s.id,
x.MaxSSC
FROM SuperSupplierCodes AS s
CROSS APPLY (SELECT MAX(SuperSupplierID)+1 AS MaxSSC FROM dbo.SuperSupplierCodes) x;
I don't like using cursors unless I absolutely have to. Is there a way to do this with T-SQL in a set based manner versus using a cursor?

Create the column as an identity and insert the existing records once using SET IDENTITY_INSERT ON option. Then switch it off for adding new Ids and they will be incremented.
https://learn.microsoft.com/en-us/sql/t-sql/statements/set-identity-insert-transact-sql?view=sql-server-2017

Why not something like this?
SELECT (SELECT MAX(SuperSupplierID) FROM dbo.SuperSupplierCodes) + ROW_NUMBER() OVER (ORDER BY s.DateCreated) AS SuperSupplierID,
s.SupplierName,
s.SupplierAddress,
s.DateCreated,
s.DateModified,
s.SupplierCode,
s.PlantName,
s.id
FROM SuperSupplierCodes AS s;
We use the above technique at my work all the time when inserting rows. If some have existing values, you can insert them all into the table and then change the above to only update values that are currently null.

django database design when you will have too many rows

I have a django web app with postgres db; the general operation is that every day I have an array of values that need to be stored in one of the tables.
There is no foreseeable need to query the values of the array but need to be able to plot the values for a specific day.
The problem is that this array is pretty big and if I were to store it in the db, I'd have 60 million rows per year but if I store each row as a blob object, I'd have 60 thousand rows per year.
Is is a good decision to use a blob object to reduce table size when you do not want to query with the row of values?
Here are the two options:
option1: keeping all
group(foreignkey)| parent(foreignkey) | pos(int) | length(int)
A | B | 232 | 45
A | B | 233 | 45
A | B | 234 | 45
A | B | 233 | 46
...
option2: collapsing the array into a blob:
group(fk)| parent(fk) | mean_len(float)| values(blob)
A | B | 45 |[(pos=232, len=45),...]
...
so I do NOT want to query pos or length but I want to query group or parent.
An example of read query that I'm talking about is:
SELECT * FROM "mytable"
LEFT OUTER JOIN "group"
ON ( "group"."id" = "grouptable"."id" )
ORDER BY "pos" DESC LIMIT 100
which is a typical django admin list_view page main query.

I tried loading the data and tried displaying the table in the django admin page without doing any complex query (just a read query).
When I get pass 1.5 millions rows, the admin page freezes. All it takes is a some count query on that table to cause the app to crash so I should definitely either keep the data as a blob or not keep it in the db at all and use the filesystem instead.
I want to emphasize that I've used django 1.8 as my test bench so this is not a postgres evaluation but rather a system evaluation with django admin and postgres.

Database design for a step by step wizard

I am designing a system containing logical steps with some actions associated (but the actions are not part of the question, but they are crucial for each step in the list)!
The ting is that I need to create a way to define all the logical steps in an ordered way, so that I can get the list by query, and also make modifications later on!
Anyone with some experience in this kind of database design?
I have been thinking of having a column named wizard_steps (or something similar), and then use priority to make the order, but for some reason i feel that this design at some point will fail (due to items with same priority, adding new items would then have to rearrange the rest of the items, and so forth)!
Another design I have been thinking about is the use of "next item" as a column in the wizard_step column, but I don't feel this is the correct step eighter!
So to summarize; I am trying to make a list (and the design should be open enought to support multiple lists) of elements where the order is crucial!
Any ideas on how the database should look like?
Thanks!
EDIT: I found this yii component I will check out: http://www.yiiframework.com/extension/simpleworkflow/
Might be a good solution!

If I get you well, your main concern is to create a schema that supports ordered lists and can provide easy insert/reordering of items.
The following table design:
id_list item_priority foreign_itemdef_id
1 1 245
1 2 32
1 3 45
2 1 156
2 2 248
2 3 127
coupled to a table with item definition will be easily queried but will be difficult to maintain, especially for insertions
That one:
id_list first_item_id
1 45
2 38
coupled to the linked list:
item_id next_item foreign_itemdef_id
45 381 56
381 NULL 59
38 39 89
39 42 78
42 NULL 45
Will be both difficult to query and update (you should update the linked list inside a transaction, otherwise your linked list can get corrupted).
I would prefer the first solution for simplicity.
Depending on your update frequency, you may consider using large increments between item_priority to help insertion:
id_list item_priority foreign_itemdef_id
1 1000 245
1 2000 32
1 3000 45
2 1000 156
2 2000 248
2 3000 127
1 2500 46 -- late insertion
1 2750 47 -- late insertion
EDIT:
Here's a query that will hopefully make room for an insertion: it increments priority of all rows above the argument
$query_make_room_for_new_item = "UPDATE item_priority_table SET item_priority = item_priority + 1 WHERE item_priority > ". $new_item_position_priority ." AND id_list = ".$id_list;
Then insert your item with priority $new_item_position_priority

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Sparse data: efficient storage and retrieval in an RDBMS - sql

Related

Amazon Redshift - Pivot Large JSON Arrays

Second highest column

Set based approach to SQL Server insert where one column is calculated as Max from same column

django database design when you will have too many rows

Database design for a step by step wizard

Categories

Resources