Amazon Redshift - Pivot Large JSON Arrays - sql

I have an optimisation problem.
I have a table containing about 15MB of JSON stored as rows of VARCHAR(65535). Each JSON string is an array of arbitrary size.
95% contains 16 or fewer elements
the longest (to date) contains 67 elements
the hard limit is 512 elements (before 64kB isn't big enough)
The task is simple, pivot each array such that each element has its own row.
id | json
----+---------------------------------------------
01 | [{"something":"here"}, {"fu":"bar"}]
=>
id | element_id | json
----+------------+---------------------------------
01 | 1 | {"something":"here"}
01 | 2 | {"fu":"bar"}
Without having any kind of table valued functions (user defined or otherwise), I've resorted to pivoting via joining against a numbers table.
SELECT
src.id,
pvt.element_id,
json_extract_array_element_text(
src.json,
pvt.element_id
)
AS json
FROM
source_table AS src
INNER JOIN
numbers_table AS pvt(element_id)
ON pvt.element_id < json_array_length(src.json)
The numbers table has 512 rows in it (0..511), and the results are correct.
The elapsed time is horrendous. And it's not to do with distribution or sort order or encoding. It's to do with (I believe) redshift's materialisation.
The working memory needed to process 15MB of JSON text is 7.5GB.
15MB * 512 rows in numbers = 7.5GB
If I put just 128 rows in numbers then the working memory needed reduces by 4x and the elapsed time similarly reduces (not 4x, the real query does other work, it's still writing the same amount of results data, etc, etc).
So, I wonder, what about adding this?
WHERE
pvt.element_id < (SELECT MAX(json_array_length(src.json)) FROM source_table)
No change to the working memory needed, the elapsed time goes up slightly (effectively a WHERE clause that has a cost but no benefit).
I've tried making a CTE to create the list of 512 numbers, that didn't help. I've tried making a CTE to create the list of numbers, with a WHERE clause to limit the size, that didn't help (effectively Redshift appears to have materialised using the 512 rows and THEN applied the WHERE clause).
My current effort is to create a temporary table for the numbers, limited by the WHERE clause. In my sample set this means that I get a table with 67 rows to join on, instead of 512 rows.
That's still not great, as that ONE row with 67 elements dominates the elapsed time (every row, no matter how many elements, gets duplicated 67 times before the ON pvt.element_id < json_array_length(src.json) gets applied).
My next effort will be to work on it in two steps.
As above, but with a table of only 16 rows, and only for row with 16 or fewer elements
As above, with the dynamically mixed numbers table, and only for rows with more than 16 elements
Question: Does anyone have any better ideas?

Please consider declaring the JSON as an external table. You can then use Redshift Spectrum's nested data syntax to access these values as if they were rows.
There is a quick tutorial here: "Tutorial: Querying Nested Data with Amazon Redshift Spectrum"
Simple example:
{ "id": 1
,"name": { "given":"John", "family":"Smith" }
,"orders": [ {"price": 100.50, "quantity": 9 }
,{"price": 99.12, "quantity": 2 }
]
}
CREATE EXTERNAL TABLE spectrum.nested_tutorial
(id int
,name struct<given:varchar(20), family:varchar(20)>
,orders array<struct<price:double precision, quantity:double precision>>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
LOCATION 's3://my-files/temp/nested_data/nested_tutorial/'
;
SELECT c.id
,c.name.given
,c.name.family
,o.price
,o.quantity
FROM spectrum.nested_tutorial c
LEFT JOIN c.orders o ON true
;
id | given | family | price | quantity
----+-------+--------+-------+----------
1 | John | Smith | 100.5 | 9
1 | John | Smith | 99.12 | 2

Neither the data format, nor the task you wish to do, is ideal for Amazon Redshift.
Amazon Redshift is excellent as a data warehouse, with the ability to do queries against billions of rows. However, storing data as JSON is sub-optimal because Redshift cannot use all of its abilities (eg Distribution Keys, Sort Keys, Zone Maps, Parallel processing) while processing fields stored in JSON.
The efficiency of your Redshift cluster would be much higher if the data were stored as:
id | element_id | key | value
----+------------+---------------------
01 | 1 | something | here
01 | 2 | fu | bar
As to how to best convert the existing JSON data into separate rows, I would frankly recommend that this is done outside of Redshift, then loaded into tables via the COPY command. A small Python script would be more efficient at converting the data that trying strange JOINs on a numbers table in Redshift.

Maybe if you avoid parsing and interpreting JSON as JSON and instead work with this as text it can work faster. If you're sure about the structure of your JSON values (which I guess you are since the original query does not produce the JSON parsing error) you might try just to use split_part function instead of json_extract_array_element_text.
If your elements don't contain commas you can use:
split_part(src.json,',',pvt.element_id)
if your elements contain commas you might use
split_part(src.json,'},{',pvt.element_id)
Also, the part with ON pvt.element_id < json_array_length(src.json) in the join condition is still there, so to avoid JSON parsing completely you might try to cross join and then filter out non-null values.

Related

Optimal SQL query for joining two tables ordered with regards to join logic

I am looking for help with optimizing my BigQuery query.
I have 2 tables:
1) ips with one column: ip which is IP-like string, e.g. 192.168.4.6. It has m rows
----------------
| ip |
----------------
| 73.14.170.37 |
----------------
| 5.14.121.34 |
----------------
| 61.22.122.67 |
---------------
2) ranges with one column: range which is CIDR-like string, e.g. 192.168.128/28 with n rows. Ranges never overlap.
-------------------
| range |
-------------------
| 42.126.124.0/24 |
-------------------
| 2.36.127.0/24 |
------------------
| 12.59.78.0/23 |
-------------------
m = around 100M
n = around 65K
So both are large and m >> n
My goal is to find all IPs from ips table that belong to any range in range table.
My approach:
I created 2 intermediary table:
1) ip_num with distinct numeric (INT64) ip column computed from the ips table, ordered by ip.
------------
| ip_num |
-----------
| 16753467 |
------------
| 16753469 |
------------
| 16753474 |
------------
2) ranges_num with 2 columns: start_ip and end_ip (both INT64) which were calculated based on CIDR ranges. This column is ordered by start_ip.
-----------------------
| start_ip | end_ip |
-----------------------
| 16753312 | 16753316 |
-----------------------
| 16753569 | 16753678 |
-----------------------
| 16763674 | 16763688 |
-----------------------
Both tables use numeric format because I hoped that comparing numbers has better performance. Conversion was done with NET.IPV4_TO_INT64 and NET.IP_FROM_STRING.
Producing these 2 tables is reasonably fast.
Now my last step is join on these 2 tables:
select ip from ip_num JOIN ranges_num ON ip BETWEEN start_ip AND end_ip;
This last query takes a lot of time (around 30 minutes) and then produces no results but I get Query exceeded resource limits error, probably because it takes too long.
So my questions here are:
Can I make it faster?
Is my intuition correct that the last query generates n * m joins and query optimizer fails to leverage ordering of ranges_num, effectively generating O(n*m) complexity?
If I had these 2 structures in memory of a program rather than in relational DB, it is relatively simple to write O(m+n) algorithm with 2 iterators, one on each table.
But I don't know how to express it in standard SQL (if possible) and maybe built-in query optimizer should be able to do derive this algorithm automatically.
Are there any tools for BigQuery (or any other) that would help me understand query optimizer?
I am not an SQL expert and I am new to BigQuery, so I'll be grateful for any tips.
Thanks to #rtenha, I followed the link to almost the same problem with solution: Efficiently joining IP ranges in BigQuery
I am attaching a more complete query which I derived from that one and it worked for me under 10 seconds:
(SELECT ip FROM `ips` i
JOIN `ranges` a
ON NET.IP_TRUNC(a.start_ip, 16) = NET.IP_TRUNC(NET.SAFE_IP_FROM_STRING(i.ip), 16)
WHERE NET.SAFE_IP_FROM_STRING(i.ip) BETWEEN a.start_ip AND a.end_ip
AND mask >= 16)
UNION ALL
(
SELECT ip FROM `ips` i
JOIN `ranges` a
ON NET.IP_TRUNC(a.start_ip, 8) = NET.IP_TRUNC(NET.SAFE_IP_FROM_STRING(i.ip), 8)
WHERE NET.SAFE_IP_FROM_STRING(i.ip) BETWEEN a.start_ip AND a.end_ip
AND mask BETWEEN 8 AND 15)
UNION ALL
(
SELECT ip FROM `ips` i
JOIN `ranges` a
ON NET.SAFE_IP_FROM_STRING(i.ip) BETWEEN a.start_ip AND a.end_ip
AND mask < 8)
Here I split the ranges in 3 sections: those with 16-bits network prefixes or longer, between 8 and 15, and below 8.
For each section I then applied extra prefix comparison which has much better performance and it enables to filter effectively data and then execute second comparison (BETWEEN) on much smaller sets which is quick.
The last section does not have prefix matching because it targets the shortest network prefixes.
After this, I combine all the sections with UNION.
One of the reason it works is that 99% of my network prefixes was 16 or longer. The smaller network prefixes were longer to process each, but it was compensated by the fact that there were very few of them and further mitigated by braking the short-network-prefix section (16 or shorter) in two smaller sections. It might be possible to optimize further by braking the data into even more fine-grained sections (e.g. 32 sections, one for each possible mask length). However I was satisfied with my results anyway.
I did not analyze whether INT or BYTES were more optimal data type for processing, but having intermediate tables did not bring noticeable performance improvement.
According to this article you have ordered the tables right.
Although sometimes I've experienced that BigQuery underestimates the needed calculation capacity (slots), which makes the query slow or throws a quote exceeded error, as in your case.
For me it helped when I switched the order of the tables and put the smallest one of the left side of join.
E.g.:
select ip from ranges_num JOIN ip_num ON ip BETWEEN start_ip AND end_ip;

django database design when you will have too many rows

I have a django web app with postgres db; the general operation is that every day I have an array of values that need to be stored in one of the tables.
There is no foreseeable need to query the values of the array but need to be able to plot the values for a specific day.
The problem is that this array is pretty big and if I were to store it in the db, I'd have 60 million rows per year but if I store each row as a blob object, I'd have 60 thousand rows per year.
Is is a good decision to use a blob object to reduce table size when you do not want to query with the row of values?
Here are the two options:
option1: keeping all
group(foreignkey)| parent(foreignkey) | pos(int) | length(int)
A | B | 232 | 45
A | B | 233 | 45
A | B | 234 | 45
A | B | 233 | 46
...
option2: collapsing the array into a blob:
group(fk)| parent(fk) | mean_len(float)| values(blob)
A | B | 45 |[(pos=232, len=45),...]
...
so I do NOT want to query pos or length but I want to query group or parent.
An example of read query that I'm talking about is:
SELECT * FROM "mytable"
LEFT OUTER JOIN "group"
ON ( "group"."id" = "grouptable"."id" )
ORDER BY "pos" DESC LIMIT 100
which is a typical django admin list_view page main query.
I tried loading the data and tried displaying the table in the django admin page without doing any complex query (just a read query).
When I get pass 1.5 millions rows, the admin page freezes. All it takes is a some count query on that table to cause the app to crash so I should definitely either keep the data as a blob or not keep it in the db at all and use the filesystem instead.
I want to emphasize that I've used django 1.8 as my test bench so this is not a postgres evaluation but rather a system evaluation with django admin and postgres.

How to sort string data that represents numbers

My client has a set of numeric data stored in a string field in a database. So of course it doesn't sort correctly. These rows sort like this:
105
3
44
When they should sort like this:
3
44
105
This is very much a legacy database and I can't change it at all. I also can't change the software that uses the database. The client doesn't own it or have the source code. It has never worked the way they want. However, there is an unused string field that I could use to sort on (only a small number of fields can be sorted on.)
What I would like to do is take the input data, derive a string from it, and store the new string in the unused field, such that when the data is sorted on this new data, the original data sorts correctly, i.e., numerically.
So, for an overly simplistic example, if the algorithm produced the following new data:
105 -> c
3 -> a
44 -> b
Then when the second column was sorted, the first column would look 'correct'.
The tricky bit is that when new rows are added to the database, they must also sort correctly, without having to regenerate the sort data for all rows. This is the part of the problem that has my brain in a twist. I'm not sure it's actually possible.
You can assume that the number will never be more than 5 'digits'.
I realize this is a total kludge, but since I can't change the system, I have to find a work around, rather than a quality solution. Welcome to the real world.
~~~~~~~~~~~~~~~~~~~~~~ S O L U T I O N ~~~~~~~~~~~~~~~~~~
I don't think this is an uncommon problem, so here are the results of Gordon's solution:
mysql> select * from t order by new;
+------+------------+
| orig | new |
+------+------------+
| 3 | 0000000003 |
| 44 | 0000000044 |
| 105 | 0000000105 |
+------+------------+
In most databases, you can just do:
order by cast(col as int)
This will convert the string representation to a number and use that for ordering. There is no need for an additional column. If you add one, I would recommend adding a numeric column to contain the actual value.
If you really want to store something in the unused field, then you can left pad the number. How to do this depends on the database, but here is one typical method:
update t
set unused = right(concat('0000000000', col), 10);
Not all databases support these two specific functions, but all offer this basic functionality in some method.
Try something like
SELECT column1 FROM table1 ORDER BY LENGTH(column1) ASC, column1 ASC
(Adjust the column and table name for your environment.)
This is a bit of a hack but works as long as the "numbers" in your string column are natural, non-negative numbers only.
If you are looking for a more sophisticated approach or algorithm, try searching for natural sort together with your DBMS.

Poor performance on Amazon Redshift queries based on VARCHAR size

I'm building an Amazon Redshift data warehouse, and experiencing unexpected performance impacts based on the defined size of the VARCHAR column. Details are as follows. Three of my columns are shown from pg_table_def:
schemaname | tablename | column | type | encoding | distkey | sortkey | notnull
------------+-----------+-----------------+-----------------------------+-----------+---------+---------+---------
public | logs | log_timestamp | timestamp without time zone | delta32k | f | 1 | t
public | logs | event | character varying(256) | lzo | f | 0 | f
public | logs | message | character varying(65535) | lzo | f | 0 | f
I've recently run Vacuum and Analyze, I have about 100 million rows in the database, and I'm seeing very different performance depending on which columns I include.
Query 1:
For instance, the following query takes about 3 seconds:
select log_timestamp from logs order by log_timestamp desc limit 5;
Query 2:
A similar query asking for more data runs in 8 seconds:
select log_timestamp, event from logs order by log_timestamp desc limit 5;
Query 3:
However, this query, very similar to the previous, takes 8 minutes to run!
select log_timestamp, message from logs order by log_timestamp desc limit 5;
Query 4:
Finally, this query, identical to the slow one but with explicit range limits, is very fast (~3s):
select log_timestamp, message from logs where log_timestamp > '2014-06-18' order by log_timestamp desc limit 5;
The message column is defined to be able to hold larger messages, but in practice it doesn't hold much data: the average length of the message field is 16 charachters (std_dev 10). The average length of the event field is 5 charachters (std_dev 2). The only distinction I can really see is the max length of the VARCHAR field, but I wouldn't think that should have an order of magnitude affect on the time a simple query takes to return!
Any insight would be appreciated. While this isn't the typical use case for this tool (we'll be aggregating far more than we'll be inspecting individual logs), I'd like to understand any subtle or not-so-subtle affects of my table design.
Thanks!
Dave
Redshift is a "true columnar" database and only reads columns that are specified in your query. So, when you specify 2 small columns, only those 2 columns have to be read at all. However when you add in the 3rd large column then the work that Redshift has to do dramatically increases.
This is very different from a "row store" database (SQL Server, MySQL, Postgres, etc.) where the entire row is stored together. In a row store adding/removing query columns does not make much difference in response time because the database has to read the whole row anyway.
Finally the reason your last query is very fast is because you've told Redshift that it can skip a large portion of the data. Redshift stores your each column in "blocks" and these blocks are sorted according the sort key you specified. Redshift keeps a record of the min/max of each block and can skip over any blocks that could not contain data to be returned.
The limit clause doesn't reduce the work that has to be done because you've told Redshift that it must first order all by log_timestamp descending. The problem is your ORDER BY … DESC has to be executed over the entire potential result set before any data can be returned or discarded. When the columns are small that's fast, when they're big it's slow.
Out of curiosity, how long does this take?
select log_timestamp, message
from logs l join
(select min(log_timestamp) as log_timestamp
from (select log_timestamp
from logs
order by log_timestamp desc
limit 5
) lt
) lt
on l.log_timestamp >= lt.log_timestamp;

Pulling items out of a DB with weighted chance

Let's say I had a table full of records that I wanted to pull random records from. However, I want certain rows in that table to appear more often than others (and which ones vary by user). What's the best way to go about this, using SQL?
The only way I can think of is to create a temporary table, fill it with the rows I want to be more common, and then pad it with other randomly selected rows from the table. Is there a better way?
One way I can think of is to create another column in the table which is a rolling sum of your weights, then pull your records by generating a random number between 0 and the total of all your weights, and pull the row with the highest rolling sum value less than the random number.
For example, if you had four rows with the following weights:
+---+--------+------------+
|row| weight | rollingsum |
+---+--------+------------+
| a | 3 | 3 |
| b | 3 | 6 |
| c | 4 | 10 |
| d | 1 | 11 |
+---+--------+------------+
Then, choose a random number n between 0 and 11, inclusive, and return row a if 0<=n<3, b if 3<=n<6, and so on.
Here are some links on generating rolling sums:
http://dev.mysql.com/tech-resources/articles/rolling_sums_in_mysql.html
http://dev.mysql.com/tech-resources/articles/rolling_sums_in_mysql_followup.html
I don't know that it can be done very easily with SQL alone. With T-SQL or similar, you could write a loop to duplicate rows, or you can use the SQL to generate the instructions for doing the row duplication instead.
I don't know your probability model, but you could use an approach like this to achieve the latter. Given these table definitions:
RowSource
---------
RowID
UserRowProbability
------------------
UserId
RowId
FrequencyMultiplier
You could write a query like this (SQL Server specific):
SELECT TOP 100 rs.RowId, urp.FrequencyMultiplier
FROM RowSource rs
LEFT JOIN UserRowProbability urp ON rs.RowId = urp.RowId
ORDER BY ISNULL(urp.FrequencyMultiplier, 1) DESC, NEWID()
This would take care of selecting a random set of rows as well as how many should be repeated. Then, in your application logic, you could do the row duplication and shuffle the results.
Start with 3 tables users, data and user-data. User-data contains which rows should be prefered for each user.
Then create one view based on the data rows that are prefered by the the user.
Create a second view that has the none prefered data.
Create a third view which is a union of the first 2. The union should select more rows from the prefered data.
Then finally select random rows from the third view.