Unnesting a json in Redshift causing nested loop in the query plan - sql

I have a column in my tables called 'data' with JSONs in it like below:
{"tt":"452.95","records":[{"r":"IN184366","t":"812812819910","s":"129.37","d":"982.7","c":"83"},{"r":"IN183714","t":"8028028029093","s":"33.9","d":"892","c":"38"}]}
I have written a code to unnest it into separate columns like tr,r,s.
Below is the code
with raw as (
SELECT json_extract_path_text(B.Data, 'records', true) as items
FROM tableB as B where B.date::timestamp between
to_timestamp('2019-01-01 00:00:00','YYYY-MM-DD HH24:MA:SS') AND
to_timestamp('2022-12-31 23:59:59','YYYY-MM-DD HH24:MA:SS')
UNION ALL
SELECT json_extract_path_text(C.Data, 'records', true) as items
FROM tableC as C where C.date-5 between
to_timestamp('2019-01-01 00:00:00','YYYY-MM-DD HH24:MA:SS') AND
to_timestamp('2022-12-31 23:59:59','YYYY-MM-DD HH24:MA:SS')
),
numbers as (
SELECT ROW_NUMBER() OVER (ORDER BY TRUE)::integer- 1 as ordinal
FROM <any_random_table> limit 1000
),
joined as (
select raw.*,
json_array_length(orders.items, true) as number_of_items,
json_extract_array_element_text(
raw.items,
numbers.ordinal::int,
true
) as item
from raw
cross join numbers
where numbers.ordinal <
json_array_length(raw.items, true)
),
parsed as (
SELECT J.*,
json_extract_path_text(J.item, 'tr',true) as tr,
json_extract_path_text(J.item, 'r',true) as r,
json_extract_path_text(J.item, 's',true)::float8 as s
from joined J
)
select * from parsed
The above code is working when there are small number of records but this taking more than a day to run and CPU utilization (in redshift) is reaching 100 % and even the disk space used also reaching 100% if I am putting date between last two years etc.. or if the number of records is large.
Can anyone please suggest any alternative way to unnnest JSON objects like above in redshift.
My query plan is saying:
Nested Loop Join in the query plan - review the join predicates to avoid Cartesian products
Goal: To Unnest without using any cross joins
Input: data column having JSON
"tt":"452.95","records":[{"r":"IN184366","t":"812812819910","s":"129.37","d":"982.7","c":"83"},{"r":"IN183714","t":"8028028029093","s":"33.9","d":"892","c":"38"}]}
Output should be for example
tr,r,s columns from the above json

You want to unnest json records of up to 1000 stored in a json array but nested loop join is taking too long.
The root issues is likely your data model. You have stored structured records (called "records"), inside a semi-structure text element (json), within a column of a structured columnar database. You want to perform some operation on these buried records that you haven't described but here's the problem. Columnar databases are optimized for performing read-centric analytic queries but you need to expand these json internal records into Redshift rows (records) which is fundamentally a write operation. This is working against the optimizations of the database.
The size of this expanding data is also large as compared to your disk storage on your cluster which is why the disks are filling up. You CPUs are likely spinning unpacking the jsons and managing overloaded disk and memory capacity. At the edge of filling up disks Redshift shifts to a mode that optimizes disk space utilization at the expense of execution speed. A larger cluster may give you a significantly faster execution if you can avoid this effect but that will cost money you may not have budgeted. Not an ideal solution.
One area that would improve speed of your query is not carrying all the data along. You keep raw.* and J.* all through the query but it is not clear you need these. Since part of the issue is data size during execution and that this execution includes loop joining, you are making the execution much harder that it needs to be by carrying all this data (including the original jsons).
The best way out of this situation is to change your data model and expand these json internal records into Redshift records on ingestion. Json data is fine for seldom used information or information that is only needed at the end of a query where the data is small. Needing the expanded json at the input end of the query for such a large amount of data is not good use case for json in Redshift. Each of these "records" inside of the json are records and need to be stored as such if you need to work across them as query input.
Now you want to know if there is some slick way to get around this issue in your case and the answer is "unlikely but maybe". Can you describe how you are using the final values in your query (t, r, and s)? If you are just using some aspect of this data (max value or sum or ...) then there may be a way to get to the answer without the large nested loop join. But if you need all the values then there is no other way to get these AFAIK. A description of what comes next in the data process could open up such an opportunity.

Related

GCP: select query with unnest from array has very big process data to run compared to hardcoded values

In bigQuery GCP, I am trying to grab some data in a table where the date is the same as a date in a list of values I have got. If I hardcode the list of values in the select it is vastly cheaper in process to run than if I use a temp structure like an array...
Is there a way to use the temp structure but avoid the enormous processing cost ?
Why is it so expensive for something small simple like this.
please see below examples:
**-----1/ array structure example: this query process's 144.8 GB----------**
WITH
get_a as (
SELECT
GENERATE_DATE_ARRAY('2000-01-01','2000-01-02') as array_of_dates
)
SELECT
a.heading as title
a.ingest_time as proc_date
FROM
'veiw_a.events' as a
get_a as b
UNNEST(b.array_of_dates) as c
WHERE
c in (CAST(a.ingest_time AS DATE)
)
**------2/ hardcoded example: this query processes 936.5 MB over 154 X's less ? --------**
SELECT
a.heading as title
a.ingest_time as proc_date
FROM
'veiw_a.events' as a
WHERE
(CAST(a.ingest_time as DATE)) IN ('2000-01-01','2000-01-02')
Presumably, your view_a.events table is partitioned by the ingest_time.
The issue is that partition pruning is very conservative (buggy?). With the direct comparisons, BigQuery is smart enough to recognize exactly which partitions are used for the query. But with the generated version, BigQuery is not able to figure this out, so the entire table needs to be read.

Query times out after 6 hours, how to optimize it?

I have two tables, shapes and squares, that I'm joining based on intersections of GEOGRAHPY columns.
The shapes table contains travel routes for vehicles:
shape_key STRING identifier for the shape
shape_lines ARRAY<GEOGRAPHY> consecutive line segments making up the shape
shape_geography GEOGRAPHY the union of all shape_lines
shape_length_km FLOAT64 length of the shape in kilometers
Rows: 65k
Size: 718 MB
We keep shape_lines separated out in an ARRAY because shapes sometimes double back on themselves, and we want to keep those line segments separate instead of deduplicating them.
The squares table contains a grid of 1×1 km squares:
square_key INT64 identifier of the grid square
square_geography GEOGRAPHY four-cornered polygon describing the grid square
Rows: 102k
Size: 15 MB
The shapes represent travel routes for vehicles. For each shape, we have computed emissions of harmful substances in a separate table. The aim is to calculate the emissions per grid square, assuming that they are evenly distributed along the route. To that end, we need to know what portion of the route shape intersects with each grid cell.
Here's the query to compute that:
SELECT
shape_key,
square_key,
SAFE_DIVIDE(
(
SELECT SUM(ST_LENGTH(ST_INTERSECTION(line, square_geography))) / 1000
FROM UNNEST(shape_lines) AS line
),
shape_length_km)
AS square_portion
FROM
shapes,
squares
WHERE
ST_INTERSECTS(shape_geography, square_geography)
Sadly, this query times out after 6 hours instead of producing a useful result.
In the worst case, the query can produce 6.6 billion rows, but that will not happen in practice. I estimate that each shape typically intersects maybe 50 grid squares, so the output should be around 65k * 50 = 3.3M rows; nothing that BigQuery shouldn't be able to handle.
I have considered the geographic join optimizations performed by BigQuery:
Spatial JOINs are joins of two tables with a predicate geographic function in the WHERE clause.
Check. I even rewrote my INNER JOIN to the equivalent "comma" join shown above.
Spatial joins perform better when your geography data is persisted.
Check. Both shape_geography and square_geography come straight from existing tables.
BigQuery implements optimized spatial JOINs for INNER JOIN and CROSS JOIN operators with the following standard SQL predicate functions: [...] ST_Intersects
Check. Just a single ST_Intersect call, no other conditions.
Spatial joins are not optimized: for LEFT, RIGHT or FULL OUTER joins; in cases involving ANTI joins; when the spatial predicate is negated.
Check. None of these cases apply.
So I think BigQuery should be able to optimize this join using whatever spatial indexing data structures it uses.
I have also considered the advice about cross joins:
Avoid joins that generate more outputs than inputs.
This query definitely generates more outputs than inputs; that's in its nature and cannot be avoided.
When a CROSS JOIN is required, pre-aggregate your data.
To avoid performance issues associated with joins that generate more outputs than inputs:
Use a GROUP BY clause to pre-aggregate the data.
Check. I already pre-aggregated the emissions data grouped by shapes, so that each shape in the shapes table is unique and distinct.
Use a window function. Window functions are often more efficient than using a cross join. For more information, see analytic functions.
I don't think it's possible to use a window function for this query.
I suspect that BigQuery allocates resources based on the number of input rows, not on the size of the intermediate tables or output. That would explain the pathological behaviour I'm seeing.
How can I make this query run in reasonable time?
I think the squares got inverted, resulting in almost-full Earth polygons:
select st_area(square_geography), * from `open-transport-data.public.squares`
Prints results like 5.1E14 - which is full globe area. So any line intersects almost all the squares. See BigQuery doc for details : https://cloud.google.com/bigquery/docs/gis-data#polygon_orientation
You can invert them by running ST_GeogFromText(wkt, FALSE) - which chooses smaller polygon, ignoring polygon orientation, this works reasonably fast:
SELECT
shape_key,
square_key,
SAFE_DIVIDE(
(
SELECT SUM(ST_LENGTH(ST_INTERSECTION(line, square_geography))) / 1000
FROM UNNEST(shape_lines) AS line
),
shape_length_km)
AS square_portion
FROM
`open-transport-data.public.shapes`,
(select
square_key,
st_geogfromtext(st_astext(square_geography), FALSE) as square_geography,
from `open-transport-data.public.squares`) squares
WHERE
ST_INTERSECTS(shape_geography, square_geography)
Below would definitely not fit the comments format so I have to post this as an answer ...
I did three adjustment to your query
using JOIN ... ON instead of CROSS JOIN ... WHERE
commenting out square_portion calculation
using destination table with Allow Large Results option
Even though you expected just 3.3 M rows in output - in reality it is about 6.6 B ( 6,591,549,944) rows - you can see result of my experiment below
Note warning about Billing Tier - so you better use Reservations if available
Obviously, un-commenting square_portion calculation will increase Slots usage - so, you might potentially need to revisit your requirements/expectations

Why does select result fields double data scanned in BigQuery

I have a table with 2 integer fields x,y and few millions of rows.
The fields are created with the following code:
Field.newBuilder("x", LegacySQLTypeName.INTEGER).setMode(Field.Mode.NULLABLE).build();
If I run the following from the web:
SELECT x,y FROM [myproject:Test.Test] where x=1 LIMIT 50
Query Editor: "Valid: This query will process 64.9 MB when run."
compared to:
SELECT x FROM [myproject:Test.Test] where x=1 LIMIT 50
Query Editor: " Valid: This query will process 32.4 MB when run."
It scans more than double of the original data scanned.
I would expect it will first find the relevant rows based on where clause and then bring the extra field without scanning the entire second field.
Any inputs on why it doubles the data scanned and how to avoid it will be appreciated.
In my application I have hundred of possible fields which I need to fetch for a very small number of rows (50) which answer the query.
Does this means I will need to processed all fields data?
* I'm aware how columnar database works, but wasn't aware for the huge price when you want to brings lots of fields based on a very specific where clause.
The following link provide very clear answer:
best-practices-performance-input
BigQuery does not have a concept of index or something like that. When you query a field column, BigQuery will scan through all the values of that column and then make the operations you want (for a deeper deep understanding they have some pretty cool posts about the inner workings of BQ).
That means that when you select x and y where x = 1, BQ will read through all values of x and y and then find where x = 1.
This ends up being an amazing feature of BQ, you just load your data there and it just works. It does force you to be aware on how much data you retrieve from each query. Queries of the type select * from table should be used only if you really need all columns.

Access - Match Column Range / Complex Join

How do we match columns based on condition of closeness to value.
This requires Complex Query / Range Comparison / Multiple Joins conditions.
Getting Query size exceeded 2GB error.
Tables :
InvDetails1 / InvDetails2 / INVDL / ExpectedResult
Field Relation :
InvDetails1.F1 = InDetails2.F3
InvDetails2.F5 = INVDL.F1
INVDL.DLID = ExpectedResult.DLID
ExpectedResult.Total - 1 < InvDetails1.F6 < ExpectedResult.Total + 1
left(InvDetails1.F21,10) = '2013-03-07'
Return Results where Number of records from ExpectedResult is only 1.
Group by InvDetails1.F1 , count(ExpectedResult.DLID) works.
From this result.
Final Result :
InvDetails1.F1 , InvDetails1.F16 , ExpectedResult.DLID , ExpectedResult.NMR
ExpectedResult - has millions of rows.
InvDetails - few hundred thousands
If I was in that situation and was finding that my query was "hitting a wall" at 2GB then one thing I would try would be to create a separate saved Select Query to isolate the InvDetails1 records just for the specific date in question. Then I would use that query instead of the full InvDetails1 table when joining to the other tables.
My reasoning is that the query optimizer may not be able to use your left(InvDetails1.F21,10) = '2013-03-07' condition to exclude InvDetails1 records early in the execution plan, possibly causing the query to grow much larger than it really needs to (internally, while it is being processed). Forcing the date selection to the beginning of the process by putting it in a separate (prerequisite) query may keep the size of the "main" query down to a more feasible size.
Also, if I found myself in the situation where my queries were getting that big I would also keep a watchful eye on the size of my .accdb (or .mdb) file to ensure that it does not get too close to 2GB. I've never had it happen myself, but I've heard that database files that hit the 2GB barrier can result in some nasty errors and be rather "challenging" to recover.

Poor DB Performance when using ORDER BY

I'm working with a non-profit that is mapping out solar potential in the US. Needless to say, we have a ridiculously large PostgreSQL 9 database. Running a query like the one shown below is speedy until the order by line is uncommented, in which case the same query takes forever to run (185 ms without sorting compared to 25 minutes with). What steps should be taken to ensure this and other queries run in a more manageable and reasonable amount of time?
select A.s_oid, A.s_id, A.area_acre, A.power_peak, A.nearby_city, A.solar_total
from global_site A cross join na_utility_line B
where (A.power_peak between 1.0 AND 100.0)
and A.area_acre >= 500
and A.solar_avg >= 5.0
AND A.pc_num <= 1000
and (A.fips_level1 = '06' AND A.fips_country = 'US' AND A.fips_level2 = '025')
and B.volt_mn_kv >= 69
and B.fips_code like '%US06%'
and B.status = 'active'
and ST_within(ST_Centroid(A.wkb_geometry), ST_Buffer((B.wkb_geometry), 1000))
--order by A.area_acre
offset 0 limit 11;
The sort is not the problem - in fact the CPU and memory cost of the sort is close to zero since Postgres has Top-N sort where the result set is scanned while keeping up to date a small sort buffer holding only the Top-N rows.
select count(*) from (1 million row table) -- 0.17 s
select * from (1 million row table) order by x limit 10; -- 0.18 s
select * from (1 million row table) order by x; -- 1.80 s
So you see the Top-10 sorting only adds 10 ms to a dumb fast count(*) versus a lot longer for a real sort. That's a very neat feature, I use it a lot.
OK now without EXPLAIN ANALYZE it's impossible to be sure, but my feeling is that the real problem is the cross join. Basically you're filtering the rows in both tables using :
where (A.power_peak between 1.0 AND 100.0)
and A.area_acre >= 500
and A.solar_avg >= 5.0
AND A.pc_num <= 1000
and (A.fips_level1 = '06' AND A.fips_country = 'US' AND A.fips_level2 = '025')
and B.volt_mn_kv >= 69
and B.fips_code like '%US06%'
and B.status = 'active'
OK. I don't know how many rows are selected in both tables (only EXPLAIN ANALYZE would tell), but it's probably significant. Knowing those numbers would help.
Then we got the worst case CROSS JOIN condition ever :
and ST_within(ST_Centroid(A.wkb_geometry), ST_Buffer((B.wkb_geometry), 1000))
This means all rows of A are matched against all rows of B (so, this expression is going to be evaluated a large number of times), using a bunch of pretty complex, slow, and cpu-intensive functions.
Of course it's horribly slow !
When you remove the ORDER BY, postgres just comes up (by chance ?) with a bunch of matching rows right at the start, outputs those, and stops since the LIMIT is reached.
Here's a little example :
Tables a and b are identical and contain 1000 rows, and a column of type BOX.
select * from a cross join b where (a.b && b.b) --- 0.28 s
Here 1000000 box overlap (operator &&) tests are completed in 0.28s. The test data set is generated so that the result set contains only 1000 rows.
create index a_b on a using gist(b);
create index b_b on a using gist(b);
select * from a cross join b where (a.b && b.b) --- 0.01 s
Here the index is used to optimize the cross join, and speed is ridiculous.
You need to optimize that geometry matching.
add columns which will cache :
ST_Centroid(A.wkb_geometry)
ST_Buffer((B.wkb_geometry), 1000)
There is NO POINT in recomputing those slow functions a million times during your CROSS JOIN, so store the results in a column. Use a trigger to keep them up to date.
add columns of type BOX which will cache :
Bounding Box of ST_Centroid(A.wkb_geometry)
Bounding Box of ST_Buffer((B.wkb_geometry), 1000)
add gist indexes on the BOXes
add a Box overlap test (using the && operator) which will use the index
keep your ST_Within which will act as a final filter on the rows that pass
Maybe you can just index the ST_Centroid and ST_Buffer columns... and use an (indexed) "contains" operator, see here :
http://www.postgresql.org/docs/8.2/static/functions-geometry.html
I would suggest creating an index on area_acre. You may want to take a look at the following: http://www.postgresql.org/docs/9.0/static/sql-createindex.html
I would recommend doing this sort of thing off of peak hours though because this can be somewhat intensive with a large amount of data. One thing you will have to look at as well with indexes is rebuilding them on a schedule to ensure performance over time. Again this schedule should be outside of peak hours.
You may want to take a look at this article from a fellow SO'er and his experience with database slowdowns over time with indexes: Why does PostgresQL query performance drop over time, but restored when rebuilding index
If the A.area_acre field is not indexed that may slow it down. You can run the query with EXPLAIN to see what it is doing during execution.
First off I would look at creating indexes , ensure your db is being vacuumed, increase the shared buffers for your db install, work_mem settings.
First thing to look at is whether you have an index on the field you're ordering by. If not, adding one will dramatically improve performance. I don't know postgresql that well but something similar to:
CREATE INDEX area_acre ON global_site(area_acre)
As noted in other replies, the indexing process is intensive when working with a large data set, so do this during off-peak.
I am not familiar with the PostgreSQL optimizations, but it sounds like what is happening when the query is run with the ORDER BY clause is that the entire result set is created, then it is sorted, and then the top 11 rows are taken from that sorted result. Without the ORDER BY, the query engine can just generate the first 11 rows in whatever order it pleases and then it's done.
Having an index on the area_acre field very possibly may not help for the sorting (ORDER BY) depending on how the result set is built. It could, in theory, be used to generate the result set by traversing the global_site table using an index on area_acre; in that case, the results would be generated in the desired order (and it could stop after generating 11 rows in the result). If it does not generate the results in that order (and it seems like it may not be), then that index will not help in sorting the results.
One thing you might try is to remove the "CROSS JOIN" from the query. I doubt that this will make a difference, but it's worth a test. Because a WHERE clause is involved joining the two tables (via ST_WITHIN), I believe the result is the same as an inner join. It is possible that the use of the CROSS JOIN syntax is causing the optimizer to make an undesirable choice.
Otherwise (aside from making sure indexes exist for fields that are being filtered), you could play a bit of a guessing game with the query. One condition that stands out is the area_acre >= 500. This means that the query engine is considering all rows that meet that condition. But then only the first 11 rows are taken. You could try changing it to area_acre >= 500 and area_acre <= somevalue. The somevalue is the guessing part that would need adjustment to make sure you get at least 11 rows. This, however, seems like a pretty cheesy thing to do, so I mention it with some reticence.
Have you considered creating Expression based indexes for the benefit of the hairier joins and where conditions?