Query times out after 6 hours, how to optimize it? - google-bigquery

I have two tables, shapes and squares, that I'm joining based on intersections of GEOGRAHPY columns.
The shapes table contains travel routes for vehicles:
shape_key STRING identifier for the shape
shape_lines ARRAY<GEOGRAPHY> consecutive line segments making up the shape
shape_geography GEOGRAPHY the union of all shape_lines
shape_length_km FLOAT64 length of the shape in kilometers
Rows: 65k
Size: 718 MB
We keep shape_lines separated out in an ARRAY because shapes sometimes double back on themselves, and we want to keep those line segments separate instead of deduplicating them.
The squares table contains a grid of 1×1 km squares:
square_key INT64 identifier of the grid square
square_geography GEOGRAPHY four-cornered polygon describing the grid square
Rows: 102k
Size: 15 MB
The shapes represent travel routes for vehicles. For each shape, we have computed emissions of harmful substances in a separate table. The aim is to calculate the emissions per grid square, assuming that they are evenly distributed along the route. To that end, we need to know what portion of the route shape intersects with each grid cell.
Here's the query to compute that:
SELECT
shape_key,
square_key,
SAFE_DIVIDE(
(
SELECT SUM(ST_LENGTH(ST_INTERSECTION(line, square_geography))) / 1000
FROM UNNEST(shape_lines) AS line
),
shape_length_km)
AS square_portion
FROM
shapes,
squares
WHERE
ST_INTERSECTS(shape_geography, square_geography)
Sadly, this query times out after 6 hours instead of producing a useful result.
In the worst case, the query can produce 6.6 billion rows, but that will not happen in practice. I estimate that each shape typically intersects maybe 50 grid squares, so the output should be around 65k * 50 = 3.3M rows; nothing that BigQuery shouldn't be able to handle.
I have considered the geographic join optimizations performed by BigQuery:
Spatial JOINs are joins of two tables with a predicate geographic function in the WHERE clause.
Check. I even rewrote my INNER JOIN to the equivalent "comma" join shown above.
Spatial joins perform better when your geography data is persisted.
Check. Both shape_geography and square_geography come straight from existing tables.
BigQuery implements optimized spatial JOINs for INNER JOIN and CROSS JOIN operators with the following standard SQL predicate functions: [...] ST_Intersects
Check. Just a single ST_Intersect call, no other conditions.
Spatial joins are not optimized: for LEFT, RIGHT or FULL OUTER joins; in cases involving ANTI joins; when the spatial predicate is negated.
Check. None of these cases apply.
So I think BigQuery should be able to optimize this join using whatever spatial indexing data structures it uses.
I have also considered the advice about cross joins:
Avoid joins that generate more outputs than inputs.
This query definitely generates more outputs than inputs; that's in its nature and cannot be avoided.
When a CROSS JOIN is required, pre-aggregate your data.
To avoid performance issues associated with joins that generate more outputs than inputs:
Use a GROUP BY clause to pre-aggregate the data.
Check. I already pre-aggregated the emissions data grouped by shapes, so that each shape in the shapes table is unique and distinct.
Use a window function. Window functions are often more efficient than using a cross join. For more information, see analytic functions.
I don't think it's possible to use a window function for this query.
I suspect that BigQuery allocates resources based on the number of input rows, not on the size of the intermediate tables or output. That would explain the pathological behaviour I'm seeing.
How can I make this query run in reasonable time?

I think the squares got inverted, resulting in almost-full Earth polygons:
select st_area(square_geography), * from `open-transport-data.public.squares`
Prints results like 5.1E14 - which is full globe area. So any line intersects almost all the squares. See BigQuery doc for details : https://cloud.google.com/bigquery/docs/gis-data#polygon_orientation
You can invert them by running ST_GeogFromText(wkt, FALSE) - which chooses smaller polygon, ignoring polygon orientation, this works reasonably fast:
SELECT
shape_key,
square_key,
SAFE_DIVIDE(
(
SELECT SUM(ST_LENGTH(ST_INTERSECTION(line, square_geography))) / 1000
FROM UNNEST(shape_lines) AS line
),
shape_length_km)
AS square_portion
FROM
`open-transport-data.public.shapes`,
(select
square_key,
st_geogfromtext(st_astext(square_geography), FALSE) as square_geography,
from `open-transport-data.public.squares`) squares
WHERE
ST_INTERSECTS(shape_geography, square_geography)

Below would definitely not fit the comments format so I have to post this as an answer ...
I did three adjustment to your query
using JOIN ... ON instead of CROSS JOIN ... WHERE
commenting out square_portion calculation
using destination table with Allow Large Results option
Even though you expected just 3.3 M rows in output - in reality it is about 6.6 B ( 6,591,549,944) rows - you can see result of my experiment below
Note warning about Billing Tier - so you better use Reservations if available
Obviously, un-commenting square_portion calculation will increase Slots usage - so, you might potentially need to revisit your requirements/expectations

Related

Creating a view that contains all records from one table, that match the comma separated field content in another table

I have two tables au_postcodes and groups.
Table groups contains a field called PostCodeFootPrint
that contains the postcode set making up the footprint.
Table au_postcodes contains a field called poa_code that
contains a single postcode.
The records in groups.PostCodeFootPrint look like:
PostCodeFootPrint
2529,2530,2533,2534,2535,2536,2537,2538,2539,2540,2541,2575,2576,2577,2580
2640
3844
2063, 2064, 2065, 2066, 2067, 2068, 2069, 2070, 2071, 2072, 2073, 2074, 2075, 2076, 2077, 2079, 2080, 2081, 2082, 2083, 2119, 2120, 2126, 2158, 2159
2848, 2849, 2850, 2852
Some records have only one postcode, some have multiple separated by a "," or ", " (comma and space).
The records in au_postcode.poa_code look like:
poa_code
2090
2092
2093
829
830
836
2080
2081
Single postcode (always).
The objective is to:
Get all records from au_postcode, where the poa_code appears in groups.*PostCodeFootPrint into a view.
I tried:
SELECT
au_postcodes.poa_code,
groups."NameOfGroup"
FROM
groups,
au_postcodes
WHERE
groups."PostcodeFootprint" LIKE '%au_postcodes.poa_code%'
But no luck
You can use regex for this. Take a look at this fiddle:
https://dbfiddle.uk/?rdbms=postgres_14&fiddle=739592ef262231722d783670b46bd7fa
Where I form a regex from the poa_code and the word boundary (to avoid partial matches) and compare that to the PostCodeFootPrint.
select p.poa_code, g.PostCodeFootPrint
from groups g
join au_postcode p
on g.PostCodeFootPrint ~ concat('\y', p.poa_code, '\y')
Depending on your data, this may be performant enough. I also believe that in postGres you have access to the array data type, and so it might be better to store the post code lists as arrays.
https://dbfiddle.uk/?rdbms=postgres_14&fiddle=ae24683952cb2b0f3832113375fbb55b
Here I stored the post code lists as arrays, then used ANY to join with.
select p.poa_code, g.PostCodeFootPrint
from groups g
join au_postcode p
on p.poa_code = any(g.PostCodeFootPrint);
In these two fiddles I use explain to show the cost of the queries, and while the array solution is more expensive, I imagine it might be easier to maintain.
https://dbfiddle.uk/?rdbms=postgres_14&fiddle=7f16676825e10625b90eb62e8018d78e
https://dbfiddle.uk/?rdbms=postgres_14&fiddle=e96e0fc463f46a7c467421b47683f42f
I changed the underlying data type to integer in this fiddle, expecting it to reduce the cost, but it didn't, which seems strange to me.
https://dbfiddle.uk/?rdbms=postgres_14&fiddle=521d6a7d0eb4c45471263214186e537e
It is possible to reduce the query cost with the # operator (see the last query here: https://dbfiddle.uk/?rdbms=postgres_14&fiddle=edc9b07e9b22ee72f856e9234dbec4ba):
select p.poa_code, g.PostCodeFootPrint
from groups g
join au_postcode p
on (g.PostCodeFootPrint # p.poa_code) > 0;
but it is still more expensive than the regex. However, I think you can probably rearrange the way the tables are set up and radically change performance. See the first and second queries in the fiddle, where I take each post code in the footprint and insert it as a row in a table, along with an identifier for the group it was in:
select p.poa_code, g.which
from groups2 g
join au_postcode p
on g.footprint = p.poa_code;
The explain plan for this indicates that query cost drops significantly (from 60752.50 to 517.20, or two orders of magnitude) and the execution times go from 0.487 to 0.070. So it might be worth looking into changing the table structure.
Since the values of PostCodeFootPrint are separated by a common character, you can easily create an array out of it. From there use unnest to convert the array elements to records, and then join then with au_postcode:
SELECT * FROM au_postcode au
JOIN (SELECT trim(unnest(string_to_array(PostCodeFootPrint,',')))
FROM groups) fp (PostCodeFootPrint) ON fp.PostCodeFootPrint = au.poa_code;
Demo: db<>fiddle

Unnesting a json in Redshift causing nested loop in the query plan

I have a column in my tables called 'data' with JSONs in it like below:
{"tt":"452.95","records":[{"r":"IN184366","t":"812812819910","s":"129.37","d":"982.7","c":"83"},{"r":"IN183714","t":"8028028029093","s":"33.9","d":"892","c":"38"}]}
I have written a code to unnest it into separate columns like tr,r,s.
Below is the code
with raw as (
SELECT json_extract_path_text(B.Data, 'records', true) as items
FROM tableB as B where B.date::timestamp between
to_timestamp('2019-01-01 00:00:00','YYYY-MM-DD HH24:MA:SS') AND
to_timestamp('2022-12-31 23:59:59','YYYY-MM-DD HH24:MA:SS')
UNION ALL
SELECT json_extract_path_text(C.Data, 'records', true) as items
FROM tableC as C where C.date-5 between
to_timestamp('2019-01-01 00:00:00','YYYY-MM-DD HH24:MA:SS') AND
to_timestamp('2022-12-31 23:59:59','YYYY-MM-DD HH24:MA:SS')
),
numbers as (
SELECT ROW_NUMBER() OVER (ORDER BY TRUE)::integer- 1 as ordinal
FROM <any_random_table> limit 1000
),
joined as (
select raw.*,
json_array_length(orders.items, true) as number_of_items,
json_extract_array_element_text(
raw.items,
numbers.ordinal::int,
true
) as item
from raw
cross join numbers
where numbers.ordinal <
json_array_length(raw.items, true)
),
parsed as (
SELECT J.*,
json_extract_path_text(J.item, 'tr',true) as tr,
json_extract_path_text(J.item, 'r',true) as r,
json_extract_path_text(J.item, 's',true)::float8 as s
from joined J
)
select * from parsed
The above code is working when there are small number of records but this taking more than a day to run and CPU utilization (in redshift) is reaching 100 % and even the disk space used also reaching 100% if I am putting date between last two years etc.. or if the number of records is large.
Can anyone please suggest any alternative way to unnnest JSON objects like above in redshift.
My query plan is saying:
Nested Loop Join in the query plan - review the join predicates to avoid Cartesian products
Goal: To Unnest without using any cross joins
Input: data column having JSON
"tt":"452.95","records":[{"r":"IN184366","t":"812812819910","s":"129.37","d":"982.7","c":"83"},{"r":"IN183714","t":"8028028029093","s":"33.9","d":"892","c":"38"}]}
Output should be for example
tr,r,s columns from the above json
You want to unnest json records of up to 1000 stored in a json array but nested loop join is taking too long.
The root issues is likely your data model. You have stored structured records (called "records"), inside a semi-structure text element (json), within a column of a structured columnar database. You want to perform some operation on these buried records that you haven't described but here's the problem. Columnar databases are optimized for performing read-centric analytic queries but you need to expand these json internal records into Redshift rows (records) which is fundamentally a write operation. This is working against the optimizations of the database.
The size of this expanding data is also large as compared to your disk storage on your cluster which is why the disks are filling up. You CPUs are likely spinning unpacking the jsons and managing overloaded disk and memory capacity. At the edge of filling up disks Redshift shifts to a mode that optimizes disk space utilization at the expense of execution speed. A larger cluster may give you a significantly faster execution if you can avoid this effect but that will cost money you may not have budgeted. Not an ideal solution.
One area that would improve speed of your query is not carrying all the data along. You keep raw.* and J.* all through the query but it is not clear you need these. Since part of the issue is data size during execution and that this execution includes loop joining, you are making the execution much harder that it needs to be by carrying all this data (including the original jsons).
The best way out of this situation is to change your data model and expand these json internal records into Redshift records on ingestion. Json data is fine for seldom used information or information that is only needed at the end of a query where the data is small. Needing the expanded json at the input end of the query for such a large amount of data is not good use case for json in Redshift. Each of these "records" inside of the json are records and need to be stored as such if you need to work across them as query input.
Now you want to know if there is some slick way to get around this issue in your case and the answer is "unlikely but maybe". Can you describe how you are using the final values in your query (t, r, and s)? If you are just using some aspect of this data (max value or sum or ...) then there may be a way to get to the answer without the large nested loop join. But if you need all the values then there is no other way to get these AFAIK. A description of what comes next in the data process could open up such an opportunity.

geolocating self join too slow

I am trying to get the count of all records within 50 miles of each record in a huge table (1m + records), using self join as shown below:
proc sql;
create table lab as
select distinct a.id, sum(case when b.value="New York" then 1 else 0 end)
from latlon a, latlon b
where a.id <> b.id
and geodist(a.lat,a.lon,b.lat,b.lon,"M") <= 50
and a.state = b.state;
This ran for 6 hours and was still running when i last checked.
Is there a way to do this more efficiently?
UPDATE: My intention is to get the number of new yorkers in a 50 mile radius from every record identified in table latlon which has name, location and latitude/longitude where lat/lon could be anywhere in the world but location will be a person's hometown. I have to do this for close to a dozen towns. Looks like this is the best it could get. I may have to write a C code for this one i guess.
The geodist() function you're using has no chance of exploiting any index. So, you have an algorithn that's O(n**2) at best. That's gonna be slow.
You can take advantage of a simple fact of spherical geometry, though, to get access to an indexable query. A degree of latitude (north - south) is equivalent to sixty nautical miles, 69 statute miles, or 111.111 km. The British definition of nautical mile was originally equal to a minute. The original Napoleonic meter was defined as one part in ten thousand of the distance from the equator to the pole, also defined as 90 degrees.
(These defintions depend on the assumption that the earth is spherical. It isn't, quite. If you're a civil engineer these definitions break down. If you use them to design a parking lot, it will have some nasty puddles in it when it rains, and will encrooach on the neighbors' property.)
So, what you want is to use a bounding range. Assuming your latitude values a.lat and b.lat are in degrees, two of them are certainly more than fifty statute miles apart unless
a.lat BETWEEN b.lat - 50.0/69.0 AND b.lat + 50.0/69.0
Let's refactor your query. (I don't understand the case stuff about New York so I'm ignoring it. You can add it back.) This will give the IDs of all pairs of places lying within 50 miles of each other. (I'm using the 21st century JOIN syntax here).
select distinct a.id, b.id
from latlon a
JOIN latlon b ON a.id<>b.id
AND a.lat BETWEEN b.lat - 50.0/69.0 AND b.lat + 50.0/69.0
AND a.state = b.state
AND geodist(a.lat,a.lon,b.lat,b.lon,"M") <= 50
Try creating an index on the table on the lat column. That should help performance a LOT.
Then try creating a compound index on (state, lat, id, lon, value). Try those columns in the compound index in different orders, if you don't get satisfactory performance acceleration. It's called a covering index, because the some of its columns (the first two in this case) are used for quick lookups and the rest are used to provide values that would otherwise have to be fetched from the main table.
Your question is phrased ambiguously - I'm interpreting it as "give me all (A, B) city pairs within 50 miles of each other." The NYC special case seems to be for a one-off test - the problem is not to (trivially, in O(n) time) find all cities within 50 miles of NYC.
Rather than computing Great Circle distances, find Manhattan distances instead, using simple addition, and simple bounding boxes. Given (A, B) city tuples with Manhattan distance less than 50 miles, it is straightforward to prune out the few (on diagonals) that have Great Circle (or Euclidean) distance less than 50 miles.
You didn't show us EXPLAIN output describing the backend optimizer's plan.
You didn't tell us about indexes on the latlon table.
I'm not familiar with the SAS RDBMS. Oracle, MySQL, and others have geospatial extensions to support multi-dimensional indexing. Essentially, they merge high-order coordinate bits, down to low-order coordinate bits, to construct a quadtree index. The technique could prove beneficial to your query.
Your DISTINCT keyword will make a big difference for the query plan. Often it will force a tablescan and a filesort. Consider deleting it.
The equijoin on state seems wrong, but maybe you don't care about the tri-state metropolitan area and similar densely populated regions near state borders.
You definitely want the WHERE clause to prune out b rows that are more than 50 miles from the current a row:
too far north, OR
too far south, OR
too far west, OR
too far east
Each of those conditionals boils down to a simple range query that the RDBMS backend can evaluate and optimize against an index. Unfortunately, if it chooses the latitude index, any longitude index that's on disk will be ignored, and vice versa. Which motivates using your vendor's geospatial support.

Spatial SQL query showing parcels containing centroid of building

I am trying to write a query that selects parcels that contain the centroid of a certain building code (bldg_code = 3).
The parcels are listed in the table "city.zoning" and contains a column for a PIN, geometry, and area of each parcel. The table "buildings" contains a column for bldg_type and bldg_code indicating the building type and its corresponding code. The building type of interest for this query has a bldg_code of 3.
So far I've developed a query that shows parcels that interact with the building type of interest:
select a.*
from city.zoning a, username.buildings b
where b.bldg_code = 3 and sdo_anyinteract(a.geom,b.geom) = 'TRUE';
Any ideas?
You can use SDO_GEOM.SDO_CENTROID (documentation) to find the centroid of a geometry.
Note that the centroid provided by this function is the mathematical centroid only and may not always lie inside the geometry, for example, if your polygon is L shaped. SpatialDB Adviser has a good article on this, but here's a quick illustration:
If this isn't a problem for you and you don't need that level of accuracy, just use the built-in, but if you do consider this to be a problem (as I did in the past), then SpatialDB Adviser has a standalone PL/SQL package that corrrectly calculates centroids.
Depending on your performance needs, you could calculate the centroids on-the-fly and just use them in your query directly, or alternatively, add a centroid column to the table and compute and cache the values with application code (best case) or trigger (worst case).
Your query would look something like this:
SELECT a.*
FROM city.zoning a
JOIN username.buildings b ON sdo_contains(a.geom, b.centroid) = 'TRUE'
WHERE b.bldg_code = 3
Note that this is using SDO_CONTAINS on the basis of the a.geom column being spatially indexed and a new column b.centroid that has been added and populated (note - query not tested). If the zoning geometry is not spatially indexed, then you would need to use SDO_GEOM.RELATE, or index the centroid column and invert the logic to use SDO_INSIDE.

SQL Cross Apply Performance Issues

My database has a directory of about 2,000 locations scattered throughout the United States with zipcode information (which I have tied to lon/lat coordinates).
I also have a table function which takes two parameters (ZipCode & Miles) to return a list of neighboring zip codes (excluding the same zip code searched)
For each location I am trying to get the neighboring location ids. So if location #4 has three nearby locations, the output should look like:
4 5
4 24
4 137
That is, locations 5, 24, and 137 are within X miles of location 4.
I originally tried to use a cross apply with my function as follows:
SELECT A.SL_STORENUM,A.Sl_Zip,Q.SL_STORENUM FROM tbl_store_locations AS A
CROSS APPLY (SELECT SL_StoreNum FROM tbl_store_locations WHERE SL_Zip in (select zipnum from udf_GetLongLatDist(A.Sl_Zip,7))) AS Q
WHERE A.SL_StoreNum='04'
However that ran for over 20 minutes with no results so I canceled it. I did try hardcoding in the zipcode and it immediately returned a list
SELECT A.SL_STORENUM,A.Sl_Zip,Q.SL_STORENUM FROM tbl_store_locations AS A
CROSS APPLY (SELECT SL_StoreNum FROM tbl_store_locations WHERE SL_Zip in (select zipnum from udf_GetLongLatDist('12345',7))) AS Q
WHERE A.SL_StoreNum='04'
What is the most efficient way of accomplishing this listing of nearby locations? Keeping in mind while I used "04" as an example here, I want to run the analysis for 2,000 locations.
The "udf_GetLongLatDist" is a function which uses some math to calculate distance between two geographic coordinates and returns a list of zipcodes with a distance of > 0. Nothing fancy within it.
When you use the function you probably have to calculate every single possible distance for each row. That is why it takes so long. SInce teh actual physical locations don;t generally move, what we always did was precalculate the distance from each zipcode to every other zip code (and update only once a month or so when we added new possible zipcodes). Once the distances are precalculated, all you have to do is run a query like
select zip2 from zipprecalc where zip1 = '12345' and distance <=10
We have something similar and optimized it by only calculating the distance of other zipcodes whose latitude is within a bounded range. So if you want other zips within #miles, you use a
where latitude >= #targetLat - (#miles/69.2) and latitude <= #targetLat + (#miles/69.2)
Then you are only calculating the great circle distance of a much smaller subset of other zip code rows. We found this fast enough in our use to not require precalculating.
The same thing can't be done for longitude because of the variation between equator and pole of what distance a degree of longitude represents.
Other answers here involve re-working the algorithm. I personally advise the pre-calculated map of all zipcodes against each other. It should be possible to embed such optimisations in your existing udf, to minimise code-changes.
A refactoring of the query, however, could be as follows...
SELECT
A.SL_STORENUM, A.Sl_Zip, C.SL_STORENUM
FROM
tbl_store_locations AS A
CROSS APPLY
dbo.udf_GetLongLatDist(A.Sl_Zip,7) AS B
INNER JOIN
tbl_store_locations AS C
ON C.SL_Zip = B.zipnum
WHERE
A.SL_StoreNum='04'
Also, the performance of the CROSS APPLY will benefit greatly if you can ensure that the udf is INLINE rather than MULTI-STATEMENT. This allows the udf to be expanded inline (macro like) for a much cleaner execution plan.
Doing so would also allow you to return additional fields from the udf. The optimiser can then include or exclude those fields from the plan depending on whether you actually use them. Such an example would be to include the SL_StoreNum if it's easily accessible from the query in the udf, and so remove the need for the last join...