Presto SQL left joining using ST_intersects, ST_crosses yield unexpected results

Presto SQL left joining using ST_intersects, ST_crosses yield unexpected results - sql

Sorry if title is not informative.
Using AWS Athena.
Have two tables:
1.transaction_table
location time status type .... deleted_at
BLOB 2020-09-01
BLOB 2020-09-02
BLOB 2020-09-03
2.area_table
boundary created_at deleted_at
POLYGON((...)) 2020-09-01 null
POLYGON((...)) 2020-09-01 null
POLYGON((...)) 2020-09-01 2020-10-01
For Each row in transaction_table I want add appropriate boundary
select date(time) as dt
, count(time) As cnt
from transaction_table t
left join area_table a
on ST_intersects(boundary, ST_Point(ST_X(t.location), ST_Y(t.location)))
where t.status = 'complete'
and t.deleted_at is null
and t.time >= date('2020-09-01')
and a.deleted_at is null
group by date(withdraw_time);
Problem is when I use ST_intersects or ST_contains daily cnt decreases from query without left join which does not make sense to me since left join will always output equal or greater rows that left table.
Both left, right table have none null values and there are no multiple joins that increases number of rows (if so, query with left join more rows than without)
Right now using ST_Crosses fixes the problem -> outputs same result with/without left join. But I am not sure why number of rows decrease in my query above.
EDIT: ST_Crosses doesn't seem to join any rows hence same value as querying without left join. So my question is why does daily cnt decrease when using left join ST_intersects or ST_contains? Same query in Mysql(ST_point -> point) runs perfectly fine.
From https://prestodb.io/docs/current/functions/geospatial.html and https://dev.mysql.com/doc/refman/5.7/en/gis-class-point.html.
Point(lat,lng) give point object which is zero-dimension
ST_Point(lat,lng) is a geometry and is 2-d.
So I guess using ST_intersects(Geom,Geom) and ST_intersects(Geom, Point) works diffrently, but this still do not explain reduced daily cnt on left join.

Athena is based on Presto 0.172 - according to their release notes there were no geospatial functions available:
Presto Functions in Athena
Presto 01.72 Documentation
Athena's geospatial functions are implemented as a Presto Plugin and the full Reference is available here: List of Supported Geospatial Functions.
One thing to consider is actually the order of arguments of ST_POINT being ST_POINT(longitude, latitude), so longitude being the first argument and latitude the second.
You are also referring left and right table in the where condition, this definitely could result in having less rows.

Related

Athena/Presto | Can't match ID row on self join

I'm trying to get the bi-grams on a string column.
I've followed the approach here but Athena/Presto is giving me errors at the final steps.
Source code so far
with word_list as (
SELECT
transaction_id,
words,
n,
regexp_extract_all(f70_remittance_info, '([a-zA-Z]+)') as f70,
f70_remittance_info
FROM exploration_transaction
cross join unnest(regexp_extract_all(f70_remittance_info, '([a-zA-Z]+)')) with ordinality AS t (words, n)
where cardinality((regexp_extract_all(f70_remittance_info, '([a-zA-Z]+)'))) > 1
and f70_remittance_info is not null
limit 50 )
select wl1.f70, wl1.n, wl1.words, wl2.f70, wl2.n, wl2.words
from word_list wl1
join word_list wl2
on wl1.transaction_id = wl2.transaction_id
The specific issue I'm having is on the very last line, when I try to self join the transaction ids - it always returns zero rows. It does work if I join only by wl1.n = wl2.n-1 (the position on the array) which is useless if I can't constrain it to a same id.
Athena doesn't support the ngrams function by presto, so I'm left with this approach.
Any clues why this isn't working?
Thanks!

This is speculation. But I note that your CTE is using limit with no order by. That means that an arbitrary set of rows is being returned.
Although some databases materialize CTEs, many do not. They run the code independently each time it is referenced. My guess is that the code is run independently and the arbitrary set of 50 rows has no transaction ids in common.
One solution would be to add order by transacdtion_id in the subquery.

Missing rows that are only on one table on FULL OUTER JOIN in PostgreSQL

I have 2 tables remittance and draft where I'm trying to find the difference between what was remitted and drafted.
This is what I have so far, but it doesn't give me the records that are only on one of the tables
SELECT remittance.loan_no, remittance.inv_loan_no, remittance.ll_remittance,
draft.fm_loan_number, draft.draft_amount,
(remittance.ll_remittance) - (draft.draft_amount) AS difference
FROM remittance FULL JOIN draft
ON remittance_.inv_loan_no = draft.fm_loan_number
WHERE (remittance.ll_remittance) - (draft.draft_amount)<> 0.00;
Could it be because I when I do the difference the missing remittance or draft amounts have null values in them and that's why I'm not getting any results for the difference.
I thought FULL JOIN would give me the loans that are only on one of the tables and NULL in other tables column.
Thank you,
Here is the sample data:
Remittance Table, Draft Table, Query Results
I have highlighted the loans in red that are not showing up in the Query Results

Could it be because I when I do the difference the missing remittance or draft amounts have null values in them and that's why I'm not getting any results for the difference.
Yes! Adding a where clause to a full join is always tricky because the filtering takes place on the joined data set.
However, I don't think there should be any problems in placing that difference query in the on clause instead.
SELECT remittance.loan_no, remittance.inv_loan_no, remittance.ll_remittance,
draft.fm_loan_number, draft.draft_amount,
(remittance.ll_remittance) - (draft.draft_amount) AS difference
FROM remittance FULL JOIN draft
ON remittance_.inv_loan_no = draft.fm_loan_number
AND (remittance.ll_remittance) - (draft.draft_amount)<> 0.00;

If you have only one row in each table for each loan, then coalesce() suffices for your purposes:
SELECT r.loan_no, r.inv_loan_no, r.ll_remittance,
d.fm_loan_number, d.draft_amount,
coalesce(r.ll_remittance, 0) - coalesce(d.draft_amount, 0) AS difference
FROM remittance r FULL JOIN
draft d
ON r.inv_loan_no = d.fm_loan_number
WHERE coalesce(r.ll_remittance, 0) <> coalesce(d.draft_amount, 0);
If there are multiple rows, then you will need to pre-aggregate.
Here is a SQL Fiddle.

Query using COUNT returns records where the count is positive only

Good day everyone.
Consider this portion of a relational SQLite database:
floors(number) - rooms(number, #floorNumber)
I aim to query for the number of rooms per floor. This is my attempt:
select floors.number, count(rooms.floornumber)
from floors, rooms where floors.number=rooms.floornumber
group by floors.number, rooms.floornumber;
Example:
1|5
2|7
3|5
4|3
The issue is that I also would like the query to return records where the floor contains 0 rooms (for example floor number 5 exists in the "floors" table but isn't shown in the query result).
Your assistance is appreciated. Thank you.

Never use commas in the FROM clause. Always use proper, explicit JOIN syntax.
You need a LEFT JOIN, but you cannot even see what you need because of the way that your query is written.
select f.number, count(r.floornumber)
from floors f left join
rooms r
on f.number = r.floornumber
group by f.number;

sum 'distinct' rows with same values

I have a database which has a feeder that may have several distributors, each which may have several transformers, each which may have several clients and a certain kVA (power that gets to the clients).
And I have the following code:
SELECT f.feeder,
d.distributor,
count(DISTINCT t.transformer) AS total_transformers,
sum(t.Kvan) AS Total_KVA,
count(c.client) AS Clients,
FROM feeders f
LEFT JOIN distributors d
ON (d.feeder = f.feeder)
LEFT JOIN transformers t
ON (t.transformer = d.transformer)
LEFT JOIN clients c
ON (c.transformer = t.transformer)
WHERE d.transformer IS NOT NULL
GROUP BY f.feeder,
d.distributor,
f.feeder
ORDER BY f.feeder,
d.distributor
The sum is supposed to bring the sum of the different kVA the transformers have. Each transformer has a certain kVA. Problem is, 1 transformer has 1kVA for all the clients it has connected, but it will sum it as if it was 1kVA per client.
I need to group it on the feeder and distributor (I want to see how much kVA the distributor has and how many clients total).
So what should be "feeder1|dist1|2|600|374" brings me "feeder1|dist1|2|130000|374" (1 transformer has 200 kVA and the otherone 400, but it will sum these two 374 times instead of 400+200)

Your data model seems a little messy, in that you've specified a distributor can have many transformers (and logic suggests that a transformer is only on a single distributor) yet your query implies that the transformer ID is on the distributor record, which normally implies the opposite relationship ...
So if that's right, it must mean that you have multiple records in the distributors table for the same distributor - i.e. distributor can't then be a unique key in distributors table, which makes the query quite hard to reason accurately about. (e.g. What happens if the records for a distributor don't all have the same feeder ID on them? I'm guessing you wouldn't like the answer so much... Presumably you mean for that to be impossible, but if the model is as described it's not impossible. And worse I'm now second-guessing whether the apparent keys on the other tables are in fact unique... But I digress...)
Or maybe something else is broken. Point is the info you've given may be inconsistent or incomplete. Since I'm inferring an abnormal data model I can't guarantee the following is bug-free (though if you provide more detail so I can make fewer guesses, I may be able to refine the answer)...
So you know the trouble is that by the time you're ready to do the aggregation, the transformer data is embedded in a larger row that isn't based just on the identity of the transformer. There are a couple ways you could fix it, basically all centered on changing how you look at the aggregation of values. Here's one option:
select f.feeder
, dtc.distributor
-- next values work because transformer is already unique per group
, count(dtc.transformer) total_transformers
, sum(dtc.kvam) total_kvam
, sum(dtc.clients) clients
from feeder f
join (select d.distributor
, d.feeder
, t.transformer
, max(t.kvan) as kvan -- or min, doesn't matter
, count(distinct c.client) clients
from distributors d
left join transformers t
on d.transformer = t.transformer
left join clients c
on c.transformer = t.transformer
where d.transformer is not null
group by d.distributor, d.feeder, t.transformer
) dtc
on dtc.feeder = f.feeder
group by f.feeder, dtc.distributor
A few notes:
I changed the outer query join to an inner join, because any null rows from the original left join from feeder would be eliminated by the original where clause.
I kept the where clause anyway; having it along side the distributor-to-transformer left join is a little weird but is different from either an inner join or an outer join without the where clause (since the where clause acts on the left table's value). I'm avoiding changing the semantics from your original query, but it's weird enough this is something you might want to take another look at.
What using the subquery does here is, the inner query returns one row per feeder/distributor/transformer - i.e. for each feeder/distributor it returns one row per transformer. That row is itself an aggregate so that we can count clients, but since all rows in that aggregation come from the same transformer record we can use max() to get that single record's kvan value onto the aggregation.

Why does changing the where clause on this criteria reduce the execution time so drastically?

I ran across a problem with a SQL statement today that I was able to fix by adding additional criteria, however I really want to know why my change fixed the problem.
The problem query:
SELECT *
FROM
(SELECT ah.*,
com.location,
ha.customer_number,
d.name applicance_NAME,
house.name house_NAME,
dr.name RULE_NAME
FROM actionhistory ah
INNER JOIN community com
ON (t.city_id = com.city_id)
INNER JOIN house_address ha
ON (t.applicance_id = ha.applicance_id
AND ha.status_cd = 'ACTIVE')
INNER JOIN applicance d
ON (t.applicance_id = d.applicance_id)
INNER JOIN house house
ON (house.house_id = t.house_id)
LEFT JOIN the_rule tr
ON (tr.the_rule_id = t.the_rule_id)
WHERE actionhistory_id >= 'ACT100010000'
ORDER BY actionhistory_id
)
WHERE rownum <= 30000;
The "fix"
SELECT *
FROM
(SELECT ah.*,
com.location,
ha.customer_number,
d.name applicance_NAME,
house.name house_NAME,
dr.name RULE_NAME
FROM actionhistory ah
INNER JOIN community com
ON (t.city_id = com.city_id)
INNER JOIN house_address ha
ON (t.applicance_id = ha.applicance_id
AND ha.status_cd = 'ACTIVE')
INNER JOIN applicance d
ON (t.applicance_id = d.applicance_id)
INNER JOIN house house
ON (house.house_id = t.house_id)
LEFT JOIN the_rule tr
ON (tr.the_rule_id = t.the_rule_id)
WHERE actionhistory_id >= 'ACT100010000' and actionhistory_id <= 'ACT100030000'
ORDER BY actionhistory_id
)
All of the _id columns are indexed sequences.
The first query's explain plan had a cost of 372 and the second was 14. This is running on an Oracle 11g database.
Additionally, if actionhistory_id in the where clause is anything less than ACT100000000, the original query returns instantly.

This is because of the index on the actionhistory_id column.
During the first query Oracle has to return all the index blocks containing indexes for records that come after 'ACT100010000', then it has to match the index to the table to get all the records, and then it pulls 29999 records from the result set.
During the second query Oracle only has to return the index blocks containing records between 'ACT100010000' and 'ACT100030000'. Then it grabs from the table those records that are represented in the index blocks. A lot less work in that step of grabbing the record after having found the index than if you use the first query.
Noticing your last line about if the id is less than ACT100000000 - sounds to me that those records may all be in the same memory block (or in a contiguous set of blocks).
EDIT: Please also consider what is said by Justin - I was talking about actual performance, but he is pointing out that the id being a varchar greatly increases the potential values (as opposed to a number) and that the estimated plan may reflect a greater time than reality because the optimizer doesn't know the full range until execution. To further optimize, taking his point into consideration, you could put a function based index on the id column or you could make it a combination key, with the varchar portion in one column and the numeric portion in another.

What are the plans for both queries?
Are the statistics on your tables up to date?
Do the two queries return the same set of rows? It's not obvious that they do but perhaps ACT100030000 is the largest actionhistory_id in the system. It's also a bit confusing because the first query has a predicate on actionhistory_id with a value of TRA100010000 which is very different than the ACT value in the second query. I'm guessing that is a typo?
Are you measuring the time required to fetch the first row? Or the time required to fetch the last row? What are those elapsed times?
My guess without that information is that the fact that you appear to be using the wrong data type for your actionhistory_id column is affecting the Oracle optimizer's ability to generate appropriate cardinality estimates which is likely causing the optimizer to underestimate the selectivity of your predicates and to generate poorly performing plans. A human may be able to guess that actionhistory_id is a string that starts with ACT10000 and then has 30,000 sequential numeric values from 00001 to 30000 but the optimizer is not that smart. It sees a 13 character string and isn't able to figure out that the last 10 characters are always going to be numbers so there are only 10 possible values rather than 256 (assuming 8-bit characters) and that the first 8 characters are always going to be the same constant value. If, on the other hand, actionhistory_id was defined as a NUMBER and had values between 1 and 30000, it would be dramatically easier for the optimizer to make reasonable estimates about the selectivity of various predicates.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas