Athena/Presto | Can't match ID row on self join - sql

I'm trying to get the bi-grams on a string column.
I've followed the approach here but Athena/Presto is giving me errors at the final steps.
Source code so far
with word_list as (
SELECT
transaction_id,
words,
n,
regexp_extract_all(f70_remittance_info, '([a-zA-Z]+)') as f70,
f70_remittance_info
FROM exploration_transaction
cross join unnest(regexp_extract_all(f70_remittance_info, '([a-zA-Z]+)')) with ordinality AS t (words, n)
where cardinality((regexp_extract_all(f70_remittance_info, '([a-zA-Z]+)'))) > 1
and f70_remittance_info is not null
limit 50 )
select wl1.f70, wl1.n, wl1.words, wl2.f70, wl2.n, wl2.words
from word_list wl1
join word_list wl2
on wl1.transaction_id = wl2.transaction_id
The specific issue I'm having is on the very last line, when I try to self join the transaction ids - it always returns zero rows. It does work if I join only by wl1.n = wl2.n-1 (the position on the array) which is useless if I can't constrain it to a same id.
Athena doesn't support the ngrams function by presto, so I'm left with this approach.
Any clues why this isn't working?
Thanks!

This is speculation. But I note that your CTE is using limit with no order by. That means that an arbitrary set of rows is being returned.
Although some databases materialize CTEs, many do not. They run the code independently each time it is referenced. My guess is that the code is run independently and the arbitrary set of 50 rows has no transaction ids in common.
One solution would be to add order by transacdtion_id in the subquery.

Related

Adding LIMIT to pg sql query returns just one row duplicated as many times as the limit says

select qt.title, qia.text_data
from qt,
qia,
qi
q,
spam_flag
where qt.quiz_id = qi.quiz_id
and qia.quiz_item_id = qi.id
and q.id = qi.quiz_id
and q.course_id = 'course-id'
and qi.type = 'essay'
and qia.quiz_answer_id in (select spam_flag.quiz_answer_id from spam_flag group by spam_flag.quiz_answer_id having count(*)>=0)
and LENGTH(qia.text_data) > 20
limit 100;
Without the last argument
limit 100;
the query returns a list of unique rows. But when I add it, it returns just a copy of a single row duplicated as many times as is the limit. What could cause this? This is PostgreSQL.
Thank you.
You forgot the join condition for spam_flag, so you are getting each result row as often as there are rows in the spam_flag table (thus making it live up to its name).
Instead of commas in the FROM clause, you should use the explicit JOIN syntax. Then that could not happen.

UPDATED: Using COALESCE / IFNULL to filter recursive cte query

I have an issue with filtering / adding a condition to a recursive CTE query to avoid NULL result. The recursive part of query stops looping once it comes across a gap in the descending series of gids taken from the first non-recursive query (essentially when the x.t_gid = s.g2_t_gid filter in the WHERE is NULL).
x.t_gid and s.g2_t_gid are decsending series' of integers, and x.t_gid(n) = s.g2_t_gid(n+1). There are meant to be gaps in the series of gids but I want the recursive part to just continue onto the next row if it returns a NULL result. See code below.
WITH RECURSIVE snapped_points(t_gid, r_rdname, r_gid, r_ufi, snapped_geom, snapped_distance, g2_t_gid, g2_r_rdname, s_g2_snapped_geom, route_distance) AS (
(SELECT t_gid,
r_rdname,
r_gid,
r_ufi,
snapped_geom AS snapped_geom,
snapped_distance,
g2_t_gid,
g2_r_rdname,
g2_snapped_geom AS s_g2_snapped_geom,
route_distance
FROM x_joined_snapped x
LIMIT 1)
UNION ALL
(SELECT x.t_gid,
x.r_rdname,
x.r_gid,
x.r_ufi,
x.snapped_geom,
x.snapped_distance,
x.g2_t_gid,
x.g2_r_rdname,
x.g2_snapped_geom,
x.route_distance
FROM snapped_points s
INNER JOIN x_joined_snapped x
ON x.t_gid <> s.t_gid
AND (x.t_gid = s.g2_t_gid AND x.snapped_geom = s.s_g2_snapped_geom)
--OR (x.t_gid < s.g2_t_gid), <-- difference between 1s and 24s**
LIMIT 1
)
)
SELECT t_gid, r_gid, r_ufi, r_rdname, snapped_distance, snapped_geom FROM snapped_points
;
What I am aiming to achieve.
From the array of potential snapped points for t_gid(n), choose the row where distance between snapped_geom(n) and g2_snapped_geom(n) is the shortest. If there is only 1 result choose that.
From the array of potential snapped points for t_gid(n-1) (which equals
g2_t_gid(n)), select the subset containing only g2_snapped_geom(n) chosen in the previous step.
From this subset, choose the row where distance between snapped_geom(n-1) and g2_snapped_geom(n-1) is shortest. If there is only 1 choose that. Append to previous result. Loop until you run out of rows.
Once the recursive part hits a gap in the table where x.t_gid <> s.g2_t_gid it just stops looping. I have been able to fix this by adding a OR x.t_gid < s.g2_t_gid to the WHERE clause, but this increases the compute time from 1000ms to 24,000ms.
I have tried using COALESCE but can't get it to work, and recursive CTE queries don't allow the recursive part to be repeated in the query.
I am a complete noob so I'm sure this code could be made prettier and more efficient. Any help would be greatly appreciated.

Maximum number of expressions in a list is 1000

I 'm trying to select 170k records from a oracle database, there are some how to avoid this error? or any way to improve this query ?
thanks.
select sr.RELATED_PON, srsi.VALID_VALUE
from SERV_REQ sr
inner join SERV_REQ_SI_VALUE srsi
on sr.DOCUMENT_NUMBER = srsi.DOCUMENT_NUMBER
inner join SERV_ITEM si
on si.SERV_ITEM_ID = srsi.SERV_ITEM_ID
and si.STATUS = '6'
where srsi.VALUE_LABEL = 'unitAddress'
and srsi.VALID_VALUE in ('1682511819',
'1682575135',
'1682580326'
... more than 150k here!
)
Lamak is correct: This really looks like a list that belongs in a table.
However, if this is not convenient for whatever reason, you must break the IN clause into chunks of no more than 1000 elements each. Happily, this is pretty trivial: You insert ) OR ( scri.VALID_VALUE in every 1000 items.
Unfortunately, you're soon going to bump into the max size of a query string. (For Oracle, I think that's 32K)... but seriously consider a temp table or something.

CASE Clause on select clause throwing 'SQLCODE=-811, SQLSTATE=21000' Error

This query is very well working in Oracle. But it is not working in DB2. It is throwing
DB2 SQL Error: SQLCODE=-811, SQLSTATE=21000, SQLERRMC=null, DRIVER=3.61.65
error when the sub query under THEN clause is returning 2 rows.
However, my question is why would it execute in the first place as my WHEN clause turns to be false always.
SELECT
CASE
WHEN (SELECT COUNT(1)
FROM STOP ST,
FACILITY FAC
WHERE ST.FACILITY_ID = FAC.FACILITY_ID
AND FAC.IS_DOCK_SCHED_FAC=1
AND ST.SHIPMENT_ID = 2779) = 1
THEN
(SELECT ST.FACILITY_ALIAS_ID
FROM STOP ST,
FACILITY FAC
WHERE ST.FACILITY_ID = FAC.FACILITY_ID
AND FAC.IS_DOCK_SCHED_FAC=1
AND ST.SHIPMENT_ID = 2779
)
ELSE NULL
END STAPPFAC
FROM SHIPMENT SHIPMENT
WHERE SHIPMENT.SHIPMENT_ID IN (2779);
The SQL standard does not require short cut evaluation (ie evaluation order of the parts of the CASE statement). Oracle chooses to specify shortcut evaluation, however DB2 seems to not do that.
Rewriting your query a little for DB2 (8.1+ only for FETCH in subqueries) should allow it to run (unsure if you need the added ORDER BY and don't have DB2 to test on at the moment)
SELECT
CASE
WHEN (SELECT COUNT(1)
FROM STOP ST,
FACILITY FAC
WHERE ST.FACILITY_ID = FAC.FACILITY_ID
AND FAC.IS_DOCK_SCHED_FAC=1
AND ST.SHIPMENT_ID = 2779) = 1
THEN
(SELECT ST.FACILITY_ALIAS_ID
FROM STOP ST,
FACILITY FAC
WHERE ST.FACILITY_ID = FAC.FACILITY_ID
AND FAC.IS_DOCK_SCHED_FAC=1
AND ST.SHIPMENT_ID = 2779
ORDER BY ST.SHIPMENT_ID
FETCH FIRST 1 ROWS ONLY
)
ELSE NULL
END STAPPFAC
FROM SHIPMENT SHIPMENT
WHERE SHIPMENT.SHIPMENT_ID IN (2779);
Hmm... you're running the same query twice. I get the feeling you're not thinking in sets (how SQL operates), but in a more procedural form (ie, how most common programming languages work). You probably want to rewrite this to take advantage of how RDBMSs are supposed to work:
SELECT Current_Stop.facility_alias_id
FROM SYSIBM/SYSDUMMY1
LEFT JOIN (SELECT MAX(Stop.facility_alias_id) AS facility_alias_id
FROM Stop
JOIN Facility
ON Facility.facility_id = Stop.facility_id
AND Facility.is_dock_sched_fac = 1
WHERE Stop.shipment_id = 2779
HAVING COUNT(*) = 1) Current_Stop
ON 1 = 1
(no sample data, so not tested. There's a couple of other ways to write this based on other needs)
This should work on all RDBMSs.
So what's going on here, why does this work? (And why did I remove the reference to Shipment?)
First, let's look at your query again:
CASE WHEN (SELECT COUNT(1)
FROM STOP ST, FACILITY FAC
WHERE ST.FACILITY_ID = FAC.FACILITY_ID
AND FAC.IS_DOCK_SCHED_FAC = 1
AND ST.SHIPMENT_ID = 2779) = 1
THEN (SELECT ST.FACILITY_ALIAS_ID
FROM STOP ST, FACILITY FAC
WHERE ST.FACILITY_ID = FAC.FACILITY_ID
AND FAC.IS_DOCK_SCHED_FAC = 1
AND ST.SHIPMENT_ID = 2779)
ELSE NULL END
(First off, stop using the implicit-join syntax - that is, comma-separated FROM clauses - always explicitly qualify your joins. For one thing, it's way too easy to miss a condition you should be joining on)
...from this it's obvious that your statement is the 'same' in both queries, and shows what you're attempting - if the dataset has one row, return it, otherwise the result should be null.
Enter the HAVING clause:
HAVING COUNT(*) = 1
This is essentially a WHERE clause for aggregates (functions like MAX(...), or here, COUNT(...)). This is useful when you want to make sure some aspect of the entire set matches a given criteria. Here, we want to make sure there's just one row, so using COUNT(*) = 1 as the condition is appropriate; if there's more (or less! could be 0 rows!) the set will be discarded/ignored.
Of course, using HAVING means we're using an aggregate, the usual rules apply: all columns must either be in a GROUP BY (which is actually an option in this case), or an aggregate function. Because we only want/expect one row, we can cheat a little, and just specify a simple MAX(...) to satisfy the parser.
At this point, the new subquery returns one row (containing one column) if there was only one row in the initial data, and no rows otherwise (this part is important). However, we actually need to return a row regardless.
FROM SYSIBM/SYSDUMMY1
This is a handy dummy table on all DB2 installations. It has one row, with a single column containing '1' (character '1', not numeric 1). We're actually interested in the fact that it has only one row...
LEFT JOIN (SELECT ... )
ON 1 = 1
A LEFT JOIN takes every row in the preceding set (all joined rows from the preceding tables), and multiplies it by every row in the next table reference, multiplying by 1 in the case that the set on the right (the new reference, our subquery) has no rows. (This is different from how a regular (INNER) JOIN works, which multiplies by 0 in the case that there is no row) Of course, we only maybe have 1 row, so there's only going to be a maximum of one result row. We're required to have an ON ... clause, but there's no data to actually correlate between the references, so a simple always-true condition is used.
To get our data, we just need to get the relevant column:
SELECT Current_Stop.facility_alias_id
... if there's the one row of data, it's returned. In the case that there is some other count of rows, the HAVING clause throws out the set, and the LEFT JOIN causes the column to be filled in with a null (no data) value.
So why did I remove the reference to Shipment? First off, you weren't using any data from the table - the only column in the result set was from the subquery. I also have good reason to believe that there would only be one row returned in this case - you're specifying a single shipment_id value (which implies you know it exists). If we don't need anything from the table (including the number of rows in that table), it's usually best to remove it from the statement: doing so can simplify the work the db needs to do.

Use of the HAVING clause when using muliple sums

I was having a problem getting mulitple sums from multiple tables. Short story, my answer was solved in the "sql sum data from multiple tables" thread on this site. But where it came up short, is that now I'd like to only show sums that are greater than a certain amount. So while I have sub-selects in my select, I think I need to use a HAVING clause to filter the summed amounts that are too low.
Example, using the code specified in the link above (more specifically the answer that the owner has chosen as correct), I would only like to see a query result if SUM(AP2.Value) > 1500. Any thoughts?
If you need to filter on the results of ANY aggregate function, you MUST use a HAVING clause. WHERE is applied at the row level as the DB scans the tables for matching things. HAVING is applied basically immediately before the result set is sent out to the client. At the time WHERE operates, the aggregate function results are not (and cannot) be available, so you have to use a HAVING clause, which is applied after the main query is complete and all aggregate results are available.
So... long story short, yes, you'll need to do
SELECT ...
FROM ...
WHERE ...
HAVING (SUM_AP > 1500)
Note that you can use column aliases in the having clause. In technical terms, having on a query as above works basically exactly the same as wrapping the initial query in another query and applying another WHERE clause on the wrapper:
SELECT *
FROM (
SELECT ...
) AS child
WHERE (SUM_AP > 1500)
You could wrap that query as a subselect and then specify your criteria in the WHERE clause:
SELECT
PROJECT,
SUM_AP,
SUM_INV
FROM (
SELECT
AP1.[PROJECT],
(SELECT SUM(AP2.Value) FROM AP AS AP2 WHERE AP2.PROJECT = AP1.PROJECT) AS SUM_AP,
(SELECT SUM(INV2.Value) FROM INV AS INV2 WHERE INV2.PROJECT = AP1.PROJECT) AS SUM_INV
FROM AP AS AP1
INNER JOIN INV AS INV1 ON
AP1.[PROJECT] = INV1.[PROJECT]
WHERE
AP1.[PROJECT] = 'XXXXX'
GROUP BY
AP1.[PROJECT]
) SQ
WHERE
SQ.SUM_AP > 1500