Workaround for a correlated subquery - hive

I need to run the following join without using a correlated subquery, as I am restricted to either using Hive or Presto, both of which fail due to my using a correlated subquery.
I have worked this down to a MWE. I have a table of each user and their 18th birthdays. I have another table of each time each user visited a movie theatre. I want to merge in only the last time a user visited my movie cinema. The code that would work on native SQL is below.
What is the most efficient workaround that does not require me to join every instance of the user visiting the movie theatre (it is far too large).
SELECT
people.*,
tickets.uid
tickets.date
FROM all_customers as people
JOIN tkting as tickets
on people.uid = tickets.uid
and tickets.date = (select
lastvisit.date
from tickets as lastvisit
where
lastvisit.uid = people.uid
and lastvisit.date < people.birthday_18
order by lastvisit.date asc
limit 1)

Instead of this inner query:
SELECT lastvisit.date
...
ORDER BY lastvisit.date ASC
LIMIT 1
you can try with:
SELECT min(lastvisit.date)
...

Related

Fastest way to count from a subquery

I have the following query to return a list of current employees and the number of 'corrections' they have. This is working correctly but is very slow.
I was previously not using a subquery, instead opting for a count (from...) as an aggregate subselect but I have read that a subquery should be much faster. Changing to the code to the below did improve performance but not anywhere near what I was expecting.
SELECT DISTINCT
tblStaff.StaffID, CorrectionsOut.Count AS CorrectionsAssigned
FROM tblStaff
LEFT JOIN tblMeetings ON tblMeetings.StaffID = tblStaff.StaffID
JOIN tblTasks ON tblTasks.TaskID = tblMeetings.TaskID
--Get Corrections Issued
LEFT JOIN(
SELECT
COUNT(DISTINCT tblMeetings.TaskID) AS Count, tblMeetings.StaffID
FROM tblRegister
JOIN tblMeetings ON tblRegister.MeetingID = tblMeetings.MeetingID
WHERE tblRegister.FDescription IS NOT NULL
AND tblRegister.CorrectionOutDate IS NULL
GROUP BY tblMeetings.StaffID
) AS CorrectionsOut ON CorrectionsOut.StaffID = tblStaff.StaffID
WHERE tblStaff.CurrentEmployee = 1
I need an open vendor solution as we are transitioning from SQL Server to Postgres. Note this is a simplified example of the query where there are quite few counts. My current query time without the counts is less than half a second, but with the counts, is approx 20 seconds, if it runs at all without locking or otherwise failing.
I would get rid of the joins that you are not using which probably makes the SELECT DISTINCT unnecessary as well:
SELECT s.StaffID, co.Count AS CorrectionsAssigned
FROM tblStaff s LEFT JOIN
(SELECT COUNT(DISTINCT m.TaskID) AS Count, m.StaffID
FROM tblRegister r
tblMeetings m
ON r.MeetingID = m.MeetingID
WHERE r.FDescription IS NOT NULL AND
r.CorrectionOutDate IS NULL
GROUP BY m.StaffID
) co
ON co.StaffID = s.StaffID
WHERE s.CurrentEmployee = 1;
Getting rid of the SELECT DISTINCT and the duplicate rows added by the tasks should help performance.
For additional benefit, you would want to be sure you have indexes on the JOIN keys, and perhaps on the filtering criteria.

Need to make SQL subquery more efficient

I have a table that contains all the pupils.
I need to look through my registered table and find all students and see what their current status is.
If it's reg = y then include this in the search, however student may change from y to n so I need it to be the most recent using start_date to determine the most recent reg status.
The next step is that if n, then don't pass it through. However if latest reg is = y then search the pupil table, using pupilnumber; if that pupil number is in the pupils table then add to count.
Select Count(*)
From Pupils Partition(Pupils_01)
Where Pupilnumber in (Select t1.pupilnumber
From registered t1
Where T1.Start_Date = (Select Max(T2.Start_Date)
From registered T2
Where T2.Pupilnumber = T1.Pupilnumber)
And T1.reg = 'N');
This query works, but it is very slow as there are several records in the pupils table.
Just wondering if there is any way of making it more efficient
Worrying about query performance but not indexing your tables is, well, looking for a kind word here... ummm... daft. That's the whole point of indexes. Any variation on the query is going to be much slower than it needs to be.
I'd guess that using analytic functions would be the most efficient approach since it avoids the need to hit the table twice.
SELECT COUNT(*)
FROM( SELECT pupilnumber,
startDate,
reg,
rank() over (partition by pupilnumber order by startDate desc) rnk
FROM registered )
WHERE rnk = 1
AND reg = 'Y'
You can look execution plan for this query. It will show you high cost operations. If you see table scan in execution plan you should index them. Also you can try "exists" instead of "in".
This query MIGHT be more efficient for you and hope at a minimum you have indexes per "pupilnumber" in the respective tables.
To clarify what I am doing, the first inner query is a join between the registered table and the pupil which pre-qualifies that they DO Exist in the pupil table... You can always re-add the "partition" reference if that helps. From that, it is grabbing both the pupil AND their max date so it is not doing a correlated subquery for every student... get all students and their max date first...
THEN, join that result to the registration table... again by the pupil AND the max date being the same and qualify the final registration status as YES. This should give you the count you need.
select
count(*) as RegisteredPupils
from
( select
t2.pupilnumber,
max( t2.Start_Date ) as MostRecentReg
from
registered t2
join Pupils p
on t2.pupilnumber = p.pupilnumber
group by
t2.pupilnumber ) as MaxPerPupil
JOIN registered t1
on MaxPerPupil.pupilNumber = t1.pupilNumber
AND MaxPerPupil.MostRecentRec = t1.Start_Date
AND t1.Reg = 'Y'
Note: If you have multiple records in the registration table, such as a person taking multiple classes registered on the same date, then you COULD get a false count. If that might be the case, you could change from
COUNT(*)
to
COUNT( DISTINCT T1.PupilNumber )

DB2 alias in WHERE clause

I have a couple DB2 tables, one for users and one for newsletters and I want to select using an alias in the WHERE clause.
SELECT a.*, b.tech_id as user FROM users a
JOIN newsletter b ON b.tech_id = a.newsletter_id
WHERE timestamp(user) < current_timestamp
This is radically simplified so I can see what's going on, but I am getting an error that makes me think that the user alias isn't getting passed correctly:
ERROR: An invalid datetime format was detected; that is, an
invalid string representation or value was specified.
The user.tech_id is a string built from the datetime when the record was created, so it looks something like 20150210175040951186000000. I've verified that I can execute a timestamp(tech_id) successfully-- so it can't be the format of the field causing the problem.
Any ideas?
More information:
There's multiple newsletters per user. I need to get the most recent newsletter (by the tech_id) and check if that was created in the past week. So the more complex version would be something like:
SELECT a.*, b.tech_id as user FROM users a
JOIN newsletter b ON b.tech_id = a.newsletter_id
WHERE timestamp(max(user)) < current_timestamp
Is there a way to JOIN only on the most recent record?
The order of execution is different to the order of writing. The FROM & WHERE clauses are executed before the SELECT clause hence the alias does not exist when you are trying to use it.
You would have to "nest" part of the query so that the alias is defined before the where clause. Can be easier in many cases to not use the alias.
try
WHERE timestamp(b.tech_id) < current_timestamp
The generic "order of execution" of SQL clauses is
FROM
JOINs (as part of the from clause)
WHERE
GROUP BY
HAVING
SELECT
ORDER BY
Is there a way to JOIN only on the most recent record?
A useful technique for this is using ROW_NUMBER() assuming your DB2 supports it, and would look something like this:
SELECT
a.*
, b.tech_id AS techuser
FROM users a
JOIN (
SELECT
*
, ROW_NUMBER() OVER (ORDER BY timestamp(tech_id) DESC) AS RN
FROM newsletter
) b
ON b.tech_id = a.newsletter_id
AND b.rn = 1
this would give you just one row from newsletter, and using the DESCending order gives you the "most recent" assuming timestamp(tech_id) works as described.
To get most recent newsletter of user, consider ordering the join query, then select top record (in DB2 you would use FETCH FIRST ONLY):
SELECT a.*, b.tech_id as user
FROM users a
INNER JOIN newsletter b ON b.tech_id = a.newsletter_id
ORDER BY b.tech_id
FETCH FIRST 1 ROW ONLY;
Alternatively, you can use a subquery in WHERE clause that aggregates the max user:
SELECT a.*, b.tech_id as user
FROM users a
WHERE b.tech_id IN (
SELECT Max(n.tech_id) as maxUser
FROM users u
INNER JOIN newsletter n ON n.tech_id = u.newsletter_id)
I left out the condition of timestamp(user) < current_timestamp as data stored in database will always be less than current time (i.e., now).

SQL - unknown column in derived table

This is the problematic part of my query:
SELECT
(SELECT id FROM users WHERE name = 'John') as competitor_id,
(SELECT MIN(duration)
FROM
(SELECT duration FROM attempts
WHERE userid=competitor_id ORDER BY created DESC LIMIT 1,1
) x
) as best_time
On execution, it throws this error:
#1054 - Unknown column 'competitor_id' in 'where clause'
It looks like the derived table 'x' can't see the parent's query alias competitor_id. Is there any way how to create some kind of global alias, which will be usable by all derived tables?
I know I can just use the competitor_id query as a subquery directly in a WHERE clause and avoid using alias at all, but my real query is much bigger and I need to use competitor_id in more subqueries and derived tables, so it would be inefficient if I would used the same subquery more times.
you may not need to use derived tables within the select statement, wouldn't the following accomplish the same thing?
SELECT
users.id as competitor_id,
MIN(duration) as best_time
FROM users
inner join attempts on users.id = attempts.user_id
WHERE name = 'John'
group by users.id
There error is caused because a identifier introduced in a select output-clause cannot be referenced from anywhere else in that clause - basically, with SQL, identifiers/columns are pushed out and not down (or across).
But, even if it were possible, it's not good to write a query this way anyway. Use a JOIN between the users and attempts (on user id), then filter based on the name. The SQL query planner will then take the high-level relational algebra and write an efficient plan for it :) Note that there is no need for either a manual ordering or limit here as the aggregate (MIN) over a group handles that.
SELECT u.id, u.name, MIN(a.duration) as duration
FROM users u
-- match up each attempt per user
JOIN attempts a
ON a.userid = u.id
-- only show users with this name
WHERE u.name = 'John'
-- group so we get the min duration *per user*
-- (name is included so it can be in the output clause)
GROUP BY u.id, u.name
Something about your query seems rather strange. The innermost subquery is selecting one row and then you are taking the min(duration). The min is unnecessary, because there is only one row. You can phrase the query as:
SELECT u.id as competitor_id, a.duration as best_time
from users u left outer join
attempts a
on u.id = a.userid
where u.name = 'John'
order by a.created desc
limit 1, 1;
This seems to be what your query is attempting to do. However, this might not be your intention. It is probably giving the most recent time. (If you are using MySQL, then limit 1, 1 is actually taking the second most recent record). To get the smallest duration (presumably the "best"), you would do:
SELECT u.id as competitor_id, min(a.duration) as best_time
from users u left outer join
attempts a
on u.id = a.userid
where u.name = 'John'
Adding a group by u.id would ensure that this returns exactly one row.

SQL GROUP BY/COUNT even if no results

I am attempting to get the information from one table (games) and count the entries in another table (tickets) that correspond to each entry in the first. I want each entry in the first table to be returned even if there aren't any entries in the second. My query is as follows:
SELECT g.*, count(*)
FROM games g, tickets t
WHERE (t.game_number = g.game_number
OR NOT EXISTS (SELECT * FROM tickets t2 WHERE t2.game_number=g.game_number))
GROUP BY t.game_number;
What am I doing wrong?
You need to do a left-join:
SELECT g.Game_Number, g.PutColumnsHere, count(t.Game_Number)
FROM games g
LEFT JOIN tickets t ON g.Game_Number = t.Game_Number
GROUP BY g.Game_Number, g.PutColumnsHere
Alternatively, I think this is a little clearer with a correlated subquery:
SELECT g.Game_Number, G.PutColumnsHere,
(SELECT COUNT(*) FROM Tickets T WHERE t.Game_Number = g.Game_Number) Tickets_Count
FROM Games g
Just make sure you check the query plan to confirm that the optimizer interprets this well.
You need to learn more about how to use joins in SQL:
SELECT g.*, count(*)
FROM games g
LEFT OUTER JOIN tickets t
USING (game_number)
GROUP BY g.game_number;
Note that unlike some database brands, MySQL permits you to list many columns in the select-list even if you only GROUP BY their primary key. As long as the columns in your select-list are functionally dependent on the GROUP BY column, the result is unambiguous.
Other brands of database (Microsoft, Firebird, etc.) give you an error if you list any columns in the select-list without including them in GROUP BY or in an aggregate function.
"FROM games g, tickets t" is the problem line. This performs an inner join. Any where clause can't add on to this. I think you want a LEFT OUTER JOIN.