SQL to HiveQL conversion - sql

I have this SQL query and I am trying to convert it so that it can be run on HiveQL 2.1.1.
SELECT p.id FROM page p, comments c, users u,
WHERE c.commentid= p.id
AND u.id = p.creatorid
AND u.upvotes IN (
SELECT MAX(upvotes)
FROM users u WHERE u.date > p.date
)
AND EXISTS (
SELECT 1 FROM links l WHERE l.relid > p.id
)
This does not work on Hive QL, as it has more than 1 SubQuery (which is not supported)
EXISTS or IN replacements from SQL to Hive SQL are done like this:
WHERE A.aid IN (SELECT bid FROM B...)
can be replaced by:
A LEFT SEMI JOIN B ON aid=bid
But I can`t come up with a way to do this with the additional MAX() function.

Use standard join syntax instead of comma separated :
SELECT p.id
FROM page p INNER JOIN
comments c
ON c.commentid= p.id INNER JOIN
users u
ON u.id = p.creatorid INNER JOIN
links l
ON l.relid > p.id
WHERE u.upvotes IN (SELECT MAX(upvotes)
FROM users u
WHERE u.date > p.date
);

I am not sure what the upvotes logic is supposed to be doing. The links logic is easy to handle. Hive may handle this:
SELECT p.id
FROM page p JOIN
comments c
ON c.commentid = p.id JOIN
users u
ON u.id = p.creatorid CROSS JOIN
(SELECT MAX(l.relid) as max_relid
FROM links l
) l
WHERE l.max_relid > p.id AND
u.upvotes IN (SELECT MAX(upvotes)
FROM users u
WHERE u.date > p.date
);

Related

How to replace exist in Hive with two correlated subqueries

I have a query that looks like this
SELECT u.id, COUNT(*)
FROM users u, posts p
WHERE u.id = p.owneruserid
AND EXISTS (SELECT COUNT(*) as num
FROM postlinks pl
WHERE pl.postid = p.id
GROUP BY pl.id
HAVING num > 1) --correlated subquery 1
AND EXISTS (SELECT *
FROM comments c
WHERE c.postid = p.id); --correlated subquery 2
GROUP BY u.id
I researched and read that in Hive IN or EXIST are not supported statements. I read that a workaround for this would be to use a LEFT JOIN. I have tried this but I am having trouble with the GROUP BY u.id. I read that this needs to be paired always with an aggregation function like COUNT() however I'm not sure how I can rewrite this query so as to get it to work. All the other examples I have seen online do not seem to be as complicated as this one.
Like you said, you can convert them to left join or may be left join since they uses exists in both subquery. Simply convert your subqueries to inline view and join them with original tables.
SELECT u.id, COUNT(*)
FROM users u
inner join posts p on u.id = p.owneruserid
left outer join (SELECT COUNT(*) as num, pl.postid postid
FROM postlinks pl
GROUP BY pl.postid
HAVING num > 1) pl ON pl.postid = p.id --correlated subquery 1 with left join
left outer join (SELECT postid FROM comments c GROUP BY postid)c ON c.postid = p.id --correlated subquery 2 with left join
WHERE ( c.postid is not null AND pl.postid is not null) -- this ensure data exists in both subquery
GROUP BY u.id
With left join, there may be chance of duplicates, you can use group by in subqry2 to avoid it.

Get null records in SQL

I have the next query:
SELECT c.name as clientName, p.id as projectId, p.name as projectName, p.rate, u.name as userName, sum(w.duration) as workedHours
FROM Project p, User u, Worklog w, Client c
WHERE w.user_id = u.id AND w.project_id = p.id AND p.client_id = c.id
GROUP BY p.id, u.id
that returns the projects, clients, hourly rate and worked hours.
How should be changed to return also the projects where workedHours is equal with 0?
Because this query returns just the records where workedHours is not 0.
Thank you for your time.
The problem is that no row in worklog can be joined, and that your condition in the WHERE clause removes any row without worklog associated.
Solution 1 : Using a LEFT JOIN
Using a left join instead would solve your problem.
SELECT c.name as clientName, p.id as projectId, p.name as projectName, p.rate, u.name as userName, coalesce(sum(w.duration), 0) as workedHours
FROM Project p, User u, Client c
LEFT JOIN Worklog w ON w.project_id = p.id AND w.user_id = u.id
WHERE p.client_id = c.id
GROUP BY p.id, u.id
By the way your query is suspicious in other aspects. For example c.name is in the SELECT clause but not in the GROUP BY clause. I take it that you use MySQL which is the only RDBMS I'm aware of which allows such queries. You maybe should consider adding the retrieved columns in the GROUP BY clause.
Solution 2 : Using only ANSI JOINs
As underscore_d points out, you may want to avoid old-style joins completely, and preferable use the following query :
SELECT
c.name as clientName,
p.id as projectId,
p.name as projectName,
p.rate,
u.name as userName,
coalesce(sum(w.duration), 0) as workedHours
FROM Project p
CROSS JOIN User u
INNER JOIN Client c ON p.client_id = c.id
LEFT JOIN Worklog w ON w.project_id = p.id AND w.user_id = u.id
GROUP BY c.name, p.id, p.name, p.rate, u.id, u.name
Solution 3 - Using a subquery
Another solution is to use a subquery, which would allow you to remove the GROUP BY clause completely and get a more manageable query if you ever need to retrieve more information. I personally don't like long lists of columns in a GROUP BY clause.
SELECT
c.name as clientName,
p.id as projectId,
p.name as projectName,
p.rate,
u.name as userName,
(SELECT SUM(duration) FROM Worklog WHERE project_id = c.id AND user_id = u.id) as workedHours
FROM Project p
CROSS JOIN User u
INNER JOIN Client c ON p.client_id = p.id
You should use standard ANSI joins and use LEFT JOIN on worklog table and ultimately you have to use LEFT JOIN on the user table as follows:
SELECT C.NAME AS CLIENTNAME,
P.ID AS PROJECTID,
P.NAME AS PROJECTNAME,
P.RATE,
U.NAME AS USERNAME,
SUM(W.DURATION) AS WORKEDHOURS
FROM PROJECT P
JOIN CLIENT C
ON P.CLIENT_ID = C.ID
LEFT JOIN WORKLOG W
ON W.PROJECT_ID = P.ID
LEFT JOIN USER U
ON W.USER_ID = U.ID
GROUP BY P.ID,
U.ID;

Re-writing query from in() to joins

Can you assist in re-writing this into joins?
select * from users where users.advised_by in (
select p.id
from advisors p
join advisor_members m on p.id = m.advisor_id
join representatives r on m.user_id=r.user_id
where m.memeber_type='Advisor'
)
This is part of 200+ row query and that in() statement is hard to maintain when there are changes.
you should use a proper on clause
select *
from users
inner join
(
select p.id
from advisors p
join advisor_members m on p.id = m.advisor_id
join representatives r on m.user_id=r.user_id
where m.memeber_type='Advisor'
) t on users.advised_by = t.id
/*Option 1 */
SELECT *
FROM users usr
INNER JOIN
(
SELECT p.id AS advisor_id
FROM advisors p
JOIN advisor_members m
ON p.id = m.advisor_id
JOIN representatives r
ON m.user_id=r.user_id
WHERE m.memeber_type='Advisor' ) T2 usr.advised_by = t2.advisor_id
/*Option2 -- */
SELECT *
FROM users usr
INNER JOIN advisors p
ON usr.advised_by=p.id
JOIN
(
SELECT *
FROM advisor_members
WHERE m.memeber_type='Advisor') m
ON p.id = m.advisor_id
JOIN representatives r
ON m.user_id=r.user_id

SQL Get Only Most Recent Record for A User [duplicate]

This question already has answers here:
Get top 1 row of each group
(19 answers)
Closed 4 years ago.
I have a situation where I need to write a sql query to get all of the most recent responses for a student in a classroom. I basically want to show just their most recent response, not all of their responses. I have the query to get all of the responses and order them, however I can't figure out the part where it only grabs that user's most recent record.
Below is the query I have to this point. You can see from the sample data in the image it is pulling back all responses. What I basically want is either just the most recent for a particular student OR possibly just showing the max attempt for a particular Lesson/Page number combo. I have tried playing around with partition and group bys but I haven't found the right combination yet.
SELECT U.UserName, C.Name AS 'ClassroomName', U.FirstName, U.LastName, L.Name AS 'LessonName', P.PageNumber, R.Attempt, R.Created
FROM Responses R
INNER JOIN ClassroomUsers CU ON CU.UserId = R.UserId
INNER JOIN Classrooms C ON C.Id = CU.ClassroomId
INNER JOIN Questions Q ON Q.Id = R.QuestionId
INNER JOIN Pages P ON P.Id = Q.PageId
INNER JOIN Lessons L ON L.Id = P.LessonId
INNER JOIN AspNetUsers U ON U.Id = CU.UserId
WHERE CU.ClassroomId IN (
SELECT CU.ClassroomId
FROM ClassroomUsers CU
WHERE CU.UserId = #UserId
)
ORDER BY R.Created DESC
My favorite way of doing this is using Row_Number() which will number each row based upon the criteria you set - In your case, you'd partition by U.UserName since you want one row returned for each user and order by R.Created DESC to get the latest one.
That being the case, you'd only want to get back the rows that have RN=1, so you query that out as follows:
WITH cte AS
(
SELECT
ROW_NUMBER() OVER(PARTITION BY U.UserName ORDER BY R.Created DESC) AS RN
,U.UserName, C.Name AS 'ClassroomName', U.FirstName, U.LastName, L.Name AS 'LessonName', P.PageNumber, R.Attempt, R.Created
FROM Responses R
INNER JOIN ClassroomUsers CU ON CU.UserId = R.UserId
INNER JOIN Classrooms C ON C.Id = CU.ClassroomId
INNER JOIN Questions Q ON Q.Id = R.QuestionId
INNER JOIN Pages P ON P.Id = Q.PageId
INNER JOIN Lessons L ON L.Id = P.LessonId
INNER JOIN AspNetUsers U ON U.Id = CU.UserId
WHERE CU.ClassroomId IN (
SELECT CU.ClassroomId
FROM ClassroomUsers CU
WHERE CU.UserId = #UserId
)
)
SELECT * FROM cte WHERE RN = 1
Hope that makes sense / helps!!
Just another option is using the WITH TIES clause.
Example
SELECT top 1 with ties
U.UserName
, C.Name AS 'ClassroomName'
, U.FirstName
, U.LastName
, L.Name AS 'LessonName'
, P.PageNumber
, R.Attempt
, R.Created
FROM Responses R
INNER JOIN ClassroomUsers CU ON CU.UserId = R.UserId
INNER JOIN Classrooms C ON C.Id = CU.ClassroomId
INNER JOIN Questions Q ON Q.Id = R.QuestionId
INNER JOIN Pages P ON P.Id = Q.PageId
INNER JOIN Lessons L ON L.Id = P.LessonId
INNER JOIN AspNetUsers U ON U.Id = CU.UserId
WHERE CU.ClassroomId IN (
SELECT CU.ClassroomId
FROM ClassroomUsers CU
WHERE CU.UserId = #UserId
)
Order By ROW_NUMBER() OVER(PARTITION BY U.UserName ORDER BY R.Created DESC)

SQL: get attributes from nested query

How can I return d.title or u.name in the outer SELECT clause?
SELECT c.id, c.name
FROM components c
INNER JOIN publications p
ON c.id = p.component_id
AND p.document_id IN
(SELECT d.id FROM documents d WHERE user_id IN
(SELECT u.id FROM users u WHERE u.brand_id IN (39, 41)
)
)
I get this error when I throw d.title in the top line:
missing FROM-clause entry for table "d" LINE 1
The package I'm using needs these values returned on the top line to make any use out of them in the result.
Structure
A User has many Documents, and Publications is the join table between Documents and Components.
Use below query -
SELECT c.id, c.name, d.title, u.name
FROM components c
INNER JOIN publications p ON c.id = p.component_id
INNER JOIN documents d ON d.id = p.document_id
INNER JOIN users u ON d.user_id = u.id
AND u.brand_id IN (39, 41)
Hope this helps.