Performance of Join vs Subquery in Where Clause (HIVE) - hive

Can someone please help me understand which approach would be the most efficient.
The first table users_of_interest_table has one column users that has ~1,000 unique user ID's.
The second table app_logs_table has a users column as well as an app_log column. The table has more than 1 billion rows and over 10 million unique users.
What is the most efficient way to get all the app log data for the users in users_of_interest. Here is what I have come up with so far.
Option 1: Use Inner Join
SELECT
u.users, a.app_logs
FROM
users_of_interest_table u
INNER JOIN
app_logs_table a
ON
u.users = a.users
Option 2: Subquery in Where Clause
SELECT
a.users, a.app_logs
FROM
app_logs_table a
WHERE
a.users IN (SELECT u.users FROM users_of_interest_table u)

the community advises the use of the Join clause, but, in some tests that I have done, the In clause has been more efficient
You must do the test yourself, use the SQL Server Profile tool for this

Related

SQL subselect statement very slow on certain machines

I've got an sql statement where I get a list of all Ids from a table (Machines).
Then need the latest instance of another row in (Events) where the the id's match so have been doing a subselect.
I need to latest instance of quite a few fields that match the id so have these subselects after one another within this single statement so end up with results similar to this...
This works and the results are spot on, it's just becoming very slow as the Events Table has millions of records. The Machine table would have on average 100 records.
Is there a better solution that subselects? Maybe doing inner joins or a stored procedure?
Help appreciated :)
You can use apply. You don't specify how "latest instance" is defined. Let me assume it is based on the time column:
Select a.id, b.*
from TableA a outer apply
(select top(1) b.Name, b.time, b.weight
from b
where b.id = a.id
order by b.time desc
) b;
Both APPLY and the correlated subquery need an ORDER BY to do what you intend.
APPLY is a lot like a correlated query in the FROM clause -- with two convenient enhances. A lateral join -- technically what APPLY does -- can return multiple rows and multiple columns.

Tuning Oracle Query for slow select

I'm working on an oracle query that is doing a select on a huge table, however the joins with other tables seem to be costing a lot in terms of time of processing.
I'm looking for tips on how to improve the working of this query.
I'm attaching a version of the query and the explain plan of it.
Query
SELECT
l.gl_date,
l.REST_OF_TABLES
(
SELECT
MAX(tt.task_id)
FROM
bbb.jeg_pa_tasks tt
WHERE
l.project_id = tt.project_id
AND l.task_number = tt.task_number
) task_id
FROM
aaa.jeg_labor_history l,
bbb.jeg_pa_projects_all p
WHERE
p.org_id = 2165
AND l.project_id = p.project_id
AND p.project_status_code = '1000'
Something to mention:
This query takes data from oracle to send it to a sql server database, so I need it to be this big, I can't narrow the scope of the query.
the purpose is to set it to a sql server job with SSIS so it runs periodically
One obvious suggestion is not to use sub query in select clause.
Instead, you can try to join the tables.
SELECT
l.gl_date,
l.REST_OF_TABLES
t.task_id
FROM
aaa.jeg_labor_history l
Join bbb.jeg_pa_projects_all p
On (l.project_id = p.project_id)
Left join (SELECT
tt.project_id,
tt.task_number,
MAX(tt.task_id) task_id
FROM
bbb.jeg_pa_tasks tt
Group by tt.project_id, tt.task_number) t
On (l.project_id = t.project_id
AND l.task_number = t.task_number)
WHERE
p.org_id = 2165
AND p.project_status_code = '1000';
Cheers!!
As I don't know exactly how many rows this query is returning or how many rows this table/view has.
I can provide you few simple tips which might be helpful for you for better query performance:
Check Indexes. There should be indexes on all fields used in the WHERE and JOIN portions of the SQL statement.
Limit the size of your working data set.
Only select columns you need.
Remove unnecessary tables.
Remove calculated columns in JOIN and WHERE clauses.
Use inner join, instead of outer join if possible.
You view contains lot of data so you can also break down and limit only the information you need from this view

Is there any way to simplify these two subqueries into one?

I have the call
SELECT *,
(SELECT first_name||' '||last_name FROM users WHERE user_id=U.invited_by) AS inviter,
(SELECT first_name FROM users WHERE user_id=U.invited_by) AS inviter_first
FROM users AS U
and that works. But as you can see, the two subqueries are both retrieving pretty data from the same row. Is there any way to simplify the two SELECT calls as one and still get the same results?
You have to do a join. Since you are joining back to the same table use an alias.
SELECT U.*,
i_table.first_name||' '||i_table.last_name AS inviter,
i_table.first_name as inviter_first
FROM users as U
LEFT JOIN users as i_table on i_table.user_id=U.invited_by
Note this query changes your query from performing 2 queries per row (so 2n * n or O(n^2)) to performing 1 joined query.
If you have an index on user_id you should see amazing increases in performance.
If you don't it should still be a lot faster at O(2n)

Multiple Joins in Teradata SQL - Faster to Use Subqueries or Temp Tables?

I am writing SQL for Teradata. I need to use joins to connect data from multiple tables. Is it typically faster to use subqueries or create temporary tables and append columns one join at a time? I'm trying to test it myself but network traffic makes it hard for me to tell which is faster.
Example A:
SELECT a.ID, a.Date, b.Gender, c.Age
FROM mainTable AS a
LEFT JOIN (subquery 1) AS b ON b.ID = a.ID
LEFT JOIN (subquery 2) AS c ON c.ID = a.ID
Or I could...
Example B:
CREATE TABLE a AS (
SELECT mainTable.ID, mainTable.Date, sq.Gender
FROM mainTable
LEFT JOIN (subquery 1) AS sq ON sq.id = mainTable.ID
)
CREATE TABLE b AS (
SELECT a.ID, a.Date, a.Gender, sq.Age
FROM a
LEFT JOIN (subquery 2) AS sq ON sq.id = a.ID
)
Assuming I clean everything up afterward, is one approach preferable to another? Again, I would like to just test this myself but the network traffic is kind of messing me up.
EDIT: The main table has anywhere from 100k to 5 million rows. The subqueries return a 1:1 relationship to the main table's IDs, but require WHERE clauses to filter dates. The subquery SQL isn't trivial, I guess is what I'm trying to convey.
Of course it's recommended to write joins, that's why there's an optmizer :-)
If you create temporary tables you force a specific order of processing instead of letting the optimizer decide which is the best plan.
Creating temporary tables might be usefull in some rare cases when you got a really complex query with dozens of joins and you need to break it into a more easily maintainable parts or you would like to get a specific PI for further processing.
Regarding testing different approaches:
Runtime should never be used for that, it might vary greatly based on the load on the server. You need to access Teradata's Query Log (DBQL: dbc.QryLogV, etc.) to get details about actual CPU/IO/spool usage. If you don't have access to it you might ask your DBA to grant it to you.
Btw, instead of real tables you should create VOLATILE TABLES which are automatically dropped when you logoff.

SQL COUNT(col) vs extra logging column... efficiency?

I can't seem to find much information about this.
I have a table to log users comments. I have another table to log likes / dislikes from other users for each comment.
Therefore, when selecting this data to be displayed on a web page, there is a complex query requiring joins and subqueries to count all likes / dislikes.
My example is a query someone kindly helped me with on here to achieve the required results:
SELECT comments.comment_id, comments.descr, comments.created, usrs.usr_name,
(SELECT COUNT(*) FROM comment_likers WHERE comment_id=comments.comment_id AND liker=1)likes,
(SELECT COUNT(*) FROM comment_likers WHERE comment_id=comments.comment_id AND liker=0)dislikes,
comment_likers.liker
FROM comments
INNER JOIN usrs ON ( comments.usr_id = usrs.usr_id )
LEFT JOIN comment_likers ON ( comments.comment_id = comment_likers.comment_id
AND comment_likers.usr_id = $usrID )
WHERE comments.topic_id=$tpcID
ORDER BY comments.created DESC;
However, if I added a likes and dislikes column to the COMMENTS table and created a trigger to automatically increment / decrement these columns as likes get inserted / deleted / updated to the LIKER table then the SELECT statement would be more simple and more efficient than it is now. I am asking, is it more efficient to have this complex query with the COUNTS or to have the extra columns and triggers?
And to generalise, is it more efficient to COUNT or to have an extra column for counting when being queried on a regular basis?
Your query is very inefficient. You can easily eliminate those sub queries, which will dramatically increase performance:
Your two sub queries can be replaced by simply:
sum(liker) likes,
sum(abs(liker - 1)) dislikes,
Making the whole query this:
SELECT comments.comment_id, comments.descr, comments.created, usrs.usr_name,
sum(liker) likes,
sum(abs(liker - 1)) dislikes,
comment_likers.liker
FROM comments
INNER JOIN usrs ON comments.usr_id = usrs.usr_id
LEFT JOIN comment_likers ON comments.comment_id = comment_likers.comment_id
AND comment_likers.usr_id = $usrID
WHERE comments.topic_id=$tpcID
ORDER BY comments.created DESC;