A small table join large table on like, using impala

A small table join large table on like, using impala - sql

An impala query as below runs very slow.
SELECT pattern, MAX(time)
FROM (SELECT t.time, p.pattern
FROM t
JOIN p
ON (t.name LIKE p.pattern)) AS tmp
GROUP BY pattern
p is large of 1billion records, and t is small of only 1 record.
How can I optimize this? This is actually a nested loop join, but why did this take about half an hour to complete?
What's more, when I use the following query
SELECT time
FROM p
WHERE name LIKE 'one_pattern'
ORDER BY time DESC LIMIT 1
It only takes 3s. I am really confused.

Related

SQL Query - Joining and Aggregating

I need to run a query every hour against a table that joins and aggregates data from another table with millions of rows.
select f.master_con,
s.containers
from
(
select master_con
from shipped
where start_time >= a and start_time <= a+1
) f,
(
select master_con,
count(distinct container) as containers
from picked
) s
where f.master_con = s.master_con
This query above sorta works, the exact syntax may not be correct because I wrote it from memory.
In the sub query 's' I only want to count container for each master_con in the 'f' query, and I think my query runs for a long time because I'm counting container for all master_con but then joining only to master_con from 'f'
Is there a better, more efficient way to write this type of query?
(In the end, I'll sum(containers) from this query above to get total containers shipped during that hour)

Most likely, there is. Can you provide some simplified sample table structures? Additionally, the join method being used has been moving towards deprecation for some time. You should declare your joins explicitly. The below should be an improvement. Left outer join was used so that you get all of the shipper records that meet your criteria and keep them even if they aren't in the picked table. Change that to inner join if you want them gone.
SELECT shipped.master_con,
COUNT(DISTINCT picked.containers) AS containers
FROM shipped LEFT OUTER JOIN
Picked ON picked.master_con = shipped.master_con
WHERE shipped.start_time BETWEEN a AND a+1
GROUP BY shipped.master_con

Speeding up a query with INNER JOIN

I have a query that takes a long time to execute. I've waited for about 10 mins and it's still not finished executing.
The query looks something like this:
SELECT
one.ID,
two.NAME,
two.STATUS,
four.KEY,
four.VALUE,
count(one.ID) as num
FROM TABLE_ONE one, TABLE_TWO two, TABLE_THREE three, TABLE_FOUR four
WHERE one.STATE='RED'
AND (two.STATUS='ON' OR two.STATUS='OFF')
AND (
four.KEY='FINAL'
OR four.KEY='LIMIT'
OR (
four.KEY='MODE'
AND (
four.VALUE='T'
OR four.VALUE='R')))
GROUP BY one.ID, two.NAME, two.STATUS, four.KEY, four.VALUE
ORDER BY group_name ASC;
I have another query which is equivalent but executes very fast (about 1 second to execute).
Here is that query:
SELECT
one.ID,
two.NAME,
two.STATUS,
four.KEY,
four.VALUE,
count(one.ID) as num
FROM TABLE_ONE one
INNER JOIN TABLE_TWO two
ON one.ID=two.ID
INNER JOIN TABLE_THREE three
ON two.ID=three.GROUP_ID
INNER JOIN TABLE_FOUR four
ON three.ID=four.ID
WHERE one.STATE='RED'
AND (two.STATUS='ON' OR two.STATUS='OFF')
AND (
four.KEY='FINAL'
OR four.KEY='LIMIT'
OR (
four.KEY='MODE'
AND (
four.VALUE='T'
OR four.VALUE='R')))
GROUP BY one.ID, two.NAME, two.STATUS, four.KEY, four.VALUE
ORDER BY group_name ASC;
I'm kind of confused why the query with INNER JOIN executes really fast (about 1 second) and the one without takes a long time (waited about 10mins and still not finised executing).
Is there anything I can do to the query without the INNER JOIN to speed up the execution time?
I am using ORACLE.

In the first query, the tables are not really joined on any columns. The result is called cross join. Cross join between two table returns rows equals to number of rows in the first table times the numbers of rows in the second table.
Inner join joins based on given set of columns.

Your long running query has no join conditions to relate one table to the other. Therefore it is creating a cartesian product of all the records in each table. So if each table has 10 rows, it would generate 10*10*10*10=10,000 result rows before performing the aggregate functions. Larger tables just get worse. If each table had 1,000 rows you'd end up generating 1,000,000,000,000 rows.
Your faster query has join criteria which significantly reduces the number of rows in the result set, which is why it is more performant.

Lets say you have N values for ID. In the first query you will create N * N * N * N (or N ^ 4) rows.
In the second you will create N rows.
In big O notation:
O(N^4)
vs
O(N)
Now you have a real world example of the impact.

SELECT FROM inner query slowdown

We have two very similar queries, one takes 22 seconds the other takes 6 seconds. Both use an inner select, have the exact same outer columns and outer joins. The only difference is the inner select that the outer query is using to join in on.
The inner query when run alone executes in 100ms or less in both cases and returns the EXACT SAME data.
Both queries as a whole have a lot of room for improvement, but this particular oddity is really puzzling to us and we just want to understand why. To me it would seem the inner query should be executed once in 100ms then the outer stuff happens. I have a feeling the inner select may be executed multiple times.
Query that takes 6 seconds:
SELECT {whole bunch of column names}
FROM (
SELECT projectItems.* FROM projectItems
WHERE projectItems.isActive = 1
ORDER BY projectItemsID ASC
OFFSET 0 ROWS FETCH NEXT 1 ROWS ONLY
) projectItems
LEFT JOIN categories
ON projectItems.fk_category = categories.categoryID
...{more joins}
Query that takes 22 seconds:
SELECT {whole bunch of column names}
FROM (
SELECT projectItems.* FROM projectItems
WHERE projectItems.isActive = 1
AND projectItemsID = 6539
) projectItems
LEFT JOIN categories
ON projectItems.fk_category = categories.categoryID
...{more joins}

For every row in your projectItems table, in the second function, you search two columns instead of one. If projectItemsID isn't the primary key or if it isn't indexed, it takes longer to parse an extra column.'
If you look at the sizes of the tables and the number of rows each query returns, you can calculate how many comparisons need to be made for each of the queries.

I believe that you're right that the inner query is being run for every single row that is being left joined with categories.
I can't find a proper source on it right now, but you can easily test this by doing something like this and comparing the run times. Here, we can at least be sure that the inner query is only running one time. (sorry if any syntax is incorrect, but you'll get the general idea):
DECLARE #innerQuery TABLE ( [all inner query columns here] )
INSERT INTO #innerQuery
SELECT projectItems.* FROM projectItems
WHERE projectItems.isActive = 1
AND projectItemsID = 6539
SELECT {whole bunch of field names}
FROM #innerQuery as IQ
LEFT JOIN categories
ON IQ.fk_category = categories.categoryID
...{more joins}

SQL Performance: SELECT DISTINCT versus GROUP BY

I have been trying to improve query times for an existing Oracle database-driven application that has been running a little sluggish. The application executes several large queries, such as the one below, which can take over an hour to run. Replacing the DISTINCT with a GROUP BY clause in the query below shrank execution time from 100 minutes to 10 seconds. My understanding was that SELECT DISTINCT and GROUP BY operated in pretty much the same way. Why such a huge disparity between execution times? What is the difference in how the query is executed on the back-end? Is there ever a situation where SELECT DISTINCT runs faster?
Note: In the following query, WHERE TASK_INVENTORY_STEP.STEP_TYPE = 'TYPE A' represents just one of a number of ways that results can be filtered. This example was provided to show the reasoning for joining all of the tables that do not have columns included in the SELECT and would result in about a tenth of all available data
SQL using DISTINCT:
SELECT DISTINCT
ITEMS.ITEM_ID,
ITEMS.ITEM_CODE,
ITEMS.ITEMTYPE,
ITEM_TRANSACTIONS.STATUS,
(SELECT COUNT(PKID)
FROM ITEM_PARENTS
WHERE PARENT_ITEM_ID = ITEMS.ITEM_ID
) AS CHILD_COUNT
FROM
ITEMS
INNER JOIN ITEM_TRANSACTIONS
ON ITEMS.ITEM_ID = ITEM_TRANSACTIONS.ITEM_ID
AND ITEM_TRANSACTIONS.FLAG = 1
LEFT OUTER JOIN ITEM_METADATA
ON ITEMS.ITEM_ID = ITEM_METADATA.ITEM_ID
LEFT OUTER JOIN JOB_INVENTORY
ON ITEMS.ITEM_ID = JOB_INVENTORY.ITEM_ID
LEFT OUTER JOIN JOB_TASK_INVENTORY
ON JOB_INVENTORY.JOB_ITEM_ID = JOB_TASK_INVENTORY.JOB_ITEM_ID
LEFT OUTER JOIN JOB_TASKS
ON JOB_TASK_INVENTORY.TASKID = JOB_TASKS.TASKID
LEFT OUTER JOIN JOBS
ON JOB_TASKS.JOB_ID = JOBS.JOB_ID
LEFT OUTER JOIN TASK_INVENTORY_STEP
ON JOB_INVENTORY.JOB_ITEM_ID = TASK_INVENTORY_STEP.JOB_ITEM_ID
LEFT OUTER JOIN TASK_STEP_INFORMATION
ON TASK_INVENTORY_STEP.JOB_ITEM_ID = TASK_STEP_INFORMATION.JOB_ITEM_ID
WHERE
TASK_INVENTORY_STEP.STEP_TYPE = 'TYPE A'
ORDER BY
ITEMS.ITEM_CODE
SQL using GROUP BY:
SELECT
ITEMS.ITEM_ID,
ITEMS.ITEM_CODE,
ITEMS.ITEMTYPE,
ITEM_TRANSACTIONS.STATUS,
(SELECT COUNT(PKID)
FROM ITEM_PARENTS
WHERE PARENT_ITEM_ID = ITEMS.ITEM_ID
) AS CHILD_COUNT
FROM
ITEMS
INNER JOIN ITEM_TRANSACTIONS
ON ITEMS.ITEM_ID = ITEM_TRANSACTIONS.ITEM_ID
AND ITEM_TRANSACTIONS.FLAG = 1
LEFT OUTER JOIN ITEM_METADATA
ON ITEMS.ITEM_ID = ITEM_METADATA.ITEM_ID
LEFT OUTER JOIN JOB_INVENTORY
ON ITEMS.ITEM_ID = JOB_INVENTORY.ITEM_ID
LEFT OUTER JOIN JOB_TASK_INVENTORY
ON JOB_INVENTORY.JOB_ITEM_ID = JOB_TASK_INVENTORY.JOB_ITEM_ID
LEFT OUTER JOIN JOB_TASKS
ON JOB_TASK_INVENTORY.TASKID = JOB_TASKS.TASKID
LEFT OUTER JOIN JOBS
ON JOB_TASKS.JOB_ID = JOBS.JOB_ID
LEFT OUTER JOIN TASK_INVENTORY_STEP
ON JOB_INVENTORY.JOB_ITEM_ID = TASK_INVENTORY_STEP.JOB_ITEM_ID
LEFT OUTER JOIN TASK_STEP_INFORMATION
ON TASK_INVENTORY_STEP.JOB_ITEM_ID = TASK_STEP_INFORMATION.JOB_ITEM_ID
WHERE
TASK_INVENTORY_STEP.STEP_TYPE = 'TYPE A'
GROUP BY
ITEMS.ITEM_ID,
ITEMS.ITEM_CODE,
ITEMS.ITEMTYPE,
ITEM_TRANSACTIONS.STATUS
ORDER BY
ITEMS.ITEM_CODE
Here is the Oracle query plan for the query using DISTINCT:
Here is the Oracle query plan for the query using GROUP BY:

The performance difference is probably due to the execution of the subquery in the SELECT clause. I am guessing that it is re-executing this query for every row before the distinct. For the group by, it would execute once after the group by.
Try replacing it with a join, instead:
select . . .,
parentcnt
from . . . left outer join
(SELECT PARENT_ITEM_ID, COUNT(PKID) as parentcnt
FROM ITEM_PARENTS
) p
on items.item_id = p.parent_item_id

I'm fairly sure that GROUP BY and DISTINCT have roughly the same execution plan.
The difference here since we have to guess (since we don't have the explain plans) is IMO that the inline subquery gets executed AFTER the GROUP BY but BEFORE the DISTINCT.
So if your query returns 1M rows and gets aggregated to 1k rows:
The GROUP BY query would have run the subquery 1000 times,
Whereas the DISTINCT query would have run the subquery 1000000 times.
The tkprof explain plan would help demonstrate this hypothesis.
While we're discussing this, I think it's important to note that the way the query is written is misleading both to the reader and to the optimizer: you obviously want to find all rows from item/item_transactions that have a TASK_INVENTORY_STEP.STEP_TYPE with a value of "TYPE A".
IMO your query would have a better plan and would be more easily readable if written like this:
SELECT ITEMS.ITEM_ID,
ITEMS.ITEM_CODE,
ITEMS.ITEMTYPE,
ITEM_TRANSACTIONS.STATUS,
(SELECT COUNT(PKID)
FROM ITEM_PARENTS
WHERE PARENT_ITEM_ID = ITEMS.ITEM_ID) AS CHILD_COUNT
FROM ITEMS
JOIN ITEM_TRANSACTIONS
ON ITEMS.ITEM_ID = ITEM_TRANSACTIONS.ITEM_ID
AND ITEM_TRANSACTIONS.FLAG = 1
WHERE EXISTS (SELECT NULL
FROM JOB_INVENTORY
JOIN TASK_INVENTORY_STEP
ON JOB_INVENTORY.JOB_ITEM_ID=TASK_INVENTORY_STEP.JOB_ITEM_ID
WHERE TASK_INVENTORY_STEP.STEP_TYPE = 'TYPE A'
AND ITEMS.ITEM_ID = JOB_INVENTORY.ITEM_ID)
In many cases, a DISTINCT can be a sign that the query is not written properly (because a good query shouldn't return duplicates).
Note also that 4 tables are not used in your original select.

The first thing that should be noted is the use of Distinct indicates a code smell, aka anti-pattern. It generally means that there is a missing join or an extra join that is generating duplicate data. Looking at your query above, I am guessing that the reason why group by is faster (without seeing the query), is that the location of the group by reduces the number of records that end up being returned. Whereas distinct is blowing out the result set and doing row by row comparisons.
Update to approach
Sorry, I should have been more clear. Records are generated when
users perform certain tasks in the system, so there is no schedule. A
user could generate a single record in a day or hundreds per-hour. The
important things is that each time a user runs a search, up-to-date
records must be returned, which makes me doubtful that a materialized
view would work here, especially if the query populating it would take
long to run.
I do believe this is the exact reason to use a materialized view. So the process would work this way. You take the long running query as the piece that builds out your materialized view, since we know the user only cares about "new" data after they perform some arbitrary task in the system. So what you want to do is query against this base materialized view, which can be refreshed constantly on the back-end, the persistence strategy involved should not choke out the materialized view (persisting a few hundred records at a time won't crush anything). What this will allow is Oracle to grab a read lock (note we don't care how many sources read our data, we only care about writers). In the worst case a user will have "stale" data for microseconds, so unless this is a financial trading system on Wall Street or a system for a nuclear reactor, these "blips" should go unnoticed by even the most eagle eyed users.
Code example of how to do this:
create materialized view dept_mv FOR UPDATE as select * from dept;
Now the key to this is as long as you don' t invoke refresh you won't lose any of the persisted data. It will be up to you to determine when you want to "base line" your materialized view again (midnight perhaps?)

You should use GROUP BY to apply aggregate operators to each group and DISTINCT if you only need to remove duplicates.
I think the performance is the same.
In your case i think you should use GROUP BY.

Limit the number of rows being processed in this query

I cannot post the actual query here, so I am posting the basic outline of the query which should suffice. The query is used to page and return a set of users ranked according the output of a function, say F. F takes parameters from the User table and other tables which are joined. The query is something like as follows
Select TOP (20)
from (select row_number OVER (Order By F desc) as rownum,
user.*, ..
from user
inner join X on user.blah = X.blah
left outer join Y on user.foo = Y.foo
where DATEDIFF(dd, LastLogin, GetDate()) > 200 and Y.bar > FUBAR) as temp
where rownum > 0
According to the execution plan 91% of the cost is in the Sort. Since the sort is based on F, I cannot add an index to speed the sort. The inner query queries all the records, filters then sorts. Now most of the time the users just look at results in the 1 - 5 pages (1 page has 20 records hence the Top(20)) so I was thinking if there was any way I could limit the rows being processed and sorted and make the query faster and less CPU intensive most of the time.
EDIT: When I say to Calculate F tables are joined, what I mean is this. F takes in parameters such as X.blah and Y.foo and Y.bar. That's it. All these parameters also need to be returned as part of the resultset. e.g. The Latitude and Longitude of the User's Last location is stored in X.

At least you could try not to call DATEDIFF on every row
declare #target_date datetime
set #target_date = DATEADD(dd, -200, GetDate())
Select TOP (20)
from (select row_number OVER (Order By F desc) as rownum,
user.*, ..
from user
inner join X on user.blah = X.blah
left outer join Y on user.foo = Y.foo
where LastLogin < #target_date and Y.bar > FUBAR) as temp
where rownum > 0
Perhaps do the same thing with FUBAR and F?
The example above doesn't give you much performance but provides a general idea on how to reduce function calls

Not sure if and how much it'll help - but two things:
can you make sure all the foreign key columns and colums in the WHERE clause (user.blah, X.blah, user.foo, Y.foo, Y.bar) are indeed indexed? This will significantly help JOIN performance.
If those columns are not indexed, there also might be a sort operation in the execution plan that SQL Server uses so it can then use a Merge Join for the data. So your sort might not even really come from the OVER (ORDER BY F DESC) that you think causes the sort
you're combining TOP (20) with row numbers, but you're not defining any real ORDER BY for the complete result set - so your results will be random at best. Also, if you already define the rownum, couldn't you just use:
SELECT (columns)
FROM (.......) as temp
WHERE rownum BETWEEN 0 AND 20

Some thoughts:
What kind of function is F? Can it be rewritten as an inline table-valued function? That would give the optimizer an opportunity to expand the function into a reusable execution plan.
You're doing a LEFT OUTER JOIN on Y, but then include a column from Y in your WHERE clause, effectively rendering it as an INNER JOIN. Although the optimizer probably renders the execution plan in the same way, I would clean that up so that it's easier to troubleshoot in the future.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas