Row sieving statistics during WHERE clauses combined by AND - sql

The website stores information about all specifications of large amount of items and provides user ability to search through the data by adding some filtering conditions at the front-end. At back-end all the conditions are being translated into clauses and joined by AND operand.
My aim is to give the user an idea how many goods are being thrown away or left after each filter. Exact numbers aren't very important for the initial sieving (some fuzzy or approximations are fine, because the whole amount is quite large), but at latter stages, when there's ten or so items left, the user should get the proper amount.
There's obvious straightforward way of making as much SELECT COUNT queries as he has filters, but I feel that it might be some technique to archive it in more elegant way and without abusing DB much.

There are many ways to achieve this with varying levels of difficulty and performance.
The first and most obvious way to me is to simply do a count on the filters which performs fairly well and is not that difficult to implement. An alternative but similar approach would be to group by the values and do a count.
Here's a fiddle as an example of both methods: http://sqlfiddle.com/#!15/0cdcb/26
select
count(product.id) total,
sum((v0.value = 'spam')::int) v0_is_spam,
sum((v0.value != 'spam')::int) v0_not_spam,
sum((v1.value = 'spam')::int) v1_is_spam,
sum((v1.value != 'spam')::int) v1_not_spam
from product
left join specification_value v0 on v0.product_id = product.id and v0.specification_id = 1
left join specification_value v1 on v1.product_id = product.id and v1.specification_id = 2;
select specification.id, value, count(*)
from specification
left join specification_value on specification.id = specification_value.specification_id
group by specification.id, value;
A slightly more difficult way to do something like that is using window functions, a lot more flexible but not as easy to grasp. Docs are here: http://www.postgresql.org/docs/9.3/static/tutorial-window.html
Example query and results:
SELECT depname, empno, salary, avg(salary) OVER (PARTITION BY depname) FROM empsalary;
depname | empno | salary | avg
-----------+-------+--------+-----------------------
develop | 11 | 5200 | 5020.0000000000000000
develop | 7 | 4200 | 5020.0000000000000000
develop | 9 | 4500 | 5020.0000000000000000
develop | 8 | 6000 | 5020.0000000000000000
develop | 10 | 5200 | 5020.0000000000000000
personnel | 5 | 3500 | 3700.0000000000000000
personnel | 2 | 3900 | 3700.0000000000000000
sales | 3 | 4800 | 4866.6666666666666667
sales | 1 | 5000 | 4866.6666666666666667
sales | 4 | 4800 | 4866.6666666666666667
(10 rows)
And lastly, by far the fastest but also most inaccurate and difficult to implement. Using the database statistics to guess the amount of rows. I would not opt for this unless you have millions of rows within a filter set and have no way to reducing it further. Also, don't do this unless the performance is really so bad that it's needed.

Related

Access SQL - divide value in each row by total (using self join)

I have a table (tblGoals) which shows how many goals each player has scored, e.g:
| Player | Goals |
--------------------
| John | 6 |
| Chris | 10 |
| Ben | 4 |
I am trying to write a query that will output each player along with the percentage of the teams total goals that they have scored:
| Player | PercentageGoals |
------------------------------
| John | 0.3 |
| Chris | 0.5 |
| Ben | 0.2 |
I have already figured out how to do this with a sub query as shown
SELECT
Player,
Goals/ (SELECT SUM(Goals)FROM tblGoals) AS PercentageGoals
FROM tblGoals
The example table I have shown is just to demonstrate what I am trying to do. The actual table I am using is a much larger dataset and trying to use a subquery to get the percentage in this way is running quite slowly.
I have noticed before in Access that self-joins are usually optimised more efficiently than sub queries, and so I am trying to figure out if the above query can be rewritten using a self join.
I have tried something along the lines of below but obviously this is incorrect as t2 is grouped by Player which means I am not getting the true total, but if I leave the Player name out then I can't join on it?
SELECT
t1.Player,
t1.Goals/t2.sumGoals AS PercentageGoals
FROM
tbl1 t1
INNER JOIN
(SELECT Player, SUM(Goals) as sumGoals
FROM tbl1
GROUP BY Player) t2
ON t1.Player= t2.Player
Is there any way to do this?
I'll try this as an answer - although I'm not sure if it will be faster as only using a three records.
Your second table in the join should just have a total of all the goals to join to each record in the first table - a cartesian product.
You can then divide one by the other:
SELECT Player
, Goals / TotalGoals AS Total
FROM tblGoals, (SELECT SUM(Goals) AS TotalGoals FROM tblGoals)
Is it any faster on a big table? The idea being that in your SQL it was calculating the total for each record, while this creates the total as a table and joins to it so should only be calculated once.

Calculating an average of a DISTINCTCOUNT efficiently in Dax?

I'm trying to calculate a business-logic in DAX which has turned out to be quite resource-heavy and complex. I have a very large PowerPivot model (call it "sales") with numerous dimensions and measures. A simplified view of the sales model:
+-------+--------+---------+------+---------+-------+
| State | City | Store | Week | Product | Sales |
+-------+--------+---------+------+---------+-------+
| NY | NYC | Charlie | 1 | A | $5 |
| MA | Boston | Bravo | 2 | B | $10 |
| - | D.C. | Delta | 1 | A | $20 |
+-------+--------+---------+------+---------+-------+
Essentially what I'm trying to do is calculate a DISTINCTCOUNT of product by store and week:
SUMMARIZE(Sales,[Store],[Week],"Distinct Products",DISTINCTCOUNT([Product]))
+---------+------+-------------------+
| Store | Week | Distinct Products |
+---------+------+-------------------+
| Charlie | 1 | 15 |
| Charlie | 2 | 7 |
| Charlie | 3 | 12 |
| Bravo | 1 | 20 |
| Bravo | 2 | 14 |
| Bravo | 3 | 22 |
+---------+------+-------------------+
I then want to calculate the AVERAGE of these Distinct Products at the store level. The way I approached this was by taking the previous calculation, and running a SUMX on top of it and dividing it by distinct weeks:
SUMX(
SUMMARIZE(Sales,[Store],[Week],"Distinct Products",DISTINCTCOUNT([Product]))
,[Distinct Products]
) / DISTINCTCOUNT([Week])
+---------+------------------+
| Store | Average Products |
+---------+------------------+
| Charlie | 11.3 |
| Bravo | 18.7 |
+---------+------------------+
I stored this calculation in a measure and it worked well when the dataset was smaller. But now the dataset is so huge that when I try to use the measure, it hangs until I have to cancel the process.
Is there a more efficient way to do this?
SUMX is appropriate in this case since you want the distinct product count calculated independently for each store & for each week, then summed together by store, and then divided by the number of weeks by store. There's no way around that. (If there was, I'd recommend it.)
However, SUMX is an iterator, and so is the likely cause of the slowdown. Since we can't eliminate the SUMX entirely, the biggest factor here is the number of combinations of stores/weeks that you have.
To confirm if the number of combinations of stores/weeks is the source of the slowdown, try filtering or removing 50% from a copy of your data model and see if that speeds things up. If that doesn't time out, add more back in to get a sense of how many combinations are the failing point.
To make things faster with the full dataset:
You may be able to filter to a subset of stores/weeks in your pivot table, before dragging on the measure. This will typically get faster results than dragging on the measure first, then adding filters. (This isn't really a change to your measure, but more of a behaviour change for users of your model).
You might want to consider grouping at a higher level than week (e.g. month), to reduce the number of combinations it has to iterate over
If you're running Excel 32-bit, or only have 4GB of RAM, consider 64-bit Excel and/or a more powerful machine (I doubt this is the case, but am including for comprehensiveness - Power Pivot can be a resource hog)
If you can move your model to Power BI Desktop (I don't believe Calculated Tables are supported in Power Pivot), you could extract out the SUMMARIZE into a calculated table, and then re-write your measure to reference that calculated table instead. This reduces the number of calculations the measure has to perform at run-time, as all the combinations of store/week plus the distinct count of products will be pre-calculated (leaving only the summing & division for your measure to do - a lot less work).
.
Calculated Table =
SUMMARIZE (
Sales,
[Store],
[Week],
"Distinct Products", DISTINCTCOUNT ( Sales[Product] )
)
Note: The calculated table code above is rudimentary and is mostly designed as a proof of concept. If this is the path you take, you'll want to make sure you have a separate store dimension to join the calculated table to, as this won't join to the source table directly
Measure Using Calc Table =
SUMX (
'Calculated Table',
[Distinct Products] / DISTINCTCOUNT ( 'Calculated Table'[Week] )
)
Jason Thomas has a great post on calculated tables and when they can come in useful here: http://sqljason.com/2015/09/my-thoughts-on-calculated-tables-in.html.
If you can't use calculated tables, but your data is coming from a database of some form, then you could do the same logic in SQL and then import a pre-prepared separate table of unique store/months and their distinct counts.
I hope some of this proves useful (or you've solved the problem another way).

Time Series in Postgres

I have a huge database of eCommerce transactions on Redshift, running into about 900 million rows, with the headers being somewhat similar to this.
id | date_stamp | location | item | amount
001 | 2009-12-28 | A1 | Apples | 2
002 | 2009-12-28 | A2 | Juice | 2
003 | 2009-12-28 | A1 | Apples | 1
004 | 2009-12-28 | A4 | Apples | 2
005 | 2009-12-29 | A1 | Juice | 6
006 | 2009-12-29 | A4 | Apples | 2
007 | 2009-12-29 | A1 | Water | 7
008 | 2009-12-28 | B7 | Juice | 14
Is it possible to find trends within items? For example, if I wanted to see how "Apples" performed in terms of sales, between 2009-12-28 and 2011-12-28, at location A4, how would I go about it? Ideally I would like to generate a table with positive/negative trending, somewhat similar to the post here -
Aggregate function to detect trend in PostgreSQL
I have performed similar analysis on small data sets in R, and even visualizing it using ggplot isn't a big challenge, but the sheer size of the database is causing me some troubles, and extremely long querying times as well.
For example,
select *
from fruitstore.sales
where item = 'Apple' and location = 'A1'
order by date_stamp
limit 1000000;
takes about 2500 seconds to execute, and times out often.
I appreciate any help on this.
900M rows is quite a bit for stock Postgres to handle. One of the MPP variants (like Citus) would be able to handle it better.
Another option is to change how you're storing the data. A far more efficient structure would be to have 1 row for each month/item/location, and store an int array of amounts. That would cut things down to ~300M rows, which is much more manageable. I suspect most of your analysis tools will want to see the data as an array anyway.
Take a look at window functions. They're great for this type of use case. They were a bit tough for me to get my head around but can save you some serious contortions with SQL.
This will show you how many apples were sold per day for the period you're interested in:
select date_trunc('day', date_stamp) as day, count(*) as sold
from fruitstore.sales
where item = 'Apple' and location = 'A4'
and date_stamp::date >= '2009-12-28'::date and date_stamp::date <= '2011-12-28'::date
group by 1 order by 1 asc
Regarding performance, avoid using select * in Redshift. It's a columnar store where data for different columns is spread across nodes. Being explicit about the columns and only referencing the ones you use will save Redshift from moving a lot of unneeded data over the network.
Make sure you're picking good distkey and sortkeys for your tables. In a time series table the timestamp should definitely be one of the sortkeys. Enabling compression on your tables can help too.
Schedule regular VACUUM and ANALYZE runs on your tables.
Also if there's any way to restrict the range of data you're looking at by filtering possible records out in the where clause, it can help a lot. For example, if you know you only care about the trend for the last few days it can make a huge difference to limit on time like:
where date_stamp >= sysdate::date - '5 day'::interval
Here's a good article with performance tips.
To filter results in your SQL query, you can use a WHERE clause:
SELECT *
FROM myTable
WHERE
item='Apple' AND
date_stamp BETWEEN '2009-12-28' AND '2011-12-28' AND
location = 'A4'
Using Aggregate functions, you can summarize fruit sales between two dates at a location, for instance:
SELECT item as "fruit", sum(amount) as "total"
FROM myTable
WHERE
date_stamp BETWEEN '2009-12-28' AND '2011-12-28' AND
location = 'A4'
GROUP BY item
Your question asking how apples "Fared" isn't terrible descriptive, but using a WHERE clause and aggregate functions (don't forget your group by) are probably where you need to aim.

Selecting Recent Rows, Optimization (Oracle SQL)

I would appreciate some guidance on the following query. We have a list of experiments and their current progress state (for simplicity, I've reduced the statuses to 4types, but we have 10 different statuses in our data). I need to eventually return a list of the current status of all non-finished experiments.
Given a table exp_status,
Experiment | ID | Status
----------------------------
A | 1 | Starting
A | 2 | Working On It
B | 3 | Starting
B | 4 | Working On It
B | 5 | Finished Type I
C | 6 | Starting
D | 7 | Starting
D | 8 | Working On It
D | 9 | Finished Type II
E | 10 | Starting
E | 11 | Working On It
F | 12 | Starting
G | 13 | Starting
H | 14 | Starting
H | 15 | Working On It
H | 16 | Finished Type II
Desired Result Set:
Experiment | ID | Status
----------------------------
A | 2 | Working On It
C | 6 | Starting
E | 11 | Working On It
F | 12 | Starting
G | 13 | Starting
The most recent ID number will correspond to the most recent status.
Now, the current code I have executes in 150 seconds.
SELECT *
FROM
(SELECT Experiment, ID, Status,
row_number () over (partition by Experiment
order by ID desc) as rn
FROM exp_status)
WHERE rn = 1
AND status NOT LIKE ('Finished%')
The thing is, this code wastes its time. The result set is 45 thousand rows pulled from a table of 3.9 million. This is because most experiments are in the finished status. The code goes through and orders all of them then only filters out the finished at the end. About 95% of the experiments in the table are in the finished phase. I could not figure out how to make the query first pick out all the experiments and statuses where there isn't a 'Finished' for that experiment. I tried the following but had very slow performance.
SELECT *
FROM exp_status
WHERE experiment NOT IN
(
SELECT experiment
FROM exp_status
WHERE status LIKE ('Finished%')
)
Any help would be appreciated!
Given your requirement, I think your current query with with row_number() is one of the most efficient possible. This query takes time not because it has to sort the data, but because there is so much data to read in the first place (the extra cpu time is negligible compared to the fetch time). Furthermore, the first query makes a FULL SCAN that is really the best way to read lots of data.
You need to find a way to read a lot less rows if you want to improve performance. The second query doesn't go in the right direction:
the inner query will likely be a full scan since the 'finished' rows will be spread across the whole table and likely represent a big percentage of all rows.
the outer query will also likey be a full scan and a nice ANTI-HASH JOIN which should be quicker than 45k * (number of status change per experiment) non-unique index scans.
So the second query seems to have at least twice the number of reads (plus a join).
If you want to really improve performance, I think you will need a change of design.
You could for instance build a table of active experiments and join to this table. You would maintain this table either as a materialized view or with a modification to the code that inserts experiment statuses. You could go further and store the last status in this table. Maintaining this "last status" will likely be an extra burden but this could be justified by the improved performance.
Consider partitioning your table by status
www.orafaq.com/wiki/Partitioning_FAQ
You could also create materialized views to avoid having to recalculate your aggregations if these types of queries are frequent.
Could you provide the execution plans of your queries. Without those it is difficult to know the exact reason it is taking so long
You can improve your first query slightly by using this variant:
select experiment
, max(id) id
, max(status) keep (dense_rank last order by id) status
from exp_status
group by experiment
having max(status) keep (dense_rank last order by id) not like 'Finished%'
If you compare the plans, you'll notice one step less
Regards,
Rob.

Join vs. sub-query

I am an old-school MySQL user and have always preferred JOIN over sub-query. But nowadays everyone uses sub-query, and I hate it; I don't know why.
I lack the theoretical knowledge to judge for myself if there is any difference. Is a sub-query as good as a JOIN and therefore is there nothing to worry about?
Sub-queries are the logically correct way to solve problems of the form, "Get facts from A, conditional on facts from B". In such instances, it makes more logical sense to stick B in a sub-query than to do a join. It is also safer, in a practical sense, since you don't have to be cautious about getting duplicated facts from A due to multiple matches against B.
Practically speaking, however, the answer usually comes down to performance. Some optimisers suck lemons when given a join vs a sub-query, and some suck lemons the other way, and this is optimiser-specific, DBMS-version-specific and query-specific.
Historically, explicit joins usually win, hence the established wisdom that joins are better, but optimisers are getting better all the time, and so I prefer to write queries first in a logically coherent way, and then restructure if performance constraints warrant this.
In most cases JOINs are faster than sub-queries and it is very rare for a sub-query to be faster.
In JOINs RDBMS can create an execution plan that is better for your query and can predict what data should be loaded to be processed and save time, unlike the sub-query where it will run all the queries and load all their data to do the processing.
The good thing in sub-queries is that they are more readable than JOINs: that's why most new SQL people prefer them; it is the easy way; but when it comes to performance, JOINS are better in most cases even though they are not hard to read too.
Taken from the MySQL manual (13.2.10.11 Rewriting Subqueries as Joins):
A LEFT [OUTER] JOIN can be faster than an equivalent subquery because the server might be able to optimize it better—a fact that is not specific to MySQL Server alone.
So subqueries can be slower than LEFT [OUTER] JOIN, but in my opinion their strength is slightly higher readability.
In the year 2010 I would have joined the author of this questions and would have strongly voted for JOIN, but with much more experience (especially in MySQL) I can state: Yes subqueries can be better. I've read multiple answers here; some stated subqueries are faster, but it lacked a good explanation. I hope I can provide one with this (very) late answer:
First of all, let me say the most important: There are different forms of sub-queries
And the second important statement: Size matters
If you use sub-queries, you should be aware of how the DB-Server executes the sub-query. Especially if the sub-query is evaluated once or for every row!
On the other side, a modern DB-Server is able to optimize a lot. In some cases a subquery helps optimizing a query, but a newer version of the DB-Server might make the optimization obsolete.
Sub-queries in Select-Fields
SELECT moo, (SELECT roger FROM wilco WHERE moo = me) AS bar FROM foo
Be aware that a sub-query is executed for every resulting row from foo.
Avoid this if possible; it may drastically slow down your query on huge datasets. However, if the sub-query has no reference to foo it can be optimized by the DB-server as static content and could be evaluated only once.
Sub-queries in the Where-statement
SELECT moo FROM foo WHERE bar = (SELECT roger FROM wilco WHERE moo = me)
If you are lucky, the DB optimizes this internally into a JOIN. If not, your query will become very, very slow on huge datasets because it will execute the sub-query for every row in foo, not just the results like in the select-type.
Sub-queries in the Join-statement
SELECT moo, bar
FROM foo
LEFT JOIN (
SELECT MIN(bar), me FROM wilco GROUP BY me
) ON moo = me
This is interesting. We combine JOIN with a sub-query. And here we get the real strength of sub-queries. Imagine a dataset with millions of rows in wilco but only a few distinct me. Instead of joining against a huge table, we have now a smaller temporary table to join against. This can result in much faster queries depending on database size. You can have the same effect with CREATE TEMPORARY TABLE ... and INSERT INTO ... SELECT ..., which might provide better readability on very complex queries (but can lock datasets in a repeatable read isolation level).
Nested sub-queries
SELECT VARIANCE(moo)
FROM (
SELECT moo, CONCAT(roger, wilco) AS bar
FROM foo
HAVING bar LIKE 'SpaceQ%'
) AS temp_foo
GROUP BY moo
You can nest sub-queries in multiple levels. This can help on huge datasets if you have to group or change the results. Usually the DB-Server creates a temporary table for this, but sometimes you do not need some operations on the whole table, only on the resultset. This might provide a much better performance depending on the size of the table.
Conclusion
Sub-queries are no replacement for a JOIN and you should not use them like this (although possible). In my humble opinion, the correct use of a sub-query is the use as a quick replacement of CREATE TEMPORARY TABLE .... A good sub-query reduces a dataset in a way you cannot accomplish in an ON statement of a JOIN. If a sub-query has one of the keywords GROUP BY or DISTINCT and is preferably not situated in the select fields or the where statement, then it might improve performance a lot.
Use EXPLAIN to see how your database executes the query on your data. There is a huge "it depends" in this answer...
PostgreSQL can rewrite a subquery to a join or a join to a subquery when it thinks one is faster than the other. It all depends on the data, indexes, correlation, amount of data, query, etc.
First of all, to compare the two first you should distinguish queries with subqueries to:
a class of subqueries that always have corresponding equivalent query written with joins
a class of subqueries that can not be rewritten using joins
For the first class of queries a good RDBMS will see joins and subqueries as equivalent and will produce same query plans.
These days even mysql does that.
Still, sometimes it does not, but this does not mean that joins will always win - I had cases when using subqueries in mysql improved performance. (For example if there is something preventing mysql planner to correctly estimate the cost and if the planner doesn't see the join-variant and subquery-variant as same then subqueries can outperform the joins by forcing a certain path).
Conclusion is that you should test your queries for both join and subquery variants if you want to be sure which one will perform better.
For the second class the comparison makes no sense as those queries can not be rewritten using joins and in these cases subqueries are natural way to do the required tasks and you should not discriminate against them.
I think what has been under-emphasized in the cited answers is the issue of duplicates and problematic results that may arise from specific (use) cases.
(although Marcelo Cantos does mention it)
I will cite the example from Stanford's Lagunita courses on SQL.
Student Table
+------+--------+------+--------+
| sID | sName | GPA | sizeHS |
+------+--------+------+--------+
| 123 | Amy | 3.9 | 1000 |
| 234 | Bob | 3.6 | 1500 |
| 345 | Craig | 3.5 | 500 |
| 456 | Doris | 3.9 | 1000 |
| 567 | Edward | 2.9 | 2000 |
| 678 | Fay | 3.8 | 200 |
| 789 | Gary | 3.4 | 800 |
| 987 | Helen | 3.7 | 800 |
| 876 | Irene | 3.9 | 400 |
| 765 | Jay | 2.9 | 1500 |
| 654 | Amy | 3.9 | 1000 |
| 543 | Craig | 3.4 | 2000 |
+------+--------+------+--------+
Apply Table
(applications made to specific universities and majors)
+------+----------+----------------+----------+
| sID | cName | major | decision |
+------+----------+----------------+----------+
| 123 | Stanford | CS | Y |
| 123 | Stanford | EE | N |
| 123 | Berkeley | CS | Y |
| 123 | Cornell | EE | Y |
| 234 | Berkeley | biology | N |
| 345 | MIT | bioengineering | Y |
| 345 | Cornell | bioengineering | N |
| 345 | Cornell | CS | Y |
| 345 | Cornell | EE | N |
| 678 | Stanford | history | Y |
| 987 | Stanford | CS | Y |
| 987 | Berkeley | CS | Y |
| 876 | Stanford | CS | N |
| 876 | MIT | biology | Y |
| 876 | MIT | marine biology | N |
| 765 | Stanford | history | Y |
| 765 | Cornell | history | N |
| 765 | Cornell | psychology | Y |
| 543 | MIT | CS | N |
+------+----------+----------------+----------+
Let's try to find the GPA scores for students that have applied to CS major (regardless of the university)
Using a subquery:
select GPA from Student where sID in (select sID from Apply where major = 'CS');
+------+
| GPA |
+------+
| 3.9 |
| 3.5 |
| 3.7 |
| 3.9 |
| 3.4 |
+------+
The average value for this resultset is:
select avg(GPA) from Student where sID in (select sID from Apply where major = 'CS');
+--------------------+
| avg(GPA) |
+--------------------+
| 3.6800000000000006 |
+--------------------+
Using a join:
select GPA from Student, Apply where Student.sID = Apply.sID and Apply.major = 'CS';
+------+
| GPA |
+------+
| 3.9 |
| 3.9 |
| 3.5 |
| 3.7 |
| 3.7 |
| 3.9 |
| 3.4 |
+------+
average value for this resultset:
select avg(GPA) from Student, Apply where Student.sID = Apply.sID and Apply.major = 'CS';
+-------------------+
| avg(GPA) |
+-------------------+
| 3.714285714285714 |
+-------------------+
It is obvious that the second attempt yields misleading results in our use case, given that it counts duplicates for the computation of the average value.
It is also evident that usage of distinct with the join - based statement will not eliminate the problem, given that it will erroneously keep one out of three occurrences of the 3.9 score. The correct case is to account for TWO (2) occurrences of the 3.9 score given that we actually have TWO (2) students with that score that comply with our query criteria.
It seems that in some cases a sub-query is the safest way to go, besides any performance issues.
MSDN Documentation for SQL Server says
Many Transact-SQL statements that include subqueries can be alternatively formulated as joins. Other questions can be posed only with subqueries. In Transact-SQL, there is usually no performance difference between a statement that includes a subquery and a semantically equivalent version that does not. However, in some cases where existence must be checked, a join yields better performance. Otherwise, the nested query must be processed for each result of the outer query to ensure elimination of duplicates. In such cases, a join approach would yield better results.
so if you need something like
select * from t1 where exists select * from t2 where t2.parent=t1.id
try to use join instead. In other cases, it makes no difference.
I say: Creating functions for subqueries eliminate the problem of cluttter and allows you to implement additional logic to subqueries. So I recommend creating functions for subqueries whenever possible.
Clutter in code is a big problem and the industry has been working on avoiding it for decades.
As per my observation like two cases, if a table has less then 100,000 records then the join will work fast.
But in the case that a table has more than 100,000 records then a subquery is best result.
I have one table that has 500,000 records on that I created below query and its result time is like
SELECT *
FROM crv.workorder_details wd
inner join crv.workorder wr on wr.workorder_id = wd.workorder_id;
Result : 13.3 Seconds
select *
from crv.workorder_details
where workorder_id in (select workorder_id from crv.workorder)
Result : 1.65 Seconds
Run on a very large database from an old Mambo CMS:
SELECT id, alias
FROM
mos_categories
WHERE
id IN (
SELECT
DISTINCT catid
FROM mos_content
);
0 seconds
SELECT
DISTINCT mos_content.catid,
mos_categories.alias
FROM
mos_content, mos_categories
WHERE
mos_content.catid = mos_categories.id;
~3 seconds
An EXPLAIN shows that they examine the exact same number of rows, but one takes 3 seconds and one is near instant. Moral of the story? If performance is important (when isn't it?), try it multiple ways and see which one is fastest.
And...
SELECT
DISTINCT mos_categories.id,
mos_categories.alias
FROM
mos_content, mos_categories
WHERE
mos_content.catid = mos_categories.id;
0 seconds
Again, same results, same number of rows examined. My guess is that DISTINCT mos_content.catid takes far longer to figure out than DISTINCT mos_categories.id does.
A general rule is that joins are faster in most cases (99%).
The more data tables have, the subqueries are slower.
The less data tables have, the subqueries have equivalent speed as joins.
The subqueries are simpler, easier to understand, and easier to read.
Most of the web and app frameworks and their "ORM"s and "Active record"s generate queries with subqueries, because with subqueries are easier to split responsibility, maintain code, etc.
For smaller web sites or apps subqueries are OK, but for larger web sites and apps you will often have to rewrite generated queries to join queries, especial if a query uses many subqueries in the query.
Some people say "some RDBMS can rewrite a subquery to a join or a join to a subquery when it thinks one is faster than the other.", but this statement applies to simple cases, surely not for complicated queries with subqueries which actually cause a problems in performance.
Subqueries are generally used to return a single row as an atomic value, though they may be used to compare values against multiple rows with the IN keyword. They are allowed at nearly any meaningful point in a SQL statement, including the target list, the WHERE clause, and so on. A simple sub-query could be used as a search condition. For example, between a pair of tables:
SELECT title
FROM books
WHERE author_id = (
SELECT id
FROM authors
WHERE last_name = 'Bar' AND first_name = 'Foo'
);
Note that using a normal value operator on the results of a sub-query requires that only one field must be returned. If you're interested in checking for the existence of a single value within a set of other values, use IN:
SELECT title
FROM books
WHERE author_id IN (
SELECT id FROM authors WHERE last_name ~ '^[A-E]'
);
This is obviously different from say a LEFT-JOIN where you just want to join stuff from table A and B even if the join-condition doesn't find any matching record in table B, etc.
If you're just worried about speed you'll have to check with your database and write a good query and see if there's any significant difference in performance.
MySQL version: 5.5.28-0ubuntu0.12.04.2-log
I was also under the impression that JOIN is always better than a sub-query in MySQL, but EXPLAIN is a better way to make a judgment. Here is an example where sub queries work better than JOINs.
Here is my query with 3 sub-queries:
EXPLAIN SELECT vrl.list_id,vrl.ontology_id,vrl.position,l.name AS list_name, vrlih.position AS previous_position, vrl.moved_date
FROM `vote-ranked-listory` vrl
INNER JOIN lists l ON l.list_id = vrl.list_id
INNER JOIN `vote-ranked-list-item-history` vrlih ON vrl.list_id = vrlih.list_id AND vrl.ontology_id=vrlih.ontology_id AND vrlih.type='PREVIOUS_POSITION'
INNER JOIN list_burial_state lbs ON lbs.list_id = vrl.list_id AND lbs.burial_score < 0.5
WHERE vrl.position <= 15 AND l.status='ACTIVE' AND l.is_public=1 AND vrl.ontology_id < 1000000000
AND (SELECT list_id FROM list_tag WHERE list_id=l.list_id AND tag_id=43) IS NULL
AND (SELECT list_id FROM list_tag WHERE list_id=l.list_id AND tag_id=55) IS NULL
AND (SELECT list_id FROM list_tag WHERE list_id=l.list_id AND tag_id=246403) IS NOT NULL
ORDER BY vrl.moved_date DESC LIMIT 200;
EXPLAIN shows:
+----+--------------------+----------+--------+-----------------------------------------------------+--------------+---------+-------------------------------------------------+------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------------+----------+--------+-----------------------------------------------------+--------------+---------+-------------------------------------------------+------+--------------------------+
| 1 | PRIMARY | vrl | index | PRIMARY | moved_date | 8 | NULL | 200 | Using where |
| 1 | PRIMARY | l | eq_ref | PRIMARY,status,ispublic,idx_lookup,is_public_status | PRIMARY | 4 | ranker.vrl.list_id | 1 | Using where |
| 1 | PRIMARY | vrlih | eq_ref | PRIMARY | PRIMARY | 9 | ranker.vrl.list_id,ranker.vrl.ontology_id,const | 1 | Using where |
| 1 | PRIMARY | lbs | eq_ref | PRIMARY,idx_list_burial_state,burial_score | PRIMARY | 4 | ranker.vrl.list_id | 1 | Using where |
| 4 | DEPENDENT SUBQUERY | list_tag | ref | list_tag_key,list_id,tag_id | list_tag_key | 9 | ranker.l.list_id,const | 1 | Using where; Using index |
| 3 | DEPENDENT SUBQUERY | list_tag | ref | list_tag_key,list_id,tag_id | list_tag_key | 9 | ranker.l.list_id,const | 1 | Using where; Using index |
| 2 | DEPENDENT SUBQUERY | list_tag | ref | list_tag_key,list_id,tag_id | list_tag_key | 9 | ranker.l.list_id,const | 1 | Using where; Using index |
+----+--------------------+----------+--------+-----------------------------------------------------+--------------+---------+-------------------------------------------------+------+--------------------------+
The same query with JOINs is:
EXPLAIN SELECT vrl.list_id,vrl.ontology_id,vrl.position,l.name AS list_name, vrlih.position AS previous_position, vrl.moved_date
FROM `vote-ranked-listory` vrl
INNER JOIN lists l ON l.list_id = vrl.list_id
INNER JOIN `vote-ranked-list-item-history` vrlih ON vrl.list_id = vrlih.list_id AND vrl.ontology_id=vrlih.ontology_id AND vrlih.type='PREVIOUS_POSITION'
INNER JOIN list_burial_state lbs ON lbs.list_id = vrl.list_id AND lbs.burial_score < 0.5
LEFT JOIN list_tag lt1 ON lt1.list_id = vrl.list_id AND lt1.tag_id = 43
LEFT JOIN list_tag lt2 ON lt2.list_id = vrl.list_id AND lt2.tag_id = 55
INNER JOIN list_tag lt3 ON lt3.list_id = vrl.list_id AND lt3.tag_id = 246403
WHERE vrl.position <= 15 AND l.status='ACTIVE' AND l.is_public=1 AND vrl.ontology_id < 1000000000
AND lt1.list_id IS NULL AND lt2.tag_id IS NULL
ORDER BY vrl.moved_date DESC LIMIT 200;
and the output is:
+----+-------------+-------+--------+-----------------------------------------------------+--------------+---------+---------------------------------------------+------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+-----------------------------------------------------+--------------+---------+---------------------------------------------+------+----------------------------------------------+
| 1 | SIMPLE | lt3 | ref | list_tag_key,list_id,tag_id | tag_id | 5 | const | 2386 | Using where; Using temporary; Using filesort |
| 1 | SIMPLE | l | eq_ref | PRIMARY,status,ispublic,idx_lookup,is_public_status | PRIMARY | 4 | ranker.lt3.list_id | 1 | Using where |
| 1 | SIMPLE | vrlih | ref | PRIMARY | PRIMARY | 4 | ranker.lt3.list_id | 103 | Using where |
| 1 | SIMPLE | vrl | ref | PRIMARY | PRIMARY | 8 | ranker.lt3.list_id,ranker.vrlih.ontology_id | 65 | Using where |
| 1 | SIMPLE | lt1 | ref | list_tag_key,list_id,tag_id | list_tag_key | 9 | ranker.lt3.list_id,const | 1 | Using where; Using index; Not exists |
| 1 | SIMPLE | lbs | eq_ref | PRIMARY,idx_list_burial_state,burial_score | PRIMARY | 4 | ranker.vrl.list_id | 1 | Using where |
| 1 | SIMPLE | lt2 | ref | list_tag_key,list_id,tag_id | list_tag_key | 9 | ranker.lt3.list_id,const | 1 | Using where; Using index |
+----+-------------+-------+--------+-----------------------------------------------------+--------------+---------+---------------------------------------------+------+----------------------------------------------+
A comparison of the rows column tells the difference and the query with JOINs is using Using temporary; Using filesort.
Of course when I run both the queries, the first one is done in 0.02 secs, the second one does not complete even after 1 min, so EXPLAIN explained these queries properly.
If I do not have the INNER JOIN on the list_tag table i.e. if I remove
AND (SELECT list_id FROM list_tag WHERE list_id=l.list_id AND tag_id=246403) IS NOT NULL
from the first query and correspondingly:
INNER JOIN list_tag lt3 ON lt3.list_id = vrl.list_id AND lt3.tag_id = 246403
from the second query, then EXPLAIN returns the same number of rows for both queries and both these queries run equally fast.
Subqueries have ability to calculate aggregation functions on a fly.
E.g. Find minimal price of the book and get all books which are sold with this price.
1) Using Subqueries:
SELECT titles, price
FROM Books, Orders
WHERE price =
(SELECT MIN(price)
FROM Orders) AND (Books.ID=Orders.ID);
2) using JOINs
SELECT MIN(price)
FROM Orders;
-----------------
2.99
SELECT titles, price
FROM Books b
INNER JOIN Orders o
ON b.ID = o.ID
WHERE o.price = 2.99;
The difference is only seen when the second joining table has significantly more data than the primary table. I had an experience like below...
We had a users table of one hundred thousand entries and their membership data (friendship) about 3 hundred thousand entries. It was a join statement in order to take friends and their data, but with a great delay. But it was working fine where there was only a small amount of data in the membership table. Once we changed it to use a sub-query it worked fine.
But in the mean time the join queries are working with other tables that have fewer entries than the primary table.
So I think the join and sub query statements are working fine and it depends on the data and the situation.
These days, many dbs can optimize subqueries and joins. Thus, you just gotto examine your query using explain and see which one is faster. If there is not much difference in performance, I prefer to use subquery as they are simple and easier to understand.
I am not a relational database expert, so take this with a grain of salt.
The general idea about sub queries vs joins is the path the evaluation of the larger query takes.
In order to perform the larger query, every individual subquery has to be executed first, and then the resultset is stored as a temporary table that the larger query interacts with.
This temporary table is unindexed, so, any comparison requires scanning the whole resultset.
In contrast, when you use a join, all indexes are in use and so, comparison require traversing index trees (or hash tables), which is way less expensive in terms of speed.
Now, what I don't know if newer versions of the most popular relational engines execute the evaluation on reverse, and just load the necessary elements in the temporary table, as an optimization method.
I just thinking about the same problem, but I am using subquery in the FROM part.
I need connect and query from large tables, the "slave" table have 28 million record but the result is only 128 so small result big data! I am using MAX() function on it.
First I am using LEFT JOIN because I think that is the correct way, the mysql can optimalize etc.
Second time just for testing, I rewrite to sub-select against the JOIN.
LEFT JOIN runtime: 1.12s
SUB-SELECT runtime: 0.06s
18 times faster the subselect than the join! Just in the chokito adv. The subselect looks terrible but the result ...
It depends on several factors, including the specific query you're running, the amount of data in your database. Subquery runs the internal queries first and then from the result set again filter out the actual results. Whereas in join runs the and produces the result in one go.
The best strategy is that you should test both the join solution and the subquery solution to get the optimized solution.
If you want to speed up your query using join:
For "inner join/join",
Don't use where condition instead use it in "ON" condition.
Eg:
select id,name from table1 a
join table2 b on a.name=b.name
where id='123'
Try,
select id,name from table1 a
join table2 b on a.name=b.name and a.id='123'
For "Left/Right Join",
Don't use in "ON" condition, Because if you use left/right join it will get all rows for any one table.So, No use of using it in "On". So, Try to use "Where" condition