Which MYSQL select method? - sql

I have 3 tables.
1st -> products
2nd -> labels
3rd -> connectionTable
I want to select all products with their labels. I have two methods. I want to ask which one is more efficent.
1st way-> Two queries using while
$query = "SELECT * FROM products";
$result = mysql_query($query);
while($row = mysql_fetch_array($result, MYSQL_ASSOC))
{
$query = "SELECT *
FROM connectionTable
INNER JOIN labels ON labels.labelID = connectionTable.labelID
WHERE productID = " . $row['labelID'];
..
..
}
###################
2nd way-> using GROUP_CONCAT()
something like this:
SELECT GROUP_CONCAT(labelName)
FROM connectionTable
INNER JOIN labels ON labels.labelID = connectionTable.labelID
INNER JOIN products ON products.productID = connectionTable.productID
WHERE productID = " . $row['labelID'] . " GROUP BY productID;
$result = mysql_query($query);

neither approach is good. in both cases, you have a query inside a loop. that is not "two serial SQL queries", that is a query, and a second query that is run as many times as the number of rows in the first query.
what you should really be doing is adding the labels and connectionTable tables to the query outside of the loop.

Dump your query in phpMyAdmin and use EXPLAIN?
Other than that a JOIN will always be faster than a nested query.

You should be looking to do the JOIN and not the 2 queries separately and explain planning the 2 queries vs. the JOIN'ed won't tell you the whole story.
You have to remember when you execute a query, there a things going on outside of the actual query that take time and resources. Validating and parsing the SQL statement (which could be somewhat mitigated using bind variables if your version of MySQL supports them), determining the plan for retrieving the results, and network time/traffic especially if you're accessing a db on another host. If your first query returns a million rows, you're going to be executing the second query 1 million times and incuring the network overhead to send that across along with returning result set each time. This is far less efficient than sending a JOIN query once and returning the dataset as a whole and processing it. Not to mention what you are doing to the DB SQL cache without the use of bind variables.
Note that efficiency and response time aren't the same. What may be more efficient, may end up being slower from a users perspective. If a user hits the page with the 2 separate queries, he/she will most likely see results quite quickly as the individual queries in the loop execute and return small return sets that can be output to the page. To return all the rows though could take much longer than the single JOIN. In the single JOIN case, the user may wait longer before data is returned, but they will see the entirety of that data sooner.
I would go with the join and make sure you have indexes on the columns you are joining on (namely productID). A concatenated index on the label_id, label_name may help too depending on the table size etc., but this would be something you'll need to look at with EXPLAIN in order to verify. See how the response time is for your user(s) and work from there.

The first version is not running 2 queries, it's running 1 + number_of_products queries. If
you are loading all products, it's easy:
Run SELECT * FROM products and create a map of products, so that you can access them by ID.
Run SELECT * FROM connectionTable JOIN labels ON labels.labelID = connectionTable.labelID, iterate over the results, look up the product in the map from the previous step and add the row to the product entry.
If you want to do this only for a limited set of products, select them, collect the product IDs and use the same query as before, but with WHERE productID IN (?, ?, ?, ...).

Related

Best way to deal with huge postgres database

I've created a scraper that collects huge amounts of data to the Postgres database. One of the tables has more than 120 million records and still grows.
It creates obvious problems with even simple selects, but when I run aggregate
functions like COUNT(), it takes ages to get a result. I want to display this data using a web service, but it is definitely too slow to do it directly. I thought about materialized views, but even there if I run some more advanced query (query with subqueries to show trend) it throws an error with not enough memory, and if the query is simple, then it takes about an hour to complete. I am asking about general rules (I haven't managed to find any) with dealing with such huge databases.
The example queries which I use:
The simple query takes about an hour to complete (Items table have 120 million records, ItemTypes have about 30k - they keep the names and all information for the Items)
SELECT
IT."name",
COUNT("Items".id) AS item_count,
(CAST(COUNT("Items".id) AS DECIMAL(10,1))/(SELECT COUNT(id) FROM "Items"))*100 as percentage_of_all
FROM "Items" JOIN "ItemTypes" IT on "Items"."itemTypeId" = IT.id
GROUP BY IT."name"
ORDER BY item_count DESC;
When I run the above query with subquery which returns COUNT("Items".id) AS item_count % trend which is the count of them from a week ago compared to count now, it throws an error that memory was exceeded.
As I wrote above I am looking for tips, how to optimize it. The first thing I plan to optimize the above query is to move names from ItemTypes to Items, to Items. It won't be required to join ItemTypes anymore, but I already tried to mock it and the results aren't a lot better.
You don't need a subquery, so an equivalent version is:
SELECT IT."name",
COUNT(*) AS item_count,
COUNT(*) * 100.0 / SUM(COUNT(*)) OVER () as percentage_of_all
FROM "Items" JOIN
"ItemTypes" IT
ON "Items"."itemTypeId" = IT.id
GROUP BY IT."name"
ORDER BY item_count DESC;
I'm not sure if this will fix your resource problem. In addition, this assumes that all items have a valid ItemType. If that is not the case, use a LEFT JOIN instead of JOIN.

How to improve the performance of a 10 min running query?

I have a query which is taking approximately 10 mins to execute and produce the results. When I try to break it into parts and run it, it seems to run fine, within seconds.
I tried to modify the subselect of the top and the bottom portions of the query and determine if that was causing the issue, but it was not. It gave out some results within 3 seconds.
I am trying to learn to read the Estimated Execution plan, but it is becoming more confusing and hard for me to trace to the issue.
Can anyone please point out some mistakes which I made that is making the query for long?
Select Distinct
PostExtended.BatchNum,
post.ControlNumStatus,
post.AccountSeg,
Post.PostDat
From
Post
Post Records
join (Select Post, MAX(Dist) as Dist, COUNT(fkglDist) as RecordCount From PostExtend WITH (NOLOCK) Group By flPost) as PostExtender on Post.PK = PostExtender.flPost
join glPostExtended WITH (NOLOCK) on glPostExtendedLimiter.Post = glPostExtended.Post and (PostExtendedLimiter.fkglDist = PostExtend.Dist or PostExtend.Dist is null)
join (select lP.fkosControlNumberStatus, lP.SourceJENumber, AccountSegment,
sum(case
............
from Post WITH (NOLOCK)
join AccountingPeriod WITH (NOLOCK) on AccountingPeriod.pk = lP.fkglAccountingPeriod
join FiscalYear WITH (NOLOCK) on FiscalYear.pk = AccountingPeriod.FiscalYear
join Account WITH (NOLOCK) on Account.pk = FiscalYear.Account
where FiscalYear.Period = #Date
and glP.fkMLSosCodeEntryType = 2202
group by glP.fkosControlNumberStatus, glP.SourceNumber, AccountSeg) post on post.ControlNumStatus = Post.fkControlNumberStatus and postdata.SourceJENumber = glPost.SourceJENumber
where post.AmountT <> 0)......
Group by
The subqueries are very often the point of problems.
I would try to:
separate the postdata subquery from the main query,
save the result in a temporary table or even in a table variable,
put clustered index on fkosControlNumberStatus and SourceJENumber fields,
join this temporary table back to the main query.
Sometimes the result of these simple actions pleasantly surprises.
This is a fairly complex query. You are joining on Aggregate Queries (with GROUP BY).
The first thing I would do is see how long it takes to run each of the join queries. One of these may run very fast, while another may run very long. So, you may not really need to optimize the entire query--just one of the joined queries.
Another way to do it is just start eliminating joins one by one, then run the entire query and see how fast it goes. When you have a really significant decrease in time, you've found the error.
Typically, one thing that can add a lot of CPU is comparisons. The sums with case statements might be the biggest suspect.
Have you used the Database Engine Tuning Adviser? If all else fails, go with that and see what it tells you.
So, maybe try this approach:
Take away the CASE Statements inside the SUM expressions on that last join.
Remove the last JOIN with all the sums.
Remove the first join with that GROUP BY and the MAX expression
That would be my strategy.

How to select related objects in a query in rails?

Coming from django, we have something called select_related that does a join when executing a query such that related objects' data are also fetched.
e.g.
# rails + select_related
p = Person.where(job: 1).select_related("job_name")
# so the return query list has objects that
# can call person.job.job_name without another query
# because selected_related did a join on jobs table
How do you do this in rails/activerecord?
In rails, it's more common to use includes to handle join tables. It can either do a left outer join (when a where condition needs to reference the joined table) or one more query such as select * from jobs where id IN (1,3,4,5) which solves the n+1 optimization problem.
In your case I would:
p = Person.where(job: 1).includes(:jobs)
job = p.job.job_name
This does still use two queries, but this is not the use case it is optimized for (and this case doesn't deserve optimization) but if you had a more complicated case it gets better:
people = Person.where(status: 'active').includes(:jobs)
people.each {|p| puts p.job.job_name}
In this case, it will still only execute 2 queries.

How to model and query objects in relational databases?

I have a complex database scheme for a dictionary. Each object (essentially a translation) is similar to this:
Entry {
keyword;
examples;
tags;
Translations;
}
with
Translation {
text;
tags;
examples;
}
and
Example {
text;
translation;
phonetic_script;
}
i.e. tags (i.e. grammar) can belong to either the keyword itself, or the translation (grammar of the foreign language), and similar examples can belong either to the translation itself (i.e. explaining the foreign word) or the the text in the entry. I ended up with such kind of relational design:
entries(id,keyword,)
tags(tag)
examples(id,text,...)
entrytags(entry_id,tag)
entryexamples(entry_id,example_id)
translations(id,belongs_to_entry,...)
translationtags(transl_id, tag)
translationexamples(transl_id,example_id)
My main task is querying this database. Say I search for "foo", my current way of handling is:
query all entries with foo, get ids A
foreach id in A
query all examples belonging to id
query all tags belonging to id
query all translations belonging to A, store their ids in B
foreach tr_id in B
query all tags belonging to tr_id
query all examples belonging to tr_id
to rebuild my objects. This looks cumbersome to me, and is slow. I do not see how i could significantly improve this by using joins, or otherwise. I have a hard time modeling these objects to relations in the database. Is this a proper design?
How can I make this more efficient to improve query time?
Each query being called in the loop takes at a minimum a certain base duration of time to execute, even for trivial queries. Many environment factors contribute to what this duration is but for now let's assume it's 10 milliseconds. If the first query matches 100 entries then there are at a minimum 301 total queries being called, each taking 10 ms, for a total of 3 seconds. The number of loop iterations varies which can contribute to a substantial variation in the performance.
Restructuring the queries with joins will create more complex queries but the total number of queries being called can be reduced down to a fixed number, 4 in the queries below. Suppose now that each query takes 50 ms to execute now that it is more complex and the total duration becomes 200 ms, a substantial decrease from 3000 ms.
The 4 queries show below should come close to achieving the desired result. There are other ways to write the queries such as using subquery or including the tables in the FROM clause but these show how to do it with JOINs. The condition entries.keyword = 'foo' is used to represent the condition in the original query to select the entries.
It is worth noting that if the foo condition on entries is very expensive to compute then other optimizations may be needed to further improve performance. In these examples the condition is a simple comparison which is quick to lookup in an index but using LIKE which may require a full table scan may not work well with these queries.
The following query selects all examples matching the original query. The condition from the original query is expressed as a WHERE clause on the entries.keyword column.
SELECT entries.id, examples.text
FROM entries
INNER JOIN entryexamples
ON (entries.id = entryexamples.entry_id)
INNER JOIN examples
ON (entryexamples.example_id = examples.id)
WHERE entries.keyword = 'foo';
This query selects tags matching the original query. Only two joins are used in this case because the entrytags.tag column is what is needed and joining with tags would only provide the same value.
SELECT entries.id, entrytags.tag
FROM entries
INNER JOIN entrytags
ON (entries.id = entrytags.entry_id)
WHERE entries.keyword = 'foo'';
This query selects the translation tags for the original query. This is similar to the previous query to select the entrytags but another layer of joins is used here for the translations.
SELECT entries.id, translationtags.tag
FROM entries
INNER JOIN translations
ON (entries.id = translations.belongs_to_entry)
INNER JOIN translationtags
ON (translations.id = translationtags.transl_id)
WHERE entries.keyword = 'foo';
The final query does the same as the first query for the examples but also includes the additional joins. It's getting to be a lot of joins but in general should perform significantly better than looping through and executing individual queries.
SELECT entries.id, examples.text
FROM entries
INNER JOIN translations
ON (entries.id = translations.belongs_to_entry)
INNER JOIN translationexamples
ON (translations.id = translationexamples.transl_id)
INNER JOIN examples
ON (translationexamples.example_id = examples.id)
WHERE entries.keyword = 'foo';

OR query performance and strategies with Postgresql

In my application I have a table of application events that are used to generate a user-specific feed of application events. Because it is generated using an OR query, I'm concerned about performance of this heavily used query and am wondering if I'm approaching this wrong.
In the application, users can follow both other users and groups. When an action is performed (eg, a new post is created), a feed_item record is created with the actor_id set to the user's id and the subject_id set to the group id in which the action was performed, and actor_type and subject_type are set to the class names of the models. Since users can follow both groups and users, I need to generate a query that checks both the actor_id and subject_id, and it needs to select distinct records to avoid duplicates. Since it's an OR query, I can't use an normal index. And since a record is created every time an action is performed, I expect this table to have a lot of records rather quickly.
Here's the current query (the following table joins users to feeders, aka, users and groups)
SELECT DISTINCT feed_items.* FROM "feed_items"
INNER JOIN "followings"
ON (
(followings.feeder_id = feed_items.subject_id
AND followings.feeder_type = feed_items.subject_type)
OR
(followings.feeder_id = feed_items.actor_id
AND followings.feeder_type = feed_items.actor_type)
)
WHERE (followings.follower_id = 42) ORDER BY feed_items.created_at DESC LIMIT 30 OFFSET 0
So my questions:
Since this is a heavily used query, is there a performance problem here?
Is there any obvious way to simplify or optimize this that I'm missing?
What you have is called an exclusive arc and you're seeing exactly why it's a bad idea. The best approach for this kind of problem is to make the feed item type dynamic:
Feed Items: id, type (A or S for Actor or Subject), subtype (replaces actor_type and subject_type)
and then your query becomes
SELECT DISTINCT fi.*
FROM feed_items fi
JOIN followings f ON f.feeder_id = fi.id AND f.feeder_type = fi.type AND f.feeder_subtype = fi.subtype
or similar.
This may not completely or exactly represent what you need to do but the principle is sound: you need to eliminate the reason for the OR condition by changing your data model in such a way to lend itself to having performant queries being written against it.
Explain analyze and time query to see if there is a problem.
Aso you could try expressing the query as a union
SELECT x.* FROM
(
SELECT feed_items.* FROM feed_items
INNER JOIN followings
ON followings.feeder_id = feed_items.subject_id
AND followings.feeder_type = feed_items.subject_type
WHERE (followings.follower_id = 42)
UNION
SELECT feed_items.* FROM feed_items
INNER JOIN followings
followings.feeder_id = feed_items.actor_id
AND followings.feeder_type = feed_items.actor_type)
WHERE (followings.follower_id = 42)
) AS x
ORDER BY x.created_at DESC
LIMIT 30
But again explain analyze and benchmark.
To find out if there is a performance problem measure it. PostgreSQL can explain it for you.
I don't think that the query needs simplifying, if you identify a performance problem then you may need to revise your indexes.