Is it possible there is a faster way to perform this SELECT query? - sql

UPDATE (based on everyone's responses):
I'm thinking of changing my structure so that I have a new table called prx_tags_sportsitems. I will be removing prx_lists entirely. prx_tags_sportsitems will act as a reference of ID's table to replace the prx_lists.ListString which used to be storing the ID's of tags belonging to each prx_sportsitem.
The new relation will be like so:
prx_tags_sportsitems.TagID <--> prx_tags.ID
prx_sportsitems.ID <--> prx_tags_sportsitems.OwnerID
prx_tags will contain the TagName. This is so I can still maintain each "tag" as a separate unique entity.
My new query for finding all sportsitems with the tag "aerobic" will become something similar to as follows:
SELECT prx_sportsitems.* FROM prx_sportsitems, prx_tags_sportsitems
WHERE prx_tags_sportsitems.OwnerID = prx_sportsitems.ID
AND prx_tags_sportsitems.TagID = (SELECT ID FROM prx_tags WHERE TagName = 'aerobic')
ORDER BY prx_sportsitems.DateAdded DESC LIMIT 0,30;
Or perhaps I can do something with the "IN" clause, but I'm unsure about that just yet.
Before I go ahead with this huge modification to my scripts, does everyone approve? comments? Many thanks!
ORIGINAL POST:
When it comes to MYSQL queries, I'm rather novice. When I originally designed my database I did something, rather silly, because it was the only solution I could find. Now I'm finding it appears to be causing too much stress of my MYSQL server since it takes 0.2 seconds to perform each of these queries where I believe it could be more like 0.02 seconds if it was a better query (or table design if it comes to it!). I want to avoid needing to rebuild my entire site structure since it's deeply designed the way it currently is, so I'm hoping there's a faster mysql query possible.
I have three tables in my database:
Sports Items Table
Tags Table
Lists Table
Each sports item has multiple tag names (categories) assigned to them. Each "tag" is stored as a separate result in prx_tags. I create a "list" in prx_lists for the sports item in prx_sportsitems and link them through prx_lists.OwnerID which links to prx_sportsitems.ID
This is my current query (which finds all sports items which have the tag 'aerobic'):
SELECT prx_sportsitems.*
FROM prx_sportsitems, prx_lists
WHERE prx_lists.ListString LIKE (CONCAT('%',(SELECT prx_tags.ID
FROM prx_tags
WHERE prx_tags.TagName = 'aerobic'
limit 0,1),'#%'))
AND prx_lists.ListType = 'Tags-SportsItems'
AND prx_lists.OwnerID = prx_sportsitems.ID
ORDER BY prx_sportsitems.DateAdded
DESC LIMIT 0,30
To help clarify more, the list that contains all of the tag ids is inside a single field called ListString and I structure it like so: " #1 #2 #3 #4 #5" ...and from that, the above query "concats" the prx_tags.ID which tagname is 'aerobic'.
My thoughts are that, there probably isn't a faster query existing and that I need to simply accept I need to do something simpler, such as putting all the Tags in a list, directly inside prx_sportsitems in a new field called "TagsList" and then I can simply run a query which does Select * from prx_sportsitems Where TagsList LIKE '%aerobic%' - however, I want to avoid needing to redesign my entire site. I'm really regretting not looking into optimization beforehand :(

Whenever I am writing a query, and think I need to use LIKE, an alarm goes off in my head that maybe there is a better design. This is certainly the case here.
You need to redesign the prx_lists tables. From what you've said, its hard to say what the exact schema should be, but here is my best guess:
prx_lists should have three columns: OwnerID, ListType, and TagName. Then you would have one row for each tag an OwnerID has. Your above query would now look something like this:
SELECT prx_sportsitems.*
FROM prx_sportsitems, prx_lists
where prx_lists.TagName = 'aerobic'
AND prx_lists.OwnerID = prx_sportsitems.ID
This is a MUCH more efficient query. Maybe ListType doesn't belong in that table either, but its hard to say without more info about what that column is used for.
Don't forget to create the appropriate indexes either! This will improve performance.
Refactoring your database schema might be painful, but its seems to me the only way to fix your long term problem.

To help clarify more, the list that
contains all of the tag ids is inside
a single field called ListString and I
structure it like so: " #1 #2 #3 #4 #5" ...and from that, the above query "concats" the prx_tags.ID which
tagname is 'aerobic'.
There's your problem right there. Don't store delimited data in a DB field (ListString). Modeling data this way is going to make it extremely difficult/impossible to write performant queries against it.
Suggestion: Break the contents of ListString out into a related table with one row for each item.

the list that contains all of the tag ids is inside a single field called ListString and I structure it like so: " #1 #2 #3 #4 #5" ...and from that, the above query "concats" the prx_tags.ID which tagname is 'aerobic'.
Not only is that bad, storing denormalized data, but the separator character is uncommon.
Interim Improvement
The quickest way to improve things is to change the separator character you're currently using ("#") to a comma:
UPDATE PRX_LISTS
SET liststring = REPLACE(liststring, '#', ',')
Then, you can use MySQL's FIND_IN_SET function:
SELECT si.*
FROM PRX_SPORTSITEMS si
JOIN PRX_LISTS l ON l.ownerid = si.id
JOIN PRX_TAGS t ON FIND_IN_SET(t.id, l.liststring) > 0
WHERE t.tagname = 'aerobic'
AND l.listtype = 'Tags-SportsItems'
ORDER BY si.DateAdded DESC
LIMIT 0, 30
Long Term Solution
As you've experienced, searching for specifics in denormalized data does not perform well, and makes queries overly complicated. You need to change the PRX_LISTS table so one row contains a unique combination of the SPORTSITEM.ownerid and PRX_TAGS.id, and whatever other columns you might need. I'd recommend renaming as well - lists of what, exactly? The name is too generic:
CREATE TABLE SPORTSITEM_TAGS_XREF (
sportsitem_ownerid INT,
tag_id INT,
PRIMARY KEY (sportsitem_ownerid INT, tag_id)
)

Don't make any changes without
looking at the execution plan. (And
post that here, too, by editing your
original question.)
The way your LIKE clause is
constructed, MySQL can't use an
index.
The LIKE clause is a symptom. Your
table structure is more likely the problem.
You'll probably get at least one order of magnitude improvement by building sane tables.
I'm really regretting not looking into
optimization beforehand
That's not what caused your problem. Being ignorant of the fundamentals of database design caused your problem. (That's an observation, not a criticism. You can fix ignorance. You can't fix stupid.)
Later:
Post your existing table structure and your proposed changes. You'll be a lot happier with our ability to predict what your code will do than with our ability to predict what your description of a piece of your code will do.

Related

SQL Server - Select * vs Select Column in a Stored Procedure

In an ad-hoc query using Select ColumnName is better, but does it matter in a Stored Procedure after it's saved in the plan guide?
Always explicitly state the columns, even in a stored procedure. SELECT * is considered bad practice.
For instance you don't know the column order that will be returned, some applications may be relying on a specific column order.
I.e. the application code may look something like:
Id = Column[0]; // bad design
If you've used SELECT * ID may no longer be the first column and cause the application to crash. Also, if the database is modified and an additional 5 fields have been added you are returning additional fields that may not be relevant.
These topics always elicit blanket statements like ALWAYS do this or NEVER do that, but the reality is, like with most things it depends on the situation. I'll concede that it's typically good practice to list out columns, but whether or not it's bad practice to use SELECT * depends on the situation.
Consider a variety of tables that all have a common field or two, for example we have a number of tables that have different layouts, but they all have 'access_dt' and 'host_ip'. These tables aren't typically used together, but there are instances when suspicious activity prompts a full report of all activity. These aren't common, and they are manually reviewed, as such, they are well served by a stored procedure that generates a report by looping through every log table and using SELECT * leveraging the common fields between all tables.
It would be a waste of time to list out fields in this situation.
Again, I agree that it's typically good practice to list out fields, but it's not always bad practice to use SELECT *.
Edit: Tried to clarify example a bit.
It's a best practice in general but if you actually do need all the column, you'd better use the quickly read "SELECT *".
The important thing is to avoid retreiving data you don't need.
It is considered bad practice in situations like stored procedures when you are querying large datasets with table scans. You want to avoid using table scans because it causes a hit to the performance of the query. It's also a matter of readability.
SOme other food for thought. If your query has any joins at all you are returning data you don't need because the data in the join columns is the same. Further if the table is later changed to add some things you don't need (such as columns for audit purposes) you may be returning data to the user that they should not be seeing.
Nobody has mentioned the case when you need ALL columns from a table, even if the columns change, e.g. when archiving table rows as XML. I agree one should not use "SELECT *" as a replacement for "I need all the columns that currently exist in the table," just out of laziness or for readability. There needs to be a valid reason. It could be essential when one needs "all the columns that could exist in the table."
Also, how about when creating "wrapper" views for tables?

How can I improve this endless query?

I've got a table with close to 5kk rows. Each one of them has one text column where I store my XML logs
I am trying to find out if there's some log having
<node>value</node>
I've tried with
SELECT top 1 id_log FROM Table_Log WHERE log_text LIKE '%<node>value</node>%'
but it never finishes.
Is there any way to improve this search?
PS: I can't drop any log
A wildcarded query such as '%<node>value</node>%' will result in a full table scan (ignoring indexes) as it can't determine where within the field it'll find the match. The only real way I know of to improve this query as it stands (without things like partitioning the table etc which should be considered if the table is logging constantly) would be to add a Full-Text catalog & index to the table in order to provide a more efficient search over that field.
Here is a good reference that should walk you through it. Once this has been completed you can use things like the CONTAINS and FREETEXT operators that are optimised for this type of retrieval.
Apart from implementing full-text search on that column and indexing the table, maybe you can narrow the results by another parameters (date, etc).
Also, you could add a table field (varchar type) called "Tags" which you can populate when inserting a row. This field would register "keywords, tags" for this log. This way, you could change your query with this field as condition.
Unfortunately, about the only way I can see to optimize that is to implement full-text search on that column, but even that will be hard to construct to where it only returns a particular value within a particular element.
I'm currently doing some work where I'm also storing XML within one of the columns. But I'm assuming any queries needed on that data will take a long time, which is okay for our needs.
Another option has to do with storing the data in a binary column, and then SQL Server has options for specifying what type of document is stored in that field. This allows you to, for example, implement more meaningful full-text searching on that field. But it's hard for me to imagine this will efficiently do what you are asking for.
You are using a like query.
No index involved = no good
There is nothing you can do with what you have currently to speed this up unfortunately.
I don't think it will help but try using the FAST x query hint like so:
SELECT id_log
FROM Table_Log
WHERE log_text LIKE '%<node>value</node>%'
OPTION(FAST 1)
This should optimise the query to return the first row.

Combine contents of multiple rows in 3NF mysql tables

Having dutifully normalised all my data, I'm having a problem combining 3NF rows into a single row for output.
Up until now I've been doing this with server-side coding, but for various reasons I now need to select all rows related to another row, and combine them in a single row, all in MySQL...
So to try and explain:
I have three tables.
Categories
Articles
CategoryArticles_3NF
A category contains CategoryID + titles, descriptions etc. It can contain any number of articles in the Articles table, consisting of ArticleID + a text field to house the content.
The CategoryArticles table is used to link the two, so contains both the CategoryID and the ArticleID.
Now, if I select a Category record, and I JOIN the Articles table via the linking CategoryArticles_3NF table, the result is a separate row for each article contained within that category.
The issue is that I want to output one single row for each category, containing content from all articles within.
If that sounds like a ridiculous request, it's because it is. I'm just using articles as a good way to describe the problem. My data is actually somewhat different.
Anyway - the only way I can see to achieve this is to use a 'GROUP_CONCAT' statement to group the content fields together - the problem with this is that there is a limit to how much data this can return, and I need it to be able to handle significantly more.
Can anyone tell me how to do this?
Thanks.
This sounds like something that should be done in the front end without more information.
If you need to, you can increase the size limit of GROUP_CONCAT by setting the system variable group_concat_max_len. It has a limit based on max_allowed_packet, which you can also increase. I think that the max size for a packet is 1GB. If you need to go higher than that then there are some serious flaws in your design.
EDIT: So that this is in the answer and not just buried in the comments...
If you don't want to change the group_concat_max_len globally then you can change it for just your session with:
SET SESSION group_concat_max_len = <your value here>

'SELECT *' from inner joined tables

How do you select all fields of two joined tables, without having conflicts with the common field?
Suppose I have two tables, Products and Services. I would like to make a query like this:
SELECT Products.*, Services.*
FROM Products
INNER JOIN Services ON Products.IdService = Services.IdService
The problem with this query is that IdService will appear twice and lead to a bunch of problems.
The alternative I found so far is to discriminate every field from Products except the IdService one. But this way I'll have to update the query every time I add a new field to Products.
Is there a better way to do this?
What are the most common SQL anti-patterns?
You've hit anti-pattern #1.
The better way is to provide a fieldlist. One way to get a quick field list is to
sp_help tablename
And if you want to create a view from this query - using select * gets you in more trouble. SQL Server captures the column list at the time the view is created. If you edit the underlying tables and don't recreate the view - you're signing up for trouble (I had a production fire of this nature - view was against tables in a different database though).
You should NEVER have SELECT * in production code (well, almost never, but the times where it is justified can be easily counted).
As far as I am aware you'll have to avoid SELECT * but this't really a problem.
SELECT * is usually regarded as a problem waiting to happen for the reason you quote as an advantage! Usually extra results columns appearing for queries when the database has been modified will cause problems.
Does your dialect of SQL support COMPOSE? COMPOSE gets rid of the extra copy of the column that's used on an equijoin, like the one in your example.
As others have said the Select * is bad news especially if other fields are added to the tables in which you are querying. You should select out the exact fields you want from the tables and can use an alias for fields with the same names or just use table.columnName.
Do not use *. Use somthing like this:
SELECT P.field1 AS 'Field from P'
, P.field2
, S.field1 AS 'Field from S'
, S.field4
FROM Products P
INNER JOIN
Services S
ON P.IdService = S.IdService
That would be correct, list the fields you want (in SQL Server you can drag them over from the object browser, so you don't have to type them all). Incidentally, if there are fields your specific query doe not need, do not list them. This creates extra work for the server and uses up extra network resources and can be one of the causes of poor performance when it is done thoughout your system and such wasteful queries are run thousands of times a day.
As to it being a maintenance problem, you only need to add the fields if the part of the application that uses your query would be affected by them. If you don't know what affect the new field would have or where you need to add it, you shouldn't be adding the field. Also adding new fileds unexopectedly through the use of select * can cause maintenance problems as well. Creating performance problems to avoid doing maintenance (maintenance you may never even need to do as column changes should be rare (if they aren't you need to look at your design)) is pretty short-sighted.
The best way is to specify the exact fields that you want from the query. You shouldn't use * anyway.
It is convenient to use * to get all fields, but it doesn't produce robust code. Any change in the table will change the result that is returned from the query, and that is not always desirable.
You should return only the data that you really want from the query, specified in the exact order you want it. That way the result looks exactly the same even if you add fields to the table or change the order of the fields in the table.
It's a litte more work to specify the exact output, but in the long run it usually pays off. When you make a change, only what you actually change is affected, you don't get cascading effects that breaks code that you didn't even know was affected.

Database Design for Tagging [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Closed 1 year ago.
Locked. This question and its answers are locked because the question is off-topic but has historical significance. It is not currently accepting new answers or interactions.
How would you design a database to support the following tagging features:
items can have a large number of tags
searches for all items that are tagged with a given set of tags must be quick (the items must have ALL tags, so it's an AND-search, not an OR-search)
creating/writing items may be slower to enable quick lookup/reading
Ideally, the lookup of all items that are tagged with (at least) a set of n given tags should be done using a single SQL statement. Since the number of tags to search for as well as the number of tags on any item are unknown and may be high, using JOINs is impractical.
Any ideas?
Thanks for all the answers so far.
If I'm not mistaken, however, the given answers show how to do an OR-search on tags. (Select all items that have one or more of n tags). I am looking for an efficient AND-search. (Select all items that have ALL n tags - and possibly more.)
Here's a good article on tagging Database schemas:
http://howto.philippkeller.com/2005/04/24/Tags-Database-schemas/
along with performance tests:
http://howto.philippkeller.com/2005/06/19/Tagsystems-performance-tests/
Note that the conclusions there are very specific to MySQL, which (at least in 2005 at the time that was written) had very poor full text indexing characteristics.
About ANDing: It sounds like you are looking for the "relational division" operation. This article covers relational division in concise and yet comprehendible way.
About performance: A bitmap-based approach intuitively sounds like it will suit the situation well. However, I'm not convinced it's a good idea to implement bitmap indexing "manually", like digiguru suggests: It sounds like a complicated situation whenever new tags are added(?) But some DBMSes (including Oracle) offer bitmap indexes which may somehow be of use, because a built-in indexing system does away with the potential complexity of index maintenance; additionally, a DBMS offering bitmap indexes should be able to consider them in a proper when when performing the query plan.
I just wanted to highlight that the article that #Jeff Atwood links to (http://howto.philippkeller.com/2005/04/24/Tags-Database-schemas/) is very thorough (It discusses the merits of 3 different schema approaches) and has a good solution for the AND queries that will usually perform better than what has been mentioned here so far (i.e. it doesn't use a correlated subquery for each term). Also lots of good stuff in the comments.
ps - The approach that everyone is talking about here is referred to as the "Toxi" solution in the article.
I don't see a problem with a straightforward solution: Table for items, table for tags, crosstable for "tagging"
Indices on cross table should be enough optimisation. Selecting appropriate items would be
SELECT * FROM items WHERE id IN
(SELECT DISTINCT item_id FROM item_tag WHERE
tag_id = tag1 OR tag_id = tag2 OR ...)
AND tagging would be
SELECT * FROM items WHERE
EXISTS (SELECT 1 FROM item_tag WHERE id = item_id AND tag_id = tag1)
AND EXISTS (SELECT 1 FROM item_tag WHERE id = item_id AND tag_id = tag2)
AND ...
which is admittedly, not so efficient for large number of comparing tags. If you are to maintain tag count in memory, you could make query to start with tags that are not often, so AND sequence would be evaluated quicker. Depending on expected number of tags to be matched against and expectancy of matching any single of them this could be OK solution, if you are to match 20 tags, and expect that some random item will match 15 of them, then this would still be heavy on a database.
You might want to experiment with a not-strictly-database solution like a Java Content Repository implementation (e.g. Apache Jackrabbit) and use a search engine built on top of that like Apache Lucene.
This solution with the appropriate caching mechanisms would possibly yield better performance than a home-grown solution.
However, I don't really think that in a small or medium-sized application you would require a more sophisticated implementation than the normalized database mentioned in earlier posts.
EDIT: with your clarification it seems more compelling to use a JCR-like solution with a search engine. That would greatly simplify your programs in the long run.
The easiest method is to create a tags table.
Target_Type -- in case you are tagging multiple tables
Target -- The key to the record being tagged
Tag -- The text of a tag
Querying the data would be something like:
Select distinct target from tags
where tag in ([your list of tags to search for here])
and target_type = [the table you're searching]
UPDATE
Based on your requirement to AND the conditions, the query above would turn into something like this
select target
from (
select target, count(*) cnt
from tags
where tag in ([your list of tags to search for here])
and target_type = [the table you're searching]
)
where cnt = [number of tags being searched]
I'd second #Zizzencs suggestion that you might want something that's not totally (R)DB-centric
Somehow, I believe that using plain nvarchar fields to store that tags with some proper caching/indexing might yield faster results. But that's just me.
I've implemented tagging systems using 3 tables to represent a Many-to-Many relationship before (Item Tags ItemTags), but I suppose you will be dealing with tags in a lot of places, I can tell you that with 3 tables having to be manipulated/queried simultaneously all the time will definitely make your code more complex.
You might want to consider if the added complexity is worth it.
You won't be able to avoid joins and still be somewhat normalized.
My approach is to have a Tag Table.
TagId (PK)| TagName (Indexed)
Then, you have a TagXREFID column in your items table.
This TagXREFID column is a FK to a 3rd table, I'll call it TagXREF:
TagXrefID | ItemID | TagId
So, to get all tags for an item would be something like:
SELECT Tags.TagId,Tags.TagName
FROM Tags,TagXref
WHERE TagXref.TagId = Tags.TagId
AND TagXref.ItemID = #ItemID
And to get all items for a tag, I'd use something like this:
SELECT * FROM Items, TagXref
WHERE TagXref.TagId IN
( SELECT Tags.TagId FROM Tags
WHERE Tags.TagName = #TagName; )
AND Items.ItemId = TagXref.ItemId;
To AND a bunch of tags together, You would to modify the above statement slightly to add AND Tags.TagName = #TagName1 AND Tags.TagName = #TagName2 etc...and dynamically build the query.
What I like to do is have a number of tables that represent the raw data, so in this case you'd have
Items (ID pk, Name, <properties>)
Tags (ID pk, Name)
TagItems (TagID fk, ItemID fk)
This works fast for the write times, and keeps everything normalized, but you may also note that for each tag, you'll need to join tables twice for every further tag you want to AND, so it's got slow read.
A solution to improve read is to create a caching table on command by setting up a stored procedure that essentially creates new table that represents the data in a flattened format...
CachedTagItems(ID, Name, <properties>, tag1, tag2, ... tagN)
Then you can consider how often the Tagged Item table needs to be kept up to date, if it's on every insert, then call the stored procedure in a cursor insert event. If it's an hourly task, then set up an hourly job to run it.
Now to get really clever in data retrieval, you'll want to create a stored procedure to get data from the tags. Rather than using nested queries in a massive case statement, you want to pass in a single parameter containing a list of tags you want to select from the database, and return a record set of Items. This would be best in binary format, using bitwise operators.
In binary format, it is easy to explain. Let's say there are four tags to be assigned to an item, in binary we could represent that
0000
If all four tags are assigned to an object, the object would look like this...
1111
If just the first two...
1100
Then it's just a case of finding the binary values with the 1s and zeros in the column you want. Using SQL Server's Bitwise operators, you can check that there is a 1 in the first of the columns using very simple queries.
Check this link to find out more.
To paraphrase what others have said: the trick isn't in the schema, it's in the query.
The naive schema of Entities/Labels/Tags is the right way to go. But as you've seen, it's not immediately clear how to perform an AND query with a lot of tags.
The best way to optimize that query will be platform-dependent, so I would recommend re-tagging your question with your RDBS and changing the title to something like "Optimal way to perform AND query on a tagging database".
I have a few suggestions for MS SQL, but will refrain in case that's not the platform you're using.
A variation to the above answer is take the tag ids, sort them, combine as a ^ separated string and hash them.
Then simply associate the hash to the item. Each combination of tags produces a new key. To do an AND search simply re-create the hash with the given tag ids and search.
Changing tags on an item will cause the hash to be recreated. Items with the same set of tags share the same hash key.