Having dutifully normalised all my data, I'm having a problem combining 3NF rows into a single row for output.
Up until now I've been doing this with server-side coding, but for various reasons I now need to select all rows related to another row, and combine them in a single row, all in MySQL...
So to try and explain:
I have three tables.
Categories
Articles
CategoryArticles_3NF
A category contains CategoryID + titles, descriptions etc. It can contain any number of articles in the Articles table, consisting of ArticleID + a text field to house the content.
The CategoryArticles table is used to link the two, so contains both the CategoryID and the ArticleID.
Now, if I select a Category record, and I JOIN the Articles table via the linking CategoryArticles_3NF table, the result is a separate row for each article contained within that category.
The issue is that I want to output one single row for each category, containing content from all articles within.
If that sounds like a ridiculous request, it's because it is. I'm just using articles as a good way to describe the problem. My data is actually somewhat different.
Anyway - the only way I can see to achieve this is to use a 'GROUP_CONCAT' statement to group the content fields together - the problem with this is that there is a limit to how much data this can return, and I need it to be able to handle significantly more.
Can anyone tell me how to do this?
Thanks.
This sounds like something that should be done in the front end without more information.
If you need to, you can increase the size limit of GROUP_CONCAT by setting the system variable group_concat_max_len. It has a limit based on max_allowed_packet, which you can also increase. I think that the max size for a packet is 1GB. If you need to go higher than that then there are some serious flaws in your design.
EDIT: So that this is in the answer and not just buried in the comments...
If you don't want to change the group_concat_max_len globally then you can change it for just your session with:
SET SESSION group_concat_max_len = <your value here>
Related
I am setting up a fairly large dataset (catalogue) on a sql database (i'd guesstimate ∼100k records) to store information regarding products. Each product is characterized by about 20-30 properties, so that would basically mean 20-30 column. The system is setup so that each of these properties is actually linked to a code, and each product is therefore characterized by a unique string made concatenating all these properties (the string has to be unique, if two product codes are the same then the two products are actually the same product). What I am trying to figure out is if sql-wise there is any difference to storing the catalogue as a table of 20-30 columns, or if I am better off just having 1 column with the code and decoding the properties from the code. The difference being that in one case I would do
SELECT * FROM Catalogue WHERE Color='RED'
versus
SELECT * FROM Catalogue WHERE Code LIKE '____R____________'
Also it might make it easier to check whether a product already exists, as I am only comparing a single column compared to 20-30 columns. I could also just add an extra column to the complete table to store the code and use one method when doing one operation and another when doing another operation.
I have almost no knowledge of how the SQL engine works so I might be completely off with my reasoning here.
The code approach seems silly. Why do I phrase it this way?
You have a few dozen columns with attributes and you know what they are. Why would you NOT include that information in the data model.
I am also amused by how you are going to distinguish these comparisons:
WHERE Code LIKE '____R____________'
WHERE Code LIKE '___R_____________'
WHERE Code LIKE '_____R___________'
WHERE Code LIKE '____R___________'
That just seems like a recipe for spending half the rest of your future life on debugging -- if not your code then someone else's.
And, with separate columns, you can create indexes for commonly used combinations.
If not all rows have all attributes -- or if the attributes can be expanded in the future -- you might want a structure with a separate line for each attribute:
entityId code value
1 Color Red
This is called an entity-attribute-value (EAV) model and is appropriate under some circumstances.
(NB. The question is not a duplicate for this, since I am dealing with an ORM system)
I have a table in my database to store all Contacts information. Some of the columns for each contact is fixed (e.g. Id, InsertDate and UpdateDate). In my program I would like to give user the option to add or remove properties for each contact.
Now there are of course two alternatives here:
First is to save it all in one table and add and remove entire columns when user needs to;
Create a key-value table to save each property alongside its type and connect the record to user's id.
These alternatives are both doable. But I am wondering which one is better in terms of speed? In the program it will be a very common thing for the user to view the entire Contact list to check for updates. Plus, I am using an ORM framework (Microsoft's Entity Framework) to deal with database queries. So if the user is to add and remove columns from a table all the time, it will be a difficult task to map them to my program. But again, if alternative (1) is a significantly better option than (2), then I can reconsider the key-value option.
I have actually done both of these.
Example #1
Large, wide table with columns of data holding names, phone, address and lots of small integer values of information that tracked details of the clients.
Example #2
Many different tables separating out all of the Character Varying data fields, the small integer values etc.
Example #1 was a lot faster to code for but in terms of performance, it got pretty slow once the table filled with records. 5000 wasn't a problem. When it reached 50,000 there was a noticeable performance degradation.
Example #2 was built later in my coding experience and was built to resolve the issues found in Example #1. While it took more to get the records I was after (LEFT JOIN this and UNION that) it was MUCH faster as you could ultimately pick and choose EXACTLY what the client was after without having to search a massive wide table full of data that was not all being requested.
I would recommend Example #2 to fit your #2 in the question.
And your USER specified columns for their data set could be stored in a table just to their own (depending on how many you have I suppose) which would allow you to draw on the table specific to that USER, which would also give you unlimited ability to remove and add columns to suit that particular setup.
You could then also have another table which kept track of the custom columns in the custom column table, which would give you the ability to "recover" columns later, as in "Do you want to add this to your current column choices or to one of these columns you have deleted in the past".
UPDATE (based on everyone's responses):
I'm thinking of changing my structure so that I have a new table called prx_tags_sportsitems. I will be removing prx_lists entirely. prx_tags_sportsitems will act as a reference of ID's table to replace the prx_lists.ListString which used to be storing the ID's of tags belonging to each prx_sportsitem.
The new relation will be like so:
prx_tags_sportsitems.TagID <--> prx_tags.ID
prx_sportsitems.ID <--> prx_tags_sportsitems.OwnerID
prx_tags will contain the TagName. This is so I can still maintain each "tag" as a separate unique entity.
My new query for finding all sportsitems with the tag "aerobic" will become something similar to as follows:
SELECT prx_sportsitems.* FROM prx_sportsitems, prx_tags_sportsitems
WHERE prx_tags_sportsitems.OwnerID = prx_sportsitems.ID
AND prx_tags_sportsitems.TagID = (SELECT ID FROM prx_tags WHERE TagName = 'aerobic')
ORDER BY prx_sportsitems.DateAdded DESC LIMIT 0,30;
Or perhaps I can do something with the "IN" clause, but I'm unsure about that just yet.
Before I go ahead with this huge modification to my scripts, does everyone approve? comments? Many thanks!
ORIGINAL POST:
When it comes to MYSQL queries, I'm rather novice. When I originally designed my database I did something, rather silly, because it was the only solution I could find. Now I'm finding it appears to be causing too much stress of my MYSQL server since it takes 0.2 seconds to perform each of these queries where I believe it could be more like 0.02 seconds if it was a better query (or table design if it comes to it!). I want to avoid needing to rebuild my entire site structure since it's deeply designed the way it currently is, so I'm hoping there's a faster mysql query possible.
I have three tables in my database:
Sports Items Table
Tags Table
Lists Table
Each sports item has multiple tag names (categories) assigned to them. Each "tag" is stored as a separate result in prx_tags. I create a "list" in prx_lists for the sports item in prx_sportsitems and link them through prx_lists.OwnerID which links to prx_sportsitems.ID
This is my current query (which finds all sports items which have the tag 'aerobic'):
SELECT prx_sportsitems.*
FROM prx_sportsitems, prx_lists
WHERE prx_lists.ListString LIKE (CONCAT('%',(SELECT prx_tags.ID
FROM prx_tags
WHERE prx_tags.TagName = 'aerobic'
limit 0,1),'#%'))
AND prx_lists.ListType = 'Tags-SportsItems'
AND prx_lists.OwnerID = prx_sportsitems.ID
ORDER BY prx_sportsitems.DateAdded
DESC LIMIT 0,30
To help clarify more, the list that contains all of the tag ids is inside a single field called ListString and I structure it like so: " #1 #2 #3 #4 #5" ...and from that, the above query "concats" the prx_tags.ID which tagname is 'aerobic'.
My thoughts are that, there probably isn't a faster query existing and that I need to simply accept I need to do something simpler, such as putting all the Tags in a list, directly inside prx_sportsitems in a new field called "TagsList" and then I can simply run a query which does Select * from prx_sportsitems Where TagsList LIKE '%aerobic%' - however, I want to avoid needing to redesign my entire site. I'm really regretting not looking into optimization beforehand :(
Whenever I am writing a query, and think I need to use LIKE, an alarm goes off in my head that maybe there is a better design. This is certainly the case here.
You need to redesign the prx_lists tables. From what you've said, its hard to say what the exact schema should be, but here is my best guess:
prx_lists should have three columns: OwnerID, ListType, and TagName. Then you would have one row for each tag an OwnerID has. Your above query would now look something like this:
SELECT prx_sportsitems.*
FROM prx_sportsitems, prx_lists
where prx_lists.TagName = 'aerobic'
AND prx_lists.OwnerID = prx_sportsitems.ID
This is a MUCH more efficient query. Maybe ListType doesn't belong in that table either, but its hard to say without more info about what that column is used for.
Don't forget to create the appropriate indexes either! This will improve performance.
Refactoring your database schema might be painful, but its seems to me the only way to fix your long term problem.
To help clarify more, the list that
contains all of the tag ids is inside
a single field called ListString and I
structure it like so: " #1 #2 #3 #4 #5" ...and from that, the above query "concats" the prx_tags.ID which
tagname is 'aerobic'.
There's your problem right there. Don't store delimited data in a DB field (ListString). Modeling data this way is going to make it extremely difficult/impossible to write performant queries against it.
Suggestion: Break the contents of ListString out into a related table with one row for each item.
the list that contains all of the tag ids is inside a single field called ListString and I structure it like so: " #1 #2 #3 #4 #5" ...and from that, the above query "concats" the prx_tags.ID which tagname is 'aerobic'.
Not only is that bad, storing denormalized data, but the separator character is uncommon.
Interim Improvement
The quickest way to improve things is to change the separator character you're currently using ("#") to a comma:
UPDATE PRX_LISTS
SET liststring = REPLACE(liststring, '#', ',')
Then, you can use MySQL's FIND_IN_SET function:
SELECT si.*
FROM PRX_SPORTSITEMS si
JOIN PRX_LISTS l ON l.ownerid = si.id
JOIN PRX_TAGS t ON FIND_IN_SET(t.id, l.liststring) > 0
WHERE t.tagname = 'aerobic'
AND l.listtype = 'Tags-SportsItems'
ORDER BY si.DateAdded DESC
LIMIT 0, 30
Long Term Solution
As you've experienced, searching for specifics in denormalized data does not perform well, and makes queries overly complicated. You need to change the PRX_LISTS table so one row contains a unique combination of the SPORTSITEM.ownerid and PRX_TAGS.id, and whatever other columns you might need. I'd recommend renaming as well - lists of what, exactly? The name is too generic:
CREATE TABLE SPORTSITEM_TAGS_XREF (
sportsitem_ownerid INT,
tag_id INT,
PRIMARY KEY (sportsitem_ownerid INT, tag_id)
)
Don't make any changes without
looking at the execution plan. (And
post that here, too, by editing your
original question.)
The way your LIKE clause is
constructed, MySQL can't use an
index.
The LIKE clause is a symptom. Your
table structure is more likely the problem.
You'll probably get at least one order of magnitude improvement by building sane tables.
I'm really regretting not looking into
optimization beforehand
That's not what caused your problem. Being ignorant of the fundamentals of database design caused your problem. (That's an observation, not a criticism. You can fix ignorance. You can't fix stupid.)
Later:
Post your existing table structure and your proposed changes. You'll be a lot happier with our ability to predict what your code will do than with our ability to predict what your description of a piece of your code will do.
I have a user table, and then a number of dependent tables with a one to many relationship
e.g. an email table, an address table and a groups table. (i.e. one user can have multiple email addresses, physical addresses and can be a member of many groups)
Is it better to:
Join all these tables, and process the heap of data in code,
Use something like GROUP_CONCAT and return one row, and split apart the fields in code,
Or query each table independently?
Thanks.
It really depends on how much data you have in the related tables and on how many users you're querying at a time.
Option 1 tends to be messy to deal with in code.
Option 2 tends to be messy to deal with as well in addition to the fact that grouping tends to be slow especially on large datasets.
Option 3 is easiest to deal with but generates more queries overall. If your data-set is small and you're not planning to scale much beyond your current needs its probably the best option. It's definitely the best option if you're only trying to display one record.
There is a fourth option however that is a middle of the road approach which I use in my job in which we deal with a very similar situation. Instead of getting the related records for each row 1 at a time, use IN() to get all of the related records for your results set. Then loop in your code to match them to the appropriate record for display. If you cache search queries you can cache that second query as well. Its only two queries and only one loop in the code (no parsing, use hashes to relate things by their key)
Personally, assuming my table indexes where up to scratch I'd going with a table join and get all the data out in one go and then process that to end up with a nested data structure. This way you're playing to each systems strengths.
Generally speaking, do the most efficient query for the situation you're in. So don't create a mega query that you use in all cases. Create case specific queries that return just the information you need.
In terms of processing the results, if you use GROUP_CONCAT you have to split all the resulting values during processing. If there are extra delimiter characters in your GROUP_CONCAT'd values, this can be problematic. My preferred method is to put the GROUPed BY field into a $holder during the output loop. Compare that field to the $holder each time through and change your output accordingly.
Closed. This question is opinion-based. It is not currently accepting answers.
Closed 1 year ago.
Locked. This question and its answers are locked because the question is off-topic but has historical significance. It is not currently accepting new answers or interactions.
How would you design a database to support the following tagging features:
items can have a large number of tags
searches for all items that are tagged with a given set of tags must be quick (the items must have ALL tags, so it's an AND-search, not an OR-search)
creating/writing items may be slower to enable quick lookup/reading
Ideally, the lookup of all items that are tagged with (at least) a set of n given tags should be done using a single SQL statement. Since the number of tags to search for as well as the number of tags on any item are unknown and may be high, using JOINs is impractical.
Any ideas?
Thanks for all the answers so far.
If I'm not mistaken, however, the given answers show how to do an OR-search on tags. (Select all items that have one or more of n tags). I am looking for an efficient AND-search. (Select all items that have ALL n tags - and possibly more.)
Here's a good article on tagging Database schemas:
http://howto.philippkeller.com/2005/04/24/Tags-Database-schemas/
along with performance tests:
http://howto.philippkeller.com/2005/06/19/Tagsystems-performance-tests/
Note that the conclusions there are very specific to MySQL, which (at least in 2005 at the time that was written) had very poor full text indexing characteristics.
About ANDing: It sounds like you are looking for the "relational division" operation. This article covers relational division in concise and yet comprehendible way.
About performance: A bitmap-based approach intuitively sounds like it will suit the situation well. However, I'm not convinced it's a good idea to implement bitmap indexing "manually", like digiguru suggests: It sounds like a complicated situation whenever new tags are added(?) But some DBMSes (including Oracle) offer bitmap indexes which may somehow be of use, because a built-in indexing system does away with the potential complexity of index maintenance; additionally, a DBMS offering bitmap indexes should be able to consider them in a proper when when performing the query plan.
I just wanted to highlight that the article that #Jeff Atwood links to (http://howto.philippkeller.com/2005/04/24/Tags-Database-schemas/) is very thorough (It discusses the merits of 3 different schema approaches) and has a good solution for the AND queries that will usually perform better than what has been mentioned here so far (i.e. it doesn't use a correlated subquery for each term). Also lots of good stuff in the comments.
ps - The approach that everyone is talking about here is referred to as the "Toxi" solution in the article.
I don't see a problem with a straightforward solution: Table for items, table for tags, crosstable for "tagging"
Indices on cross table should be enough optimisation. Selecting appropriate items would be
SELECT * FROM items WHERE id IN
(SELECT DISTINCT item_id FROM item_tag WHERE
tag_id = tag1 OR tag_id = tag2 OR ...)
AND tagging would be
SELECT * FROM items WHERE
EXISTS (SELECT 1 FROM item_tag WHERE id = item_id AND tag_id = tag1)
AND EXISTS (SELECT 1 FROM item_tag WHERE id = item_id AND tag_id = tag2)
AND ...
which is admittedly, not so efficient for large number of comparing tags. If you are to maintain tag count in memory, you could make query to start with tags that are not often, so AND sequence would be evaluated quicker. Depending on expected number of tags to be matched against and expectancy of matching any single of them this could be OK solution, if you are to match 20 tags, and expect that some random item will match 15 of them, then this would still be heavy on a database.
You might want to experiment with a not-strictly-database solution like a Java Content Repository implementation (e.g. Apache Jackrabbit) and use a search engine built on top of that like Apache Lucene.
This solution with the appropriate caching mechanisms would possibly yield better performance than a home-grown solution.
However, I don't really think that in a small or medium-sized application you would require a more sophisticated implementation than the normalized database mentioned in earlier posts.
EDIT: with your clarification it seems more compelling to use a JCR-like solution with a search engine. That would greatly simplify your programs in the long run.
The easiest method is to create a tags table.
Target_Type -- in case you are tagging multiple tables
Target -- The key to the record being tagged
Tag -- The text of a tag
Querying the data would be something like:
Select distinct target from tags
where tag in ([your list of tags to search for here])
and target_type = [the table you're searching]
UPDATE
Based on your requirement to AND the conditions, the query above would turn into something like this
select target
from (
select target, count(*) cnt
from tags
where tag in ([your list of tags to search for here])
and target_type = [the table you're searching]
)
where cnt = [number of tags being searched]
I'd second #Zizzencs suggestion that you might want something that's not totally (R)DB-centric
Somehow, I believe that using plain nvarchar fields to store that tags with some proper caching/indexing might yield faster results. But that's just me.
I've implemented tagging systems using 3 tables to represent a Many-to-Many relationship before (Item Tags ItemTags), but I suppose you will be dealing with tags in a lot of places, I can tell you that with 3 tables having to be manipulated/queried simultaneously all the time will definitely make your code more complex.
You might want to consider if the added complexity is worth it.
You won't be able to avoid joins and still be somewhat normalized.
My approach is to have a Tag Table.
TagId (PK)| TagName (Indexed)
Then, you have a TagXREFID column in your items table.
This TagXREFID column is a FK to a 3rd table, I'll call it TagXREF:
TagXrefID | ItemID | TagId
So, to get all tags for an item would be something like:
SELECT Tags.TagId,Tags.TagName
FROM Tags,TagXref
WHERE TagXref.TagId = Tags.TagId
AND TagXref.ItemID = #ItemID
And to get all items for a tag, I'd use something like this:
SELECT * FROM Items, TagXref
WHERE TagXref.TagId IN
( SELECT Tags.TagId FROM Tags
WHERE Tags.TagName = #TagName; )
AND Items.ItemId = TagXref.ItemId;
To AND a bunch of tags together, You would to modify the above statement slightly to add AND Tags.TagName = #TagName1 AND Tags.TagName = #TagName2 etc...and dynamically build the query.
What I like to do is have a number of tables that represent the raw data, so in this case you'd have
Items (ID pk, Name, <properties>)
Tags (ID pk, Name)
TagItems (TagID fk, ItemID fk)
This works fast for the write times, and keeps everything normalized, but you may also note that for each tag, you'll need to join tables twice for every further tag you want to AND, so it's got slow read.
A solution to improve read is to create a caching table on command by setting up a stored procedure that essentially creates new table that represents the data in a flattened format...
CachedTagItems(ID, Name, <properties>, tag1, tag2, ... tagN)
Then you can consider how often the Tagged Item table needs to be kept up to date, if it's on every insert, then call the stored procedure in a cursor insert event. If it's an hourly task, then set up an hourly job to run it.
Now to get really clever in data retrieval, you'll want to create a stored procedure to get data from the tags. Rather than using nested queries in a massive case statement, you want to pass in a single parameter containing a list of tags you want to select from the database, and return a record set of Items. This would be best in binary format, using bitwise operators.
In binary format, it is easy to explain. Let's say there are four tags to be assigned to an item, in binary we could represent that
0000
If all four tags are assigned to an object, the object would look like this...
1111
If just the first two...
1100
Then it's just a case of finding the binary values with the 1s and zeros in the column you want. Using SQL Server's Bitwise operators, you can check that there is a 1 in the first of the columns using very simple queries.
Check this link to find out more.
To paraphrase what others have said: the trick isn't in the schema, it's in the query.
The naive schema of Entities/Labels/Tags is the right way to go. But as you've seen, it's not immediately clear how to perform an AND query with a lot of tags.
The best way to optimize that query will be platform-dependent, so I would recommend re-tagging your question with your RDBS and changing the title to something like "Optimal way to perform AND query on a tagging database".
I have a few suggestions for MS SQL, but will refrain in case that's not the platform you're using.
A variation to the above answer is take the tag ids, sort them, combine as a ^ separated string and hash them.
Then simply associate the hash to the item. Each combination of tags produces a new key. To do an AND search simply re-create the hash with the given tag ids and search.
Changing tags on an item will cause the hash to be recreated. Items with the same set of tags share the same hash key.