Database Design for Tagging [closed] - sql

Closed. This question is opinion-based. It is not currently accepting answers.
Closed 1 year ago.
Locked. This question and its answers are locked because the question is off-topic but has historical significance. It is not currently accepting new answers or interactions.
How would you design a database to support the following tagging features:
items can have a large number of tags
searches for all items that are tagged with a given set of tags must be quick (the items must have ALL tags, so it's an AND-search, not an OR-search)
creating/writing items may be slower to enable quick lookup/reading
Ideally, the lookup of all items that are tagged with (at least) a set of n given tags should be done using a single SQL statement. Since the number of tags to search for as well as the number of tags on any item are unknown and may be high, using JOINs is impractical.
Any ideas?
Thanks for all the answers so far.
If I'm not mistaken, however, the given answers show how to do an OR-search on tags. (Select all items that have one or more of n tags). I am looking for an efficient AND-search. (Select all items that have ALL n tags - and possibly more.)

Here's a good article on tagging Database schemas:
http://howto.philippkeller.com/2005/04/24/Tags-Database-schemas/
along with performance tests:
http://howto.philippkeller.com/2005/06/19/Tagsystems-performance-tests/
Note that the conclusions there are very specific to MySQL, which (at least in 2005 at the time that was written) had very poor full text indexing characteristics.

About ANDing: It sounds like you are looking for the "relational division" operation. This article covers relational division in concise and yet comprehendible way.
About performance: A bitmap-based approach intuitively sounds like it will suit the situation well. However, I'm not convinced it's a good idea to implement bitmap indexing "manually", like digiguru suggests: It sounds like a complicated situation whenever new tags are added(?) But some DBMSes (including Oracle) offer bitmap indexes which may somehow be of use, because a built-in indexing system does away with the potential complexity of index maintenance; additionally, a DBMS offering bitmap indexes should be able to consider them in a proper when when performing the query plan.

I just wanted to highlight that the article that #Jeff Atwood links to (http://howto.philippkeller.com/2005/04/24/Tags-Database-schemas/) is very thorough (It discusses the merits of 3 different schema approaches) and has a good solution for the AND queries that will usually perform better than what has been mentioned here so far (i.e. it doesn't use a correlated subquery for each term). Also lots of good stuff in the comments.
ps - The approach that everyone is talking about here is referred to as the "Toxi" solution in the article.

I don't see a problem with a straightforward solution: Table for items, table for tags, crosstable for "tagging"
Indices on cross table should be enough optimisation. Selecting appropriate items would be
SELECT * FROM items WHERE id IN
(SELECT DISTINCT item_id FROM item_tag WHERE
tag_id = tag1 OR tag_id = tag2 OR ...)
AND tagging would be
SELECT * FROM items WHERE
EXISTS (SELECT 1 FROM item_tag WHERE id = item_id AND tag_id = tag1)
AND EXISTS (SELECT 1 FROM item_tag WHERE id = item_id AND tag_id = tag2)
AND ...
which is admittedly, not so efficient for large number of comparing tags. If you are to maintain tag count in memory, you could make query to start with tags that are not often, so AND sequence would be evaluated quicker. Depending on expected number of tags to be matched against and expectancy of matching any single of them this could be OK solution, if you are to match 20 tags, and expect that some random item will match 15 of them, then this would still be heavy on a database.

You might want to experiment with a not-strictly-database solution like a Java Content Repository implementation (e.g. Apache Jackrabbit) and use a search engine built on top of that like Apache Lucene.
This solution with the appropriate caching mechanisms would possibly yield better performance than a home-grown solution.
However, I don't really think that in a small or medium-sized application you would require a more sophisticated implementation than the normalized database mentioned in earlier posts.
EDIT: with your clarification it seems more compelling to use a JCR-like solution with a search engine. That would greatly simplify your programs in the long run.

The easiest method is to create a tags table.
Target_Type -- in case you are tagging multiple tables
Target -- The key to the record being tagged
Tag -- The text of a tag
Querying the data would be something like:
Select distinct target from tags
where tag in ([your list of tags to search for here])
and target_type = [the table you're searching]
UPDATE
Based on your requirement to AND the conditions, the query above would turn into something like this
select target
from (
select target, count(*) cnt
from tags
where tag in ([your list of tags to search for here])
and target_type = [the table you're searching]
)
where cnt = [number of tags being searched]

I'd second #Zizzencs suggestion that you might want something that's not totally (R)DB-centric
Somehow, I believe that using plain nvarchar fields to store that tags with some proper caching/indexing might yield faster results. But that's just me.
I've implemented tagging systems using 3 tables to represent a Many-to-Many relationship before (Item Tags ItemTags), but I suppose you will be dealing with tags in a lot of places, I can tell you that with 3 tables having to be manipulated/queried simultaneously all the time will definitely make your code more complex.
You might want to consider if the added complexity is worth it.

You won't be able to avoid joins and still be somewhat normalized.
My approach is to have a Tag Table.
TagId (PK)| TagName (Indexed)
Then, you have a TagXREFID column in your items table.
This TagXREFID column is a FK to a 3rd table, I'll call it TagXREF:
TagXrefID | ItemID | TagId
So, to get all tags for an item would be something like:
SELECT Tags.TagId,Tags.TagName
FROM Tags,TagXref
WHERE TagXref.TagId = Tags.TagId
AND TagXref.ItemID = #ItemID
And to get all items for a tag, I'd use something like this:
SELECT * FROM Items, TagXref
WHERE TagXref.TagId IN
( SELECT Tags.TagId FROM Tags
WHERE Tags.TagName = #TagName; )
AND Items.ItemId = TagXref.ItemId;
To AND a bunch of tags together, You would to modify the above statement slightly to add AND Tags.TagName = #TagName1 AND Tags.TagName = #TagName2 etc...and dynamically build the query.

What I like to do is have a number of tables that represent the raw data, so in this case you'd have
Items (ID pk, Name, <properties>)
Tags (ID pk, Name)
TagItems (TagID fk, ItemID fk)
This works fast for the write times, and keeps everything normalized, but you may also note that for each tag, you'll need to join tables twice for every further tag you want to AND, so it's got slow read.
A solution to improve read is to create a caching table on command by setting up a stored procedure that essentially creates new table that represents the data in a flattened format...
CachedTagItems(ID, Name, <properties>, tag1, tag2, ... tagN)
Then you can consider how often the Tagged Item table needs to be kept up to date, if it's on every insert, then call the stored procedure in a cursor insert event. If it's an hourly task, then set up an hourly job to run it.
Now to get really clever in data retrieval, you'll want to create a stored procedure to get data from the tags. Rather than using nested queries in a massive case statement, you want to pass in a single parameter containing a list of tags you want to select from the database, and return a record set of Items. This would be best in binary format, using bitwise operators.
In binary format, it is easy to explain. Let's say there are four tags to be assigned to an item, in binary we could represent that
0000
If all four tags are assigned to an object, the object would look like this...
1111
If just the first two...
1100
Then it's just a case of finding the binary values with the 1s and zeros in the column you want. Using SQL Server's Bitwise operators, you can check that there is a 1 in the first of the columns using very simple queries.
Check this link to find out more.

To paraphrase what others have said: the trick isn't in the schema, it's in the query.
The naive schema of Entities/Labels/Tags is the right way to go. But as you've seen, it's not immediately clear how to perform an AND query with a lot of tags.
The best way to optimize that query will be platform-dependent, so I would recommend re-tagging your question with your RDBS and changing the title to something like "Optimal way to perform AND query on a tagging database".
I have a few suggestions for MS SQL, but will refrain in case that's not the platform you're using.

A variation to the above answer is take the tag ids, sort them, combine as a ^ separated string and hash them.
Then simply associate the hash to the item. Each combination of tags produces a new key. To do an AND search simply re-create the hash with the given tag ids and search.
Changing tags on an item will cause the hash to be recreated. Items with the same set of tags share the same hash key.

Related

How does a full text search server like Sphinx work?

Can anyone explain in simple words how a full text server like Sphinx works? In plain SQL, one would use SQL queries like this to search for certain keywords in texts:
select * from items where name like '%keyword%';
But in the configuration files generated by various Sphinx plugins I can not see any queries like this at all. They contain instead SQL statements like the following, which seem to divide the search into distinct ID groups:
SELECT (items.id * 5 + 1) AS id, ...
WHERE items.id >= $start AND items.id <= $end
GROUP BY items.id
..
SELECT * FROM items WHERE items.id = (($id - 1) / 5)
It it possible to explain in simple words how these queries work and how they are generated?
Inverted Index is the answer to your question: http://en.wikipedia.org/wiki/Inverted_index
Now when you run a sql query through sphinx, it fetches the data from the database and constructs the inverted index which in Sphinx is like a hashtable where the key is a 32 bit integer which is calculated using crc32(word) and the value is the list of documentID's having that word.
This makes it super fast.
Now you can argue that even a database can create a similar structure for making the searches superfast. However the biggest difference is that a Sphinx/Lucene/Solr index is like a single-table database without any support for relational queries (JOINs) [From MySQL Performance Blog]. Remember that an index is usually only there to support search and not to be the primary source of the data. So your database may be in "third normal form" but the index will be completely be de-normalized and contain mostly just the data needed to be searched.
Another possible reason is generally databases suffer from internal fragmentation, they need to perform too much semi-random I/O tasks on huge requests.
What that means is, for example, considering the index architecture of a databases, the query leads to the indexes which in turn lead to the data. If the data to recover is widely spread, the result will take long and that seems to be what happens in databases.
EDIT: Also please see the source code in cpp files like searchd.cpp etc for the real internal implementation, I think you are just seeing the PHP wrappers.
Those queries you are looking at, are the query sphinx uses, to extract a copy of the data from the database, to put in its own index.
Sphinx needs a copy of the data to build it index (other answers have mentioned how that index works). You then ask for results (matching a specific query) from the searchd daemon - it consults the index and returns you matching documents.
The particular example you have choosen looks quite complicated, because it only extracting a part of the data, probbably for sharding - to split the index into parts for performance reasons. And is using range queries - so can access big datasets piecemeal.
An index could be built with a much simpler query, like
sql_query = select id,name,description from items
which would create a sphinx index, with two fields - name and description that could be searched/queried.
When searching, you would get back the unique id. http://sphinxsearch.com/info/faq/#row-storage
Full text search usually use one implementation of inverted index. In simple words, it brakes the content of a indexed field in tokens (words) and save a reference to that row, indexed by each token. For example, a field with The yellow dog for row #1 and The brown fox for row #2, will populate an index like:
brown -> row#2
dog -> row#1
fox -> row#2
The -> row#1
The -> row#2
yellow -> row#1
A short answer to the question is that databases such as MySQL are specifically designed for storing and indexing records and supporting SQL clauses (SELECT, PROJECT, JOIN, etc). Even though they can be used to do keyword search queries, they cannot give the best performance and features. Search engines such as Sphinx are designed specifically for keyword search queries, thus can provide much better support.

Is it possible there is a faster way to perform this SELECT query?

UPDATE (based on everyone's responses):
I'm thinking of changing my structure so that I have a new table called prx_tags_sportsitems. I will be removing prx_lists entirely. prx_tags_sportsitems will act as a reference of ID's table to replace the prx_lists.ListString which used to be storing the ID's of tags belonging to each prx_sportsitem.
The new relation will be like so:
prx_tags_sportsitems.TagID <--> prx_tags.ID
prx_sportsitems.ID <--> prx_tags_sportsitems.OwnerID
prx_tags will contain the TagName. This is so I can still maintain each "tag" as a separate unique entity.
My new query for finding all sportsitems with the tag "aerobic" will become something similar to as follows:
SELECT prx_sportsitems.* FROM prx_sportsitems, prx_tags_sportsitems
WHERE prx_tags_sportsitems.OwnerID = prx_sportsitems.ID
AND prx_tags_sportsitems.TagID = (SELECT ID FROM prx_tags WHERE TagName = 'aerobic')
ORDER BY prx_sportsitems.DateAdded DESC LIMIT 0,30;
Or perhaps I can do something with the "IN" clause, but I'm unsure about that just yet.
Before I go ahead with this huge modification to my scripts, does everyone approve? comments? Many thanks!
ORIGINAL POST:
When it comes to MYSQL queries, I'm rather novice. When I originally designed my database I did something, rather silly, because it was the only solution I could find. Now I'm finding it appears to be causing too much stress of my MYSQL server since it takes 0.2 seconds to perform each of these queries where I believe it could be more like 0.02 seconds if it was a better query (or table design if it comes to it!). I want to avoid needing to rebuild my entire site structure since it's deeply designed the way it currently is, so I'm hoping there's a faster mysql query possible.
I have three tables in my database:
Sports Items Table
Tags Table
Lists Table
Each sports item has multiple tag names (categories) assigned to them. Each "tag" is stored as a separate result in prx_tags. I create a "list" in prx_lists for the sports item in prx_sportsitems and link them through prx_lists.OwnerID which links to prx_sportsitems.ID
This is my current query (which finds all sports items which have the tag 'aerobic'):
SELECT prx_sportsitems.*
FROM prx_sportsitems, prx_lists
WHERE prx_lists.ListString LIKE (CONCAT('%',(SELECT prx_tags.ID
FROM prx_tags
WHERE prx_tags.TagName = 'aerobic'
limit 0,1),'#%'))
AND prx_lists.ListType = 'Tags-SportsItems'
AND prx_lists.OwnerID = prx_sportsitems.ID
ORDER BY prx_sportsitems.DateAdded
DESC LIMIT 0,30
To help clarify more, the list that contains all of the tag ids is inside a single field called ListString and I structure it like so: " #1 #2 #3 #4 #5" ...and from that, the above query "concats" the prx_tags.ID which tagname is 'aerobic'.
My thoughts are that, there probably isn't a faster query existing and that I need to simply accept I need to do something simpler, such as putting all the Tags in a list, directly inside prx_sportsitems in a new field called "TagsList" and then I can simply run a query which does Select * from prx_sportsitems Where TagsList LIKE '%aerobic%' - however, I want to avoid needing to redesign my entire site. I'm really regretting not looking into optimization beforehand :(
Whenever I am writing a query, and think I need to use LIKE, an alarm goes off in my head that maybe there is a better design. This is certainly the case here.
You need to redesign the prx_lists tables. From what you've said, its hard to say what the exact schema should be, but here is my best guess:
prx_lists should have three columns: OwnerID, ListType, and TagName. Then you would have one row for each tag an OwnerID has. Your above query would now look something like this:
SELECT prx_sportsitems.*
FROM prx_sportsitems, prx_lists
where prx_lists.TagName = 'aerobic'
AND prx_lists.OwnerID = prx_sportsitems.ID
This is a MUCH more efficient query. Maybe ListType doesn't belong in that table either, but its hard to say without more info about what that column is used for.
Don't forget to create the appropriate indexes either! This will improve performance.
Refactoring your database schema might be painful, but its seems to me the only way to fix your long term problem.
To help clarify more, the list that
contains all of the tag ids is inside
a single field called ListString and I
structure it like so: " #1 #2 #3 #4 #5" ...and from that, the above query "concats" the prx_tags.ID which
tagname is 'aerobic'.
There's your problem right there. Don't store delimited data in a DB field (ListString). Modeling data this way is going to make it extremely difficult/impossible to write performant queries against it.
Suggestion: Break the contents of ListString out into a related table with one row for each item.
the list that contains all of the tag ids is inside a single field called ListString and I structure it like so: " #1 #2 #3 #4 #5" ...and from that, the above query "concats" the prx_tags.ID which tagname is 'aerobic'.
Not only is that bad, storing denormalized data, but the separator character is uncommon.
Interim Improvement
The quickest way to improve things is to change the separator character you're currently using ("#") to a comma:
UPDATE PRX_LISTS
SET liststring = REPLACE(liststring, '#', ',')
Then, you can use MySQL's FIND_IN_SET function:
SELECT si.*
FROM PRX_SPORTSITEMS si
JOIN PRX_LISTS l ON l.ownerid = si.id
JOIN PRX_TAGS t ON FIND_IN_SET(t.id, l.liststring) > 0
WHERE t.tagname = 'aerobic'
AND l.listtype = 'Tags-SportsItems'
ORDER BY si.DateAdded DESC
LIMIT 0, 30
Long Term Solution
As you've experienced, searching for specifics in denormalized data does not perform well, and makes queries overly complicated. You need to change the PRX_LISTS table so one row contains a unique combination of the SPORTSITEM.ownerid and PRX_TAGS.id, and whatever other columns you might need. I'd recommend renaming as well - lists of what, exactly? The name is too generic:
CREATE TABLE SPORTSITEM_TAGS_XREF (
sportsitem_ownerid INT,
tag_id INT,
PRIMARY KEY (sportsitem_ownerid INT, tag_id)
)
Don't make any changes without
looking at the execution plan. (And
post that here, too, by editing your
original question.)
The way your LIKE clause is
constructed, MySQL can't use an
index.
The LIKE clause is a symptom. Your
table structure is more likely the problem.
You'll probably get at least one order of magnitude improvement by building sane tables.
I'm really regretting not looking into
optimization beforehand
That's not what caused your problem. Being ignorant of the fundamentals of database design caused your problem. (That's an observation, not a criticism. You can fix ignorance. You can't fix stupid.)
Later:
Post your existing table structure and your proposed changes. You'll be a lot happier with our ability to predict what your code will do than with our ability to predict what your description of a piece of your code will do.

How to efficiently build and store semantic graph?

Surfing the net I ran into Aquabrowser (no need to click, I'll post a pic of the relevant part).
It has a nice way of presenting search results and discovering semantically linked entities.
Here is a screenshot taken from one of the demos.
On the left side you have they word you typed and related words.
Clicking them refines your results.
Now as an example project I have a data set of film entities and subjects (like wolrd-war-2 or prison-escape) and their relations.
Now I imagine several use cases, first where a user starts with a keyword.
For example "world war 2".
Then i would somehow like to calculate related keywords and rank them.
I think about some sql query like this:
Lets assume "world war 2" has id 3.
select keywordId, count(keywordId) as total from keywordRelations
WHERE movieId IN (select movieId from keywordRelations
join movies using (movieId)
where keywordId=3)
group by keywordId order by total desc
which basically should select all movies which also have the keyword world-war-2 and then looks up the keywords which theese films have as well and selects those which occour the most.
I think with theese keywords I can select movies which match best and have a nice tag cloud containing similar movies and related keywords.
I think this should work but its very, very, very inefficient.
And its also only one level or relation.
There must be a better way to do this, but how??
I basically have an collection of entities. They could be different entities (movies, actors, subjects, plot-keywords) etc.
I also have relations between them.
It must somehow be possible to efficiently calculate "semantic distance" for entities.
I also would like to implement more levels of relation.
But I am totally stuck. Well I have tried different approaches but everything ends up in some algorithms that take ages to calculate and the runtime grows exponentially.
Are there any database systems available optimized for that?
Can someone point me in the right direction?
You probably want an RDF triplestore. Redland is a pretty commonly used one, but it really depends on your needs. Queries are done in SPARQL, not SQL. Also... you have to drink the semantic web koolaid.
From your tags I see you're more familiar with sql, and I think it's still possible to use it effectively for your task.
I have an application where a custom-made full-text search implemented using sqlite as a database. In the search field I can enter terms and popup list will show suggestions about the word and for any next word only those are shown that appears in the articles where previously entered words appeared. So it's similar to the task you described
To make things more simple let's assume we have only three tables. I suppose you have a different schema and even details can be different but my explanation is just to give an idea.
Words
[Id, Word] The table contains words (keywords)
Index
[Id, WordId, ArticleId]
This table (indexed also by WordId) lists articles where this term appeared
ArticleRanges
[ArticleId, IndexIdFrom, IndexIdTo]
This table lists ranges of Index.Id for any given Article (obviously also indexed by ArticleId) . This table requires that for any new or updated article Index table should contain entries having known from-to range. I suppose it can be achieved with any RDBMS with a little help of autoincrement feature
So for any given string of words you
Intersect all articles where all previous words appeared. This will narrow the search. SELECT ArticleId FROM Index Where WordId=... INTERSECT ...
For the list of articles you can get ranges of records from ArticleRanges table
For this range you can effectively query WordId lists from Index grouping the results to get Count and finally sort by it.
Although I listed them as separate actions, the final query can be just big sql based on the parsed query string.

What is the correct strategy to normalize a database with articles and tags for the articles?

I am building a system that stores articles and tags that categorize the article. Standard stuff, similar to how this website does it. Now my question is whether I should store the tags on a separate table that just contains tags and article ids or store the tags on an extra column in the articles table. My first instinct would be to normalize the database and have two tables. The problem is that the interface with which the user administers the tags is a simple text box with all tags separated by commas. So when the user commits his changes, in order to find out which tags where added, changed or subtracted, I would need to first query the database , compare the results with the new data on a tag basis and then process the changes accordingly. A process with a huge overhead, compared with simply updating the one filed in the one row of the articles table. How would you do it or is there a third option I haven’t considered?
PD. I am stuck with a relational database for this project .
If you are using a separate table, rather than trying to figure out which tags have changed each time, simply delete all for the given article ID, and then insert all of the supplied tags - this should present very little overhead.
In a tagged system the performance that would normally be most important is the retrieval of tags and / or the retrieval of the related content. Using a separate table with an indexed tag column should provide very fast lookup in a situation where an item can have any number of tags.
You need to normalize the database in order to run queries such as 'find all articles with tag T'.
I don't think that there will really be that much overhead in grabbing all of the tags to compare them with the new tags, assuming that you've applied correct indexes.
Personally I wouldn't delete all the tags then insert all the new ones, because I might want to do things like audit when individual tags are entered.
If you're using SQL Server 2008 then I suggest that you look at the MERGE command.

Combine contents of multiple rows in 3NF mysql tables

Having dutifully normalised all my data, I'm having a problem combining 3NF rows into a single row for output.
Up until now I've been doing this with server-side coding, but for various reasons I now need to select all rows related to another row, and combine them in a single row, all in MySQL...
So to try and explain:
I have three tables.
Categories
Articles
CategoryArticles_3NF
A category contains CategoryID + titles, descriptions etc. It can contain any number of articles in the Articles table, consisting of ArticleID + a text field to house the content.
The CategoryArticles table is used to link the two, so contains both the CategoryID and the ArticleID.
Now, if I select a Category record, and I JOIN the Articles table via the linking CategoryArticles_3NF table, the result is a separate row for each article contained within that category.
The issue is that I want to output one single row for each category, containing content from all articles within.
If that sounds like a ridiculous request, it's because it is. I'm just using articles as a good way to describe the problem. My data is actually somewhat different.
Anyway - the only way I can see to achieve this is to use a 'GROUP_CONCAT' statement to group the content fields together - the problem with this is that there is a limit to how much data this can return, and I need it to be able to handle significantly more.
Can anyone tell me how to do this?
Thanks.
This sounds like something that should be done in the front end without more information.
If you need to, you can increase the size limit of GROUP_CONCAT by setting the system variable group_concat_max_len. It has a limit based on max_allowed_packet, which you can also increase. I think that the max size for a packet is 1GB. If you need to go higher than that then there are some serious flaws in your design.
EDIT: So that this is in the answer and not just buried in the comments...
If you don't want to change the group_concat_max_len globally then you can change it for just your session with:
SET SESSION group_concat_max_len = <your value here>