SQL - How to tag data? - sql

I need to store short tags (A01, B34, etc) in a SQL table, and make sure their indexed. Creating an INT column for each letter in the alphabet is not possible, because entries can have multiple 'A' tags for example.
First I stored them as a long string, seperated with spaces (for example "A01 B34"). But this requires a LIKE% query, which does a fulltable scan and ignores any indexes. So i'm looking for alternatives.
I now use SQLite FTS (text search) to search for these tags, but this requires a special table to store the tags in, and fetching results with JOIN queries, and all kinds of other stuff I'd rather like to avoid.
My requirements are pretty simple: I need to store millions of short strings, each with their own tags, and do simple searches for these tags.
Is my current approach (doing FTS on the tags) the fastest? Or is it better to use a NoSQL database for this kind of data?

I will share my experience how I have done it in my previous startup Pageflakes Community site. At Pageflakes, user created content is tagged. You can see an example from here:
http://www.pageflakes.com/Community/Content/Flakes.aspx?moduleKey=4796
Each widget, pagecast has a collection of tags. When someone searches, we give the tags highest priority, then the title, then the description of the item.
Assuming you have a Content table like this:
Content (id, title, description)
First of all, you need to create a table for all unique tags.
Tag (id, name (unique), countOfItems)
Then you need to map the tag to content rows.
TagContentMap (tagid, contentid)
You will now ask, for each new content, I have to insert in three tables. Not always. You insert in Tag table only when you have a new tag. Most of the time, people choose existing tags. After couple of months of tagging, users should have exhausted unique tags. From then, 99% of the time users pick some existing tag. So, that removes one insert for you. So, you only have one additional insert.
Moreover, insert is always significantly lower than select. Most likely you will have 99% read, 1% write.
Unless you introduce these two tables, you can never have a UI where users can click on a tag and see all the content tagged with that particular tag. If you have no need for such feature, then of course you can just add a "tags" column on the Content table itself and store the tags in comma delimited format.
Now the most important point - how to produce the best search result. On the content table, we have a varchar field called "SearchData". This field is first populated with the tag names, then the title, then the description. So,
SearchData = tag names comma delimited + newline + title + newline + description.
Then you use SQL Server's Full text indexing to index the SearchData column only, not any other field in the Content table.
Does this work for you?

You do not give us a lot of details to go on, but your design seems to be all wrong. It is not in third normal form.

#Joshua, pls goo on term "normalization". Currently your data is denormalized. Denormalization is possible thing. but after normalization and as some kind of perfomance hack. Currently your design seems to be wrong.
As an example you should have insetad of 1 table 3 tables:
some_records (id, column1, ..., columnN)
tags (id, title)
some_records_tags (some_record_id, tag_id)
It's a classic design pattern in DBMS. And NoSQL here not needed.

As other users have pointed out, the data is not well normalized. I'll assume that this is intentional and there is some very large (100s of gb or tb size requirement or huge throughput requirement that you haven't mentioned). But before you start down any path, you should understand exactly what your requirements are: how often do you write versus read, what are the latency requirements for writes and reads, and you have to include index maintenance in your calculations.
If you have a significant perf requirement, you might try building a near-line index system on top of what you currently have. I've used this technique in the past for large throughput requirement systems. The idea is basically that for writes, you make them as small and quick as possible, and create a batch process to come back and add the data into a secondary search table that will get it into a form that is capable of being searched. The benefit is your writes can be done quickly, and if you choose your clustered index well the reads for the batch processing can be done very efficiently. In addition, you can segment the data into different servers as necessary to support higher search throughput. The major drawback is that updates are not instantaneously reflected in search results.
If you write into a table like:
table data (id binary(6), ..., timestamp datetime, tags varchar(256))
and have a secondary table:
table search (tag char(3), dataId binary(6))
You can create a batch process to come around take the last BATCH_SIZE (1000 maybe) records and splitting the tags column on a space and inserting/deleting the tags into/from the search table. You keep a variable/row somewhere with the last timestamp value you've collected from and start from there at the next batch interval. Finally, if deletes are important, each batch interval will need to find the set of records no longer in the data table. Alternately you could choose a tombstone table if your data table is too large or you can concurrently perform deletes against data and search if they happen infrequently enough.
Things to watch out for with batch processing is making the batch size too big and taking table locks when updating the search table. Also, you have to watch out for duplicate timestamps. And of course, when writing/updating the data table it is necessary to always update the timestamp.

Related

Is it a good idea to index every column if the users can filter by any column?

In my application, users can create custom tables with three column types, Text, Numeric and Date. They can have up to 20 columns. I create a SQL table based on their schema using nvarchar(430) for text, decimal(38,6) for numeric and datetime, along with an Identity Id column.
There is the potential for many of these tables to be created by different users, and the data might be updated frequently by users uploading new CSV files. To get the best performance during the upload of the user data, we truncate the table to get rid of existing data, and then do batches of BULK INSERT.
The user can make a selection based on a filter they build up, which can include any number of columns. My issue is that some tables with a lot of rows will have poor performance during this selection. To combat this I thought about adding indexes, but as we don't know what columns will be included in the WHERE condition we would have to index every column.
For example, on a local SQL server one table with just over a million rows and a WHERE condition on 6 of its columns will take around 8 seconds the first time it runs, then under one second for subsequent runs. With indexes on every column it will run in under one second the first time the query is ran. This performance issue is amplified when we test on an SQL Azure database, where the same query will take over a minute the first time its run, and does not improve on subsequent runs, but with the indexes it takes 1 second.
So, would it be a suitable solution to add a index on every column when a user creates a column, or is there a better solution?
Yes, it's a good idea given your model. There will, of course, be more overhead maintaining the indexes on the insert, but if there is no predictable standard set of columns in the queries, you don't have a lot of choices.
Suppose by 'updated frequently,' you mean data is added frequently via uploads rather than existing records being modified. In that case, you might consider one of the various non-SQL databases (like Apache Lucene or variants) which allow efficient querying on any combination of data. For reading massive 'flat' data sets, they are astonishingly fast.

a Large one table with 100 column vs a lot of little tables

I created some website which contain users,comments,videos,photos,messages and more.All of the data is in the one table which contain 100 column.I thought one table is better than more because user need just connect one table but I heard that some programmer doesnt like this method.And Can someone say me which one is better?One very large table or a lot of little tables.
and Why I need use a lot tables?Why it is useful?Which one is fast for user?
What is the advantages and disadvantages of large table and a lot of little tables?
100 columns in a single table is bad design in most situations.
Read this page: http://www.tutorialspoint.com/sql/sql-rdbms-concepts.htm
Break your data up into related chunks and give each of them their own table.
You said you have this information (users,comments,videos,photos,messages) so you should have something like these tables.
Users which contains (User ID, Name, Email etc)
Comments which contains (Comment ID, User ID, Comment Text etc)
Videos which contains (Video ID, User ID, Comment ID, Video Data etc)
Photos which contains (Photo ID, User ID, Comment ID, Photo Data etc)
Messages which contains (Message ID, User ID, Message Text etc)
Then when your writing your SQL you can write proper SQL to query based on exactly what information you need.
SELECT UserID, MessageID, MessageText
FROM Users as USR
JOIN Messages as MSG
on USR.UserID = MSG.UserID
WHERE USR.UserID = 1234567
With your current query your having to deal with rows containing data that you dont need or care about.
EDIT
Just to give some further information to the OP as to why this is better design.
Lets take the "Users" as a starting example.
In a proper database design you would have a table called Users which has all the required columns that are required for a user to exist. Username, email, id number etc.
Now we want to create a new user so we want to insert Username, email and id number. But wait i still have to populate 97 other columns with totally unrelated information to our process of creating a new user! Even if you store NULL in all columns its going to use some space in the database.
Also imagine you have hundreds of users all trying to select, update and delete from a single database table. There is a high chance of the table being locked. But if you had one user updating the Users table, another user Inserting into the Messages table then the work is spread out.
And as other users have said, purely performance. The database needs to get all information and filter out what you want. If you have alot of columns this is unnecessary work.
Performance Example.
Lets say your database has been running for years. You have 5000 users, 2,000,000 comments, 300,000 pictures, 1,000,000 messages. Your single table now contains 3,305,000 records.
Now you want to find a User with the ID of 12345 who has more than 20 pictures. You need to search through all 3,305,000 records to get this result.
If you had a split table design then you would only need to search through 305,000 records.
Obvious performance gain!!
EDIT 2
Performance TEST.
I created a dummy table containing 2 million rows and 1 column. I ran the below query which took 120ms on average over 10 executions.
SELECT MyDate1 from dbo.DummyTable where MyDate1 BETWEEN '2015-02-15 16:59:00.000' and '2015-02-15 16:59:59.000'
I then truncated the table and created 6 more columns and populated them with 2 million rows of test data and ran the same query. It took 210ms on average over 10 executions.
So adding more columns decreases performance even though your not viewing the extra data.
Wide tables can cause performance problems if they are wider than the database can store in one place.
You need to read about normalization as this type of structure is very bad and is not what the database is optimized for. In your case you will have many repeated records that you will have to use distinct (which is a performance killer) to get rid of when you want to only show the user name or the comments.
Additionally, you may have some fields that are repeats like comment1, comment2, etc. Those are very hard to query over time and if you need another one, then you have to change the table structure and potentially change the queries. That is a bad way to do business.
Further when you only have one table, it becomes a hot spot in your database and you will have more locking and blocking.
Now also suppose that one of those pieces of information is updated, now you have to make sure to update all the records not just one. This can also be also a performance killer and if you don't do it, then you will have data integrity problems which will make the data in your database essentially useless. Denormalizing is almost always a bad idea and always is a bad idea when done by someone who is not an expert in database design. There are many ramifications of denormalization that you probably haven't thought of.
Overall your strategy is sure loser over time and needs to be fixed ASAP because the more records you have in a database, the harder it is to refactor.
For your situation it is better to have multiple tables. The reason for this is because if you put all your data into one table then you will have update anomalies. For example, if a user decides to update his username, you will have to update every single row in your big table that has that user's username. But if you split it into multiple tables then you will only need to update one row in your User table and all the rows in your other tables will reference that updated row.
As far as speed, having one table will be faster than multiple tables with SELECT statements because joining tables is slow. INSERT statements will be about the same speed in either situation because you will be inserting one row. However, updating someone's username with an UPDATE statement will be very slow with one table if they have a lot of data about them because it has to go through each row and update every one of them as opposed to only having to update one row in the User table.
So, you should create tables for everything you mentioned in your first sentence (users, comments, videos, photos, and messages) and connect them using Ids like this:
User
-Id
-Username
Video
-Id
-UploaderId references User.Id
-VideoUrl
Photo
-Id
-UploaderId references User.Id
-PhotoUrl
VideoComment
-CommenterId references User.Id
-VideoId references Video.Id
-CommentText
PhotoComment
-CommenterId reference User.Id
-PhotoId references Photo.Id
-CommentText
Message
-SenderId references User.Id
-ReceiverId references User.Id
-MessageText

Why does Wordpress have separate 'usersmeta' and 'users' SQL tables. Why not combine them?

Alongside the users table, Wordpress has a usersmeta table with the following columns
meta_id
user_id
meta_key (e.g. first_name)
meta_value (e.g. Tom)
Each user has 20 rows in the usersmeta table, regardless of whether or not the rows have a filled-in meta_value. That said, would it not be more efficient to add the always-present meta rows to the users table?
I'm guessing that the information in the users table is more frequently queried (e.g. user_id, username, pass), so it is more efficient to keep those rows smaller. Is this true? And are there other reasons for this separation of tables?
Entity Attribute Value
It's known as the Entity Attribute Value (EAV) data model, and allows an arbitrary number of attributes to be assigned to a given entity. That means any number of meta-data entries per user.
Why use it
By default there are a few keys that wordpress sets (20 stated in the question) but there can be any number. If all users have one thousand meta data entries - there are simply one thousand entries in the usermeta table for each user - it doesn't have (in terms of the database structure) a limit to the number of meta data entries a user can have. It also permits one user to have one thousand meta data entires, whilst all others have 20 and still store the data efficiently - or any permutation thereof.
In addition to flexibility, using this kind of structure permits the main users table to remain small - which means more efficient queries.
Alternatives
The alternatives to using EAV include:
Modify the schema whenever the number of attributes changes
Store all attributes in a serialized string (on the user object)
Use a schemaless db
Permissions is the biggest problem with the first point, it is not a good idea to grant blanket access to alter the schema of your database tables, and is a (sane) roadblock for many if not most wordpress installs (hosted on wordpress.com or on a shared host where the db user has no alter permissions). Mysql also has a hard-limit of 4096 columns and 65,535 bytes per row. Attempting to store a large number of columns in a single table will eventually fail, along the way creating a table that is inefficient to query.
Storing all attribute in a serialized string would make it difficult and slow to query by a meta-data value.
Wordpress is quite tied to mysql, and therefore changing datastore isn't a realistic option.
Further WP info
If you aren't using any/many plugins it's possible you will have a constant number of rows in the usermeta table for each user, but typically each plugin you add may need to add meta-data for users; the number added may not be trivial and this data is stored in the usermeta table.
The docs for add_meta_user may add some clarity as to why the database is structured that way. If you put code like this somewhere:
add_user_meta($user_id, "favorite_color", "blue");
It will create a row in the usermeta table for the given user_id, without the need to add a column (favorite_color) to the main users table. That makes it easy-ish to find users by favorite color without the need to modify the schema of the users table.
This is really a question about database normalization. You can look for information on that topic in many places.
Basic answer Since there is a huge literature about this, and there are a lot of differences, I will just give some examples of why this might happen - it boild down to trade-offs; Speed versus storage requirements, or ease of use versus data duplication. Efficiency is multidimensional, and since wordpress does a lot of different things, it may have various reasons to keep them separate - space could be an issue, speed of queries may depend on this, it may be easier to look at just the meta table instead of the full table for some purposes, or vice versa.
Further reading This is a deep topic, you may want to learn more - there are hundreds of books and thousands of scholarly papers on these issues. For instance, look at this previous SO question about designing a database:
Database design: one huge table or separate tables?, or this one: First-time database design: am I overengineering?
or Database Normalization Basics
on About.com.

LONG TEXT or thousands of rows, or maybe something else?

I'm designing a program where the user makes a single choice on thousands (or potentially millions) of people. I've thought of 2 ways of storing this in a database:
1) a separate row for each entry
2) a single long text that just appends a choice for a new person or modifies a choice for an existing person.
I'd imagine separate rows for each entry should be more efficient, but if we're talking about, let's say, hundreds of thousands of entries, then what is the network overhead I'm looking at for queries on that versus just returning a single long text and using the user's cpu to parse the text?
As an example, a single long text might be something like:
Data
[Person A:Choice A][Person B: Choice A][Person C: Choice C]...[Person n:Choice n]
Whereas multiple rows obviously would be:
Person Choice
A A
B A
C C
....
n n
Maybe I'm not thinking of this in the right way in the first place. Is there a more efficient way of doing something like this?
Thanks for your input.
I'll put my comments in an answer and expand in places.
Regarding your decision of string vs Table. Table every time.
A design based on table Person (Id, Name) , table Choice (Id, Value) and table PersonChoice (Id, PersonId, ChoiceId). Will give you an indexable, searchable and flexible solution.
Hiding data in text columns in SQL is a very bad idea - obviously ignoring XML data and its datatype. But that doesn't apply here.
One solution for adding statistics at a later date could be have scheduled SQL Agent jobs running off the data parsing what and when changes were made and storing that data in separate "reporting" tables.
Something to consider in your design - to save yourself having to store and manipulate 1000s of rows - is the idea of grouping choices together. Could save you a great deal of work (both for yourself and the server).
Welcome to the world of database design!
I'd imagine separate rows for each entry should be more efficient, but
if we're talking about, let's say, hundreds of thousands of entries,
then what is the network overhead I'm looking at for queries on that
versus just returning a single long text and using the user's cpu to
parse the text?
Network overhead (as a difference between the two designs) is negligible. Essentially all you're sending to the server is your query, and all the server returns is the result set. If the result of your query is one row, the server returns only one row. If the result of your query is 10,000 rows, the server sends back 10,000 rows.
The real overhead is in execution speed on the server and in maintenance. The server will find indexed rows in a table quickly if you use an index. But finding the 17 in a single value like "1, 2, 3, 5, 6, 7, 8, 10, 13, 17, 18, 30, 27" probably won't use an index.
Values like that also lose type safety and the ability to use foreign key references and cascades.

Lucene Indexing

I would like to use Lucene for indexing a table in an existing database. I have been thinking the process is like:
Create a 'Field' for every column in the table
Store all the Fields
'ANALYZE' all the Fields except for the Field with the primary key
Store each row in the table as a Lucene Document.
While most of the columns in this table are small in size, one is huge. This column is also the one containing the bulk of the data on which searches will be performed.
I know Lucene provides an option to not store a Field. I was thinking of two solutions:
Store the field regardless of the size and if a hit is found for a search, fetch the appropriate Field from Document
Don't store the Field and if a hit is found for a search, query the data base to get the relevant information out
I realize there may not be a one size fits all answer ...
For sure, your system will be more responsive if you store everything on Lucene. Stored field does not affect the query time, it will only make the size of your index bigger. And probably not that bigger if it is only a small portion of the rows that have a lot of data. So if the index size is not an issue for your system, I would go with that.
I strongly disagree with a Pascal's answer. Index size can have major impact on search performance. The main reasons are:
stored fields increase index size. It could be problem with relatively slow I/O system;
stored fields are all loaded when you load Document in memory. This could be good stress for the GC
stored fields are likely to impact reader reopen time.
The final answer, of course, it depends. If the original data is already stored somewhere else, it's good practice to retrieve it from original data store.
When adding a row from the database to Lucene, you can judge if it actually needed to be write to the inverted-index. If not, you can use Index.NOT to avoid writing too much data to the inverted-index.
Meanwhile, you can judge where a column will be queried by key-value. If not, you needn't use Store.YES to store the data.