Why does Wordpress have separate 'usersmeta' and 'users' SQL tables. Why not combine them? - sql

Alongside the users table, Wordpress has a usersmeta table with the following columns
meta_id
user_id
meta_key (e.g. first_name)
meta_value (e.g. Tom)
Each user has 20 rows in the usersmeta table, regardless of whether or not the rows have a filled-in meta_value. That said, would it not be more efficient to add the always-present meta rows to the users table?
I'm guessing that the information in the users table is more frequently queried (e.g. user_id, username, pass), so it is more efficient to keep those rows smaller. Is this true? And are there other reasons for this separation of tables?

Entity Attribute Value
It's known as the Entity Attribute Value (EAV) data model, and allows an arbitrary number of attributes to be assigned to a given entity. That means any number of meta-data entries per user.
Why use it
By default there are a few keys that wordpress sets (20 stated in the question) but there can be any number. If all users have one thousand meta data entries - there are simply one thousand entries in the usermeta table for each user - it doesn't have (in terms of the database structure) a limit to the number of meta data entries a user can have. It also permits one user to have one thousand meta data entires, whilst all others have 20 and still store the data efficiently - or any permutation thereof.
In addition to flexibility, using this kind of structure permits the main users table to remain small - which means more efficient queries.
Alternatives
The alternatives to using EAV include:
Modify the schema whenever the number of attributes changes
Store all attributes in a serialized string (on the user object)
Use a schemaless db
Permissions is the biggest problem with the first point, it is not a good idea to grant blanket access to alter the schema of your database tables, and is a (sane) roadblock for many if not most wordpress installs (hosted on wordpress.com or on a shared host where the db user has no alter permissions). Mysql also has a hard-limit of 4096 columns and 65,535 bytes per row. Attempting to store a large number of columns in a single table will eventually fail, along the way creating a table that is inefficient to query.
Storing all attribute in a serialized string would make it difficult and slow to query by a meta-data value.
Wordpress is quite tied to mysql, and therefore changing datastore isn't a realistic option.
Further WP info
If you aren't using any/many plugins it's possible you will have a constant number of rows in the usermeta table for each user, but typically each plugin you add may need to add meta-data for users; the number added may not be trivial and this data is stored in the usermeta table.
The docs for add_meta_user may add some clarity as to why the database is structured that way. If you put code like this somewhere:
add_user_meta($user_id, "favorite_color", "blue");
It will create a row in the usermeta table for the given user_id, without the need to add a column (favorite_color) to the main users table. That makes it easy-ish to find users by favorite color without the need to modify the schema of the users table.

This is really a question about database normalization. You can look for information on that topic in many places.
Basic answer Since there is a huge literature about this, and there are a lot of differences, I will just give some examples of why this might happen - it boild down to trade-offs; Speed versus storage requirements, or ease of use versus data duplication. Efficiency is multidimensional, and since wordpress does a lot of different things, it may have various reasons to keep them separate - space could be an issue, speed of queries may depend on this, it may be easier to look at just the meta table instead of the full table for some purposes, or vice versa.
Further reading This is a deep topic, you may want to learn more - there are hundreds of books and thousands of scholarly papers on these issues. For instance, look at this previous SO question about designing a database:
Database design: one huge table or separate tables?, or this one: First-time database design: am I overengineering?
or Database Normalization Basics
on About.com.

Related

Best practice for potentially large amount of data in single table SQL

I have a table with three columns: user_id, post_id and voting.
Each post in the application can get a voting by multiple users but only one voting per user.
Imagine we end up with 1 billion posts and an average of 1000 votings on each. Would it still be a good practice to store all these information in a single table? (performancewise)
Is it maybe better to split the information in different tables, maybe userwise? (potentially multiple thousand users, so multiple tables)
For best practices create a non-clustered index on user_id the reason I say this is it will help you to run operations like - get all posts of a particular user etc.
If in future record is very big like half to what you mentioned you can do DB shading it will improve overall performance. The idea is to distribute data that can’t fit on a single node onto a cluster of database nodes. You can refer to your vendor's documentation to achieve these.

Is there a term to describe tables where only the last value should be used?

In my database there are tables where each row represents an entity, and there are other tables where the same entity can appear multiple times, but only the latest entry is the valid one.
For example, I have a table customer where each row represents a customer and another table customer_membership_status where the same customer can be referenced multiple times, but only the last record for each customer is suppost to be used. Data is never updated in customer_membership_status, only inserted.
Is there a term to describe this pattern? I'm asking because I would like to quickly and easily explain the intended use of the table to others.
Probably the best term would be CQRS and event sourcing
Using the stream of events as the write store, rather than the actual data at a point in time, avoids update conflicts on a single aggregate and maximizes performance and scalability
Event Sourcing pattern
Instead of storing just the current state of the data in a domain, use an append-only store to record the full series of actions taken on that data. The store acts as the system of record and can be used to materialize the domain objects. This can simplify tasks in complex domains, by avoiding the need to synchronize the data model and the business domain, while improving performance, scalability, and responsiveness. It can also provide consistency for transactional data, and maintain full audit trails and history that can enable compensating actions.
EDIT:
After closer look you may also read about SCD(Slowly Changing Dimension) Type 2.
This method tracks historical data by creating multiple records for a given natural key in the dimensional tables with separate surrogate keys and/or different version numbers. Unlimited history is preserved for each insert.
Temporal table. It's a table where a timestamp / version attribute is part of a key. The temporal / version attribute allows you to identify which is the latest row for each customer.

Normalization of SQL Database with similar data managed by different tools

I'm designing a database for storing a bunch of product data that is both pulled via an API and scraped off the web. This scraper will pull some data that is static and some data that varies with time. Therefore there will one table for each type of data (static/variable). I'm trying to decide if there should be a separate table for variable data that is scraped compared to variable data that is pulled through an API.
At first, I thought they should be stored in separate tables because they are managed by separate tools. However, data will be pulled through the API and scraped on the same schedule (daily), and so they will both be mapped with the same ProductID and date. So, it seems like I could just combine the schema of both tables to save on the join time during queries for processing the data later. The obvious downside to this is managing whether rows need to be created or updated whenever one of the processes runs (which of the scraper vs API tools create or update rows).
For what it's worth, these scripts will be pulling data for millions (maybe tens of millions) of rows per day, and storing it for quite a while. So, the tables are going to get quite huge, and that's why I'm concerned with join times later on.
Here's an example in case this is all a little cloudy as an example. There are multiple industries for this, but I'll just use real estate:
Scraped Static Data: ProductID, Address, City, State, Zip, SquareFeet, etc.
Scraped Variable Data: ProductID, Price, PricePerSqFt, etc.
API Variable Data: ProductID, PageHits, UniqueVisitors, etc.
Mainly just the variable data is the concern here. So, just summarize, separate tables for the sake of general design principles, or one table for the sake of speed on joins?
Thanks in advance for the input
The example you give indicates that, apart from having 2 or 3 tables, you should also consider having just one table for both static and variable data. As long as the key of everything is just the product id, you can keep all information describing a particular id value in one record. Or do you intend to have a time stamp as part of the key of your variable data?
Once this has been decided, I can't see any advantage in having more tables than necessary.
The joins you mention won't be particularly complicated, as they basically mean to read a single record from each of your tables, each time using a primary key, which is fast. But still reading 3 records means more effort than reading 2, or only one.
There is no general design principle saying you should have a separate table for each way to collect data. On the contrary, it's the purpose of a database to contain data according to their logical structure without (too much) regard of the technical means of collecting or accessing them.
The logic to decide whether to insert or update a row isn't complicated. Also, if you want to verify your data, you might need some logic anyway, e.g. making sure that variable data only get inserted for an object that already has static data.

a Large one table with 100 column vs a lot of little tables

I created some website which contain users,comments,videos,photos,messages and more.All of the data is in the one table which contain 100 column.I thought one table is better than more because user need just connect one table but I heard that some programmer doesnt like this method.And Can someone say me which one is better?One very large table or a lot of little tables.
and Why I need use a lot tables?Why it is useful?Which one is fast for user?
What is the advantages and disadvantages of large table and a lot of little tables?
100 columns in a single table is bad design in most situations.
Read this page: http://www.tutorialspoint.com/sql/sql-rdbms-concepts.htm
Break your data up into related chunks and give each of them their own table.
You said you have this information (users,comments,videos,photos,messages) so you should have something like these tables.
Users which contains (User ID, Name, Email etc)
Comments which contains (Comment ID, User ID, Comment Text etc)
Videos which contains (Video ID, User ID, Comment ID, Video Data etc)
Photos which contains (Photo ID, User ID, Comment ID, Photo Data etc)
Messages which contains (Message ID, User ID, Message Text etc)
Then when your writing your SQL you can write proper SQL to query based on exactly what information you need.
SELECT UserID, MessageID, MessageText
FROM Users as USR
JOIN Messages as MSG
on USR.UserID = MSG.UserID
WHERE USR.UserID = 1234567
With your current query your having to deal with rows containing data that you dont need or care about.
EDIT
Just to give some further information to the OP as to why this is better design.
Lets take the "Users" as a starting example.
In a proper database design you would have a table called Users which has all the required columns that are required for a user to exist. Username, email, id number etc.
Now we want to create a new user so we want to insert Username, email and id number. But wait i still have to populate 97 other columns with totally unrelated information to our process of creating a new user! Even if you store NULL in all columns its going to use some space in the database.
Also imagine you have hundreds of users all trying to select, update and delete from a single database table. There is a high chance of the table being locked. But if you had one user updating the Users table, another user Inserting into the Messages table then the work is spread out.
And as other users have said, purely performance. The database needs to get all information and filter out what you want. If you have alot of columns this is unnecessary work.
Performance Example.
Lets say your database has been running for years. You have 5000 users, 2,000,000 comments, 300,000 pictures, 1,000,000 messages. Your single table now contains 3,305,000 records.
Now you want to find a User with the ID of 12345 who has more than 20 pictures. You need to search through all 3,305,000 records to get this result.
If you had a split table design then you would only need to search through 305,000 records.
Obvious performance gain!!
EDIT 2
Performance TEST.
I created a dummy table containing 2 million rows and 1 column. I ran the below query which took 120ms on average over 10 executions.
SELECT MyDate1 from dbo.DummyTable where MyDate1 BETWEEN '2015-02-15 16:59:00.000' and '2015-02-15 16:59:59.000'
I then truncated the table and created 6 more columns and populated them with 2 million rows of test data and ran the same query. It took 210ms on average over 10 executions.
So adding more columns decreases performance even though your not viewing the extra data.
Wide tables can cause performance problems if they are wider than the database can store in one place.
You need to read about normalization as this type of structure is very bad and is not what the database is optimized for. In your case you will have many repeated records that you will have to use distinct (which is a performance killer) to get rid of when you want to only show the user name or the comments.
Additionally, you may have some fields that are repeats like comment1, comment2, etc. Those are very hard to query over time and if you need another one, then you have to change the table structure and potentially change the queries. That is a bad way to do business.
Further when you only have one table, it becomes a hot spot in your database and you will have more locking and blocking.
Now also suppose that one of those pieces of information is updated, now you have to make sure to update all the records not just one. This can also be also a performance killer and if you don't do it, then you will have data integrity problems which will make the data in your database essentially useless. Denormalizing is almost always a bad idea and always is a bad idea when done by someone who is not an expert in database design. There are many ramifications of denormalization that you probably haven't thought of.
Overall your strategy is sure loser over time and needs to be fixed ASAP because the more records you have in a database, the harder it is to refactor.
For your situation it is better to have multiple tables. The reason for this is because if you put all your data into one table then you will have update anomalies. For example, if a user decides to update his username, you will have to update every single row in your big table that has that user's username. But if you split it into multiple tables then you will only need to update one row in your User table and all the rows in your other tables will reference that updated row.
As far as speed, having one table will be faster than multiple tables with SELECT statements because joining tables is slow. INSERT statements will be about the same speed in either situation because you will be inserting one row. However, updating someone's username with an UPDATE statement will be very slow with one table if they have a lot of data about them because it has to go through each row and update every one of them as opposed to only having to update one row in the User table.
So, you should create tables for everything you mentioned in your first sentence (users, comments, videos, photos, and messages) and connect them using Ids like this:
User
-Id
-Username
Video
-Id
-UploaderId references User.Id
-VideoUrl
Photo
-Id
-UploaderId references User.Id
-PhotoUrl
VideoComment
-CommenterId references User.Id
-VideoId references Video.Id
-CommentText
PhotoComment
-CommenterId reference User.Id
-PhotoId references Photo.Id
-CommentText
Message
-SenderId references User.Id
-ReceiverId references User.Id
-MessageText

SQL Structure, Dynamic Two Columns or Unique Colmuns

I'm not sure which is faster. I have the need to store lists of possible data.
Currently I have an SQL table with the following structure being accessed with php.
boxID
place
name -- (serialNum, itemNum, idlock, etc, etc)
data
--(Note: The Primary Key here would be boxId, place, name, and data, to prevent duplicate data.)
The reason i set it up like this was to prevent creating columns per named data. Its a possibility in the future to have 5-10 different named data or more. Also possible to store 1,000 - 10,000 entries of data in one week for just one named data. It will be searched as well, like when i get place from a specific serialNum, then getting all data related to that place. (A specific serialNum, itemNum, idLock, etc, etc,)
But my concern is that my structure could be slower than just creating a named column for each named data. For example:
boxID
place
serialNum
itemNum
idLock
etc
etc
--(Note: Not even sure how to add keys to this if i would do it this way)
To sum it up: Which is faster and better practice? (keep in mind im still a novice with SQL)
The best practice is to model your data as entities with specific attributes. Typically an entity has at most a few dozen attributes. The entities typically turn into tables, and the attributes typically which turn into columns. That is, the physical model and the logic model are often very similar.
There may be other considerations. For instance, there is a limit on the number of columns a row can have -- and if you have more columns, you need another solution. Similarly, if the data is sparse (that is, most values are NULL), then having lots of unused columns may be a waste of space. That is, it is more efficient to store it in another format. SQL Server offers sparse columns for this reason.
My suggestion is that you design your table in an intuitive way with named columns. A volume of data of 1,000 - 10,000 rows per week is not that much data. That turns into 50,000 - 500,000 rows per year, which SQL Server should be easily able to handle the volume. You don't say how many named entities you have, but table with millions or tens of millions of rows are quite reasonable for modern databases.