a Large one table with 100 column vs a lot of little tables - sql

I created some website which contain users,comments,videos,photos,messages and more.All of the data is in the one table which contain 100 column.I thought one table is better than more because user need just connect one table but I heard that some programmer doesnt like this method.And Can someone say me which one is better?One very large table or a lot of little tables.
and Why I need use a lot tables?Why it is useful?Which one is fast for user?
What is the advantages and disadvantages of large table and a lot of little tables?

100 columns in a single table is bad design in most situations.
Read this page: http://www.tutorialspoint.com/sql/sql-rdbms-concepts.htm
Break your data up into related chunks and give each of them their own table.
You said you have this information (users,comments,videos,photos,messages) so you should have something like these tables.
Users which contains (User ID, Name, Email etc)
Comments which contains (Comment ID, User ID, Comment Text etc)
Videos which contains (Video ID, User ID, Comment ID, Video Data etc)
Photos which contains (Photo ID, User ID, Comment ID, Photo Data etc)
Messages which contains (Message ID, User ID, Message Text etc)
Then when your writing your SQL you can write proper SQL to query based on exactly what information you need.
SELECT UserID, MessageID, MessageText
FROM Users as USR
JOIN Messages as MSG
on USR.UserID = MSG.UserID
WHERE USR.UserID = 1234567
With your current query your having to deal with rows containing data that you dont need or care about.
EDIT
Just to give some further information to the OP as to why this is better design.
Lets take the "Users" as a starting example.
In a proper database design you would have a table called Users which has all the required columns that are required for a user to exist. Username, email, id number etc.
Now we want to create a new user so we want to insert Username, email and id number. But wait i still have to populate 97 other columns with totally unrelated information to our process of creating a new user! Even if you store NULL in all columns its going to use some space in the database.
Also imagine you have hundreds of users all trying to select, update and delete from a single database table. There is a high chance of the table being locked. But if you had one user updating the Users table, another user Inserting into the Messages table then the work is spread out.
And as other users have said, purely performance. The database needs to get all information and filter out what you want. If you have alot of columns this is unnecessary work.
Performance Example.
Lets say your database has been running for years. You have 5000 users, 2,000,000 comments, 300,000 pictures, 1,000,000 messages. Your single table now contains 3,305,000 records.
Now you want to find a User with the ID of 12345 who has more than 20 pictures. You need to search through all 3,305,000 records to get this result.
If you had a split table design then you would only need to search through 305,000 records.
Obvious performance gain!!
EDIT 2
Performance TEST.
I created a dummy table containing 2 million rows and 1 column. I ran the below query which took 120ms on average over 10 executions.
SELECT MyDate1 from dbo.DummyTable where MyDate1 BETWEEN '2015-02-15 16:59:00.000' and '2015-02-15 16:59:59.000'
I then truncated the table and created 6 more columns and populated them with 2 million rows of test data and ran the same query. It took 210ms on average over 10 executions.
So adding more columns decreases performance even though your not viewing the extra data.

Wide tables can cause performance problems if they are wider than the database can store in one place.
You need to read about normalization as this type of structure is very bad and is not what the database is optimized for. In your case you will have many repeated records that you will have to use distinct (which is a performance killer) to get rid of when you want to only show the user name or the comments.
Additionally, you may have some fields that are repeats like comment1, comment2, etc. Those are very hard to query over time and if you need another one, then you have to change the table structure and potentially change the queries. That is a bad way to do business.
Further when you only have one table, it becomes a hot spot in your database and you will have more locking and blocking.
Now also suppose that one of those pieces of information is updated, now you have to make sure to update all the records not just one. This can also be also a performance killer and if you don't do it, then you will have data integrity problems which will make the data in your database essentially useless. Denormalizing is almost always a bad idea and always is a bad idea when done by someone who is not an expert in database design. There are many ramifications of denormalization that you probably haven't thought of.
Overall your strategy is sure loser over time and needs to be fixed ASAP because the more records you have in a database, the harder it is to refactor.

For your situation it is better to have multiple tables. The reason for this is because if you put all your data into one table then you will have update anomalies. For example, if a user decides to update his username, you will have to update every single row in your big table that has that user's username. But if you split it into multiple tables then you will only need to update one row in your User table and all the rows in your other tables will reference that updated row.
As far as speed, having one table will be faster than multiple tables with SELECT statements because joining tables is slow. INSERT statements will be about the same speed in either situation because you will be inserting one row. However, updating someone's username with an UPDATE statement will be very slow with one table if they have a lot of data about them because it has to go through each row and update every one of them as opposed to only having to update one row in the User table.
So, you should create tables for everything you mentioned in your first sentence (users, comments, videos, photos, and messages) and connect them using Ids like this:
User
-Id
-Username
Video
-Id
-UploaderId references User.Id
-VideoUrl
Photo
-Id
-UploaderId references User.Id
-PhotoUrl
VideoComment
-CommenterId references User.Id
-VideoId references Video.Id
-CommentText
PhotoComment
-CommenterId reference User.Id
-PhotoId references Photo.Id
-CommentText
Message
-SenderId references User.Id
-ReceiverId references User.Id
-MessageText

Related

Track database changes or differentiate records with timestamp?

Keeping track of changes to a database must be a big concern for lots of people, but it seems that the big names have software for that.
My question is for a small SQL database with 10 tables, <10 columns each, using joins to create a "master" junction table: is there a downside to updating a few times per year by adding rows (with a lot of duplicate information) and then taking the MAX id (PK) to generate and post on a website the most recent data in tabular form (excerpted from the "master")? This versus updating the records, in which I'll lose information on the values at a particular moment.
A typical row for teacher contact information would have fName, lName, schoolName, [address & phone info]; for repertoire or audition information: year, instrument, piece, composer, publisher/edition.
Others have asked about tracking db changes, but only one recently, and not with a lot of votes/details:
How to track data changes in a database table
Keeping history of data revisions - best practice?
How to track data changes in a database table
This lightweight solution seems promising, but I don't know if it didn't get votes because it's not helpful, or because folks just weren't interested.
How to keep track of changes to data in a table?
more background if needed:
I'm a music teacher (i.e. amateur programmer) maintaining a Joomla website for our organization. I'm using a Joomla plugin called Sourcerer to create dynamic content (PHP/SQL to the Joomla database) to make it easier to communicate changes (dates, personnel, rules, repertoire, etc.) For years, this was done with static pages (and paper handbooks) that took days to update.
I also, however, want to be able to look back and see the database state at a particular time: who taught where, what audition piece was listed, etc., as we could with paper versions. NOTE: I'm not tracking HTML changes, only that information fed from the database.
Thanks for any help! (I've followed SO for years, but this is my first question.)
The code I'm using now to generate the "master junction table." I would modify this to "INSERT into" for my new rows and query from it via Sourcerer to post the information online.
CREATE TABLE 011people_to_schools_junction
AS (
SELECT *
FROM (
SELECT a.peopleID, a.districtID, a.firstName, a.lastName, a.statusID, c.schoolName
FROM 01People a
INNER JOIN (
SELECT districtID, MAX(peopleID) peopleID
FROM 01People
GROUP BY districtID
) b
ON a.districtID = b.districtID
AND a.peopleID = b.peopleID
INNER JOIN (
SELECT schoolID, MAX(peopleID) peopleID
FROM 01people_to_schools_junction ab
GROUP BY schoolID
) z
ON z.peopleID = a.peopleID
LEFT JOIN 01Schools c
ON c.schoolID = z.schoolID
WHERE z.schoolID IS NOT NULL
OR z.peopleID IS NOT NULL
ORDER BY c.schoolName
) t1
);
#Add a primary key as the first column
ALTER TABLE 011people_to_schools_junction
ADD COLUMN 011people_to_schoolsID INT NOT NULL AUTO_INCREMENT FIRST,
ADD PRIMARY KEY (011people_to_schoolsID);
To answer your questions in order:
Is there a downside?
Of course, and it's performance - related. If you add a million records each year, it will hurt performances; and occupy space on disk.
Where the suggestions in the linked question bad or just not popular?
The question and answers are good; but the right answer depends on your specific use case: are you doing it for legal reasons, how fast you want to be able to access the data, how much data and updates you have, how much you want your history functionality to last without changes... only if it met your use case you would vote.
As a rule of thumb, history should go to a different table, this would provide several advantages:
your current tables don't change, so your code needs no change except for storing the current version also in history;
your application doesn't slow down;
if your history tables grow you can move them easily to a different server;
In order to choose whether to have a single history table or several (one per backed up table) depends on how you plan to retrieve the data and what you want to do with it:
if you mirror each of your tables adding a timestamp and the user id, your code would need little modifications; but you'd end up with twice as many tables, and any structure change would then need to be replicated on the history table as well;
if you build a single history table with the timestamp, the user id, the table name and a json representation of the record, you will have an easier life building it, while for retrieving it you should access the data using an Object per row i.e. using Joomla's dbo getObjectList(), then the objects will be the same format you store in the history table and the changes there will be fairly easy. But querying for changes across specific tables/fields will be much harder.
Keep in mind that having data is useless if you can't retrieve it properly.
Since you mention pushing to the website a few times a year, the overhead of the queries should not be an issue (if you update monthly, waiting 5 minutes may not be a problem).
You should seek the best solution based on the other uses of this data: for it to be useful to anyone, you will have to implement a system to retrieve historical data. If phpmyadmin is enough, well look no further.
I hope this scared you. Either way it's a lot of hard work.
If you just want to be able to look up old data, you may instead store a copy of the markup/output you generate from time to time, and save it to different folders on the webserver. This will take minutes to set up, and be extremely reliable.
Sure, it's more fun to code it. But are you really sure you need it? And you can keep the database dumps just in case one day you change your mind.

Best practice for tables with varying content

Currently I am working on a problem where I have to log data in a Oracle10g database. I want to store data from up to 40 devices (but not necessarily always 40) as one data point, these share a bit of information and the rest is device specific.
So I could either create arrays for every device-specific column and if the device is in use the according array field is getting populated.
ID TIMESTAMP BOARD DEVICE_ID[40] ERROR_CNT[40] TEMP[40] MORE_DATA[40]...
But I think I would be wasting a lot of database space by doing it like that, because the arrays would be hardly populated
The other method I can think of would be to just use the same ID for a multi-line entry and then I put as many rows into the table as I have used devices.
ID TIMESTAMP BOARD DEVICE_ID ERROR_CNT TEMP MORE_DATA
1 437892 1 1 100 25 xxx
1 437892 1 2 50 28 yyy
Now the shared information is multiple times in the database and the data is shattered among multiple lines.
Another issue is that there might be columns used by a part of the devices and some do not carry that information, so there might be even more unused fields. So maybe it would be best to create multiple tables and split the devices into groups according to the information they have and log their data in the corresponding tables.
I appreciate any help, maybe I am even paranoid about wasted db space and should not worry about that and simply follow the 'easiest' approach, which I think would be the one with arrays.
Never store arrays in a database. Violating first normal form is a big mistake.
Worry more about how the data is queried than how it is stored. Keep the data model "dumb" and there are literally millions of people who can understand how to use it. There are probably only a few hundred people who understand Oracle object types.
For example, using object types, here is the simplest code to create a table, insert data, and query it:
drop table device;
create or replace type error_count_type is table of number;
create table device(id number, error_count error_count_type)
nested table error_count store as error_count_table;
insert into device values(1, error_count_type(10, 20));
commit;
select sum(column_value) error_count
from device
cross join table(error_count);
Not many people or tools understand creating types, store as, instantiating types, COLUMN_VALUE, or TABLE(...). Internally, Oracle stores arrays as tables anyway so there's no performance benefit.
Do it the simple way, with multiple tables. As Gordon pointed out, it's a small database anyway. Keep it simple.
I think this is too long for a comment:
1000 hours * 12/hour * 40 devices = 480,000 rows.
This is not a lot of data, so I wouldn't worry about duplication of values. You might want to go with the "other method" because it provides a lot of flexibility.
You can store all the data in columns, but if you get the columns wrong, you have to start messing around with alter table statements and that might affect queries you have already written.

DB schema for updating downstream sources?

I want a table to be sync-able by a web API.
For example,
GET /projects?sequence_latest=2113&limit=10
[{"state":"updated", "id":12,"sequence":2116},
{"state":"deleted" "id":511,"sequence":2115}
{"state":"created", "id":601,"sequence":2114}]
What is a good schema to achieve this?
I intend this for Postgresql with Django ORM, which uses surrogate keys. Presence of an ORM may kill answers like unions.
I can come up with only half-solutions.
I could have a modified_time column, but we cannot convey deletions.
I could have a table for storing deleted IDs, when returning 10 new/updated rows, I could return all the deleted rows between them. But this works only when the latest change is an insert/update and there are a moderate number of deleted rows.
I could set a deleted flag on the row and null the rest, but its kinda bad schema design to set all columns nullable.
I could have another table that stores ID, modification sequence number and state(new, updated, deleted), but its another table to maintain and setting sequence numbers cause contentions; imagine n concurrent requests querying for latest ID.
If you're using an ORM you want simple(ish) and if you're serving the data via an API you want quick.
To go through your suggested options:
Correct, so this doesn't help you. You could have a deleted flag in your main table though.
This seems quite a random way of doing it and breaks your insistence that there be no UNION queries.
Not sure why you would need to NULL the rest of the column here? What benefit does this bring?
I would strongly advise against having a table that has a modification sequence number. Either this means that you're performing a lot of analytic queries in order to find out the most recent state or you're updating the same rows multiple times and maintaining a table with the same PK as your normal one. At that point you might as well have a deleted flag in your main table.
Essentially the design of your API gives you one easy option; you should have everything in the same table because all data is being returned through the same method. I would follow your point 2 and Wolph's suggestion, have a deleted_on column in your table; making it look like:
create table my_table (
id ... primary key
, <other_columns>
, created_on date
, modified_on date
, deleted_on date
);
I wouldn't even bother updating all the other columns to be NULL. If you want to ensure that you return no data create a view on top of your table that nulls data where the deleted_on column has data in it. Then, your API only accesses the table through the view.
If you are really, really worried about space and the volume of records and will perform regular database maintenance to ensure that both are controlled then maybe go with option 4. Create a second table that has the state of each ID in your main table and actually delete the data from your main table. You then can do a LEFT OUTER JOIN to the main table to get the data. When there is no data that ID has been deleted. Honestly, this is overkill until you know whether you will definitely require it.
You don't mention why you're using an web API for data-transfers; but, if you're going to be transferring a lot of data or using this for internal systems only it might be worth using a lower-level transfer mechanism.

How to structure databse of API usage history

I have a database of users for a web API, but I also want to store usage history for each user, i.e: page request count, data volumes, etc. What is the best way to implement this, in terms of database structure? My initial thought was to retain the main table, but then create a history table for each user. This seems horribly impractical, however. My gut feeling is that I probably need one separate table for usage history, but I am unclear as to how to structure it.
I am using SQLite.
For an event logging model (which is what you want), I can recommend two options
One table, lets call it activity_log.
`activity_log`{
id INTEGER PRIMARY KEY,
user_id MEDIUM INT NOT NULL,
event_type VARCHAR(10),
event_time TIMESTAMP
}
For each event in your system affecting a user, you insert a record into this role (i believe the column names are self-explanatory). I believe SQLite doesn't provide native TIMESTAMP type so you'll have to handle the storage in your application code. What this design will leave you with a table that has the potential to grow very large, but it will give you fine grained statistics. SQLite doesn't support clustered indexes but there are some options here that will help you out with performance tuning.
The same table as above, only instead of inserting a new row for every event, you're going to perform a conditional insert i.e. update existing rows for users already in and update for new users. This option will keep your table several times smaller than what you have above, but you'll only have access to the most recent use of your api.
If you can afford it, I'd say go with number 1.
In one of my programs, I maintain a table of module usage per user. The structure of the table is
table id
user id
prog id
date/time
history flag (0=current, 1=history)
runs (number of time user has run program on date)
About once a week, I aggregate the data in the table: if user 1 has run program 1 twice on a given date, then initially there will be two entries in the table:
1;1;1;04/10/12 08:56;0;1
2;1;1;04/10/12 09:33;0;1
After aggregation, the table becomes
3;1;1;04/10/12 00:00;1;2
Whilst the aggregation loses the time part, no other data is lost and queries against the table will be quicker.

Why does Wordpress have separate 'usersmeta' and 'users' SQL tables. Why not combine them?

Alongside the users table, Wordpress has a usersmeta table with the following columns
meta_id
user_id
meta_key (e.g. first_name)
meta_value (e.g. Tom)
Each user has 20 rows in the usersmeta table, regardless of whether or not the rows have a filled-in meta_value. That said, would it not be more efficient to add the always-present meta rows to the users table?
I'm guessing that the information in the users table is more frequently queried (e.g. user_id, username, pass), so it is more efficient to keep those rows smaller. Is this true? And are there other reasons for this separation of tables?
Entity Attribute Value
It's known as the Entity Attribute Value (EAV) data model, and allows an arbitrary number of attributes to be assigned to a given entity. That means any number of meta-data entries per user.
Why use it
By default there are a few keys that wordpress sets (20 stated in the question) but there can be any number. If all users have one thousand meta data entries - there are simply one thousand entries in the usermeta table for each user - it doesn't have (in terms of the database structure) a limit to the number of meta data entries a user can have. It also permits one user to have one thousand meta data entires, whilst all others have 20 and still store the data efficiently - or any permutation thereof.
In addition to flexibility, using this kind of structure permits the main users table to remain small - which means more efficient queries.
Alternatives
The alternatives to using EAV include:
Modify the schema whenever the number of attributes changes
Store all attributes in a serialized string (on the user object)
Use a schemaless db
Permissions is the biggest problem with the first point, it is not a good idea to grant blanket access to alter the schema of your database tables, and is a (sane) roadblock for many if not most wordpress installs (hosted on wordpress.com or on a shared host where the db user has no alter permissions). Mysql also has a hard-limit of 4096 columns and 65,535 bytes per row. Attempting to store a large number of columns in a single table will eventually fail, along the way creating a table that is inefficient to query.
Storing all attribute in a serialized string would make it difficult and slow to query by a meta-data value.
Wordpress is quite tied to mysql, and therefore changing datastore isn't a realistic option.
Further WP info
If you aren't using any/many plugins it's possible you will have a constant number of rows in the usermeta table for each user, but typically each plugin you add may need to add meta-data for users; the number added may not be trivial and this data is stored in the usermeta table.
The docs for add_meta_user may add some clarity as to why the database is structured that way. If you put code like this somewhere:
add_user_meta($user_id, "favorite_color", "blue");
It will create a row in the usermeta table for the given user_id, without the need to add a column (favorite_color) to the main users table. That makes it easy-ish to find users by favorite color without the need to modify the schema of the users table.
This is really a question about database normalization. You can look for information on that topic in many places.
Basic answer Since there is a huge literature about this, and there are a lot of differences, I will just give some examples of why this might happen - it boild down to trade-offs; Speed versus storage requirements, or ease of use versus data duplication. Efficiency is multidimensional, and since wordpress does a lot of different things, it may have various reasons to keep them separate - space could be an issue, speed of queries may depend on this, it may be easier to look at just the meta table instead of the full table for some purposes, or vice versa.
Further reading This is a deep topic, you may want to learn more - there are hundreds of books and thousands of scholarly papers on these issues. For instance, look at this previous SO question about designing a database:
Database design: one huge table or separate tables?, or this one: First-time database design: am I overengineering?
or Database Normalization Basics
on About.com.