Store array of items in SQL table - sql

I know this has probably been asked a million times but I can't find anything definite for me. I'm making a website involving users who can build a list of items. I'm wondering what would be the best way for store their items in an SQL table?
I'm thinking will I need to make a seperate table for each user since there I can't see any way to store an array. I think this would be inefficient however.

Depending on what an "item" is, there seem to be two possible solutions:
a one-to-many relationship between users and items
a many-to-many relationship between users and items
If a single item (such as a "book") can be "assigned" to more than one user, it's 2). If each item is unique and can only belong to a single user it's 1).
one-to-many relationship
create table users
(
user_id integer primary key not null,
username varchar(100) not null
);
create table items
(
item_id integer primary key not null,
user_id integer not null references users(user_id),
item_name varchar(100) not null
);
many-to-many relationship:
create table users
(
user_id integer primary key not null,
username varchar(100) not null
);
create table items
(
item_id integer primary key not null,
item_name varchar(100) not null
);
create table user_items
(
user_id integer not null references users(user_id),
item_id integer not null references items(item_id)
);
Because of your extremely vague description, this is the best I can think of.
There is no need to use an array or something similar. It seems you are new to database modelling, so you should read up about normalisation. Each time you think about "arrays" you are probably thinking about "tables" (or relations).
Edit (just saw you mentioned MySQL): the above SQL will not create a foreign key constraint in MySQL (even though it will run without an error) due to MySQL's stupid "I'm not telling you if I can't do something" attitude. You need to define the foreign keys separately.

A separate table for each user\account would be best. This will limit the size of the necessary tables and allow for faster searching. When you present data you are usually displaying data for that current user/account. When you have to search through the table to find the relative information. The application will start to slow down the larger the dependent table grows. Write the application as if it will be used to the fullest extent of SQL. This will limit the need for redesign in the future if the website becomes popular.

Related

Can I use identity for primary key in more than one table in the same ER model

As it is said in the title, my question is can I use int identity(1,1) for primary key in more than one table in the same ER model? I found on Internet that Primary Key need to have unique value and row, for example if I set int identity (1,1) for table:
CREATE TABLE dbo.Persons
(
Personid int IDENTITY(1,1) PRIMARY KEY,
LastName varchar(255) NOT NULL,
FirstName varchar(255),
Age int
);
GO
and the other table
CREATE TABLE dbo.Job
(
jobID int IDENTITY(1,1) NOT NULL PRIMARY KEY,
nameJob NVARCHAR(25) NOT NULL,
Personid int FOREIGN KEY REFERENCES dbo.Persons(Personid)
);
Wouldn't Personid and jobID have the same value and because of that cause an error?
Constraints in general are defined and have a scope of one table (object) in the database. The only exception is the FOREIGN KEY which usually has a REFERENCE to another table.
The PRIMARY KEY (or any UNIQUE key) sets a constraint only on the table it is defined on and is not affecting or is not affected by other constraints on other tables.
The PRIMARY KEY defines a column or a set of columns which can be used to uniquely identify one record in one table (and none of the columns can hold NULL, UNIQUE on the other hand allows NULLs and how it is treated might differ in different database engines).
So yes, you might have the same value for PersonID and JobID, but their meaning is different. (And to select the one unique record, you will need to tell SQL Server in which table and in which column of that table you are looking for it, this is the table list and the WHERE or JOIN conditions in the query).
The query SELECT * FROM dbo.Job WHERE JobID = 1; and SELECT * FROM dbo.Person WHERE PersonID = 1; have a different meaning even when the value you are searching for is the same.
You will define the IDENTITY on the table (the table can have only one IDENTITY column). You don't need to have an IDENTITY definition on a column to have the value 1 in it, the IDENTITY just gives you an easy way to generate unique values per table.
You can share sequences across tables by using a SEQUENCE, but that will not prevent you to manually insert the same values into multiple tables.
In short, the value stored in the column is just a value, the table name, the column name and the business rules and roles will give it a meaning.
To the notion "every table needs to have a PRIMARY KEY and IDENTITY, I would like to add, that in most cases there are multiple (independent) keys in the table. Usually every entity has something what you can call business key, which is in loose terms the key what the business (humans) use to identify something. This key has very similar, but usually the same characteristics as a PRIMARY KEY with IDENTITY.
This can be a product's barcode, or the employee's ID card number, or something what is generated in another system (say HR) or a code which is assigned to a customer or partner.
These business keys are useful for humans, but not always useful for computers, but they could serve as PRIMARY KEY.
In databases we (the developers, architects) like simplicity and a business key can be very complex (in computer terms), can consist of multiple columns, and can also cause performance issues (comparing a strings is not the same as comparing numbers, comparing multiple columns is less efficient than comparing one column), but the worst, it might change over time. To resolve this, we tend to create our own technical key which then can be used by computers more easily and we have more control over it, so we use things like IDENTITYs and GUIDs and whatnot.

What table structure to go for if there are two objects of same type but of different nature?

Given that there are two kinds of products X and Y. X has A, B and C as the primary key whereas Y has A and D as it's primary key. Should I put them in the same table ? Why should I and if I should not, then why is that ?
I have currently put them in two separate tables but some colleagues are suggesting that they belong in the same table. My question is should I consider putting them in the same table or continue with different tables?
Below I have given example tables for the above case.
CREATE TABLE `product_type_b` (
`PRODUCT_CODE` VARCHAR(50) NOT NULL,
`COMPONENT_CODE` VARCHAR(50) NOT NULL,
`GROUP_INDICATOR` VARCHAR(50) NULL DEFAULT NULL,
`RECORD_TIMESTAMP` DATE NULL DEFAULT NULL,
PRIMARY KEY (`PRODUCT_CODE`, `COMPONENT_CODE`)
)
COLLATE='utf8mb4_general_ci'
ENGINE=InnoDB
;
CREATE TABLE `product_type_a` (
`PRODUCT_CODE` VARCHAR(50) NOT NULL,
`CHOICE_OF_COVER` VARCHAR(50) NOT NULL,
`PLAN_TYPE` VARCHAR(50) NOT NULL,
`RECORD_TIMESTAMP` DATE NULL DEFAULT NULL,
`PRODUCT_TENURE` INT(11) NULL DEFAULT NULL,
PRIMARY KEY (`PRODUCT_CODE`, `CHOICE_OF_COVER`, `PLAN_TYPE`)
)
COLLATE='utf8mb4_general_ci'
ENGINE=InnoDB
;
As you can see there are certain fields that are not common to both tables but are part of the primary key. There are also some other fields which are not common to both tables.
Here is the bigger picture of the system in consideration.
Each product type has a different source from where it is sent to the system.
We need to store these products in the database.
I would like to have a balance between normalization and performance so that my read-write speeds aren't compromised much due to over normalization.
There is also a web-app which will have a page where these products are searchable by the user.
User will populate certain column fields as filters based on which we need to fetch the products and show on the UI.
variations in subtypes is currently 2 and is not expected to increase beyond 4-5 which again is go
ig to be over a decade maybe. This again is an approximation.n
I hope this presents a bigger picture of the system.
I want to have good read and write speeds without compromising much. So should I go ahead with this design ? If not, what design should be implemented ?
For a trading system and taking into account max 5 product types and very limited number of attributes I'd prefer a single table for all products with a surrogate PK. Think about references to products from trading transactions, this is the biggest part of the total DB content in a long run.
A metadata table describing every product-specific attribute and its mapping to the general table column would help to build UI and backend/frontend communications.
Search indexes would reflect most popular user seraches depending on product type.
This is typical Category/SubCategory model issue. There are a few options:
Put everything in one table, which will have some columns NULLable
because different subtypes do not have the same attributes;
One parent table for all the common attributes, and also with the
column of the type indication column. Then each sub type has its own
table just for the columns the Subtype has.
Each subtype has its own table, including all the common columns of
all the sub type.
(1) is good if the sub type is very limited;
(3) is suitable if the variations of the sub types are very limited.
The advantage of (2). is it is easy to return all the records with the common columns. And if an artificial key (like auto-increment id) is used, it ensures all records, regards less the sub type, has unique id.
In your case, no artificial PK is used, I think your choice is not bad.

Storing single form table questions in 1 or multiple tables

I have been coding ASP.NET forms inside web applications for a long time now. Generally most web apps have a user that logs in, picks a form to fill out and answers questions so your table looks like this
Table: tblInspectionForm
Fields:
inspectionformid (either autoint or guid)
userid (user ID who entered it)
datestamp (added, modified, whatever)
Question1Answer: boolean (maybe a yes/no)
Question2Answer: int (maybe foreign key for sub table 1 with dropdown values)
Question3Answer: int (foreign key for sub table 2 with dropdown values)
If I'm not mistaken it meets both 2nd and 3rd normal forms. You're not storing user names in the tables, just the ID's. You aren't storing the dropdown or "yes/no" values in Q-3, just ID's of other tables.
However, IF all the questions are exactly the same data type (assume there's no Q1 or Q1 is also an int), which link to the exact same foreign key (e.g. a form that has 20 questions, all on a 1-10 scale or have the same answers to chose from), would it be better to do something like this?
so .. Table: tblInspectionForm
userid (user ID who entered it)
datestamp (added, modified, whatever)
... and that's it for table 1 .. then
Table2: tblInspectionAnswers
inspectionformid (composite key that links back to table1 record)
userid (composite key that links back to table1 record)
datastamp (composite key that links back to table1 record)
QuestionIDNumber: int (question 1, question 2, question3)
QuestionAnswer: int (foreign key)
This wouldn't just apply to forms that only have the same types of answers for a single form. Maybe your form has 10 of these 1-10 ratings (int), 10 boolean-valued questions, and then 10 freeform.. You could break it into three tables.
The disadvantage would be that when you save the form, you're making 1 call for every question on your form. The upside is, if you have a lot of nightly integrations or replications that pull your data, if you decide to add a new question, you don't have to manually modify any replications to reporting data sources or anything else that's been designed to read/query your form data. If you originally had 20 questions and you deploy a change to your app that adds a 21st, it will automatically get pulled into any outside replications, data sources, reporting that queries this data. Another advantage is that if you have a REALLY LONG (this happens a lot maybe in the real estate industry when you have inspection forms with 100's of questions that go beyond the 8k limit for a table row) you won't end up running into problems.
Would this kind of scenario ever been the preferred way of saving form data?
As a rule of thumb, whenever you see a set of columns with numbers in their names, you know the database is poorly designed.
What you want to do in most cases is have a table for the form / questionnaire, a table for the questions, a table for the potential answers (for multiple-choice questions), and a table for answers that the user chooses.
You might also need a table for question type (i.e free-text, multiple-choice, yes/no).
Basically, the schema should look like this:
create table Forms
(
id int identity(1,1) not null primary key,
name varchar(100) not null, -- with a unique index
-- other form related fields here
)
create table QuestionTypes
(
id int identity(1,1) not null primary key,
name varchar(100) not null, -- with a unique index
)
create table Questions
(
id int identity(1,1) not null primary key,
form_id int not null foreign key references Forms(id),
type_id int not null foreign key references QuestionTypes(id),
content varchar(1000)
)
create table Answers
(
id int identity(1,1) not null primary key,
question_id int not null foreign key references Questions(id),
content varchar(1000)
-- For quizez, unremark the next row:
-- isCorrect bit not null
)
create table Results
{
id int identity(1,1) not null primary key,
form_id int not null foreign key references Forms(id)
-- in case only registered users can fill the form, unremark the next row
--user_id int not null foreign key references Users(id),
}
create table UserAnswers
(
result_id int not null foreign key references Results(id),
question_id int not null foreign key references Questions(id),
answer_id int not null foreign key references Answers(id),
content varchar(1000) null -- for free text questions
)
This design will require a few joins when generating the forms (and if you have multiple forms per application, you just add an application table that the form can reference), and a few joins to get the results, but it's the best dynamic forms database design I know.
I'm not sure whether it's "preferred" but I have certainly seen that format used commercially.
You could potentially make the secondary table more flexible with multiple answer columns (answer_int, answer_varchar, answer_datetime), and assign a question value that you can relate to get the answer from the right column.
So if q_var = 2 you know to look in answer_varchar, whereas q_value=1 you know is an int and requires a lookup (the name of which could also be specified with the question and stored in a column).
I use an application at the moment which splits answers into combobox, textfield, numeric, date etc in this fashion. The application actually uses a JSON form which splits out the data as it saves into the separate columns. It's a bit flawed as it saves JSON into these columns but the principle can work.
You could go with a single identity field for the parent table key that the child table would reference.

Generic Database table design

Just trying to figure out the best way to design my table for the following scenario:
I have several areas in my system (documents, projects, groups and clients) and each of these can have comments logged against them.
My question is should I have one table like this:
CommentID
DocumentID
ProjectID
GroupID
ClientID
etc
Where only one of the ids will have data and the rest will be NULL or should I have a separate CommentType table and have my comments table like this:
CommentID
CommentTypeID
ResourceID (this being the id of the project/doc/client)
etc
My thoughts are that option 2 would be more efficient from an indexing point of view. Is this correct?
Option 2 is not a good solution for a relational database. It's called polymorphic associations (as mentioned by #Daniel Vassallo) and it breaks the fundamental definition of a relation.
For example, suppose you have a ResourceId of 1234 on two different rows. Do these represent the same resource? It depends on whether the CommentTypeId is the same on these two rows. This violates the concept of a type in a relation. See SQL and Relational Theory by C. J. Date for more details.
Another clue that it's a broken design is that you can't declare a foreign key constraint for ResourceId, because it could point to any of several tables. If you try to enforce referential integrity using triggers or something, you find yourself rewriting the trigger every time you add a new type of commentable resource.
I would solve this with the solution that #mdma briefly mentions (but then ignores):
CREATE TABLE Commentable (
ResourceId INT NOT NULL IDENTITY,
ResourceType INT NOT NULL,
PRIMARY KEY (ResourceId, ResourceType)
);
CREATE TABLE Documents (
ResourceId INT NOT NULL,
ResourceType INT NOT NULL CHECK (ResourceType = 1),
FOREIGN KEY (ResourceId, ResourceType) REFERENCES Commentable
);
CREATE TABLE Projects (
ResourceId INT NOT NULL,
ResourceType INT NOT NULL CHECK (ResourceType = 2),
FOREIGN KEY (ResourceId, ResourceType) REFERENCES Commentable
);
Now each resource type has its own table, but the serial primary key is allocated uniquely by Commentable. A given primary key value can be used only by one resource type.
CREATE TABLE Comments (
CommentId INT IDENTITY PRIMARY KEY,
ResourceId INT NOT NULL,
ResourceType INT NOT NULL,
FOREIGN KEY (ResourceId, ResourceType) REFERENCES Commentable
);
Now Comments reference Commentable resources, with referential integrity enforced. A given comment can reference only one resource type. There's no possibility of anomalies or conflicting resource ids.
I cover more about polymorphic associations in my presentation Practical Object-Oriented Models in SQL and my book SQL Antipatterns.
Read up on database normalization.
Nulls in the way you describe would be a big indication that the database isn't designed properly.
You need to split up all your tables so that the data held in them is fully normalized, this will save you a lot of time further down the line guaranteed, and it's a lot better practice to get into the habit of.
From a foreign key perspective, the first example is better because you can have multiple foreign key constraints on a column but the data has to exist in all those references. It's also more flexible if the business rules change.
To continue from #OMG Ponies' answer, what you describe in the second example is called a Polymorphic Association, where the foreign key ResourceID may reference rows in more than one table. However in SQL databases, a foreign key constraint can only reference exactly one table. The database cannot enforce the foreign key according to the value in CommentTypeID.
You may be interested in checking out the following Stack Overflow post for one solution to tackle this problem:
MySQL - Conditional Foreign Key Constraints
The first approach is not great, since it is quite denormalized. Each time you add a new entity type, you need to update the table. You may be better off making this an attribute of document - I.e. store the comment inline in the document table.
For the ResourceID approach to work with referential integrity, you will need to have a Resource table, and a ResourceID foreign key in all of your Document, Project etc.. entities (or use a mapping table.) Making "ResourceID" a jack-of-all-trades, that can be a documentID, projectID etc.. is not a good solution since it cannot be used for sensible indexing or foreign key constraint.
To normalize, you need to the comment table into one table per resource type.
Comment
-------
CommentID
CommentText
...etc
DocumentComment
---------------
DocumentID
CommentID
ProjectComment
--------------
ProjectID
CommentID
If only one comment is allowed, then you add a unique constraint on the foreign key for the entity (DocumentID, ProjectID etc.) This ensures that there can only be one row for the given item and so only one comment. You can also ensure that comments are not shared by using a unique constraint on CommentID.
EDIT: Interestingly, this is almost parallel to the normalized implementation of ResourceID - replace "Comment" in the table name, with "Resource" and change "CommentID" to "ResourceID" and you have the structure needed to associate a ResourceID with each resource. You can then use a single table "ResourceComment".
If there are going to be other entities that are associated with any type of resource (e.g. audit details, access rights, etc..), then using the resource mapping tables is the way to go, since it will allow you to add normalized comments and any other resource related entities.
I wouldn't go with either of those solutions. Depending on some of the specifics of your requirements you could go with a super-type table:
CREATE TABLE Commentable_Items (
commentable_item_id INT NOT NULL,
CONSTRAINT PK_Commentable_Items PRIMARY KEY CLUSTERED (commentable_item_id))
GO
CREATE TABLE Projects (
commentable_item_id INT NOT NULL,
... (other project columns)
CONSTRAINT PK_Projects PRIMARY KEY CLUSTERED (commentable_item_id))
GO
CREATE TABLE Documents (
commentable_item_id INT NOT NULL,
... (other document columns)
CONSTRAINT PK_Documents PRIMARY KEY CLUSTERED (commentable_item_id))
GO
If the each item can only have one comment and comments are not shared (i.e. a comment can only belong to one entity) then you could just put the comments in the Commentable_Items table. Otherwise you could link the comments off of that table with a foreign key.
I don't like this approach very much in your specific case though, because "having comments" isn't enough to put items together like that in my mind.
I would probably go with separate Comments tables (assuming that you can have multiple comments per item - otherwise just put them in your base tables). If a comment can be shared between multiple entity types (i.e., a document and a project can share the same comment) then have a central Comments table and multiple entity-comment relationship tables:
CREATE TABLE Comments (
comment_id INT NOT NULL,
comment_text NVARCHAR(MAX) NOT NULL,
CONSTRAINT PK_Comments PRIMARY KEY CLUSTERED (comment_id))
GO
CREATE TABLE Document_Comments (
document_id INT NOT NULL,
comment_id INT NOT NULL,
CONSTRAINT PK_Document_Comments PRIMARY KEY CLUSTERED (document_id, comment_id))
GO
CREATE TABLE Project_Comments (
project_id INT NOT NULL,
comment_id INT NOT NULL,
CONSTRAINT PK_Project_Comments PRIMARY KEY CLUSTERED (project_id, comment_id))
GO
If you want to constrain comments to a single document (for example) then you could add a unique index (or change the primary key) on the comment_id within that linking table.
It's all of these "little" decisions that will affect the specific PKs and FKs. I like this approach because each table is clear on what it is. In databases that's usually better then having "generic" tables/solutions.
Of the options you give, I would go for number 2.
Option 2 is a good way to go. The issue that I see with that is you are putting the resouce key on that table. Each of the IDs from the different resources could be duplicated. When you join resources to the comments you will more than likely come up with comments that do not belong to that particular resouce. This would be considered a many to many join. I would think a better option would be to have your resource tables, the comments table, and then tables that cross reference the resource type and the comments table.
If you carry the same sort of data about all comments regardless of what they are comments about, I'd vote against creating multiple comment tables. Maybe a comment is just "thing it's about" and text, but if you don't have other data now, it's likely you will: date the comment was entered, user id of person who made it, etc. With multiple tables, you have to repeat all these column definitions for each table.
As noted, using a single reference field means that you could not put a foreign key constraint on it. This is too bad, but it doesn't break anything, it just means you have to do the validation with a trigger or in code. More seriously, joins get difficult. You can just say "from comment join document using (documentid)". You need a complex join based on the value of the type field.
So while the multiple pointer fields is ugly, I tend to think that's the right way to go. I know some db people say there should never be a null field in a table, that you should always break it off into another table to prevent that from happening, but I fail to see any real advantage to following this rule.
Personally I'd be open to hearing further discussion on pros and cons.
Pawnshop Application:
I have separate tables for Loan, Purchase, Inventory & Sales transactions.
Each tables rows are joined to their respective customer rows by:
customer.pk [serial] = loan.fk [integer];
= purchase.fk [integer];
= inventory.fk [integer];
= sale.fk [integer];
I have consolidated the four tables into one table called "transaction", where a column:
transaction.trx_type char(1) {L=Loan, P=Purchase, I=Inventory, S=Sale}
Scenario:
A customer initially pawns merchandise, makes a couple of interest payments, then decides he wants to sell the merchandise to the pawnshop, who then places merchandise in Inventory and eventually sells it to another customer.
I designed a generic transaction table where for example:
transaction.main_amount DECIMAL(7,2)
in a loan transaction holds the pawn amount,
in a purchase holds the purchase price,
in inventory and sale holds sale price.
This is clearly a denormalized design, but has made programming alot easier and improved performance. Any type of transaction can now be performed from within one screen, without the need to change to different tables.

T-SQL Tag Database Architecture Design?

Scenario
I am building a database that contains a series of different tables. These consist of a COMMENTS table, a BLOGS table & an ARTICLES table. I want to be able to add new items to each table, and tag them with between 0 and 5 tags to help the user search for particular information that is relevant more easily.
Initial thoughts for architecture
My first thoughts were to have a centralised table of TAGS. This table would list all of the available tags using a TagID field & a TagName field. Since each item can have many tags and each tag can have many items, I would need a MANY-TO-MANY relationship between each item table and the TAGS table.
For Example:
Many COMMENTS can have many TAGS.
Many TAGS can have many COMMENTS.
Many ARTICLES can have many TAGS.
Many TAGS can have many ARTICLES.
etc.....
Current Understanding
From previous experience I understand that a way of implementing this structure in T-SQL is to have an ajoining table between the COMMENTS table and the TAG table. This ajoining table would contain the CommentID & the TagID, as well as its own unique CommentTagID. This structure would also apply to all other items.
Questions
Firstly is this the right way to go about implementing such a database architecture? If not, what other methods would be feasible? Since the database will eventually contain a lot of information, I need to ensure that it is scalable. Is this a scalable implementation?
If I had lots of these tables would this architecture make CRUD operations very slow?
Should I use GUIDs or Incrementing INTs for the ID fields?
Help & suggestions would be appreciated greatly.
Thankyou.
You may also want to look at WordPress schema and database description to see how others are solving a similar problem.
Keeping a centralized table of tags is a good idea if you will ever need to do one of the following:
Build a complete list of all tags (that is mixing blog tags, comment tags and article tags)
Update the tags so that they get updated everywhere: so that when you change sqlserver to sql-server, it gets changed anywhere: in blogs, articles and comments.
Option 1 is very useful to build the tag clouds so I'd recommend to build a table of tags and reference it from your tables.
If you won't ever need to update the tags as described in the option 2, you don't ever need surrogate key for them.
You will most probably need a UNIQUE constraint on them anyway and there is no point not to make it a PRIMARY KEY, if you are not going to update them.
This will also save you lots of joins: you don't need to join with the tags table to show the tags.
GUIDs are more simple to manage, but theу make the indexes and link tables quite large in size.
You can assign a numerical identifier to each table and link like this:
tTag (tag VARCHAR(30) NOT NULL PRIMARY KEY)
tTaggable (type INT NOT NULL, id INT NOT NULL, PRIMARY KEY (type, id))
tTagLink (
tag VARCHAR(30) NOT NULL FOREIGN KEY REFERENCES tTag,
type INT NOT NULL, id INT NOT NULL,
PRIMARY KEY (tag, type, id),
FOREIGN KEY (type, id) REFERENCES tTaggable
)
tBlog (
id INT NOT NULL PRIMARY KEY,
type INT NOT NULL, CHECK(type = 1),
FOREIGN KEY (type, id) REFERENCES tTaggable,
…)
tArticle (
id INT NOT NULL,
blog INT NOT NULL FOREIGN KEY REFERENCES tBlog,
type INT NOT NULL, CHECK(type = 2),
FOREIGN KEY (type, id) REFERENCES tTaggable,
…)
tComment (
id INT NOT NULL PRIMARY KEY,
article INT NOT NULL FOREIGN KEY REFERENCES tArticle,
type INT NOT NULL, CHECK(type = 3),
FOREIGN KEY (type, id) REFERENCES tTaggable,
…)
Note that if you want to delete a blog, an article or a comment, you should delete from tTaggable as well.
This way, tTaggable is only used to ensure the referential integrity. To query all tags for an article, you just issue this query:
SELECT tag
FROM tTagLink
WHERE type = 2
AND id = 1234567
, so you get all tags by querying a single table, without any joins.
usually many-to-many relationship implemented exactly as you describe it.
Auto-incrementing IDs it is good idea since it guarantee that they will be unique.
And you can use guids if you want to tag comments and articles with the same tag(instead of 6 tables you need just 5). But searching with guids may be more slow.