schema for big database - sql

I was trying to solve this issue recently, but i don't really know how.
I have an application that allows users to register and create profiles(as many as they want).
For every profile they can create campaigns(as many as they want per profile) and for each campaign they can add links(there is a limited number for the links but it's big anyway). Each link can have it's own keywords (more than 1 keyword).
The obvious thing that came to my mind was to have a table for users, one for profiles, one for campaigns, one for links and one for keywords. But think of this, some users may use the same keywords and i don't want to repeat that information over the database n times. I don't know if this is possible in mysql but i would like to have a field in the links table which will refer to the ids of the keywords in the keywords table. something like an array of ids. I would like this implementation to be flexible, allowing me to easily retrieve the keywords, update the "array of keywords" and perform certain computations (count the number of keywords for example). Can you recommend a possible solution on how to implement this?
Just to state again: I'm using mySQL and php.
Thank you.

from that description i thought of these tables:
user (id, ...)
campaigns (id, user_id, ...)
links (id, campaign_id, link)
keywords (link_id, keyword)

You should create a table to store the keywords i.e.
id (int)
keyword (varchar)
And store an association table for links -> keywords i.e.
link_id (int)
keyword_id (int)
Hope this helps!
Christian

I argue that your initial implementation of a table for each of the entities is correct. If you store keywords in a separate table and associate them with a link_id or something like that then you can look up links with common keywords much faster than an array containing all the keywords for each link.

I would argue that althought its possible that the same keyword may be choosen for different links by a user. This does not make them semantically the same.
If if have a campaign for driftwood and flotsom and use "shells" as a keyword on a link this is not the same "shells" as I would use as a keyword on the unix utilities campaign.
Stick with your original clean and logical schema and dont complicate it by solving imaginary problems.

You need to have a many to many table which stores the ID of the link, the id of the user and the id of the keyword (I'm assuming all links have keywords)
Then you can accomplish what you're talking about just through normal database operations.

There are 2 ideas
1) have each user have their own set of keywords
have a user table and have keywords in another table with userID as a FK.
When a user no matter what profile/campaign they are in needs to add a link you display the keywords for that user.
The link would still link to a keywordID via a join table that would hold keywordID and LinkID
2) global keywords
have keywords just have keywordID and keyword
there would be a join table to hold keywordID and LinkID, allowing a link to have multiple keywords.
The front end would then have to be made to ensure the users search existing keywords before adding new ones, this would help prevent double ups. The process that adds a keyword should also check for an existing value before adding

Related

Is a two table solution a performant and scalable solution to implement tagging in Postgres 9.5?

Background
I work for a real estate technology company. An upcoming project involves building out functionality to allow users to affix tags/labels (plural) to a MLS listing (real estate property). The second requirement is to allow a user to search by one or more tags. We won't be dealing with keeping track of counts or building word clouds or anything like that.
Solutions Researched
I found this SO Q&A and think the solution is pretty straightforward and have attempted to adapt some ideas from it below. Also, I understand that JSONB support is much better in 9.5 and it may be a possibility. If you have any insight here I'd love to hear your thoughts as well in an answer.
Attempted Solution
Table: Tags
Columns: ID, OwnerID, TagName, CreatedDate
Table: TaggedItems
Columns: ID, TagID (references above), PropertyID, CreatedDate, (Possibly some denormalized data to assist with presenting search results; property name, original listor, etc.)
Inserting new tags should be straightforward. Searching tags should also be straightforward since the user will select one or multiple tags from a searchable dropdown, thus affording me access to the actual TagID which I can use to query the TaggedItems table. When showing the full profile view for a listing, I can use it's PropertyID and the UserID to query my tables for the existence of one or more Tags to display in the view.
Edit: It's probably worth noting that we don't keep an entire database of properties, we access them via an API partner; hence the two table solution and not 3.
If you want to Nth normalize you would actually use 3 tables.
1 Property/Listing
2 Tags
3 CrossReferenceBetween the Two
The 3rd table creates a many to many relationship between the other 2 tables.
In this case only the 3 rd table would carry both the tagid and the property.
Going with 2 tables if fine too depending on how large of use you have as a small string won't bloat your databse too much.
I would say that it is strongly preferable to separate the tags to a separate table when you need to do lookups and more on it. Otherwise you have to have a delimited list which then what happens if a user injects a delimiter into their tag value? Also how do you plan on searching the delimited list? You will constantly expand that to a table or use regex and the regex might give you false positives as "some" will match "some" and "something" depending on how you write your code.......

How to get random unread article?

Database table articles contains 10.000 rows (articles)
I want to get a random article, mark it as read and never get it again,
so my next get request should return random article except that one.
There are thousands of users like me and all of them are using this table.
How do I implement this? How do I mark those articles, and how do I search only unread articles?
I was trying to create the column relatedUsers in the Articles table, which is filled out with relations to user objects, who has read the article. I was using search query: relatedUsers.objectId!=currentUserId But when this column get 2+ relatedUsers, this solutions stops working, because the database returns articles, which has at least one relatedUser, who's objectId isn't equal to currentUserId (means all users).
I'm using Backendless.com right now, but if I guess, the solution should be applicable to any database, including backendless. If not, then what kind of database should I use for this?
What you can do now is:
Create a table named UserArticle with fields articleId and userId
When you add a user or an article, you should also add to this table relations to all articles or users respectively; this way your UserArticle table will contain the relations between articles and users who haven't yet read those articles
And in order to get a random not read article for user, you:
Retrieve items from UserArticle where userId = yourCurrentUserId
Randomly select one of them and retrieve an article by its ID
Remove the selected article from UserArticle by articleId and userId, meaning that the user has read the article
This approach is the most suitable in your case. It involves a little more than one request, but for now your requirements cannot be fit better.
Also we (the Backendless team) are working on a keyword like contains, using which you would be able to have only a relation to Users table and get not read articles in one request, so the suggested approach is pretty temporary.
Getting a random number depends on the database. But the basic idea is to have a table called UserArticles with one row per user and per article already read (and perhaps other information such as the time).
Then, you can do:
select a.*
from articles a
where not exists (select 1
from userarticles ua
where a.article_id = ua.article_id and ua.user_id = $user_id
)
order by rand()
limit 1;
The order by rand() limit 1 is definitely database-dependent. But, it gives you the idea of how to approach the problem.
Once you have selected the article, then you insert this information into UserArticles:
insert into UserArticles(user_id, article_id)
values ($user_id, $article_id);
where $article_id refers to the article retrieved in the previous step.
Note: this should be fine with respect to race conditions, unless a single user can have multiple simultaneous connections to the database asking for the same information. Handling that case requires more knowledge about the database being used.

Saving Statistics to sqlite database

I have created a IRC bot for twitch that comes with a couple of features (song request,a queue to join games. Currently I have a table in my database looking like this
CREATE TABLE users(id,points,timespent,follower,followed,wMessage);
a very simple table(id is the name of user, points are a virtual currency you get for watching the stream that you use for some of the features I mentioned above)timespent is timespent in the channel, follower is if you are a follower, followed is if you have followed once before and wMessage is a welcome message)
I would like it so I can see some statistics on the bot so, statisticslike how many people joined the channel on x year/month/day/hour, how many used the queue feature, how many used y feature on x time. I can only come up with one way to do this but I am not sure if it is the best way to do it:
CREATE TABLE queueStats(usedDate DATETIME,timeUsed int);
I guess you could even remove the timeUsed and just make a new row each time the feature is used and then count the rows with a "SELECT - WHERE" query. Is this a smart way to do this? The reason I ask is, I am very new with sql databases so I am not really sure of the standard way to do things(if there is such a thing)
I'd recommend creating a table to record events of interest. You could have a foreign key referencing the user table. Getting summary statistics could then be done using an aggregation query (example).
BTW, I'd recommend explicitly specifying your user id column to be an "integer primary key". See here for why/how. Basically if you don't you could end up with duplicate rows for user IDs, and also if you don't explicitly specify a primary key field, sqlite creates an extra "row id" column for you.

SQL Select help - removing white space

I'm running into a problem when trying to select records from my 2005 MS-SQL database (I'm still very new to SQL, but I learned and use the basic commands from w3schools already). In theory, all my manufacturer records should be unique. At least that is how I intended it to be when I did my first massive data dump into it. Unfortunately, that is not the case and now I need to fix it! Here is my scenario:
Table name = ItemCatalog
Relevant columns = Partnumber,Manufacturer,Category
When I did a SELECT DISTINCT Manufacturer FROM ItemCatalog this little problem is what turned up:
Cables2Go
CablesToGo
Cables To Go
CableToGo Inc
CablesToGo Inc
All 5 of those showed up as distinct, which they are. Can't fault my SELECT statement for returning it, but from my human perspective they are all the same manufacturer! One method I see working is doing an UPDATE command and fixing all the permutations that show up, but I have a LOT of manufacturers and this would be very time consuming.
Is there a way when I punch in a SELECT statement, that I can find all the likely permutations of a manufacturer name (or any field really)? I attempted the LIKE operator, so my statement would read
SELECT Manufacturer FROM ItemCatalog WHERE Manufacturer LIKE '%CablesToGo%'
but that didn't turn out as well as I had hoped. Here's the nasty bit, my other program that I'm putting together absolutely requires that I only ask for a single manufacturer name, not all 5 variations. Maybe I'm talking in circles here, but is there is a simple way in one statement for me to find a similar string?
If you are doing some data mining, you could also try the SOUNDEX and DIFFERENCE function in SQL Server.
While they are both outdated (they don't handle foreign character very well), they could yield some interesting result for you:
SELECT * FROM ItemCatalog WHERE SOUNDEX(Manufacturer) = SOUNDEX('Cables To Go');
and
SELECT * FROM ItemCatalog WHERE DIFFERENCE(Name, 'Cables To Go') >= 3;
The number 3 means likely similar (0 mean not similar and 4 is very similar)
There are a few number of better SOUNDEX function available on the internet. See Tek-Tips for an example.
Here is another example at SQL Team.
Standard SQL has a SIMILAR statement, which is a bit more powerful than LIKE.
However, you could use LIKE to good effect with:
Manufacturer LIKE 'Cable%Go%'
This would work in this specific case, finding all the variants listed. However, it would also find 'Cable TV Gorgons' and you probably don't need them included. Your version would also find 'We Hate CablesToGo With Ferocity Inc', which you probably didn't want either.
However, data cleansing is a major problem, and there are companies that make a living out of providing data cleansing. You often end up making a dictionary or thesaurus of terms (company names here) mapping all the variants encountered to the canonical form. The problem is that sometimes you find the same variant spelling is used for two separate canonical forms. For example, a pair of bright sparks might both decide to use 'C2G' as an abbreviation, but one uses it for 'Cables To Go Inc' and the other uses it for 'Computers To Gamers Inc'. You have to use some other information to determine whether a particular instance of 'C2G' means 'Cables' or 'Computers'.
'Cable%Go%' might work for that one case, but if you have other variations for other strings, you'll probably have to do a lot of manual data cleanup.
I suggest you to use object relational mapping tool to map your table into object and add filtering logic there.
One option you have is to loosen your wildcard search to something like 'Cables%Go%'. This might be good in the short term, but with this approach you run the risk of matching more manufacturers than you want (ie , Cables on the Go, etc).
You could also put together a mapping table, which would put all of the variants of Cables To Go into a single group, which your app can query and normalize for your ItemCatalog query.
Another option you have is to introduce a Manufacturers table. This your ItemCatalog table would then have a foreign key to this table and only allow manufacturers that are in the Manufacturer table. This would require some cleanup of your ItemCatalog table to get it working, assuming that you want all of the variants of Cables to Go to be the same.
I know others are suggesting query fixes - I thought I'd elaborate on my long-term fix for kicks.
You could create another table relating each of the variations to a single manufacturer entity. If I encountered this situation at work (and I have), I would be enticed to fix it.
Create a manufacturer's table with a primary key, name, etc..
Create a table with aliases - these will only be needed when you are presented with data that doesn't have the manufacturer's ID (like an import file).
Modify ItemCatalog such that it references the primary key from the manufacturer table (i.e. a ManufacturerID foreign key).
When importing data to ItemCatalog, assign the ManufacturerID foreign key based on matches to the alias table. If you have a name that matches 2+ records then you flag them for manual review or you try to match on more than manufacturer name.

What is the best way to add users to multiple groups in a database?

In an application where users can belong to multiple groups, I'm currently storing their groups in a column called groups as a binary. Every four bytes is a 32 bit integer which is the GroupID. However, this means that to enumerate all the users in a group I have to programatically select all users, and manually find out if they contain that group.
Another method was to use a unicode string, where each character is the integer denoting a group, and this makes searching easy, but is a bit of a fudge.
Another method is to create a separate table, linking users to groups. One column called UserID and another called GroupID.
Which of these ways would be the best to do it? Or is there a better way?
You have a many-to-many relationship between users and groups. This calls for a separate table to combine users with groups:
User: (UserId[PrimaryKey], UserName etc.)
Group: (GroupId[PrimaryKey], GroupName etc.)
UserInGroup: (UserId[ForeignKey], GroupId[ForeignKey])
To find all users in a given group, you just say:
select * from User join UserInGroup on UserId Where GroupId=<the GroupId you want>
Rule of thumb: If you feel like you need to encode multiple values in the same field, you probably need a foreign key to a separate table. Your tricks with byte-blocks or Unicode chars are just clever tricks to encode multiple values in one field. Database design should not use clever tricks - save that for application code ;-)
I'd definitely go for the separate table - certainly the best relational view of data. If you have indexes on both UserID and GroupID you have a quick way of getting users per group and groups per user.
The more standard, usable and comprehensible way is the join table. It's easily supported by many ORMs, in addition to being reasonably performant for most cases. Only enter in "clever" ways if you have a reason to, say a million of users and having to answer that question every half a second.
I would make 3 tables. users, groups and usersgroups which is used as cross-reference table to link users and groups. In usersgroups table I would add userId and groupId columns and make them as primary key. BTW. What naming conventions there are to name those xref tables?
It depends what you're trying to do, but if your database supports it, you might consider using roles. The advantage of this is that the database provides security around roles, and you don't have to create any tables.