Should I create multiple tables, or even databases for multiple users of a CRM - sql

I'm working on creating an application best described as a CRM. There is a relatively complex table structure, and I'm thinking about allowing users to do a fair bit of customization (adding fields and the like). One concern is that I will be reaching a certain level of scale almost immediately. We have about 50,000 individual users who will be coming online within about nine months of launch. So I want to build to last.
I'm thinking about two and maybe even three options.
One table set with a userID column on everything and with a custom attributes table created by creating a table which indexes custom attributes, then another table which has their values, which can then be joined to the existing contact records for the user. -- From what I've read, this seems like the right option, but I keep feeling like it's not. It seems like once these tables start reaching the millions of records searching for just one users records in every query is going to become a database hog.
For each user account recreate the table set, preened with a unique identifier (the userID for example.) Then rather than using a WHERE userID=? everywhere I can use a FROM ?_contacts. For attributes I could then have a custom attributes table where users could add additional columns for custom attributes. -- This feels like the simplest way to go, though, of course when I decide to change the database structure there would be a migration from hell.
The third option, which I'm pretty confident is wrong, but for that reason alone I can not rule out, is that a new database should be created for each user with all the requisite tables.
Am I crazy? Is option one really the best?

The first method is the best. Create individual userId's and then you can assign specific roles to them. A database retrieval time indeed depends on the number of records too. But, there is a trade-off where you can write efficient sql queries to fetch data. Well, according to this site, you will probably won't run out of memory or run into concurrency issues, because with a good server, the performance ought to be good, provided that you are efficient in writing queries.
If you recreate table sets, you will just end up creating lots of tables and can make the indexing slow which is a bad practice. Whereas if you opt of relational database scheme rather than an ordinary database scheme, and normalize the database and datatables for improving efficiency.
Creating a new database for each and every user, just sums up the complexity from both the above statements resulting in a shabby and disorganized database access. Because, if you decide to run individual instances of databases for every single user, you would just end up consuming your servers physical resources like RAM and CPU usage which will affect the service quality of all the other users.
Take up option 1. Assign separate userIds and assign them roles and privileges where needed. That is more efficient than the other two methods.

Related

Extending a set of existing tables into a dynamic client defined structure?

We have an old repair database that has alot of relational tables and it works as it should but i need to update it to be able to handle different clients ( areas ) - currenty this is done as a single client only.
So i need to extend the tables and the sql statements so ex i can login as user A and he will see his own system only and user B will have his own system too.
Is it correctly understood that you wouldnt create new tables for each client but just add a clientID to every record in every ( base ) table and then just filter with a clientid in all sql statements to be able to achieve multiple clients ?
Is this also something that would work ( how is it done ) on hosted solutions ? Am worried about performance if thats an issue lets say i had 500 clients ( i wont but from a theoretic viewpoint ) ?
The normal situation is to add a client key to each table where appropriate. Many tables don't need them -- such as reference tables.
This is preferred for many reasons:
You have the data for all clients in one place, so you can readily answers a question such as "what is the average X for each client".
If you change the data structure, then it affects all clients at the same time.
Your backup and restore strategy is only implemented once.
Your optimization is only implemented once.
This is not always the best solution. You might have requirements that specify that data must be separated -- in which case, each client should be in a separate database. However, indexes on the additional keys are probably a minor consideration and you shouldn't worry about it.
This question has been asked before. The problem with adding the key to every table is that you say you have a working system, and this means every query needs to be updated.
Probably the easiest is to create a new database for each client, so that the only thing you need to change is the connection string. This also means you can get automated query tools for example to work without worrying about cross-client data leakage.
And it also allows you to backup, transfer, delete a single client easily as well.
There are of course pros and cons to this approach, but it will simplify the development effort. Also remember that if you plan to spin it up in a cloud environment then spinning up databases like this is also very easy.

Should I create separate SQL Server database for each user?

I am working on Asp.Net MVC web application, back-end is SQL Server 2012.
This application will provide billing, accounting, and inventory management. The user will create an account by signup. just like http://www.quickbooks.in. Each user will create some masters and various transactions. There is no limit, user can make unlimited records in the database.
I want to keep stable database performance, after heavy data load. I am maintaining proper indexing and primary keys in it, but there would be a heavy load on the database, per user.
So, should I create a separate database for each user, or should maintain one database with UserID. Add UserID in each table and making a partition based on UserID?
I am not an expert in SQL Server, so please provide suggestions with clear specifications.
Please inform me if there is any lack of information.
A DB per user is what happens when customers need to be able pack up and leave taking the actual database with them. Think of a self hosted wordpress website. Or if there are incredible risks to one user accidentally seeing another user's data, so it's safer to rely on the servers security model than to rely on remembering to add the UserId filter to all your queries. I can't imagine a scenario like that, but who knows-- maybe if the privacy laws allowed for jail time, I would rather data partitioned by security rules rather than carefully writing WHERE clauses.
If you did do user-per-database, creating a new user will be 10x more effort. While INSERT, UPDATE and so on stay the same from version to version, with each upgrade the syntax for database, user creation, permission granting and so on will evolve enough to break those scripts each SQL version upgrade.
Also, this will multiply your migration headaches by the number of users. Let's say you have 5000 users and you need to add some new columns, change a columns data type, update a trigger, and so on. Instead of needing to run that change script 1x, you need to run it 5000 times.
Per user Dbs also probably wastes disk space. Each of those databases is going to have a transaction log, sitting idle taking up the minimum log space.
As for load, if collectively your 5000 users are doing 1 billion inserts, updates and so on per day, my intuition tells me that it's going to be faster on one database, unless there is some sort of contension issue (everyone reading and writing to the same table at the same time and the same pages of the same table). Each database has machine resources (probably threads and memory) per database doing housekeeping, so these extra DBs can't be free.
Anyhow, the best thing to do is to simulate the two architectures and use a random data generator to simulate load and see how they perform.
It's not an easy answer to give.
First, there is logical design to be considered. Then you have integrity, security, management and performance (in this very order).
A database is a logical unit of data, self contained. Ideally, you should be able to take a database, move it to another instance, probably change the connection strings and be running again.
All the constraints are database-level. No foreign keys can exist referencing some object outside the database.
So, try thinking in these terms first.
How would you reliably prevent one user messing up the other user's data? Keep in mind that it's just a matter of time before someone opens an excel sheet and fire up queries on the database bypassing your application. Row level security in SQL Server is something you don't want to deal with.
Multiple databases mean that all management tasks should be scripted out and executed on all databases. Yes, there is some overhead to it, but once you set it up it's just the matter of monitoring. If a database goes suspect, it's a single customer down, not all of them. You can even have different versions for different customes if each customer have it's own database. Additionally, if you roll an upgrade, you can do it per customer, so the inpact will be much less.
Performance is the least relevant factor here. Of course, it really depends on how many customers and how much data, but proper indexing will solve these issues. Scale-out is much easier with multiple databases.
BTW, partitioning, as you mentioned it, is never a performance booster, it's simply a management feature, allowing for faster loading and evicting of data from a table.
I'd probably put each customer in separate database, but it's up to you eventually to make a decision for yourself. Hope I've helped some with this.

How to organize primary keys for good locality?

I have a table for users and a table for documents. Documents have exactly one user as an owner, and for the application I'm building, I know that I will typically be accessing a group of documents associated with a single given user.
Let's say the average user has K documents, and certain common queries fetch all of the documents for a given user. I don't want my database (PostgreSQL) to have to do K disk seeks (on average) to fetch all the documents for a user. Ideally, the documents would be stored in contiguous blocks so that fetches would only require a few seeks.
Is it possible (and reasonable) to organize the document table schema to create such locality? I know that no-SQL implementations do this all the time? E.g. the BigTable paper talks about how row keys for web tables are assigned by URL, except that the url is reversed, e.g. com.cnn.www, so that all the pages for CNN are located near eachother in the data store. It doesn't appear possible to something similar in Postgres because the tables cannot be index-organized, although it might be possible in MySQL w/ InnoDB. This post comes to a similar conclusion.
The command you're looking for is CLUSTER, but it has drawbacks. It completely rewrites the table when you run it, which requires a lock on it, so you may only want to do this when traffic is low. Also, Postgres will do nothing to keep rows in that order during INSERTs and UPDATEs, so your data will tend to fragment as the table is written to and you may have to recluster it regularly.
What you can also do is set a low fillfactor on the table, so that UPDATEs are more likely to keep a given row on the same page. This should prevent some fragmentation, which just leaves INSERTs, but with a low fillfactor INSERTs will tend to be placed on newer pages, and these will probably be commonly accessed enough to be kept in RAM. I'm making assumptions about your usage patterns which may be wrong, but regardless, your best course of action is probably to just recluster whenever you see I/O start to become a problem.
Finally, there's also a tool called pg_repack that can cluster a table without taking such a heavy lock, in a similar manner to how CREATE INDEX CONCURRENTLY works, but it's a third-party tool, so you'll want to experiment with it before running in production.

SQL: Joins vs Denormalization (lots of data)

I know, variations of this question had been asked before. But my case may be a little different :-)
So, I am building a site that tracks events. Each event has id and value. It is also performed by a user, which has id, age, gender, city, country and rank. (these attributes are all integers, if it matters)
I need to be able to quickly get answers to two queries:
get number of events from users with certain profile (for example, males with age 18-25 from Moscow, Russia)
get sum(maybe avg also) of values of events from users with certain profile -
Also, data is generated by multiple customers, which, in turn, can have multiple source_ids.
Access pattern: data will be mostly written by collector processes, but when queried (infrequently, by web ui) it has to respond quickly.
I expect LOTS of data, certainly more than one table or single server can handle.
I am thinking about grouping events in separate tables per day (that is, 'events_20111011'). Also I want to prefix table name with customer id and source id, so that data is isolated and can be trivially discarded (purge old data) and relatively easily moved around (distribute load to other machines).
This way, every such table will have limited amount of rows, let's say, 10M tops.
So, the question is: what to do with user's attributes?
Option 1, normalized: store them in separate table and reference from event tables.
(pro) No repetition of data.
(con) joins, which are expensive (or so
I heard).
(con) this requires user table and event tables to be on
the same server
Option 2, redundant: store user attributes in event tables and index them.
(pro) easier load balancing (self-contained tables can be moved around)
(pro) simpler (faster?) queries
(con) lots of disk space and memory used for repeating user attributes and corresponding indexes
Your design should be normalized, you physical schema may end up denormalized for performance reasons.
Is it possible to do both? There is a reason why SQL Server ships with Analysis Server. Even if you are not in the Microsoft realm, it is a common design to have a transactional system for the data entry and day to day processing while a reporting system is available for the kinds of queries that would cause heavy loads upon the transactional system.
Doing this means you get the best of both worlds: a normalized system for daily operations and a denormalized system for rollup queries.
In most cases nightly updates are fine for reporting systems, but it depends on your hours of operation and other factors what works best. I find most 8-5 businesses have more than enough time in the evening to update a reporting system.
Use an OLAP/Data Warehousing approach. That is, store your data in the standard normalized way, but also store aggregated versions of the data that will be queried frequently in separate fact tables. The user queries won't be on real-time data, but it is usually worth it for the performance trade off.
Also, if you are using SQL Server enterprise I wouldn't roll your own horizontal partitioning scheme (breaking the data into days). There are tools built into SQL server to automatically do that for you.
Please Normalize
use partitions and indexing to balance load

Database for microblogging startup

I will do microblogging web service (for school, so don't blast me for lack of new idea) and I worry that DB could be often be overloaded (user could following other users or even tag so I suppouse that SELECT will be heavy - check 20 latest messages which contains all observing tags and user).
My idea is create another table, and store in it only statusID and userID (who should pick up message). Danger of that is, if some tag or user has many followers there will be a lot of record with that status ID. So, is it good idea? Or maybe better is used M2M relation? (one status -> many receivers)
I think most databases can easily handle large record sets. The responsibility to have it preform lies in your design with properly setting up the indexes. If you create the right indexes the select clauses should perform really well.
I'd go with a users table, a table to have the m2m relationship between users and messages table.
You can then do one select to find all of the users a user is following and then a second select in to get all of the messages of interest (sorting and limiting the results as appropriate). Extending this to tagging should be pretty simple.
This design should be fine for large numbers of users and messages as long as you index the right columns. If you got massive then you could also run the users tables and messages tables to different servers or have read only replicates. I wouldn't even worry about that for the moment - you'd need to be huge.
When implementing Collabinate (http://www.collabinate.com), a service-based engine for microblogging and shared activity streams, I used a graph database. The fact that people create posts and follow other people lends itself to a graph structure. With the right relationships and algorithms, this can be a very efficient and performant solution.