SQL: Joins vs Denormalization (lots of data) - sql

I know, variations of this question had been asked before. But my case may be a little different :-)
So, I am building a site that tracks events. Each event has id and value. It is also performed by a user, which has id, age, gender, city, country and rank. (these attributes are all integers, if it matters)
I need to be able to quickly get answers to two queries:
get number of events from users with certain profile (for example, males with age 18-25 from Moscow, Russia)
get sum(maybe avg also) of values of events from users with certain profile -
Also, data is generated by multiple customers, which, in turn, can have multiple source_ids.
Access pattern: data will be mostly written by collector processes, but when queried (infrequently, by web ui) it has to respond quickly.
I expect LOTS of data, certainly more than one table or single server can handle.
I am thinking about grouping events in separate tables per day (that is, 'events_20111011'). Also I want to prefix table name with customer id and source id, so that data is isolated and can be trivially discarded (purge old data) and relatively easily moved around (distribute load to other machines).
This way, every such table will have limited amount of rows, let's say, 10M tops.
So, the question is: what to do with user's attributes?
Option 1, normalized: store them in separate table and reference from event tables.
(pro) No repetition of data.
(con) joins, which are expensive (or so
I heard).
(con) this requires user table and event tables to be on
the same server
Option 2, redundant: store user attributes in event tables and index them.
(pro) easier load balancing (self-contained tables can be moved around)
(pro) simpler (faster?) queries
(con) lots of disk space and memory used for repeating user attributes and corresponding indexes

Your design should be normalized, you physical schema may end up denormalized for performance reasons.
Is it possible to do both? There is a reason why SQL Server ships with Analysis Server. Even if you are not in the Microsoft realm, it is a common design to have a transactional system for the data entry and day to day processing while a reporting system is available for the kinds of queries that would cause heavy loads upon the transactional system.
Doing this means you get the best of both worlds: a normalized system for daily operations and a denormalized system for rollup queries.
In most cases nightly updates are fine for reporting systems, but it depends on your hours of operation and other factors what works best. I find most 8-5 businesses have more than enough time in the evening to update a reporting system.

Use an OLAP/Data Warehousing approach. That is, store your data in the standard normalized way, but also store aggregated versions of the data that will be queried frequently in separate fact tables. The user queries won't be on real-time data, but it is usually worth it for the performance trade off.
Also, if you are using SQL Server enterprise I wouldn't roll your own horizontal partitioning scheme (breaking the data into days). There are tools built into SQL server to automatically do that for you.

Please Normalize
use partitions and indexing to balance load

Related

bigquery with a large number of tables

I know there has already been a question regarding the table number limits, but it was vague...
In a dataset I want to create about 1-2 milion tables. This happens because I want to split my users activity table into smaller tables; for each user a table. And in time this number will keep on growing.
As I understand there will be no problem from BigQuery's perpective...but i'm concerned that I will not be able to access (list) those datasets from browser (https://bigquery.cloud.google.com/queries/appname); because the tables are not grouped by time (like in the case of tables with timerange) and they get all listed in an endless scroll (possibly blocking the browser)
Thank you for any suggestions
… the problem is that the browser will get blocked while listing all
tables in the dataset
You can use the "?minimal" parameter to limit the load operation to 30,000 tables per project, so browser will not be blocked. For example:
https://bigquery.cloud.google.com/queries/<your_project_name>?minimal.
see more about Display limits
I can't easily explore my dataset because of this (and query them)
If you are planning to have 2+ million tables in same dataset, even if Web UI were to show them to you without being blocked - I really doubt you would be able to somehow reasonably visually explore them. Just too many objects to “swallow”
Btw, this is not only human specific issue - even querying such "long" tables list programmatically can be problematic. See more about Using meta-tables
because the tables are not grouped by time (like in the case of tables with timerange) and they get all listed in an endless scroll (possibly blocking the browser)
That’s right, in BigQuery Web UI tables will be grouped only if they follow table_preffixYYYYMMDD pattern. Even if you would map your userID namespace to YYYYMMDD value – you would still be out of luck as your group still will consists of those millions tables.
Thank you for any suggestions
BigQuery supports Partitioned Tables which allows to have multiple partitions in the same table. Unfortunately, as of today, only Date-Partitioned tables are supported, but from what I heard BigQuery Team plans to add partitioning by arbitrary column.
This would probably fit to your desired design, unless there will be a limitation to column cardinality.
Meantime, if you want, you can experiment with applying your design using date-partitioned tables feature by mapping userid to YYYYMMDD (~9999*12*30 >> 3+ million users)
My recommendation:
Play/experiment with partitioned tables as I suggested in previous (above) section
Sharding (splitting) tables in BigQuery to millions of tables sound to me extremely impractical. You should revisit your design. What it is that you are trying to address by such sharding? Try to focus on this and if needed - post specific question here on SO!
As an alternative solution for this you can use Google cloud sdk client.
You can read the documentation for this bq Command-Line tool here.
eg: bq ls [project_id:][dataset_id] to list all tables.
NOTE: Maximum tables per query is limited to 1000. Refer

Handling relational model in Cassandra

Background
We have chosen Cassandra as our storage engine since we have an application that must handle async messaging between many users on the website and event storing (some types of analytics, what happens on site and when, etc.). Also we have a voting platform so we are storing votes per users per day and Cassandra are good in those use cases.
Recently we got new requirements to build a relational model on top of our existing system (at least we think it is relational). Some types of political candidates with lists of jobs, education, historical voting, endorsements, etc.
Problem
We have relations which can be edited on both ends (i.e. candidate is supported by companies, but in our admin panel that company can be edited without candidate). A candidate is one row in our Cassandra DB identified by a UUID. On the front end, we would need full information about candidates (political party, schools, jobs, voting history, supporting companies). We want to place the majority of candidate info in a single row so we can read data with a single read. However when we place the list of supporting companies UDT we have problems editing it (we need to change it in company_by_id and candidate_by_id tables).
Question
How to solve the editing problem and relational model issues in our situation?
We came up with couple of solutions:
Track relations in Cassandra with additional index-like tables: candidates_by_supporting_company. When updating company, we update candidates who have that company as well.
Similar to 1, but using secondary index if relation is low carnality and updating based on secondary index (we have 10 political parties so we can place index on political party in candidates table and when political party changes we can change candidates by political party since we have index)
Use a relational database for relational type of data and leave Cassandra to handle only suitable use cases like time-series data, messaging, event sorting (this adds the maintenance cost of one more database, deployment costs and problems since our system is distributed how to have replication of data)
Use Spark to do joins (this will not be the sole purpose of adding Spark to the system, we are thinking of adding it for importing huge data sets in CSV and doing transformation so having Spark will be an added bonus and we can use SparkSQL for places where we need joins)
We are leaning towards option 3 since we will add Spark anyway, we will stay with only Cassandra database (which does not complicate maintenance and deployment of one more database) and we get sort of JOINS and GROUP BY efficient on application level with it.
What do you think?
If you want to use only cassandra the right way to proceed is the number 1: denormalization. But if yu have a lot of relationships it will bring a lot of effort at application level.
If adding an other dbms is not a problem in your environment, using the right tool for the right job is the best choice: number 3 for me

What is a good way to manage large ever growing tables in a database?

I am building a web application for medical record keeping. A requirement for this application is logging all changes (view, create, update, delete) to a patients data and pretty much any other useful info in the system (login, cron run, data export, etc).
I am storing the data into a database table currently which is working fine. However it is likely this table will grow unruly very quickly and bloat the database. I am not allowed to delete log entries.
My current plan is to choose an arbitrary size (such as 1 million entries, large but still manageable). When the table hits 1 million entries I move 100,000 oldest entries into a file and store it onto our file server.
Does anyone have any experience with this issue that has other/better ideas on how to handle it?
Additional info:
My primary concern is nothing will ever be deleted from this data. However the data does not necessarily need to be accessed after several months. Since this data could logically hit 1 Billion entries in a matter of a couple years (and I have 300 copies of this db that all include this table) what is a good way to manage the size and performance. This table needs to be on a pager which is obviously going to be an issue when it breaks 1 Million let alone 1 Billion.
Cases like this are tailor-made for partitioning. Using a partitioning strategy, you span your data across multiple tables. This helps to balance I/O, speed up access times for partition-specific queries, etc. This is a discipline in and of itself, and the choice of partitioning key is crucial. In many cases such as log data like this, people often partition on a datetime value.
Partitioned Tables and Indexes (SQL Server)

Should I create separate SQL Server database for each user?

I am working on Asp.Net MVC web application, back-end is SQL Server 2012.
This application will provide billing, accounting, and inventory management. The user will create an account by signup. just like http://www.quickbooks.in. Each user will create some masters and various transactions. There is no limit, user can make unlimited records in the database.
I want to keep stable database performance, after heavy data load. I am maintaining proper indexing and primary keys in it, but there would be a heavy load on the database, per user.
So, should I create a separate database for each user, or should maintain one database with UserID. Add UserID in each table and making a partition based on UserID?
I am not an expert in SQL Server, so please provide suggestions with clear specifications.
Please inform me if there is any lack of information.
A DB per user is what happens when customers need to be able pack up and leave taking the actual database with them. Think of a self hosted wordpress website. Or if there are incredible risks to one user accidentally seeing another user's data, so it's safer to rely on the servers security model than to rely on remembering to add the UserId filter to all your queries. I can't imagine a scenario like that, but who knows-- maybe if the privacy laws allowed for jail time, I would rather data partitioned by security rules rather than carefully writing WHERE clauses.
If you did do user-per-database, creating a new user will be 10x more effort. While INSERT, UPDATE and so on stay the same from version to version, with each upgrade the syntax for database, user creation, permission granting and so on will evolve enough to break those scripts each SQL version upgrade.
Also, this will multiply your migration headaches by the number of users. Let's say you have 5000 users and you need to add some new columns, change a columns data type, update a trigger, and so on. Instead of needing to run that change script 1x, you need to run it 5000 times.
Per user Dbs also probably wastes disk space. Each of those databases is going to have a transaction log, sitting idle taking up the minimum log space.
As for load, if collectively your 5000 users are doing 1 billion inserts, updates and so on per day, my intuition tells me that it's going to be faster on one database, unless there is some sort of contension issue (everyone reading and writing to the same table at the same time and the same pages of the same table). Each database has machine resources (probably threads and memory) per database doing housekeeping, so these extra DBs can't be free.
Anyhow, the best thing to do is to simulate the two architectures and use a random data generator to simulate load and see how they perform.
It's not an easy answer to give.
First, there is logical design to be considered. Then you have integrity, security, management and performance (in this very order).
A database is a logical unit of data, self contained. Ideally, you should be able to take a database, move it to another instance, probably change the connection strings and be running again.
All the constraints are database-level. No foreign keys can exist referencing some object outside the database.
So, try thinking in these terms first.
How would you reliably prevent one user messing up the other user's data? Keep in mind that it's just a matter of time before someone opens an excel sheet and fire up queries on the database bypassing your application. Row level security in SQL Server is something you don't want to deal with.
Multiple databases mean that all management tasks should be scripted out and executed on all databases. Yes, there is some overhead to it, but once you set it up it's just the matter of monitoring. If a database goes suspect, it's a single customer down, not all of them. You can even have different versions for different customes if each customer have it's own database. Additionally, if you roll an upgrade, you can do it per customer, so the inpact will be much less.
Performance is the least relevant factor here. Of course, it really depends on how many customers and how much data, but proper indexing will solve these issues. Scale-out is much easier with multiple databases.
BTW, partitioning, as you mentioned it, is never a performance booster, it's simply a management feature, allowing for faster loading and evicting of data from a table.
I'd probably put each customer in separate database, but it's up to you eventually to make a decision for yourself. Hope I've helped some with this.

Should I create multiple tables, or even databases for multiple users of a CRM

I'm working on creating an application best described as a CRM. There is a relatively complex table structure, and I'm thinking about allowing users to do a fair bit of customization (adding fields and the like). One concern is that I will be reaching a certain level of scale almost immediately. We have about 50,000 individual users who will be coming online within about nine months of launch. So I want to build to last.
I'm thinking about two and maybe even three options.
One table set with a userID column on everything and with a custom attributes table created by creating a table which indexes custom attributes, then another table which has their values, which can then be joined to the existing contact records for the user. -- From what I've read, this seems like the right option, but I keep feeling like it's not. It seems like once these tables start reaching the millions of records searching for just one users records in every query is going to become a database hog.
For each user account recreate the table set, preened with a unique identifier (the userID for example.) Then rather than using a WHERE userID=? everywhere I can use a FROM ?_contacts. For attributes I could then have a custom attributes table where users could add additional columns for custom attributes. -- This feels like the simplest way to go, though, of course when I decide to change the database structure there would be a migration from hell.
The third option, which I'm pretty confident is wrong, but for that reason alone I can not rule out, is that a new database should be created for each user with all the requisite tables.
Am I crazy? Is option one really the best?
The first method is the best. Create individual userId's and then you can assign specific roles to them. A database retrieval time indeed depends on the number of records too. But, there is a trade-off where you can write efficient sql queries to fetch data. Well, according to this site, you will probably won't run out of memory or run into concurrency issues, because with a good server, the performance ought to be good, provided that you are efficient in writing queries.
If you recreate table sets, you will just end up creating lots of tables and can make the indexing slow which is a bad practice. Whereas if you opt of relational database scheme rather than an ordinary database scheme, and normalize the database and datatables for improving efficiency.
Creating a new database for each and every user, just sums up the complexity from both the above statements resulting in a shabby and disorganized database access. Because, if you decide to run individual instances of databases for every single user, you would just end up consuming your servers physical resources like RAM and CPU usage which will affect the service quality of all the other users.
Take up option 1. Assign separate userIds and assign them roles and privileges where needed. That is more efficient than the other two methods.