I am working for a big online store. At the moment our architecture is something weird where we have microservices which actually all share the same DB (doesn't work well at all...).
I am considering improving that but have some challenges on how to make them independant.
Here is a use case. I have customers, customers purchase products. Let say I have 3 microservices : customer authentication, order management, product management.
An order is linked to a customer and a product.
Could you describe a solution for the following problems :
How do you make the link between an order and a customer?
Let say both services share a customer ID, how do you handle data consistency? If you remove a customer on the customer service side, you end up with inconsistency. If your service has to notify the other services then you end up with tighlty coupled services which to me sounds like what you wanted to avoid in the first place. You could kind of avoid that by having an event mechanism which notify everyone but what about network errors when you don't even know who is supposed to receive the event?
I want to do a simple query : retrieve the customers from US that bought product A. Given that 3million people bought product A and we have 1 million customers in the US; How could you make that reasonably performant? (Our current DB would execute that in few milliseconds)
I can't think of any part of our code where we don't have this kind of relation. One of the solution I can think of is duplicating data. E.g. When a customer purchase something, the order management service will store the customer details and the product details. You end up with massive data replication, not sure if that's a good thing and I would still be worried about consistency.
I couldn't find a paper addressing those issues. What are the different options?
At the moment our architecture is something weird where we have microservices which actually all share the same DB (doesn't work well at all...). I am considering improving that but have some challenges on how to make them independant.
IMHO the architecture is more simple by having one OLTP database for orders, customers, and products since it allows you to make use of JOINS and stored procedures. It could be the case that the DB could use some configuration and tuning TLC vs. software re-architecture. Just keep that door open when you consider how to fix performance problems.
How do you make the link between an order and a customer?
In the orders table have a column for customer_id. The customer_id field in the orders table would be a foreign key to the id field on the customers table. This will give you the best performance.
You can do either periodic cleanup or event based cleanup of deleted users (and their orders). But please make sure that somewhere these old orders and customers are stored. Maybe archive tables or back-end data-warehouse where reports and analysis (OLAP) can be done on this data.
Let say both services share a customer ID, how do you handle data consistency? If you remove a customer on the customer service side, you end up with inconsistency. If your service has to notify the other services then you end up with tighlty coupled services which to me sounds like what you wanted to avoid in the first place. You could kind of avoid that by having an event mechanism which notify everyone but what about network errors when you don't even know who is supposed to receive the event?
There are various ways this can be done. As mentioned you can either create an event to deal with customer deletions or do periodic db cleanups. But one thing is for certain, the orders service does not NEED to be notified when this cleanup is done, unless you want it to. Not a need but could be a want if you want order culling to be done via the order services. The naive way to do this is to create a stored procedure that takes a customer_id (or list of customer_id's) as input and deletes all orders that match that customer_id from the orders table. Please make sure to backup the data for future data analysis and auditing.
I want to do a simple query : retrieve the customers from US that bought product A. Given that 3million people bought product A and we have 1 million customers in the US; How could you make that reasonably performant? (Our current DB would execute that in few milliseconds)
Again this is why it makes sense to keep the customers, products, and orders tables in the same DB as this query can more easily be made to execute quickly when they are on the same DB. You can take advantage of your DB's designing and optimization tools, and EXPLAIN/DESCRIBE output to tweak your tables indexes and such. If you are using Mysql you can change DB engines (I recommend TokuDB DB engine).
In the end my main suggestion is to leave in one DB for OLTP as you will get more efficiency and performance for the same amount of hardware. Splitting the DB into multiple DB's will have an overhead cost for your code, architecture, network, and CPU's. The important thing is that your DB can scale horizontally and is finely tuned for the queries being done on it. Move OLAP to its own DB. This can be done using ETL to move data from OLTP DB to OLAP DB. The query in your example sounds like something that would be done in an OLAP DB. For your OLAP database you can use a columnar DB, like Vertica or something equivalent that can easily scale horizontally. The important thing to note is that by splitting up your OLAP and OLTP you can tune and configure each for their respective purpose.
Whether you run your customer, orders, and products services as a monolith (my recommendation) or as microservices the DB design should not change. What will cause your queries in your code to change is if you split the OLTP DB into multiple DB's because now you can not do simple JOINs or stored procedures.
This is what Martin Fowler calls the Monolith First. http://martinfowler.com/bliki/MonolithFirst.html
Related
We have a need to know which database architecture makes more sense to use and why.
We have a list of customers who are all going to use the same table structure (with very few exceptions).
We would have about 10 thousand customers who might all have all about 50 thousand products each.
The processing on products may not be the same for each customer and we would also want to provide a plan where customers could have API access to their data.
Our customers do sell products and their SQL table structure would all have columns such as :
Feed_ID
Product_ID
Product_Description
Price
Weight
etc...
The Feed_ID is used to differentiate the origin of these products and will be unique for each customer - of course.
The 3 choices of relational table structure that we have thought about:
Each customer has its own database and in that database, he has 1 table per product-feed
All customers are hosted under 1 unique database under which all customers all have 1 table per feed - in that case, 1 customer can have 2 tables if he as 2 different product feed.
All customers are hosted under 1 unique database, HOWEVER, in this 3rd solution, we only have 1 unique table that host all products feed of all customers.
Which solution would you use and why you think that the solution you selected is better?
Thank you.
You haven't quite provided enough information. Under almost all circumstances (see below for exceptions), you want one set of tables for all customers. Here are some reasons:
Performance. A proliferation of tables means the data is spread through more data pages, so you have lots of partially filled data pages. The database is bigger and processing is slower.
Coding efficiency. If the tables for a customer all have different names, then all the code is dynamic SQL. That is harder to maintain.
Maintenance. Adding a column or index is very arduous when there are zillions of similar tables.
Analytics. When similar data is spread through tables, it is really hard to answer questions such as "Which client has the most products?".
Security. Granting access permissions on a single set of tables is less error prone than on zillions of tables.
And no doubt, I've missed a few reasons. You can see that it is almost a no-brainer to have a single database with a small number of tables.
There are situations where separate databases might be called for. I cannot think of a good reason to have separate tables for each client in a single database.
The number one reason would be security and isolation. There might be a business or even legal reason for storing data into "physically" separate databases, to further minimize the possibility of one client seeing another client's data (accidentally or through hacking).
Another reason would be if clients had bespoke solutions. That is, there are per-client customizations. I would still be inclined to try to put this into a single database solution, but that might not be possible.
Related to this would be an application that you intend to support both in the cloud and on premises. In that case, separate databases per client would probably simplify the application design.
But, in general, you would store the data in a pretty normalized single database, with one table per entity.
I think having separate tables (or ideally schemas) for each customer is not that bad idea. In addition to benefits you mentioned, this way you can scale your database easily, and you can give customers full control of their data if they want to.
Regarding the drawbacks:
Managing it is more complicated but not as bad either - you can write
a script to create columns/tables/indexes/etc. You
don't have to do it manually.
It will be a challenge to perform analytics on 10K tables,
although it's not the best idea to mix it with production anyway.
I'd create a separate database (or server) for analytics, running
some overnight job to update reporting tables.
Also, if your table is going to have hundreds of millions rows (10Kx50k?), it's a good idea to split it into smaller pieces, regardless which option you'll choose. If not by customer, then by region or some other bigger group (assuming you are building on premises RDBMS)
I've created a database to track computers at my company. The goal is for the data to be automatically updated nightly and any changes tracked in a history table. I created a temporal table and everything seems to work fine. However, I'd like to exempt the column that contains the lastLogon from AD for each computer account. History of the data is irrelevant, it would result in many unnecessary updates to the history table and I'm concerned it would grow too quickly. Is there any way to do something like "Update the history table on changes to any column EXCEPT m_lastLogon"?
The only way you will be able to do this is to store the m_lastlogon information in a separate, non-temporal table. However, you are losing some potentially valuable logging information that way, especially for usage patterns and possible accidental damage tracking. You may choose to have a simple login log table correlated to the hardware, so that only the login information is tracked, reducing the unnecessary multiple recording of the rest of the information.
According to a comment made by Borko Novakovid (a Program Manager in the SQL Server team), you cannot exclude columns.
His comment was
Currently we do not support filtering out changes that occur on
columns one is not interested to track in DW schema (I guess that was
the question). We are aware that some people need this capability, but
modifying ETL logic to exclude these updates is also viable option...
Here's the link to the webpage
https://channel9.msdn.com/Shows/Data-Exposed/Temporal-in-SQL-Server-2016
I'm working on a Compact Framework app running on Windows Mobile. It's to be used by delivery drivers to tell them their next job and track spending etc. I have a SQL CE database on the mobile devices and SQL Server on the server. After struggling with major performance and configuration problems with the Sync Framework I ended up writing my own sync code using WCF. This works well and is a lot faster than the Sync Framework but I've been asked to speed it up further. Now we get into the details of the problem. Hopefully I can explain this clearly.
The synchronisation works one table at a time and is only one-way. Updates are sent from the server to the PDA only. Data travelling back to the server is handled a completely different way. First of all I delete any records on the PDA that have been removed from the server. Because of database constraints I have to delete from 'child' tables before deleting from 'parent' tables so I work up the heirachy from the bottom. E.G. I delete records from the invoice table before deleting from the products table.
Next I add new records to the PDA that have been added on the server. In this case I have to update the parent tables first and work down the heirachy and update child tables later.
The problem is that my boss doesn't like the fact that my app will keep a large table like the products table synchronised with the server when the delivery driver only needs the
invoiceProduct table. The invoiceProduct table links the invoice and products table together and contains some information about the product. I mean that their database design is not normalised and the product name has been duplicated and stored in the invoiceProduct table as well as the product table. Of course we all know this is poor design but it seems they have done this to improve performance in this type of situations.
The obvious solution is to just remove the products table completely from the PDA database. However I can't do this because it is sometimes needed. Drivers have the ability to add a new product to an invoice on the fly. My boss suggests that they could just synchronise the large products table occasionally or when they try to add a product and find that it's not there.
This won't work with the current design bacause if an invoice is downloaded containing a new product that is not on the PDA it will throw a database foreign key error.
Sorry about posting such a large message. Hopefully it makes sense. I don't want to remove my database constraints and mess up my nice data structure :(
You seems to be running into some architecture problem. I work on a product that somewhat has a similar situation. I had a client-server application where the client loaded too much data that isn't needed.
We used ADO.NET (Dataset) to reflect what the database has on the client side. The Dataset class is like a in memory CE SQL Server.
Our company starts having bigger clients and our architecture isn't fast enough to handle all the data.
In the past, we did the following. These are no fast solution:
Remove the "most" of the constraints
on the client side
all the frequently used data still have constraint in the
dataset.
Create logic to load a subset of data, instead of loading everything to the client. For example, we only load 7 days of works data, instead of every work data (which is what we did in the past).
Denormalized certain data by adding new columns, so that we don't have to load extra data we don't need
Certain data is only loaded when it is needed based on the client modules.
As long as you keep your database constraint on the SQL Server, you should have no data integrity issue. However, on your PDA side, you will need to more testing to ensure your application runs properly.
This isn't an easy problem to solve when you already have an existing architecture. Hopefully these suggestions help you.
Add a created_on field for your products and keep track of when the last time each pda synced. When the invoice is downloaded, check if the product is newer than the last sync and if its re-sync the pda. Does not seem like it would screw up the DB too much?
SQL Server 2008 database design problem.
I'm defining the architecture for a service where site users would manage a large volume of data on multiple websites that they own (100MB average, 1GB maximum per site). I am considering whether to split the databases up such that the core site management tables (users, payments, contact details, login details, products etc) are held in one database, and the database relating to the customer's own websites is held in a separate database.
I am seeing a possible gain in that I can distribute the hardware architecture to provide more meat to the heavy lifting done in the websites database leaving the site management database in a more appropriate area. But I'm also conscious of losing the ability to directly relate the sites to the customers through a Foreign key (as far as I know this can't be done cross database?).
So, the question is two fold - in general terms should data in this sort of scenario be split out into multiple databases, or should it all be held in a single database?
If it is split into multiple, is there a recommended way to protect the integrity and security of the system at the database layer to ensure that there is a strong relationship between the two?
Thanks for your help.
This question and thus my answer may be close to the gray line of subjective, but at the least I think it would be common practice to separate out the 'admin' tables into their own db for what it sounds like you're doing. If you can tie a client to a specific server and db instance then by having separate db instances, it opens up some easy paths for adding servers to add clients. A single db would require you to monkey with various clustering approaches if you got too big.
[edit]Building in the idea early that each client gets it's own DB also just sets the tone for how you develop when it is easy to make structural and organizational changes. Discovering 2 yrs from now you need to do it will become a lot more painful. I've worked with split dbs plenty of times in the past and it really isn't hard to deal with as long as you can establish some idea of what the context is. Here it sounds like you already have the idea that the client is the context.
Just my two cents, like I said, you could be close to subjective on this one.
Single Database Pros
One database to maintain. One database to rule them all, and in the darkness - bind them...
One connection string
Can use Clustering
Separate Database per Customer Pros
Support for customization on per customer basis
Security: No chance of customers seeing each others data
Conclusion
The separate database approach would be valid if you plan to support per customer customization. I don't see the value if otherwise.
You can use link to connect the databases.
Your architecture is smart.
If you can't use a link, you can always replicate critical data to the website database from the users database in a read only mode.
concerning security - The best way is to have a service layer between ASP (or other web lang) and the database - so your databases will be pretty much isolated.
If you expect to have to split the databases across different hardware in the future because of heavy load, I'd say split it now. You can use replication to push copies of some of the tables from the main database to the site management databases. For now, you can run both databases on the same instance of SQL Server and later on, when you need to, you can move some of the databases to a separate machine as your volume grows.
Imagine we have infinitely fast computers, would you split your databases? Of course not. The only reason why we split them is to make it easy for us to scale out at some point. You don't really have any choice here, 100MB-1000MB per client is huge.
We have a SQL server that has a database for each client, and we have hundreds of clients. So imagine the following: database001, database002, database003, ..., database999. We want to combine all of these databases into one database.
Our thoughts are to add a siteId column, 001, 002, 003, ..., 999.
We are exploring options to make this transition as smoothly as possible. And we would LOVE to hear any ideas you have. It's proving to be a VERY challenging problem.
I've heard of a technique that would create a view that would match and then filter.
Any ideas guys?
Create a client database id for each of the client databases. You will use this id to keep the data logically separated. This is the "site id" concept, but you can use a derived key (identity field) instead of manually creating these numbers. Create a table that has database name and id, with any other metadata you need.
The next step would be to create an SSIS package that gets the ID for the database in question and adds it to the tables that have to have their data separated out logically. You then can run that same package over each database with the lookup for ID for the database in question.
After you have a unique id for the data that is unique, and have imported the data, you will have to alter your apps to fit the new schema (actually before, or you are pretty much screwed).
If you want to do this in steps, you can create views or functions in the different "databases" so the old client can still hit the client's data, even though it has been moved. This step may not be necessary if you deploy with some downtime.
The method I propose is fairly flexible and can be applied to one client at a time, depending on your client application deployment methodology.
Why do you want to do that?
You can read about Multi-Tenant Data Architecture and also listen to SO #19 (around 40-50 min) about this design.
The "site-id" solution is what's done.
Another possibility that may not work out as well (but is still appealing) is multiple schemas within a single database. You can pull common tables into a "common" schema, and leave the customer-specific stuff in customer-specific schema. In some database products, however, the each schema is -- effectively -- a separate database. In other products (Oracle, DB2, for example) you can easily write queries that work in multiple schemas.
Also note that -- as an optimization -- you may not need to add siteId column to EVERY table.
Sometimes you have a "contains" relationship. It's a master-detail FK, often defined with a cascade delete so that detail cannot exist without the parent. In this case, the children don't need siteId because they don't have an independent existence.
Your first step will be to determine if these databases even have the same structure. Even if you think they do, you need to compare them to make sure they do. Chances are there will be some that are customized or missed an upgrade cycle or two.
Now depending on the number of clients and the number of records per client, your tables may get huge. Are you sure this will not create a performance problem? At any rate you may need to take a fresh look at indexing. You may need a much more powerful set of servers and may also need to partion by client anyway for performance.
Next, yes each table will need a site id of some sort. Further, depending on your design, you may have primary keys that are now no longer unique. You may need to redefine all primary keys to include the siteid. Always index this field when you add it.
Now all your queries, stored procs, views, udfs will need to be rewritten to ensure that the siteid is part of them. PAy particular attention to any dynamic SQL. Otherwise you could be showing client A's information to client B. Clients don't tend to like that. We brought a client from a separate database into the main application one time (when they decided they didn't still want to pay for a separate server). The developer missed just one place where client_id had to be added. Unfortunately, that sent emails to every client concerning this client's proprietary information and to make matters worse, it was a nightly process that ran in the middle of the night, so it wasn't known about until the next day. (the developer was very lucky not to get fired.) The point is be very very careful when you do this and test, test, test, and test some more. Make sure to test all automated behind the scenes stuff as well as the UI stuff.
what I was explaining in Florence towards the end of last year is if you had to keep the database names and the logical layer of the database the same for the application. In that case you'd do the following:
Collapse all the data into consolidated tables into one master, consolidated database (hereafter referred to as the consolidated DB).
Those tables would have to have an identifier like SiteID.
Create the new databases with the existing names.
Create views with the old table names which use row-level security to query the tables in the consolidated DB, but using the SiteID to filter.
Set up the databases for cross-database ownership chaining so that the service accounts can't "accidentally" query the base tables in the consolidated DB. Access must happen through the views or through stored procedures and other constructs that will enforce row-level security. Now, if it's the same service account for all sites, you can avoid the cross DB ownership chaining and assign the rights on the objects in the consolidated DB.
Rewrite the stored procedures to either handle the change (since they are now referring to views and they don't know to hit the base tables and include SiteID) or use InsteadOf Triggers on the views to intercept update requests and put the appropriate site specific information into the base tables.
If the data is large you could look at using a partioned view. This would simplify your access code as all you'd have to maintain is the view; however, if the data is not large, just add a column to identify the customer.
Depending on what the data is and your security requirements the threat of cross contamination may be a show stopper.
Assuming you have considered this and deem it "safe enough". You may need/want to create VIEWS or impose some other access control to prevent customers from seeing each-other's data.
IIRC a product called "Trusted Oracle" had the ability to partition data based on such a key (about the time Oracle 7 or 8 was out). The idea was that any given query would automagically have "and sourceKey = #userSecurityKey" (or some such) appended. The feature may have been rolled into later versions of the popular commercial product.
To expand on Gregory's answer, you can also make a parent ssis that calls the package doing the actual moving within a foreach loop container.
The parent package queries a config table and puts this in an object variable. The foreach loop then uses this recordset to pass variables to the package, such as your database name and any other details the package might need.
You table could list all of your client databases and have a flag to mark when you are ready to move them. This way you are not sitting around running the ssis package on 32,767 databases. I'm hooked on the foreach loop in ssis.