Can we choose to have a few objects sharded, typically which have the potential to have large amount of data and leave the others to be stored across all instances?
For example when I have a banking transaction, I might want to store the customers and transactions in different shards and I might want to store the list of bank branches in all databases. Is this possible?
You can do something like that, but not in a single database.
The way it works, you have two databases on each server (and two document stores)
1) Sharded, for Transactions & Customers
2) Standard, for Banks.
The banks database is replicated to all nodes.
Related
I have a table design that is represented by this awesome hand drawn image.
Basically, I have an account event, which can be either a Transaction (Payment to or from a third party) or a Transfer (transfer between accounts held by the user).
All common data is held in the event table (Date, CreatedBy, Source Account Id...) and then if it's a transaction, then transaction specific data is held in the Account Transaction table (Third Party, transaction type (Debit, Credit)...). If the event is a transfer, then transfer specific data is in the account_transfer table (Amount, destination account id...).
Note, something I forgot to draw, is that the Event table has an event_type_id. If event_type_id = 1, then it's a transaction. If it's a 2, then it's a Transfer.
Both the transfer and transaction tables are linked to the event table via an event id foreign key.
Note though that a transaction doesn't have an amount, as the transaction can be split into multiple payment lines, so it has a child account_transaction_line. To get the amount of the transaction, you sum it's child lines.
Foreign keys are all setup, with an index on primary keys...
My question is about design and querying. If I want to list all events for a specific account, I can either:
Select
from Event,
where event_type = 1 (transaction),
then INNER join to the Transaction table,
and INNER join to the transaction line (to sum the total)...
and then UNION to another selection,
selecting
from Event,
where event_type = 2 (transfer),
INNER join to transfer table...
and producing a list of all events.
or
Select
from Event,
then LEFT join to transaction,
then LEFT join to transaction line,
then LEFT join to transfer ...
and sum up totals (because of the transaction lines).
Which is more efficient? I think option 1 is best, as it avoids the LEFT joins (Scans?)
OR...
An Indexed View of option 1?
On performance
For performance analysis in SQL server, there are quite a few factors at play, e.g.
What is the number of queries you are going to run, esp. on the same data? For example, if 80% of your queries are around 20% of your data, then caching may help significantly. (See below the design section on how this can matter)
Are your databases distributed or collocated on the same server? I assume it's a single server system, but if they were distributed, the design and optimization might vary.
Are these queries executed in a background process or on-demand and a user is expecting to get the results quicker?
Without these (and perhaps some other follow up questions once answers to these are provided), it would be unwise to give an answer stating one being preferable over the other.
Having said that, based on my personal experience, your best bet specifically for SQL server is to use query analyzer, which is actually pretty reasonable, as your first stop. After that, you can do some performance analyses to find the optimal solution. Typically, these are done by modeling the query traffic as it would be when the system is under regular load. (FYI: The modeling link is to ASP.NET performance modeling, but various core concepts apply to SQL as well.) You typically put the system under load and then:
Look at how many connections are lost -- this can increase if the queries are expensive.
Performance counters on the server(s) to see how the system is dealing with the load.
Responses from the queries to see if some start failing to provide a valid response, although this is unlikely to happen
FYI: This is based on my personal experience, after having done various types of performance analyses for multiple projects. We expect to do it again for our current project, although this time around we're using AD and Azure tables instead of SQL, and hence the methodology is not specific to SQL server, although the tools, traffic profiles, and what to measure varies.
On design
Introducing event id in the account transaction line:
Although you do not explicitly state so, but it seems that the event ID and transaction ID is not going to change after the first entry has been made. If that's the case and you are only interested in getting the totals for a transaction in this query, then another option (which will optimize your queries) would be to add a foreign key to AccountEvent's primary key (which I think is the event id). In strictest DB sense, you are de-normalizing the table a bit, but in practice, it often helps with performance.
Computing totals on inserts:
The other approach that I have taken in a past project (just because I was using FoxPro in the previous century and FoxPro tended to be extremely slow at joins) was to keep total amounts in the primary table, equivalent of your transactions table. This would be quite useful if your reads heavily outweighed your writes, and in the case of SQL, you can issue a transaction to make entries in other tables and update totals simultaneously (hence my question about on your query profiles).
Join transaction & transfers tables:
Keep a value to indicate which is which, and keep the totals there -- similar to previous one but at a different level. This will decrease the joins on query, but still have sum of totals on inserts -- I would prefer the previous over this one.
De-normalize completely:
This is yet another approach that folks have used (esp. in NOSQL space), but it gives me shivers when applying in SQL Server, so I have a personal bias against it but you could very well search it and find about it.
Forgive me if this is a silly question (I'm new to databases and SQL), but is it possible to lock a table, similar to the lock keyword in C#, so that I can query the database to see if a condition is met, then insert a row afterwards while ensuring the state of the table has not changed between the two actions?
In this case, I have a table transactions which has two columns: user and product. This is a many-to-one relationship; multiple users can have the same product. However, the number of products is limited.
When a user adds a product to their account, I want to first check if the total number of items with the same product value to see if it is under a certain threshold, then add the transaction afterwards. However, since this is a multithreaded application, multiple transactions can come in at the same time. I want to make sure that one of these is rejected, and one succeeds, such that the number of transactions with the same product value can never be higher than the limit.
Rough pseudo-code for what I am trying to do:
my_user, my_product = ....
my_product_count = 0
for each transaction in transactions:
if product == my_product:
my_product_count += 1
if my_product_count < LIMIT:
insert my_user, my_product into transactions
return SUCCESS
else:
return FAILURE
I am using SQLAlchemy with SQLite3, if that matters.
Not needed if you do both operations in a transaction - which is supported by databases. Databases do maintain locks to guarantee transactional integrity. In fact that is one of the four pillars of what a database does - they are called ACID guaranetees (for (Atomicity, Consistency, Isolation, Durability).
So, in your case, to ensure consistence you would make both operations in one transaction and seat the transaction parameters in such a way to block reads on the already read rows.
SQL locking is WAY more powerfull than the lock statement because, among other things, databases per definition have multiple threads (users) hitting the same data - something that is exceedingly rare in programming (where same data access is avoided in multi threaded programming as much as possible).
I suggest a good book about SQL - because you need to simply LEARN some fundamental concepts at one point, or you will make mistakes that cost money.
Transactions allow you to use multiple SQL statements atomically.
(SQLite implements transactions by locking the entire database, but the exact mechanism doesn't matter, and you might want to use another database later anyway.)
However, you don't even need to bother with explicit transactions if your desired algorithm can be handled with single SQL statement, like this:
INSERT INTO transactions(user, product)
SELECT #my_user, #my_product
WHERE (SELECT COUNT(*)
FROM transactions
WHERE product = #my_product) < #LIMIT;
Suppose you have the following tables: Orders, Customers, Events, Lines, and LineAssignments. The only table that I can modify is the LineAssignments table.
Event 1 <---> * Orders
Customer 1 <---> * Orders
Order 1 <---> * LineAssignments
Line 1 <---> * LineAssignments
Different pages display different combinations of info with the line assignments. For example, on some pages I only display the event info with the line assignments, while on other pages I display the order info with them, etc.
Basically, whenever I add a new line assignment, should I also store the EventID, CustomerID, and OrderID too, or should I only store the OrderID, then do multiple joins to get the other data. Would it be better to create a view that joins these tables?
I tend to follow the school of thought that data should only be represented once in a database. This means, in your place, I would attempt to get what I need from multiple joins and only store OrderID.
The reason why I would do this is if there's any chance that the data stored in the other tables (the data you copied over to the LineAssignments table) is updated, the copied data would be wrong. I don't see it being super likely that the data in the other tables would change, but in the off-chance that it does... You'd be better off with the joins than potentially incorrect data.
It is simply a question of performance. Generally, you should stick to the 3NF, i.e. no redundancy. Whereas this gives very tight and elegant data structures, it might also lead to heavy performance issues.
This is usually the case if your database is both for productive and historical data, i.e. grows over time.
When issuing the joined queries, your RDBMS will load as much information as possible into memory, usually index information to speed up your query. Now, if your indexes are so big that they don't fit into memory, your RDBMS (no, the OS in fact) will have to swap, which is a performance killer.
The real deal (in my eyes) is to completely separate productive data (open / unpaid orders for example) from historic data. The historic data can and should be optimized for fast retrieval as nothing changes anymore and hard discs are cheap.
Productive data should be nice and tight (3.NF). Whenever a piece of information is not productive anymore (order is paid, parts are delivered etc.) it will be removed from the productive database and transferred to the historical data.
Get information on the topic 'data warehouse' in case you're not yet familiar with it and read about the concepts. It's quite easy to understand.
I'm designing my DB for functionality and performance for realtime AJAX web applications, and I don't currently have the resources to add DB server redundancy or load-balancing.
Unfortunately, I have a table in my DB that could potentially end up storing hundreds of millions of rows, and will need to read and write quickly to prevent lagging the web-interface.
Most, if not all, of the columns in this table are individually indexed, and I'd love to know if there are other ways to ease the burden on the server when running querys on large tables. But is there eventually a cap for the size (in rows or GB) of a table before a single unclustered SQL server starts to choke?
My DB only has a dozen tables, with maybe a couple dozen foriegn key relationships. None of my tables have more than 8 or so columns, and only one or two of these tables will end up storing a large number of rows. Hopefully the simplicity of my DB will make up for the massive amounts of data in these couple tables ...
Rows are limited strictly by the amount of disk space you have available. We have SQL Servers with hundreds of millions of rows of data in them. Of course, those servers are rather large.
In order to keep the web interface snappy you will need to think about how you access that data.
One example is to stay away from any type of aggregate queries which require processing large swaths of data. Things like SUM() can be a killer depending on how much data it's trying to process. In these situations you are much better off calculating any summary or grouped data ahead of time and letting your site query these analytic tables.
Next you'll need to partition the data. Split those partitions across different drive arrays. When SQL needs to go to disk it makes it easier to parallelize the reads. (#Simon touched on this).
Basically, the problem boils down to how much data you need to access at any one time. This is the main problem regardless of the amount of data you have on disk. Even small databases can be choked if the drives are slow and the amount of available RAM in the DB server isn't enough to keep enough of the DB in memory.
Usually for systems like this large amounts of data are basically inert, meaning that it's rarely accessed. For example, a PO system might maintain a history of all invoices ever created, but they really only deal with any active ones.
If your system has similar requirements, then you might have a table that is for active records and simply archive them to another table as part of a nightly process. You could even have statistics like monthly averages (as an example) recomputed as part of that archival.
Just some thoughts.
The only limit is the size of your primary key. Is it an INT or a BIGINT?
SQL will happily store the data without a problem. However, with 100 millions of rows, your best off partitioning the data. There are many good articles on this such as this article.
With partitions, you can have 1 thread per partition working at the same time to parallelise the query even more than is possible without paritioning.
My gut tells me that you will probably be okay, but you'll have to deal with performance. It's going to depend on the acceptable time-to-retrieve results from queries.
For your table with the "hundreds of millions of rows", what percentage of the data is accessed regularly? Is some of the data, rarely accessed? Do some users access selected data and other users select different data? You may benefit from data partitioning.
I have a postgres database with several million rows, which drives a web app. The data is static: users don't write to it.
I would like to be able to offer users query-able aggregates (e.g. the sum of all rows with a certain foreign key value), but the size of the database now means it takes 10-15 minutes to calculate such aggregates.
Should I:
start pre-calculating aggregates in the database (since the data is static)
move away from postgres and use something else?
The only problem with 1. is that I don't necessarily know which aggregates users will want, and it will obviously increase the size of the database even further.
If there was a better solution than postgres for such problems, then I'd be very grateful for any suggestions.
You are trying to solve an OLAP (On-Line Analytical Process) data base structure problem with an OLTP (On-Line Transactional Process) database structure.
You should build another set of tables that store just the aggregates and update these tables in the middle of the night. That way your customers can query the aggregate set of tables and it won't interfere with the on-line transation proceessing system at all.
The only caveate is the aggregate data will always be one day behind.
Yes
Possibly. Presumably there are a whole heap of things you would need to consider before changing your RDBMS. If you moved to SQL Server, you would use Indexed views to accomplish this: Improving Performance with SQL Server 2008 Indexed Views
If you store the aggregates in an intermediate Object (something like MyAggragatedResult), you could consider a caching proxy:
class ResultsProxy {
calculateResult(param1, param2) {
.. retrieve from cache
.. if not found, calculate and store in cache
}
}
There are quite a few caching frameworks for java, and most like for other languages/environments such as .Net as well. These solution can take care of invalidation (how long should a result be stored in memory), and memory-management (remove old cache items when reaching memory limit, etc.).
If you have a set of commonly-queried aggregates, it might be best to create an aggregate table that is maintained by triggers (or an observer pattern tied to your OR/M).
Example: say you're writing an accounting system. You keep all the debits and credits in a General Ledger table (GL). Such a table can quickly accumulate tens of millions of rows in a busy organization. To find the balance of a particular account on the balance sheet as of a given day, you would normally have to calculate the sum of all debits and credits to that account up to that date, a calculation that could take several seconds even with a properly indexed table. Calculating all figures of a balance sheet could take minutes.
Instead, you could define an account_balance table. For each account and dates or date ranges of interest (usually each month's end), you maintain a balance figure by using a trigger on the GL table to update balances by adding each delta individually to all applicable balances. This spreads the cost of aggregating these figures over each individual persistence to the database, which will likely reduce it to a negligible performance hit when saving, and will decrease the cost of getting the data from a massive linear operation to a near-constant one.
For that data volume you shouldn't have to move off Postgres.
I'd look to tuning first - 10-15 minutes seems pretty excessive for 'a few million rows'. This ought to be just a few seconds. Note that the out-of-the box config settings for Postgres don't (or at least didn't) allocate much disk buffer memory. You might look at that also.
More complex solutions involve implementing some sort of data mart or an OLAP front-end such as Mondrian over the database. The latter does pre-calculate aggregates and caches them.
If you have a set of common aggregates you can calculate it before hand (like, well, once a week) in a separate table and/or columns and users get it fast.
But I'd seeking the tuning way too - revise your indexing strategy. As your database is read only, you don't need to worry about index updating overhead.
Revise your database configuration, maybe you can squeeze some performance of it - normally default configurations are targeted to easy the life of first-time users and become short-sighted fastly with large databases.
Maybe even some denormalization can speed up things after you revised your indexing and database configuration - and falls in the situation that you need even more performance, but try it as a last resort.
Oracle supports a concept called Query Rewrite. The idea is this:
When you want a lookup (WHERE ID = val) to go faster, you add an index. You don't have to tell the optimizer to use the index - it just does. You don't have to change the query to read FROM the index... you hit the same table as you always did but now instead of reading every block in the table, it reads a few index blocks and knows where to go in the table.
Imagine if you could add something like that for aggregation. Something that the optimizer would just 'use' without being told to change. Let's say you have a table called DAILY_SALES for the last ten years. Some sales managers want monthly sales, some want quarterly, some want yearly.
You could maintain a bunch of extra tables that hold those aggregations and then you'd tell the users to change their query to use a different table. In Oracle, you'd build those as materialized views. You do no work except defining the MV and an MV Log on the source table. Then if a user queries DAILY_SALES for a sum by month, ORACLE will change your query to use an appropriate level of aggregation. The key is WITHOUT changing the query at all.
Maybe other DB's support that... but this is clearly what you are looking for.