Gemfire Partitioning : Transaction Issues - gemfire

In GemFire documentation and forums we normally see the use case of Customer, Order, Order items and partitioning done on these regions with Order and Order items co-located with Customer.
Our use case has a Capacity region which holds huge data and holds the Inventory info and needs to be partitioned. It holds the details of each train capacity.
When we do a booking suppose a person is going from point A to B, he might choose a route such as A - C - B
Where A-C he goes by train1 and C-B he goes by train2
So when booking is done from the inventory - for train1 and train2 the capacity must be updated(reduced in this case).
Considering Capacity as a partitioned region, train1 entry and train2 entry can be on seperate data nodes. Here we cannot do any kind of data co-location on the train info.
How can we update the train1 and train2 data in a single transaction without getting TransactionDataNotColocated exception?
Is this possible or is it not possible to Partition Capacity region?
Pivotal site mentions case studies of Indian Rail, China rail so such use case might be a very common one?
Thanks

I've encountered something like this with an airline inventory system. The short answer is that you won't be able to use transactions to span the whole purchase since there is no partitioning scheme that colocates trains that are part of any possible journey.
You have to get creative. I've outlined one solution below. It may or may not exactly meet your needs but it should at least give you the idea of how to build a solution out of the pieces GemFire provides.
One way of doing this is to use the idea of "reserved seats". A "reserved seat" is one that may have been sold - its "in doubt". Let a "journey" be one train going from location A to B on a particular date. This will be very large and will be stored in a partition region. Each train-journey would carry on it a capacity , seats sold and a list of reservations. Each reservation contains the timestamp the reservation was made, and unique purchase identifier.
At any given time, the available capacity on a train-journey is initial capacity - seats sold - reservations.size()
When selling a trip, which may contain multiple journeys
for each leg of the journey
start txn
retrieve train-journey
check available capacity (see formula above)
if capacity > 0 add a reservation to the list
commit
If you succeed in reserving capacity on every journey, complete the sale and record the unique purchase identifier in a "recent purchase" region, along with a list of keys of all journeys in the trip. If any journey on the trip doesn't have capacity you tell the user the trip is not available.
This algorithm never oversells but can leave reservations in place that do not correspond to a completed purchase. This could happen because a journey on the trip was unavailable or because of failure.
The last piece of the puzzle is a couple of background jobs to process reservations and turn them into seats sold. One job would simply pass through all of the train-journeys on a regular basis and remove expired reservations from the list. Note that this can be done in an entirely distributed fashion with no intra-node coordination.
The other job would deal with recent purchases. You would loop over the recent purchases region. For each recent purchase, run an onRegion Function against the journey region with a filter consisting of the list of journey keys. This Function would , in a transaction, find the journey (local key lookup) , remove the corresponding reservation and increment seats sold. Note that this Function is idempotent and does not require global transactions. If there is a failure, it can just be run again.
Hope this helps.

Related

What is the best practice database design for transactions aggregation?

I am designing a database which will hold transaction level data. It will work the same way as a bank account - debits/credits to an Account Number.
What is the best / most efficient way of obtaining the aggregation of these transactions.
I was thinking about using a summary table and then adding these to a list of today's transactions in order to derive how much each account has (i.e their balance).
I want this to be scalable (ie. 1 billion transactions) so don't want to have to perform database hits to the main fact table as it will need to find all the debits/credits associated with a desired account number scanning potentially a billion rows.
Thanks, any help or resources would be awesome.
(Have been working in Banks for almost 10years. Here is how it is actually done).
TLDR: your idea is good.
Every now and then you store the balance somewhere else ("carry forward balance"). E.g. every month or so (or aver a given number of transactions). To calculate the actual balance (or any balance in the past) you accumulate all relevant transactions going back in time until the most recent balance you kept ("carry forward balance"), which you need to add, of course.
The "current" balance is not kept anywhere. Just alone for the locking problems you would have if you'd update this balance all the time. (In real banks you'll hit some bank internal accounts with almost every single transactions. There are plenty of bank internal accounts to be able to get the figures required by law. These accounts are hit very often and thus would cause locking issues when you'd like to update them with every transaction. Instead every transactions is just insert — even the carry forward balances are just inserts).
Also in real banks you have many use cases which make this approach more favourable:
Being able to get back dated balances at any time - Being able to get balances based on different dates for any time (e.g. value date vs. transaction date).
Reversals/Cancellations are a fun of it's own. Imagine to reverse a transaction from two weeks ago and still keep all of the above going.
You see, this is a long story. However, the answer to your question is: yes, you cannot accumulate an ever increasing number of transactions, you need to keep intermediate balances to limit the number to accumulate if needed. Hitting the main table for a limited number of rows, should be no issue.
Make sure your main query uses an Index-Only Scan.
Do an Object Oriented Design, Create table for objects example Account, Transaction etc. Here's a good website for your reference. But there's a lot more on the web discussing OODBMS. The reference I gave is just my basis when I started doing an OODBMS.

Fact table with information that is regularly updatable in source system

I'm building a dimensional data warehouse and learning how to model my various business processes from my source system in my warehouse.
I'm currently modelling a "Bid" (bid for work) from our source system in our data warehouse as a fact table which contains information such as:
Bid amount
Projected revenue
Sales employee
Bid status (active, pending, rejected, etc)
etc.
The problem is that the bid (or most any other process I'm trying to model) can go through various states and have its information updated at any given moment in the source system. According to Ralph Kimball, fact tables should only be updated if they are considered "accumulating snapshot" and I'm sure that not all of these processes would be considered an "accumulating snapshot" by the definition below.
How should these type of processes be modeled in the data warehouse according to the Kimball group's recommendations? Further more, what type of fact table would work for a bid (given the facts I've outlined above)?
Excert from http://www.kimballgroup.com/2008/11/fact-tables/
The transaction grain corresponds to a measurement taken at a single
instant. The grocery store beep is a transaction grain. The measured
facts are valid only for that instant and for that event. The next
measurement event could happen one millisecond later or next month or
never. Thus, transaction grain fact tables are unpredictably sparse or
dense. We have no guarantee that all the possible foreign keys will be
represented. Transaction grain fact tables can be enormous, with the
largest containing many billions of records.
The periodic snapshot grain corresponds to a predefined span of time,
often a financial reporting period. Figure 1 illustrates a monthly
account periodic snapshot. The measured facts summarize activity
during or at the end of the time span. The periodic snapshot grain
carries a powerful guarantee that all of the reporting entities (such
as the bank account in Figure 1) will appear in each snapshot, even if
there is no activity. The periodic snapshot is predictably dense, and
applications can rely on combinations of keys always being present.
Periodic snapshot fact tables can also get large. A bank with 20
million accounts and a 10-year history would have 2.4 billion records
in the monthly account periodic snapshot!
The accumulating snapshot fact table corresponds to a predictable
process that has a well-defined beginning and end. Order processing,
claims processing, service call resolution and college admissions are
typical candidates. The grain of an accumulating snapshot for order
processing, for example, is usually the line item on the order. Notice
in Figure 1 that there are multiple dates representing the standard
scenario that an order undergoes. Accumulating snapshot records are
revisited and overwritten as the process progresses through its steps
from beginning to end. Accumulating snapshot fact tables generally are
much smaller than the other two types because of this overwriting
strategy.
Like one of the comments mention, Change Data Capture is a fairly generic term for "how do I handle changes to data entities over time", and there are entire books on it (and a gazillion posts and articles).
Regardless of any statements that seem to suggest a clear black-and-white or always-do-it-like-this answer, the real answer, as usual, is "it depends" - in your case, on what grain you need for your particular fact table.
If your data changes in unpredictable ways or very often, it can become challenging to implement Kimball's version of an accumulated snapshot (picture how many "milestone" date columns, etc. you might end up needing).
So, if you prefer, you can decide to make your fact table be an transactional fact table rather than a snapshot, where the fact key would be (Bid Key, Timestamp), and then in your application layer (whether a view, mview, actual app, or whatever), you can ensure that a given query only gets the latest version of each Bid (note that this can be thought of as kind of a virtual accumulated snapshot). If you find that you don't need the previous versions (the history of each Bid), you can have a routine that prunes them (i.e. deletes or moves them somewhere else).
Alternatively, you can only allow the fact (Bid) to be added when it is in it's final state, but then you will likely have a significant lag where a new (updateable) Bid doesn't make it to the fact table for some time.
Either way, there are several solid and proven techniques for handling this - you just have to clearly identify the business requirements and design accordingly.
Good luck!

Optimal solution for massive number of requests on one database table

We have a system where customers are allocated a product on a first come first served basis.
Our products table contains an incrementing primary key that started at zero which we use to keep track of how many products have been allocated i.e. a user reserves a product and gets allocated 1, next user gets 2 etc.
The problem, is that potentially hundreds of thousands of users will access the system in any given hour. All of whom will be hitting this one table.
Since we need to ensure that each customer is only allocated one product and keep track of how many products have been allocated, we use a row lock for each customer accessing the system to ensure they write to the table before the next customer hits the system - i.e. enforcing the first come first served rule.
We are concerned about the bottleneck that is the processing time of each request coming into SQL Server 2008 Enterprise Edition and the row lock.
We can't use multiple servers as we need to ensure the integrity of the primay key so anything that requires replication isn't going to work.
Does anyone know of any good solutions that are particularly efficient at handling a massive number of requests on one database table?
A bit more info:
The table in question essentially contains two fields only - ID and CustomerID. The solution is for a free giveaway of a million products - hence the expectation of high demand and why using the incrementing primary key as a key makes sense for us - once the key hits a million, no more customers can register. Also, the products are all different so allocation of the correct key is important e.g. first 100 customers entered receieve a higher value product than the next 100 etc
First, to remove the issue of key generation, I would generate them all in advance. It's only 1m rows and it means you don't have to worry about managing the key generation process. It also means you don't have to worry about generating too many rows accidentally, because once you have the table filled, you will only do UPDATEs, not INSERTs.
One important question here is, are all 1m items identical or not? If they are, then it doesn't matter what order the keys are in (or even if they have an order), so as customers submit requests, you just 'try' to UPDATE the table something roughly like this:
UPDATE TOP(1) dbo.Giveaway -- you can use OUTPUT to return the key value here
SET CustomerID = #CurrentCustomerID
WHERE CustomerID IS NULL
IF ##ROWCOUNT = 0 -- no free items left
PRINT 'Bad luck'
ELSE
PRINT 'Winner'
If on the other hand the 1m items are different then you need another solution, e.g. item 1 is X, items 2-10 are Y, 11-50 are Z etc. In this case it's important to assign customers to keys in the order the requests are submitted, so you should probably look into a queuing system of some kind, perhaps using Service Broker. Each customer adds a request to the queue, then a stored procedure processes them one at a time and assigns them the MAX free key, then returns the details of what they won.

Recommendations for best SQL Performance updating and/or calculating stockonhand totals

Apologies for the length of this question.
I have a section of our database design which I am worried may begin to cause problems. It is not at that stage yet, but obviously don't want to wait until it is to resolve the issue. But before I start testing various scenarios, I would appreciate input from anyone who has experience with such a problem.
Situation is Stock Control and maintaining the StockOnHand value.
It would be possible to maintain a table hold the stock control figures which can be updated whenever a order is entered either manually or by using a database trigger.
Alternatively you can get SQL to calculate the quantities by reading and summing the actual sales values.
The program is installed on several sites some of which are using MS-SQL 2005 and some 2008.
My problem is complicated because the same design needs to cope with several scenarios,
such as :
1) Cash/Sale Point Of Sale Environment. Sale is entered and stock is reduced in one transaction. No amendments can be made to this transaction.
2) Order/Routing/Confirmation
In this environment, the order is created and can be placed on hold, released, routed, amended, delivered, and invoiced. And at any stage until it is invoiced the order can be amended. (I mention this because any database triggers may be invoked lots of time and has to determine if changes should affect the stock on hand figures)
3) Different business have a different ideas of when their StockOnHand should be reduced. For example, some consider the stock as sold once they approve an order (as they have committed to sell the goods and hence it should not be sold to another person). Others do not consider the stock as sold until they have routed it and some others only when it has been delivered or collected.
4) There can be a large variance in number of transactions per product. For example, one system has four or five products which are sold several thousand times per month, so asking SQL to perform a sum on those transactions is reading ten of thousands of transactions per year, Whereas, on the same system, there are several thousand other products where sales would only less than a thousand transactions per year per product.
5) Historical information is important. For that reason, our system does not delete or archive transactions and has several years worth of transactions.
6) The system must have the ability to warn operators if they do not have the required stock when the order is entered ( which quite often is in real time, eg telephone order).
Note that this only required for some products. (But I don't think it would be practical to sum the quantity across ten of thousands of transactions in real time).
7) Average Cost Price. Some products can be priced based on the average cost of the items in stock. The way this is implemented is that the Average Cost price is re-calculated for every goods in transaction, something like newAverageCostPrice = (((oldAverageCostPrice * oldStockOnHand) + newCostValue) / newStockOnHand) . This means the stock On Hand must be known for every goods in if the product is using AverageCost.
The way the system is currently implemented is two fold.
We have a table which holds the StockOnHand for each product and location. Whenever a sale is updated, this table is updated via the business layer of our application (C#)
This only provides the current stock on hand figures.
If you need to run a Stock Valuation for a particular date, this figure is calculated by performing a sum of the quantitys on the lines involved. This also requires a join between the sales line and the sale header tables as the quantity and product are stored in the line file and the date and status are only held in the header table.
However, there are downsides to this method, such as.
Running the stock valuation report is slow, (but not unacceptably slow), but I am not happy with it. (It works and monitoring the server does not show it overloading it, but it has the potential to cause problems and hence requires regular monitoring)
The logic of the code updating the StockOnHand table is complicated.
This table is being updated frequently. In a lot of cases this is un-necessary as the information does not need to be checked. For example, if 90% of your business is selling 4 or 5 products, you don't really need a computer to tell you are out of stock.
Database trigggers.
I have never implemented complicated triggers before, so am wary of this.
For example, as stated before we need configuration options to determine the conditions when the stock figures should be updated. This is currently read once and cached in our program. To do this inside a trigger would persumably mean reading this information for every trigger. Does this have a big impact on performance.
Also we may need a trigger on the sale header and the sale line. (This could mean that an amendment to the sale header would be forced to read the lines and update the stockonhand for the relevant products, and then later on the lines are saved and another database trigger would amend the stockonahand table again which may be in-efficient.
Another alternative would be to only update the StockOnHand table whenever the transaction is invoiced (which means no further amendments can be done) and to provide a function to calculate the stockonhand figure based on a union of this table and the un-invoiced transactions which affect stock.
Any advice would be greatly appreciated.
First of I would strongly recommend you add "StockOnHand", "ReservedStock" and "SoldStock"
to your table.
A cash sale would immediatly Subtract the sale from "StockOnHand" and add it to "SoldStock", for an order you would leave "StockOnHand" alone and merely add the sale to ReservedStock, when the stock is finally invoiced you substract the sale from StockOnHand and Reserved stock and add it to "SoldStock".
The business users can then choose whether StockOnHand is just that or StockOnHand - ReservedStock.
Using a maintaind StockOnHand figure will reduce your query times massively, versus the small risk that the figure can go out of kilter if you mess up your program logic.
If your customers are so lucky enough to experience update contention when maintaining the StockOnHand figure (i.e. are they likely to process more than five sales a second at peak times) then you can consisider the following scheme:-
Overnight calculate the StockOnHand figure by counting deliveries - sales or whatever.
When a sale is confirmed insert a row to a "Todays Sales" table.
When you need to query stock on hand total up todays sale and subtract it from the start of day figure.
You could also place a "Stock Check Threshold" on each product so if you start the day with 10,000 widgets you can set the CheckThreshold to 100 if someone is ordering less than 100 than dont bother checking the stock. If someone orders over 100 then check the stock and recalculate a new lower threshold.
Could you create a view (or views) to respresent your stock on hand? This would take the responsibility for doing the calculations out of synchronous triggers which slow down your transactions. Using multiple views could satisfy the requirement "Different business have a different ideas of when their StockOnHand should be reduced." Assuming you can meet the stringent requirements, creating an indexed view could further improve your performance.
Just some ideas off the top of my head:
Instead of a trigger (and persistent SOH data), you could use a computed column (e.g. SOH per product per store). However, the performance impact of evaluating this would likely be abysmal unless there are >> more writes to your source tables than reads from your computed column. (The trade off is that is assuming the only reason you calculate the SOH is so that you can read it again. If you update the source data for the calc much more often than you actually need to read it, then the computed col might make sense - since it is JIT evaluation only when needed. This would be unusual though - reads are usually more frequent than writes in most Systems)
I'm guessing that the reason you are looking at triggers is because the source tables for the SOH figures are updated from a large number of procs / code in order to prevent oversight (as opposed to a calling a recalc SPROC from every applicable point where the source data has been modified?)
IMHO placing complicated in DB triggers is not advised, as this will adversely affect the performance of high volume inserts / updates, and triggers aren't great for maintainability.
Does the SOH calculation need to be real time? If not, you could implement a mechanism to queue requests for recalculation (e.g. by using a trigger to indicate that a product / location balance is dirty) and then run a recalculation service every few minutes for near real-time. Mission critical calculations (e.g. financial - like your #6) could still however detect that a SOH calc is dirty and then force a recalc before doing a transaction.
Re : 3 - Ouch. Would recommend that internally you agree on a consistent (and industry acceptable) set of terminology (Stock In Hand, Stock Committed, Stock in Transit, Shrinkage etc etc) and then try to convince your customers to conform to a standard. But that is in the ideal world of course!

Need help/suggestions for creating fantasy sports scoring databases and queries

I'm trying to create a website for my friends and I to keep track of fantasy sports scoring. So far, I've been doing the calculations and storage in Excel, which is very tedious. I'm trying to make it more simplified and automated through a SQL database that I can then wrap a web app around to enter daily stat updates.
It's premised on our participation in another commercial site where we trade virtual shares of athletes, and thus acquire an "ownership percentage" in each athlete. For instance, if there are 100 shares of AROD, and I own 10 shares, then I own 10%. It then applies this to traditional baseball rotisserie scoring. So, for instance, if AROD has 1 HR today, then his adjusted HR stat would be 1.10. If he also has 2 RBI's, then his adjusted RBI stat today would be 2.20, based on (2 x 1.10)(1 to normalize the stat, and the .10 to represent the ownership percentage).
All the stats for my team would then be summed each day and added to my stat history to come to an aggregated total. After that, points are allocated based on the ranking of each participant in each category at the end of the day. E.g. if there are 10 participants, and I have the highest total aggregate number of Adjusted HR's, then I get 10 pts. The points are then summed across the different stat categories to come up with a total point ranking for that day.
An added difficulty is that ownership %'s can change on a daily basis.
So far, in playing around with different schema, I don't know that having a separate table for each athlete's stats and each player's ownership %'s is the wisest choice. It seems to me that simply having two tables, one that contains the daily stat information for each athlete, and another that shows the ownership % of each player. My friend suggested using a start and end date for each ownership % to represent the potential daily changes in this category.
I'm admittedly new to database development, so any suggestions on query code would be appreciated.
You could go nuts, and do the following:
A table named 'Athletes' that has a record for each athlete. Here is where you could store the static properties of the athlete, like what sport they are in, their batting average, etc.
A table named 'Owners' that has a record for each user. This might include their name, their password hash, join date, etc.
A table for each athlete, containing a record for each owner. Here is where you'd store a reference to the Owners table, along with the percentage ownership.
A table for each owner, containing the history of ownership.