I want to create a SQL Server table that has a Department and a Maximum Capacity columns (assume 10 for this scenario). When users add them selves to a department the system will check the current assignment count (assume 9 for this scenario) in the department and compare it to the maximum value. If it is below the maximum, they will be added.
The issue is this: what if two users submit at the same time and the when the code retrieves the current assignment count it will be 9 for both. One user updates the row sooner so now its 10 but the other user has already retrieved the previous value before the update (9) and so both are valid when compared and we end up with 11 users in the department.
Is this even possible and how can one solve it?
The answer to your problem lies in understanding "Database Concurrency" and then choosing the correct solution to your specific scenario.
It too large a topic to cover in a single SO answer so I would recommend doing some reading and coming back with specific questions.
However in simple form you either block the assignments out to the first person who tries to obtain them (pessimistic locking), or you throw an error after someone tries to assign over the limit (optimistic locking).
In the pessimistic case you then need ways to unblock them if the user fails to complete the transaction e.g. a timeout. A bit like on a ticket booking website it says "These tickets are being held for you for the next 10 minutes, you must complete your booking within that time else you may lose them".
And when you're down to the last few positions you are going to be turning everyone after the first away... no other way around it if you require this level of locking. (Well you could then create a waiting list, but that's another issue in itself).
Related
I have an application which calls the database multiple times to achieve one simple goal.
A little information about this application; In short, the application scrapes data from a webpage & stores specific information from this page into a database. The important information in this query is: Player name, Position. There can be multiple sitting at one specific position, kill points & Class
Player name has every potential to change or remain the same every day
Regarding the Position, there can be multiple sitting in one position
Kill points has the potential to increase or remain the same every day
Class, there is only 2 possibilities that a name can be, Ex: A can change to B or remain A (same in reverse), but cannot be C,D,E,F
The player name can change at any particular day, Position can also change dependent on the kill point increase from the last update which spins back around to the goal. This is to search the database day by day, from the current date to as far back as 2021-02-22 starting at the most recent entry for a player name and back track to the previous day to check if that player name is still the same or has changed.
What is being used as a main reference to the change is the kill points. As the days go on, this number will either be the exact same or increase, it can never decrease.
So now onto the implementation of this application.
The first query which runs finds the most recent entry for the player name
SELECT TOP(1) * FROM [changes] WHERE [CharacterName]=#charname AND [Territory]=#territory AND [Archived]=0 ORDER BY [Recorded] DESC
Then continue to check the previous days entries with the following query:
SELECT TOP(1) * FROM [changes] WHERE [Territory]=#territory AND [CharacterName]=#charname AND [Recorded]=#searchdate AND ([Class] LIKE '%{Class}%' OR [Class] LIKE '%{GetOpposite(Class)}%' AND [Archived]=0 )
If no results are found, will then proceed to find an alternative name with the following query:
SELECT TOP(5) * FROM [changes] WHERE [Kills] <= #kills AND [Recorded]='{Data.Recorded.AddDays(-1):yyyy-MM-dd}' AND [Territory]=#territory AND [Mode]=#mode AND ([Class] LIKE #original OR [Class] LIKE #opposite) AND [Archived]=0 ORDER BY [Kills] DESC
The aim of the query above is to get the top 5 entries that are the closest possible matches & Then cross references with the day ahead
SELECT COUNT(*) FROM [changes] WHERE [CharacterName]=#CharacterName AND [Territory]=#Territory AND [Recorded]=#SearchedDate AND [Archived]=0
So with checking the day ahead, if the character name is not found in the day ahead, then this is considered to be the old player name for this specific character, else after searching all 5 of the results and they are all found to be present in the day aheads searches, then this name is considered to be new to the table.
Now with the date this application started to run up to today's date which is over 400 individual queries on the database to achieve one goal.
It is also worth a noting that this table grows by 14,400 - 14,500 Rows each and every day.
The overall question to this specific? Is it possible to bring all these queries into less calls onto the database, reduce queries & improve performance?
What you can do to improve performance will be based on what parts of the application stack you can manipulate. Things to try:
Store Less Data - Database content retrieval speed is largely based on how well the database is ordered/normalized and just how much data needs to be searched for each query. Managing a cache of prior scraped pages and only storing data when there's been a change between the current scrape and the last one would guarantee less redundant requests to the db.
Separate specific classes of data - Separating data into dedicated tables would allow you to query a specific table for a specific character, etc... effectively removing one where clause.
Reduce time between queries - Less incoming concurrent requests means less resource contention and faster response times to prior requests.
Use another data structure - The only reason you're using top() is because you need data ordered in some specific way (most-recent, etc...). If you just used a code data structure that keeps the data ordered and still easily-query-able you could then perhaps offload some sql requests to this structure instead of the db.
The suggestions above are not exhaustive, but what you do to improve performance is largely a function of what in the application stack you have the ability to modify.
Problem
we have ~50k scheduled financial reports that we periodically deliver to clients via email
reports have their own delivery frequency (date&time format - as configured by clients)
weekly
daily
hourly
weekdays only
etc.
Current architecture
we have a table called report_metadata that holds report information
report_id
report_name
report_type
report_details
next_run_time
last_run_time
etc...
every week, all 6 instances of our scheduler service poll the report_metadata database, extract metadata for all reports that are to be delivered in the following week, and puts them in a timed-queue in-memory.
Only in the master/leader instance (which is one of the 6 instances):
data in the timed-queue is popped at the appropriate time
processed
a few API calls are made to get a fully-complete and current/up-to-date report
and the report is emailed to clients
the other 5 instances do nothing - they simply exist for redundancy
Proposed architecture
Numbers:
db can handle up to 1000 concurrent connections - which is good enough
total existing report number (~50k) is unlikely to get much larger in the near/distant future
Solution:
instead of polling the report_metadata db every week and storing data in a timed-queue in-memory, all 6 instances will poll the report_metadata db every 60 seconds (with a 10 s offset for each instance)
on average the scheduler will attempt to pick up work every 10 seconds
data for any single report whose next_run_time is in the past is extracted, the table row is locked, and the report is processed/delivered to clients by that specific instance
after the report is successfully processed, table row is unlocked and the next_run_time, last_run_time, etc for the report is updated
In general, the database serves as the master, individual instances of the process can work independently and the database ensures they do not overlap.
It would help if you could let me know if the proposed architecture is:
a good/correct solution
which table columns can/should be indexed
any other considerations
I have worked on a differt kind of sceduler for a program that reported analyses on a specific moment of the month/week and what I did was combining the reports to so called business cycle based time moments. these moments are on the "start of a new week", "start of the month", "start/end of a D/W/M/Q/Y'. So I standardised the moments of sending the reports and added the id's to a table that would carry the details of the report. - now you add thinks to the cycle of you remove it when needed, you could do this by adding a tag like(EOD(end of day)/EOM (End of month) SOW (Start of week) ect, ect, ect,).
So you could index the moments of when the clients want to receive the reports and build on that track. Hope that this comment can help you with your challenge.
It seems good to simply query that metadata table by all 6 instances to check which is the next report to process as you are suggesting.
It seems odd though to have a staggered approach with a check once every 60 seconds offset by 10 seconds for your servers. You have 6 servers now but that may change. Also I don't understand the "locking" you are suggesting, why now simply set a flag on the row such as [State] = "processing", then the next scheduler knows to skip that row and move on to the next available one. Once a run is processed, you can simply update a [Date_last_processed] column, or maybe something like [last_cycle_complete] = 'YES'.
Alternatively you could have one server-process to go through the table, and for each available row, sends it off to one of the instances, in a round-robin fashion (or keep track of who is busy and who isn't).
If you are working with access control, you must have faced the issue where the Automatic Record Permission field (with Rules) does not update itself on recalculating the record. You either have to launch full recalculation or wait for a considerable amount of time for the changes to take place.
I am facing this issue where based on 10 different field values in the record, I have to give read/edit access to 10 different groups respectively.
For instance:
if rule 1 is true, give edit access to 1st group of users
if rule 1 and 2 are true, give edit access to 1st AND 2nd group of
users.
I have selected 'No Minimum' and 'No Maximum' in the Auto RP field.
How to make the Automatic Record Permission field to update itself as quickly as possible? Am I missing something important here?
If you are working with access control, you must have faced the issue
where the Automatic Record Permission field (with Rules) does not
update itself on recalculating the record. You either have to launch
full recalculation or wait for a considerable amount of time for the
changes to take place.
Tanveer, in general, this is not a correct statement. You should not face this issue with [a] well-designed architecture (relationships between your applications) and [b] correct calculation order within the application.
About the case you described. I suggest you check and review the following possibilities:
1. Calculation order.Automatic Record Permissions [ARP from here] are treated by Archer platform in the same way as calculated fields. This means that you can modify the calculation order in which calculated field and automatic record permissions will be updated when you save the record.So it is possible that your ARP field is calculated before certain calculated fields you use in the rules in ARP. For example, let say you have two rules in ARP field:
if A>0 then group AAA
if B>0 then groub BBB
Now, you will have a problem if calculation order is the following:
"ARP", "A", "B"
ARP will not be updated after you click "Save" or "Apply", but it will be updated after you click "Save" or "Apply" twice within the save record.With calculation order "A","B","ARP" your ARP will get recalculated right away.
2. Full recalculation queue.
Since ARPs are treated as calculated fields, this mean that every time ARP needs to get updated there will be recalculation job(s) created on the application server on the back end. And if for some reason the calculation queue is full, then record permission will not get updated right away. Job engine recalculation queue can be full if you have a data feed running or if you have a massive amount of recalculations triggered via manual data imports. Recalculation job related to ARP update will be created and added to the queue. Recalculation job will be processed based on the priorities defined for job queue. You can monitor the job queue and alter default job's processing priorities in Archer v5.5 via Archer Control Panel interface. I suggest you check the job queue state next time you see delays in ARP recalculations.
3. "Avalanche" of recalculations
It is important to design relationships and security inheritance between your applications so recalculation impact is minimal.
For example, let's say we have Contacts application and Department application. - Record in the Contacts application inherits access using Inherited Record Permission from the Department record.-Department record has automatic record permission and Contacts record inherits it.-Now the best part - Department D1 has 60 000 Contacts records linked to it, Department D2 has 30 000 Contacts records linked to it.The problem you described is reproducible in the described configuration. I will go to the Department record D1 and updated it in a way that ARP in the department record will be forced to recalculate. This will add 60 000 jobs to the job engine queue to recalculate 60k Contacts linked to D1 record. Now without waiting I go to D2 and make change forcing to recalculate ARP in this D2 record. After I save record D2, new job to recalculate D2 and other 30 000 Contacts records will be created in the job engine queue. But record D2 will not be instantly recalculated because first set of 60k records was not recalculated yet and recalculation of the D2 record is still sitting in the queue.
Unfortunately, there is not a good solution available at this point. However, this is what you can do:
- review and minimize inheritance
- review and minimize relationships between records where 1 record reference 1000+ records.
- modify architecture and break inheritance and relationships and replace them with Archer to Archer data feeds if possible.
- add more "recalculation" power to you Application server(s). You can configure your web-servers to process recalculation jobs as well if they are not utilized to certain point. Add more job slots.
Tanveer, I hope this helps. Good luck!
I'm currently working on a project in MongoDB where I want to get a random sampling of new products from the DB. But my problem is not MongoDB specific, I think it's a general database question.
The scenario:
Let's say we have a collection (or table) of products. And we also have a collection (or table) of users. Every time a user logs in, they are presented with 10 products. These products are selected randomly from the collection/table. Easy enough, but the catch is that every time the user logs in, they must be presented with 10 products that they have NEVER SEEN BEFORE. The two obvious ways that I can think of solving this problem are:
Every user begins with their own private list of all products. Each time they get one of these products, the product is removed from their private list. The result is that the next time products are chosen from this previously trimmed list, it already contains only new items.
Every user has a private list of previously viewed products. When a user logs in, they select 10 random products from the master list, compare the id of each against their list of previously viewed products, and if the item appears on the previously viewed list, the application throws this one away selects a new one, and iterates until there are 10 new items, which it then adds to the previously viewed list for next time.
The problem with #1 is it seems like a tremendous waste. You would basically be duplicating the list data for n number of users. Also removing/adding new items to the system would be a nightmare since it would have to iterate through all users. #2 seems preferable, but it too has issues. You could end up making a lot of extra and unnecessary calls to the DB in order to guarantee 10 new products. As a user goes through more and more products, there are less new ones to choose from, so the chances of having to throw one away and get new one from the DB greatly increases.
Is there an alternative solution? My first and primary concern is performance. I will give up disk space in order to optimize performance.
Those 2 ways are a complete waste of both primary and secondary memory.
You want to show 2 never before seen products, but is this a real must?
If you have a lot of products 10 random ones have a high chance of being unique.
3 . You could list 10 random products, even though not as easy as in MySQL, still less complicated than 1 and 2.
If you don't care how random the sequence of id's is you could do this:
Create a single randomized table of just product id's and a sequential integer surrogate key column. Start each customer at a random point in the list on first login and cycle through the list ordered by that key. If you reach the end, start again from the top.
The customer record would contain a single value for the last product they saw (the surrogate from the randomized list, not the actual id). You'd then pull the next ten on login and do a single update to the customer. It wouldn't really be random, of course. But this kind of table-seed strategy is how a lot of simpler pseudo-random number generators work.
The only problem I see is if your product list grows more quickly than your users log in. Then they'd never see the portions of the list which appear before wherever they started. Even so, with a large list of products and very active users this should scale much better than storing everything they've seen. So if it doesn't matter that products appear in a set psuedo-random sequence, this might be a good fit for you.
Edit:
If you stored the first record they started with as well, you could still generate the list of all things seen. It would be everything between that value and last viewed.
How about doing this: crate a collection prodUser where you will have just the id of the product and the list of customersID, (who have seen these products) .
{
prodID : 1,
userID : []
}
when a customer logs in you find the 10 prodID which has not been assigned to that user
db.prodUser.find({
userID : {
$nin : [yourUser]
}
})
(For some reason $not is not working :-(. I do not have time to figure out why. If you will - plz let me know.). After showing the person his products - you can update his prodUser collection. To mitigate mongos inability to find random elements - you can insert elements randomly and just find first 10.
Everything should work really fast.
I need to scheduled events, tasks, appointments, etc. in my DB. Some of them will be one time appointments, and some will be reoccurring "To-Dos" which must be checked off. After looking a google's calendar layout and others, plus doing a lot of reading here is what I have so far.
Calendar table (Could be called schedule table I guess): Basic_Event Title, start/end, reoccurs info.
Calendar occurrence table: ties to schedule table, occurrence specific text, next occurrence date / time????
Looked here at how SQL Server does its jobs: http://technet.microsoft.com/en-us/library/ms178644.aspx
but this is slightly different.
Why two tables: I need to track status of each instance of the reoccurring task. Otherwise this would be much simpler...
so... on to the questions:
1) Does this seem like the proper way to go about it? Is there a better way to handle the multiple occurrence issue?
2) How often / how should I trigger creation of the occurrences? I really don't want to create a bunch of occurrences... BUT... What if the user wants to view next year's calendar...
Makes sense to have your schedule definition for a task in one table and then a separate table to record each instance of that separately - that's the approach I've taken in the past.
And with regards to creating the occurrences, there's probably no need to create them all up front. Especially when you consider tasks that repeat indefinitely! Again, the approach I've used in the past is to only create the next occurrence. When that instance is actioned, the next instance is then calculated and created.
This leaves the issue of viewing future occurrences. For this, you can start of with the initial/next scheduled occurrence and just calculate the future occurrences on-the-fly at display time.
While this isn't an exact answer to your question I've solved this problem before in SQL Server (though database here is irrelevant) by modeling a solution based on Unix's cron.
Instead of string parsing we used integer columns in a table to store the various time units.
We had events which could be scheduled; they could either point to a one-time schedule table that represented a distinct point in time (a date/time) or to the recurring schedule table which is modelled after cron.
Additionally remember to model your solution correctly. An event has a duration but the duration is unrelated to the schedule (but an event's duration may impact the schedule by causing conflicts). Do not try to model duration as part of your schedule.
In the past when we've done this, we had 2 tables:
1) Schedules -> Includes recurrence information
2) Exceptions -> Edit/changes to specific instances
Using SQL, it's possible to get the list of "Schedules" that have at least one instance in a given date range. Then you can expand in the GUI where each instance lies.