I have a database of about 100 million customer relation records and 3 million distinct clients
I need to write a SQL script to work out which clients have registered 5 or more complaints within 30 days of each other, over the entire history of the client
I thought that a window function would be the answer but I haven't had any luck
Any ideas would be useful, but efficient ones would be even better, as I have low system priority and my codes takes hours to run.
Related
Imagine I'm a bank and I want a dashboard that shows our "top spenders" with data updating throughout the day.
Currently I query the database for all our customer IDs, and pass all those IDs to a service which calculates how much they've spent today. If I have 10,000 customers, it must make 10,000 calculations.
Then I pick the Top 10 and show them in the dashboard. 9,990 calculations were useless, but we can't know which ones until they're complete.
Is there some way to improve the performance? I can't pre-calculate because customers are making new purchases all the time, and the list should be dynamic.
What if the results are very consistent day-to-day? As in, the Top 10 spenders are almost always in yesterday's Top 20. We could store the prior day's top spenders, and only calculate those 20 to find the Top 10, but if someone jumps from #25 to #9, they would be missing from our Top 10 if we use that algorithm.
Any advice is appreciated!
Instead of asking another service for all customers, have this service subscribe to customer created/deleted events from that service, and have it store the customer info in its database. So, it will not need to inefficiently query all the customers every time to compute the top spenders.
I have an application which calls the database multiple times to achieve one simple goal.
A little information about this application; In short, the application scrapes data from a webpage & stores specific information from this page into a database. The important information in this query is: Player name, Position. There can be multiple sitting at one specific position, kill points & Class
Player name has every potential to change or remain the same every day
Regarding the Position, there can be multiple sitting in one position
Kill points has the potential to increase or remain the same every day
Class, there is only 2 possibilities that a name can be, Ex: A can change to B or remain A (same in reverse), but cannot be C,D,E,F
The player name can change at any particular day, Position can also change dependent on the kill point increase from the last update which spins back around to the goal. This is to search the database day by day, from the current date to as far back as 2021-02-22 starting at the most recent entry for a player name and back track to the previous day to check if that player name is still the same or has changed.
What is being used as a main reference to the change is the kill points. As the days go on, this number will either be the exact same or increase, it can never decrease.
So now onto the implementation of this application.
The first query which runs finds the most recent entry for the player name
SELECT TOP(1) * FROM [changes] WHERE [CharacterName]=#charname AND [Territory]=#territory AND [Archived]=0 ORDER BY [Recorded] DESC
Then continue to check the previous days entries with the following query:
SELECT TOP(1) * FROM [changes] WHERE [Territory]=#territory AND [CharacterName]=#charname AND [Recorded]=#searchdate AND ([Class] LIKE '%{Class}%' OR [Class] LIKE '%{GetOpposite(Class)}%' AND [Archived]=0 )
If no results are found, will then proceed to find an alternative name with the following query:
SELECT TOP(5) * FROM [changes] WHERE [Kills] <= #kills AND [Recorded]='{Data.Recorded.AddDays(-1):yyyy-MM-dd}' AND [Territory]=#territory AND [Mode]=#mode AND ([Class] LIKE #original OR [Class] LIKE #opposite) AND [Archived]=0 ORDER BY [Kills] DESC
The aim of the query above is to get the top 5 entries that are the closest possible matches & Then cross references with the day ahead
SELECT COUNT(*) FROM [changes] WHERE [CharacterName]=#CharacterName AND [Territory]=#Territory AND [Recorded]=#SearchedDate AND [Archived]=0
So with checking the day ahead, if the character name is not found in the day ahead, then this is considered to be the old player name for this specific character, else after searching all 5 of the results and they are all found to be present in the day aheads searches, then this name is considered to be new to the table.
Now with the date this application started to run up to today's date which is over 400 individual queries on the database to achieve one goal.
It is also worth a noting that this table grows by 14,400 - 14,500 Rows each and every day.
The overall question to this specific? Is it possible to bring all these queries into less calls onto the database, reduce queries & improve performance?
What you can do to improve performance will be based on what parts of the application stack you can manipulate. Things to try:
Store Less Data - Database content retrieval speed is largely based on how well the database is ordered/normalized and just how much data needs to be searched for each query. Managing a cache of prior scraped pages and only storing data when there's been a change between the current scrape and the last one would guarantee less redundant requests to the db.
Separate specific classes of data - Separating data into dedicated tables would allow you to query a specific table for a specific character, etc... effectively removing one where clause.
Reduce time between queries - Less incoming concurrent requests means less resource contention and faster response times to prior requests.
Use another data structure - The only reason you're using top() is because you need data ordered in some specific way (most-recent, etc...). If you just used a code data structure that keeps the data ordered and still easily-query-able you could then perhaps offload some sql requests to this structure instead of the db.
The suggestions above are not exhaustive, but what you do to improve performance is largely a function of what in the application stack you have the ability to modify.
I'm trying to find an efficient way to calculate the booked times for a user(object), given a list of free/available times for the same user\object.
I have an object that will return the "available" times for a given specific day. The duration/end time is fixed to 10 minutes.
Example Starting data:
12/23/2020 8:00 AM
12/23/2020 9:00 AM
12/23/2020 1:00 PM
In this case I need to generate the "unavailable" times and insert them into a database with a fairly simple schema:
start_date | end_date | start_time | end_time
The inserting is fairly trivial, i'm having a hard time determining the best way to calculate the unavailable timespans.
Using the example above i would need to generate the following timespans:
12/23/2020 12:00 AM - 7:59 AM
12/23/2020 08:11 AM - 8:59 AM
12/23/2020 09:11 AM - 12:59 PM
12/23/2020 1:11 PM - 11:59 PM
Any frameworks or libraries that can do the heavy lifting on this for me? Is it possible to solve this problem without looping through the results and calculating all of the offsets?
To anyone asking "why" - hooking together two legacy systems, one system returns the available appointments for a given date this needs to be plumbed into a system that needs the un-available appointments for a given date.
Well, first I written more tour booking systems then I can shake a stick at.
The one Rosetta stone that holds true?
You don't want to generate or have booking slots that are NOT being used in the system PERIOD!!!
Thus you ONLY ever enter into the system a valid booking (starttime, and end time). And that startTime should be a datetime column - this will VAST reduce the complexity of your queries. Given you have date and separate time? Well, then your queries will be more complex - I'll leave that to you.
Given the above? The simple logic to find a booking collision in ALL cases is this:
A collision occurs when:
RequestStartDate <= EndDate
and
RequestEndDate >= StartDate
Now in above, I assume date values, or datetime values.
So if I want a list of any booking for today?
RequestDDTStart = 2020-12-23 9 AM
RequestDTEnd = 2020-12-23 5 PM
And thus any collision can be found with this:
strWhere= dtRequestStartDate <= BookingEndDate" & _
" and dtRequestEndDate >= BookingStartDate"
Now, assumging .net, then above would be something like this as parameters
strWhere= #dtRequestStart <= BookingEndDate" & _
" and #dtRequestEnd >= BookingStartDate"
So, above would return all bookings for today 9 am to 5 pm
A REMARKABLE simple query! Now of course the above query could/would include the exam room, or hotel room or whatever as an additional criteria. But in ALL cases the above simple query returns ANY collision for that 9 am to 5 pm.
And the beauty of this system? As long as you never allow a over-lap into the booking system, then you can book a 10 minute or a 20 minute or a 30 minute session as ONE entry into the database. I would thus not need to create 3x 10 minute slots.
So, this means you NEVER have to create booking slots. The whole system will and can be driver with a simple start + end booking record. And as noted, then you can book 1 hour, or 40 minutes. Your input (UI) can simple limit the time span to increments of 10 minutes - but that's the UI part.
Now I suppose to display things in 10 minute increments on a screen? Well, then you would have to submit 6 requests per hour to "display" the time slots. For a whole day, that suggest for 9 am to 5 pm, you would have to run 8 x 6 = 48 requests to get a list of 10 minute increments. But then again, you COULD just show the existing bookings for a day, and allow new bookings to be added - but don't allow if there is a over lap.
So, as noted, the concept here is you don't really need "slots" in the database. I suppose you could try slots, but it makes the code a HUGE mess to deal with. if you ONLY ever store the start + end? Then I can say move the booking to another day by JUST changing the date. Or I can extend a booking from 10 minutes to say 20 or 40 minutes - and ONLY have to change the end time. As long as no overlap occurs with the above simple "test", then I can simple change the booking to be 40 minutes in length - and ZERO code to update multiple slots is required. And same goes for reducing a booking from 40 minutes to 10 minutes. Again ONLY the end time need be reduced - a ONE row update into the database.
So if at all possible, I would dump the concept of having "slots" in the database. I might consider such a design if a booking was only ever 10 minutes. But if 10 or 20 or 30 is allowed, then you don't need to store ANY un-used slots in the database, but ONLY ever store a valid booked slot. Empty un-used time can thus ALSO be found with the above query. (if the query returns records - then you can't book).
So display of free time in some UI becomes more of a challenge, but showing bookings that span 10 or 20 or whatever minutes is far more easy, and as noted, you can even change a whole booking to a different room by a ONE row update of the room ID. If no collision occurs, then you allow this booking - and you achieve this result by ONLY updating one simple booking record that represents that start + end time.
and this means you also NEVER store the booking totals in the database - you query them!
I also found that if I say store any booking totals in the database? Well, with complex code, we always found that the totals often don't match perfect. So then we wind up writing a routine to go though the data, sum up the totals and write those out.
But, if you never store any booked totals (say people on a bus, or people in a given hotel), then while the query for such display is somewhat more difficult, it becomes dead simple to remove a person from say a tour by simple null out of the tourID.
So, this display shows the above concepts in Action. And the available rooms in the hotel, people booked on bus, and even totals for "group tours" are ALL values NOT stored in the database:
So in above, people booked on bus, booked in rooms, and rooms used? All those values are NOT stored in the database. And no slots exist either. So if we have a bus, then we set the capacity of 46, but we do NOT create 46 slots to book into. So be it a bus, a hotel, a medical exam room? You don't create booking slots ahead of time, but simply insert bookings with a start + end, and then query against that concept.
So, to find a total on a bus (or say in a exam room), I query to find the total for that day. And if I want to move a group booking of 4 people from one bus to another? Then one FK update to the given bus they are on allows the whole system to "cascade" the existing values in the system. And same goes for moving a person from exam room #1 to #5. You only have to update the FK value of the exam room. If no collisions occur, then this again is a one row update. If you have multiple exam rooms, and multiple slots, then what should be a simple one row update in the database becomes a whole hodge podge of now having to update multiple booking slots with whacks of code.
So you book "use" of resources "into" a "day" a "bus" a room, but it is the act of that booking that consumes the time slots - not that you pre-create records or timeslots for each "range". This thus allows you to leverage the relatonal database model, and reduce huge amounts of code - since you not coding against "slots", but only that a exam room is open from 10 am to 4 pm. That available room for that day is thus ONLY ONE record you create in the system, and then you are now free to book into that one day given room range. The bookings into that one room for the day can be 10 minutes, or 40 minutes - but it ONLY one record being added into the database to achieve this goal (booking).
Regardless of the above, that simple collision query works for any collision (including a whole overlap, inside a existing span, or even the end or start overlaps any booking. And that query is dead simple - and it works for all collisions. So I don't have a library to share, but that simple booking collision finder query can thus drive the whole system based on that kind of simple query.
So I'm developing a database for an agency that manages many relief staff.
Relief workers set their availability for each day in one of three categories (day, evening, night).
We also need to be able to set some part-time relief workers as busy on weekly, biweekly, and in one instance, on a 9-week rotation. Since we're already developing recurring patterns of availability here, we might as well also give the relief workers the option of setting recurring availability days.
We also need to be able to query the database, and determine if an employee is available for a given day.
But here's the gotcha - we need to be able to use change data capture. So I'm not sure if calculating availability is the best option.
My SQL prototype table looks like this:
TABLE Availability Day
employee_id_fk | workday (DATETIME) | day | eve | night (all booleans)| worksite_code_fk (can be null)
I'm really struggling how to wrap my head around recurring events. I could create say, a years worth, of availability days following a pattern in 'x' day cycle. But how far ahead of time do we store information? I can see running into problems when we reach the end of the data set.
I was thinking of storing say, 6 months of information, then adding a server side task that runs monthly to keep the tables updated with 6 months of data, but my intuition is telling me this is a bad fix.
For absolutely flexibility in the future and keeping data from bloating my first thought would be something like
Calendar Dimension Table - Make it for like 100 years or Whatever you Want make it include day of week information etc.
Time Dimension Table - Hour, Minutes, every 15 what ever but only for 24 hour period
Shifts Table - 1 record per shift e.g. Day, Evening, and Night
Specific Availability Table - Relationship to Calendar & Time with Start & Stops recommend 1 record per day so even if they choose a range of 7 days split that to 1 record perday and 1 record per shift.
Recurring Availability Table - for day of week (1-7),Month,WeekOfYear, whatever you can think of. But again I am thinking 1 record per value so if they are available Mondays and Tuesday's that would be 2 rows. and if multiple shifts then it would be multiple rows.
Now and here is the perhaps the weird part, I would put a Available Column on the Specific and Recurring Availability Tables, maybe make it a tiny int and store something like 0 not available, 1 available, 2 maybe available, 3 available with notice.
If you want to take into account Availability with Notice you could add columns for that too such as x # of days. If you want full flexibility maybe that becomes a related table too.
The queries would be complex but you could use a stored procedure or a table valued function to handle it fairly routinely.
This is probably a fork in the road question. I have a journal blog that date stamps a continuation of a single field within a record.
Example:
Proj #1 (ID): Notes (memo field:) 10/12/2012 - visited site. 10/11/2012 - updated information. 10/11/2012 - call client. 10/10/2012 - Input information.
Proj #2 (ID): Notes (memo field:) 10/10/12 - visited site. 10/10/2012 - call client. 10/9/2012 - Input information. 10/1/2012 - Started project. etc etc...
I need to count how many updates where made over a specific time frame. I know I can create a hidden field and add + 1 everytime there is an update which is useful for an OVERALL update count... but how can i keep track of number of updates over the last 5 days. Like the example above you may update it twice in one day and I may not care about updates made 2 weeks ago.
I think I need to create an SQL that counts the number of "dates" since 10/10/12 or since 10/2/12 etc.
I have done the SQL: SELECT memo FROM Projects WHERE memo IN ('%10/10/12%', '%10/9/2012%' etc)
and then the Len(memoStringCombined) - Len(Replace(searchword""etc)/Len(searchword) and it works fine for countings a single date... but if I have count multiple dates over 30 days it gets to be quite cumbersome to keep rewriting each search word. Is there a regex or obj that can loop through this for me?
Otherwise any other suggestions for counting updates between time frames would be greatly appreciated.
BTW - I can't really justify creating a new table dedicated to tracking updates because there will be 100's of updates for close to 10,000 records which means the update tracking table will be more monstrous than the data... or am I wrong with that idea too?