How to optimally get sorted records, when the sort column requires computation from a secondary service? - sql

Imagine I'm a bank and I want a dashboard that shows our "top spenders" with data updating throughout the day.
Currently I query the database for all our customer IDs, and pass all those IDs to a service which calculates how much they've spent today. If I have 10,000 customers, it must make 10,000 calculations.
Then I pick the Top 10 and show them in the dashboard. 9,990 calculations were useless, but we can't know which ones until they're complete.
Is there some way to improve the performance? I can't pre-calculate because customers are making new purchases all the time, and the list should be dynamic.
What if the results are very consistent day-to-day? As in, the Top 10 spenders are almost always in yesterday's Top 20. We could store the prior day's top spenders, and only calculate those 20 to find the Top 10, but if someone jumps from #25 to #9, they would be missing from our Top 10 if we use that algorithm.
Any advice is appreciated!

Instead of asking another service for all customers, have this service subscribe to customer created/deleted events from that service, and have it store the customer info in its database. So, it will not need to inefficiently query all the customers every time to compute the top spenders.

Related

Reducing database load from consecutive queries

I have an application which calls the database multiple times to achieve one simple goal.
A little information about this application; In short, the application scrapes data from a webpage & stores specific information from this page into a database. The important information in this query is: Player name, Position. There can be multiple sitting at one specific position, kill points & Class
Player name has every potential to change or remain the same every day
Regarding the Position, there can be multiple sitting in one position
Kill points has the potential to increase or remain the same every day
Class, there is only 2 possibilities that a name can be, Ex: A can change to B or remain A (same in reverse), but cannot be C,D,E,F
The player name can change at any particular day, Position can also change dependent on the kill point increase from the last update which spins back around to the goal. This is to search the database day by day, from the current date to as far back as 2021-02-22 starting at the most recent entry for a player name and back track to the previous day to check if that player name is still the same or has changed.
What is being used as a main reference to the change is the kill points. As the days go on, this number will either be the exact same or increase, it can never decrease.
So now onto the implementation of this application.
The first query which runs finds the most recent entry for the player name
SELECT TOP(1) * FROM [changes] WHERE [CharacterName]=#charname AND [Territory]=#territory AND [Archived]=0 ORDER BY [Recorded] DESC
Then continue to check the previous days entries with the following query:
SELECT TOP(1) * FROM [changes] WHERE [Territory]=#territory AND [CharacterName]=#charname AND [Recorded]=#searchdate AND ([Class] LIKE '%{Class}%' OR [Class] LIKE '%{GetOpposite(Class)}%' AND [Archived]=0 )
If no results are found, will then proceed to find an alternative name with the following query:
SELECT TOP(5) * FROM [changes] WHERE [Kills] <= #kills AND [Recorded]='{Data.Recorded.AddDays(-1):yyyy-MM-dd}' AND [Territory]=#territory AND [Mode]=#mode AND ([Class] LIKE #original OR [Class] LIKE #opposite) AND [Archived]=0 ORDER BY [Kills] DESC
The aim of the query above is to get the top 5 entries that are the closest possible matches & Then cross references with the day ahead
SELECT COUNT(*) FROM [changes] WHERE [CharacterName]=#CharacterName AND [Territory]=#Territory AND [Recorded]=#SearchedDate AND [Archived]=0
So with checking the day ahead, if the character name is not found in the day ahead, then this is considered to be the old player name for this specific character, else after searching all 5 of the results and they are all found to be present in the day aheads searches, then this name is considered to be new to the table.
Now with the date this application started to run up to today's date which is over 400 individual queries on the database to achieve one goal.
It is also worth a noting that this table grows by 14,400 - 14,500 Rows each and every day.
The overall question to this specific? Is it possible to bring all these queries into less calls onto the database, reduce queries & improve performance?
What you can do to improve performance will be based on what parts of the application stack you can manipulate. Things to try:
Store Less Data - Database content retrieval speed is largely based on how well the database is ordered/normalized and just how much data needs to be searched for each query. Managing a cache of prior scraped pages and only storing data when there's been a change between the current scrape and the last one would guarantee less redundant requests to the db.
Separate specific classes of data - Separating data into dedicated tables would allow you to query a specific table for a specific character, etc... effectively removing one where clause.
Reduce time between queries - Less incoming concurrent requests means less resource contention and faster response times to prior requests.
Use another data structure - The only reason you're using top() is because you need data ordered in some specific way (most-recent, etc...). If you just used a code data structure that keeps the data ordered and still easily-query-able you could then perhaps offload some sql requests to this structure instead of the db.
The suggestions above are not exhaustive, but what you do to improve performance is largely a function of what in the application stack you have the ability to modify.

Best way to implement multiple continuous aggregates in postgres

Imagine you have to display information about rainfall based on cities over time.
You have tables the provides the details on how much it rains in a specific city for every hour. There is an endpoint that returns the average amount of rainfall for the timeframe/city requested.
(so imagine a table called rainfall_california, rainfall_texas, etc... I realize this schema isn't ideal for rainfall, but using it for an example.)
So instead of calculating the average on each request, I setup a continuous aggregate to calculate the average into a new view and have a policy to refresh the last hour of data once every hour.
ca_texas_rainfall_1_day
ca_texas_rainfall_7_day
ca_texas_rainfall_30_day
ca_california_rainfall_1_day
ca_california_rainfall_7_day
ca_california_rainfall_30_day
This works great and is super fast, but I'm a little confused on the best way to set it up. Should I have a different view for each continuous aggregate and each city? Wouldn't that result in a ton of different views? Or should I consolidate the average of each table into a single view?

Movable window of fixed length of time - SQL

I have a database of about 100 million customer relation records and 3 million distinct clients
I need to write a SQL script to work out which clients have registered 5 or more complaints within 30 days of each other, over the entire history of the client
I thought that a window function would be the answer but I haven't had any luck
Any ideas would be useful, but efficient ones would be even better, as I have low system priority and my codes takes hours to run.

Recurring Orders

Hi everyone I'm working on a school project, and for my project I chose to create an ecommerce system that can process recurring orders. This is for my final project, I'll be graduating in May with an associates in computer science.
Keep in mind this is no where a final solution and it's basically a jumping off point for this database design.
A little background on the business processes.
- Customer will order a product, and will specify during checkout whether it is a one time order or a weekly/monthly order.
- Customer will specify a location in which to pick up their order (this location is specific only to the order)
- If the value of the order > 25.00 then it is accepted otherwise it is rejected.
- This will populate the orders_test and order_products_test tables respectively
Person on the back end will have a report generated for deliveries for the day based on these two tables.
They will be able to print it off and it will generate a list of what items go to what location.
Based on the following criteria.
date_of_next_scheduled_delivery = current date
remaining_deliveries > 0
Once they are satisfied with the delivery list they will press "Process Deliveries" button.
This will adjust the order_products_test table as follows
Subtract 1 from remaining_deliveries
Insert current date into date_of_last_delivery_processed
Based on delivery_frequency (i.e. once, weekly, monthly) it will change the date_of_next_scheduled_delivery
status values in the order_products_test table can either be active, hold, or canceled, expired
I just would like some opinions if I am approaching this correctly or if I should scratch this approach and start over again.
A few thoughts, though not necessarily complete (there's a lot to your question, but hopefully these points help):
I don't think you need to keep track of remaining deliveries. You only have 2 options - a one time order, or a recurring order. In both cases, there's no sense in calculating remaining deliveries. It's never leveraged.
In terms of tracking the next delivery date, you can just keep track of the day of the order. If it's recurring -- monthly or weekly, regardless -- everything is calculable from that first date. Most DB systems (MySQL, SQL Server, Oracle, etc) support more than enough date computation flexibility so that you can calculate this on the fly, as opposed to maintaining such a known schedule.
If the delivery location is only specific to the order, I see no use in creating a separate table for it -- it's functionally dependent on the order, you should keep it in the same table as the order. For most e-commerce systems, this is not the case because they tend to associate a list of delivery locations with accounts, which they prompt you about when you order more than once (e.g., Amazon).
Given the above, I bet you can just get away with 2 of your 4 tables above -- Account and Order. But again, if delivery locations are associated with Accounts, I would indeed break that out. (but your question above doesn't suggest that)
Do not name your tables with a "_test" suffix -- it's confusing.

track sales for week/month and find the best sellers

Lets say I have a website that sells widgets. I would like to do something similar to a tag cloud tracking best sellers. However, due to constantly aquiring and selling new widgets, I would like the sales to decay on a weekly time scale.
I'm having problems puzzling out how store and manipulate this data and have it decay properly over time so that something that was an ultra hot item 2 months ago but has since tapered off doesn't show on top of the list over the current best sellers. What would be the logic and database design for this?
Part 1: You have to have tables storing the data that you want to report on. Date/time sold is obviously key. If you need to work in decay factors, that raises the question: for how long is the data good and/or relevant? At what point in time as the "value" of the data decayed so much that you no longer care about it? When this point is reached for any given entry in the database, what do you do--keep it there but ensure it gets factored out of all subsequent computations? Or do you archive it--copy it to a "history" table and delete it from your main "sales" table? This is relevant, as it has to be factored into your decay formula (as well as your capacity planning, annual reporting requirements, and who knows what all else.)
Part 2: How much thought has been given to the decay formula that you want to use? There's no end of detail you can work into this. Options and factors to wade through include but are not limited to:
Simple age-based. Everything before the cutoff date counts as 1; everything after counts as 0. Sum and you're done.
What's the cutoff date? Precisly 14 days ago, to the minute? Midnight as of two Saturdays ago from (now)?
Does the cutoff date depend on the item that was sold? If some items are hot but some are not, does that affect things? What if you want to emphasize some things (the expensive/hard to sell ones) over others (the fluff you'd sell anyway)?
Simple age-based decays are trivial, but can be insufficient. Time to go nuclear.
Perhaps you want some kind of half-life, Dr. Freeman?
Everything sold is "worth" X, where the value of X is either always the same or varies on the item sold. And the value of X can decay over time.
Perhaps the value of X decreased by one-half every week. Or ever day. Or every month. Or (again) it may vary depending on the item.
If you do half-lifes, the value of X may never reach zero, and you're stuck tracking it forever (which is why I wrote "part 1" first). At some point, you probably need some kind of cut-off, some point after which you just don't care. X has decreased to one-tenth the intial value? Three months have passed? Either/or but the "range" depends on the inherent valud of the item?
My real point here is that how you calculate your decay rate is far more important than how you store it in the database. So long as the data's there that the formalu needs to do it's calculations, you should be good. And if you only need the last month's data to do this, you should perhaps move everything older to some kind of archive table.
you could just count the sales for the last month/week/whatever, and sort your items according to that.
if you want you can always add the total amonut of sold items into your formula.
You might have a table which contains the definitions of the pointing criterion (most sales, most this, most that, etc.), then for a given period, store in another table the attribution of points for each of the criterion defined in the criterion table. Obviously, a historical table will be used to store the score for each sellers for a given period or promotion, call it whatever you want.
Does it help a little?