I am using SQL Server 2005.
I have a site that people can vote on awesome motorcycles. Each time a user votes, there is one for the first bike and one vote against the second bike. Two votes are stored in the database. The vote table looks like this:
VoteID VoteDate BikeID Vote
1 2012-01-12 123 1
2 2012-01-12 125 0
3 2012-01-12 126 0
4 2012-01-12 129 1
I want to tally the votes for each bike quite frequently, say each hour. My idea is to store the tally as a percentage of contest won versus lost on the bike table as an attribute of the bike. So, if a bike won 10 contests and lost 20 contest, they would have a score (tally) of 33. I would tally up daily, weekly, and monthly scores.
BikeID BikeName DailyTally WeeklyTally MonthlyTally
1 Big Dog 5 10 50
2 Big Cat 3 15 40
3 Small Dog 9 8 0
4 Fish Face 19 21 0
Right now, there are about 500 votes per day being cast. We anticipate 2500 - 5000 per day in the next month or so.
What is the best way to tally the data and what is the best way to store it? Should the tallies be on their own table? Should a trigger be used to run a new tally each time a bike is voted on? Should a stored procedure be run hourly to get all tallies?
Any ideas would be very helpful!
Store your VoteDate as a datetime value instead of just date.
For your tallies, you can just make that a view and calculate it on the fly. This should be very simple to do using GROUP BY and DATEPART functions. If you need exact code for how to do this, please open a new question.
For that low volume of rows it doesn't make any sense to store aggregations in a table when you can just calculate them whenever you want to see them and get accurate and immediate results that are up-to-date.
I agree with #JNK try a view or just a normal stored proc to calculate the outputs on the fly. If you find it becomes too slow as your data grows I would investigate other routes then (like caching the data in another table etc). Probably worth keeping it simple to start with; you can always resuse the logic from the SP/VIEW later if you do want to setup a scheduled task.
Edit :
Removed the index view as per #Damien_The_Unbeliever comments its not deterministic and i'm stupid :)
Related
I am trying to improve the performance of updating only about 60K rows with data coming from different rows in the same table. At about 2 minutes, it's not terrible, but it's not great either, and my application really doesn't work if you have to wait so long between recalculations.
The app generates a set of financial statements for a business, where it calculates basic formulas on 1300 line items, like Rent, or Direct Labor, or Inventory costs, all of which roll up to totals that mimic the Balance Sheet, P&Ls, Cash Flow etc. Many of the line items need to calculate on a month by month basis, where for instance it has figure out April's On Hand Inventory before knowing what April's Inventory Value is. So the total program ends up looping through 48 months over 30 calculation passes, requiring about 8000 SQL statements. (fortunately it figures it all out by itself!) Each SQL is taking only a few milliseconds, but it adds up.
I'm pretty sure I can't reduce the number of loops, so I keep trying to figure out how to make each SQL quicker. The basic structure is as follows:
LI: Line item table that holds the basic info of each item, primary key LID
LID Name
123 Sales_1
124 Sales_2
200 Total Sales
Formula: Master/Detail tables that create any formula from the line items
Total sales=Sales_1 + Sales_2
or
{200}={123}+{124}
(I use curly braces to be able to find and replace the LIDs within the formula, as shown in the SQL below)
FC: Formula Calculation table: all line items by month, about 1300 items x 48 months=62K records. Primary key FID
FID SQL_ID LID LID_brace LIN OutputMonth Formula Amount
3232 25 123 {123} Sales_1 1 1200
3255 26 124 {124} Sales_2 1 1500
5454 177 200 {200} Total Sales 1 {123}+{124}
DMO:Operand Join table, which links a formula to its detail lines within the same table, so once Sales_1 is calculated, it can find the Total Sales record and update it, which then will evaluate then send its amount up the chain to the other LIDs that depend on it, such as Total Income. It locates the record to update based on the SQL_id, which is set based on the calc pass and month. Its complex to setup, but pretty straightforward once you actually run things
Master_FID Detail_FID
5454 3232 (links total sales to sales_1)
5454 3255 (links total sales to sales_2)
SQL1:
Update FC inner join DMO on FC.FID=DMO.Master_FID inner join FC2 on DMO.Detail_FID=FC2.FID set FC.formula=replace(FC.Formula,FC.LID_brace,FC2.Amount) where FC.sql_id=177
The above will change {123} + {124} to 1200+1500 which will then evaluate to 2700 when I run the following
SQL2:
UPDATE FC SET FC.amount = Eval([fc].[formula]) WHERE (((FC.calc_sql_id)=177 )
So those two sql statements are run over and over again, with the only thing changing is the SQL_id.
There are indexes on the SQL_ID, LID, FID etc
When measuring, the milliseconds per record can range from .04ms if there are many records included (~10K for some passes), up to 10 or 15 ms for just one record updated. Perhaps it is the setup of the query causing a whole lot of overhead time, because it doesn't seem to be a function of the actual number of records updated? Also its not very consistent, where some runs have 20+ ms compared to less than 3ms when it runs it again.
I know this is a complex question i'm asking that probably doesn't have a simple answer, but I'm just looking for directions for what might help. For instance, a parameter query if there isn't a whole lot of change between runs? Does Access have a better time of running a query if knows about it in advance, i.e a named query with parameters vs dynamic SQL? Am I just doomed because it still needs to run those 8000 queries?
Also, is there inherently a problem with trying to update the same table through a secondary join table, and/or is there a better way to do it?
Is it also because string replacing isn't efficient this way? If I tried RegEx would that be quicker? I would have to make a function that could do that within a query, but it seems like that's going to be slower.
Thanks in advance, this has been a most vexing problem!!!
I have a list of about 800 sales items that have a rating (from 1 to 5), and the number of ratings. I'd like to list the items that are most probable of having a "good" rating in an unbiased way, meaning that 1 person voting 5.0 isn't nearly as good as 50 people having voted and the rating of the item being a 4.5.
Initially I thought about getting the smallest amount of votes (which will be zero 99% of the time), and the highest amount of votes for an item on the list and factor that into the ratings, giving me a confidence level of 0 to 100%, however I'm thinking that this approach would be too simplistic.
I've heard about Bayesian probability but I have no idea on how to implement it. My list of items, ratings and number of ratings is on a MySQL view, but I'm parsing the code using Python, so I can make the calculations on either side (but preferably at the SQL view).
Is there any practical way that I can normalize this voting with SQL, considering the rating and number of votes as parameters?
|----------|--------|--------------|
| itemCode | rating | numOfRatings |
|----------|--------|--------------|
| 12330 | 5.00 | 2 |
| 85763 | 4.65 | 36 |
| 85333 | 3.11 | 9 |
|----------|--------|--------------|
I've started off trying to assign percentiles to the rating and numOfRatings, this way I'd be able to do normalization (sum them with an initial 50/50 weight). Here's the code I've attempted:
SELECT p.itemCode AS itemCode, (p.rating - min(p.rating)) / (max(p.rating) - min(p.rating)) AS percentil_rating,
(p.numOfRatings - min(p.numOfRatings)) / (max(p.numOfRatings) - min(p.numOfRatings)) AS percentil_qtd_ratings
FROM products p
WHERE p.available = 1
GROUP BY p.itemCode
However that's only bringing me a result for the first itemCode on the list, not all of them.
Clearly the issue here is the low number of observations your data has. Implementing Bayesian's method is the way to go because it provides great probability distribution for applications involving ratings especially if there is limited observations, and it easily decides the future likelihood ratio based on given parameters (this article provides an excellent explanation about Bayesian probability for beginners).
I would suggest storing your data in CSV files so it becomes easier to manipulate in python. Denormalizing the data via joins is the first task to do before analyzing your ratings.
This is Bayesian's simplified formula to use in your python code:
R – Confidence level aka number of observations
v – number of votes for a single product
C – avg vote for all products
m - tuneable parameter aka cutoff number required for votes to be considered (How many votes do you want displayed)
Since this is the simplified formula, this article explains how its been derived from its original formula. This article is helpful too in explaining the parameters.
Knowing the formula pretty much gets 50% of your work done, the rest is just importing your data and working with it. I provided below examples similar to your problem in case you need full demonstration:
Github example 1
Github example 2
I realize that referring to these as dimension and fact tables is not exactly appropriate. I am at a lost for better terminology, so please excuse this categorization that I use in the post.
I am building an application for employee record keeping.
The database will contain organizational information. The information is mostly defined in three tables: Locations, Divisions, and Departments. However, there are others with similar problems. First, I need to store the available values for these tables. This will allow for available values in the application when managing an employee and for management of these values when adding/deleting departments and such. For instance, the Locations table may look like,
LocationId | LocationName | LocationStatus
1 | New York | Active
2 | Denver | Inactive
3 | New Orleans | Active
I then need to store these values for each employee and keep their history. My first thought was to create LocationHistory, DivisionHistory, and DepartmentHistory tables. I cannot pinpoint why, but this struck me as poor design. My next inclination was to create a DimLocation/FactLocation, DimDivision/FactDivision, DimDepartment/FactDepartment set of tables. I do not believe this makes sense either. I have also considered naming them as a combination of Employee, i.e. EmployeeLocations, EmployeeDivisions, etc. Regardless of the naming convention for these tables, I imagine that data would look similar to a simplified version I have below:
EmployeeId | LocationId | EffectiveDate | EndDate
1 | 3 | 2008-07-01 | NULL
1 | 2 | 2007-04-01 | 2008-06-30
I realize any of the imagined solutions I described above could work, but I am really looking to create a design that will be easy for others to maintain with an intuitive, familiar structure. I would like to receive this community's help, opinions, and experience with this matter. I am open to and would welcome any suggestion to consider. For instance, should I even store the available values for these three tables in the database? Should they be maintained in the application code/business logic layer? Do I just need to get over seeing the word History repeating three times?
Thanks!
Firstly, I see no issue in describing these as Dimension and Fact tables outside of a warehouse :)
In terms of conceptualising and understanding the relationships, I personally see the use of start/end dates perfectly easy for people to understand. Allowing Agent and Location fact tables, and then time dependant mapping tables such as Agent_At_Location, etc. They do, however, have issues worthy of taking note.
If EndDate is 2008-08-30, was the employee in that location UP TO 30th August, or UP TO and INCLUDING 30th August.
Dealing with overlapping date periods in queries can give messy queries, but more importantly, slow queries.
The first one seems simply a matter of convention, but it can have certain implications when dealign with other data. For example, consider that an EndDate of 2008-08-30 means that they ARE at that location UP TO and INCLUDING 30th August. Then you join on to their Daily Agent Data for that day (Such as when they Actually arrived at work, left for breaks, etc). You need to join ON AgentDailyData.EventTimeStamp < '2008-08-30' + 1 in order to include all the events that happened during that day.
This is because the data's EventTimeStamp isn't measured in days, but probably minutes or seconds.
If you consider that the EndDate of '2008-08-30' means that the Agent was at that Location UP TO but NOT INCLDUING 30th August, the join does not need the + 1. In fact you don't need to know if the date is DAY bound, or can include a time component or not. You just need TimeStamp < EndDate.
By using EXCLUSIVE End markers, all of your queries simplify and never need + 1 day, or + 1 hour to deal with edge conditions.
The second one is much harder to resolve. The simplest way of resolving an overlapping period is as follows:
SELECT
CASE WHEN TableA.InclusiveFrom > TableB.InclusiveFrom THEN TableA.InclusiveFrom ELSE TableB.InclusiveFrom END AS [NetInclusiveFrom],
CASE WHEN TableA.ExclusiveFrom < TableB.ExclusiveFrom THEN TableA.ExclusiveFrom ELSE TableB.ExclusiveFrom END AS [NetExclusiveFrom],
FROM
TableA
INNER JOIN
TableB
ON TableA.InclusiveFrom < TableB.ExclusiveFrom
AND TableA.ExclusiveFrom > TableB.InclusiveFrom
-- Where InclusiveFrom is the StartDate
-- And ExclusiveFrom is the EndDate, up to but NOT including that date
The problem with that query is one of indexing. The first condition TableA.InclusiveFrom < TableB.ExclusiveFrom could be be resolved using an index. But it could give a Massive range of dates. And then, for each of those records, the ExclusiveDates could all be just about anything, and certainly not in an order that could help quickly resolve TableA.ExclusiveFrom > TableB.InclusiveFrom
The solution I have previously used for that is to have a maximum allowed gap between InclusiveFrom and ExclusiveFrom. This allows something like...
ON TableA.InclusiveFrom < TableB.ExclusiveFrom
AND TableA.InclusiveFrom >= TableB.InclusiveFrom - 30
AND TableA.ExclusiveFrom > TableB.InclusiveFrom
The condition TableA.ExclusiveFrom > TableB.InclusiveFrom STILL can't benefit from indexes. But instead we've limitted the number of rows that can be returned by searching TableA.InclusiveFrom. It's at most only ever 30 days of data, because we know that we restricted the duration to a maximum of 30 days.
An example of this is to break up the associations by calendar month (max duration of 31 days).
EmployeeId | LocationId | EffectiveDate | EndDate
1 | 2 | 2007-04-01 | 2008-05-01
1 | 2 | 2007-05-01 | 2008-06-01
1 | 2 | 2007-06-01 | 2008-06-25
(Representing Employee 1 being in Location 2 from 1st April to (but not including) 25th June.)
It's effectively a trade off; using Disk Space to gain performance.
I've even seen this pushed to the extreme of not actually storing date Ranges, but storing the actual mapping for each and every day. Essentially, it's like restricting the maximum duration to 1 day...
EmployeeId | LocationId | EffectiveDate
1 | 2 | 2007-06-23
1 | 2 | 2007-06-24
1 | 3 | 2007-06-25
1 | 3 | 2007-06-26
Instinctively I initially rebelled against this. But in subsequent ETL, Warehousing, Reporting, etc, I actually found it Very powerful, adaptable, and maintainable. I actually saw people making fewer coding mistakes, writing code in less time, the code ending up running faster, and being much more able to adapt to clients' changing needs.
The only two down sides were:
1. More disk space taken (But trival compared to the size of fact tables)
2. Inserts and Updates to this mapping was slower
The actual slow down for Inserts and Updates only actually matter Once, where this model was being used to represent a constantly changing process net; where the app wanted to change the mapping about 30 times a second. Even then it worked, it just chomped up more CPU time than was ideal.
If you want to be efficient and keep a history, do these things. There are multiple solutions to this problem, but this is the one that I keep going back to:
Remember that each row represents a single entity, if you make corrections that entity, that's fine, but don't re-use and ID for a new Location. Set it up so that instead of deleting a Location, you mark it as deleted with a bit and hide it from the interface, that way when it's referenced historically, it's still there.
Create a history table that includes the current value, or no records if a value isn't currently set. Have the foreign key tie back to the employee and tie to the location.
Create a column in the employee table that points to the current active location in the history. When you need to get the employees location, you join in the history table based on this ID. When you need to get all of the history for an employee you join from the history table.
This structure keeps it all normalized, and gives you an easy way to find the current value without having to do any date comparisons.
As far as using the word history, think of it in different terms: since it contains the current item as well as historical items, it's really just a junction table that keeps around the old item. As such you can name it something like EmployeeLocations.
Yeah, so I'm filling out a requirements document for a new client project and they're asking for growth trends and performance expectations calculated from existing data within our database.
The best source of data for something like this would be our logs table as we pretty much log every single transaction that occurs within our application.
Now, here's the issue, I don't have a whole lot of experience with MySql when it comes to collating cumulative sum and running averages. I've thrown together the following query which kind of makes sense to me, but it just keeps locking up the command console. The thing takes forever to execute and there are only 80k records within the test sample.
So, given the following basic table structure:
id | action | date_created
1 | 'merp' | 2007-06-20 17:17:00
2 | 'foo' | 2007-06-21 09:54:48
3 | 'bar' | 2007-06-21 12:47:30
... thousands of records ...
3545 | 'stab' | 2007-07-05 11:28:36
How would I go about calculating the average number of records created for each given day of the week?
day_of_week | average_records_created
1 | 234
2 | 23
3 | 5
4 | 67
5 | 234
6 | 12
7 | 36
I have the following query which makes me want to murderdeathkill myself by casting my body down an elevator shaft... and onto some bullets:
SELECT
DISTINCT(DAYOFWEEK(DATE(t1.datetime_entry))) AS t1.day_of_week,
AVG((SELECT COUNT(*) FROM VMS_LOGS t2 WHERE DAYOFWEEK(DATE(t2.date_time_entry)) = t1.day_of_week)) AS average_records_created
FROM VMS_LOGS t1
GROUP BY t1.day_of_week;
Halps? Please, don't make me cut myself again. :'(
How far back do you need to go when sampling this information? This solution works as long as it's less than a year.
Because day of week and week number are constant for a record, create a companion table that has the ID, WeekNumber, and DayOfWeek. Whenever you want to run this statistic, just generate the "missing" records from your master table.
Then, your report can be something along the lines of:
select
DayOfWeek
, count(*)/count(distinct(WeekNumber)) as Average
from
MyCompanionTable
group by
DayOfWeek
Of course if the table is too large, then you can instead pre-summarize the data on a daily basis and just use that, and add in "today's" data from your master table when running the report.
I rewrote your query as:
SELECT x.day_of_week,
AVG(x.count) 'average_records_created'
FROM (SELECT DAYOFWEEK(t.datetime_entry) 'day_of_week',
COUNT(*) 'count'
FROM VMS_LOGS t
GROUP BY DAYOFWEEK(t.datetime_entry)) x
GROUP BY x.day_of_week
The reason why your query takes so long is because of your inner select, you are essentialy running 6,400,000,000 queries. With a query like this your best solution may be to develop a timed reporting system, where the user receives an email when the query is done and the report is constructed or the user logs in and checks the report after.
Even with the optimization written by OMG Ponies (bellow) you are still looking at around the same number of queries.
SELECT x.day_of_week,
AVG(x.count) 'average_records_created'
FROM (SELECT DAYOFWEEK(t.datetime_entry) 'day_of_week',
COUNT(*) 'count'
FROM VMS_LOGS t
GROUP BY DAYOFWEEK(t.datetime_entry)) x
GROUP BY x.day_of_week
For ten years we've been using the same custom sorting on our tables, I'm wondering if there is another solution which involves fewer updates, especially since today we'd like to have a replication/publication date and wouldn't like to have our replication replicate unnecessary entries.I had a look into nested sets, but it doesn't seem to do the job for us.
Base table:
id | a_sort
---+-------
1 10
2 20
3 30
After inserting:
insert into table (a_sort) values(15)
An entry at the second position.
id | a_sort
---+-------
1 10
2 20
3 30
4 15
Ordering the table with:
select * from table order by a_sort
and resorting all the a_sort entries, updating at least id=(2,3,4)
will of course produce the desired output:
id | a_sort
---+-------
1 10
4 20
2 30
3 40
The column names, the column count, datatypes, a possible join, possible triggers or the way the resorting is done is/are irrelevant to the problem.Also we've found some pretty neat ways to do this task fast.
only; how the heck can we reduce the updates in the db to 1 or 2 max.
Seems like an awfully common problem.
The captain obvious in me thougth once "use an a_sort float(53), insert using a fixed value of ordervaluefirstentry+abs(ordervaluefirstentry-ordervaluenextentry)/2".
But this would only allow around 1040 "in between" entries - so never resorting seems a bit problematic ;)
You really didn't describe what you're doing with this data, so forgive me if this is a crazy idea for your situation:
You could make a sort of 'linked list' where instead of a column of values, you have a column for the 'next highest valued' id. This would decrease the number of updates to a maximum of 2.
You can make it doubly linked and also have a column for next lowest, which would bring the maximum number of updates to 3.
See:
http://en.wikipedia.org/wiki/Linked_list