Slow MS Access Query (Using DSum & DCount Functions) - sql

I'm having an issue in Microsoft Access where my query calculates extremely slow (it takes hours and hours). This query is reading a table that has 150,000 records and each record belongs to one of 4,000 unique groups (called API_10).
The goal of the query is to calculate a running cumulative production value (organized by API_10 and Date) such that the running cumulative production starts over at each new API_10 group. Each record in the table has a field called No which is an autonumber that MS Access calculates so that the table has a Primary Key. An example of what I'm describing is shown below:
MyTable:
No API_10 Date Production
1 1 1/1/2010 1000
2 1 2/1/2010 500
3 2 7/1/2014 300
4 2 8/1/2014 400
MyQuery:
No API_10 Date Production Cumulative_Production
1 1 11/1/2010 1000 1000
2 1 12/1/2010 500 1500
3 2 27/1/2014 300 300
4 2 28/1/2014 400 700
Here is a sample of the code (typed in the Expression Builder on MS Access) used to create the Cumulative_Production column in MyQuery:
Cumulative_Production:
DSum("[Production]","[MyTable]","[API_10]='" & [API_10] & "' AND [No]<=" & [No])
Do note that this is a simplified version of the actual query/table. The real query also computes another field called Normalized_Prod_Month which counts the number of production dates (starting at 1) for each unique API_10 as shown below:
NORMALIZED_PROD_MONTH:
DCount("[Date]","[MyTable]","[API_10]='" & [API_10] & "' AND [No]<=" & [No])
Any tips for improving these types of calculations would greatly help!!

If you apply this query to each record, then you must access n * (n + 1) / 2 records. If all 4000 groups have about the same size of 38 records, you get 4000 * 38 * (38 + 1) / 2 = ~ 3 Mio accesses. But this is the best case, since larger groups have an over-proportional cost because of the quadratic nature of n * (n + 1) / 2.
You are better off by creating the running sum in a loop in VBA, and accessing each record only once.
Dim db As DAO.Database, rs As DAO.Recordset
Dim lastNoApi As Long, runningSum As Long
Set db = CurrentDb
Set rs = db.OpenRecordset("SELECT * FROM MyTable ORDER BY NoAPI_10, Date")
Do Until rs.EOF
If rs!NoAPI_10 <> lastNoApi Then
runningSum = 0
lastNoApi = rs!NoAPI_10
End If
runningSum = runningSum + rs!Production
'TODO: insert the result into a temporary table
rs.MoveNext
Loop
rs.Close: Set rs = Nothing
db.Close: Set db = Nothing
Or use the following query. It still has a quadratic cost, but a single query is always more performing than multiple calls to DCount, DSum or DLookup.
SELECT
A.API_10,
A.Date,
A.Production,
(Select Sum(B.Production)
FROM MyTable B
WHERE B.API_10 = A.API_10 And B.[No] <= A.[No]) AS Cumulative_Production
FROM MyTable AS A
ORDER BY A.API_10, A.Date;
Assuming that the No column is consistent with the date sequence. If the dates are unique, you can also replace B.[No] <= A.[No] with B.[Date] <= A.[Date].

Related

Adding distinct sum and count columns based on another to datatable in VB.NET

I have an extra large DataTable (from delimited string cell in my PostgreSQL DB) with ~40k rows. Example data columns:
invoice customer_id amount
1 1 150,50
2 1 149,50
3 2 50,50
4 3 49,50
I'm trying to add 2 columns to this DataTable. One should show number of invoices (customer_id count), but the other one - sum of amount for every customer like this:
invoice customer_id amount invoice_count amount_total
1 1 150,50 2 300,00
2 1 149,50 2 300,00
3 2 50,50 1 50,50
4 3 49,50 1 49,50
Using this:
For i = 0 To dt.Rows.Count - 1
Dim distinctDT As DataTable = dt.DefaultView.ToTable(True, "customer_id", "amount")
distinctDT.DefaultView.RowFilter = "customer_id = " & dt.Rows(i).Item("customer_id")
dt.Rows(i).Item("count") = distinctDT.DefaultView.Count
Next
works, but takes very long time (whole DataTable fills in about 2 hours!) because for every 'i' auxiliary datatable is created (I think so). In Postgres I could use simply count(customer_id) over(partition by customer_id) in Select and group by customer_id and my query results displayed in few seconds.
Is it possible to solve this problem without creating distinct datatable and filtering it every 'i' ticking? Thanks in advance!
You could use the power of LINQ, in this case combined with a Lookup(Of TKeyx, TValue) which is similar to a dictionary. It is efficient and makes the code concise and easy to read:
Dim customerLookup = dt.AsEnumerable().ToLookup(Function(r) r("customer_id"))
For Each row As DataRow In dt.Rows
Dim customerRows = customerLookup(row("customer_id"))
row("count") = customerRows.Count()
row("amount_total") = customerRows.Sum(Function(r)row.Field(Of Decimal)("amount"))
Next

SQL Sampling based on the whole population

I have a population of records...let's say 10,000 athletes, grouped by sports, where (numbers below would be variable):
4,000 are from NBA
2,000 are from NHL
3,000 are from MLB
1,000 are from NFL
How can I build a sample query that will sample 100 records based on the population, not fully random but pull out:
NBA/Whole Population=X
Select Top X * From MainTable Where league= 'NBA' (something like this)
40 names are from NBA
20 names are from NHL
30 names are from MLB
10 names are from NFL.
This is just a sample of the population, logic here is to calculate what the ratios are with regard to the whole population and then apply them to the sample size.
Regards
Consider using a count correlated subquery for a rank order that you then use as filtering criteria for sample ratio.
SELECT main.*
FROM
(SELECT *,
(SELECT Count(*) FROM MainTable sub
WHERE sub.League = t.League AND sub.UniqueID <= t.UniqueID) As Rank
FROM MainTable t) AS main
WHERE main.Rank <= CInt((SELECT Count(*) FROM MainTable sub
WHERE sub.League = main.League) /
(SELECT Count(*) FROM MainTable) * 100)
ORDER BY main.League, main.Rank
To explain above query with nested subqueries and derived tables:
The derived table, main, is exact source MainTable with a new column called Rank that gives an ordinal count of records for each League. So for the first NBA record (not necessarily first row), it is tagged rank 1, next NBA record (which can appear anywhere like 89th row) is tagged 2, and so on for each League. And yes, Rank will go up to 4,000 if needed!
Once this Rank field is calculated giving ordinal 1, 2, 3, ... indicators for each League grouping, we then position this SELECT statement as a derived table in FROM clause in order to use Rank in WHERE filter for the sample ratio. We cannot calculate a column and filter in same SELECT call.
Sample ratio is the last two subqueries used for a quotient that calculates: (# of League records matching current row / total # of table records). This value is then multiplied by 100 per sample quota. CInt is used to return integer values of possible decimal ratios. Consider also Round(..., 0) which rounds instead of strips decimal points.
Dim Leagues(1 To 4) As String
Leagues(1) = "NHL"
Leagues(2) = "MLB"
Leagues(3) = "NFL"
Leagues(4) = "MLS"
Set db = CurrentDb
For x = 1 To 4
y = 0
sqql = "Select * from Maintable Where League = '" & leagues(x) & "'"
Set cf = db.OpenRecordset(sqql)
Set samp = db.OpenRecordset("RANDOMSAMPLE")
Do While y < (x * 1000) ' adjust as necessary just swagged in you wanted 1000 from league 1, 2000 league 2 etc
cf.MoveLast
cf.MoveFirst
i = Int((cf.RecordCount - 1 + 1) * Rnd + 1)
cf.Move (i)
With samp
.AddNew
.fields("Yourfield here") = cf![your field ]
' repeat as nec
.Update
End With
y = y + 1
Loop
cf.Close
Next x
samp.Close

Row_Number() in Access select statement

I believe similar questions have been asked but I can't quite find a solution that works for me.
I've got a database that I use to sort through digitised books and their pages and I'm trying to sort through several thousand pages that contain maps. Of the two tables I'm using the first lists all the pages in a book and the order they occur in the book, it's got three columns (bookID, pageOrder, pageID), each page has its own row. The second table lists all the places (in a map) that occur on each page, it has two columns (pageID, placeID) if there are multiple places on one page then a new row is added to the table for each place.
What I need to do is create a select statement that gives every pageID/placeID combination a unique number but the numbers must go in the order they appear in the book. In SQL Server I would do this:
SELECT ROW_NUMBER() OVER(ORDER BY bp.bookID, bp.pageOrder, pp.placeID) AS uniqueNumber, pp.pageID, pp.placeID
FROM booksAndPages AS bp INNER JOIN pagesAndPlaces AS pp ON bp.pageID = pp.pageID
Unfortunately, I'm stuck using Access. Ideally I'd like to do it (if possible) with a single SQL statement, similar to the one above but I would also try it using VBA.
Any help is greatly appreciated.
This is the query that you want:
SELECT ROW_NUMBER() OVER (ORDER BY bp.bookID, bp.pageOrder, pp.placeID) AS uniqueNumber,
pp.pageID, pp.placeID
FROM booksAndPages AS bp INNER JOIN
pagesAndPlaces AS pp
ON bp.pageID = pp.pageID;
You can get the same result using a correlated subquery. More complicated and more expensive, but possible:
SELECT (select count(*)
from booksAndPages AS bp2 INNER JOIN
pagesAndPlaces AS pp2
ON bp2.pageID = pp2.pageID
where bp2.bookID < bp.bookID or
(bp2.bookID = bp.bookID and bp2.pageOrder < bp.pageOrder) or
(bp2.bookID = bp.bookID and bp2.pageOrder = bp.pageOrder and
bp2.placeId <= pp.PlaceId
)
) as uniqueNumber,
pp.pageID, pp.placeID
FROM booksAndPages AS bp INNER JOIN
pagesAndPlaces AS pp
ON bp.pageID = pp.pageID;
This assumes that the combination bookId, pageOrder, placeId` is unique.
I know this is an old question, but this was a top search and I haven't seen any other solutions on the internet so I hope this will help others.
My solution works for any dataset regardless of if it has a unique identifier or not.
Add the following VBA code into a module:
Public row as Variant
Function RowNum(dummy) As Integer
row = row + 1
RowNum = row
End Function
Function GetRowNum(dummy) As Integer
GetRowNum = row
End Function
Function ResetRowNum()
row = 0
End Function
Now here's a sample query:
SELECT Table1.Field1, Table1.Field2, RowNum([Field1]) AS RowId,
"Row: "&GetRowNum([Field1]) AS RowText
FROM Table1
You can add any 'ORDER BY' or even 'GROUP BY' if you wish. You can use any field that will be in the query output as the input for RowNum and GetRowNum. Important to note is to only use RowNum for the first time you want the row number and use GetRowNum every time after. This is to prevent one row increasing the counter more than once.
The last thing you need to do is create a macro that runs ResetRowNum and run it after every query you use with this method, or if you're running a series of queries through a macro or VBA, make sure to run ResetRowNum after every query that uses these functions.
Also avoid datasheet view, as it seems to constantly recalculate the formulas when you scroll, making the numbers steadily increase.
Query to sort and/or group
SELECT Table1.Field1,
Table1.SomeDate,
Table1.Field2,
RowNumber([Field1]) AS RowId,
"Row: " & GetRowNum([Field1]) AS RowText
FROM Table1
ORDER BY Table1.Field1, Table1.SomeDate;
Field1 Field2 RowId RowText
James 2 1 Row: 1
James 35 2 Row: 2
James 6 3 Row: 3
James 86 4 Row: 4
James 67 5 Row: 5
James 35 6 Row: 6
Maria 4 1 Row: 1
Maria 54 2 Row: 2
Samuel 46 1 Row: 1
Samuel 32 2 Row: 2
Samuel 7 3 Row: 3
Thomas 43 1 Row: 1
Thomas 65 2 Row: 2
Thomas 5 3 Row: 3
Public StoredRowNumber As Variant
Public OldlastField As Variant
Function RowNumber(TheField) As Integer
If OldlastField = TheField Then
'nada
Else
ResetRowNum
End If
StoredRowNumber = StoredRowNumber + 1
RowNumber = StoredRowNumber
OldlastField = TheField
End Function
Function GetRowNum(TheField) As Integer
GetRowNum = StoredRowNumber
End Function
Function ResetRowNum()
StoredRowNumber = 0
'OldFieldItem = Null
End Function

SQL Access db - select every third row from database

How can I select every thrid row from the table?
if a table has
1
2
3
4
5
6
7
8
9
records
it should pick up 3, 6,9 record. regards less what their data is.
Modulo is what you want...
Assuming contiguous values:
SELECT *
FROM Mytable
WHERE [TheColumn] Mod 3 = 0
And with gaps
SELECT *
FROM Mytable
WHERE DCount("TheColumn", "table", "TheColumn <= " & [TheColumn]) Mod 3 = 0
Edit: To exclude every 3rd record, ...Mod 3 <> 0
If its SQL you could use the row_number and over commands. see this, then where rownumvar % 3 =0 but not sure if that works in access.
Or you could put the table into a recordset and iterate through checking the index for % 3=0 if your using any kind of code.
How about a Count() on a field that has unique members. (id?) then % 3 on that.

Problem sorting, grouping, and re-ordering a selection of records

Bottom line:
I have a Sub that should be re-ordering a group of records, but the query at the heart of it is not grouping and sorting the records as expected under rare, specific circumstances.
Background:
I'm developing an upgrade to a system for the education staff to post class information to our intranet. In the existing and in the upgraded system, the Classes_Dates table contains all the information related to the date, including a "Series" number.
The Series number was (and still is) used to group and sort dates, mostly to speed up the page generation on the front-end. Classes can have one or more (no limit) dates in a given series.
In the existing system, the series number is managed manually. Normally, it isn't an issue. Classes are entered in sequentially, in the order they occur. Occasionally, a class gets added in the middle of the chronological flow, and the staff will manually re-order the series numbers to properly group/sort the dates. It works, but is difficult for new staff to learn and existing staff to retain if they do not frequently use the system.
In the upgrade, I wrote a sub to automatically handle the re-ordering of the groups. I'm trying to keep the concept, but bury it so the staff don't need to be aware it still exists.
Here's the sub itself, called every time a new class date is added:
Sub ReorderGroups(intClassID)
strSQL = "SELECT DateID, Series, ClassStart "
strSQL = strSQL & "FROM Classes_Dates "
strSQL = strSQL & "WHERE ClassID = " & intClassID & " "
strSQL = strSQL & "GROUP BY Series, ClassStart, DateID "
strSQL = strSQL & "ORDER BY ClassStart;"
Dim objSQLDB : Set objSQLDB = CreateObject("ADODB.Command")
objSQLDB.ActiveConnection = strSQLConn
Dim objDates : Set objDates = Server.CreateObject("ADODB.Recordset")
objDates.Open strSQL, strSQLConn, adOpenDynamic, adLockReadOnly, adCmdText
If Not objDates.BOF Then objDates.MoveFirst
If Not objDates.EOF Then
Dim intNewSeries : intNewSeries = 1
Dim intCurrentOld : intCurrentOld = cLng(objDates("Series"))
Do Until objDates.EOF
If intCurrentOld <> cLng(objDates("Series")) Then
intNewSeries = cLng(intNewSeries) + 1
intCurrentOld = cLng(objDates("Series"))
End If
objSQLDB.CommandText = "UPDATE Classes_Dates SET Series = " & intNewSeries & " WHERE DateID = " & objDates("DateID")
objSQLDB.Execute ,,adCmdText
objDates.MoveNext
Loop
End If
objDates.Close
Set objDates = Nothing
Set objSQLDB = Nothing
End Sub
I'm sure there's a more efficient way to write this, but my first concern was getting it working - then I may post it over to CodeReview.SE for some help with optimization.
The sub works great as long as there are not two series with overlapping dates. The following:
SELECT DateID, Series, ClassStart
FROM Classes_Dates
WHERE ClassID = 11
GROUP BY Series, ClassStart, DateID
ORDER BY ClassStart;
Is gathering this result set:
DateID Series ClassStart
------ ------ --------------
49 1 20100907080000
51 1 20100913080000
50 1 20100916080000
56 2 20100921080000
57 2 20100927080000
58 2 20100929080000
'-- snip --'
670 12 20110614080000
671 12 20110615080000
672 13 20110705080000
676 15 20110707080000
674 14 20110709090000
673 13 20110714080000
675 14 20110716080000
Instead of what I expected:
DateID Series ClassStart
------ ------ --------------
49 1 20100907080000
51 1 20100913080000
50 1 20100916080000
56 2 20100921080000
57 2 20100927080000
58 2 20100929080000
'-- snip --'
670 12 20110614080000
671 12 20110615080000
672 13 20110705080000
673 13 20110714080000
676 15 20110707080000
674 14 20110709090000
675 14 20110716080000
What do I need to fix in the SQL? Or is there a better way to get the same end result?
The latter would likely be better as I can see now that I look at it again this is not going to scale well as time goes on...
I think you want:
SELECT DateID, Series, ClassStart
FROM Classes_Dates
WHERE ClassID = 11
GROUP BY Series, ClassStart, DateID
ORDER BY MIN(ClassStart) OVER(PARTITION BY Series)
, ClassStart
Note that if the (Series, ClassStart, DateID) is a unique key in this table, then you don't even need the GROUP BY:
SELECT DateID, Series, ClassStart
FROM Classes_Dates
WHERE ClassID = 11
ORDER BY MIN(ClassStart) OVER(PARTITION BY Series)
, ClassStart
And just to catch the (probably rare) case where two Series have the same MIN(ClassStart), you should use this one so data from these two Series don't get mixed up in the results:
SELECT DateID, Series, ClassStart
FROM Classes_Dates
WHERE ClassID = 11
ORDER BY MIN(ClassStart) OVER(PARTITION BY Series)
, Series
, ClassStart
How the query works:
What your problem describes is that you want the data shown in groups (of same Series). But you also want these groups ordered depending on the MIN(ClassStart) of every group.
To find MIN(ClassStart) we'd have to use GROUP BY Series but we can't do that because then the muptiple rows (of same group) would collapse into one.
This is what MIN(ClassStart) OVER(PARTITION BY Series) achieves. It calculates the minimum of ClassStart as if we had used GROUP BY Series.