SQL - Finding open spaces in a schedule - sql

I am working with a SQLite database and I have three tables describing buildings,rooms and scheduled events.
The tables look like this:
Buildings(ID,Name)
Rooms(ID,BuildingID,Number)
Events(ID,BuildingID,RoomID,Days,s_time,e_time)
So every event is associated with a building and a room. The column Days contains an integer which is a product of prime numbers corresponding to days of the week ( A value of 21 means the event occurs on Tuesday = 3 and Thursday = 7).
I am hoping to find a way to generate a report of rooms in a specific building that will be open in the next few hours, along with how long they will be open for.
Here is what I have so far:
SELECT Rooms.Number
FROM Rooms
INNER JOIN Buildings on ( Rooms.BuildingID = Buildings.ID )
WHERE
Buildings.Name = "BuildingName"
EXCEPT
SELECT Events.RoomID
FROM Events
INNER JOIN Buildings on ( Events.BuildingID = Buildings.ID )
WHERE
Buildings.Name = "BuildingName" AND
Events.days & 11 = 0 AND
time("now", "localtime" BETWEEN events.s_time AND events.e_time;
Here I find all rooms for a specific building and then I remove rooms which currently have an scheduled event in progress.
I am looking forward to all helpful tips/comments.

If you're storing dates as the product of primes, the modulo (%) operator might be more useful:
SELECT * FROM Events
INNER JOIN Buildings on (Events.BuildingID = Buildings.ID)
WHERE
(Events.Days % 2 = 0 AND Events.Days % 5 = 0)
Would select events happening on either a Monday or Wednesday.
I do have to point out though, that storing the product of primes is both computationally and storage expensive. Much easier to store the sum of powers of two (Mon = 1, Tues = 2, Wed = 4, Thurs = 8, Fri = 16, Sat = 32, Sun = 64).
The largest possible value for your current implementation is 510,510. The smallest data type to store such a number is int (32 bits per row) and retrieving the encoded data requires up to 7 modulo (%) operations.
The largest possible value for a 2^n summation method is 127 which can be stored in a tinyint (8 bits per row) and retrieving the encoded data would use bitwise and (&) which is somewhat cheaper (and therefore faster).
Probably not an issue for what you're working with, but it's a good habit to choose whatever method gives you the best space and performance efficiency lest you hit serious problems should your solution be implemented at larger scales.

Related

Creating a Google Data Studio GPS report, using data from two sources in Big Query - scheduled report data and real-time data

I'd like to prefix this post by saying I'm an SQL novice, new to BigQuery and a first time poster, so thanks in advance!
I'm currently recreating a report in Google Data Studio that was originally built in Excel for a bus company which would compare a daily schedule with data we receive daily from a third-party software. The 'Schedule' data table includes the route name, location names, the scheduled times and GPS coordinates. The 'Real-time' data that we receive from third-party software includes: the date, timestamp for every 40 seconds and the GPS coordinates for that time stamp. Note, there is no 'route name', this is key for later in my question.
The third-party CSV data is uploaded daily as individual CSV files to a Google Cloud Data Storage bucket, which is in turn connected to Google BigQuery as one, single Real-Time data table, ready to be compared against the Schedule data table.
The aim of this report is to be able to compare the uploaded Real-Time data table against the Schedule data, to create a Google Data Studio report that our logistics team can check to answer the two main questions: 1) Punctuality of the bus - was the bus on time that day and if it was late, by how much. 2) Location of the bus - where is the exact GPS location of the bus in relation the scheduled GPS coordinates, per time stamp. My ideal report would have filters to select different Bus Routes and choose the Date in question.
My first query lies in how to construct this report. I envision that I will need to perform a join with my data, to be explicit, two LEFT joins that answer my two key questions as follows: 1) Punctuality calculation through a left join via GPS coordinates and, 2) GPS calculation through a left join via the timestamps.
If that outline is clear and my proposal to use two LEFT joins is correct, which I suspect may not be so for reasons discussed below, let's move on to stage two. This is a good checkpoint for anyone who has read up until this point and believes I need to make changes to my approach.
Moving on, when and if I join my two tables, I have initially observed the following technical considerations that I will need a little bit of help with:
1) Timestamps in the Real-Time data are taken every 40 seconds. Therefore, I don't necessarily have data for the bus at exact time stamps e.g. the bus in my Schedule is meant to be at location x for 09:00:00, however, my closest timestamp in my Real-Time data might be 09:00:04. How do I match that data to select the correct data point. Initially, I thought about reducing the number of significant figures, however, I then considered the instance, that would be observed in this example where I would also have a data point at 09:00:44, 40 seconds later. With reduced significant figures - 09:00, there would be two matching data points and they would be treated as the same. Any idea? Maybe a LEFT join with MIN.
2) GPS coordinates to be matched. I've attempted to use the ST_CLOSESTPOINT Geography function but I don't fully understand it ST_CLOSESTPOINT(geography_1, geography_2[, spheroid=FALSE]). What is spheroid=FALSE?(https://cloud.google.com/bigquery/docs/reference/standard-sql/geography_functions).
3) Thirdly and currently my most difficult issue, my Real-Time data table is linked to a Google Data bucket with over a years worth of CSV files, for 9 different buses with different routes - 1 CSV file per bus, per day. Also, as mentioned in paragraph 2, there is no 'route name' in my Real-Time data, just a bunch on GPS coordinates and time stamps. I need to think of a way to be able differentiate between these different CSV files with my Schedule data table, so that it will be functional in Google Data studio with the two filters previously mentioned, to firstly select the Bus Routes and then select the date. It is this point that makes me question if a LEFT join is appropriate as this functionality would not be possible with data already joined. Currently, with a large Real-Time data set, a join with my Schedule data table is randomly matching with the nearest closest of this data set, with no capability to select by day etc.
This is quite a big project, a little out of my comfort zone and a long detailed question, but any guidance would be much appreciated as I'm relatively new to SQL and BigQuery.
Many thanks in advance!
// JOIN 1 - via ST_CLOSESTPOINT to determine punctuality of the bus
SELECT
r1.Direction,
r1.ScheduledLocation,
r1.ScheduledNextLocation,
r1.ScheduledTime,
r1.ScheduledCoordinates,
r1.ScheduledXCoordinates,
r1.ScheduledYCoordinates,
r1.ScheduledFullCoordinates,
r2.RealTime,
r2.RealTimeDate,
r2.RealTimeXCoordinates,
r2.RealTimeYCoordinates,
r2.RealTimeFullCoordinates,
ST_CLOSESTPOINT(r1.ScheduledFullCoordinates, r2.RealTimeFullCoordinates) as ClosestPoint
FROM `SCHEDULE DATA SOURCE` r1
LEFT JOIN `REAL-TIME DATA SOURCE` r2 ON r1.ScheduledTime = r2.RealTime
// JOIN 2 - via Timestamp to determine GPS location of the bus
SELECT
r1.Direction,
r1.ScheduledLocation,
r1.ScheduledNextLocation,
r1.ScheduledTime,
r1.ScheduledCoordinates,
r1.ScheduledXCoordinates,
r1.ScheduledYCoordinates,
r1.ScheduledFullCoordinates,
r2.RealTime,
r2.RealTimeXCoordinates,
r2.RealTimeYCoordinates,
r2.RealTimeFullCoordinates
FROM `SCHEDULE DATA SOURCE` r1
LEFT JOIN `REAL-TIME DATA SOURCE` r2 ON r1.ScheduledTime = r2.RealTime
What you can do instead of join is use analytical functions. I assume the data per bus route is small enough for this. If you have multiple bus routes in same query - add PARTITION BY .
I used r1 as sample scheduled time / location, and r2 as real time / location. I then union them and add sched flag denoting whether this is scheduled or real time. Then ORDER all events by time, and for each event add previous and next location. You can now filter only scheduled event - for each one you'll have next and previous location. My code is somewhat simplistic, as it may choose previous or next scheduled event, rather than previous or next real event. But if real events are collected often enough that is unlikely to happen.
Finally, about ST_CLOSESTPOINT - the function is used to find a point of one complex shape closest to another complex shape. I don't think you'll need it, as you deal with points, so it just returns the single available point, i.e. its first argument. What you need is ST_DISTANCE to calculate distance to the real point. I calculate two distances, to previous real event and to next real event, and choose one that is closer.
with r1 as (
select time(10, 0, 0) as sched_tm, ST_GeogPoint(10, 10) as sched_loc union all
select time(10, 10, 0) as sched_tm, ST_GeogPoint(11, 11) as sched_loc union all
select time(10, 20, 0) as sched_tm, ST_GeogPoint(12, 13) as sched_loc
), r2 as (
select time(10, 0, 10) as real_tm, ST_GeogPoint(10.1, 10) as real_loc union all
select time(10, 0, 50) as real_tm, ST_GeogPoint(10.2, 10) as real_loc union all
select time(10, 9, 40) as real_tm, ST_GeogPoint(10.9, 11) as real_loc union all
select time(10, 10, 20) as real_tm, ST_GeogPoint(11.1, 11) as real_loc union all
select time(10, 20, 0) as real_tm, ST_GeogPoint(12, 13) as real_loc
), r12 as (
select TRUE as sched, sched_tm tm, sched_loc as loc from r1
union all
select FALSE as sched, real_tm tm, real_loc as loc from r2
), r12_sort as (
select sched, tm, loc,
LAG(loc, 1) OVER(ORDER BY tm) as prev_loc,
LEAD(loc, 1) OVER(ORDER BY tm) as next_loc
from r12
)
select sched, tm as sched_tm, loc, prev_loc, next_loc,
LEAST(coalesce(st_distance(loc, prev_loc), 1e9),
coalesce(st_distance(loc, next_loc), 1e9)) as distance
from r12_sort
where sched
The result is like this:
Line sched sched_tm loc prev_loc next_loc distance
1 true 10:00:00 POINT(10 10) null POINT(10.1 10) 10950.579731746193
2 true 10:10:00 POINT(11 11) POINT(10.9 11) POINT(11.1 11) 10915.213347763152
3 true 10:20:00 POINT(12 13) POINT(11.1 11) POINT(12 13) 0.0

Closest position between randomly moving objects

I have a large database tables that contains grid references (X and Y) associated with various objects (each with a unique object identifier) as they move with time. The objects move at approximately constant speed but random directions.
The table looks something like this….
CREATE TABLE positions (
objectId INTEGER,
x_coord INTEGER,
y_coord INTEGER,
posTime TIMESTAMP);
I want to find which two objects got closest to each other and at what time.
Finding the distance between two fixes is relatively easy – simple Pythagoras for the differences between the X and Y values should do the trick.
The first problem seems to be one of volume. The grid itself is large, 100,000 possible X co-ordinates and a similar number of Y co-ordinates. For any given time period the table might contain 10,000 grid reference positions for 1000 different objects – 10 million rows in total.
That’s not in itself a large number, but I can’t think of a way of avoiding doing a ‘product query’ to compare every fix to every other fix. Doing this with 10 million rows will produce 100 million million results.
The next issue is that I’m not just interested in the closest two fixes to each other, I’m interested in the closest two fixes from different objects.
Another issue is that I need to match time as well as position – I’m not just interested in two objects that have visited the same grid square, they need to have done so at the same time.
The other point (may not be relevant) is that the items are unlikely to every occupy exactly the same location at the same time.
I’ve got as far as a simple product query with a few sample rows, but I’m not sure on my next steps. I’m beginning to think this isn’t going something I can pull off with a single SQL query (please prove me wrong) and I’m likely to have to extract the data and subject it to some procedural programming.
Any suggestions?
I’m not sure what SE forum this best suited for – database SQL? Programming? Maths?
UPDATE - Another issue to add to the complexity, the timestamping for each object and position is irregular, one item might have a position recorded at 14:10:00 and another at 14:10:01. If these two positions are right next to each other and one second apart then they may actually represent the closest position although the time don't match!
In order to reduce the number of tested combinations you should segregate them by postime using subqueries. Also, it's recommended you create an index by postime to increase performance.
create index ix1_time on positions (postime);
Since you didn't mention any specific database I assumed PostgreSQL since it's easy to use (for me). The solution should look like:
with t as (
select distinct(postime) as pt from positions
)
select *
from t,
(
select *
from (
select
a.objectid as aid, b.objectid as bid,
a.x_coord + a.y_coord + b.x_coord + b.y_coord as dist -- fix here!
from t
join positions a on a.postime = t.pt
join positions b on b.postime = t.pt
where a.objectid <> b.objectid
) x
order by dist desc
limit 1
) y;
This SQL should compare each 10000 objects against each other on by postime. It will test 10 million combinations for each different postime value, but not against other postime values.
Please note: I used a.x_coord + a.y_coord + b.x_coord + b.y_coord as the distance formula. I leave the correct one for you to implement here.
In total it will compute 10 million x 1000 time values: a total of 10 billion comparisons. It will return the closest two points for each timepos, that is a total of 1000 rows.

Using Real numbers for explicit sorting in sql database

i'm facing a recurring problem. I've to let a user reorder some list that is stored in a database.
The fist straightforward approach i can think is to have a "position" column with the ordering saved as a integer. p.e.
Data, Order
A 1
B 2
C 3
D 4
Problem here is that if i have to insert FOO in position 2, now my table become
Data, Order
A 1
FOO 2
B 3
C 4
D 5
So to insert a new line, i have to do one CREATE and three UPDATE on a table of five elements.
So my new idea is using Real numbers instead of integers, my new table become
Data, Order
A 1.0
B 2.0
C 3.0
D 4.0
If i want to insert a element FOO after A, this become
Data, Order
A 1.0
FOO 1.5
B 2.0
C 3.0
D 4.0
With only one SQL query executed.
This would work fine with theoretical Real Numbers, but floating point numbers have a limited precision and i wondering how feasible this is and whether and how can i optimize it to avoid exceeding double precision with a reasonable number of modifications
edit:
this is how i implemented it now in python
#classmethod
def get_middle_priority(cls, p, n):
p = Decimal(str(p))
n = Decimal(str(n))
m = p + ((n - p)/2)
i = 0
while True:
m1 = round(m, i)
if m1 > p and m1 < n:
return m1
else:
i += 1
#classmethod
def create(cls, data, user):
prev = data.get('prev')
if prev is None or len(prev)<1:
first = cls.list().first()
if first is None:
priority = 1.0
else:
priority = first.priority - 1.0
else:
prev = cls.list().filter(Rotator.codice==prev).first()
next = cls.list().filter(Rotator.priority>prev.priority).first()
if next is None:
priority = prev.priority + 1.0
else:
priority = cls.get_middle_priority(prev.priority, next.priority)
r = cls(data.get('codice'),
priority)
DBSession.add(r)
return r
If you want to control the position and there is no ORDER BY solution then a rather simple and robust approach is to point to the next or to the previous. Updates/inserts/deletes (other than the first and last) will require 3 operations.
Insert the new Item
Update the Item Prior the New Item
Update the Item After the New Item
After you have that established you can use a CTE (with a UNION ALL) to create a sorted list that will never have a limit.
I have seen rather large implementations of this that were done via Triggers to keep the list in perfect form. I however am not a fan of triggers and would just put the logic for the entire operation in a stored procedure.
You may use a string rather then numbers:
item order
A ffga
B ffgaa
C ffgb
Here, the problem of finite precision is handled by the possibility of growing the string. String storage is theoretically unlimited in the database, only by the size of the storage device. But there is no better solution for absolute-ordering items. Relative-ordering, like linked-lists, might work better (but you can't do order by query then).
The linked list idea is neat but it's expensive to pull out data in order. If you have a database which supports it, you can use something like connect by to pull it out. linked list in sql is a question dedicated to that problem.
Now if you don't, I was thinking of how one can achieve an infinitely divisable range, and thought of sections in a book. What about storing the list initially as
1
2
3
and then to insert between 1 and two you insert a "subsection under 1" so that your list becomes
1
1.1
2
3
If you want to insert another one between 1.1 and 2 you place a second subsection under 1 and get
1
1.1
1.2
2
3
and lastly if you want to add something between 1.1 and 1.2 you need to introduce a subsubsection and get
1
1.1
1.1.1
1.2
2
3
Maybe using letters instead of numbers would be less confusing.
I'm not sure if there is any standard lexicographic ordering in sql databases which could sort this type of list correctly. But I think you could roll your own with some "order by case" and substringing. Edit: I found a question pertaining to this: linky
Another downside is that the worst case field size of this solution would grow exponentially with the number of input items (You could get long rows like 1.1.1.1.1.1 etc). But in the best case it would be linear or almost constant (Rows like 1.934856.1).
This solution is also quite close to what you already had in mind, and I'm not sure that it's an improvement. A decimal number using the binary partitioning strategy that you mentioned will probably increase the number of decimal points between each insert by one, right? So you would get
1,2 -> 1,1.5,2 -> 1,1.25,1.5,2 -> 1,1.125,1.25,1.5,2
So the best case of the subsectioning-strategy seems better, but the worst case a lot worse.
I'm also not aware of any infinite precision decimal types for sql databases. But you could of course save your number as a string, in which case this solution becomes even more similar to your original one.
Set all rows to a unique number starting at 1 and incrementing by 1 at the start. When you insert a new row, set it to count(*) of the table + 1 (there are a variety of ways of doing this).
When the user updates the Order of a row, always update it by calling a stored procedure with this Id (PK) of the row to update and the new order. In the stored procedure,
update tableName set Order = Order + 1 where Order >= #updatedRowOrder;
update tablename set Order = #updatedRowOrder where Id = #pk;
That guarantees that there will always be space and a continuous sequence with no duplicates. I haven't worked you what would happen if you put silly new Order numbers of a row (e.g. <= 0) but probably bad things; that's for the Front End app to prevent.
Cheers -

Can a dimension ID record, also be used as a column (attribute)?

Can a dimension ID record, also be used as a column (attribute)? I do not know if this will work, or if it is against best practices if i do. Here is a more detailed explanation of why i am asking this, and what i am looking for:
I have a dimension with only 6 records, called a Past Due Dimension.
It looks like the following:
PastDueBandDimID PastDueMin PastDueMax PastDueDesc
_____________________________________________________________________
1 0 0 Current
_____________________________________________________________________
2 1 29 1-29 Days Past Due
And on it goes with 30-59 days / 60 - 89 days / 90-180 days.
This pattern all in all works fine, but i would like to create a User Hierarchy with this data, so i can group it different ways. So what i thought of doing, was creating in the DSV, additional fields called 1-29 / 30-59.... and reference the DimID in these fields, so i could create my Hierarchy. I do not feel this is a good way of doing it, but i have no othher ideas. Any suggestions is appreciated! I would like to group (in not all but in some) of my reports 0-59 days, and 60-180 days, and a user hierarchy would enable the users to do this.
When doing bucketing like this, I almost always create physical bucket columns on an aggregated fact rather than a "past due" dimension.
I can understand the temptation in building this dimension like this since it's very "flexible", but as you are discovering it makes using automated tools (such as ssas) more difficult, as well as forcing you to constantly update your fact tables to reflect the new "past due" dimension value.
Instead, why not just build an aggregate that sits on top of your fact and is rebuilt daily (or even a view if your DB is strong enough). Using invoices as an example:
Invoice
Invoice Due Date
PastDueLTE29 (1 if <= 29, 0 otherwise)
PastDue30to59 (1 if >= 30 and <= 59, 0 otherwise)
PastDue60to89 (1 if >= 60 and <= 89, 0 otherwise)
PastDue90to180 (1 if >= 90 and <= 179, 0 otherwise)
PastDueGTE180 (1 if >= 180, 0 otherwise)
If you then want to group on, say all invoices < 60 days past due, you would just filter where either of the first two columns = 1.
If you really want a hierarchy, couldn't you just add a few columns to your table?
I really don't like using "Level" in the names of columns... but:
PastDueBandDimID
PastDueLevel1Name ("Past Due" or "Current")
PastDueLevel2Name ("1-60" or "61-180" or "180+")
PastDueLevel3Name ("1-30", "31-60", "61-90", "90+"
PastDueLevel3Min
PastDueLevel3Max

Group by run when there is no run number in data (was Show how changing the length of a production run affects time-to-build)

It would seem that there is a much simpler way to state the problem. Please see Edit 2, following the sample table.
I have a number of different products on a production line. I have the date that each product entered production. Each product has two identifiers: item number and serial number I have the total number of labour hours for each product by item number and by serial number (i.e. I can tell you how many hours went into each object that was manufactured and what the average build time is for each kind of object).
I want to determine how (if) varying the length of production runs affects the average time it takes to build a product (item number). A production run is the sequential production of multiple serial numbers for a single item number. We have historical records going back several years with production runs varying in length from 1 to 30.
I think to achieve this, I need to be able to assign 'run id'. To me, that means building a query that sorts by start date and calculates a new unique value at each change in item number. If I knew how to do that, I could solve the rest of the problem on my own.
So that suggests a series of related questions:
Am I thinking about this the right way?
If I am on the right track, how do I generate those run id values? Calculate and store is an option, although I have a (misguided?) preference for direct queries. I know exactly how I would generate the run numbers in Excel, but I have a (misguided?) preference to do this in the database.
If I'm not on the right track, where might I find that track? :)
Edit:
Table structure (simplified) with sample data:
AutoID Item Serial StartDate Hours RunID (proposed calculation)
1 Legend 1234 2010-06-06 10 1
3 Legend 1235 2010-06-07 9 1
2 Legend 1237 2010-06-08 8 1
4 Apex 1236 2010-06-09 12 2
5 Apex 1240 2010-06-10 11 2
6 Legend 1239 2010-06-11 10 3
7 Legend 1238 2010-06-12 8 3
I have shown that start date, serial, and autoID are mutually unrelated. I have shown the expectation that labour goes down as the run length increases (but this is a 'fact' only via received wisdom, not data analysis). I have shown what I envision as the heart of the solution, that being a RunID that reflects sequential builds of a single item. I know that if I could get that runID, I could group by run to get counts, averages, totals, max, min, etc. In addition, I could do something like hours/ to get percentage change from the start of the run. At that point I could graph the trends associated with different run lengths either globally across all items or on a per item basis. (At least I think I could do all that. I might have to muck about a bit, but I think I could get it done.)
Edit 2: This problem would appear to be: how do I get the 'starting' member (earliest start date) of each run when I don't already have a runID? (The runID shown in the sample table does not exist and I was originally suggesting that being able to calculate runID was a potentially viable solution.)
AutoID Item
1 Legend
4 Apex
6 Legend
I'm assuming that having learned how to find the first member of each run that I would then be able to use what I've learned to find the last member of each run and then use those two results to get all other members of each run.
Edit 3: my version of a query that uses the AutoID of the first item in a run as the RunID for all units in a run. This was built entirely from samples and direction provided by Simon, who has the accepted answer. Using this as the basis for grouping by run, I can produce a variety of run statistics.
SELECT first_product_of_run.AutoID AS runID, run_sibling.AutoID AS itemID, run_sibling.Item, run_sibling.Serial, run_sibling.StartDate, run_sibling.Hours
FROM (SELECT first_of_run.AutoID, first_of_run.Item, first_of_run.Serial, first_of_run.StartDate, first_of_run.Hours
FROM dbo.production AS first_of_run LEFT OUTER JOIN
dbo.production AS earlier_in_run ON first_of_run.AutoID - 1 = earlier_in_run.AutoID AND
first_of_run.Item = earlier_in_run.Item
WHERE (earlier_in_run.AutoID IS NULL)) AS first_product_of_run LEFT OUTER JOIN
dbo.production AS run_sibling ON first_product_of_run.Item = run_sibling.Item AND first_product_of_run.AutoID run_sibling.AutoID AND
first_product_of_run.StartDate product_between.Item AND
first_product_of_run.StartDate
Could you describe your table structure some more? If the "date that each product entered production" is a full time stamp, or if there is a sequential identifier across products, you can write queries to identify the first and last products of a run. From that, you can assign IDs to or calculate the length of the runs.
Edit:
Once you've identified 1,4, and 6 as the start of a run, you can use this query to find the other IDs in the run:
select first_product_of_run.AutoID, run_sibling.AutoID
from first_product_of_run
left join production run_sibling on first_product_of_run.Item = run_sibling.Item
and first_product_of_run.AutoID <> run_sibling.AutoID
and first_product_of_run.StartDate < run_sibling.StartDate
left join production product_between on first_product_of_run.Item <> product_between.Item
and first_product_of_run.StartDate < product_between.StartDate
and product_between.StartDate < run_sibling.StartDate
where product_between.AutoID is null
first_product_of_run can be a temp table, table variable, or sub-query that you used to find the start of a run. The key is the where product_between.AutoID is null. That restricts the results to only pairs where no different items were produced between them.
Edit 2, here's how to get the first of each run:
select first_of_run.AutoID
from
(
select product.AutoID, product.Item, MAX(previous_product.StartDate) as PreviousDate
from production product
left join production previous_product on product.AutoID <> previous_product.AutoID
and product.StartDate > previous_product.StartDate
group by product.AutoID, product.Item
) first_of_run
left join production earlier_in_run
on first_of_run.PreviousDate = earlier_in_run.StartDate
and first_of_run.Item = earlier_in_run.Item
where earlier_in_run.AutoID is null
It's not pretty, and will break if StartDate is not unique. The query could be simplified by adding a sequential and unique identifier with no gaps. In fact, that step will probably be necessary if StartDate is not unique. Here's how it would look:
select first_of_run.AutoID
from production first_of_run
left join production earlier_in_run
on (first_of_run.Sequence - 1) = earlier_in_run.Sequence
and first_of_run.Item = earlier_in_run.Item
where earlier_in_run.AutoID is null
Using outer joins to find where things aren't still twists my brain, but it's a very powerful technique.