Should I denormalize to a join with an added bit, or what? - sql

I have many documents in my system such as order invoices, requisitions etc. In order to track their approval workflow I have one common table in which I have the following columns.
WFID ActionDate DocInstancetype DocinstanceID iscurrent status
1 2017-04-04 PO 58 0 Submitted
2 2017-04-05 PO 58 1 Approved
3 2017-04-04 PR 74 1 Submitted
In my reports I usually need to consider only approved documents. Is it a good idea to add an IsApproved bit in the Documents Master table (in PR and PO tables) and sync it using a trigger so that I can avoid a join with workflow table every time I need to get approved documents only?
Any other better suggestion would also be appreciated.

Document table should contain the status (not is_approved) because you might need to filter document on basis of document status in future though today your need are limited to status="approved". Status change history should be maintain in another table which is workflow in your case but I would suggest `document_status_tracking' or just 'document_tracking' would be a better name for the table.

Related

Is it possible to match the "next" unmatched record in a SQL query where there is no strictly unique common field between tables?

Using Access 2010 and its version of SQL, I am trying to find a way to relate two tables in a query where I do not have strict, unique values in each table, using concatenated fields that are mostly unique, then matching each unmatched next record (measured by a date field or the record id) in each table.
My business receives checks that we do not cash ourselves, but rather forward to a client for processing. I am trying to build a query that will match the checks that we forward to the client with a static report that we receive from the client indicating when checks were cashed. I have no control over what the client reports back to us.
When we receive a check, we record the name of the payor, the date that we received the check, the client's account number, the amount of the check, and some other details in a table called "Checks". We add a matching field which comes as close as we can get to a unique identifier to match against the client reports (more on that in a minute).
Checks:
ID Name Acct Amt Our_Date Match
__ ____ ____ ____ _____ ______
1 Dave 1001 10.51 2/14/14 1001*10.51
2 Joe 1002 12.14 2/28/14 1002*12.14
3 Sam 1003 50.00 3/01/14 1003*50.00
4 Sam 1003 50.00 4/01/14 1003*50.00
5 Sam 1003 50.00 5/01/14 1003*50.00
The client does not report back to us the date that WE received the check, the check number, or anything else useful for making unique matches. They report the name, account number, amount, and the date of deposit. The client's report comes weekly. We take that weekly report and append the records to make a second table out of it.
Return:
ID Name Acct Amt Their_Date Unique1
__ ____ ____ ____ _____ ______
355 Dave 1001 10.51 3/25/14 1001*10.51
378 Joe 1002 12.14 4/04/14 1002*12.14
433 Sam 1003 50.00 3/08/14 1003*50.00
599 Sam 1003 50.00 5/11/14 1003*50.00
Instead of giving us back the date we received the check, we get back the date that they processed it. There is no way to make a rule to compare the two dates, because the deposit dates vary wildly. So the closest thing I can get for a unique identifier is a concatenated field of the account number and the amount.
I am trying to match the records on these two tables so that I know when the checks we forward get deposited. If I do a simple join using the two concatenated fields, it works most of the time, but we run into a problem with payors like Sam, above, who is making regular monthly payments of the same amount. In a simple join, if one of Sam's payments appears in the Return table, it matches to all of the records in the Checks table.
To limit that behavior and match the first Sam entry on the Return table to the first Sam entry on the Checks table, I wrote the following query:
SELECT return.*, checks.*
FROM return, checks
WHERE (( ( checks.id ) = (SELECT TOP 1 id
FROM checks
WHERE match = return.unique1
ORDER BY [our_date]) ));
This works when there is only one of Sam's records in the Return table. The problem comes when the second entry for Sam hits the Return table (Return.ID 599) as the client's weekly reports are added to the table. When that happens, the query appropriately (for my purposes) only lists that two of Sam's checks have been processed, but uses the "Top 1 ID" record to supply the row's details from the Return table:
Checks_Return_query:
Checks.ID Name Acct Amt Our_Date Their_Date Return.ID
__ ____ ____ ____ _____ ______ ________
1 Dave 1001 10.51 2/14/14 3/25/14 355
2 Joe 1002 12.14 2/28/14 4/04/14 378
3 Sam 1003 50.00 3/01/14 3/08/14 433
4 Sam 1003 50.00 4/01/14 3/08/14 433
In other words, the query repeats the Return table info for record Return.ID 433 instead of matching Return.ID 599, which is I guess what I should expect from the TOP 1 operator.
So I am trying to figure out how I can get the query to take the two concatenated fields in Checks and Return, compare them to find matching sets, then select the next unmatched record in Checks (with "next" being measured either by the ID or Our_Date) with the next unmatched record in Return (again, with "next" being measured either by the ID or Their_Date).
I spent many hours in a dark room turning the query into various joins, and back again, looking at functions like WHERE NOT IN, WHERE NOT EXISTS, FIRST() NEXT() MIN() MAX(). I am afraid I am way over my head.
I am beginning to think that I may have a structural problem, and may need to write the "matched" records in this query to another table of completed transactions, so that I can differentiate between "matched" and "unmatched" records better. But that still wouldn't help me if two of Sam's transactions are on the same weekly report I get from my client.
Are there any suggestions as to query functions I should look into for further research, or confirmation that I am barking up the wrong tree?
Thanks in advance.
I'd say that you really need another table of completed transactions, it could be temporary table.
Regarding your fears "... if two of Sam's transactions are on the same weekly report ", you can use cursor in order to write records "one-by-one" instead of set based transaction.

Optimal selection for ordering multiple items (parts) from multiple suppliers (vendors)

The task here is to define the optimal (as detailed below) way of ordering items (parts) from suppliers.
The relevant parts of the table schema (with some sample data) are
Items
ID NUMBER
1 Item0001
2 Item0002
3 Item0003
Suppliers
ID NAME DELIVERY DISCOUNT
1 Supplier0001 0 0
2 Supplier0002 0 0.025
3 Supplier0003 20 0
DELIVERY is the delivery charge (in dollars) levied by that supplier on each delivery. DISCOUNT is the settlement discount (as a percentage i.e. 2.5% for ID=2 above) allowed by that supplier for on time payment.
SupplierItems
SUPPLIER_ID ITEM_ID PRICE
1 2 21.67
1 5 45.54
1 7 32.97
This is the many-to-many join between suppliers and items with the price that supplier charges for that item (in dollars). Every item has at least 1 supplier but some have more than one. A supplier may have no items.
PartsRequests
ID ITEM_ID QUANTITY LOCATION_ID ORDER_ID
1 59 4 2 (null)
2 89 5 2 (null)
3 42 4 2 (null)
This table is a request from a field site for parts to be ordered and delivered by the supplier to that site. A delivery of any number of items to a site attracts a delivery charge. When the parts are ordered, the ORDER_ID is inserted into the table so we are only concerned with those where ORDER_ID IS NULL
The question is, what is the optimal way to order these parts for each `LOCATION' where there are 3 optimal solutions that need to be presented to the user for selection.
The combination of orders with the least number of suppliers
The combination of orders with the lowest total cost i.e. The sum of QUANTITY*PRICE for each item plus the DELIVERY for each order summed over all orders ignoring DISCOUNT
As item 2 but accounting for DISCOUNT
Clearly I need to determine the combinations of orders that are available and then determining the optimal ones becomes trivial but I am a bit stuck on an efficient way to deal with building the combinations.
I have built some SQL fiddles in SQL Server 2008 with random data. This one has 100 items, 10 suppliers and 100 requests. This one has 1000 items, 50 suppliers and 250 requests. The table schema is the same.
Update
I reasoned that the solution had to be recursive and I built a nice table valued function to get but I ran into the 32 hard limit on recursion in SQL Server. I was uncomfortable with it anyway because it hinted more of a procedural language solution than a RDMS.
So I am now playing with CTE recursion.
The root query is:
SELECT DISTINCT
'' SOLUTION_ID
,LOCATION_ID
,SUPPLIER_ID
,(subquery I haven't quite worked out) SOLE_SUPPLIER
FROM PartsRequests pr
INNER JOIN
SupplierItems si ON pr.ITEM_ID=si.ITEM_ID
WHERE pr.ORDER_ID IS NULL
This gets all the suppliers that can supply the required items and is certainly a solution, probably not optimal. The subquery sets a flag if the supplier is the sole supplier of any product required for that location; if so they must be part of any solution.
The recursive part is to remove suppliers one by one by means of CTE.SUPPLIER_ID<>CTE.SUPPLIER_ID and add them if they still cover all the items. The SOLUTION_ID will be a CSV list of the suppliers removed, partly to uniquely identify each solution and partly to check against so I get combinations instead of permutations.
Still working on the details, the purpose of this update was to allow the Community to say "Yay, looks like that will work" or, alternatively "You moron, that won't work because ..."
Thanks
This is a more general answer (as in, not sql) as I think solving this problem will require something more powerful. Your first scenario is to select a minimum number of suppliers. This problem can be seen as a set cover problem as you are trying to cover all demands per site with the suppliers. This problem is already NP-complete.
Your third scenario seems to be basically the same as the second. You just have to take the discount into account in the prices, assuming you pay on time for every order.
The second scenario is at least NP-hard as I see a lot of resemblance with the facility location problem. You are trying to decide which suppliers (facilities) to use (open) to cover your orders (demands) based on their prices and delivery costs (opening costs).
Enumerating your possible solutions seems infeasible as with 10 suppliers, you have 2^10 possibilities of using them, further complicated by the distribution of demands internally.
I would suggest some dynamic programming to first select the suppliers that you have to use (=they are the only ones that deliver a specific thing), eliminating some possibilities (if the cost for supplier A +delivery cost A< cost for supplier B) and then trying to expand your set of possible solutions. Linear programming is also a valid train of thought.

SQL (Access): Multiple values per "ID" - Store as TRUE unless any FALSE

Good afternoon!
I have recently come across an issue that I am hoping can be solved with your help. Our system is [sadly] ran on Access (2007). I have decent experience with SQL and elect to use this method for most of queries rather than the Design View. However, this is the issue I have come across recently:
A table (with its own primary key of course) contains the ParticipantID and Records. This table may contain multiple records per person due to having events at various locations. With this information we track whether or not each record is within our system already due to the location falling under our large "umbrella" (Internal). To make it look simple, it looks something like this, ignoring the primary key as we only care about the participant ID.
ParticipantID Internal
1 -1
1 -1
2 0
3 -1
3 -1
3 0
4 -1
4 0
I want to be able to say if ANY of the records of a participant are not Internal (eg. =0), then in this query's results, store it as 0.
Hence, the results table would look something like:
ParticipantID Internal
1 -1
2 0
3 0
4 0
Does this make sense? Thank you in advance!
You can use Max:
SELECT internal.ParticipantID, Max(internal.Internal) AS MaxOfInternal
FROM internal
GROUP BY internal.ParticipantID;
I built the above using the query design window.
If the values Internal can be only 0 and -1 the following may help
Select ParticipantID,max(internal) from thetable
Group by ParticipantID

How should you separate dimension tables from fact tables if you are not building a data warehouse?

I realize that referring to these as dimension and fact tables is not exactly appropriate. I am at a lost for better terminology, so please excuse this categorization that I use in the post.
I am building an application for employee record keeping.
The database will contain organizational information. The information is mostly defined in three tables: Locations, Divisions, and Departments. However, there are others with similar problems. First, I need to store the available values for these tables. This will allow for available values in the application when managing an employee and for management of these values when adding/deleting departments and such. For instance, the Locations table may look like,
LocationId | LocationName | LocationStatus
1 | New York | Active
2 | Denver | Inactive
3 | New Orleans | Active
I then need to store these values for each employee and keep their history. My first thought was to create LocationHistory, DivisionHistory, and DepartmentHistory tables. I cannot pinpoint why, but this struck me as poor design. My next inclination was to create a DimLocation/FactLocation, DimDivision/FactDivision, DimDepartment/FactDepartment set of tables. I do not believe this makes sense either. I have also considered naming them as a combination of Employee, i.e. EmployeeLocations, EmployeeDivisions, etc. Regardless of the naming convention for these tables, I imagine that data would look similar to a simplified version I have below:
EmployeeId | LocationId | EffectiveDate | EndDate
1 | 3 | 2008-07-01 | NULL
1 | 2 | 2007-04-01 | 2008-06-30
I realize any of the imagined solutions I described above could work, but I am really looking to create a design that will be easy for others to maintain with an intuitive, familiar structure. I would like to receive this community's help, opinions, and experience with this matter. I am open to and would welcome any suggestion to consider. For instance, should I even store the available values for these three tables in the database? Should they be maintained in the application code/business logic layer? Do I just need to get over seeing the word History repeating three times?
Thanks!
Firstly, I see no issue in describing these as Dimension and Fact tables outside of a warehouse :)
In terms of conceptualising and understanding the relationships, I personally see the use of start/end dates perfectly easy for people to understand. Allowing Agent and Location fact tables, and then time dependant mapping tables such as Agent_At_Location, etc. They do, however, have issues worthy of taking note.
If EndDate is 2008-08-30, was the employee in that location UP TO 30th August, or UP TO and INCLUDING 30th August.
Dealing with overlapping date periods in queries can give messy queries, but more importantly, slow queries.
The first one seems simply a matter of convention, but it can have certain implications when dealign with other data. For example, consider that an EndDate of 2008-08-30 means that they ARE at that location UP TO and INCLUDING 30th August. Then you join on to their Daily Agent Data for that day (Such as when they Actually arrived at work, left for breaks, etc). You need to join ON AgentDailyData.EventTimeStamp < '2008-08-30' + 1 in order to include all the events that happened during that day.
This is because the data's EventTimeStamp isn't measured in days, but probably minutes or seconds.
If you consider that the EndDate of '2008-08-30' means that the Agent was at that Location UP TO but NOT INCLDUING 30th August, the join does not need the + 1. In fact you don't need to know if the date is DAY bound, or can include a time component or not. You just need TimeStamp < EndDate.
By using EXCLUSIVE End markers, all of your queries simplify and never need + 1 day, or + 1 hour to deal with edge conditions.
The second one is much harder to resolve. The simplest way of resolving an overlapping period is as follows:
SELECT
CASE WHEN TableA.InclusiveFrom > TableB.InclusiveFrom THEN TableA.InclusiveFrom ELSE TableB.InclusiveFrom END AS [NetInclusiveFrom],
CASE WHEN TableA.ExclusiveFrom < TableB.ExclusiveFrom THEN TableA.ExclusiveFrom ELSE TableB.ExclusiveFrom END AS [NetExclusiveFrom],
FROM
TableA
INNER JOIN
TableB
ON TableA.InclusiveFrom < TableB.ExclusiveFrom
AND TableA.ExclusiveFrom > TableB.InclusiveFrom
-- Where InclusiveFrom is the StartDate
-- And ExclusiveFrom is the EndDate, up to but NOT including that date
The problem with that query is one of indexing. The first condition TableA.InclusiveFrom < TableB.ExclusiveFrom could be be resolved using an index. But it could give a Massive range of dates. And then, for each of those records, the ExclusiveDates could all be just about anything, and certainly not in an order that could help quickly resolve TableA.ExclusiveFrom > TableB.InclusiveFrom
The solution I have previously used for that is to have a maximum allowed gap between InclusiveFrom and ExclusiveFrom. This allows something like...
ON TableA.InclusiveFrom < TableB.ExclusiveFrom
AND TableA.InclusiveFrom >= TableB.InclusiveFrom - 30
AND TableA.ExclusiveFrom > TableB.InclusiveFrom
The condition TableA.ExclusiveFrom > TableB.InclusiveFrom STILL can't benefit from indexes. But instead we've limitted the number of rows that can be returned by searching TableA.InclusiveFrom. It's at most only ever 30 days of data, because we know that we restricted the duration to a maximum of 30 days.
An example of this is to break up the associations by calendar month (max duration of 31 days).
EmployeeId | LocationId | EffectiveDate | EndDate
1 | 2 | 2007-04-01 | 2008-05-01
1 | 2 | 2007-05-01 | 2008-06-01
1 | 2 | 2007-06-01 | 2008-06-25
(Representing Employee 1 being in Location 2 from 1st April to (but not including) 25th June.)
It's effectively a trade off; using Disk Space to gain performance.
I've even seen this pushed to the extreme of not actually storing date Ranges, but storing the actual mapping for each and every day. Essentially, it's like restricting the maximum duration to 1 day...
EmployeeId | LocationId | EffectiveDate
1 | 2 | 2007-06-23
1 | 2 | 2007-06-24
1 | 3 | 2007-06-25
1 | 3 | 2007-06-26
Instinctively I initially rebelled against this. But in subsequent ETL, Warehousing, Reporting, etc, I actually found it Very powerful, adaptable, and maintainable. I actually saw people making fewer coding mistakes, writing code in less time, the code ending up running faster, and being much more able to adapt to clients' changing needs.
The only two down sides were:
1. More disk space taken (But trival compared to the size of fact tables)
2. Inserts and Updates to this mapping was slower
The actual slow down for Inserts and Updates only actually matter Once, where this model was being used to represent a constantly changing process net; where the app wanted to change the mapping about 30 times a second. Even then it worked, it just chomped up more CPU time than was ideal.
If you want to be efficient and keep a history, do these things. There are multiple solutions to this problem, but this is the one that I keep going back to:
Remember that each row represents a single entity, if you make corrections that entity, that's fine, but don't re-use and ID for a new Location. Set it up so that instead of deleting a Location, you mark it as deleted with a bit and hide it from the interface, that way when it's referenced historically, it's still there.
Create a history table that includes the current value, or no records if a value isn't currently set. Have the foreign key tie back to the employee and tie to the location.
Create a column in the employee table that points to the current active location in the history. When you need to get the employees location, you join in the history table based on this ID. When you need to get all of the history for an employee you join from the history table.
This structure keeps it all normalized, and gives you an easy way to find the current value without having to do any date comparisons.
As far as using the word history, think of it in different terms: since it contains the current item as well as historical items, it's really just a junction table that keeps around the old item. As such you can name it something like EmployeeLocations.

Cross Tab - Storing different dates (Meeting1, Meeting2, Meeting 3 etc) in the same column

I need to keep track of different dates (dynamic). So for a specific Task you could have X number of dates to track (for example DDR1 meeting date, DDR2 meeting date, Due Date, etc).
My strategy was to create one table (DateTypeID, DateDescription) which would store the description of each date. Then I could create the main table (ID, TaskDescription, DateTypeID). So all the dates would be in one column and you could tell what that date represents by looking at the TypeID. The problem is displaying it in a grid. I know I should use a cross tab query, but i cannot get it to work. For example, I use a Case statement in SQL Server 2000 to pivot the table over so that each column name is the name of the date type. IF we have the following tables:
DateType Table
DateTypeID | DateDescription
1 | DDR1
2 | DDR2
3 | DueDate
Tasks Table
ID | TaskDescription
1 | Create Design
2 | Submit Paperwork
Tasks_DateType Table
TasksID | DateTypeID | Date
1 | 1 | 09/09/2009
1 | 2 | 10/10/2009
2 | 1 | 11/11/2009
2 | 3 | 12/12/2009
THE RESULT SHOULD BE:
TaskDescription | DDr1 | DDR2 | DueDate
Create Design |09/09/2009 | 10/10/2009 | null
Submit Paperwork |11/11/2009 | null | 12/12/2009
IF anyone has any idea how I can go about researching this, I appreciate it. The reason I do this instead of making a column for each date, has to do with the ability to let the user in the future add as many dates as they want without having to manually add columns to the table and editing html code. This also allows simple code for comparing dates or show upcoming tasks by their type (ex. 'Create design's DDR1 date is coming up' ) If anyone can point me in the right direction, I appreciate it.
Here is a proper answer, tested with your data. I only used the first two date types, but you'd build this up on the fly anyway.
Select
Tasks.TaskDescription,
Min(Case DateType.DateDescription When 'DDR1' Then Tasks_DateType.Date End) As DDR1,
Min(Case DateType.DateDescription When 'DDR2' Then Tasks_DateType.Date End) As DDR2
From
Tasks_DateType
INNER JOIN Tasks ON Tasks_DateType.TaskID = Tasks.TaskID
INNER JOIN DateType ON Tasks_DateType.DateTypeID = DateType.DateTypeID
Group By
Tasks.TaskDescription
EDIT
van mentioned that tasks with no dates won't show up. This is correct. Using left joins (again, mentioned by van) and restructuring the query a bit will return all tasks, even though this is not your need at the moment.
Select
Tasks.TaskDescription,
Min(Case DateType.DateDescription When 'DDR1' Then Tasks_DateType.Date End) As DDR1,
Min(Case DateType.DateDescription When 'DDR2' Then Tasks_DateType.Date End) As DDR2
From
Tasks
LEFT OUTER JOIN Tasks_DateType ON Tasks_DateType.TaskID = Tasks.TaskID
LEFT OUTER JOIN DateType ON Tasks_DateType.DateTypeID = DateType.DateTypeID
Group By
Tasks.TaskDescription
If the pivoted columns are unknown (dynamic), then you'll have to build up your query manually in either ms-sql 2000 or 2005, ie with out without PIVOT.
This involves either executing dynamic sql in a stored procedure (generally a no-no) or querying a view with dynamic sql. The latter is the approach I generally go with.
For pivoting, I prefer the Rozenshtein method over case statements, as explained here:
http://www.stephenforte.net/PermaLink.aspx?guid=2b0532fc-4318-4ac0-a405-15d6d813eeb8
EDIT
You can also do this in linq-to-sql, but it emits some pretty inefficient code (at least when I view it through linqpad), so I don't recommend it. If you're still curious I can post an example of how to do it.
I don't have personal experience with the pivot operator, it may provide a better solution.
But I've used a case statement in the past
SELECT
TaskDescription,
CASE(DateTypeID = 1, Tasks_DateType.Date) AS DDr1,
CASE(DateTypeID = 2, Tasks_DateType.Date) AS DDr2,
...
FROM Tasks
INNER JOIN Tasks_DateType ON Tasks.ID = Tasks_DateType.TasksID
INNER JOIN DateType ON Tasks_DateType.DateTypeID = DateType.DateTypeID
GROUP BY TaskDescription
This will work, but will require you to change the SQL whenever there are more Task descriptions added, so it's not ideal.
EDIT:
It appears as though the PIVOT keyword was added in SqlServer 2005, this example shows how to do a pivot query in both 2000 & 2005, but it is similar to my answer.
Version-1: +simple, -must be changed every time DateType is added. So is not great for a dynamic solution:
SELECT tt.ID,
tt.TaskDescription,
td1.Date AS DDR1,
td2.Date AS DDR2,
td3.Date AS DueDate
FROM Tasks tt
LEFT JOIN Tasks_DateType td1
ON td1.TasksID = tt.ID AND td1.DateTypeID = 1
LEFT JOIN Tasks_DateType td2
ON td2.TasksID = tt.ID AND td2.DateTypeID = 2
LEFT JOIN Tasks_DateType td3
ON td3.TasksID = tt.ID AND td3.DateTypeID = 3
Version-2: completely dynamic (with some limitations, but they can be handled - just google for it):
Dynamic pivot query creation. See Dynamic Cross-Tabs/Pivot Tables: you need to create one SP of UDF and then can use it for multiple purposes. This is the original post, to which you may find many links and improvements.
Version-3: just leave it for your client code to handle. I would not design my SQL to return a dynamic set of data, but rather handle it on the client (presentation layer). I just would not like to handle some dynamic columns that come as a result of my query, where I need to guess what is that exactly. The only reason I use Version-2 is when the result is presented directly as a table for a report. In all other cases for truly dynamic data I use client code. For example: having structure you have, how will you attach logic that field DueDate is mandatory - you cannot use DB constraints; how will you ensure that DDR1 is not higher then DDR2? If these are not separate (static) columns in the database (where you can use CONSTRAINTS), then the client code is the one that validates your data consistency.
Good luck!