Merge only Missing Data - sql

I am working on an HR project that provides data to me in the form of an Excel document.
I have created a package that captures the data from the Spreadsheet and imports it into SQL. The customer then wanted to create a data connection and place the data into Pivot Tables to manipulate and run calculations on.
This brought to light a small issue that I have tried to get fixed from the source but looks like cannot be resolved on the System Side (working with an SAP backend).
What I have is information that comes into SQL from the import that is either missing the Cost Center Name or both the cost center number and the cost center name.
EXAMPLE:
EmpID EmployeeName CostCenterNo CostCenterName
001 Bob Smith 123456 Sales
010 Adam Eve 543211 Marketing
050 Thomas Adams 121111
121 James Avery
I worked with HR to get the appropriate information for these employees, I have added the information to a separate table.
What I would like to do is figure out a way to insert the missing information as the data is imported into the Staging table.
Essentially completing the data.
EmpID EmployeeName CostCenterNo CostCenterName
001 Bob Smith 123456 Sales
010 Adam Eve 543211 Marketing
050 Thomas Adams 121111 Supply Chain
121 James Avery 555316 Human Resources

Is there an issue with a basic update like
Update <tablename> set CostCenterNo = (SELECT CostCenterNo from <hr_sourced_table> where EmpID =x) where EmpID = x
In case if its needed you can add
Where CostcentreNo is null
Because even if you did not do this, it would update all data which should be correct, but for any reason if you dont need it you can update both the fields in a single query like this
Update <tablename> set CostCenterNo = (SELECT CostCenterNo from <hr_sourced_table> where EmpID =x),CostCenterName = (SELECT CostCenterName from <hr_sourced_table> where EmpID =x) where EmpID = x

If your data source table and the extra mapping information are both accessible from the same place, you don't have to update anything with SSIS. Just build a view that joins the two tables and populate the pivot table from the view. You will have to decide what to do if the data source and the mapping table disagree, but that is a business rule question.
Select e.EMPLID, e.EmployeeName, cc.CostCenterNo, cc.CostCenterName
From Employees e
Left Join CCMapping cc on e.emplid=cc.emplid
OR
Select e.EMPLID, e.EmployeeName,
coalesce(e.CostCenterNo, cc.CostCenterNo) as CostCenterNo,
coalesce(e.CostCenterName, cc.CostCenterName) as CostCenterName
From Employees e
Left Join CCMapping cc on e.emplid=cc.emplid

I would use a lookup transformation in your data flow that sources the missing data you got from HR. Then join this lookup data on a mutual field in the data coming from your sources (EmpID?). You can then add the cost center no and cost center name from the missing data table to the data flow. In a derived column transformation you can test to see if the data from the source is null and if so, use the columns that came from the missing data table to store in the destination table.

As I see it, your options are to complete the data in flight or update the data after it has landed. Which route I would chose would be dependent on the level of complexity.
In flight
Generally speaking, this is my preference. I'd rather have all the scrubbing take place while the data is moving versus applying a series of patches afterward to shine the data.
In your Data Flow, I would have a Conditional Split to funnel the data into 2 to 3 streams: Has all data, has cost center and has nothing.
"Has all data" would route directly into a Union All
"Has cost center" would lead to a Lookup Component which would use the supplied Cost Center to lookup against the reference table to acquire the text associated to the existing value. The Lookup Component expects to find matches so if the possibility exists that a Cost Center will not exist in your reference table, you will need to handle that situation. Depending on what version of SSIS you are using will determine whether you can just use the Unmatched Output column (2008+) or whether you have to commandeer the Error Output (2005). Either way, you will need to indicate to the Lookup that failure to match should not result in a package level failure. Once you've handled this lookup and handled the no-match option, join that stream to the Union All.
"has nothing" might behave as the "has cost center" stream where you will perform some lookup on other columns to determine cost center or you might simply apply a default/known-unknown value for the missing entities. How that works will depend on the rules your business owners have supplied.
Post processing
This keeps your data flow exactly as it is. You would simply add an Execute SQL Task after the Data Flow to polish any tarnished data. Whether I do this entirely in-line in the Execute SQL Task or create a dedicated clean up stored procedure would be based in part of the level of effort it takes to get code changed. Some places, pushing an SSIS package change is a chipshot activity. Other places, it takes an act of the SOX dieties to get a package change pushed but they were fine with proc changes.
My gut would be to push the scrubber logic into a stored procedure. Then your package wouldn't have to change every time they come up with scenarios that the original queries didn't satisfy.
You would have 2 statements in the proc, much as we performed in the In flight section. One query will update populating the Cost Center name. The other will apply cost center and name. If you need help with the specifics of the actual query, let me know and I can update this answer.

I worked with another developer to create a solution, here is what we came up with.
I Created an "Execute SQL Task" to run after the data flow that has this script in it.
MERGE [Staging].[HRIS_EEMaster] AS tgt
USING (
SELECT PersNo AS EmpID,
CostCenterNo AS CCNo,
CostCenterName AS CCName
FROM [dbo].[MissingTermedCC]
) AS src ON src.EmpID = tgt.PersNo
WHEN NOT MATCHED BY TARGET
THEN INSERT (
PersNo,
CostCenterNo,
CostCenterSubDiv
)
VALUES (
src.EmpID,
src.CCNo,
src.CCName
)
WHEN MATCHED
THEN UPDATE
SET tgt.CostCenterNo = CASE
WHEN src.CCNo > '' THEN src.CCNo
ELSE tgt.CostCenterNo
END,
tgt.CostCenterSubDiv = CASE
WHEN src.CCName > '' THEN src.CCName
ELSE tgt.CostCenterSubDiv
END;
I wanted to share in case anyone else runs into a similar issue. Thanks again for all of the help everyone.

Related

MS Access - Query - Required Forms for Each Employee

I have 3 tables, all SharePoint lists. I am trying to create a query that will show me all of the required DQ_File Forms that do not have an attachment in the DQ_File.
DQ_File_Lookup is a lookup table for the description field in the DQ_File. It also has the "DQRequired" flag I am looking for to see all of the required fields that do not have an attachment.
I have included a screen shot showing the table layouts and relations.
Any help would be appreciated, I am sure I am just overlooking something obvious.
A example would be as follows:
Employee Name | Document Name
You would have employee Joe and he has forms A,B,D out of a possible forms A,B,C,D,E,F so he would be missing forms C,E and F.
So the employee name would come from the employee table, and the document name needs to get passed through the DQ_File Table from the DQ_File_Lookup
the way I thought to do it was to get it to show all documents from the DQ_File table that are missing, that I can do. But that only shows the information that has an entry. There are certain forms that are required for every employee that I want to be able to see if a employee is missing any of those forms.
Using what #June7 posted below I got it to work, and it now will show me all 15 documents that are required for every driver. But when I add the attachment field from DQ_File it shows them all as zero attachments, when I know some of them do indeed have attachments already.
Here is a screen cap showing this.
Williams in particular should only have about 5 documents that should be on this list, but instead it is showing like all 15 are missing.
Here is the SQL from the combined query:
SELECT [qryEmployees+DQFileLookup].Last, [qryEmployees+DQFileLookup].Description, DQ_File.Attachment
FROM DQ_File RIGHT JOIN [qryEmployees+DQFileLookup] ON DQ_File.EmployeeNo = [qryEmployees+DQFileLookup].EmployeeCode
WHERE (((DQ_File.Attachment.FileURL) Is Null) AND (([qryEmployees+DQFileLookup].CURRENT)=True) AND (([qryEmployees+DQFileLookup].DRIVER)=True) AND (([qryEmployees+DQFileLookup].DQRequired)=True));
If you want to know which required docs employees do not have, then need a dataset of all possible combinations of employees/docs. Then match that dataset with DQ_File to see what is missing. The all combinations dataset can be generated with a Cartesian query (a query without JOIN clause) - every record of each table will associate with every record of other table.
SELECT Employees.*, DQ_File_Lookup.* FROM Employees, DQ_File_Lookup;
Then join that query with DQ_File.
SELECT Query1.EmployeeID, Query1.First, Query1.Last, Query1.ID, Query1.Title, Query1.DQRequired, DQ_File.Description, DQ_File.EmployeeNo
FROM DQ_File RIGHT JOIN Query1 ON (DQ_File.EmployeeNo = Query1.EmployeeID) AND (DQ_File.Description = Query1.ID)
WHERE (((Query1.DQRequired)=True) AND ((DQ_File.EmployeeNo) Is Null));
Advise not to use exact same field names in multiple tables. For instance, Title in DQ_File_Lookup could be DocTitle and Title in Employees could be JobTitle. And there will be less confusion if ID is not used as name in all tables.
It seems unnecessary to repeat Title and [Compliance Asset ID] in all 3 tables.
Strongly advise not to use spaces in naming convention. Title case is better than all upper case.

stored procedure that returns in one row duplicate column names and converts 1->N to a string

I'm trying to put all the below in a single stored procedute that returns a single row because the data is up on Sql Azure and the rule for it is do everything in a single query with a single return.
I have the following tables:
Person (
PersonId
FirstName
...
)
CompanyDomains (
CompanyId
EmailDomain
)
Company (
CompanyId
CompanyName
Billing_PersonId
Admin_PersonId
...
)
I have two problems here. The first is I want to get all the elements of a Company row, and the 2 Person rows of data. That's easy with a join. But the columns for the 2 person columns will have duplicate names. I can do 'as' one by one, which is a pain as the database schema is still in a state of flux. Is there a global way to apply 'as' so all the columns brought in from Billing_PersonId get a Billing_ prepended to the column name and Admin_ prepended to the admin column name?
The second is there is a 1->N list of company domains. Is there a way to pull all those and add a column that is a single string that has "domain1;domain2;" in it? We have the distinct domains in the CompanyDomain table so we can quickly find the company that owns any domain. But a single string works fine when I'm reading the company in.
I know single SQL selects pretty well. But I've got very little experience with stored procedures (aside from calling them) and so what I'm asking here may be basic. If so, sorry. And again, this is for Sql Azure.
thanks - dave
If you are using Azure, then you application should be able to parse XML.
Write a stored procedure to join the three tables, select the data given an input like company id, and return an xml record containing information from all three.
Look at the following references.
FOR XML
http://msdn.microsoft.com/en-us/library/ms178107.aspx
CREATE PROCEUDRE
http://msdn.microsoft.com/en-us/library/ms187926.aspx
If you need more help, you need to post a simple schema with sample data.
USE MY EXAMPLE BELOW FOR SCHEMA + DATA
sql select a field into 2 columns
1 - Without detailed information, not one will be able to help you.
2 - Try it on your own. I can give you the answer but you will not learn anything.
Sincerely
John

How to create a new table from multiple spatial searches in SQL Server?

I was wondering if anyone could help. I am always asked by my colleagues to create a postcode lookup table for them and this takes me considerable time using a desktop GIS. I was wondering if it is possible to create a table using SQL Server that would create the Lookup for me?
The Lookup table is based on finding the relevant boundaries for a postcode so a sample table would look like this:
Table A
POSTCODE SCHOOL WARD POLLING DISTRICT
BH15 2RU ST PETERS ELMSDALE PD
All of the information for School, Ward and Polling district are coming from different tables. Each of these tables has a geometry column (as has the postcode table).
I can run a select statement to do a simple join (say postcode to school) and create that table, but I would like to run all of the separate spatial queries in one large query to create the singular table. I have about 20 or so different tables with boundaries that are required for the large lookup table.
I hope that makes sense!
Any help would be greatly appreciated.
First you'll need to create your table, a one-off process:
CREATE TABLE [Example]
(
Postcode VARCHAR(8) NOT NULL,
School VARCHAR(50),
Ward VARCHAR(50)
-- Repeat for each row (setting appropriate length / type etc)
);
Then you can populate them like this (suggest making a stored procedure, index on Example.Postcode and ensure appropriate Spatial Index on all geometry columns):
MERGE [Example] AS TARGET
USING
(
SELECT
P.[Name] AS PostcodeName,
SCH.[Name] AS SchoolName,
WRD.[Name] AS WardName
-- Other variables as required
FROM
[Postcodes] P
JOIN [Schools] SCH ON SCH.[Geometry].STIntersects(P.[Geometry]) = 1
JOIN [Wards] WRD ON WRD.[Geometry].STIntersects(P.[Geometry]) = 1
-- OTHER JOINS AS REQUIRED
)
AS SOURCE
ON TARGET.[Postcode] = SOURCE.[PostcodeName]
WHEN MATCHED BY TARGET
THEN
UPDATE SET
[School] = SOURCE.[SchoolName],
[Ward] = SOURCE.[WardName]
-- Repeat for other variables
WHEN NOT MATCHED BY TARGET
THEN
INSERT
([Postcode], [School], [Ward]) -- etc.
VALUES
(SOURCE.[PostcodeName], SOURCE.[SchoolName], SOURCE.[WardName]) -- etc.
WHEN NOT MATCHED BY SOURCE -- Only include this if you want to remove any records where the Postcode has have been deleted since the last "run"
THEN
DELETE;
I haven't tested obviously, but it should be 99%+ of the way there.
MERGE if you haven't used it provides a wonderful way to handle insert / update / delete in a single statement. I am also making the assumption that your geography data is well formed, they all share the same (or no) SRID, they all specify coordinates in the same datum / projection), and that there are no overlaps in each table to deal with. Also, please note that even with a lot of ooompfh, this make take some time to run if you have national coverage (1.7m postcodes, 225k OA's, 9K wards etc.). I suggest you add a WHERE Postcode.[Name] LIKE 'BH%' or similar to test first.

Versioning Lookup Tables

I'm creating some enterprise master lookup tables in SQL Server 2008r2. I'm required to keep a table specific to a domain. For example Gender would be a table and City could be another table. In each table I keep Start and End dates to keep history.
Now I'm being asked to version these so you would know something like V1 had these codes and V2 had these updated codes. Thinking about one lookup table is pretty straight forward. Using Gender for example you could have a Gender table, a GenderVersion table and a GenderToGenderVersion linking table. This would require 2 extra tables per lookup. With just a few tables that is manageable but I’m hoping to get an idea for a better way to do this with many lookup tables
A few requirements.
As stated, lookup tables must be domain specific. I cannot combine
them.
Junior level programmers must be able to build a select
statement to find all the look values based on a specific version.
Any ideas? I hate the idea of having 3x the required tables.
[Edit]
Additional Details
We need to be able to determine exactly what codes were in a particular version.
We need codes that are not edited or removed from a previous version to be part of the next version.
We do maintain start and end dates in the codes. The version labels are in addition to the start and end dates.
Here is a sample model using Race. This is with the
RaceVersion table
RaceVersionID
VersionLabel
StartDate
EndDate
RaceToRaceVersion
RaceID
RaceVersionID
Race table
RaceID (used just for the master code DB)
RaceCode (used for all other systems that use these codes)
RaceDescription
StartDate
EndDate
Both the code and the version get date searchable effective dates.
You can find an version and then relate to the code
Whenever there is a change to the codes, in this case Race, you can add a new version then find all the rows that are currently effective and populate the linking table.
This works well for a few sets of codes. I'm hoping for functionality like this that scales better.
You could avoid multiple tables if you added a version column to each lookup table. Each time a new version of the data is entered, increment the version. When selecting data from the lookup, you'll always have to specify the correct version.
id | version | name
---------------------------------
1 | 1 | Lookup Name for V1
1 | 2 | Lookup Name for V2
SELECT name
FROM lookup
WHERE id = 1
AND version = 2
If you want to easily select the latest version, you'll need to create a view:
SELECT id, name
FROM lookup lu
WHERE version = (
SELECT MAX(version)
FROM lookup lu2
WHERE lu.id = lu2.id
)
However, this view would have a performance cost that may outweigh the cost of maintaining more verbose select statements that include the version every time.
The first thing that popped into my mind is a versionId field in all these tables and a version table describing what each versionId means. Same number of tables, more records in each.
Edit starts here
The second thing that popped into my head was to also add an ApplicationId and table. That way not all applications have to be on the same version.

Query - If more than one record ID# (non primary key) matches, use later date or larger primary key

I built a MS Access Database that takes a survey to create a custom report. The survey application that was used does not give us the reports we need. I usually grab the data (excel) and import it in access and build report the way we need them.
For this first time, we have people redoing the survey because they are updating something or they forgot to add something. I need to be able to grab the most recent surveys data so we don't get a duplicate when we run the report. (My main report is composed of several subreports. Some subreports will not visible if null, and any questions not answered are hidden and shrinked to prevent bulky reports with unnecessary whitespace.)
record ID (PK) | FName | LName | IDNum | Completed
1 | Bob | Smith | 57 | 3/31/2013 5:00pm
2 | Bob | Smith | 57 | 3/31/2013 7:00pm
I want record ID 2 or the one that was completed at 7pm.
The queries and reports are already completed so i have been trying to find a way to add a line of code in the criteria line of my query to grab the most recent record if the IDnum matches with more than one record.
I have been trying to find the best way to do it for the past several hours. I don't think that having my table be modified to 'table without duplicates' as after the database is complete, someone less technical will be using it. All they are going to do is import a new excel file to overwrite the table and the queries do everything to build the report. I don't want to manually delete the duplicate records either.
I know I need to do something along the lines with
IIF(count(IDNum)>1, *something, *something)
*I get stuck on the true and false part. How do i tell access that it needs to check within the table again to find the record with the larger primary key?
I thought this was going to be easy but i guess i was wrong. lol
I am fairly new at MS Access so I know I am not using the full potential and i might be going at this at the wrong angle. Any advice would be appreciated greatly.
I'm a student going into Info Systems, so i would really like to learn how to do this.
I believe the query you are looking for is
SELECT t1.*
FROM YourTable t1 INNER JOIN
(SELECT IDNum, MAX(Completed) AS MaxOfCompleted
FROM YourTable GROUP BY IDNum
) t2
ON t1.IDNum = t2.IDNum AND t1.Completed = t2.MaxOfCompleted;
When you are using an if function it should be iif not iff.
I'd recommend a correlated subquery, such as the following:
SELECT
Data.RecordID
, Data.FName
, Data.LName
, Data.IDNum
, Data.Completed
FROM
Data
WHERE
Data.Completed IN
(
SELECT TOP 1
DataSQ.Completed
FROM
Data as DataSQ
WHERE
DataSQ.IDNum = Data.IDNum
GROUP BY
DataSQ.Completed
ORDER BY
DataSQ.Completed DESC
)
GROUP BY
Data.RecordID
, Data.FName
, Data.LName
, Data.IDNum
, Data.Completed
;
Explanation
Instead of using a function such as Max or IIF, you can embed another SELECT query within the WHERE clause of your main query. The nested query is used to determine the most recent Completed date for every IDNum. Unlike selecting the most recent survey directly from your table with SELECT TOP 1 + ORDER BY, which would only return one record, the WHERE clause in your nested query refers back to the main query and produces a result for each IDNum. This is known as the Top N per Group pattern, and I've found it to be very useful. Note that in the nested query you will need to use a table name alias so that Access will be able to differentiate between the two queries.
Also, I'd generally recommend against trying to use a table PK to perform sorts. There are many cases when the PK order value will not be a good indicator of the values of related fields.
This code worked when tested on dummy data - best of luck!