Change Data Capture Using Spark SQL - sql

I have few tables which are related as A -> Left Join -> B -> Left join -> C. Let's call A as the driving table and B & C as "supporting" tables. Each of these tables have a last_update_date column. My requirement is to identify the records that changed since the last processing date (available as a parameter) not only in the driving table but also if a change to any column occurs in the supporting table(s).
Table A
------
empid|salary|last_updt_dt
123|20000|05/14/2019
Table B
-------
empid|fname|lname|last_updt_date
123|John|Taylor|05/16/2019
Table C
-------
empid|address|last_updt_dt
123|Maryland|05/17/2019
Assume, = 05/10/2019
So, assuming executing job on Day 1 (05/20/2019) output should be:
empid|fname|lname|salary|address|last_exec_date
-----------------------------------------------
123|John|Taylor|20000|Maryland|05/20/2019
Now, let's assume that on Day 2 (05/21/2019), the address got changed from Maryland to California. So, on Day 2, the output table should look like:
empid|fname|lname|salary|address|last_exec_date
-----------------------------------------------
123|John|Taylor|20000|Maryland|05/20/2019
123|John|Taylor|20000|California|05/21/2019
561|Peter|Anderson|50000|Missouri|05/21/2019
The point to note is that on Day 2 a change in any "supporting table" (Table-C 'address' column in this case) triggered insertion of another record which was already processed earlier yesterday, but now with Updated value in address column. Also note, on Day 2 other inserts will happen as-is as regular inserts for any other qualifying record (if any) e.g. empid=561.
SELECT
A.empid, B.fname, B.lname, A.salary, C.address, current_date() as last_exec_date
from A
left outer join B
on A.empid = B.empid
left outer join B.empid = C.empid
where to_date(A.last_updt_dt, 'yyyyMMdd') > {last_exec_date}
OR to_date(A.last_updt_dt, 'yyyyMMdd') > {last_exec_date}
to_date(A.last_updt_dt, 'yyyyMMdd') > {last_exec_date}
My challenge is how to trigger and propagate any changes from any of the participating supporting tables, even when that change pertains to a record which had been processed and inserted to the target table earlier, so that a new record with the updated value shows in the target table.
In other word how can I trigger a record with a change from any of the other supporting (non-driver) tables

Related

SQL Query based on specific conditions

Figure 1 denotes the current state of the TABLE A and TABLE B.
The current implementation is to fetch the new MIds to Table B and copy the SqlQuery from base process, in case of a new market or an existing market. Below query is used for this:
SELECT A.MId, B1.Loop, B1.Segment, B1.SqlQuery, B1.UseDefault
FROM TableB B1 WITH (NOLOCK)
INNER JOIN TableA A WITH (NOLOCK) ON B1.MId IN (100, 200)
AND B1.MId = A.BaseMarket
AND ISNULL(A.POCId, 0) > 0
LEFT JOIN TableB B2 WITH (NOLOCK) ON A.MId = B2.MId
WHERE B2.MId IS NULL
Figure 2 shows the updated data in Table A and the desired state of Table B. The required implementation would be:
To fetch the new MIds to Table B and copy the SqlQuery from Base Process, if it's a new market (XYZ Market - 2001, 2002)
If the market configuration already exists in Table B (Market ABC - 1001 and 1002), then copy the existing configuration's SqlQuery.
Here's the complete flow for Table A and B. The base configurations (100 and 200) in both tables were inserted manually initially including the loop and segments.
A new market is introduced and a new MId is created in Table A. Let's assume that to be 1001 and 1002 for Market ABC.
Corresponding records are inserted in Table B for each MId and it copies data from Base Configuration in Table B. Inserted Records (SqlId - 3 and 4)
SqlQuery column in Table B is updated manually due to a specific business request. (SqlId - 3 and 4). Hence, the different query.
Market ABC is updated in front end, which creates two new entries in Table A. (MId - 1003 and 1004). Also, new market XYZ (MId - 2001 and 2002) is created.
Corresponding entries created in Table B should refer Base Configuration for Market XYZ (SqlId - 7 and 8), since it's a new market but should copy the existing configuration for Market ABC (MId - 1001 and 1002) since it's configuration already existed.
I am looking for a suggestions if a single query can implement this requirement using Case statement. I'll appreciate your help!
I guess by market configuration already exists you actually mean the combination of MarketName and Type. So here's the query
SELECT
A.NewId, B.Loop, B.Segment, B.SqlQuery, B.UseDefault
FROM (
SELECT
A1.MId AS NewId, A2.MId AS RefId
FROM
TableA A1
INNER JOIN
TableA A2
ON
(A1.MarketName = A2.MarketName AND A1.Type = A2.Type) -- use your market configuration logic here
OR
A1.BaseMarket = A2.BaseMarket
WHERE
A1.Mid NOT IN (SELECT MId FROM TableB)
) As A
INNER JOIN
TableB B
ON (A.RefId = B.MID)
At first we are self-joining TableA to get the reference MId as RefId here. Then we are joining the new derived table with TableB.
Hope this helps. Thank you!

Delete records from 2 tables with 1 query

I need to delete records from 2 different tables that are linked via job#
One table contains dates for completion and I need to delete all records that were completed between 1999 and 2001. In the second table I need to delete all phases of the job where Job number from table 1 match job number in table 2
I've researched a bit and came up with something like this but when running it I receive "Too Many Fields Selected"
DELETE a.,b.
FROM PUB_jc_job a
LEFT JOIN PUB_jc_phase b
ON b.jph_job = a.job_num
WHERE PUB_jc_job.job_compdate BETWEEN #3/31/1999# AND #12/31/2001#
Please remove dot(.) from table alias "a., b." and retry.
DELETE a, b FROM PUB_jc_job a INNER JOIN PUB_jc_phase b ON b.jph_job = a.job_num WHERE PUB_jc_job.job_compdate BETWEEN #3/31/1999# AND #12/31/2001#

SQL Server Inner Join with Timestamps: is each record only assigned once?

I am working with timestamped records and need to do an inner join based on the timestamp difference. I have been using the DATEDIFF function and it seems to be working well. However, the amount of time between timestamps varies. To clarify, sometimes the record appears in table 2 within the same second as table 1, and sometimes the record in table 2 is up to 15 seconds behind the record in table 1. The records in table 1 are always timestamped before table 2. There is no other common field with which I can join, however there is a register number in each table that I am using to increase accuracy by ensuring that the registers are the same.
My question is: if I increase the timestamp difference to do the inner join (e.g. where the DATEDIFF = 1 or 2 or 3... or 15) will records only be joined once? Or would my table contain duplicate records from table 1 (e.g. where record 1 is joined to record 4 in table 2 where the diff is 4 seconds, and is also joined with record 7 from table 2 where the diff is 11 seconds)?
The reason my statement works now is that no registers have records with less than 6 seconds in between, so even if there are multiple timestamps that would match, the matching of registers eliminates this problem.
My Statement is currently working as:
SELECT *
INTO AtriumSequoiaJoin5
FROM Atrium INNER JOIN Sequoia ON Atrium.Reader = Sequoia.theader_pos_name
WHERE (
((DateDiff(s,[Atrium].[Date2],[Sequoia].[theader_tdatetime]))=0
Or (DateDiff(s,[Atrium].[Date2],[Sequoia].[theader_tdatetime]))=1
Or (DateDiff(s,[Atrium].[Date2],[Sequoia].[theader_tdatetime]))=2
Or (DateDiff(s,[Atrium].[Date2],[Sequoia].[theader_tdatetime]))=3
Or (DateDiff(s,[Atrium].[Date2],[Sequoia].[theader_tdatetime]))=4
Or (Datediff(s,[Atrium].[Date2],[Sequoia].[theader_tdatetime]))=5)
)
ORDER BY Sequoia.theader_id;
you could CROSS APPLY to the closest record in proximity. That's by no means ideal however, what if there are multiple records written at the same time? You perhaps should give the first table an identity field, then update the next table with scopeidentity
SELECT *
INTO AtriumSequoiaJoin5
FROM Atrium CROSS APPLY
(SELECT TOP 1 * FROM Sequoia WHERE
Atrium.Reader = Sequoia.theader_pos_name
ORDER BY Datediff(millisecond,[Atrium].[Date2],[Sequoia].[theader_tdatetime])) DQ
ORDER BY Sequoia.theader_id;

SQL Query - Ensure a row exists for each value in ()

Currently struggling with finding a way to validate 2 tables (efficiently lots of rows for Table A)
I have two tables
Table A
ID
A
B
C
Table matched
ID Number
A 1
A 2
A 9
B 1
B 9
C 2
I am trying to write a SQL Server query that basically checks to make sure for every value in Table A there exists a row for a variable set of values ( 1, 2,9)
The example above is incorrect because t should have for every record in A a corresponding record in Table matched for each value (1,2,9). The end goal is:
Table matched
ID Number
A 1
A 2
A 9
B 1
B 2
B 9
C 1
C 2
C 9
I know its confusing, but in general for every X in ( some set ) there should be a corresponding record in Table matched. I have obviously simplified things.
Please let me know if you all need clarification.
Use:
SELECT a.id
FROM TABLE_A a
JOIN TABLE_B b ON b.id = a.id
WHERE b.number IN (1, 2, 9)
GROUP BY a.id
HAVING COUNT(DISTINCT b.number) = 3
The DISTINCT in the COUNT ensures that duplicates (IE: A having two records in TABLE_B with the value "2") from being falsely considered a correct record. It can be omitted if the number column either has a unique or primary key constraint on it.
The HAVING COUNT(...) must equal the number of values provided in the IN clause.
Create a temp table of values you want. You can do this dynamically if the values 1, 2 and 9 are in some table you can query from.
Then, SELECT FROM tempTable WHERE NOT IN (SELECT * FROM TableMatched)
I had this situation one time. My solution was as follows.
In addition to TableA and TableMatched, there was a table that defined the rows that should exist in TableMatched for each row in TableA. Let’s call it TableMatchedDomain.
The application then accessed TableMatched through a view that controlled the returned rows, like this:
create view TableMatchedView
select a.ID,
d.Number,
m.OtherValues
from TableA a
join TableMatchedDomain d
left join TableMatched m on m.ID = a.ID and m.Number = d.Number
This way, the rows returned were always correct. If there were missing rows from TableMatched, then the Numbers were still returned but with OtherValues as null. If there were extra values in TableMatched, then they were not returned at all, as though they didn't exist. By changing the rows in TableMatchedDomain, this behavior could be controlled very easily. If a value were removed TableMatchedDomain, then it would disappear from the view. If it were added back again in the future, then the corresponding OtherValues would appear again as they were before.
The reason I designed it this way was that I felt that establishing an invarient on the row configuration in TableMatched was too brittle and, even worse, introduced redundancy. So I removed the restriction from groups of rows (in TableMatched) and instead made the entire contents of another table (TableMatchedDomain) define the correct form of the data.

How can I make multiple records to be printed as a single row

alt text http://img59.imageshack.us/img59/962/62737835.jpg
This three columns are taken from 3 tables. In other words, these records are
retrieved by joining 3 tables.
It is basically a very simple time sheet that keeps track of shift starts time, lunch time and so on.
I want these four records to show in one row, for example:
setDate --- ShiftStarted --- LunchStarted --- LunchEnded ---- ShiftEnded ----- TimeEntered
Note: discard TimeEntered column. I will deal with this later, once i know how to solve the above issue, it will be easy for me to handle the rest.
How can i do it?
Further Info - Here is my query:
SELECT TimeSheet.setDate, TimeSheetType.tsTypeTitle
FROM TimeSheet
INNER JOIN TimeSheetDetail ON TimeSheet.timeSheetID = TimeSheetDetail.timeSheetID
INNER JOIN TimeSheetType ON TimeSheetType.timeSheetTypeID = TimeSheetDetail.timeSheetTypeID
TimeSheet table consists of the following columns:
timeSheetID
employeeID - FK
setDate
setDate represents today's date.
TimeSheetType table consists of the following columns:
timeSheetTypeID
tsTypeTitle
tsTypeTitle represents shifts e.g. shift starts at, lunch starts at, shift ends at, etc.
TimeSheetDetail table consists of the following columns:
timeSheetDetailID
timeSheetID - FK
timeSheetTypeID - FK
timeEntered
addedOn
timeEnetered represents the time that employee set manually.
addedOn represents the system time, the time that a record was inserted.
I must admit I haven't fully read all but I think you can work out the rest for yourself. Basically you can join the table timesheet with itself.
I did this ...
create table timesheet (timesheet number, setdate timestamp, timesheettype varchar2(200), timeentered timestamp);
insert into timesheet values (1,to_date('2010-08-02','YYYY-MM-DD'),'Shift Started',current_timestamp);
insert into timesheet values (1,to_date('2010-08-02','YYYY-MM-DD'),'Lunch Started',current_timestamp);
insert into timesheet values (1,to_date('2010-08-02','YYYY-MM-DD'),'Lunch Ended',current_timestamp);
insert into timesheet values (1,to_date('2010-08-02','YYYY-MM-DD'),'Shift Ended',current_timestamp);
commit;
select * from timesheet t1
left join timesheet t2 on (t1.timesheet = t2.timesheet)
where t1.timesheettype = 'Shift Started'
and t2.timesheettype = 'Lunch Started'
... and got out this
TIMESHEET SETDATE TIMESHEETTYPE TIMEENTERED TIMESHEET_1 SETDATE_1 TIMESHEETTYPE_1 TIMEENTERED_1
1 02.08.2010 00:00:00.000000 Shift Started 05.08.2010 12:35:56.264075 1 02.08.2010 00:00:00.000000 Lunch Started 05.08.2010 12:35:56.287357
It was not SQL Server but in principle it should work for you too.
Let me know if you still have a question
You might want to check out the PIVOT operator. It basically allows you to use particular row values to create new columns in your result set.
You'll have to supply an aggregate function for combining multiple rows - for instance (assuming you split your data on a per day basis), you'll have to decide how to deal with multiple "shift started" events on the same day. Assuming that such events never occur, you'll still have to use an aggregate. MAX() is usually a safe choice in those circumstances.