Combining multiple sources of data into a unified table

Combining multiple sources of data into a unified table - sql

My company is working with 3 partners and each partner can have multiple brands. Each week, I get a dump of each brand's user list
which I store in a MySQL database with a table for each brand. Each brand contains a list of users and some basic information
(birthyear, zip code, gender). Some users can be signed up with different brands and each brand can have it's own set of data on a user.
For example, a user is signed up with Canvas and MNM. At Canvas, their profile looks like this:
ID GENDER BIRTHYEAR POSTCODE MODIFIED
94bafdb3e155d30349f1113a25c0714f M 1973 2800 2009-01-01 09:01:01
and at MNM, like this:
ID GENDER BIRTHYEAR POSTCODE MODIFIED
94bafdb3e155d30349f1113a25c0714f 1973 1000 2009-09-09 09:01:01
I'd like to create a view (or table - I'm not sure which is best) that would combine the two records using the most recent version of the data, but also letting me know where the data came from.
So the above two records would combine to:
ID GENDER G_DATE G_BRAND BIRTHYEAR B_DATE B_BRAND POSTCODE P_DATE P_BRAND
94bafdb3e155d30349f1113a25c0714f M 2009-01-01 09:01:01 Canvas 1973 2009-09-09 09:01:01 MNM 1000 2009-09-09 09:01:01 MNM
I'm imagining some convoluted series of unions and sub queries, but I'm not even really sure where to begin.
I've created a view that merges all of the tables
CREATE VIEW view_combine AS
SELECT ID, GENDER, MODIFIED as G_DATE, 'Canvas' as G_BRAND,
BIRTHYEAR, MODIFIED as B_DATE, 'Canvas' as B_BRAND,
POSTCODE, MODIFIED as P_DATE, 'Canvas' as P_BRAND FROM canvas
UNION ALL
SELECT ID, GENDER, MODIFIED as G_DATE, 'Een' as G_BRAND,
BIRTHYEAR, MODIFIED as B_DATE, 'Een' as B_BRAND,
POSTCODE, MODIFIED as P_DATE, 'Een' as P_BRAND FROM een
UNION ALL
SELECT ID, GENDER, MODIFIED as G_DATE, 'MNM' as G_BRAND,
BIRTHYEAR, MODIFIED as B_DATE, 'MNM' as B_BRAND,
POSTCODE, MODIFIED as P_DATE, 'MNM' as P_BRAND FROM mnm
and then I'm trying to perform selections on that, but I don't think it's the right direction.
SELECT v1.hashkey, ge.gender, ge.g_date, ge.g_brand,
bi.birthyear, bi.b_date, bi.b_brand,
pc.postcode, pc.p_date, pc.p_brand
FROM view1 v1
JOIN (
select g.hashkey, g.gender, g.g_date, g.g_brand
from view1 g
left join view1 g1 ON g.hashkey = g1.hashkey AND g.g_date < g1.g_date
WHERE g1.hashkey IS NULL
) ge ON ge.HASHKEY = v1.HASHKEY
JOIN (
select b.hashkey, b.birthyear, b.b_date, b.b_brand
from view1 b
left join view1 b1 ON b.hashkey = b1.hashkey AND b.b_date < b1.b_date
WHERE b1.hashkey IS NULL
) bi ON bi.HASHKEY = v1.HASHKEY
JOIN (
select p.hashkey, p.postcode, p.p_date, p.p_brand
from view1 p
left join view1 p1 ON p.hashkey = p1.hashkey AND p.p_date < p1.p_date
WHERE p1.hashkey IS NULL
) pc ON pc.HASHKEY = v1.HASHKEY
GROUP BY v1.hashkey

I've managed to solve this. Essentially, I needed to select on the view and then sub-select on the view to get the fields I wanted. I found that ordering on the date within the sub-select returned the values I needed.
SELECT v1.hashkey, ge.gender, ge.g_date, ge.g_brand,
bi.birthyear, bi.b_date, bi.b_brand,
pc.postcode, pc.p_date, pc.p_brand
FROM view_combine v1
JOIN (
select g.hashkey, g.gender, g.g_date, g.g_brand
from view_combine g
left join view_combine g1 ON g.hashkey = g1.hashkey AND g.g_date < g1.g_date and g1.gender is not null
WHERE g1.hashkey IS NULL
order by g.g_date
) ge ON ge.HASHKEY = v1.HASHKEY
JOIN (
select b.hashkey, b.birthyear, b.b_date, b.b_brand
from view_combine b
left join view_combine b1 ON b.hashkey = b1.hashkey AND b.b_date < b1.b_date and b1.birthyear is not null
WHERE b1.hashkey IS NULL
order by b.b_date
) bi ON bi.HASHKEY = v1.HASHKEY
JOIN (
select p.hashkey, p.postcode, p.p_date, p.p_brand
from view_combine p
left join view_combine p1 ON p.hashkey = p1.hashkey AND p.p_date < p1.p_date and p1.postcode is not null
WHERE p1.hashkey IS NULL
order by p.p_date
) pc ON pc.HASHKEY = v1.HASHKEY
GROUP BY v1.hashkey

I realize you've solved already, but as a secondary viewpoint, this is something that I would pre-process.
Given the data:
Partner 1 - UserA, Male, Null, 6300, 9/9/09
Partner 2 - UserA, Null, 1980, 2300, 9/10/09
When querying for UserA, you would most likely want a "Most Current Record":
UserA, Male, 1980, 2300
Using the following tables:
Partner
TypeCode
DisplayName
CurrentUser
UserId
Gender
GenderSourcePartner
BirthYear
BirthYearSourcePartner
PostalCode
PostalCodeSourcePartner
PartnerSourceData
PartnerTypeCode
UserId
Gender
BirthYear
PostalCode
ModifiedDate
Then, when I receive the partner source files, I'd process it line by line to update the current user table and append to the PartnerSourceData table (using it as a log.)

Related

Can't order query correctly

A while ago I requested help to code a LEFT JOIN filtering in a particular way that the result postition the desired value in the first row.
Need to retrieve table's last inserted/updated record with some exclusions
The thing now is that there are many cases which are mixing data. The scenario is that on the same table we have 2 values that we need to organize on different columns. The PO_ID is unique, but can have 1 or more values on the other tables, and for this particular case 1 PO_ID has 3 SHIP_ID_CUS values. We only need 1 PO_ID per row (no duplicates) that is way we used the MAX() and GROUP BY.
Here is a piece of the code that I think cause issues.
select
z.po_id,
max(scdc.ship_id) as ship_id_cdc,
max(lscdc.ship_evnt_cd) as last_event_cdc,
max(lscdc.ship_evnt_tms) as event_tms_cdc,
max(scus.SHIP_ID) as ship_id_cus,
max(lscus.ship_evnt_cd) as last_event_cus,
max(lscus.ship_evnt_tms) as event_tms_cus
from TABLE.A z
left join (select distinct po_id, iltc.ship_id, s.ship_to_loc_code from TABLE.B iltc inner join TABLE.C s on iltc.ship_id=s.ship_id and iltc.ship_to_loc_code=s.ship_to_loc_code and s.ship_to_ctry<>' ') AS A ON z.po_id = a.po_id
left JOIN TABLE.C scus ON A.SHIP_ID = scus.SHIP_ID AND A.SHIP_TO_LOC_CODE = scus.SHIP_TO_LOC_CODE and scus.loc_type = 'CUS' AND DAYS(scus.shipment_tms)+10 >= DAYS(z.ship_tms)
left JOIN TABLE.C scdc ON A.SHIP_ID = scdc.SHIP_ID AND A.SHIP_TO_LOC_CODE = scdc.SHIP_TO_LOC_CODE and scdc.loc_type = 'CDC' AND DAYS(scdc.shipment_tms)+10 >= DAYS(z.ship_tms)
left join
( select ship_id_856, ship_to_loc_cd856, ship_evnt_cd, ship_evnt_tms, carr_tracking_num, event_srv_lvl
, row_number() over(partition by ship_id order by updt_job_tms desc) as RN
FROM TABLE.D
WHERE LEFT(ship_evnt_cd, 1) <> '9') lscus
ON lscus.ship_id_856=scus.ship_id and scus.ship_to_loc_code=lscus.ship_to_loc_cd856 and lscus.rn = 1
left join
( select ship_id_856, ship_to_loc_cd856, ship_evnt_cd, ship_evnt_tms, carr_tracking_num, event_srv_lvl
, row_number() over(partition by ship_id order by updt_job_tms desc) as RN
FROM TABLE.D
WHERE LEFT(ship_evnt_cd, 1) <> '9') lscdc
ON lscdc.ship_id_856=scdc.ship_id and lscdc.ship_to_loc_cd856=scdc.ship_to_loc_code and lscdc.rn = 1
WHERE
z.po_id = 'T1DLDC'
GROUP BY z.po_id
By searching that condition we get the following result
The problem is that if we search directly on the TABLE.D, the last event that we need (with last update record tms) is another one (X1) and somehow the date is incorrect.
What is even more weird, is that if we search for the ship_id_cus on the original query, we get the correct code but still with a wrong date...
WHERE
--z.po_id = 'T1DLDC'
scus.ship_id = 'D30980'
GROUP BY z.po_id
I tried other logic changes like modifying the left joins to search on a subquery.
left JOIN ( select * from TABLE.C order by updt_job_tms desc) scus ON A.SHIP_ID = scus.SHIP_ID AND A.SHIP_TO_LOC_CODE = scus.SHIP_TO_LOC_CODE and scus.loc_type = 'CUS' AND DAYS(scus.shipment_tms)+10 >= DAYS(z.ship_tms)
But this is also giving the same exact results by searching either by po_id or ship_id_cus
Any ideas or comment will be much appreciated.
Thanks
------------------------------------UPDATE-----------------------------------
Adding the result of the LEFT JOIN with the row_partition() including all the ship_id_cus for that po_id, and all the codes with the tms. None match here.
Based on all these, it should be the last ship_id_cus with X1 event/tms. If we exclude also the ones starting with 9, we would get the following result.
(I am not applying here ordering by ship_id_cus, which already described before that did not work either the way I implemented)

If you have a table: TBL1
ID APPROVED APPROVER DATE_APPROVED
====== ======== ======== =============
ABC Y JOE 2019-01-13
ABC N ZACK 2018-12-23
ABC N SUE 2019-02-23
And you do SQL:
SELECT ID, MAX(APPROVED) AS APPROVAL
,MAX(APPROVER) AS APPROVED_BY , MAX(DATE_APPROVED) AS APPROVED_ON
FROM TBL1 GROUP BY ID
you will get result:
ID APPROVAL APPROVED_BY APPROVED_ON
====== ======== =========== =============
ABC Y ZACK 2019-02-23
which is correct to the code but is NOT what you want
Try the following:
SELECT T1.ID, T1.APPROVED, T1.APPROVER, T1.DATE_APPROVED
FROM TBL1 AS T1
INNER JOIN (SELECT ID, MAX(DATE_APPROVED) AS APPROVED_ON
FROM TBL1 GROUP BY ID
) AS T2
ON T1.ID =T2.ID
AND T1.DATE_APPROVED = T2.APPROVED_ON
Result:
ID APPROVED APPROVER DATE_APPROVED
====== ======== ======== =============
ABC N SUE 2019-02-23

How to find highest count of result set using multiple tables in SQL (Oracle)

I have four tables. Here are the skeletons...
ACADEMIC_TBL
academic_id
academic_name
AFFILIATION_TBL
academic_id*
institution_id*
joined_date
leave_date
INSTITUTION_TBL
institution_id
institution_name
REVIEW_TBL
academic_id*
institution_id*
date_posted
review_score
Using these tables I need to find the academic (displaying their name, not ID) with the highest number of reviews and the institution name (not ID) they are currently affiliated with. I imagine this will need to be done using multiple sub-select scripts but I'm having trouble figuring out how to structure it.

this will work:
SELECT at.academic_name,
it.institution_name,
Max(rt.review_score),
from academic_tbl at,
affiliation_tbl afft,
institution_tbl it,
review_tbl rt
WHERE AT.academic_id=afft.academic_id
AND afft.institution_id=it.institution_id
AND afft.academic_id=rt.academic_id
GROUP BY at.academia_name,it.instituton_id

You need an aggregated query that JOINs all 4 tables to count how many reviews were performed by each academic.
Query :
SELECT
inst.institution_name,
aca.academic_name,
COUNT(*)
FROM
academic_tbl aca
INNER JOIN affiliation_tbl aff ON aff.academic_id = aca.academic_id
INNER JOIN institution_tbl inst ON inst.institution_id = aff.institution_id
INNER JOIN review_tbl rev ON rev.academic_id = aca.academic_id AND rev.institution_id = aff.institution_id
GROUP BY
inst.institution_name,
aca.academic_name,
inst.institution_id,
aca.academic_id
NB :
added the academic and institution id to the GROUP BY clause to prevent potential academics or institutions having the same name from being (wrongly) grouped together
if the same academic performed reviews for different institutions, then you will find one row for each academic / institution couple, which, if I understood you right, is what you want

Try this one:
select
inst.institution_name
, aca.academic_name
from
academic_tbl aca
, institution_tbl inst
, affiliation_tbl aff
, review_tbl rev
, (
select
max(rt.review_score) max_score
from
review_tbl rt
, affiliation_tbl aff_inn
where
rt.date_posted >= aff_inn.join_date
and rt.date_posted <= aff_inn.leave_date
and rt.academic_id = aff_inn.academic_id
and rt.institution_id = aff_inn.institution_id
)
agg
where
aca.academic_id = inst.academic_id
and inst.institution_id = aff.institution_id
and aff.institution_id = rev.institution_id
and aff.academic_id = rev.academic_id
and rev.date_posted >= aff.join_date
and rev.date_posted <= aff.leave_date
and rev.review_score = agg.max_score
;
It might return more than one academic, if there are more with the same score (maximum one).

Select closest date to another date

I'm querying a table (COURSEPLACE) to generate a result set of students. This contains a range of variables taken from the course table.
I then want to join an Addresses table (which contains multiple address records per student) but only append to the results the 1 postcode that was created on the same (or the closest) date to the date that the Student record was created.
What I've tried is the following, but this only gets me those records where the date value is an exact match - how do I extend this to (effectively) find and select the postcode value from the address record that has the closest date stamp to the student record date stamp?:
SELECT cp.CONTACTNO, cp.AGEONENTRY, cp.COURSETITLE, cp.FACULTY, ad.POSTCODE
FROM COURSEPLACE cp
LEFT OUTER JOIN ADDRESS ad ON ad.CONTACTNO=cp.CONTACTNO
WHERE
cp.TYPE = 'Application'
AND cp.TERM = '2015/6'
AND
(
ad.TYPE = 'Home' AND CONVERT(VARCHAR(23),ad.CREATIONDATE,103) = CONVERT(VARCHAR(23),cp.CREATIONDATE,103)
)

SELECT TOP 1
cp.CONTACTNO, cp.AGEONENTRY, cp.COURSETITLE, cp.FACULTY, ad.POSTCODE,
FROM
COURSEPLACE cp
INNER JOIN ADDRESS ad ON ad.CONTACTNO=cp.CONTACTNO
WHERE CP.TYPE = 'Application'
AND CP.TERM = '2016/5'
AND AD.TYPE = 'Home'
ORDER BY
DATEDIFF(AD.CREATIONDATE, CP.CREATIONDATE) ASC;

Solved it with OUTER APPLY, given as an answer to a previous question:
OUTER APPLY (SELECT TOP 1 * FROM ADDRESS ad2 WHERE ad2.CONTACTNO=cp.CONTACTNO ORDER BY DATEDIFF(dd,cp.CREATIONDATE,ad2.CREATIONDATE)) AD

Give this a whirl....
SELECT cp.CONTACTNO, cp.AGEONENTRY, cp.COURSETITLE, cp.FACULTY, ad.POSTCODE
FROM COURSEPLACE cp
LEFT JOIN ADDRESS ad ON
ad.CONTACTNO=cp.CONTACTNO
INNER JOIN
(SELECT CONTACTNO, MAX(CREATIONDATE) dt
FROM ADDRESS
WHERE CREATIONDATE <= cp.CREATIONDATE
GROUP BY CONTACTNO) ad2 on ad2.dt = ad.CREATIONDATE and ad2.CONTACTNO = ad.CONTACTNO
WHERE
cp.TYPE = 'Application'
AND cp.TERM = '2015/6'

Extract only the topline of data from a specific table SQL

I'm having trouble extracting the topline of data from a table and joining it with other extracted fields from other tables.
I have 3 tables:
Person
Folder
Earnings
Person:
PERSONID |FORENAMES|SURNAME|DOB |GENDER|NINO
1000000 |JOHNSTON |ALI |10/10/80 |M |JK548754A
Folder:
FOLDERID|FOLDERREF
1000000 |104567LK
Earnings:
FOLDERID|DATESTARTED|DATEENDED |GROSSEARNINGS
1000000 |01-04-2014 |31-03-2015 |31846.00
1000000 |01-04-2013 |31-03-2014 |31160.04
1000000 |01-04-2012 |31-03-2013 |30011.04
1000000 |01-04-2011 |31-03-2012 |29123.94
I need my data to look like:
JOHNSTON |ALI| 10-10-1980 | 31-03-2015 | 31846.00 | 31649.60
I've tried:
SELECT A.PERSONID, A.SURNAME, A.FORENAMES, A.DOB, B.FOLDERREF, C.DATEENDED, C.GROSSEARNINGS, C.BASICEARNINGS, C.FLUCTUATINGEARNINGS
FROM PERSON A, FOLDER B, EARNINGS C
WHERE A.PERSONID = B.FOLDERID AND B.FOLDERID = C.FOLDERID
Which extracts all of the data from the EARNINGS table, but I only wish to extract the top line.
Any advice is greatly received.

If you want just the data from the latest date then you could do something like the query below. Bear in mind, you're using fields like c.BasicEarnings and c.FluctuatingEarnings that you don't have in table 'Earnings'
SELECT a.PersonID
,a.Suranme
,a.Forenames
,a.DOB
,b.FolderRef
,c.DateEnded
,c.GrossEarnings
FROM Person a
JOIN Folder b ON a.FolderID = b.FolderID
JOIN (
SELECT e.FolderID
,e.DateEnded
,e.GrossEarnings
FROM Earnings e
JOIN (
SELECT FolderID
,MAX(DateEnded) DateEnded
FROM Earnings
GROUP BY FolderID
) m ON e.FolderID = m.FolderID
AND e.DateEnded = m.DateEnded
) c ON a.FolderID = c.FolderID

Assuming the final field in your expected output is GROSSEARNINGS and by "I only wish to extract the top line" you mean latest (by date) then use GROUP BY with a MAX function.
SELECT p.FORENAMES, p.SURNAME, p.DOB, MAX(e.DATEENDED), e.GROSSEARNINGS, e.BASICEARNINGS
FROM Person p
INNER JOIN Earnings e ON p.PERSONID = e.FOLDERID
GROUP BY p.FORENAMES, p.SURNAME, p.DOB, e.GROSSEARNINGS, e.BASICEARNINGS

Get percentages of larger group

The query below is kind of an ugly one so I hope I've got it spaced well enough to make it readable. The query finds the percentage of people that visit a given hospital if they are from a certain area. For instance, if 100 people live in county X and 20 go to hospital A and 80 go to hospital B the query outputs. How the heck is this sort of thing done? Let me know if I need to document the query or whatever I can do to make it clearer.
hospital A 20
hospital B 80
The query below works exactly like I want it to, but it give me thinking: how could this be done for every county in my table?
select hospitalname, round(cast(counts as float)/cast(fayettestrokepop as float)*100,2)as percentSeen
from
(
SELECT tblHospitals.hospitalname, COUNT(tblHospitals.hospitalname) AS counts, tblStateCounties_1.countyName,
(SELECT COUNT(*) AS Expr1
FROM Patient INNER JOIN
tblStateCounties ON Patient.stateCode = tblStateCounties.stateCode AND Patient.countyCode = tblStateCounties.countyCode
WHERE (tblStateCounties.stateCode = '21') AND (tblStateCounties.countyName = 'fayette')) AS fayetteStrokePop
FROM Patient AS Patient_1 INNER JOIN
tblHospitals ON Patient_1.hospitalnpi = tblHospitals.hospitalnpi INNER JOIN
tblStateCounties AS tblStateCounties_1 ON Patient_1.stateCode = tblStateCounties_1.stateCode AND Patient_1.countyCode = tblStateCounties_1.countyCode
WHERE (tblStateCounties_1.stateCode = '21') AND (tblStateCounties_1.countyName = 'fayette')
GROUP BY tblHospitals.hospitalname, tblStateCounties_1.countyName
) as t
order by percentSeen desc
EDIT: sample data
The sample data below is without the outermost query (the as t order by part).
The countsInTheCounty column is the (select count(*)..) part after 'tblStateCounties_1.countyName'
hospitalName hospitalCounts countyName countsInTheCounty
st. james 23 X 300
st. jude 40 X 300
Now with the outer query we would get
st james 0.076 (23/300)
st. jude 0.1333 (40/300)

Here is my guess. You'll have to test against your data or provide proper DDL + sample data.
;WITH totalCounts AS
(
SELECT StateCode, countyCode, COUNT(*) AS totalcount
FROM dbo.Patient GROUP BY StateCode, countyCode
)
SELECT
h.hospitalName,
hospitalCounts = COUNT(p.hospitalnpi),
c.countyName,
countsInTheCounty = tc.totalCount,
percentseen = CONVERT(DECIMAL(5,2), COUNT(p.hospitalnpi)*100.0/tc.totalCount)
FROM
dbo.Patient AS p
INNER JOIN
dbo.tblHospitals AS h
ON p.hospitalnpi = h.hospitalnpi
INNER JOIN
totalCounts AS tc
ON p.StateCode = tc.StateCode
AND p.countyCode = tc.countyCode
INNER JOIN
dbo.tblStateCounties AS c
ON tc.StateCode = c.stateCode
AND tc.countyCode = c.countyCode
GROUP BY
h.hospitalname,
c.countyName,
tc.totalcount
ORDER BY
c.countyName,
percentseen DESC;

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Combining multiple sources of data into a unified table - sql

Related

Can't order query correctly

How to find highest count of result set using multiple tables in SQL (Oracle)

Select closest date to another date

Extract only the topline of data from a specific table SQL

Get percentages of larger group

Categories

Resources