Complex select query question for hardcore SQL designers

Complex select query question for hardcore SQL designers - sql

Very complex query been trying to construct it for few days with more real success.
I'm using SQL-SERVER 2005 Standard
What i need is :
5 CampaignVariants from Campaigns whereas 2 are with the largest PPU number set and 3 are random.
Next condition is that CampaignDailyBudget and CampaignTotalBudget are below what is set in Campaign ( calculation is number of clicks in Visitors table connected to Campaigns via CampaignVariants on which users click)
Next condition CampaignLanguage, CampaignCategory, CampaignRegion and CampaignCountry must be the ones i send to this select with (languageID,categoryID,regionID and countryID).
Next condition is that IP address i send to this select statement won't be in IPs list for current Campaign ( i delete inactive for 24 hours IPs ).
In other words it gets 5 CampaignVariants for user that enters the site, when i take from user PublisherRegionUID,IP,Language,Country and Region
view diagram
more details
i get countryID, regionID, ipID, PublisherRegionUID and languageID from Visitor. This are filter parameters. While i first need to get what Publisher is about to show on his site by it's categories, language so on.... and then i filter all remaining Campaigns by Visitors's params with all parameters besides PublisherRegionUID.
So it has two actual fiters. One What Publisher wants to Publish and other one what Visitor can view...
campaignDailyBudget and campaignTotalBudget are values set by Users who creates a Campaign. Those two compared to (number of clicks per campaign)*(campaignPPU) while date filters obviously used to filter for campaignDailyBudget with from 12:00AM to 11:59PM of today. campaignTotalBudget is not filtered by date for obvious reasons
Demo of Stored Procedure
ALTER PROCEDURE dbo.CampaignsGetCampaignVariants4Visitor
#publisherSiteRegionUID uniqueidentifier,
#visitorIP varchar(15),
#browserID tinyint,
#countryID tinyint,
#osID tinyint,
#languageID tinyint,
#acceptsCookies bit
AS
BEGIN
SET NOCOUNT ON;
-- check if such #publisherRegionUID exists
if exists(select publisherSiteRegionID from PublisherSiteRegions where publisherSiteRegionUID=#publisherSiteRegionUID)
begin
declare #publisherSiteRegionID int
select #publisherSiteRegionID = publisherSiteRegionID from PublisherSiteRegions where publisherSiteRegionUID=#publisherSiteRegionUID
-- get CampaignVariants
-- ** choose 2 highest PPU and 3 random CampaignVariants from Campaigns list
-- where regionID,countryID,categoryID,languageID meets Publisher and Visitor requirements
-- and Campaign.campaignDailyBudget<(sum of Clicks in Visitors per this Campaign)*Campaign.PPU during this day
-- and Campaign.campaignTotalBudget<(sum of Clicks in Visitors per this Campaign)*Campaign.PPU
-- and #visitorID does not appear in Campaigns2IPs with this Campaign
-- insert visitor
insert into Visitors (ipAddress,browserID,countryID,languageID,OSID,acceptsCookies)
values (#visitorIP,#browserID,#countryID,#languageID,#OSID,#acceptsCookies)
declare #visitorID int
select #visitorID = IDENT_CURRENT('Visitors')
-- add IP to pool Campaigns ** adding ip to all Campaigns whose CampaignVariants were chosen
-- add PublisherRegion2Visitor relationship
insert into PublisherSiteRegions2Visitors values (#visitorID,#publisherSiteRegionID)
-- add CampaignVariant2Visitor relationship
end
END
GO

I also make a number of assumptions about your oblique requirements. I’ll spell them out as I go along, along with explaining the code. Please note that I of course have no reasonable way of testing this code for typos or minor logic errors.
It might be possible to write this as a single ginormous query, but that would be awkward, ugly, and prone to performance issues as the SQL optimizer can have problems buliding plans for overly-large queries. An option would be to write it as a series of queries, populating temp tables for use in subsequent queries (which alows for much simpler debugging). I chose to write this as a large common table expression statement with a series of CTE tables, largely because it kind of “flows” better that way, and it'd probably perform better than the many-temp-tables version.
First assumption: there are several ciruclar references in there. Campaign has links to both Countries and Regions, so both of these parameter values must be checked—even though based on the table link from Countries to Region, this filter could possibly be simplified to just a check on Country (assuming that the country parameter value is always “in” the region parameter). The same applies to Language and Category, and perhaps to IPs and Visitors. This appears to be sloppy design; if it can be cleared up, or if assumptions on the validity of the data can be made, the query could be simplified.
Second assumption: Parameters are passed in as variables in the form of #Region, #Country, etc. Also, there is only one IP address being passed in; if not, then you’ll need to pass in multiple values, set up a temp table containing those values, and add that as a filter where I use the #IP parameter.
So, step 1 is a first pass identifying “eligible” campaigns, by pulling out all those that share the desired country, region, language, cateogory, and that do not have the one IP address associated with them:
WITH cteEligibleCampaigns (CampaignId)
as (select CampaignId
from Campaigns2Regions
where RegionId = #RegionId
intersect select CampaignId
from Campaign2Countries
where CountryId = #CountryId
intersect select CampaignId
from Campaign2Languages
where LanguageId = #LanguageId
intersect select CampaignId
from Campaign2Categories
where CategoryId = #CategoryId
except select CampaignId
from Campaigns2IPs
where IPID = #IPId)
Next up, from these filter out those items where “CampaignDailyBudget and CampaignTotalBudget are below what is set in Campaign ( calculation is number of clicks in Visitors table connected to Campaigns via CampaignVariants on which users click)”. This requirement is not entirely clear to me. I have chosen to interpret it as “only include those campaigns where, if you count the number of visitors for those campaign’s CampaignVariants, the total count is less than both CampaignDailyBudget and CampaignTotalBudget”. Note that here I introduce a random value, used later on in selecting random rows.
,cteTargetCampaigns (CampaignId, RandomNumber)
as (select CampaignId, checksum(newid() RandomNumber)
from cteEligibleCampaigns ec
inner join Campaigns ca
on ca.CampgainId = ec.CampaignId
inner join CampaignVariants cv
on cv.CampgainId = ec.CampaignId
inner join CampaignVariants2Visitors cvv
on cvv.CampaignVariantId = cv. CampaignVariantId
group by ec.CampaignId
having count(*) < ca.CampaignDailyBudget
and count(*) < CampaignTotalBudget)
Next up, identify the two “best” items.
,cteTopTwo (CampaignId, Ranking)
as (select CampaignId, row_number() over (order by CampgainPPU desc)
from cteTargetCampaigns tc
inner join Campaigns ca
on ca.CampaignId = tc.CampaignId)
Next, line up all other campaigns by the randomly assigned number:
,cteRandom (CampaignId, Ranking)
as (select CampaignId, row_number() over (order by RandomNumber)
from cteTargetCampaigns
where CampaignId not in (select CampaignId
from cteTopTwo
where Ranking < 3))
And, at last, pull the data sets together:
select CampaignId
from cteTopTwo
where Ranking <= 2
union all select CampaignId
from cteRandom
where Ranking <= 3
Lump the above sections of code together, debug typos, invalid assumption, and missed requirements (such as order or flags identifying the top two items from the random ones), and you should be good.

I'm not sure I understand this portion of your post:
it gets 5 CampaignVariants for user
that enters the site, when i take from
user
PublisherRegionUID,IP,Language,Country
and Region
I'm assuming "it" is the query. The user given your second "Next Condition" is the IP? What does "when I take from user" mean? Does that mean that is the information you have at the time you execute your query or is that information you returned from your query? If the later, then there are a host of questions that would need to be answered since many of those columns are part of a Many:Many relationship.
Regardless, below is a means to get the 5 campaigns where, according to your second "Next condition", you have an IP address that you want filter out. I'm also assuming that you want five campaigns total which means that the three random ones cannot include the two "highest PPU" ones.
With
ValidCampaigns As
(
Select C.campaignId
From Campaigns As C
Left Join (Campaigns2IPs As CIP
Join IPs
On IPs.ipID = CIP.ipID
And IPs.ipAddress = #IPAddress)
On CIP.campaignId = C.campaignId
Where CIP.campaignID Is Null
)
CampaignPPURanks As
(
Select C.campaignId
, Row_Number() Over ( Order By C.campaignPPU desc ) As ItemRank
From ValidCampaigns As C
)
, RandomRanks As
(
Select campaignId
, Row_Number() Over ( Order By newid() desc ) As ItemRank
From ValidCampaigns As C
Left Join CampaignPPURanks As CR
On CR.campaignId = C.campaignId
And CR.ItemRank <= 2
Where CR.campaignId Is Null
)
Select ...
From CampaignPPURanks As CPR
Join CampaignVariants As CV
On CV.campaignId = CPR.campaignId
And CPR.ItemRank <= 2
Union All
Select ...
From RandomRanks As RR
Join CampaignVariants As CV
On CV.campaignId = RR.campaignId
And RR.ItemRank <= 3

Related

SQL Select Count Subquery, Joins messing up everything

I've got a task to work with 4 different tables. I think I've got the "logic" correct, but I think I'm failing on joining the various separately working things together.
The Case somehow returns two rows when the comparison is true; if it isn;t, it displays (correctly) just one. Works fine without joins.
The count subquery works when by itself, but when I'm trying to tie it together, it displays anything from showing the same number everywhere or displaying far too large numbers (likely multiples or multiples).
Select Distinct RPD_PERSONS.PERSON_ID "id",
RPD_PERSONS.SURN_TXT ||' '|| RPD_PERSONS.NAME_TXT "Name",
Case ADD_ROLE_PERS.ROLE_CODE When 'Manager'
Then 'yes'
Else 'no'
End "Manager",
(
Select Count(LDD_CERTS.Cert_ID)
From LDD_CERTS
Join LDD_PERS_CERTS
On LDD_PERS_CERTS.CERT_ID = LDD_CERTS.CERT_ID
Where MONTHS_BETWEEN(LDD_CERTS.VALID_TO,SYSDATE)>0
And LDD_PERS_CERTS.CERT_CHANGE_TYPE>=0
) "no. of certificates"
From RPD_PERSONS
Join ADD_ROLE_PERS
On ADD_ROLE_PERS.Person_ID = RPD_PERSONS.Person_ID
Where RPD_PERSONS.Partic_ID = 1
Group By RPD_PERSONS.PERSON_ID, RPD_PERSONS.SURN_TXT ||' '|| RPD_PERSONS.NAME_TXT, ADD_ROLE_PERS.ROLE_CODE
Order By RPD_PERSONS.Person_ID;
This is the subquery that, by itself, seems to work perfectly.
Select LDD_PERS_CERTS.PERSON_UID,Count(LDD_CERTS.Cert_ID)
From LDD_CERTS
Join LDD_PERS_CERTS
ON LDD_PERS_CERTS.CERT_ID = LDD_CERTS.CERT_ID
Where MONTHS_BETWEEN(LDD_CERTS.VALID_TO,SYSDATE)>0
AND LDD_PERS_CERTS.CERT_CHANGE_TYPE>=0
Group By LDD_PERS_CERTS.PERSON_UID
order by LDD_PERS_CERTS.PERSON_UID;

You have a lot of things going on although a short query to get it, but let me try to summarize what I THINK you are trying to get.
You want a list of distinct people within the company with a count of how many ACTIVE certs (not expired) per person. From that, you also want to know if they are in a management position or not (via roles).
Q: For a person who may be a manager, but also an under-manager to a higher-up, do you want to see that person in both roles as typical business structures could have multiple layers of management, OR... Do you only care to see a person once, and if they are a manager OR some other level. What if a person has 3 or more roles, do you want to see them every instance? If your PRIMARY care is Manager Yes or No, the query gets even more simplified.
Now, your query of counts for valid certs. The MONTHS_BETWEEN() function appears to be you are running in Oracle. Based on the two parameters of the Valid_To date compared to sysdate is an indication that the valid to is always intended to be in the future (ie: Still has an active cert). If this is the case, you will not be able to optimize query as function calling is not Sargable
Instead, you should only have to do where Valid_To > SysDate, in other words, only those that have not yet expired. You MIGHT even be better served by pre-aggregating all counts of still active cert counts per Cert ID, then joining to the person certs table since the person cert check is for all where the cert_change_type >= 0 which could imply ALL. What condition would a Cert_Change_Type be anything less than zero, and if never, that where clause is pointless.
Next, your SELECT DISTINCT query needs a bit of adjustments. Your column-based select has no context to the outer person ID and is just aggregating the total certs. There is no correlation to the person ID to the certs being counted for. I can only GUESS that there is some relationship such as
RPD_Persons.Person_id = LDD_Pers_Certs.Person_UID
Having stated all that, I would have the following table/indexes
table index
LDD_PERS_CERTS ( CERT_CHANGE_TYPE, PERSON_UID, CERT_ID )
LDD_CERTS ( valid_to, cert_id )
RPD_PERSONS ( partic_id, person_id, surn_txt, name_txt )
ADD_ROLE_PERS ( person_id, role_code )
I would try something like
Select
lpc.PERSON_UID,
ValCerts.CertCount
From
( select
Cert_id,
count(*) CertCounts
from
LDD_CERTS
where
Valid_To > sysDate
group by
Cert_id ) ValCerts
JOIN LDD_PERS_CERTS lpc
on ValCerts.Cert_id = lpc.cert_id
Where
lpc.CERT_CHANGE_TYPE >= 0
Now, if you only care if a given person is a manager or not, I would pre-query that only as you were not actually returning a person's SPECIFIC ROLE, just the fact they were a manager or not. My final query might look like'
select
p.PERSON_ID id,
max( p.SURN_TXT || ' ' || p.NAME_TXT ) Name,
max( Case when arp.Person_id IS NULL
then 'no' else 'yes' end ) Manager,
max( coalesce( certs.CertCount, 0 )) ActiveCertsForUser
from
RPD_PERSONS p
LEFT Join ADD_ROLE_PERS arp
On p.Person_ID = arp.Person_ID
AND arp.role_code = 'Manager'
LEFT JOIN
( Select
lpc.PERSON_UID,
ValCerts.CertCount
From
( select
Cert_id,
count(*) CertCounts
from
LDD_CERTS
where
Valid_To > sysDate
group by
Cert_id ) ValCerts
JOIN LDD_PERS_CERTS lpc
on ValCerts.Cert_id = lpc.cert_id
AND lpc.CERT_CHANGE_TYPE >= 0 )
) Certs
on p.Person_id = Certs.Person_uid
Where
p.Partic_ID = 1
GROUP BY
p.PERSON_ID
Now, if the p.partic_id = 1 represents only 1 person, then that wont make as much sense to query all people with a given certificate status, etc. But if Partic_id = 1 represents a group of people such as within a given association / division of a company, then it should be fine.
Any questions, let me know and I can revise / update answer

CASE issue: there can be, presumably, be multiple records in ADD_ROLE_PERS for each person. If a person can have two or more roles running concurrently then you need to decide what the business logic is that you need to use to handle this. If a person can only have one active role at a time presumably there is a "active/disabled" column or effective date columns you should be using to identify the active record (or, potentially, there is a data issue).
The subquery should return the same value for every single row in your resultset, as it is completely isolated/standalone from your main query. If you want it to produce counts that are relevant to each row then you will need to connect it to the tables in the main table (look up correlated subqueries if you don't know how to so this)

SQL Server 2016: Query to create multiple unique pairs of IDs from the same table?

I'm working on building a query to handle random pairings of people so that each one can assess many others. I'm looking for a way to handle this in bulk - cross join perhaps? -rather than using a cursor to loop through people one at a time, which when tested was pretty slow as there will likely be hundreds of pairings at a time.
There are a few main parameters:
Each pair must be unique - two IDs can only be paired once.
There will be a specific number of pairs per ID - both the person being assessed and the person doing the assessing can have no more or less than the specific number of pairs.
All IDs are in this one table.
Must be able to create the pairs in random order rather.
No ID can be paired with itself.
Any ideas for how I could approach this?
Here's the query I've been working on
DECLARE #assessmentID INT=[N];
DECLARE #assessmentPairs TABLE(
assessorID INT,
authorID INT,
assessorCounter INT,
authorCounter INT
UNIQUE NONCLUSTERED ([assessorID], [authorID])
);
INSERT INTO #assessmentPairs
SELECT assessorID,authorID,assessorCounter,authorCounter
FROM (
SELECT
e1.personID AS assessorID,
e2.personID AS authorID,
assessorCounter=ROW_NUMBER() OVER(PARTITION BY e1.personID ORDER BY e1.personID),
authorCounter=ROW_NUMBER() OVER(PARTITION BY e2.personID ORDER BY NEWID())
FROM People e1
JOIN Assessments a ON a.courseOfferingID=e1.courseOfferingID
CROSS JOIN People e2
WHERE e2.personID<>e1.personID
AND a.assessmentID=#assessmentID
GROUP BY e1.personID,e2.personID
) AS x
WHERE authorCounter<=10
ORDER BY assessorID,authorCounter,authorID,assessorCounter
SELECT *
FROM #assessmentPairs
ORDER BY authorID,assessorID

Query build to find records where all of a series of records have a value

Let me explain a little bit about what I am trying to do because I dont even know the vocab to use to ask. I have an Access 2016 database that records staff QA data. When a staff member misses a QA we assign a job aid that explains the process and they can optionally send back a worksheet showing they learned about what was missed. If they do all of these ina 3 month period they get a credit on their QA score. So I have a series of records all of whom have a date we assigned the work(RA1) and MAY have a work returned date(RC1).
In the below image "lavalleer" has earned the credit because both of her sheets got returned. "maduncn" Did not earn the credit because he didn't do one.
I want to create a query that returns to me only the people that are like "lavalleer". I tried hitting google and searched here and access.programmers.co.uk but I'm only coming up with instructions to use Not null statements. That wouldn't work for me because if I did a IS Not Null on "maduncn" I would get the 4 records but it would exclude the null.
What I need to do is build a query where I can see staff that have dates in ALL of their RC1 fields. If any of their RC1 fields are blank I dont want them to return.

Consider:
SELECT * FROM tablename WHERE NOT UserLogin IN (SELECT UserLogin FROM tablename WHERE RCI IS NULL);

You could use a not exists clause with a correlated subquery, e.g.
select t.* from YourTable t where not exists
(select 1 from YourTable u where t.userlogin = u.userlogin and u.rc1 is null)
Here, select 1 is used purely for optimisation - we don't care what the query returns, just that it has records (or doesn't have records).
Or, you could use a left join to exclude those users for which there is a null rc1 record, e.g.:
select t.* from YourTable t left join
(select u.userlogin from YourTable u where u.rc1 is null) v on t.userlogin = v.userlogin
where v.userlogin is null
In all of the above, change all occurrences of YourTable to the name of your table.

SELECT query with conditional joins, return all rows when no data in lowest join

I'm looking for a solution where a query should return:
a) a limited set of rows when there are rows in the lowest joined table
b) all rows if there is no data in the lowest joined table
c) taking into account that it is possible that there is more than 1 such join
Objective:
we are implementing security using data. Rows from the table (MainTable) are filtered on 1 or more columns. These columns have a relationship with other tables (LookupTable). Security is defined on the LookupTable.
Example1: the MainTable contains contact information. One of the columns holds the country code, this column has a relationship with a LookupTable that contains the country codes. The user can only select a country code that exists in the LookupTable. The security admin can then define that a user can only work with contacts of one or more countries. When that user accesses the MainTable he/she will only get the contacts of that limited list of countries.
Example2: the MainTable contains products. One column holds the country of origin code, another column the product group. Security setup can limit the access to the product MainTable of a user to a list of countries AND a list of product groups.
The security setup works by Management-by-Exception, whichs means that the MainTable is filtered when one or more "security filters" are defined but if no security filters are defined then the user will get ALL rows from MainTable. So my query should return a limited number of rows if any security filter is defined but should return all rows if there are no security filters defined.
Current situation:
I have been working on a query for the case of Example2. There are 4 possible scenarios:
No security filters are defined
expected outcome: all rows are returned
Security filter defined only for first LookupTable
expected outcome: only rows matching values between LookupTable1 and security filter are returned
Security filter defined only for second LookupTable
expected outcome: only rows matching values between LookupTable2 and security filter are returned
Security filter defined only for both LookupTables
expected outcome: only rows matching values between LookupTable1 AND LookupTable2 and security filter are returned
The query I have is correct for cases 2,3 and 4 but fails for case 1 where no rows are returned (as per my understanding this is due to the fact that both JOINS return an empty result set).
Background:
The application provides the (power) users with some kind of table designer which means that they can define which columns are linked to a LookupTable and which of these LookupTables can be used for the "security filters".
This means that, potentially, we could have a MainTable with for example 200 columns of which 20 are linked to a LookupTable which are defined as security filter. The queries are stored procedures which are generated when "design" changes are saved.
With the query I have now (working for 3 out 4 cases) the number of scenarios is equal to 2^N where N is the number of LookupTables. If N is 20 the total goes over 1 million.
Security setup is done with Profiles assigned to Users and Filter Sets assigned to Profiles and Filter Set Entries containing the actual values to filter on (if any).
The environment is currently on MS SQL 2017 but will be put into production on SQL on Azure.
Example of the query (but look further below for the link to dbfiddle):
SELECT E.col_pk, E.col_28, E.col_7, E.col_8, E.col_9, E.col_1052
FROM MainTable AS E
LEFT JOIN LookupTable2 AS L28 ON L28.col_pk = E.col_28
JOIN SecUserProfile AS UP28 ON UP28.IdentityUserId = #UserId
JOIN SecProfileFilterSets AS PFS28 ON PFS28.SecProfileId = UP28.SecProfileId
LEFT JOIN SecFilterSetEntry AS SE28 ON SE28.SecFilterSetId = PFS28.SecFilterSetId AND SE28.MdxEntityId = 2 AND SE28.EntityKey = L28.col_pk
LEFT JOIN LookupTable13 AS L1052 ON L1052.col_pk = E.col_1052
JOIN SecUserProfile AS UP1052 ON UP1052.IdentityUserId = #UserId
JOIN SecProfileFilterSets AS PFS1052 ON PFS1052.SecProfileId = UP1052.SecProfileId
LEFT JOIN SecFilterSetEntry AS SE1052 ON SE1052.SecFilterSetId = PFS1052.SecFilterSetId AND SE1052.MdxEntityId = 13 AND SE1052.EntityKey = L1052.col_pk
WHERE
(SE28.SecFilterSetId IS NOT NULL AND SE1052.SecFilterSetId IS NOT NULL)
OR
(
SE28.SecFilterSetId IS NOT NULL AND
NOT EXISTS
(
SELECT TOP 1 NUP1052.Id FROM SecUserProfile AS NUP1052
JOIN SecProfileFilterSets AS NPFS1052 ON NPFS1052.SecProfileId = NUP1052.SecProfileId
JOIN SecFilterSetEntry AS NSE1052 ON NSE1052.SecFilterSetId = NPFS1052.SecFilterSetId AND NSE1052.MdxEntityId = 13
WHERE NUP1052.IdentityUserId = #UserId
)
)
OR
(
NOT EXISTS
(
SELECT TOP 1 NUP28.Id FROM SecUserProfile AS NUP28
JOIN SecProfileFilterSets AS NPFS28 ON NPFS28.SecProfileId = NUP28.SecProfileId
JOIN SecFilterSetEntry AS NSE28 ON NSE28.SecFilterSetId = NPFS28.SecFilterSetId AND NSE28.MdxEntityId = 2
WHERE NUP28.IdentityUserId = #UserId
)
AND SE1052.SecFilterSetId IS NOT NULL
)
OR
(
NOT EXISTS
(
SELECT TOP 1 NUP28.Id FROM SecUserProfile AS NUP28
JOIN SecProfileFilterSets AS NPFS28 ON NPFS28.SecProfileId = NUP28.SecProfileId
JOIN SecFilterSetEntry AS NSE28 ON NSE28.SecFilterSetId = NPFS28.SecFilterSetId AND NSE28.MdxEntityId = 2
WHERE NUP28.IdentityUserId = #UserId
)
AND
NOT EXISTS
(
SELECT TOP 1 NUP1052.Id FROM SecUserProfile AS NUP1052
JOIN SecProfileFilterSets AS NPFS1052 ON NPFS1052.SecProfileId = NUP1052.SecProfileId
JOIN SecFilterSetEntry AS NSE1052 ON NSE1052.SecFilterSetId = NPFS1052.SecFilterSetId AND NSE1052.MdxEntityId = 13
WHERE NUP1052.IdentityUserId = #UserId
)
)
Issue:
I have the following issues but they probably boil down to 1 in the end:
my current query is only 75% correct
even if my current query is correct it cannot be used in production with the potential high(er) number of lookup tables.
performance needs to be taken into account. Just as we don't know the number of columns and lookup tables at design time we don't know how many rows the tables will contain. The main table may hold 500, 50000 or 500000 records.
In the end all this will boil down to the right solution :)
I think this is not the easiest of questions (otherwise I will feel very stupid) and for those willing to take a look I've prepared a sandbox environment on dbfiddle representing the use-case I'm working with. I've setup the query to run 4 times, once for each of the scenarios.

How to refactor complicated SQL query which is broken

Here is the simplified model of the domain
In a nutshell, unit grants documents to to a customer. There are two types of units: main units and their child units. Both belong to the same province, and to one province may belong multiple cities. Document has numerous events (processing history). Customer belongs to one city and province.
I have to write query, which returns random set of documents, given a target main unit code. Here is the criteria:
Return 10 documents where the newest event_code = 10
Each document must belong to a different customer living in any city of the unit's region (prefer different cities)
Return the Customers newest Document which meets the criteria
There must be both document types present in the result
Result (customers chosen) should be random with each query
But...
If there's not enough customers, try to use multiple documents of the same customer as a last resort
If there aren't enough documents either, return as much as possible
If there's not a single instance of another document type, then return all the same
There may be million of rows, and the query must be as fast as possible, it is executed frequently.
I'm not sure how to structure this kind of complex query in a sane manner. I'm using Oracle and PL/SQL. Here is something I tried, but it isn't working as expected (returns wrong data). How should I refactor this query and get the random result, and also honor all those borderline rules? I'm also worried about the performance regarding the joins and wheres.
CURSOR c_documents IS
WITH documents_cte AS
SELECT d.document_id AS document_id, d.create_dt AS create_dt,
c.customer_id
FROM documents d
JOIN customers c ON (c.customer_id = d.customer_id AND
c.province_id = (SELECT region_id FROM unit WHERE unit_code = 1234))
WHERE exists (
SELECT 1
FROM event
where document_id = d.document_id AND
event_code = 10
AND create_dt =
SELECT MAX(create_dt)
FROM event
WHERE document_id = d.document_id)
SELECT * FROM documents_cte d
WHERE create_dt = (SELECT MAX(create_dt)
from documents_cte
WHERE customer_id = d.customer_id)
How to correctly make this query with efficiency, randomness in mind? I'm not asking for exact solution, but guidelines at least.

I'd avoid hierarchic tables whenever possible. In your case you are using a hierarchic table to allow for an unlimited depth, but at last it's just two levels you store: provinces and their cities. That should better be just two tables: one for provinces and one for cities. Not a big deal, but that would make your data model simpler and easier to query.
Below I am starting with a WITH clause to get a city table, as such doesn't exist. Then I go step by step: get the customers belonging to the unit, then get their documents and rank them. At last I select the ranked documents and randomly take 10 of the best ranked ones.
with cities as
(
select
c.region_id as city_id,
o.region_id as province_id
from region c
join region p on p.region_id = c.parent_region_id
)
, unit_customers as
(
select customer_id
from customer
where city_id in
(
select city_id
from cities
where
(
select region_id
from unit
where unit_code = 1234
) in (city_id, province_id)
)
)
, ranked_documents as
(
select
document.*,
row_number(partition by customer_id order by create_dt desc) as rn
from document
where customer_id in -- customers belonging to the unit
(
select customer_id
from unit_customers
)
and document_id in -- documents with latest event code = 10
(
select document_id
from event
group by document_id
having max(event_code) keep (dense_rank last order by create_dt) = 10
)
)
select *
from ranked_documents
order by rn, dbms_random.value
fetch first 10 rows only;
This doesn't take into account to get both document types, as this contradicts the rule to get the latest documents per customer.
FETCH FIRST is availavle as of Oracle 12c. In earlier versions you would use one more subquery and another ROW_NUMBER instead.
As to speed, I'd recommend these indexes for the query:
create index idx_r1 on region(region_id); -- already exists for region_id = primary key
create index idx_r2 on region(parent_region_id, region_id);
create index idx_u1 on unit(unit_code, region_id);
create index idx_c1 on customer(city_id, customer_id);
create index idx_e1 on event(document_id, create_dt, event_code);
create index idx_d1 on document(document_id, customer_id, create_dt);
create index idx_d2 on document(customer_id, document_id, create_dt);
One of the last two will be used, the other not. Check which with EXPLAIN PLAN and drop the unused one.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Complex select query question for hardcore SQL designers - sql

Related

SQL Select Count Subquery, Joins messing up everything

SQL Server 2016: Query to create multiple unique pairs of IDs from the same table?

Query build to find records where all of a series of records have a value

SELECT query with conditional joins, return all rows when no data in lowest join

How to refactor complicated SQL query which is broken

Categories

Resources