Get a unique current user, depending on what Online Platform they are using - sql

Our goal is to measure if they are still active over the last month, and if not the last month the last time the user was active. We are storing two tables, Employees and Online Transactions. The Online transactions table is built from monthly reports where we go out to the Online Platforms and get a listing of all users, with the date their id was activated and when it was last logged in. We get other data that matters to us, like what role they have in the Online Platform and how much data they are storing.
The Employee table has termination dates for employees that have left the company, as well as a unique id. The unique id is what we join to the Online table.
We use these two tables to manage the Online tool access as well as to report overall usage of the platforms by individual user per platform. There are three different platforms used, Portal, AGOL or Training. The Online transactions table stores all three in the same table.
The issue is I am not understanding the proper way to get the results I’m looking for. I am attaching example code, which I know by the result set isn’t what I want. Of course, since this is PII, I must scrub my example results to remove information but still show what I am having trouble with.
SELECT DISTINCT
U.EMPLOYEE_TRACKING_ID,U.LAST_NAME, U.FIRST_NAME, U.EMAIL_ADDRESS, U.PERSON_TYPE,
U.SERVICE_LINE, U.SUPERVISOR_NAME, U.OFFICE_LOCATION, U.OFFICE_CITY, U.OFFICE_STATE,
U.OFFICE_COUNTRY, U.OFFICE_POSTAL_CODE, U.ACTUAL_TERMINATION_DATETIME, O.FiscalPeriod,
O.Role, O.Source, O.LogDate
FROM dbo.Users U
INNER JOIN dbo.Online_Transactions O ON U.EMPLOYEE_TRACKING_ID = O.TrackingId
WHERE O.Source ='AGOL
In the example result set you can see that there are repeating records, because I have more than just the Tracking ID as part of the distinct, thus resulting in each field’s distinct value being considered.
Do I need to do some type of inner select to get down to the distinct user, by office and the latest log date they were reported? For example, our records go back 3.5 years, but what I want is the all unique users by the last time they logged in. I can remove the termination date for current users and keep the termination date so I can remove users who have left. I thought I would do a series of views to get me each scenario, therefore there would be a total of 6, one for each Online type and the second for terminated or not?
If anyone can help me learn how to do this.
EMPLOYEE_TRACKING_ID,LAST_NAME,FIRST_NAME,EMAIL_ADDRESS,PERSON_TYPE,SERVICE_LINE,SUPERVISOR_NAME,OFFICE_LOCATION,OFFICE_CITY,OFFICE_STATE,OFFICE_COUNTRY,OFFICE_POSTAL_CODE,ACTUAL_TERMINATION_DATETIME,FiscalPeriod,Role,Source,LogDate
111483,Name4,User4,User4.Name4#mycompany.com,Employee,NULL,,,Melbourne,VIC,Australia,3008,4/30/2021,FY2020-Q4-08,AECOM PUBLISHER,AGOL,2020-09-02 00:00:00.000
111483,Name4,User4,User4.Name4#mycompany.com,Employee,NULL,,,Melbourne,VIC,Australia,3008,4/30/2021,FY2020-Q4-09,AECOM PUBLISHER,AGOL,2020-09-30 00:00:00.000
111483,Name4,User4,User4.Name4#mycompany.com,Employee,NULL,,,Melbourne,VIC,Australia,3008,4/30/2021,FY2021-Q1-11,AECOM PUBLISHER,AGOL,2021-02-02 00:00:00.000
111483,Name4,User4,User4.Name4#mycompany.com,Employee,NULL,,,Melbourne,VIC,Australia,3008,4/30/2021,FY2021-Q1-12,AECOM PUBLISHER,AGOL,2021-03-01 00:00:00.000
111483,Name4,User4,User4.Name4#mycompany.com,Employee,NULL,,,Melbourne,VIC,Australia,3008,4/30/2021,FY2021-Q2-01,AECOM PUBLISHER,AGOL,2021-05-01 00:00:00.000
111483,Name4,User4,User4.Name4#mycompany.com,Employee,NULL,,,Melbourne,VIC,Australia,3008,4/30/2021,FY2021-Q2-02,AECOM PUBLISHER,AGOL,2020-09-02 00:00:00.000
111483,Name4,User4,User4.Name4#mycompany.com,Employee,NULL,,,Melbourne,VIC,Australia,3008,4/30/2021,FY2021-Q2-03,AECOM PUBLISHER,AGOL,2021-01-04 00:00:00.000
113311,Name3,User3,User3.Name3#mycompany.com,Employee,NULL,,,Sydney,NSW,Australia,2000,5/21/2021,FY2020-Q3-06,AECOM COLLECTOR,AGOL,2020-09-02 00:00:00.000
14001627,Name1,User1,user1.name1#mycompany.com,Employee,NULL,,,Melbourne,VIC,Australia,3008,3/3/2021,FY2021-Q1-12,AECOM COLLECTOR,AGOL,2021-02-02 00:00:00.000
14001627,Name1,User1,user1.name1#mycompany.com,Employee,NULL,,,Melbourne,VIC,Australia,3008,3/3/2021,FY2021-Q2-01,AECOM COLLECTOR,AGOL,2021-04-01 00:00:00.000
14001627,Name1,User1,user1.name1#mycompany.com,Employee,NULL,,,Melbourne,VIC,Australia,3008,3/3/2021,FY2021-Q2-02,AECOM COLLECTOR,AGOL,2020-09-02 00:00:00.000
14007604,Name2,User2,User2.Name2#mycompany.com,Employee,NULL,,,Newcastle upon Tyne,POST-TWR,United Kingdom,NE1 2HF,9/30/2020,FY2020-Q2-03,AECOM COLLECTOR,AGOL,2020-09-02 00:00:00.000

It looks like your Online_Transactions table has multiple entries for your users?
You could use a CROSS APPLY to get the one row note the ORDER BY to get the latest one
SELECT DISTINCT
U.EMPLOYEE_TRACKING_ID,U.LAST_NAME, U.FIRST_NAME, U.EMAIL_ADDRESS, U.PERSON_TYPE,
U.SERVICE_LINE, U.SUPERVISOR_NAME, U.OFFICE_LOCATION, U.OFFICE_CITY, U.OFFICE_STATE,
U.OFFICE_COUNTRY, U.OFFICE_POSTAL_CODE, U.ACTUAL_TERMINATION_DATETIME,
T.FiscalPeriod, T.Role, T.Source, T.LogDate
FROM dbo.Users AS U
CROSS APPLY (
SELECT TOP 1 O.FiscalPeriod, O.Role, O.Source, O.LogDate
FROM dbo.Online_Transactions AS O
WHERE O.TrackingId = U.EMPLOYEE_TRACKING_ID
AND O.Source = 'AGOL'
ORDER BY O.LogDate DESC
) AS T
This could also be done with an inner select sub query, but I like this syntax.
You might want to change CROSS APPLY for OUTER APPLY just to see the results, but I normally use CROSS APPLY which is like an INNER JOIN, in that it will only show users that have a match to an Online_Transaction.

Related

SQL question with attempt on customer information

Schema
Question: List all paying customers with users who had 4 or 5 activities during the week of February 15, 2021; also include how many of the activities sent were paid, organic and/or app store. (i.e. include a column for each of the three source types).
My attempt so far:
SELECT source_type, COUNT(*)
FROM activities
WHERE activity_time BETWEEN '02-15-21' AND '02-19-21'
GROUP BY source_type
I would like to get a second opinion on it. I didn't include the accounts table because I don't believe that I need it for this query, but I could be wrong.
Have you tried to run this? It doesn't satisfy the brief on FOUR counts:
List all the ... customers (that match criteria)
There is no customer information included in the results at all, so this is an outright fail.
paying customers
This is the top level criteria, only customers that are not free should be included in the results.
Criteria: users who had 4 or 5 activities
There has been no attempt to evaluate this user criteria in the query, and the results do not provide enough information to deduce it.
there is further ambiguity in this requirement, does it mean that it should only include results if the account has individual users that have 4 or 5 acitvities, or is it simply that the account should have 4 or 5 activities overall.
If this is a test question (clearly this is contrived, if it is not please ask for help on how to design a better schema) then the use of the term User is usually very specific and would suggest that you need to group by or otherwise make specific use of this facet in your query.
Bonus: (i.e. include a column for each of the three source types).
This is the only element that was attempted, as the data is grouped by source_type but the information cannot be correlated back to any specific user or customer.
Next time please include example data and the expected outcome with your post. In preparing the data for this post you would have come across these issues yourself and may have been inspired to ask a different question, or through the process of writing the post up you may have resolved the issue yourself.
without further clarification, we can still start to evolve this query, a good place to start is to exclude the criteria and focus on the format of the output. the requirement mentions the following output requirements:
List Customers
Include a column for each of the source types.
Firstly, even though you don't think you need to, the request clearly states that Customer is an important facet in the output, and in your schema account holds the customer information, so although we do not need to, it makes the data readable by humans if we do include information from the account table.
This is a standard PIVOT style response then, we want a row for each customer, presenting a count that aggregates each of the values for source_type. Most RDBMS will support some variant of a PIVOT operator or function, however we can achieve the same thing with simple CASE expressions to conditionally put a value into projected columns in the result set that match the values we want to aggregate, then we can use GROUP BY to evaluate the aggregation, in this case a COUNT
The following syntax is for MS SQL, however you can achieve something similar easily enough in other RBDMS
OP please tag this question with your preferred database engine...
NOTE: there is NO filtering in this query... yet
SELECT accounts.company_id
, accounts.company_name
, paid = COUNT(st_paid)
, organic = COUNT(st_organic)
, app_store = COUNT(st_app_store)
FROM activities
INNER JOIN accounts ON activities.company_id = accounts.company_id
-- PIVOT the source_type
CROSS APPLY (SELECT st_paid = CASE source_type WHEN 'paid' THEN 1 END
,st_organic = CASE source_type WHEN 'organic' THEN 1 END
,st_app_store = CASE source_type WHEN 'app store' THEN 1 END
) as PVT
GROUP BY accounts.company_id, accounts.company_name
This results in the following shape of result:
company_id
company_name
paid
organic
app_store
apl01
apples
4
8
0
ora01
oranges
6
12
0
Criteria
When you are happy with the shpe of the results and that all the relevant information is available, it is time to apply the criteria to filter this data.
From the requirement, the following criteria can be identified:
paying customers
The spec doesn't mention paying specifically, but it does include a note that (free customers have current_mrr = 0)
Now aren't we glad we did join on the account table :)
users who had 4 or 5 activities
This is very specific about explicitly 4 or 5 activities, no more, no less.
For the sake of simplicity, lets assume that the user facet of this requirement is not important and that is is simply a reference to all users on an account, not just users who have individually logged 4 or 5 activities on their own - this would require more demo data than I care to manufacture right now to prove.
during the week of February 15, 2021.
This one was correctly identified in the original post, but we need to call it out just the same.
OP has used Monday to Friday of that week, there is no mention that weeks start on a Monday or that they end on Friday but we'll go along, it's only the syntax we need to explore today.
In the real world the actual values specified in the criteria should be parameterised, mainly because you don't want to manually re-construct the entire query every time, but also to sanitise input and prevent SQL injection attacks.
Even though it seems overkill for this post, using parameters even in simple queries helps to identify the variable elements, so I will use parameters for the 2nd criteria to demonstrate the concept.
DECLARE #from DateTime = '2021-02-15' -- Date in ISO format
DECLARE #to DateTime = (SELECT DateAdd(d, 5, #from)) -- will match Friday: 2021-02-19
/* NOTE: requirement only mentioned the start date, not the end
so your code should also only rely on the single fixed start date */
SELECT accounts.company_id, accounts.company_name
, paid = COUNT(st_paid), organic = COUNT(st_organic), app_store = COUNT(st_app_store)
FROM activities
INNER JOIN accounts ON activities.company_id = accounts.company_id
-- PIVOT the source_type
CROSS APPLY (SELECT st_paid = CASE source_type WHEN 'paid' THEN 1 END
,st_organic = CASE source_type WHEN 'organic' THEN 1 END
,st_app_store = CASE source_type WHEN 'app store' THEN 1 END
) as PVT
WHERE -- paid accounts = exclude 'free' accounts
accounts.current_mrr > 0
-- Date range filter
AND activity_time BETWEEN #from AND #to
GROUP BY accounts.company_id, accounts.company_name
-- The fun bit, we use HAVING to apply a filter AFTER the grouping is evaluated
-- Wording was explicitly 4 OR 5, not BETWEEN so we use IN for that
HAVING COUNT(source_type) IN (4,5)
I believe you are missing some information there.
without more information on the tables, I can only guess that you also have a customer table. i am going to assume there is a customer_id key that serves as key between both tables
i would take your query and do something like:
SELECT customer_id,
COUNT() AS Total,
MAX(CASE WHEN source_type = "app" THEN "numoperations" END) "app_totals"),
MAX(CASE WHEN source_type = "paid" THEN "numoperations" END) "paid_totals"),
MAX(CASE WHEN source_type = "organic" THEN "numoperations" END) "organic_totals"),
FROM (
SELECT source_type, COUNT() AS num_operations
FROM activities
WHERE activity_time BETWEEN '02-15-21' AND '02-19-21'
GROUP BY source_type
) tb1 GROUP BY customer_id
This is the most generic case i can think of, but does not scale very well. If you get new source types, you need to modify the query, and the structure of the output table also changes. Depending on the sql engine you are using (i.e. mysql vs microsoft sql) you could also use a pivot function.
The previous query is a little bit rough, but it will give you a general idea. You can add "ELSE" statements to the clause, to zero the fields when they have no values, and join with the customer table if you want only active customers, etc.

Microsoft Access query to retrieve random transactions with seemingly impossible criteria

I was asked to assist with developing a report to retrieve a 25% sample of random transactions within a specific date range. I am not a programmer but I was able to devise the following fairly quickly:
SELECT TOP 25 PERCENT account.CID, account.ACCT, account.NAME, log.DATE, log.action_txt, log.field_nm, log.from_data, log.to_data, log.tran_id, log.init
FROM account INNER JOIN log ON account.ACCT = log.ACCT
GROUP BY account.CID, account.ACCT, account.NAME, log.DATE, log.action_txt, log.field_nm, log.from_data, log.to_data, log.tran_id, log.init
HAVING (((log.DATE) Between #2/7/2018# And #6/15/2018#) AND ((log.action_txt)="mod" Or (log.action_txt)="del") AND ((log.init)="J1X"
ORDER BY log.tran_dt
This returns 25% of the records within the date range. Each record row is unique but each account number potentially has multiple records on each day. In some cases the records have the same date and tran_id as well.
Upon further discussion with the requester, he actually wants to see all of the transactions for 25% of the accounts that have activity on each day within the date range. Thus if there were 100 accounts on 3/1/2018 with records in this table, he wants to see all of the transactions for 25 of those accounts; if there were 60 accounts on 3/2/2018 with records in this table, he wants to see all of the transactions for 15 of those accounts; and so on.
I was thinking that an Access module would work best in this scenario as I believe there are multiple parts to this. I figured that I need a function to loop through the date range and for each day:
1. Count the account numbers only one time
2. Return all of the transactions for 25% of the total accounts
But as I mentioned, I am not a programmer and I am exhausted from searching possible solutions for the many parts.
I think the key to your question is that you only really need a pseudo random selection of results for your report. So you can force the Random number generator to reorder your results based on a value in the record and the current time.
Something like this should work - I assume your actiontxt field is a text field and pull out the length of each field and apply that to current date/time to create a pseudo random number that can be sorted.
All I really do is change your ORDER BY line
See if this works for you
SELECT TOP 25 PERCENT
account.CID, account.ACCT, account.NAME, log.DATE, log.action_txt, log.field_nm, log.from_data,
log.to_data, log.tran_id, log.init
FROM account
INNER JOIN log ON account.ACCT = log.ACCT
GROUP BY account.CID, account.ACCT, account.NAME, log.DATE, log.action_txt, log.field_nm, log.from_data, log.to_data, log.tran_id, log.init
HAVING (((log.DATE) Between #2/7/2018# And #6/15/2018#) AND ((log.action_txt)="mod" Or (log.action_txt)="del") AND ((log.init)="J1X"
ORDER BY Rnd(CLng(Now()*Len(log.action_txt))-(Now()*Len(log.action_txt)));
Modified similar idea from other StackOverflow question and response

Access/SQL Select Query - Return "Most Like" Value Only

We have a chargeback process in an AccessDB where Departments must approve the expenses entered by another department. We only want a single 'default' approver, but the way the data has been set-up and the query we currently use to fill in the approver returns multiple results.
In the tUserSec table, for example, we have two columns. Name(UserIDX) and UserCode
User1 - 550*
User2 - 55003*
The idea here being that User1 is the Director and so is a 'catchall' for everything in this department, while User2 is a Manager and is specifically assigned to a narrower division. Departments are always 7 characters total.
Say the Department is 5500309, the idea is that User2 should populate as the approver since their code is most closely matched to the Department ID. However, using the "Like" criteria returns both users and the form appears to select one of the two users at random with no rhyme or reason that I can determine. It always selects User1 for 5500309 but always selects User2 for 5500301, despite there being no further delineation - but ideally User1 shouldn't be populating at all unless no one else matches closer.
Below is a simplified version of the SQL, I cut out some other stuff that muddies the situation:
SELECT TDepts.Dept, TDepts.DDescr, tUserSec.UserIDX
FROM tUserSec, TDepts
WHERE (((TDepts.Dept) Like [usercode] & "*"));
How can I change this up so that I only pull in the UserID who is most like the usercode? I tried to figure out a way to pull in the UserID based on the length or max of the usercode, etc. but I wasn't able to find a way that worked. It's a safe assumption that if two users have usercodes that are "like" the department that the usercode that is longest is the one we want.
(This is my first question on here and a struggled with how to best explain this issue. Please be gentle :) )
First, I have to say that the main problem here is when a developer thought that they would be clever and build a lot of logic into the department and user IDs. Hiding this sort of information within a column is a big source of headaches in general (as you're just starting to see).
I don't develop with Access, so I'm not certain of the syntax, but hopefully you'll get the general idea. Please let me know if the syntax needs to be tweaked for future users who find this question:
SELECT
D.Dept,
D.DDescr,
U.UserIDX
FROM
TDepts D
LEFT OUTER JOIN
(
SELECT
SQ_D.Dept,
MAX(LEN(SQ_U.usercode)) AS max_len_usercode
FROM
TDepts SQ_D
INNER JOIN tUserSec SQ_U ON SQ_D.Dept LIKE SQ_U.usercode & "*"
GROUP BY
SQ_D.Dept
) SQ ON SQ_D.Dept = D.Dept
LEFT OUTER JOIN tUserSec U ON
D.Dept LIKE U.usercode & "*" AND
LEN(U.usercode) = SQ.max_len_usercode
The query gets a list of all of the departments along with the length of the longest usercode that matches for that department. Then it uses that to determine which user matches for the "most like" the department.

Access query, grouped sum of 2 columns where either column contains values

Another team has an Access database that they use to track call logs. It's very basic, really just a table with a few lookups, and they enter data directly in the datasheet view. They've asked me to assist with writing a report to sum up their calls by week and reason and I'm a bit stumped on this problem because I'm not an Access guy by any stretch.
The database consists of two core tables, one holding the call log entries (Calls) and one holding the lookup list of call reasons (ReasonsLookup). Relevant table structures are:
Calls
-----
ID (autonumber, PK)
DateLogged (datetime)
Reason (int, FK to ReasonLookup.ID)
Reason 2 (int, FK to ReasonLookup.ID)
ReasonLookup
------------
ID (autonumber PK)
Reason (text)
What they want is a report that looks like this:
WeekNum Reason Total
------- ---------- -----
10 Eligibility Request 24
10 Extension Request 43
10 Information Question 97
11 Eligibility Request 35
11 Information Question 154
... ... etc ...
My problem is that there are TWO columns in the Calls table, because they wanted to log a primary and secondary reason for receiving the call, i.e. someone calls for reason A and while on the phone also requests something under reason B. Every call will have a primary reason column value (Calls.Reason not null) but not necessarily a secondary reason column value (Calls.[Reason 2] is often null).
What they want is, for each WeekNum, a single (distinct) entry for each possible Reason, and a Total of how many times that Reason was used in either the Calls.Reason or Calls.[Reason 2] column for that week. So in the example above for Eligibility Request, they want to see one entry for Eligibility Request for the week and count every record in Calls that for that week that has Calls.Reason = Eligibility Request OR Calls.[Reason 2] = Eligibility Request.
What is the best way to approach a query that will display as shown above? Ideally this is a straight query, no VBA required. They are non-technical so the simpler and easier to maintain the better if possible.
Thanks in advance, any help much appreciated.
The "normal" approach would be to use a union all query as a subquery to create a set of weeks and reasons, however Access doesn't support this, but what you can do that should work is to first define a query to make the union and then use that query as a source for the "main" query.
So the first query would be
SELECT datepart("ww",datelogged) as week, Reason from calls
UNION ALL
SELECT datepart("ww",datelogged), [Reason 2] from calls;
Save this as UnionQuery and make another query mainQuery:
SELECT uq.week, rl.reason, Count(*) AS Total
FROM UnionQuery AS uq
INNER JOIN reasonlookup AS rl ON uq.reason = rl.id
GROUP BY uq.week, rl.reason;
You can use a Union query to append individual Group By Aggregate queries for both Reason and Reason 2:
SELECT DatePart("ww", Calls.DateLogged) As WeekNum, ReasonLookup.Reason,
Sum(Calls.ID) As [Total]
FROM Calls
INNER JOIN Calls.Reason = ReasonLookup.ID
GROUP BY DatePart("ww", Calls.DateLogged) As WeekNum, ReasonLookup.Reason;
UNION
SELECT DatePart("ww", Calls.DateLogged) As WeekNum, ReasonLookup.Reason,
Sum(Calls.ID) As [Total]
FROM Calls
INNER JOIN Calls.[Reason 2] = ReasonLookup.ID
GROUP BY DatePart("ww", Calls.DateLogged) As WeekNum, ReasonLookup.Reason;
DatePart() outputs the specific date's week number in the calendar year. Also, UNION as opposed to UNION ALL prevents duplicate rows from appearing.

SQL query seems to work for 'AND T1.email_address_ IN (subquery)', but returns 0 rows for 'AND T1.email_address_ NOT IN (subquery)'

Good morning. I'm working in Responsys Interact, which is an Oracle-based email campaign management type SAAS product. I'm creating a query to basically filter a target list for an email campaign designed to target a specific sub-set of our master email contact list. Here's the query I created a few weeks ago that appears to work:
/*
Table Symbolic Name
CONTACTS_LIST $A$
Engaged $B$
TRANSACTIONS_RAW $C$
TRANSACTION_LINES_RAW $D$
-- A Responsys Filter (Engaged) will return only an RIID_, nothing else, according to John # Responsys....so,....let's join on that to contact list...
*/
SELECT
DISTINCT $A$.EMAIL_ADDRESS_,
$A$.RIID_,
$A$.FIRST_NAME,
$A$.LAST_NAME,
$A$.EMAIL_PERMISSION_STATUS_
FROM
$A$
JOIN $B$ ON $B$.RIID_ = $A$.RIID_
LEFT JOIN $C$ ON $C$.EMAIL_ADDRESS_ = $A$.EMAIL_ADDRESS_
LEFT JOIN $D$ ON $D$.TRANSACTION_ID = $C$.TRANSACTION_ID
WHERE
$A$.EMAIL_DOMAIN_ NOT IN ('none.com', 'noemail.com', 'mailinator.com', 'nomail.com') AND
/* don't include hp customers */
$A$.HP_PLAN_START_DATE IS NULL AND
$A$.EMAIL_ADDRESS_ NOT IN
(
SELECT
$C$.EMAIL_ADDRESS_
FROM
$C$
JOIN $D$ ON $D$.TRANSACTION_ID = $C$.TRANSACTION_ID
WHERE
/* Get only purchase transactions for certain item_id's/SKU's */
($D$.ITEM_FAMILY_ID IN (3,4,5,8,14,15) OR $D$.ITEM_ID IN (704,769,1893,2808,3013) ) AND
/* .... within last 60 days (i.e. 2 months) */
$A$.TRANDATE > ADD_MONTHS(CURRENT_TIMESTAMP, -2)
)
;
This seems to work, in that if I run the query without the sub-query, we get 720K rows; and if I add back the 'AND NOT IN...' subquery, we get about 700K rows, which appears correct based on what my user knows about her data. What I'm (supposedly) doing with the NOT IN subquery is filtering out any email addresses where the customer has purchased certain items from us in the last 60 days.
So, now I need to add in another constraint. We still don't want customers who made certain purchases in the last 60 days as above, but now also we want to exclude customers who have purchased another particular item, but now within the last 12 months. So, I thought I would add another subquery, as shown below. Now, this has introduced several problems:
Performance - the query, which took a couple minutes to run before, now takes quite a few more minutes to run - in fact it seems to time out....
So, I wondered if there's an issue having two subqueries, but before I went to think about alternatives to this, I decided to test my new subquery by temporarily deleting the first subquery, so that I had just one subquery similar to above, but with the new item = 11 and within the last 12 months logic. And so with this, the query finally returned after a few minutes now, but with zero rows.
Trying to figure out why, I tried simply changing the AND NOT IN (subquery) to AND IN (subquery), and that worked, in that it returned a few thousand rows, as expected.
So why would the same SQL when using AND IN (subquery) "work", but the exact same SQL simply changed to AND NOT IN (subquery) return zero rows, instead of what I would expect which would be my 700 something thousdand plus rows, less the couple thousand encapsulated by the subquery result?
Also, what is the best i.e. most performant way to accomplish what I'm trying to do, which is filter by some purchases made within one date range, AND by some other purchases made within a different date range?
Here's the modified version:
SELECT
DISTINCT $A$.EMAIL_ADDRESS_,
$A$.RIID_,
$A$.FIRST_NAME,
$A$.LAST_NAME,
$A$.EMAIL_PERMISSION_STATUS_
FROM
$A$
JOIN $B$ ON $B$.RIID_ = $A$.RIID_
LEFT JOIN $C$ ON $C$.EMAIL_ADDRESS_ = $A$.EMAIL_ADDRESS_
LEFT JOIN $D$ ON $D$.TRANSACTION_ID = $C$.TRANSACTION_ID
WHERE
$A$.EMAIL_DOMAIN_ NOT IN ('none.com', 'noemail.com', 'mailinator.com', 'nomail.com') AND
/* don't include hp customers */
$A$.HP_PLAN_START_DATE IS NULL AND
$A$.EMAIL_ADDRESS_ NOT IN
(
SELECT
$C$.EMAIL_ADDRESS_
FROM
$C$
JOIN $D$ ON $D$.TRANSACTION_ID = $C$.TRANSACTION_ID
WHERE
/* Get only purchase transactions for certain item_id's/SKU's */
($D$.ITEM_FAMILY_ID IN (3,4,5,8,14,15) OR $D$.ITEM_ID IN (704,769,1893,2808,3013) ) AND
/* .... within last 60 days (i.e. 2 months) */
$C$.TRANDATE > ADD_MONTHS(CURRENT_TIMESTAMP, -2)
)
AND
$A$.EMAIL_ADDRESS_ NOT IN
(
/* get purchase transactions for another type of item within last year */
SELECT
$C$.EMAIL_ADDRESS_
FROM
$C$
JOIN $D$ ON $D$.TRANSACTION_ID = $C$.TRANSACTION_ID
WHERE
$D$.ITEM_FAMILY_ID = 11 AND $C$.TRANDATE > ADD_MONTHS(CURRENT_TIMESTAMP, -12)
)
;
Thanks for any ideas/insights. I may be missing or mis-remembering some basic SQL concept here - if so please help me out! Also, Responsys Interact runs on top of Oracle - it's an Oracle product - but I don't know off hand what version/flavor. Thanks!
Looks like my problem with the new subquery was due to poor performance due to lack of indexes. Thanks to Alex Poole's comments, I looked in Responsys and there is a facility to get an 'explain' type analysis, and it was throwing warnings, and suggesting I build some indexes. Found the way to do that on the data sources, went back to the explain, and it said, "The query should run without placing an unnecessary burden on the system". And while it still ran for quite a few minutes, it did finally come back with close to the expected number of rows.
Now, I'm on to tackle the other half of the issue, which is to now incorporate this second sub-query in addition to the first, original subquery....
Ok, upon further testing/analysis and refining my stackoverflow search critieria, the answer to the main part of my question dealing with the IN vs. NOT IN can be found here: SQL "select where not in subquery" returns no results
My performance was helped by using Responsys's explain-like feature and adding some indexes, but when I did that, I also happened to add in a little extra SQL in my sub-query's WHERE clause.... when I removed that, even after indexes built, I was back to zero rows returned. That's because as it turned out at least one of the transactions rows for the item family id I was interested in for this additional sub-query had a null value for email address. And as further explained in the link above, when using NOT IN, as soon as you have a null value involved, SQL can't definitively say it's NOT IN, since you can't really compare to null, so as soon as you have a null, the sub-query's going to evaluate 'false', thus zero rows. When using IN, even though there are nulls present, if you get one positive match, well, that's a match, so the sub-query returns 'true', so that's why you'll get rows with IN, but not with NOT IN. I hadn't realized that some of our transaction data may have null email addresses - now I know, so I just added a where not null to the where clause for the email address, and now all's good.