Amazon SQL job interview question: customers who made 2+ purchases - sql

You have a simple table that has only two fields: CustomerID, DateOfPurchase. List all customers that made at least 2 purchases in any period of six months. You may assume the table has the data for the last 10 years. Also, there is no PK or unique value.
My friend already got the job, despite the fact that he couldn't answer this question. I was curious how this kind of question can be solved.
Thanks

From an abstract view this problem is about efficiently self joining a table with no PK or unique identifier.
This is very tricky as you see there can be scenarios like
a customer making exactly 2 purchase in 6 month that too on same date (which can look like duplicate record)
a customer making >=2 purchase in 6 month on different date(the usual case).
One of the thing that needs to be done here is generate a column that can act
like a unique identifier which can be achieved here using row_number
After having a unique identifier it is easy to join on your required conditions and unique identifier from 1st alias != unique identifier from 2nd alias (meaning joining all rows from both alias except with same row, same row != different row with same data as in 1st scenario)
Putting it all together, it can achieved using
common table expressions to start with a single source of data that includes a manually added unique identifier and then doing the required business
row_number which helps us assign that unique identifier to our single source of data generated in a common table expression.
refer the below query for technical details.
with tempPurchase as (
select *,
row_number() over (order by CustomerID) as rowNumber -- this is crucial part
from purchase
)
select distinct(tp1.CustomerID) from tempPurchase as tp1
join tempPurchase as tp2 on tp1.CustomerID = tp2.CustomerID
and tp1.DateOfPurchase >= tp2.DateOfPurchase
and tp1.DateOfPurchase <= DATEADD(month, 6, tp2.DateOfPurchase)
and tp1.rowNumber != tp2.rowNumber; -- this is crucial part
Refer db fiddle here for complete working solution.

We can try using exists logic here to detect records for the same customer occurring within 6 months. Then, find distinct customers, which implies that any such matching customer has at least two purchases.
SELECT DISTINCT CustomerID
FROM yourTable t1
WHERE EXISTS (SELECT 1 FROM yourTable t2
WHERE t2.CustomerID = t1.CustomerID AND
t2.DateOfPurchase > t1.DateOfPurchase AND
t2.DateOfPurchase <= DATEADD(month, 6, t1.DateOfPurchase));
Note that this answer assumes that there would only be at most one distinct purchase per day by a given customer. A better approach would be:
SELECT DISTINCT CustomerID
FROM yourTable t1
WHERE EXISTS (SELECT 1 FROM yourTable t2
WHERE t2.CustomerID = t1.CustomerID AND
t2.PK <> t1.PK AND
t2.DateOfPurchase >= t1.DateOfPurchase AND
t2.DateOfPurchase <= DATEADD(month, 6, t1.DateOfPurchase));
The above query reads as saying find, for each customer, any relationship between 2 records within 6 months of each other which are distinct purchases. This assumes that the table has a PK primary key column. Ideally, every table should have some kind of logical primary key.

Try this:
SELECT distinct CustomerID
FROM purchase t1
WHERE 1 < (SELECT count(1) FROM purchase t2
WHERE t2.CustomerID = t1.CustomerID AND
t2.DateOfPurchase >= t1.DateOfPurchase AND
t2.DateOfPurchase <= DATEADD(month, 6, t1.DateOfPurchase))
Idea is to pick one record from outer table t1 and check in inner table t2 if there are any purchases made within 6 months including the one which you picked from outer table. If count from subquery is greater than 1 then we have the eligible customer.

Related

SQL Aggregating data based on condition containing the key fields for aggregation

I am new to SQL (Oracle SQL if it makes a difference) but it so happens I have to use it. I need to aggregate data by some key fields (CustId, AppId). I also have some AppDate, PDate and Amount.Initial data
What I need to do is aggregate but for each key field combination I need to aggregate the data from other rows with the following conditions:
CustID = CustID aka take only information for this custID
AppId != AppId aka take only information for application different than the current one.
AppDate >= PDate aka take only information available at time of application
From a quick look at SQL language my approach was the use of:
select CustId, AppId, Sum(case when
custid=custid and Appid!=Appid and AppDate >= PDate then Amount else 0 end) as SumAmount
From Table
Group by CustId AppId
Unfortunately, the result I get are all 0 for SumAmount. My guess it is because of the last 2 conditions.
The results I want to get from the example table are: Results
Also, I would probably add condition that AppDate - AppDate of other AppID > 6months exclude those from the aggregated amounts.
P.S. I am really sorry for the substandard formatting and probably bad code. I am not really experienced on how to do it.
Edit: I've found a solution as follows:
select distinct a.CustId, a.AppId, a.AppDate, b.PDate, b.Amount
from table a
inner join (select CustId, AppId, Amount, PDate from Table) b
on a.CustId = b.CustId and a.AppId != b.AppId
where a.AppDate >= b.PDate
After that I aggregate by AppId summing the amount.
Basically, I just append the same information based on a condition and since I get a lot of full duplicates I deduplicate with distinct.
I've found a solution as follows:
select distinct a.CustId, a.AppId, a.AppDate, b.PDate, b.Amount
from table a
inner join (select CustId, AppId, Amount, PDate from Table) b
on a.CustId = b.CustId and a.AppId != b.AppId
where a.AppDate >= b.PDate
After that I aggregate by AppId summing the amount.
Basically, I just append the same information based on a condition and since I get a lot of full duplicates I deduplicate with distinct.

SQL - Removing result of one table based on the tree structure of another

To better define the question I'm asking, it'd probably be easier to introduce you to the data I'm working with.
Essentially I have two tables joined that kind of look like this:
Table 1
Product ID AccountLinkID (FK)
PRODUCT00001 AC000001
PRODUCT00001 AC000002
PRODUCT00001 AC000003
PRODUCT00001 AC000004
Table 2
Link (FK) AccountType
AC000001 1
AC000002 2
AC000003 3
AC000004 4
As part of some data i'm looking at, I want to make sure that if any ProductID is linked to an account type '4' that the product ID is removed from the search.
The problem Is that the foreign key isn't also a single number - as one product can be linked to multiple account types (for example, one produce could be linked to a sellers account, buyers account, customer account etc)
So in this instance - account type 4 is something like a 'dummy' account, therefore any productID's linked to it aren't ID's I want including in the search.
I can't think of how to use the account type as a means to remove the product id.
Thank you in advance for any advice.
If you want just one row per productid, you can join, aggregate, and filter out with a having clause
select t1.productid
from table1 t1
inner join table2 t2 on t2.link = t1.accountlinkid
group by t1.productid
having max(case when t2.accounttype = 4 then 1 else 0 end) = 0
If, on the other hand, you want entire rows from t1, then window functions are a better option:
select t.*
from (
select t1.*,
max(case when t2.accounttype = 4 then 1 else 0 end) over(partition by t1.productid) has_type4
from table1 t1
inner join table2 t2 on t2.link = t1.accountlinkid
) t
where has_type4 = 0

Get Same data in different row in to a same row

I have a table with columns
Account number (number)
Account Status(Number)
Datetime(text)
some accounts are repeating with same time stamp
I need to get the account status of repeating data of the same account with same timestamp to same row as new column (new Account Status)
Account_Number Account Status Timestamp
7856277 5 9155070519
4527882 5 1045225522
7856277 1 9155070519
I want to
Account_Number Account Status Timestamp new account Status
7856277 5 9155070519 1
You will want to join the table with itself, using the account number as the linked column. Consider the query below (where o is the older record and n is the newer one)
SELECT o.account_number, o.account_status, n.timestamp, n.account_status "new account status"
FROM table o join table n on (o.account_number = n.account_number)
WHERE n.timestamp>o.timestamp
SQL tables represent unordered sets. So, there is no "new" or "old" unless a column specifies the ordering. You don't seem to have such a column.
But you can use aggregation to get the results on a single row:
select Account_Number, Timestamp,
min(AccountStatus) as status_min,
max(AccountStatus) as status_max
from t
group by Account_Number, Timestamp
having count(*) > 1;
You need a self join:
SELECT
t.account_number, t.account_status, t.timestamp,
tt.account_status new_account_status
FROM tablename t INNER JOIN tablename tt
ON
tt.account_number = t.account_number AND
tt.timestamp = t.timestamp AND
tt.account_status < t.account_status

Extra records when used as a sub query: Access

I'm a rookie developer with basic SQL experience and this problem has been 'doing my head in' for the last couple of days. I've gone to ask a question here a couple times and thought... not yet... keep trying.
I have a table:
ID
Store
Product_Type
Delivery_Window
Despatch_Time
Despatch_Type
Pallets
Cartons
Many other columns (start_week and day_num are two of them)
My goal is to get a list of of store by product_type with the minimum despatch_time with all the other column information.
I've tested the base query.
SELECT Product_Type, Store, Min(Despatch_Time) as MinDes
FROM table
GROUP BY Store, Product_Type
Works well, I get 200 rows as expected.
Now I want those 200 rows to have the other related record information : Delivery_Window, start_week, etc
I've tried the following.
SELECT * FROM Table WHERE EXISTS
(SELECT Product_Type, Store, Min(Despatch_Time) as MinDes
FROM table
GROUP BY Store, Product_Type)
I've tried doing inner and right joins all returned more than 200 records, my original amount.
I inspected the additional records and it is where there is the same despatch time for a store and product type but for a different despatch type.
So I need a hand in creating a query where I limit it by the initial sub query but even if there is matching minimum despatch times it will still limit the count to one record by store and product type.
Current Query is:
SELECT *
FROM table AS A INNER JOIN
(Select Min(Despatch_Time) as MinDue, store, product_type
FROM table
WHERE day_num = [Forms]![FRM_SomeForm]![combo_del_day] AND start_week =[Forms]![FRM_SomeForm]![txt_date1]
GROUP BY store, product_type) AS B
ON (A.product_type = B.product_type) AND (A.store = B.store) AND (A.Despatch_Time = B.MinDue);
I think you want:
SELECT t.*
FROM table as t
WHERE t.Dispatch_Time = (SELECT MIN(t2.Dispatch_Time)
FROM table as t2
WHERE t2.Store = t.Store AND t2.Product_Type = t.Product_Type);
The above will return duplicates. In order to avoid duplicates, you need a key to provide uniqueness. Let me assume you have a primary key pk:
SELECT t.*
FROM table as t
WHERE t.pk = (SELECT TOP (1) t2.pk
FROM table as t2
WHERE t2.Store = t.Store AND t2.Product_Type = t.Product_Type
ORDER BY t2.Dispatch_Time, t2.pk
);

How to delete multiple rows with 2 columns as composite primary key in ms sql server

I have a table named ProductLog (RequestID, Date, Product, Price). Here RequestID and Product is composite primary key.
Now there is a cleaning tool which cleans the product after some days. Priviously the cleaning was straight cut. If Date was less than 15 days then delete it. Now some situation comes like - suppose there is a product which logs 18 days back. Now if we delete data less that 15 days then it will delete that product. Every product data should be there cause it is used for some monitoring purpose. Now deletion requirement changed.
Choose a product.
If the product has record within 15 days then keep 15 days data and
delete remaining.
If the product record is not within 15 days then don't delete the data.
Now I am trying to use query like
select RequestID, Product from ProductLog where Date < '201505161505'
EXCEPT
select RequestID, Product from ProductLog where Product not in (
select distinct Product from ProductLog where Date > '201505161505'
)
I am able to select the data which should be deleted. Now I have to delete it. As RequestID and Product is a composite primary key, I can't use IN for deletion. Does anyone have any idea how can I achieve this?
If you are just looking to write a delete statement based on your query above, I would use EXISTS query to handle both columns (RequestID and Product) at the same time.
DELETE
FROM ProductLog
WHERE EXISTS (
SELECT *
FROM (
SELECT RequestID
,Product
FROM ProductLog
WHERE DATE < '201505161505'
EXCEPT
SELECT RequestID
,Product
FROM ProductLog p
WHERE NOT EXISTS (
SELECT *
FROM ProductLog p1
WHERE p.product = p1.product
AND DATE > '201505161505'
)
) t
WHERE t.RequestID = ProductLog.RequestID
AND t.Product = ProductLog.Product
);
Also, I would use NOT EXISTS instead of NOT IN in your subquery.
The most important thing to note about NOT EXISTS and NOT IN is that,
unlike EXISTS and IN, they are not equivalent in all cases.
Specifically, when NULLs are involved they will return different
results. To be totally specific, when the subquery returns even one
null, NOT IN will not match any rows.
Add the number of days to keep to the product table. Then:
delete from ProductLog
from ProductLog pl
inner join Product p on p.ProductId = pl.ProductId
and pl.Date > DateAdd(day, - p.NumberOfDaysToKeep, getdate());