Count the number of repeating pairs of information - sql

I have a table with customer_ID, date, and payment_method as 3 columns. payment_method can be 'cash', 'credit', or 'others'. I want to find out the number of customers who have used credit as a payment method more than 5 times, in the last 6 months.
I found this solution for displaying the rows where the customer used credit:
SELECT customer_ID, payment_method, COUNT(*) AS unique_pair_repeats
FROM tab1
WHERE customer_ID IS NOT NULL
GROUP BY customer_ID, payment_method
HAVING count(*) > 1;
The problem is, I don't want a list of the names/ids, I want to know how many people used their credit card for a purchase 5 times or more in the last 6 months.

This is one way you could do it:
SELECT COUNT(*)
FROM
(
SELECT customer_id
FROM tab1
WHERE
customer_ID IS NOT NULL and
payment_method = 'credit' and
tran_date > add_months(sysdate, -6)
GROUP BY customer_ID
HAVING count(*) > 5
) x
The inner query generates a list of all customer ids that have used credit more than 5 times in 6 months. The outer query counts them
You might feel it more logical to write it like this:
SELECT COUNT(*)
FROM
(
SELECT customer_id, count(*) as ctr
FROM tab1
WHERE
customer_ID IS NOT NULL and
payment_method = 'credit' and
tran_date > add_months(sysdate, -6)
GROUP BY customer_ID
) x
WHERE x.ctr > 5

So, remove customer_ID, payment_method, from select.
Though, that still doesn't answer "at least 5 times in last 6 months", so you need another condition: date (presuming you use Oracle, although you didn't tag the question but - you do use Oracle SQL Developer):
and date_column >= add_months(trunc(sysdate), -6)
Finally, something like this might help:
SELECT COUNT(*) AS unique_pair_repeats --> changes here
FROM tab1
WHERE customer_ID IS NOT NULL
and date_column >= add_months(trunc(sysdate), -6) --> here
GROUP BY customer_ID, payment_method
HAVING count(*) >= 5; --> here

Related

Retrieve Customers with a Monthly Order Frequency greater than 4

I am trying to optimize the below query to help fetch all customers in the last three months who have a monthly order frequency +4 for the past three months.
Customer ID
Feb
Mar
Apr
0001
4
5
6
0002
3
2
4
0003
4
2
3
In the above table, the customer with Customer ID 0001 should only be picked, as he consistently has 4 or more orders in a month.
Below is a query I have written, which pulls all customers with an average purchase frequency of 4 in the last 90 days, but not considering there is a consistent purchase of 4 or more last three months.
Query:
SELECT distinct lines.customer_id Customer_ID, (COUNT(lines.order_id)/90) PurchaseFrequency
from fct_customer_order_lines lines
LEFT JOIN product_table product
ON lines.entity_id= product.entity_id
AND lines.vendor_id= product.vendor_id
WHERE LOWER(product.country_code)= "IN"
AND lines.date >= DATE_SUB(CURRENT_DATE() , INTERVAL 90 DAY )
AND lines.date < CURRENT_DATE()
GROUP BY Customer_ID
HAVING PurchaseFrequency >=4;
I tried to use window functions, however not sure if it needs to be used in this case.
I would sum the orders per month instead of computing the avg and then retrieve those who have that sum greater than 4 in the last three months.
Also I think you should select your interval using "month(CURRENT_DATE()) - 3" instead of using a window of 90 days. Of course if needed you should handle the case of when current_date is jan-feb-mar and in that case go back to oct-nov-dec of the previous year.
I'm not familiar with Google BigQuery so I can't write your query but I hope this helps.
So I've found the solution to this using WITH operator as below:
WITH filtered_orders AS (
select
distinct customer_id ID,
extract(MONTH from date) Order_Month,
count(order_id) CountofOrders
from customer_order_lines` lines
where EXTRACT(YEAR FROM date) = 2022 AND EXTRACT(MONTH FROM date) IN (2,3,4)
group by ID, Order_Month
having CountofOrders>=4)
select distinct ID
from filtered_orders
group by ID
having count(Order_Month) =3;
Hope this helps!
An option could be first count the orders by month and then filter users which have purchases on all months above your threshold:
WITH ORDERS_BY_MONTH AS (
SELECT
DATE_TRUNC(lines.date, MONTH) PurchaseMonth,
lines.customer_id Customer_ID,
COUNT(lines.order_id) PurchaseFrequency
FROM fct_customer_order_lines lines
LEFT JOIN product_table product
ON lines.entity_id= product.entity_id
AND lines.vendor_id= product.vendor_id
WHERE LOWER(product.country_code)= "IN"
AND lines.date >= DATE_SUB(CURRENT_DATE() , INTERVAL 90 DAY )
AND lines.date < CURRENT_DATE()
GROUP BY PurchaseMonth, Customer_ID
)
SELECT
Customer_ID,
AVG(PurchaseFrequency) AvgPurchaseFrequency
FROM ORDERS_BY_MONTH
GROUP BY Customer_ID
HAVING COUNT(1) = COUNTIF(PurchaseFrequency >= 4)

Find Customers Who Shop at Multiple Stores

I need a query that will give me a count of customers who have shopped at multiple store locations within the last 3 years.
I have formulated the following query, but it's not what I need to know:
SELECT STORE_ID, CUSTOMER_ID, COUNT(DISTINCT CUSTOMER_ID) as SERVICE_COUNT
From SALES INNER JOIN
STORE_DETAILS
ON trim(STORE_ID) = trim(STORE_ID)
WHERE (CURRENT_DATE - cast(SALE_DATE AS DATE format 'mm/dd/yyyy')) < 1095
ORDER BY 1,2
Group by 1,2
HAVING COUNT(DISTINCT SALE_DATE) > 1
If you want customers at multiple stores, then something like:
SELECT CUSTOMER_ID
FROM SALES INNER JOIN
STORE_DETAILS
ON trim(STORE_ID) = trim(STORE_ID)
WHERE (CURRENT_DATE - cast(SALE_DATE AS DATE format 'mm/dd/yyyy')) < 1095
GROUP BY 1
HAVING COUNT(DISTINCT STORE_ID) > 1;
I don't understand your date expression, but presumably you know what it is supposed to be doing.
Optimized version of Gordon's query based on your comments:
Often, the store_id has trailing spaces not allowing a true match
Comparing strings ignores trailing spaces. As long as there are no leading spaces (which is a worst case and should be fixed during load) you don't have to TRIM (it's quite bad for performance).
The datatype for SALE_DATE is DATE
If it's a date there's no need for a CAST. Additionally the within three years logic can be simplified to avoid date calculattion on every row.
SELECT CUSTOMER_ID, COUNT(DISTINCT CUSTOMER_ID) as SERVICE_COUNT
FROM SALES
JOIN STORE_DETAILS
ON STORE_ID = STORE_ID
WHERE SALE_DATE >= ADD_MONTHS(CURRENT_DATE, -12*3)
GROUP BY 1
HAVING SERVICE_COUNT > 1
;

Find rows with similar date values

I want to find customers where for example, system by error registered duplicates of an order.
It's pretty easy, if reg_date is EXACTLY the same but I have no idea how to implement it in query to count as duplicate if for example there was up to 1 second difference between transactions.
select * from
(select customer_id, reg_date, count(*) as cnt
from orders
group by 1,2
) x where cnt > 1
Here is example dataset:
https://www.db-fiddle.com/f/m6PhgReSQbVWVZhqe8n4mi/0
CUrrently only customer's 104 orders are counted as duplicates because its reg_date is identical, I want to count also orders 1,2 and 4,5 as there's just 1 second difference
demo:db<>fiddle
SELECT
customer_id,
reg_date
FROM (
SELECT
*,
reg_date - lag(reg_date) OVER (PARTITION BY customer_id ORDER BY reg_date) <= interval '1 second' as is_duplicate
FROM
orders
) s
WHERE is_duplicate
Use the lag() window function. It allows to have a look hat the previous record. With this value you can do a diff and filter the records where the diff time is more than one second.
Try this following script. This will return you day/customer wise duplicates.
SELECT
TO_CHAR(reg_date :: DATE, 'dd/mm/yyyy') reg_date,
customer_id,
count(*) as cnt
FROM orders
GROUP BY
TO_CHAR(reg_date :: DATE, 'dd/mm/yyyy'),
customer_id
HAVING count(*) >1

Same output in two different lateral joins

I'm working on a bit of PostgreSQL to grab the first 10 and last 10 invoices of every month between certain dates. I am having unexpected output in the lateral joins. Firstly the limit is not working, and each of the array_agg aggregates is returning hundreds of rows instead of limiting to 10. Secondly, the aggregates appear to be the same, even though one is ordered ASC and the other DESC.
How can I retrieve only the first 10 and last 10 invoices of each month group?
SELECT first.invoice_month,
array_agg(first.id) first_ten,
array_agg(last.id) last_ten
FROM public.invoice i
JOIN LATERAL (
SELECT id, to_char(invoice_date, 'Mon-yy') AS invoice_month
FROM public.invoice
WHERE id = i.id
ORDER BY invoice_date, id ASC
LIMIT 10
) first ON i.id = first.id
JOIN LATERAL (
SELECT id, to_char(invoice_date, 'Mon-yy') AS invoice_month
FROM public.invoice
WHERE id = i.id
ORDER BY invoice_date, id DESC
LIMIT 10
) last on i.id = last.id
WHERE i.invoice_date BETWEEN date '2017-10-01' AND date '2018-09-30'
GROUP BY first.invoice_month, last.invoice_month;
This can be done with a recursive query that will generate the interval of months for who we need to find the first and last 10 invoices.
WITH RECURSIVE all_months AS (
SELECT date_trunc('month','2018-01-01'::TIMESTAMP) as c_date, date_trunc('month', '2018-05-11'::TIMESTAMP) as end_date, to_char('2018-01-01'::timestamp, 'YYYY-MM') as current_month
UNION
SELECT c_date + interval '1 month' as c_date,
end_date,
to_char(c_date + INTERVAL '1 month', 'YYYY-MM') as current_month
FROM all_months
WHERE c_date + INTERVAL '1 month' <= end_date
),
invocies_with_month as (
SELECT *, to_char(invoice_date::TIMESTAMP, 'YYYY-MM') invoice_month FROM invoice
)
SELECT current_month, array_agg(first_10.id), 'FIRST 10' as type FROM all_months
JOIN LATERAL (
SELECT * FROM invocies_with_month
WHERE all_months.current_month = invoice_month AND invoice_date >= '2018-01-01' AND invoice_date <= '2018-05-11'
ORDER BY invoice_date ASC limit 10
) first_10 ON TRUE
GROUP BY current_month
UNION
SELECT current_month, array_agg(last_10.id), 'LAST 10' as type FROM all_months
JOIN LATERAL (
SELECT * FROM invocies_with_month
WHERE all_months.current_month = invoice_month AND invoice_date >= '2018-01-01' AND invoice_date <= '2018-05-11'
ORDER BY invoice_date DESC limit 10
) last_10 ON TRUE
GROUP BY current_month;
In the code above, '2018-01-01' and '2018-05-11' represent the dates between we want to find the invoices. Based on those dates, we generate the months (2018-01, 2018-02, 2018-03, 2018-04, 2018-05) that we need to find the invoices for.
We store this data in all_months.
After we get the months, we do a lateral join in order to join the invoices for every month. We need 2 lateral joins in order to get the first and last 10 invoices.
Finally, the result is represented as:
current_month - the month
array_agg - ids of all selected invoices for that month
type - type of the selected invoices ('first 10' or 'last 10').
So in the current implementation, you will have 2 rows for each month (if there is at least 1 invoice for that month). You can easily join that in one row if you need to.
LIMIT is working fine. It's your query that's broken. JOIN is just 100% the wrong tool here; it doesn't even do anything close to what you need. By joining up to 10 rows with up to another 10 rows, you get up to 100 rows back. There's also no reason to self join just to combine filters.
Consider instead window queries. In particular, we have the dense_rank function, which can number every row in the result set according to groups:
SELECT
invoice_month,
time_of_month,
ARRAY_AGG(id) invoice_ids
FROM (
SELECT
id,
invoice_month,
-- Categorize as end or beginning of month
CASE
WHEN month_rank <= 10 THEN 'beginning'
WHEN month_reverse_rank <= 10 THEN 'end'
ELSE 'bug' -- Should never happen. Just a fall back in case of a bug.
END AS time_of_month
FROM (
SELECT
id,
invoice_month,
dense_rank() OVER (PARTITION BY invoice_month ORDER BY invoice_date) month_rank,
dense_rank() OVER (PARTITION BY invoice_month ORDER BY invoice_date DESC) month_rank_reverse
FROM (
SELECT
id,
invoice_date,
to_char(invoice_date, 'Mon-yy') AS invoice_month
FROM public.invoice
WHERE invoice_date BETWEEN date '2017-10-01' AND date '2018-09-30'
) AS fiscal_year_invoices
) ranked_invoices
-- Get first and last 10
WHERE month_rank <= 10 OR month_reverse_rank <= 10
) first_and_last_by_month
GROUP BY
invoice_month,
time_of_month
Don't be intimidated by the length. This query is actually very straightforward; it just needed a few subqueries.
This is what it does logically:
Fetch the rows for the fiscal year in question
Assign a "rank" to the row within its month, both counting from the beginning and from the end
Filter out everything that doesn't rank in the 10 top for its month (counting from either direction)
Adds an indicator as to whether it was at the beginning or end of the month. (Note that if there's less than 20 rows in a month, it will categorize more of them as "beginning".)
Aggregate the IDs together
This is the tool set designed for the job you're trying to do. If really needed, you can adjust this approach slightly to get them into the same row, but you have to aggregate before joining the results together and then join on the month; you can't join and then aggregate.

SQL Count Query Using Non-Index Column

I have a query similar to this, where I need to find the number of transactions a specific customer had within a time frame:
select customer_id, count(transactions)
from transactions
where customer_id = 'FKJ90838485'
and purchase_date between '01-JAN-13' and '31-AUG-13'
group by customer_id
The table transactions is not indexed on customer_id but rather another field called transaction_id. Customer_ID is character type while transaction_id is numeric.
'accounting_month' field is also indexed.. this field just stores the month that transactions occured... ie, purchase_date = '03-MAR-13' would have accounting_month = '01-MAR-13'
The transactions table has about 20 million records in the time frame from '01-JAN-13' and '31-AUG-13'
When I run the above query, it has taken more than 40 minutes to come back, any ideas or tips?
As others have already commented, the best is to add an index that will cover the query, So:
Contact the Database administrator and request that they add an index on (customer_id, purchase_date) because the query is doing a table scan otherwise.
Sidenotes:
Use date and not string literals (you may know that and do it already, still noted here for future readers)
You don't have to put the customer_id in the SELECT list and if you remove it from there, it can be removed from the GROUP BY as well so the query becomes:
select count(*) as number_of_transactions
from transactions
where customer_id = 'FKJ90838485'
and purchase_date between DATE '2013-01-01' and DATE '2013-08-31' ;
If you don't have a WHERE condition on customer_id, you can have it in the GROUP BY and the SELECT list to write a query that will count number of transactions for every customer. And the above suggested index will help this, too:
select customer_id, count(*) as number_of_transactions
from transactions
where purchase_date between DATE '2013-01-01' and DATE '2013-08-31'
group by customer_id ;
This is just an idea that came up to me. It might work, try running it and see if it is an improvement over what you currently have.
I'm trying to use the transaction_id, which you've said is indexed, as much as possible.
WITH min_transaction (tran_id)
AS (
SELECT MIN(transaction_ID)
FROM TRANSACTIONS
WHERE
CUSTOMER_ID = 'FKJ90838485'
AND purchase_date >= '01-JAN-13'
), max_transaction (tran_id)
AS (
SELECT MAX(transaction_ID)
FROM TRANSACTIONS
WHERE
CUSTOMER_ID = 'FKJ90838485'
AND purchase_date <= '31-AUG-13'
)
SELECT customer_id, count(transaction_id)
FROM transactions
WHERE
transaction_id BETWEEN min_transaction.tran_id AND max_transaction.tran_id
GROUP BY customer_ID
May be this will run faster since it look at the transaction_id for the range instead of the purchase_date. I also take in consideration that accounting_month is indexed :
select customer_id, count(*)
from transactions
where customer_id = 'FKJ90838485'
and transaction_id between (select min(transaction_id)
from transactions
where accounting_month = '01-JAN-13'
) and
(select max(transaction_id)
from transactions
where accounting_month = '01-AUG-13'
)
group by customer_id
May be you can also try :
select customer_id, count(*)
from transactions
where customer_id = 'FKJ90838485'
and accounting_month between '01-JAN-13' and '01-AUG-13'
group by customer_id