Find rows with similar date values - sql

I want to find customers where for example, system by error registered duplicates of an order.
It's pretty easy, if reg_date is EXACTLY the same but I have no idea how to implement it in query to count as duplicate if for example there was up to 1 second difference between transactions.
select * from
(select customer_id, reg_date, count(*) as cnt
from orders
group by 1,2
) x where cnt > 1
Here is example dataset:
https://www.db-fiddle.com/f/m6PhgReSQbVWVZhqe8n4mi/0
CUrrently only customer's 104 orders are counted as duplicates because its reg_date is identical, I want to count also orders 1,2 and 4,5 as there's just 1 second difference

demo:db<>fiddle
SELECT
customer_id,
reg_date
FROM (
SELECT
*,
reg_date - lag(reg_date) OVER (PARTITION BY customer_id ORDER BY reg_date) <= interval '1 second' as is_duplicate
FROM
orders
) s
WHERE is_duplicate
Use the lag() window function. It allows to have a look hat the previous record. With this value you can do a diff and filter the records where the diff time is more than one second.

Try this following script. This will return you day/customer wise duplicates.
SELECT
TO_CHAR(reg_date :: DATE, 'dd/mm/yyyy') reg_date,
customer_id,
count(*) as cnt
FROM orders
GROUP BY
TO_CHAR(reg_date :: DATE, 'dd/mm/yyyy'),
customer_id
HAVING count(*) >1

Related

Data recurring in previous 90 days

I hope you can suppor me with a piece of code I'm writing. I'm working with the following query:
SELECT case_id, case_date, people_id FROM table_1;
and I've to search in the DB how many times the same people_id is repeted in the DB, (different case_id) considering the case_date -90days timeframe. Any advise on how to address that?
Data sample
Additional info: as results I'm expecting to have the list of people_id with how many cases received in the 90 days from the last case_date.
expected result sample:
The way I understood the question, it would be something like this:
select people_id,
case_id,
count(*)
from table_1
where case_date >= trunc(sysdate) - 90
group by people_id,
case_id
You want to filter WHERE the case_date is greater than or equal to 90 days before the start of today and then GROUP BY the people_id and COUNT the number of DISTINCT (different) case_id:
SELECT people_id,
COUNT( DISTINCT case_id ) AS number_of_cases
FROM table_1
WHERE case_date >= TRUNC( SYSDATE ) - INTERVAL '90' DAY
GROUP BY
people_id;
If you only want to count repeated case_id per person_id then:
SELECT person_id,
COUNT(*) AS number_of_repeated_cases
FROM (
SELECT case_id,
person_id,
FROM table_1
WHERE case_date >= TRUNC( SYSDATE ) - INTERVAL '90' DAY
GROUP BY
people_id,
case_id
HAVING COUNT(*) >= 2
)
GROUP BY
people_id;
I think you want window functions:
select t.*,
count(*) over (partition by people_idorder by case_date
range between interval '90' day preceding and current row
) as person_count_90_day
from t;

Count the number of repeating pairs of information

I have a table with customer_ID, date, and payment_method as 3 columns. payment_method can be 'cash', 'credit', or 'others'. I want to find out the number of customers who have used credit as a payment method more than 5 times, in the last 6 months.
I found this solution for displaying the rows where the customer used credit:
SELECT customer_ID, payment_method, COUNT(*) AS unique_pair_repeats
FROM tab1
WHERE customer_ID IS NOT NULL
GROUP BY customer_ID, payment_method
HAVING count(*) > 1;
The problem is, I don't want a list of the names/ids, I want to know how many people used their credit card for a purchase 5 times or more in the last 6 months.
This is one way you could do it:
SELECT COUNT(*)
FROM
(
SELECT customer_id
FROM tab1
WHERE
customer_ID IS NOT NULL and
payment_method = 'credit' and
tran_date > add_months(sysdate, -6)
GROUP BY customer_ID
HAVING count(*) > 5
) x
The inner query generates a list of all customer ids that have used credit more than 5 times in 6 months. The outer query counts them
You might feel it more logical to write it like this:
SELECT COUNT(*)
FROM
(
SELECT customer_id, count(*) as ctr
FROM tab1
WHERE
customer_ID IS NOT NULL and
payment_method = 'credit' and
tran_date > add_months(sysdate, -6)
GROUP BY customer_ID
) x
WHERE x.ctr > 5
So, remove customer_ID, payment_method, from select.
Though, that still doesn't answer "at least 5 times in last 6 months", so you need another condition: date (presuming you use Oracle, although you didn't tag the question but - you do use Oracle SQL Developer):
and date_column >= add_months(trunc(sysdate), -6)
Finally, something like this might help:
SELECT COUNT(*) AS unique_pair_repeats --> changes here
FROM tab1
WHERE customer_ID IS NOT NULL
and date_column >= add_months(trunc(sysdate), -6) --> here
GROUP BY customer_ID, payment_method
HAVING count(*) >= 5; --> here

Add columns to SQL query and filter by min(date) and sum(price)

I am trying to generate a list of users who's first purchase was in December 2018 and have spent over 100 dollars since then in SQL. I'm able to generate the list of users, but I'm unable to determine what their first purchase was or other variables and it appears to be an issue since the columns I'm trying to include are neither grouped nor aggregated so I'm hoping someone can point me in the right direction as I'm new to SQL.
Here's my code to generate the list I want to add more columns to:
select billing_address.name, contact_email, min(processed_at) as First_Purchase_Date, sum(total_price) as Total_Revenue
FROM (
SELECT *, ROW_NUMBER() OVER(PARTITION BY id) AS instance
FROM `table.orders`
) orders -- identify duplicate rows
WHERE instance = 1
group by contact_email, billing_address.name
having min(processed_at) between '2019-01-01 00:00:00 UTC' and '2019-02-01 00:00:00 UTC' and sum(total_price) > 100
order by sum(total_price) desc
Is there some way I can modify this to pull each user's purchase from this list into a separate row and include more columns? So I'd pull in each user (and ALL of their purchases) who has a min(processed_at) in December 2018 AND their sum(total_price) > 100? something like this:
SELECT contact_email, billing_address, line_items, min(processed_at), sum(total_price) OVER (PARTITION BY contact_email)
FROM (
SELECT *, ROW_NUMBER() OVER(PARTITION BY id) AS instance
FROM `table.orders`
) orders -- identify duplicate rows
WHERE instance = 1
However, the sum(total_price) doesn't work in this case and I can't filter by min(processed_at). Can someone guide me in the right direction?
I think that should use window functions instead of aggregation. You can compute the date of the first purchase and the total amount spent on the fly in a subquery, without aggregating (your original group by columns become the partition columns of the window functions). Then you can use these information to filter in the outer query.
This should get you close to what you want:
select o.*
from (
select
o.*,
min(processed_at) over(partition by contact_email, billing_address) min_processed_at,
sum(total_price) over(partition by contact_email, billing_address) sum_total_price
from (
select
o.*,
row_number() over(partition by id) instance
from orders o
) o
where instance = 1
) o
where
processed_at between '2019-01-01 00:00:00 UTC' and '2019-02-01 00:00:00 UTC'
and sum_total_price > 100
Your question was a bit unclear as you did not provide much detail about your input tables or your expected output, so this is a guess.
The following query gets all transactions from users who meet the criteria:
-- BigQuery StandardSQL
with ordered_orders as (
--rank each ID by processed_at date first to last
select *, row_number() over(partition by id order by processed_at asc) as rn
from `table.orders`
),
first_criteria as (
-- select IDs where first processed_at date is in 2018-12
select id, processed_at as first_order_date
from ordered_orders
where rn = 1
and extract(year from processed_at) = 2018
and extract(month from processed_at) = 12
),
second_criteria as (
-- further select IDs who meet first criteria and have a total of > 100
select id, sum(total_prices) as total_revenue
from ordered_orders
inner join first_criteria using(id)
group by id
having total_revenue > 100
),
orders_with_criteria as (
-- get all orders for users who meet both criteria
select ordered_orders.* except(rn), first_order_date, total_revenue
from ordered_orders
inner join first_criteria using(id)
inner join second_criteria using(id)
),
-- select any fields you want
select * from orders_with_criteria
I prefer liberal use of CTEs in cases like this to keep the logic clear.
I also wouldn't be surprised if this query doesn't work as you intend. I think it is highly doubtful that the ID column in your orders table refers to the customer id, which is what you/we are partitioning on. Depending on who set up your tables, id probably refers to the order id. If you have a customer_id (or account #, etc), then I would use that instead of id in the query.
No need to use row_number() in BigQuery for this:
SELECT billing_address.name, contact_email,
MIN(processed_at) as First_Purchase_Date,
SUM(total_price) as Total_Revenue,
ARRAY_AGG(o ORDER BY processed_at LIMIT 1) as first_order
FROM `table.orders` o
WHERE instance = 1
GROUP BY contact_email, billing_address.name
HAVING MIN(processed_at) >= '2019-01-01 00:00:00 UTC' AND
MIN(processed_at) < '2019-02-01 00:00:00 UTC' AND
SUM(total_price) > 100
ORDER BY SUM(total_price) desc;
This returns the entire first order as a struct. You can select specific columns, if you prefer.

Is it possible to look at two consecutive rows and determine the difference in time between the two using SQL?

I am relatively new to SQL, so please bear with me! I am trying to see how many customers make a purchase after being dormant for two years. Relevant fields include cust_id and purchase_date (there can be several observations for the same cust_id but with different dates). I am using Redshift for my SQL scripts.
I realize I cannot put the same thing in for the DATEDIFF parameters (it just doesn't make any sense), but I am unsure what else to do.
SELECT *
FROM tickets t
LEFT JOIN d_customer c
ON c.cust_id = t.cust_id
WHERE DATEDIFF(year, t.purchase_date, t.purchase_date) between 0 and 2
ORDER BY t.cust_id, t.purchase_date
;
I think you want lag(). To get the relevant tickets:
SELECT t.*
FROM (SELECT t.*,
LAG(purchase_date) OVER (PARTITION BY cust_id ORDER BY purchase_date) as prev_pd
FROM tickets t
) t
WHERE prev_pd < purchase_date - interval '2 year';
If you want the number of customers, use count(distinct):
SELECT COUNT(DISTINCT cust_id)
FROM (SELECT t.*,
LAG(purchase_date) OVER (PARTITION BY cust_id ORDER BY purchase_date) as prev_pd
FROM tickets t
) t
WHERE prev_pd < purchase_date - interval '2 year';
Note that these do not use DATEDIFF(). This counts the number of boundaries between two date values. So, 2018-12-31 and 2019-01-01 have a difference of 1 year.

Same output in two different lateral joins

I'm working on a bit of PostgreSQL to grab the first 10 and last 10 invoices of every month between certain dates. I am having unexpected output in the lateral joins. Firstly the limit is not working, and each of the array_agg aggregates is returning hundreds of rows instead of limiting to 10. Secondly, the aggregates appear to be the same, even though one is ordered ASC and the other DESC.
How can I retrieve only the first 10 and last 10 invoices of each month group?
SELECT first.invoice_month,
array_agg(first.id) first_ten,
array_agg(last.id) last_ten
FROM public.invoice i
JOIN LATERAL (
SELECT id, to_char(invoice_date, 'Mon-yy') AS invoice_month
FROM public.invoice
WHERE id = i.id
ORDER BY invoice_date, id ASC
LIMIT 10
) first ON i.id = first.id
JOIN LATERAL (
SELECT id, to_char(invoice_date, 'Mon-yy') AS invoice_month
FROM public.invoice
WHERE id = i.id
ORDER BY invoice_date, id DESC
LIMIT 10
) last on i.id = last.id
WHERE i.invoice_date BETWEEN date '2017-10-01' AND date '2018-09-30'
GROUP BY first.invoice_month, last.invoice_month;
This can be done with a recursive query that will generate the interval of months for who we need to find the first and last 10 invoices.
WITH RECURSIVE all_months AS (
SELECT date_trunc('month','2018-01-01'::TIMESTAMP) as c_date, date_trunc('month', '2018-05-11'::TIMESTAMP) as end_date, to_char('2018-01-01'::timestamp, 'YYYY-MM') as current_month
UNION
SELECT c_date + interval '1 month' as c_date,
end_date,
to_char(c_date + INTERVAL '1 month', 'YYYY-MM') as current_month
FROM all_months
WHERE c_date + INTERVAL '1 month' <= end_date
),
invocies_with_month as (
SELECT *, to_char(invoice_date::TIMESTAMP, 'YYYY-MM') invoice_month FROM invoice
)
SELECT current_month, array_agg(first_10.id), 'FIRST 10' as type FROM all_months
JOIN LATERAL (
SELECT * FROM invocies_with_month
WHERE all_months.current_month = invoice_month AND invoice_date >= '2018-01-01' AND invoice_date <= '2018-05-11'
ORDER BY invoice_date ASC limit 10
) first_10 ON TRUE
GROUP BY current_month
UNION
SELECT current_month, array_agg(last_10.id), 'LAST 10' as type FROM all_months
JOIN LATERAL (
SELECT * FROM invocies_with_month
WHERE all_months.current_month = invoice_month AND invoice_date >= '2018-01-01' AND invoice_date <= '2018-05-11'
ORDER BY invoice_date DESC limit 10
) last_10 ON TRUE
GROUP BY current_month;
In the code above, '2018-01-01' and '2018-05-11' represent the dates between we want to find the invoices. Based on those dates, we generate the months (2018-01, 2018-02, 2018-03, 2018-04, 2018-05) that we need to find the invoices for.
We store this data in all_months.
After we get the months, we do a lateral join in order to join the invoices for every month. We need 2 lateral joins in order to get the first and last 10 invoices.
Finally, the result is represented as:
current_month - the month
array_agg - ids of all selected invoices for that month
type - type of the selected invoices ('first 10' or 'last 10').
So in the current implementation, you will have 2 rows for each month (if there is at least 1 invoice for that month). You can easily join that in one row if you need to.
LIMIT is working fine. It's your query that's broken. JOIN is just 100% the wrong tool here; it doesn't even do anything close to what you need. By joining up to 10 rows with up to another 10 rows, you get up to 100 rows back. There's also no reason to self join just to combine filters.
Consider instead window queries. In particular, we have the dense_rank function, which can number every row in the result set according to groups:
SELECT
invoice_month,
time_of_month,
ARRAY_AGG(id) invoice_ids
FROM (
SELECT
id,
invoice_month,
-- Categorize as end or beginning of month
CASE
WHEN month_rank <= 10 THEN 'beginning'
WHEN month_reverse_rank <= 10 THEN 'end'
ELSE 'bug' -- Should never happen. Just a fall back in case of a bug.
END AS time_of_month
FROM (
SELECT
id,
invoice_month,
dense_rank() OVER (PARTITION BY invoice_month ORDER BY invoice_date) month_rank,
dense_rank() OVER (PARTITION BY invoice_month ORDER BY invoice_date DESC) month_rank_reverse
FROM (
SELECT
id,
invoice_date,
to_char(invoice_date, 'Mon-yy') AS invoice_month
FROM public.invoice
WHERE invoice_date BETWEEN date '2017-10-01' AND date '2018-09-30'
) AS fiscal_year_invoices
) ranked_invoices
-- Get first and last 10
WHERE month_rank <= 10 OR month_reverse_rank <= 10
) first_and_last_by_month
GROUP BY
invoice_month,
time_of_month
Don't be intimidated by the length. This query is actually very straightforward; it just needed a few subqueries.
This is what it does logically:
Fetch the rows for the fiscal year in question
Assign a "rank" to the row within its month, both counting from the beginning and from the end
Filter out everything that doesn't rank in the 10 top for its month (counting from either direction)
Adds an indicator as to whether it was at the beginning or end of the month. (Note that if there's less than 20 rows in a month, it will categorize more of them as "beginning".)
Aggregate the IDs together
This is the tool set designed for the job you're trying to do. If really needed, you can adjust this approach slightly to get them into the same row, but you have to aggregate before joining the results together and then join on the month; you can't join and then aggregate.