SQL get distinct customer count by hour - sql

I have two table with datein and timein that is recorded when an order is placed and another table with the column datepicked and timepicked that is recorded when the invoice from the order is picked up. I need to find out how many customer I have every hour, but some are placing order and some are picking up invoices and some are doing both. There could be more than one order and invoice for the same customer on the same day/hour.
OrderTable:
Ordernum
CustomerID
datein
timein
InvoiceTable:
CustomerID
InvoiceID
Ordernum
datepicked
timepicked
I tried this SQL, but I can't find out how to get the DISTINCT CUSTOMERID from both tables and the date and hours lined up on both tables, I noticed in the result if there was no order for one hour / day the columns did not lineup.
Select o.datein, i.datepicked, (o.datein) As iDay, HOUR(o.timein) as iH,
DayOfMonth(i.datepicked) As pDay, HOUR(i.timepicked) as pH, Count(*) as Total
from OrderTable o, InvoiceTable i
Where
o.datein >= '2019-01-01' and o.datein <= '2019-12-31'
GROUP BY o.datein, i.datepicked, iDay, iH, pDay, pH
Thanks for any help.
Kim

Not sure why the tables setup as they are, but if all you really care about is the DISTINCT customer per date/hour, I would do the following by pre-unioning just those records, then distinct count from that. Dont worry about joining if the transactions were done at separate times unless your consideration is that the order and invoice are BOTH handled within the same hour. What happens if one order is done at 10:59 and the invoice is 11:00 only 1 minute apart, but representative of 2 different hours. It would be the same 1 customer showing up in each individual hour component.
Notice the first "FROM" clause has a union to get all records to the same column name bulk of records, each of their own respective 2019 calendar activity date. Once that is done, get and group by for the COUNT DISTINCT customers.
select
AllRecs.DateIn,
hour( AllRecs.TimeIn ) ByHour,
DayOfMonth(AllRecs.DateIn) pDay,
Count( distinct AllRecs.CustomerID ) UniqueCustomers
from
( select
ot.CustomerID,
ot.datein,
ot.timein
from
OrderTable ot
where
ot.datein >= '2019-01-01'
and ot.datein <= '2019-12-31'
union all
select
it.CustomerID,
it.datepicked datein,
it.timepicked timein
from
InvoiceTable it
where
it.datepicked >= '2019-01-01'
and it.datepicked <= '2019-12-31' ) AllRecs
group by
AllRecs.DateIn,
hour( AllRecs.TimeIn ),
DayOfMonth(AllRecs.DateIn)

If you had a relation between these two tables it would be possible. If I understand what you are trying to do, InvoiceTable needs to be a child table of OrderTable with a foreign key field "OrderNum" that relates back to its parent "OrderTable" primary key "OrderNum". Therefore, you don't need a field "CusotmerID" on InvoiceTable and you would know when an Invoice been picked up belongs to an order from the same day.

Related

SQL - Counting users that have multiple transactions and have at least one transaction that has been made within 7 days interval of the other one

Dataset Here is the task : Count users that have multiple transactions and have at least one transaction that has been made within 7 days interval of the other one.
Structure of dataset: Row, userId, orderId, date
Date is formatted as YYYY-MM-DDTHH:MM:SS Example: 2016-09-16T11:32:06
I have completed the first part (counting users with multiple transactions), but I do not know how to do the second part in the same query. I will be thankful for help.
Here is the console:
query = '''
SELECT COUNT(*)
FROM
(SELECT userId FROM `dataset` GROUP BY userId HAVING COUNT(orderId) > 1)
'''
project_id = 'acdefg'
df = pd.io.gbq.read_gbq(query, project_id=project_id, dialect='standard')
display(df)
To solve this issue you want to be able to compare each record to a previous record: when was the last order from the same user. This hints to the use of partitions and window functions, in this case LAG.
A possible way to solve the problem is to organise records per user and order them by orderDate and then for each record have a look at the record just above:
WITH intermediate_table AS (
SELECT
userId,
orderDate,
LAG(orderDate)
OVER (PARTITION BY userId ORDER BY orderDate) -- this is where we pick the orderDate of the record right above, once the orders are organized by userId and ordered by orderDate
FROM `dataset.table`
)
SELECT userId
FROM intermediate_table
WHERE DATE_DIFF(orderDate, previous_order, DAY) <= 7
GROUP BY userId
Once orderDate and previous_order info are gathered in the same record, it's easy to compare them and see if there is less than 7 days between the two.
(GROUP BY is used for returning userIds only once in the resulting table)
This may be what you need:
-- for each order calculate the days since that customer's last order
order_profiler AS (
SELECT
orderId,
orderDate,
custId,
DATE_DIFF(orderDate, LAG(orderDate) OVER (PARTITION BY custId ORDER BY orderDate), day) AS order_latency_days,
FROM
`dataset.table`
)
SELECT
custId,
FROM order_profiler
WHERE order_latency_days <= 7
GROUP BY custId

Calculating business days in Teradata

I need help in business days calculation.
I've two tables
1) One table ACTUAL_TABLE containing order date and contact date with timestamp datatypes.
2) The second table BUSINESS_DATES has each of the calendar dates listed and has a flag to indicate weekend days.
using these two tables, I need to ensure business days and not calendar days (which is the current logic) is calculated between these two fields.
My thought process was to first get a range of dates by comparing ORDER_DATE with TABLE_DATE field and then do a similar comparison of CONTACT_DATE to TABLE_DATE field. This would get me a range from the BUSINESS_DATES table which I can then use to calculate count of days, sum(Holiday_WKND_Flag) fields making the result look like:
Order# | Count(*) As DAYS | SUM(WEEKEND DATES)
100 | 25 | 8
However this only works when I use a specific order number and cant' bring all order numbers in a sub query.
My Query:
SELECT SUM(Holiday_WKND_Flag), COUNT(*) FROM
(
SELECT
* FROM
BUSINESS_DATES
WHERE BUSINESS.Business BETWEEN (SELECT ORDER_DATE FROM ACTUAL_TABLE
WHERE ORDER# = '100'
)
AND
(SELECT CONTACT_DATE FROM ACTUAL_TABLE
WHERE ORDER# = '100'
)
TEMP
Uploading the table structure for your reference.
SELECT ORDER#, SUM(Holiday_WKND_Flag), COUNT(*)
FROM business_dates bd
INNER JOIN actual_table at ON bd.table_date BETWEEN at.order_date AND at.contact_date
GROUP BY ORDER#
Instead of joining on a BETWEEN (which always results in a bad Product Join) followed by a COUNT you better assign a bussines day number to each date (in best case this is calculated only once and added as a column to your calendar table). Then it's two Equi-Joins and no aggregation needed:
WITH cte AS
(
SELECT
Cast(table_date AS DATE) AS table_date,
-- assign a consecutive number to each busines day, i.e. not increased during weekends, etc.
Sum(CASE WHEN Holiday_WKND_Flag = 1 THEN 0 ELSE 1 end)
Over (ORDER BY table_date
ROWS Unbounded Preceding) AS business_day_nbr
FROM business_dates
)
SELECT ORDER#,
Cast(t.contact_date AS DATE) - Cast(t.order_date AS DATE) AS #_of_days
b2.business_day_nbr - b1.business_day_nbr AS #_of_business_days
FROM actual_table AS t
JOIN cte AS b1
ON Cast(t.order_date AS DATE) = b1.table_date
JOIN cte AS b2
ON Cast(t.contact_date AS DATE) = b2.table_date
Btw, why are table_date and order_date timestamp instead of a date?
Porting from Oracle?
You can use this query. Hope it helps
select order#,
order_date,
contact_date,
(select count(1)
from business_dates_table
where table_date between a.order_date and a.contact_date
and holiday_wknd_flag = 0
) business_days
from actual_table a

Number of Unique Visits in SQL

I am using Postgres 9.1 to count the number of unique patient visits over a given period of time, using the invoices entered for each patient.
I have two columns, transactions.ptnumber and transactions.dateofservice, and can calculate the the patient visits in the following manner:
select count(*)
from transactions
where transactions.dateofservice between '2012-01-01' and '2013-12-31'
The problem is that sometimes one patient might get two invoices for the same day, but that should be counted as only one patient visit.
If I use SELECT DISTINCT or GROUP BY on the column transactions.ptnumber, that would count the number of patients who were seen (but not the number of times they were seen).
If I use SELECT DISTINCT or GROUP BY on the column transactions.dateofservice, that would count the number of days that had an invoice.
Not sure how to approach this.
This will return unique patients per day.
select count(distinct transactions.ptnumber) as cnt
from transactions
where transactions.dateofservice between '2012-01-01' and '2013-12-31'
group by transactions.dateofservice
You can sum them up to get the unique patients for the whole period
select sum(cnt) from (
select count(distinct transactions.ptnumber) as cnt
from transactions
where transactions.dateofservice between '2012-01-01' and '2013-12-31'
group by transactions.dateofservice
)
You might use a subselect, consider
Select count(*)
from (select ptnumber, dateofservice
from transactions
where dateofservice between '2012-01-01' and '2013-12-31'
group by ptnumber, dateofservice
)
You may also want to make this a stored procedure so you can pass in the date range.
There are multiple ways to achieve this, but you could use the WITH clause to construct a temporary table that contains the unique visits, then count the results !
WITH UniqueVisits AS
(SELECT DISTINCT transactions.ptnumber, transactions.dateofservice
FROM transactions
WHERE transactions.dateofservice between '2012-01-01' and '2013-12-31')
SELECT COUNT(*) FROM UniqueVisits

SQL Count Query Using Non-Index Column

I have a query similar to this, where I need to find the number of transactions a specific customer had within a time frame:
select customer_id, count(transactions)
from transactions
where customer_id = 'FKJ90838485'
and purchase_date between '01-JAN-13' and '31-AUG-13'
group by customer_id
The table transactions is not indexed on customer_id but rather another field called transaction_id. Customer_ID is character type while transaction_id is numeric.
'accounting_month' field is also indexed.. this field just stores the month that transactions occured... ie, purchase_date = '03-MAR-13' would have accounting_month = '01-MAR-13'
The transactions table has about 20 million records in the time frame from '01-JAN-13' and '31-AUG-13'
When I run the above query, it has taken more than 40 minutes to come back, any ideas or tips?
As others have already commented, the best is to add an index that will cover the query, So:
Contact the Database administrator and request that they add an index on (customer_id, purchase_date) because the query is doing a table scan otherwise.
Sidenotes:
Use date and not string literals (you may know that and do it already, still noted here for future readers)
You don't have to put the customer_id in the SELECT list and if you remove it from there, it can be removed from the GROUP BY as well so the query becomes:
select count(*) as number_of_transactions
from transactions
where customer_id = 'FKJ90838485'
and purchase_date between DATE '2013-01-01' and DATE '2013-08-31' ;
If you don't have a WHERE condition on customer_id, you can have it in the GROUP BY and the SELECT list to write a query that will count number of transactions for every customer. And the above suggested index will help this, too:
select customer_id, count(*) as number_of_transactions
from transactions
where purchase_date between DATE '2013-01-01' and DATE '2013-08-31'
group by customer_id ;
This is just an idea that came up to me. It might work, try running it and see if it is an improvement over what you currently have.
I'm trying to use the transaction_id, which you've said is indexed, as much as possible.
WITH min_transaction (tran_id)
AS (
SELECT MIN(transaction_ID)
FROM TRANSACTIONS
WHERE
CUSTOMER_ID = 'FKJ90838485'
AND purchase_date >= '01-JAN-13'
), max_transaction (tran_id)
AS (
SELECT MAX(transaction_ID)
FROM TRANSACTIONS
WHERE
CUSTOMER_ID = 'FKJ90838485'
AND purchase_date <= '31-AUG-13'
)
SELECT customer_id, count(transaction_id)
FROM transactions
WHERE
transaction_id BETWEEN min_transaction.tran_id AND max_transaction.tran_id
GROUP BY customer_ID
May be this will run faster since it look at the transaction_id for the range instead of the purchase_date. I also take in consideration that accounting_month is indexed :
select customer_id, count(*)
from transactions
where customer_id = 'FKJ90838485'
and transaction_id between (select min(transaction_id)
from transactions
where accounting_month = '01-JAN-13'
) and
(select max(transaction_id)
from transactions
where accounting_month = '01-AUG-13'
)
group by customer_id
May be you can also try :
select customer_id, count(*)
from transactions
where customer_id = 'FKJ90838485'
and accounting_month between '01-JAN-13' and '01-AUG-13'
group by customer_id

Can I limit the amount of rows to be used for a group in a GROUP BY statement

I'm having an odd problem
I have a table with the columns product_id, sales and day
Not all products have sales every day. I'd like to get the average number of sales that each product had in the last 10 days where it had sales
Usually I'd get the average like this
SELECT product_id, AVG(sales)
FROM table
GROUP BY product_id
Is there a way to limit the amount of rows to be taken into consideration for each product?
I'm afraid it's not possible but I wanted to check if someone has an idea
Update to clarify:
Product may be sold on days 1,3,5,10,15,17,20.
Since I don't want to get an the average of all days but only the average of the days where the product did actually get sold doing something like
SELECT product_id, AVG(sales)
FROM table
WHERE day > '01/01/2009'
GROUP BY product_id
won't work
If you want the last 10 calendar day since products had a sale:
SELECT product_id, AVG(sales)
FROM table t
JOIN (
SELECT product_id, MAX(sales_date) as max_sales_date
FROM table
GROUP BY product_id
) t_max ON t.product_id = t_max.product_id
AND DATEDIFF(day, t.sales_date, t_max.max_sales_date) < 10
GROUP BY product_id;
The date difference is SQL server specific, you'd have to replace it with your server syntax for date difference functions.
To get the last 10 days when the product had any sale:
SELECT product_id, AVG(sales)
FROM (
SELECT product_id, sales, DENSE_RANK() OVER
(PARTITION BY product_id ORDER BY sales_date DESC) AS rn
FROM Table
) As t_rn
WHERE rn <= 10
GROUP BY product_id;
This asumes sales_date is a date, not a datetime. You'd have to extract the date part if the field is datetime.
And finaly a windowing function free version:
SELECT product_id, AVG(sales)
FROM Table t
WHERE sales_date IN (
SELECT TOP(10) sales_date
FROM Table s
WHERE t.product_id = s.product_id
ORDER BY sales_date DESC)
GROUP BY product_id;
Again, sales_date is asumed to be date, not datetime. Use other limiting syntax if TOP is not suported by your server.
Give this a whirl. The sub-query selects the last ten days of a product where there was a sale, the outer query does the aggregation.
SELECT t1.product_id, SUM(t1.sales) / COUNT(t1.*)
FROM table t1
INNER JOIN (
SELECT TOP 10 day, Product_ID
FROM table t2
WHERE (t2.product_ID=t1.Product_ID)
ORDER BY DAY DESC
)
ON (t2.day=t1.day)
GROUP BY t1.product_id
BTW: This approach uses a correlated subquery, which may not be very performant, but it should work in theory.
I'm not sure if I get it right but If you'd like to get the average of sales for last 10 days for you products you can do as follows :
SELECT Product_Id,Sum(Sales)/Count(*) FROM (SELECT ProductId,Sales FROM Table WHERE SaleDAte>=#Date) table GROUP BY Product_id HAVING Count(*)>0
OR You can use AVG Aggregate function which is easier :
SELECT Product_Id,AVG(Sales) FROM (SELECT ProductId,Sales FROM Table WHERE SaleDAte>=#Date) table GROUP BY Product_id
Updated
Now I got what you meant ,As far as I know it is not possible to do this in one query.It could be possible if we could do something like this(Northwind database):
select a.CustomerId,count(a.OrderId)
from Orders a INNER JOIN(SELECT CustomerId,OrderDate FROM Orders Order By OrderDate) AS b ON a.CustomerId=b.CustomerId GROUP BY a.CustomerId Having count(a.OrderId)<10
but you can't use order by in subqueries unless you use TOP which is not suitable for this case.But maybe you can do it as follows:
SELECT PorductId,Sales INTO #temp FROM table Order By Day
select a.ProductId,Sum(a.Sales) /Count(a.Sales)
from table a INNER JOIN #temp AS b ON a.ProductId=b.ProductId GROUP BY a.ProductId Having count(a.Sales)<=10
If this is a table of sales transactions, then there should not be any rows in there for days on which there were no Sales. I.e., If ProductId 21 had no sales on 1 June, then this table should not have any rows with productId = 21 and day = '1 June'... Therefore you should not have to filter anything out - there should not be anything to filter out
Select ProductId, Avg(Sales) AvgSales
From Table
Group By ProductId
should work fine. So if it's not, then you have not explained the problem completely or accurately.
Also, in yr question, you show Avg(Sales) in the example SQL query but then in the text you mention "average number of sales that each product ... " Do you want the average sales amount, or the average count of sales transactions? And do you want this average by Product alone (i.e., one output value reported for each product) or do you want the average per product per day ?
If you want the average per product alone, for just thpse sales in the ten days prior to now? or the ten days prior to the date of the last sale for each product?
If the latter then
Select ProductId, Avg(Sales) AvgSales
From Table T
Where day > (Select Max(Day) - 10
From Table
Where ProductId = T.ProductID)
Group By ProductId
If you want the average per product alone, for just those sales in the ten days with sales prior to the date of the last sale for each product, then
Select ProductId, Avg(Sales) AvgSales
From Table T
Where (Select Count(Distinct day) From Table
Where ProductId = T.ProductID
And Day > T.Day) <= 10
Group By ProductId