SQL - Group By 2 fields, where Count is Greater than X

SQL - Group By 2 fields, where Count is Greater than X - sql

I'm trying to find the correct and efficient way to write this query. For a sales data set, I want to isolate customers with more than 4 transactions in a given month, but with that, I want the details of the sale.
What've written so far is two volatile tables: T1) Details and T2) Count. T1, puts all the details of the sales into a volatile table and T2 counts the transactions based on identifiers of Store, Date, Time, Trans_ID, Cashier and TotalSale (grouping). Some transactions will show twice based on the details in T1, so I'm trying to eliminate the duplicates without removing the details.
My third query queries T1 against T2, where the Count in T2 is greater than 1.
I think this works, however, I'm not sure it's the most efficient way to run it as it takes hours to run this way. Any input on changes I can make? Code is pasted below:
create volatile table IC_DTL as
(
SELECT DISTINCT
o.orgn_nm REGION
, o.str_org_nbr STORE
, d.sltrn_dt SLS_DATE
, cast(d.last_upd_ts as TIME) SLS_TIME
, d.pos_rgstr_id REGISTER
, d.sltrn_id TRANS_ID
, dp.user_id CASHIER_ID
, d.sltrn_line_nbr LINE_ITEM
, i.sub_class_nbr SUB_CLASS
, i.item_sc_desc SC_DESC
, i.item_sku_nbr SKU
, i.item_desc SKU_DESC
, d.ITEM_QTY QTY
, d.EXT_RETL_AMT EXT_RETAIL
, st.grs_sls_amt GROSS_SALE_AMT
, pc.s_paymt_meth_desc PAYMT_METH
, py.cardhldr_acct_nbr ACCT_NBR
FROM ALL MY TABLES (9 JOINS)
WHERE MY CONDITIONS
)
WITH DATA
ON COMMIT PRESERVE ROWS;
Create volatile table IC_COUNT as
(
select cashier_id, count(Trans_ID) as TOTAL
FROM IC_DETL
GROUP BY store, sls_date, sls_time, trans_id, cashier_id
,gross_sales_amt
)
WITH DATA
ON COMMIT PRESERVE ROWS;
select *
from IC_DTL D
JOIN IC_COUNT C
ON C.CASHIER_ID = D.CASHIER_ID
WHERE D.TOTAL >= 2
Any help is greatly appreciated!

Related

SQL/BigQuery - List of products sold together

how are you doing?
I have a sales table with DATE, TICKET_ID (transaction id) and PRODUCT_ID (product sold). I'd like to have a list of the items sold together PER DAY (that is, today product X was sold with product Y 10 times, yesterday product X was sold with product Y 5 times...)
I have this code, however it has two problems:
1- Generate inverted duplicates. Example:
product_id product_id_bought_with counting
12345 98765 130
98765 12345 130
abcde fghij 88
fghij abcde 88
2- This code ran fine WITHOUT THE DATA COLUMN. After I entered the data volume is much larger and I get a limit error.
"Resources exceeded during query execution: The query could not be executed in the allotted memory. Peak usage: 152% of limit. Top memory consumer(s): ORDER BY operations: 99% other/unattributed: 1%"
My code:
SELECT
c.DATE,
c.product_id,
c.product_id_bought_with,
count(*) counting
FROM ( SELECT a.DATE, a.product_id, b.product_id as product_id_bought_with
FROM `MY-TABLE` a
INNER JOIN `THE-SAME-TABLE` b
ON a.ID_TICKETS = b.ID_TICKETS
AND a.product_id != b.product_id
AND a.DATE = b.DATE
) c
GROUP BY DATE, product_id, product_id_bought_with
ORDER BY counting DESC
I'm open to new ideas on how to do this. Thanks in advance!
Edit: Sample example
CREATE TABLE `project_id.dataset.table_name` (
DAT_VTE DATE,
ID_TICKET STRING,
product_id int
);
INSERT INTO `project_id.dataset.table_name` (DAT_VTE, ID_TICKET, product_id)
VALUES
(DATE('2022-01-01'), '123_abc', 876123),
(DATE('2022-01-01'), '123_abc', 324324),
(DATE('2022-01-02'), '456_def', 876123),
(DATE('2022-01-02'), '456_def', 324324),
(DATE('2022-01-02'), '456_def', 432321),
(DATE('2022-05-23'), '987_xyz', 876123),
(DATE('2022-05-23'), '987_xyz', 324324)

For your requirement, you can try the below query:
with mytable as(
select *,row_number()over (partition by DAT_VTE,ID_TICKET)rownum from `project_id.dataset.MY-TABLE`
)
select DAT_VTE
,product_id
,product_id_bought_with
,count(*) counting
from (
select a.DAT_VTE,a.ID_TICKET,a.product_id as product_id, b.product_id as product_id_bought_with
from mytable a
join mytable b
ON a.ID_TICKET = b.ID_TICKET
AND a.DAT_VTE = b.DAT_VTE
and a.rownum <b.rownum
)
GROUP BY DAT_VTE, product_id, product_id_bought_with
According to the error you provided, resources exceeded errors are usually triggered when an operation needs to gather all the data on a single computation unit and if it doesn’t fit in, then the job will fail. Ordering a huge amount of data involves heavy computation resources that can be better utilized if partitions are used.
Below are the ways to resolve your issue :
1 Usually partition helps with the resources issue as given in the documentation and in this link.
2 You can also try to split your query, write the results of every individual sub/inner query to another table as a temporary storage space for further processing.

Different results in SQL based on what columns I display

I am trying to run a query to gather the total items on hand in our database. However it seems i'm getting incorrect data. I am selecting selecting just the amount field and summing it using joins from separate tables based on certain parameters, however if I display additional fields such as order number, and date all of a sudden im getting different data, even though those fields are being used as filters in the query. Is it because its not in the select statement? If it needs to be in the select statement is it possible to not display them?
Here are the two queries.
-- Items On Hand
select CONVERT(decimal(25, 2), SUM(tw.amount)) as 'Amt'
from [Sales Header] sh
join
(
select *
from TWAllOrders
where [Status] like 'Released'
) tw
on tw.[Order Nb] = sh.No_
join
(
select *
from OnHand
) oh
on tw.No_ = oh.[Item No_]
where sh.[Requested Delivery Date] < getdate()
HAVING SUM(tw.Quantity) <= SUM(oh.Qty)
providing a sum of 21667457.20
and with the added columns
-- Items On Hand
select CONVERT(decimal(25, 2), SUM(tw.amount)) as 'Amt', [Requested Delivery Date], sh.No_, tw.[Status]
from [Sales Header] sh
join
(
select *
from TWAllOrders
where [Status] like 'Released'
) tw
on tw.[Order Nb] = sh.No_
join
(
select *
from OnHand
) oh
on tw.No_ = oh.[Item No_]
where sh.[Requested Delivery Date] < getdate()
group by sh.[Requested Delivery Date], sh.No_, tw.[Status]
HAVING SUM(tw.Quantity) <= SUM(oh.Qty)
order by sh.[Requested Delivery Date] ASC
Providing a sum of 12319998
I'm self taught in SQL so I may be misunderstanding something obvious, thanks for the help.

With no sample data, I am going to have to demonstrate this in principle. In the latter query you have a GROUP BY meaning the scope of the values in the HAVING will differ, and thus the filtering from said HAVING will be different.
Let's take the following sample data:
CREATE TABLE dbo.MyTable (Grp char(1),
Quantity int,
Required int);
INSERT INTO dbo.MyTable (Grp, Quantity, [Required])
VALUES('a',2,7),
('a',14,2),
('b',4, 7),
('b',3,4),
('c',17,5);
Now we'll perform an overly simplified version of your query:
SELECT SUM(Quantity)
FROM dbo.MyTable
HAVING SUM(Quantity) > SUM(Required);
This brings back the value 40; which is the SUM of all the values in Quantity. A value is returned because the total SUM of Required is 25.
Now let's add a GROUP BY like your second query:
SELECT SUM(Quantity)
FROM dbo.MyTable
GROUP BY Grp
HAVING SUM(Quantity) > SUM(Required);
Now we have 2 rows, with the values 16 and 17 giving a total value of 33. That's because the rows where Grp have a value of 'B' are filtered out, as the SUM of Quantity is lower that Required for 'B'.
The same is happening in your data; in the grouped data you have groups where the HAVING condition isn't met, so those rows aren't returned.

Extra records when used as a sub query: Access

I'm a rookie developer with basic SQL experience and this problem has been 'doing my head in' for the last couple of days. I've gone to ask a question here a couple times and thought... not yet... keep trying.
I have a table:
ID
Store
Product_Type
Delivery_Window
Despatch_Time
Despatch_Type
Pallets
Cartons
Many other columns (start_week and day_num are two of them)
My goal is to get a list of of store by product_type with the minimum despatch_time with all the other column information.
I've tested the base query.
SELECT Product_Type, Store, Min(Despatch_Time) as MinDes
FROM table
GROUP BY Store, Product_Type
Works well, I get 200 rows as expected.
Now I want those 200 rows to have the other related record information : Delivery_Window, start_week, etc
I've tried the following.
SELECT * FROM Table WHERE EXISTS
(SELECT Product_Type, Store, Min(Despatch_Time) as MinDes
FROM table
GROUP BY Store, Product_Type)
I've tried doing inner and right joins all returned more than 200 records, my original amount.
I inspected the additional records and it is where there is the same despatch time for a store and product type but for a different despatch type.
So I need a hand in creating a query where I limit it by the initial sub query but even if there is matching minimum despatch times it will still limit the count to one record by store and product type.
Current Query is:
SELECT *
FROM table AS A INNER JOIN
(Select Min(Despatch_Time) as MinDue, store, product_type
FROM table
WHERE day_num = [Forms]![FRM_SomeForm]![combo_del_day] AND start_week =[Forms]![FRM_SomeForm]![txt_date1]
GROUP BY store, product_type) AS B
ON (A.product_type = B.product_type) AND (A.store = B.store) AND (A.Despatch_Time = B.MinDue);

I think you want:
SELECT t.*
FROM table as t
WHERE t.Dispatch_Time = (SELECT MIN(t2.Dispatch_Time)
FROM table as t2
WHERE t2.Store = t.Store AND t2.Product_Type = t.Product_Type);
The above will return duplicates. In order to avoid duplicates, you need a key to provide uniqueness. Let me assume you have a primary key pk:
SELECT t.*
FROM table as t
WHERE t.pk = (SELECT TOP (1) t2.pk
FROM table as t2
WHERE t2.Store = t.Store AND t2.Product_Type = t.Product_Type
ORDER BY t2.Dispatch_Time, t2.pk
);

Using a stored procedure in Teradata to build a summarial history table

I am using Terdata SQL Assistant connected to an enterprise DW. I have written the query below to show an inventory of outstanding items as of a specific point in time. The table referenced loads and stores new records as changes are made to their state by load date (and does not delete historical records). The output of my query is 1 row for the specified date. Can I create a stored procedure or recursive query of some sort to build a history of these summary rows (with 1 new row per day)? I have not used such functions in the past; links to pertinent previously answered questions or suggestions on how I could get on the right track in researching other possible solutions are totally fine if applicable; just trying to bridge this gap in my knowledge.
SELECT
'2017-10-02' as Dt
,COUNT(DISTINCT A.RECORD_NBR) as Pending_Records
,SUM(A.PAY_AMT) AS Total_Pending_Payments
FROM DB.RECORD_HISTORY A
INNER JOIN
(SELECT MAX(LOAD_DT) AS LOAD_DT
,RECORD_NBR
FROM DB.RECORD_HISTORY
WHERE LOAD_DT <= '2017-10-02'
GROUP BY RECORD_NBR
) B
ON A.RECORD_NBR = B.RECORD_NBR
AND A.LOAD_DT = B.LOAD_DT
WHERE
A.RECORD_ORDER =1 AND Final_DT Is Null
GROUP BY Dt
ORDER BY 1 desc

Here is my interpretation of your query:
For the most recent load_dt (up until 2017-10-02) for record_order #1,
return
1) the number of different pending records
2) the total amount of pending payments
Is this correct? If you're looking for this info, but one row for each "Load_Dt", you just need to remove that INNER JOIN:
SELECT
load_Dt,
COUNT(DISTINCT record_nbr) AS Pending_Records,
SUM(pay_amt) AS Total_Pending_Payments
FROM DB.record_history
WHERE record_order = 1
AND final_Dt IS NULL
GROUP BY load_Dt
ORDER BY 1 DESC
If you want to get the summary info per record_order, just add record_order as a grouping column:
SELECT
load_Dt,
record_order,
COUNT(DISTINCT record_nbr) AS Pending_Records,
SUM(pay_amt) AS Total_Pending_Payments
FROM DB.record_history
WHERE final_Dt IS NULL
GROUP BY load_Dt, record_order
ORDER BY 1,2 DESC
If you want to get one row per day (if there are calendar days with no corresponding "load_dt" days), then you can SELECT from the sys_calendar.calendar view and LEFT JOIN the query above on the "load_dt" field:
SELECT cal.calendar_date, src.Pending_Records, src.Total_Pending_Payments
FROM sys_calendar.calendar cal
LEFT JOIN (
SELECT
load_Dt,
COUNT(DISTINCT record_nbr) AS Pending_Records,
SUM(pay_amt) AS Total_Pending_Payments
FROM DB.record_history
WHERE record_order = 1
AND final_Dt IS NULL
GROUP BY load_Dt
) src ON cal.calendar_date = src.load_Dt
WHERE cal.calendar_date BETWEEN <start_date> AND <end_date>
ORDER BY 1 DESC
I don't have access to a TD system, so you may get syntax errors. Let me know if that works or you're looking for something else.

Datediff between two tables

I have those two tables
1-Add to queue table
TransID , ADD date
10 , 10/10/2012
11 , 14/10/2012
11 , 18/11/2012
11 , 25/12/2012
12 , 1/1/2013
2-Removed from queue table
TransID , Removed Date
10 , 15/1/2013
11 , 12/12/2012
11 , 13/1/2013
11 , 20/1/2013
The TansID is the key between the two tables , and I can't modify those tables, what I want is to query the amount of time each transaction spent in the queue
It's easy when there is one item in each table , but when the item get queued more than once how do I calculate that?

Assuming the order TransIDs are entered into the Add table is the same order they are removed, you can use the following:
WITH OrderedAdds AS
( SELECT TransID,
AddDate,
[RowNumber] = ROW_NUMBER() OVER(PARTITION BY TransID ORDER BY AddDate)
FROM AddTable
), OrderedRemoves AS
( SELECT TransID,
RemovedDate,
[RowNumber] = ROW_NUMBER() OVER(PARTITION BY TransID ORDER BY RemovedDate)
FROM RemoveTable
)
SELECT OrderedAdds.TransID,
OrderedAdds.AddDate,
OrderedRemoves.RemovedDate,
[DaysInQueue] = DATEDIFF(DAY, OrderedAdds.AddDate, ISNULL(OrderedRemoves.RemovedDate, CURRENT_TIMESTAMP))
FROM OrderedAdds
LEFT JOIN OrderedRemoves
ON OrderedAdds.TransID = OrderedRemoves.TransID
AND OrderedAdds.RowNumber = OrderedRemoves.RowNumber;
The key part is that each record gets a rownumber based on the transaction id and the date it was entered, you can then join on both rownumber and transID to stop any cross joining.
Example on SQL Fiddle

DISCLAIMER: There is probably problem with this, but i hope to send you in one possible direction. Make sure to expect problems.
You can try in the following direction (which might work in some way depending on your system, version, etc) :
SELECT transId, (sum(add_date_sum) - sum(remove_date_sum)) / (1000*60*60*24)
FROM
(
SELECT transId, (SUM(UNIX_TIMESTAMP(add_date)) as add_date_sum, 0 as remove_date_sum
FROM add_to_queue
GROUP BY transId
UNION ALL
SELECT transId, 0 as add_date_sum, (SUM(UNIX_TIMESTAMP(remove_date)) as remove_date_sum
FROM remove_from_queue
GROUP BY transId
)
GROUP BY transId;
A bit of explanation: as far as I know, you cannot sum dates, but you can convert them to some sort of timestamps. Check if UNIX_TIMESTAMPS works for you, or figure out something else. Then you can sum in each table, create union by conveniently leaving the other one as zeto and then subtracting the union query.
As for that devision in the end of first SELECT, UNIT_TIMESTAMP throws out miliseconds, you devide to get days - or whatever it is that you want.
This all said - I would probably solve this using a stored procedure or some client script. SQL is not a weapon for every battle. Making two separate queries can be much simpler.

Answer 2: after your comments. (As a side note, some of your dates 15/1/2013,13/1/2013 do not represent proper date formats )
select transId, sum(numberOfDays) totalQueueTime
from (
select a.transId,
datediff(day,a.addDate,isnull(r.removeDate,a.addDate)) numberOfDays
from AddTable a left join RemoveTable r on a.transId = r.transId
order by a.transId, a.addDate, r.removeDate
) X
group by transId
Answer 1: before your comments
Assuming that there won't be a new record added unless it is being removed. Also note following query will bring numberOfDays as zero for unremoved records;
select a.transId, a.addDate, r.removeDate,
datediff(day,a.addDate,isnull(r.removeDate,a.addDate)) numberOfDays
from AddTable a left join RemoveTable r on a.transId = r.transId
order by a.transId, a.addDate, r.removeDate

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

SQL - Group By 2 fields, where Count is Greater than X - sql

Related

SQL/BigQuery - List of products sold together

Different results in SQL based on what columns I display

Extra records when used as a sub query: Access

Using a stored procedure in Teradata to build a summarial history table

Datediff between two tables

Categories

Resources