T-SQL aggregate over contiguous dates more efficiently - sql

I have a need to aggregate a sum over contiguous dates. I've seen solutions to similar problems that will return the start and end dates, but don't have a need to aggregate the data between those ranges. It's further complicated by the extremely large amounts of data involved, to the point that a simple self join takes an impractical amount of time (especially since the start and end date fields are unindexed)
I have a solution involving cursors, but I've generally been led to believe that cursors can always be more efficiently replaced with joins that will execute faster, but so far every solution I've tried with a query anywhere close to giving me the data I need takes an hour at least, and my cursor solutions takes about 10 seconds. So I'm asking if there is a more efficient answer.
And the data includes both buy and sell transactions and each row of aggregated contiguous dates returned also needs to list the transaction ID of the last sell that occurred before the first buy of the contiguous set of buy transactions.
An example of the data:
+------------------+------------+------------+------------------+--------------------+
| TRANSACTION_TYPE | TRANS_ID | StartDate | EndDate | Amount |
+------------------+------------+------------+------------------+--------------------+
| sell | 100 | 2/16/16 | 2/18/18 | $100.00 |
| sell | 101 | 3/1/16 | 6/6/16 | $121.00 |
| buy | 102 | 6/10/16 | 6/12/16 | $22.00 |
| buy | 103 | 6/12/16 | 6/14/16 | $0.35 |
| buy | 104 | 6/29/16 | 7/2/16 | $5.00 |
| sell | 105 | 7/3/16 | 7/6/16 | $115.00 |
| buy | 106 | 7/8/16 | 7/9/16 | $200.00 |
| sell | 107 | 7/10/16 | 7/13/16 | $4.35 |
| sell | 108 | 7/17/16 | 7/20/16 | $0.50 |
| buy | 109 | 7/25/16 | 7/29/16 | $33.00 |
| buy | 110 | 7/29/16 | 8/1/16 | $75.00 |
| buy | 111 | 8/1/16 | 8/3/16 | $0.33 |
| sell | 112 | 9/1/16 | 9/2/16 | $99.00 |
+------------------+------------+------------+------------------+--------------------+
Should have results like the following:
+------------+------------+------------------+--------------------+
| Last_Sell | StartDate | EndDate | Amount |
+------------+------------+------------------+--------------------+
| 101 | 6/10/16 | 6/14/18 | $22.35 |
| 101 | 6/29/16 | 7/2/16 | $5.00 |
| 105 | 7/8/16 | 7/9/16 | $200.00 |
| 108 | 7/25/16 | 8/3/16 | $108.33 |
+------------------+------------+------------+--------------------+
Right now I use queries to split the data into buys and sells, and just walk through the buy data, aggregating as I go, inserting into the return table every time I find a break in the dates, and I step through the sell table until I reach the last sell before the start date of the set of buys.
Walking linearly through cursors gives me a computational time of n. Even though cursors are orders of magnitude less efficient, it's still calculating in n, while I suspect the joins I would need to do would give me at least n log n. With the ridiculous amount of data I'm working with, the inefficiencies of cursors get swamped if it goes beyond linear time.

If I assume that the transaction id increases along with the dates, then you can get the last sales date using a cumulative max. Then, the adjacency can be found by using similar logic, but with a lag first:
with cte as (
select t.*,
sum(case when transaction_type = 'sell' then trans_id end) over
(order by trans_id) as last_sell,
lag(enddate) over (partition by transaction_type order by trans_id) as prev_enddate
from t
)
select last_sell, min(startdate) as startdate, max(enddate) as enddate,
sum(amount) as amount
from (select cte.*,
sum(case when startdate = dateadd(day, 1, prev_enddate) then 0 else 1 end) over (partition by last_sell order by trans_id) as grp
from cte
where transaction_type = 'buy'
) x
group by last_sell, grp;

Related

SQL/Power BI Joins without common column

So I have the following problem:
I have 2 tables, one containing different bids for a product_type, and one containing the price, date etc. to which the product was sold.
The tables look like this:
Table bids:
+----------+---------------------+---------------------+--------------+-------+
| Bid_id | Start_time | End_time | Product_type | price |
+----------+---------------------+---------------------+--------------+-------+
| 1 | 18.01.2020 06:00:00 | 18.01.2020 06:02:33 | blue | 5 € |
| 2 | 18.01.2020 06:00:07 | 18.01.2020 06:00:43 | blue | 7 € |
| 3 | 18.01.2020 06:01:10 | 19.01.2020 15:03:15 | red | 3 € |
| 4 | 18.01.2020 06:02:20 | 18.01.2020 06:05:44 | blue | 6 € |
| | | | | |
+----------+---------------------+---------------------+--------------+-------+
Table sells:
+---------+---------------------+--------------+--------+
| Sell_id | Sell_time | Product_type | Price |
+---------+---------------------+--------------+--------+
| 1 | 18.01.2020 06:00:31 | Blue | 6,50 € |
| 2 | 18:01.2020 06:51:03 | Red | 2,50 € |
| | | | |
+---------+---------------------+--------------+--------+
The sell_id and the bid_id have no relation with each other.
What I want to find out is, what is the maximum bid to the time we sold the product_type. So if we take sell_id 1, it should check, which bids for this specific product_type were active during the sell_time (in this case bid_id 1 and 2) and give back the higher price (in this case bid_id 2).
I tried to solve this problem in Power Bi, however, I was not able to get a solution. I assume, that I have to work with SQL-Joins to solve it.
Is it possible, to join based on criteria instead of matching columns? Something like:
SELECT bids.start_time, bids.end_time, bids.product_type, MAX(bids.price), sells.sell_time, sells.product_type, sells.price
FROM sells
INNER JOIN bids ON bids.start_time<sells.sell_time AND bids.end_time > sells.sell_time;
I am sorry if this question is confusing, I am still new to this sorry. Thanks in advance for ANY help!
Your sample data Sell_time should be 18.01.2020, right? You Can try this code (can be resource-intensive in relation to the amount of data due to Cartesian joins). If you are sure that Sell day is always in Bid Start day, then you can add date column to yours tables and use additional TREATAS(VALUE(bids[day], sells[day])
Test =
VAR __tretasfilter =
TREATAS ( VALUES ( bids[Product_type] ), sells[Product_type] )
RETURN
SUMMARIZE (
FILTER (
SUMMARIZECOLUMNS (
sells[Sell_id],
bids[Price],
bids[Start_time],
sells[Sell_time],
bids[End_time],
sells[Product_type],
__tretasfilter
),
[Start_time] <= [Sell_time]
&& [End_time] >= [Sell_time]
),
sells[Sell_id],
"MaxPrice", MAX ( bids[Price] )
)

how to join tables on cases where none of function(a) in b

Say in MonetDB (specifically, the embedded version from the "MonetDBLite" R package) I have a table "events" containing entity ID codes and event start and end dates, of the format:
| id | start_date | end_date |
| 1 | 2010-01-01 | 2010-03-30 |
| 1 | 2010-04-01 | 2010-06-30 |
| 2 | 2018-04-01 | 2018-06-30 |
| ... | ... | ... |
The table is approximately 80 million rows of events, attributable to approximately 2.5 million unique entities (ID values). The dates appear to align nicely with calendar quarters, but I haven't thoroughly checked them so assume they can be arbitrary. However, I have at least sense-checked them for end_date > start_date.
I want to produce a table "nonevent_qtrs" listing calendar quarters where an ID has no event recorded, e.g.:
| id | last_doq |
| 1 | 2010-09-30 |
| 1 | 2010-12-31 |
| ... | ... |
| 1 | 2018-06-30 |
| 2 | 2010-03-30 |
| ... | ... |
(doq = day of quarter)
If the extent of an event spans any days of the quarter (including the first and last dates), then I wish for it to count as having occurred in that quarter.
To help with this, I have produced a "calendar table"; a table of quarters "qtrs", covering the entire span of dates present in "events", and of the format:
| first_doq | last_doq |
| 2010-01-01 | 2010-03-30 |
| 2010-04-01 | 2010-06-30 |
| ... | ... |
And tried using a non-equi merge like so:
create table nonevents
as select
id,
last_doq
from
events
full outer join
qtrs
on
start_date > last_doq or
end_date < first_doq
group by
id,
last_doq
But this is a) terribly inefficient and b) certainly wrong, since most IDs are listed as being non-eventful for all quarters.
How can I produce the table "nonevent_qtrs" I described, which contains a list of quarters for which each ID had no events?
If it's relevant, the ultimate use-case is to calculate runs of non-events to look at time-till-event analysis and prediction. Feels like run length encoding will be required. If there's a more direct approach than what I've described above then I'm all ears. The only reason I'm focusing on non-event runs to begin with is to try to limit the size of the cross-product. I've also considered producing something like:
| id | last_doq | event |
| 1 | 2010-01-31 | 1 |
| ... | ... | ... |
| 1 | 2018-06-30 | 0 |
| ... | ... | ... |
But although more useful this may not be feasible due to the size of the data involved. A wide format:
| id | 2010-01-31 | ... | 2018-06-30 |
| 1 | 1 | ... | 0 |
| 2 | 0 | ... | 1 |
| ... | ... | ... | ... |
would also be handy, but since MonetDB is column-store I'm not sure whether this is more or less efficient.
Let me assume that you have a table of quarters, with the start date of a quarter and the end date. You really need this if you want the quarters that don't exist. After all, how far back in time or forward in time do you want to go?
Then, you can generate all id/quarter combinations and filter out the ones that exist:
select i.id, q.*
from (select distinct id from events) i cross join
quarters q left join
events e
on e.id = i.id and
e.start_date <= q.quarter_end and
e.end_date >= q.quarter_start
where e.id is null;

Is there a way to create a "pivot group" of columns with t-sql?

This seems like a common need but, unfortunately, I can't find a solution.
Assume you have a query that outputs the following content:
| TimeFrame | User | Metric1 | Metric2 |
+------------+------------+---------+---------+
| TODAY | John Doe | 10 | 20 |
| MONTHTODAY | John Doe | 100 | 200 |
| TODAY | Jack Frost | 15 | 25 |
| MONTHTODAY | Jack Frost | 150 | 250 |
What I need as output after a pivot is data that looks like this:
| User | TODAY_Metric1 | TODAY_Metric2 | MONTHTODAY_Metric1 | MONTHTODAY_Metric2 |
+------------+---------------+---------------+--------------------+--------------------+
| John Doe | 10 | 20 |100 | 200 |
| Jack Frost | 15 | 25 |150 | 250 |
Note that I'm doing the pivoting on TimeFrame, however, columns Metric1 and Metric2 remain columns but are grouped by time frame values.
Can this be done within standard PIVOT syntax or will I need to write a more complex query to pull this data together in a result set specific to my needs?
You can do this with conditional aggregation:
select
user,
sum(case when timeframe = 'TODAY' then Metric_1 end) TODAY_Metric1,
sum(case when timeframe = 'TODAY' then Metric_2 end) TODAY_Metric2,
sum(case when timeframe = 'MONTHTODAY' then Metric_1 end) MONTHTODAY_Metric1,
sum(case when timeframe = 'MONTHTODAY' then Metric_2 end) MONTHTODAY_Metric2
from mytable
group by user
I tend to prefer the conditional aggregation technique over the vendor-specific implementations, because:
I find it simpler to understand and maintain
it is cross-RDBMS (so you can easily port it to some other database if needed)
it usually performs as well, or even better than vendor implementation (that usually rely upon it under the hood)

How to optimize nested innner hive query

I have a table with following stock data where we have couple of columns like date, ticker, open and close(stock prices).
To query this data, I want to know which stock has given the highest margin on particular date. So if I have 516 different stocks, my query should return 516 rows of ticker, date, open, close and a new column Margin(which will be max(close-open)).
| deep_stocks.date_ | deep_stocks.ticker | deep_stocks.open | deep_stocks.close |
+--------------------+---------------------+-------------------+--------------------+--+
| 20100721 | A | 27.68 | 27.58 |
| 20100722 | A | 27.95 | 28.72 |
| 20100723 | A | 28.56 | 29.3 |
| 20100726 | A | 29.22 | 29.64 |
| 20100727 | A | 29.73 | 28.87 |
| 20100728 | A | 28.79 | 28.78 |
| 20100729 | A | 28.97 | 28.15 |
| 20100730 | A | 27.78 | 27.93 |
| 20100802 | A | 28.35 | 28.82 |
| 20100803 | A | 28.7 | 27.84 |
I have written a query where my approach was:
Step 1 - Get the difference between Close and Open prices (Inner/Sub query)
Step 2 - Get the maximum of margin for every stock (used group by with max function)
Step 3 - Join the results with Main Table and get the data.
I'll put my query in solution or comments can someone please correct it as it is taking more time. Also I would like to know can we have any other alternative approach.
As already told about my approach please find below query:
SELECT ds.ticker, ds.date_, ds.close, ds.open, ds.Margin FROM
(SELECT ticker, date_, close, open, case(close-open)>0 when true then round(close-open,2) else 0 end as Margin FROM DataStocks) ds
JOIN
(SELECT dsIn.ticker, max(dsIn.Margin) mxMargin FROM
(select ticker, case(close-open)>0 when true then round(close-open,2) else 0 end as Margin FROM DataStocks ) dsIn group by dsIn.ticker) dsEx
ON ds.ticker=dsEx.ticker AND ds.Margin=dsEx.mxMargin ORDER BY ds.Margin;
Do we have any other alternatives for this query or can it be possible to optimize it.

Sum of transactions between varying dates

Background
I have a table containing transactions. There are two types of transactions: "normal entries" (type=N), and "fix" entries (type=F). Each transaction has a client-ID, a date, a type code, and an EUR amount. Some example data is below:
| client_id | date | transaction_type | amount |
|-----------|-----------|------------------|---------|
| 111 | 01jan2015 | N | 1000.0 |
| 111 | 01jan2015 | F | -500.0 |
| 222 | 05mar2015 | N | 2000.0 |
| 222 | 06mar2015 | F | -100.0 |
| 222 | 07mar2015 | F | -100.0 |
| 222 | 09mar2015 | N | 1000.0 |
| 222 | 10mar2015 | N | 400.0 |
| 222 | 15jun2015 | F | -200.0 |
The fix entries are manual corrections to normal transactions made by someone at the register. They can be done on the same day or after the normal transaction, but if a new normal transaction is entered for the same client, all that client's consecutive fixes concern the new transaction (until yet another normal transaction is entered). So in effect, all fixes are "fixing" only the latest transaction of that client.
The fixes can be positive or negative numbers, the normal transactions only positive.
Desired output
What I want is a set of "normal" transactions per client, with a sum amount corrected by all the fixes related to that transaction. Example data below:
| client_id | date | amount |
|-----------|-----------|--------|
| 111 | 01jan2015 | 500.0 |
| 222 | 05mar2015 | 1800.0 |
| 222 | 07mar2015 | 1000.0 |
| 222 | 08mar2015 | 200.0 |
So this is a sum of one transaction of type N and all the consecutive F-transactions up until the next N-transaction.
What I have so far
If all the fixes happen on the same date as the original transaction (as is usually the case), this is very simple:
select client_id, date, sum(amount)
from transaction_table
group by client_id, date
However, I'm having problems handling fixes that happen after the original transaction date, because I need to pick only those that happen before the next normal transaction (and this needs to apply for each normal transaction).
A note on products in use
I'm actually using SAS 9.4, but through SAS's proc sql procedure I can apply basic SQL and that's what I'm more comfortable using. Nothing fancy though (so cursors, CTEs and such are out). A nice SAS answer will be accepted, too!
Create a grouping flag that is set at every N.
What happens if there's multiple purchases on same day?
Data want;
Set have;
By ID;
Retain purchaseGroup;
If transx = 'N' then purchaseGroup+1;
If first.id then purchaseGroup=1;
Run;
Then summarize using a SQL step grouping by ID and PurchaseGroup.