Sum of transactions between varying dates - sql

Background
I have a table containing transactions. There are two types of transactions: "normal entries" (type=N), and "fix" entries (type=F). Each transaction has a client-ID, a date, a type code, and an EUR amount. Some example data is below:
| client_id | date | transaction_type | amount |
|-----------|-----------|------------------|---------|
| 111 | 01jan2015 | N | 1000.0 |
| 111 | 01jan2015 | F | -500.0 |
| 222 | 05mar2015 | N | 2000.0 |
| 222 | 06mar2015 | F | -100.0 |
| 222 | 07mar2015 | F | -100.0 |
| 222 | 09mar2015 | N | 1000.0 |
| 222 | 10mar2015 | N | 400.0 |
| 222 | 15jun2015 | F | -200.0 |
The fix entries are manual corrections to normal transactions made by someone at the register. They can be done on the same day or after the normal transaction, but if a new normal transaction is entered for the same client, all that client's consecutive fixes concern the new transaction (until yet another normal transaction is entered). So in effect, all fixes are "fixing" only the latest transaction of that client.
The fixes can be positive or negative numbers, the normal transactions only positive.
Desired output
What I want is a set of "normal" transactions per client, with a sum amount corrected by all the fixes related to that transaction. Example data below:
| client_id | date | amount |
|-----------|-----------|--------|
| 111 | 01jan2015 | 500.0 |
| 222 | 05mar2015 | 1800.0 |
| 222 | 07mar2015 | 1000.0 |
| 222 | 08mar2015 | 200.0 |
So this is a sum of one transaction of type N and all the consecutive F-transactions up until the next N-transaction.
What I have so far
If all the fixes happen on the same date as the original transaction (as is usually the case), this is very simple:
select client_id, date, sum(amount)
from transaction_table
group by client_id, date
However, I'm having problems handling fixes that happen after the original transaction date, because I need to pick only those that happen before the next normal transaction (and this needs to apply for each normal transaction).
A note on products in use
I'm actually using SAS 9.4, but through SAS's proc sql procedure I can apply basic SQL and that's what I'm more comfortable using. Nothing fancy though (so cursors, CTEs and such are out). A nice SAS answer will be accepted, too!

Create a grouping flag that is set at every N.
What happens if there's multiple purchases on same day?
Data want;
Set have;
By ID;
Retain purchaseGroup;
If transx = 'N' then purchaseGroup+1;
If first.id then purchaseGroup=1;
Run;
Then summarize using a SQL step grouping by ID and PurchaseGroup.

Related

T-SQL aggregate over contiguous dates more efficiently

I have a need to aggregate a sum over contiguous dates. I've seen solutions to similar problems that will return the start and end dates, but don't have a need to aggregate the data between those ranges. It's further complicated by the extremely large amounts of data involved, to the point that a simple self join takes an impractical amount of time (especially since the start and end date fields are unindexed)
I have a solution involving cursors, but I've generally been led to believe that cursors can always be more efficiently replaced with joins that will execute faster, but so far every solution I've tried with a query anywhere close to giving me the data I need takes an hour at least, and my cursor solutions takes about 10 seconds. So I'm asking if there is a more efficient answer.
And the data includes both buy and sell transactions and each row of aggregated contiguous dates returned also needs to list the transaction ID of the last sell that occurred before the first buy of the contiguous set of buy transactions.
An example of the data:
+------------------+------------+------------+------------------+--------------------+
| TRANSACTION_TYPE | TRANS_ID | StartDate | EndDate | Amount |
+------------------+------------+------------+------------------+--------------------+
| sell | 100 | 2/16/16 | 2/18/18 | $100.00 |
| sell | 101 | 3/1/16 | 6/6/16 | $121.00 |
| buy | 102 | 6/10/16 | 6/12/16 | $22.00 |
| buy | 103 | 6/12/16 | 6/14/16 | $0.35 |
| buy | 104 | 6/29/16 | 7/2/16 | $5.00 |
| sell | 105 | 7/3/16 | 7/6/16 | $115.00 |
| buy | 106 | 7/8/16 | 7/9/16 | $200.00 |
| sell | 107 | 7/10/16 | 7/13/16 | $4.35 |
| sell | 108 | 7/17/16 | 7/20/16 | $0.50 |
| buy | 109 | 7/25/16 | 7/29/16 | $33.00 |
| buy | 110 | 7/29/16 | 8/1/16 | $75.00 |
| buy | 111 | 8/1/16 | 8/3/16 | $0.33 |
| sell | 112 | 9/1/16 | 9/2/16 | $99.00 |
+------------------+------------+------------+------------------+--------------------+
Should have results like the following:
+------------+------------+------------------+--------------------+
| Last_Sell | StartDate | EndDate | Amount |
+------------+------------+------------------+--------------------+
| 101 | 6/10/16 | 6/14/18 | $22.35 |
| 101 | 6/29/16 | 7/2/16 | $5.00 |
| 105 | 7/8/16 | 7/9/16 | $200.00 |
| 108 | 7/25/16 | 8/3/16 | $108.33 |
+------------------+------------+------------+--------------------+
Right now I use queries to split the data into buys and sells, and just walk through the buy data, aggregating as I go, inserting into the return table every time I find a break in the dates, and I step through the sell table until I reach the last sell before the start date of the set of buys.
Walking linearly through cursors gives me a computational time of n. Even though cursors are orders of magnitude less efficient, it's still calculating in n, while I suspect the joins I would need to do would give me at least n log n. With the ridiculous amount of data I'm working with, the inefficiencies of cursors get swamped if it goes beyond linear time.
If I assume that the transaction id increases along with the dates, then you can get the last sales date using a cumulative max. Then, the adjacency can be found by using similar logic, but with a lag first:
with cte as (
select t.*,
sum(case when transaction_type = 'sell' then trans_id end) over
(order by trans_id) as last_sell,
lag(enddate) over (partition by transaction_type order by trans_id) as prev_enddate
from t
)
select last_sell, min(startdate) as startdate, max(enddate) as enddate,
sum(amount) as amount
from (select cte.*,
sum(case when startdate = dateadd(day, 1, prev_enddate) then 0 else 1 end) over (partition by last_sell order by trans_id) as grp
from cte
where transaction_type = 'buy'
) x
group by last_sell, grp;

How should I create an SQL table with stock information so that I can add new stocks and new fields easily?

I want to create an SQL table, where I can have any number of stocks (ie. MSFT, GOOG, IBM) and any number of fields (ie. Full Name, Sector, Country). But I want the flexibility to add new stocks and new fields as I go along. Say I want to add a new stock like AAPL, or I want a new boolean field for whether they pay dividends or not. I don't expect to store dynamic fields like CurrentStockPrice, but the information will have to change periodically. For instance, when a company changes its dividend policy. How do I design the table so that I don't have to change its structure?
I had one idea where I could have a new table for each stock, and a master table that has all the stocks, and a pointer to each individual stock's table. That way, I can freely add new stocks, and new fields easily. But I'm not very familiar with SQL, and would like an expert opinion on how it should be implemented.
The simple answer is that your requirements are not a good fit for SQL. The most important concern is not how to store the data, but how you will retrieve it - what kind of query will you need to run?
EAV allows you to store data whose schema you don't know in advance - but has lots of drawbacks when querying. Even moderately complex queries (find all stocks where the dividend was paid between 1 and 12 Jan, in the tech sector, whose CEO is female) run into a lot of compexity.
Creating a new table for each type of record very quickly gets crazy too - imagine the query above if you have to search dozens or hundreds of type-specific tables.
The relational model works best when you know the schema of the information in advance.
If you don't know the schema, consider using a NoSQL solution, or use SQL Server's support for XML or JSON. Store the fixed data in rows & columns, and the variable data in XML or JSON. Performance for searching is pretty good, and it's much less convoluted as a solution.
Just to expand on my comment, because the question itself begs for a couple of common schema anti-patterns. Some hybrid of EAV may actually be a good fit if you are willing to give up some flexibility and simplicity in your SQL and you aren't looking for fast queries.
EAV
EAV, or Entity-Attribute-Value is a design where, in your case, you would have a master table of stocks with some common attributes, or maybe even ticker info with a datetime. Something like:
+---------+--------+--------------+
| stockid | symbol | name |
+---------+--------+--------------+
| 1 | goog | Google |
| 2 | msft | Microsoft |
| 3 | gpro | GoPro |
| 4 | xom | Exxon Mobile |
+---------+--------+--------------+
And a second table (the EAV table) to store ever changing attributes:
+---------+-----------+------------+
| stockid | attribute | value |
+---------+-----------+------------+
| 1 | country | us |
| 1 | favorite | TRUE |
| 1 | startyear | 2004 |
| 3 | favorite | |
| 3 | bobspick | TRUE |
| 4 | country | us |
| 3 | country | us |
| 2 | startyear | 1986 |
| 2 | employees | 18000 |
| 3 | marketcap | 1850000000 |
+---------+-----------+------------+
And perhaps a third table to get that minute by minute ticker info stored:
+---------+----------------+--------+
| stockid | datetime | value |
+---------+----------------+--------+
| 1 | 9/21/2016 8:15 | 771.41 |
| 1 | 9/21/2016 8:14 | 771.39 |
| 1 | 9/21/2016 8:12 | 771.37 |
| 1 | 9/21/2016 8:10 | 771.35 |
| 1 | 9/21/2016 8:08 | 771.33 |
| 1 | 9/21/2016 8:06 | 771.31 |
| 1 | 9/21/2016 8:04 | 771.29 |
| 2 | 9/21/2016 8:15 | 56.81 |
| 2 | 9/21/2016 8:14 | 56.82 |
| 2 | 9/21/2016 8:12 | 56.83 |
| 2 | 9/21/2016 8:10 | 56.84 |
+---------+----------------+--------+
Generally this is considered not great design since stitching data back together in a format like:
+-------------+-----------+---------+-----------+----------+--------------+
| stocksymbol | stockname | country | startyear | bobspick | currentvalue |
+-------------+-----------+---------+-----------+----------+--------------+
causes you to write a query that is not fun to look at:
SELECT
stocks.stocksymbol,
stocks.name,
country.value,
bobspick.value,
startyear.value,
stockvalue.stockvalue
FROM
stocks
LEFT OUTER JOIN (SELECT stockid, value FROM fieldsTable WHERE attribute = 'country') as country ON
stocks.stockid = country.stockid
LEFT OUTER JOIN (SELECT stockid, value FROM fieldsTable WHERE attribute = 'Bobspick') as bobspick ON
stocks.stockid = bobspick.stockid
LEFT OUTER JOIN (SELECT stockid, value FROM fieldsTable WHERE attribute = 'startyear') as startyear ON
stocks.stockid = startyear.stockid
LEFT OUTER JOIN (SELECT max(value) as stockvalue, stockid FROM ticketTable GROUP BY stockid) as stockvalue ON
stocks.stockid = stockvalue.stockid
WHERE symbol in ('goog', 'msft')
You can see that every "field" in the EAV table gets its own subquery, which means we read that table from storage three times. We gain the flexibility on the front end over the database design, but we lose flexibility when querying.
Imagine a more traditional schema:
+---------+--------+--------------+---------+----------+----------+-----------+------------+-----------+
| stockid | symbol | name | country | bobspick | favorite | startyear | marketcap | employees |
+---------+--------+--------------+---------+----------+----------+-----------+------------+-----------+
| 1 | goog | Google | us | | TRUE | 2004 | | |
| 2 | msft | Microsoft | | | | 1986 | | 18000 |
| 3 | gpro | GoPro | us | TRUE | | | 1850000000 | |
| 4 | xom | Exxon Mobile | us | | | | | |
| | | | | | | | | |
+---------+--------+--------------+---------+----------+----------+-----------+------------+-----------+
and
+---------+----------------+--------+
| stockid | datetime | value |
+---------+----------------+--------+
| 1 | 9/21/2016 8:15 | 771.41 |
| 1 | 9/21/2016 8:14 | 771.39 |
| 1 | 9/21/2016 8:12 | 771.37 |
| 1 | 9/21/2016 8:10 | 771.35 |
| 1 | 9/21/2016 8:08 | 771.33 |
| 1 | 9/21/2016 8:06 | 771.31 |
| 1 | 9/21/2016 8:04 | 771.29 |
| 2 | 9/21/2016 8:15 | 56.81 |
| 2 | 9/21/2016 8:14 | 56.82 |
| 2 | 9/21/2016 8:12 | 56.83 |
| 2 | 9/21/2016 8:10 | 56.84 |
+---------+----------------+--------+
To get the same results:
SELECT
stocks.stocksymbol,
stocks.name,
stocks.country,
stocks.bobspick,
stocks.startyear,
stockvalue.stockvalue
FROM
stocks
LEFT OUTER JOIN (SELECT max(value) as stockvalue, stockid FROM ticketTable GROUP BY stockid) as stockvalue ON
stocks.stockid = stockvalue.stockid
WHERE symbol in ('goog', 'msft')
Now we have the flexibility in the query where we can quickly change out fields without monkeying around in subqueries, but we have to hassle our DBA every time we want to add a field.
There is a further abstraction from EAV that is definitely something to avoid. I don't know if it has a name, but I call it "Database in a database". Here you have a table of tables, table of fields, and a table of values. The entire schema is kept as records as our the values that would be stored in the schema. Ultimatele flexibility is gained, but the sql you will write to get at your data will be nightmarish and your query speeds will degrade at a fast rate as you add to your data/schema/data/schema mess.
As for your last idea of adding a new table for each stock, if the fields you are going to track for each stock are different (startyear, employees, and market cap for one stock and marketmax, country, address, yearsinbusiness in another) and you aren't planning on adding new stocks often, then it may be a good fit. I'm betting though that the attributes/fields that you track on stock1 are going to also be tracked on stock2, and therefore suggest that your should have a single stock table with all those common attributes and maybe an EAV to track attributes that are particular to each stock so you can have the flexibility you need.
In each of these schemas I would also suggest that you put your ticker data in it's own table. Whether you are capturing ticket data by the minute, hour, day, week, or month, because it's datetime level data, it deserves it's own table. (Unless you are only going to track the most current value, then it becomes a field).
If you want to add fields dynamically, but without actually altering the schema of the table, then you should use a vertical schema for the table and retrieve the data via a PIVOT statement.
In this manner you can add as many Field/Value pairs as you wish for each stock/customer pairing.
The basic table would have 5 columns perhaps:
ID (Identity); StockName; AttributeName; Value; Timestamp;
If you take a look at how SQL organizes it's table schema in INFORMATION_SCHEMA.COLUMNS, it provides this very same vertical schema layout for you.

Matching disambiguating data to existing duplicate records

I have a table called transactions that has the ledger from a storefront. Let's say it looks like this, for simplicity:
trans_id | cust | date | num_items | cost
---------+------+------+-----------+------
1 | Joe | 4/18 | 6 | 14.83
2 | Sue | 4/19 | 3 | 8.30
3 | Ann | 4/19 | 1 | 2.28
4 | Joe | 4/19 | 4 | 17.32
5 | Sue | 4/19 | 3 | 8.30
6 | Lee | 4/19 | 2 | 9.55
7 | Ann | 4/20 | 1 | 2.28
For the credit card purchases, I subsequently get an electronic ledger that has the full timestamp. So I have a table called cctrans with date, time, cust, cost, and some other info. I want to add a column trans_id to the cctrans table, that references the transactions table.
The update statement for this is simple enough, except for one hitch: I have an 11 AM transaction from Sue on 4/19 for $8.30 and a 3 PM transaction from Sue on 4/19 for $8.30 that are the same in the transactions table except for the trans_id field. I don't really care which record of the cctrans table gets linked to trans_id 2 and which one gets linked to trans_id 5, but they can't both be assigned the same trans_id.
The question here is: How do I accomplish that (ideally in a way that also works when a customer makes the same purchase three or four times in a day)?
The best I have so far is to do:
UPDATE cctrans AS cc
SET trans_id = t.trans_id
WHERE cc.cust = t.cust AND cc.date = t.date AND cc.cost = t.cost
And then fix them one-by-one via manual inspection. But obviously that's not my preferred solution.
Thanks for any help you can provide.

How to combine two tables allocating Sold amounts vs Demand without loops/cursor

My task is to combine two tables in a specific way. I have a table Demands that contains demands of some goods (tovar). Each record has its own ID, Tovar, Date of demand and Amount. And I have another table Unloads that contains unloads of tovar. Each record has its own ID, Tovar, Order of unload and Amount. Demands and Unloads are not corresponding to each other and amounts in demands and unloads are not exactly equal. One demand may be with 10 units and there can be two unloads with 4 and 6 units. And two demands may be with 3 and 5 units and there can be one unload with 11 units.
The task is to get a table which will show how demands are covering by unloads. I have a solution (SQL Fiddle) but I think that there is a better one. Can anybody tell me how such tasks are solved?
What I have:
------------------------------------------
| DemandNumber | Tovar | Amount | Order |
|--------------------------------|--------
| Demand#1 | Meat | 2 | 1 |
| Demand#2 | Meat | 3 | 2 |
| Demand#3 | Milk | 6 | 1 |
| Demand#4 | Eggs | 1 | 1 |
| Demand#5 | Eggs | 5 | 2 |
| Demand#6 | Eggs | 3 | 3 |
------------------------------------------
------------------------------------------
| SaleNumber | Tovar | Amount | Order |
|--------------------------------|--------
| Sale#1 | Meat | 6 | 1 |
| Sale#2 | Milk | 2 | 1 |
| Sale#3 | Milk | 1 | 2 |
| Sale#4 | Eggs | 2 | 1 |
| Sale#5 | Eggs | 1 | 2 |
| Sale#6 | Eggs | 4 | 3 |
------------------------------------------
What I want to receive
-------------------------------------------------
| DemandNumber | SaleNumber | Tovar | Amount |
-------------------------------------------------
| Demand#1 | Sale#1 | Meat | 2 |
| Demand#2 | Sale#1 | Meat | 3 |
| Demand#3 | Sale#2 | Milk | 2 |
| Demand#3 | Sale#3 | Milk | 1 |
| Demand#4 | Sale#4 | Eggs | 1 |
| Demand#5 | Sale#4 | Eggs | 1 |
| Demand#5 | Sale#5 | Eggs | 1 |
| Demand#5 | Sale#6 | Eggs | 3 |
| Demand#6 | Sale#6 | Eggs | 1 |
-------------------------------------------------
Here is additional explanation from author's comment:
Demand#1 needs 2 Meat and it can take them from Sale#1.
Demand#2 needs 3 Meat and can take them from Sale#1.
Demand#3 needs 6 Milk but there is only 2 Milk in Sale#3 and 1 Milk in Sale#4, so we show only available amounts.
And so on.
The field Order in the example determine the order of calculations. We have to process Demands according to their Order. Demand#1 must be processed before Demand#2. And Sales also must be allocated according to their Order number. We cannot assign eggs from sale if there are sales with eggs with lower order and non-allocated eggs.
The only way I can get this is using loops. Is it posible to avoid loops and solve this task only with t-sql?
If the Amount values are int and not too large (not millions), then I'd use a table of numbers to generate as many rows as the value of each Amount.
Here is a good article describing how to generate it.
Then it is easy to join Demand with Sale and group and sum as needed.
Otherwise, a plain straight-forward cursor (in fact, two cursors) would be simple to implement, easy to understand and with O(n) complexity. If Amounts are small, set-based variant is likely to be faster than cursor. If Amounts are large, cursor may be faster. You need to measure performance with actual data.
Here is a query that uses a table of numbers. To understand how it works run each query in the CTE separately and examine its output.
SQLFiddle
WITH
CTE_Demands
AS
(
SELECT
D.DemandNumber
,D.Tovar
,ROW_NUMBER() OVER (PARTITION BY D.Tovar ORDER BY D.SortOrder, CA_D.Number) AS rn
FROM
Demands AS D
CROSS APPLY
(
SELECT TOP(D.Amount) Numbers.Number
FROM Numbers
ORDER BY Numbers.Number
) AS CA_D
)
,CTE_Sales
AS
(
SELECT
S.SaleNumber
,S.Tovar
,ROW_NUMBER() OVER (PARTITION BY S.Tovar ORDER BY S.SortOrder, CA_S.Number) AS rn
FROM
Sales AS S
CROSS APPLY
(
SELECT TOP(S.Amount) Numbers.Number
FROM Numbers
ORDER BY Numbers.Number
) AS CA_S
)
SELECT
CTE_Demands.DemandNumber
,CTE_Sales.SaleNumber
,CTE_Demands.Tovar
,COUNT(*) AS Amount
FROM
CTE_Demands
INNER JOIN CTE_Sales ON
CTE_Sales.Tovar = CTE_Demands.Tovar
AND CTE_Sales.rn = CTE_Demands.rn
GROUP BY
CTE_Demands.Tovar
,CTE_Demands.DemandNumber
,CTE_Sales.SaleNumber
ORDER BY
CTE_Demands.DemandNumber
,CTE_Sales.SaleNumber
;
Having said all this, usually it is better to perform this kind of processing on the client using procedural programming language. You still have to transmit all rows from Demands and Sales to the client. So, by joining the tables on the server you don't reduce the amount of bytes that must go over the network. In fact, you increase it, because original row may be split into several rows.
This kind of processing is sequential in nature, not set-based, so it is easy to do with arrays, but tricky in SQL.
I have no idea what your requirements are or what the business rules are or what the goals are but I can say this -- you are doing it wrong.
This is SQL. In SQL you do not do loops. In SQL you work with sets. Sets are defined by select statements.
If this problem is not resolved with a select statement (maybe with sub-selects) then you probably want to implement this in another way. (C# program? Some other ETL system?).
However, I can also say there is probably a way to do this with a single select statement. However you have not given enough information for me to know what that statement is. To say you have a working example and that should be enough fails on this site because this site is about answering questions about problems and you don't have a problem you have some code.
Re-phrase the question with inputs, expect outputs, what you have tried and what your question is. This is covered well in the FAQ.
Or if you have working code you want reviewed, it may be appropriate for the code review site.
I see additional 2 possible ways:
1. for 'advanced' data processing and calculations you can use cursors.
2. you can use SELECT with CASE construction

Merge computed data from two tables back into one of them

I have the following situation (as a reduced example). Two tables, Measures1 and Measures2, each of which store an ID, a Weight in grams, and optionally a Volume in fluid onces. (In reality, Measures1 has a good deal of other data that is irrelevant here)
Contents of Measures1:
+----+----------+--------+
| ID | Weight | Volume |
+----+----------+--------+
| 1 | 100.0000 | NULL |
| 2 | 200.0000 | NULL |
| 3 | 150.0000 | NULL |
| 4 | 325.0000 | NULL |
+----+----------+--------+
Contents of Measures2:
+----+----------+----------+
| ID | Weight | Volume |
+----+----------+----------+
| 1 | 75.0000 | 10.0000 |
| 2 | 400.0000 | 64.0000 |
| 3 | 100.0000 | 22.0000 |
| 4 | 500.0000 | 100.0000 |
+----+----------+----------+
These tables describe equivalent weights and volumes of a substance. E.g. 10 fluid ounces of substance 1 weighs 75 grams. The IDs are related: ID 1 in Measures1 is the same substance as ID 1 in Measures2.
What I want to do is fill in the NULL volumes in Measures1 using the information in Measures2, but keeping the weights from Measures1 (then, ultimately, I can drop the Measures2 table, as it will be redundant). For the sake of simplicity, assume that all volumes in Measures1 are NULL and all volumes in Measures2 are not.
I can compute the volumes I want to fill in with the following query:
SELECT Measures1.ID, Measures1.Weight,
(Measures2.Volume * (Measures1.Weight / Measures2.Weight))
AS DesiredVolume
FROM Measures1 JOIN Measures2 ON Measures1.ID = Measures2.ID;
Producing:
+----+----------+-----------------+
| ID | Weight | DesiredVolume |
+----+----------+-----------------+
| 4 | 325.0000 | 65.000000000000 |
| 3 | 150.0000 | 33.000000000000 |
| 2 | 200.0000 | 32.000000000000 |
| 1 | 100.0000 | 13.333333333333 |
+----+----------+-----------------+
But I am at a loss for how to actually insert these computed values into the Measures1 table.
Preferably, I would like to be able to do it with a single query, rather than writing a script or stored procedure that iterates through every ID in Measures1. But even then I am worried that this might not be possible because the MySQL documentation says that you can't use a table in an UPDATE query and a SELECT subquery at the same time, and I think any solution would need to do that.
I know that one workaround might be to create a new table with the results of the above query (also selecting all of the other non-Volume fields in Measures1) and then drop both tables and replace Measures1 with the newly-created table, but I was wondering if there was any better way to do it that I am missing.
UPDATE Measures1
SET Volume = (Measures2.Volume * (Measures1.Weight / Measures2.Weight))
FROM Measures1 JOIN Measures2
ON Measures1.ID = Measures2.ID;