Is there a way to create a "pivot group" of columns with t-sql? - sql

This seems like a common need but, unfortunately, I can't find a solution.
Assume you have a query that outputs the following content:
| TimeFrame | User | Metric1 | Metric2 |
+------------+------------+---------+---------+
| TODAY | John Doe | 10 | 20 |
| MONTHTODAY | John Doe | 100 | 200 |
| TODAY | Jack Frost | 15 | 25 |
| MONTHTODAY | Jack Frost | 150 | 250 |
What I need as output after a pivot is data that looks like this:
| User | TODAY_Metric1 | TODAY_Metric2 | MONTHTODAY_Metric1 | MONTHTODAY_Metric2 |
+------------+---------------+---------------+--------------------+--------------------+
| John Doe | 10 | 20 |100 | 200 |
| Jack Frost | 15 | 25 |150 | 250 |
Note that I'm doing the pivoting on TimeFrame, however, columns Metric1 and Metric2 remain columns but are grouped by time frame values.
Can this be done within standard PIVOT syntax or will I need to write a more complex query to pull this data together in a result set specific to my needs?

You can do this with conditional aggregation:
select
user,
sum(case when timeframe = 'TODAY' then Metric_1 end) TODAY_Metric1,
sum(case when timeframe = 'TODAY' then Metric_2 end) TODAY_Metric2,
sum(case when timeframe = 'MONTHTODAY' then Metric_1 end) MONTHTODAY_Metric1,
sum(case when timeframe = 'MONTHTODAY' then Metric_2 end) MONTHTODAY_Metric2
from mytable
group by user
I tend to prefer the conditional aggregation technique over the vendor-specific implementations, because:
I find it simpler to understand and maintain
it is cross-RDBMS (so you can easily port it to some other database if needed)
it usually performs as well, or even better than vendor implementation (that usually rely upon it under the hood)

Related

how to join tables on cases where none of function(a) in b

Say in MonetDB (specifically, the embedded version from the "MonetDBLite" R package) I have a table "events" containing entity ID codes and event start and end dates, of the format:
| id | start_date | end_date |
| 1 | 2010-01-01 | 2010-03-30 |
| 1 | 2010-04-01 | 2010-06-30 |
| 2 | 2018-04-01 | 2018-06-30 |
| ... | ... | ... |
The table is approximately 80 million rows of events, attributable to approximately 2.5 million unique entities (ID values). The dates appear to align nicely with calendar quarters, but I haven't thoroughly checked them so assume they can be arbitrary. However, I have at least sense-checked them for end_date > start_date.
I want to produce a table "nonevent_qtrs" listing calendar quarters where an ID has no event recorded, e.g.:
| id | last_doq |
| 1 | 2010-09-30 |
| 1 | 2010-12-31 |
| ... | ... |
| 1 | 2018-06-30 |
| 2 | 2010-03-30 |
| ... | ... |
(doq = day of quarter)
If the extent of an event spans any days of the quarter (including the first and last dates), then I wish for it to count as having occurred in that quarter.
To help with this, I have produced a "calendar table"; a table of quarters "qtrs", covering the entire span of dates present in "events", and of the format:
| first_doq | last_doq |
| 2010-01-01 | 2010-03-30 |
| 2010-04-01 | 2010-06-30 |
| ... | ... |
And tried using a non-equi merge like so:
create table nonevents
as select
id,
last_doq
from
events
full outer join
qtrs
on
start_date > last_doq or
end_date < first_doq
group by
id,
last_doq
But this is a) terribly inefficient and b) certainly wrong, since most IDs are listed as being non-eventful for all quarters.
How can I produce the table "nonevent_qtrs" I described, which contains a list of quarters for which each ID had no events?
If it's relevant, the ultimate use-case is to calculate runs of non-events to look at time-till-event analysis and prediction. Feels like run length encoding will be required. If there's a more direct approach than what I've described above then I'm all ears. The only reason I'm focusing on non-event runs to begin with is to try to limit the size of the cross-product. I've also considered producing something like:
| id | last_doq | event |
| 1 | 2010-01-31 | 1 |
| ... | ... | ... |
| 1 | 2018-06-30 | 0 |
| ... | ... | ... |
But although more useful this may not be feasible due to the size of the data involved. A wide format:
| id | 2010-01-31 | ... | 2018-06-30 |
| 1 | 1 | ... | 0 |
| 2 | 0 | ... | 1 |
| ... | ... | ... | ... |
would also be handy, but since MonetDB is column-store I'm not sure whether this is more or less efficient.
Let me assume that you have a table of quarters, with the start date of a quarter and the end date. You really need this if you want the quarters that don't exist. After all, how far back in time or forward in time do you want to go?
Then, you can generate all id/quarter combinations and filter out the ones that exist:
select i.id, q.*
from (select distinct id from events) i cross join
quarters q left join
events e
on e.id = i.id and
e.start_date <= q.quarter_end and
e.end_date >= q.quarter_start
where e.id is null;

T-SQL aggregate over contiguous dates more efficiently

I have a need to aggregate a sum over contiguous dates. I've seen solutions to similar problems that will return the start and end dates, but don't have a need to aggregate the data between those ranges. It's further complicated by the extremely large amounts of data involved, to the point that a simple self join takes an impractical amount of time (especially since the start and end date fields are unindexed)
I have a solution involving cursors, but I've generally been led to believe that cursors can always be more efficiently replaced with joins that will execute faster, but so far every solution I've tried with a query anywhere close to giving me the data I need takes an hour at least, and my cursor solutions takes about 10 seconds. So I'm asking if there is a more efficient answer.
And the data includes both buy and sell transactions and each row of aggregated contiguous dates returned also needs to list the transaction ID of the last sell that occurred before the first buy of the contiguous set of buy transactions.
An example of the data:
+------------------+------------+------------+------------------+--------------------+
| TRANSACTION_TYPE | TRANS_ID | StartDate | EndDate | Amount |
+------------------+------------+------------+------------------+--------------------+
| sell | 100 | 2/16/16 | 2/18/18 | $100.00 |
| sell | 101 | 3/1/16 | 6/6/16 | $121.00 |
| buy | 102 | 6/10/16 | 6/12/16 | $22.00 |
| buy | 103 | 6/12/16 | 6/14/16 | $0.35 |
| buy | 104 | 6/29/16 | 7/2/16 | $5.00 |
| sell | 105 | 7/3/16 | 7/6/16 | $115.00 |
| buy | 106 | 7/8/16 | 7/9/16 | $200.00 |
| sell | 107 | 7/10/16 | 7/13/16 | $4.35 |
| sell | 108 | 7/17/16 | 7/20/16 | $0.50 |
| buy | 109 | 7/25/16 | 7/29/16 | $33.00 |
| buy | 110 | 7/29/16 | 8/1/16 | $75.00 |
| buy | 111 | 8/1/16 | 8/3/16 | $0.33 |
| sell | 112 | 9/1/16 | 9/2/16 | $99.00 |
+------------------+------------+------------+------------------+--------------------+
Should have results like the following:
+------------+------------+------------------+--------------------+
| Last_Sell | StartDate | EndDate | Amount |
+------------+------------+------------------+--------------------+
| 101 | 6/10/16 | 6/14/18 | $22.35 |
| 101 | 6/29/16 | 7/2/16 | $5.00 |
| 105 | 7/8/16 | 7/9/16 | $200.00 |
| 108 | 7/25/16 | 8/3/16 | $108.33 |
+------------------+------------+------------+--------------------+
Right now I use queries to split the data into buys and sells, and just walk through the buy data, aggregating as I go, inserting into the return table every time I find a break in the dates, and I step through the sell table until I reach the last sell before the start date of the set of buys.
Walking linearly through cursors gives me a computational time of n. Even though cursors are orders of magnitude less efficient, it's still calculating in n, while I suspect the joins I would need to do would give me at least n log n. With the ridiculous amount of data I'm working with, the inefficiencies of cursors get swamped if it goes beyond linear time.
If I assume that the transaction id increases along with the dates, then you can get the last sales date using a cumulative max. Then, the adjacency can be found by using similar logic, but with a lag first:
with cte as (
select t.*,
sum(case when transaction_type = 'sell' then trans_id end) over
(order by trans_id) as last_sell,
lag(enddate) over (partition by transaction_type order by trans_id) as prev_enddate
from t
)
select last_sell, min(startdate) as startdate, max(enddate) as enddate,
sum(amount) as amount
from (select cte.*,
sum(case when startdate = dateadd(day, 1, prev_enddate) then 0 else 1 end) over (partition by last_sell order by trans_id) as grp
from cte
where transaction_type = 'buy'
) x
group by last_sell, grp;

SAP Business Objects Cross Table Data Duplication

I'm using Business Objects to construct a simple report on whether a unit is on or off for a given day. When constructing a vertical table, the data is correct and looks like such:
Unit ID | Status | Date
1 | On | 2016-09-10
1 | On | 2016-09-11
1 | Off | 2016-09-12
2 | Off | 2016-09-10
2 | Off | 2016-09-11
2 | On | 2016-09-12
However the cross table I've created, with columns of "date" and rows of "Unit ID" is duplicating Unit ID and having an entire row of 'On' followed by an entire row of 'Off' like:
____| 2016-09-10 | 2016-09-11 | 2016-09-12
1 | On | On | On
1 | Off | Off | Off
2 | On | On | On
2 | Off | Off | Off
instead of what it should be as:
____| 2016-09-10 | 2016-09-11 | 2016-09-12
1 | On | On | Off
2 | Off | Off | On
Any suggestions as to why it's doing this? The table isn't particularly useful if it has these duplicate rows and I can't understand why it's resulting in this odd table.
Turns out what happened is the "Status" field was a dimension type, but the cross table requires the data field to be a measure type. Simply making a new variable that was a measure equal to "Status" solved the issue.

How should I create an SQL table with stock information so that I can add new stocks and new fields easily?

I want to create an SQL table, where I can have any number of stocks (ie. MSFT, GOOG, IBM) and any number of fields (ie. Full Name, Sector, Country). But I want the flexibility to add new stocks and new fields as I go along. Say I want to add a new stock like AAPL, or I want a new boolean field for whether they pay dividends or not. I don't expect to store dynamic fields like CurrentStockPrice, but the information will have to change periodically. For instance, when a company changes its dividend policy. How do I design the table so that I don't have to change its structure?
I had one idea where I could have a new table for each stock, and a master table that has all the stocks, and a pointer to each individual stock's table. That way, I can freely add new stocks, and new fields easily. But I'm not very familiar with SQL, and would like an expert opinion on how it should be implemented.
The simple answer is that your requirements are not a good fit for SQL. The most important concern is not how to store the data, but how you will retrieve it - what kind of query will you need to run?
EAV allows you to store data whose schema you don't know in advance - but has lots of drawbacks when querying. Even moderately complex queries (find all stocks where the dividend was paid between 1 and 12 Jan, in the tech sector, whose CEO is female) run into a lot of compexity.
Creating a new table for each type of record very quickly gets crazy too - imagine the query above if you have to search dozens or hundreds of type-specific tables.
The relational model works best when you know the schema of the information in advance.
If you don't know the schema, consider using a NoSQL solution, or use SQL Server's support for XML or JSON. Store the fixed data in rows & columns, and the variable data in XML or JSON. Performance for searching is pretty good, and it's much less convoluted as a solution.
Just to expand on my comment, because the question itself begs for a couple of common schema anti-patterns. Some hybrid of EAV may actually be a good fit if you are willing to give up some flexibility and simplicity in your SQL and you aren't looking for fast queries.
EAV
EAV, or Entity-Attribute-Value is a design where, in your case, you would have a master table of stocks with some common attributes, or maybe even ticker info with a datetime. Something like:
+---------+--------+--------------+
| stockid | symbol | name |
+---------+--------+--------------+
| 1 | goog | Google |
| 2 | msft | Microsoft |
| 3 | gpro | GoPro |
| 4 | xom | Exxon Mobile |
+---------+--------+--------------+
And a second table (the EAV table) to store ever changing attributes:
+---------+-----------+------------+
| stockid | attribute | value |
+---------+-----------+------------+
| 1 | country | us |
| 1 | favorite | TRUE |
| 1 | startyear | 2004 |
| 3 | favorite | |
| 3 | bobspick | TRUE |
| 4 | country | us |
| 3 | country | us |
| 2 | startyear | 1986 |
| 2 | employees | 18000 |
| 3 | marketcap | 1850000000 |
+---------+-----------+------------+
And perhaps a third table to get that minute by minute ticker info stored:
+---------+----------------+--------+
| stockid | datetime | value |
+---------+----------------+--------+
| 1 | 9/21/2016 8:15 | 771.41 |
| 1 | 9/21/2016 8:14 | 771.39 |
| 1 | 9/21/2016 8:12 | 771.37 |
| 1 | 9/21/2016 8:10 | 771.35 |
| 1 | 9/21/2016 8:08 | 771.33 |
| 1 | 9/21/2016 8:06 | 771.31 |
| 1 | 9/21/2016 8:04 | 771.29 |
| 2 | 9/21/2016 8:15 | 56.81 |
| 2 | 9/21/2016 8:14 | 56.82 |
| 2 | 9/21/2016 8:12 | 56.83 |
| 2 | 9/21/2016 8:10 | 56.84 |
+---------+----------------+--------+
Generally this is considered not great design since stitching data back together in a format like:
+-------------+-----------+---------+-----------+----------+--------------+
| stocksymbol | stockname | country | startyear | bobspick | currentvalue |
+-------------+-----------+---------+-----------+----------+--------------+
causes you to write a query that is not fun to look at:
SELECT
stocks.stocksymbol,
stocks.name,
country.value,
bobspick.value,
startyear.value,
stockvalue.stockvalue
FROM
stocks
LEFT OUTER JOIN (SELECT stockid, value FROM fieldsTable WHERE attribute = 'country') as country ON
stocks.stockid = country.stockid
LEFT OUTER JOIN (SELECT stockid, value FROM fieldsTable WHERE attribute = 'Bobspick') as bobspick ON
stocks.stockid = bobspick.stockid
LEFT OUTER JOIN (SELECT stockid, value FROM fieldsTable WHERE attribute = 'startyear') as startyear ON
stocks.stockid = startyear.stockid
LEFT OUTER JOIN (SELECT max(value) as stockvalue, stockid FROM ticketTable GROUP BY stockid) as stockvalue ON
stocks.stockid = stockvalue.stockid
WHERE symbol in ('goog', 'msft')
You can see that every "field" in the EAV table gets its own subquery, which means we read that table from storage three times. We gain the flexibility on the front end over the database design, but we lose flexibility when querying.
Imagine a more traditional schema:
+---------+--------+--------------+---------+----------+----------+-----------+------------+-----------+
| stockid | symbol | name | country | bobspick | favorite | startyear | marketcap | employees |
+---------+--------+--------------+---------+----------+----------+-----------+------------+-----------+
| 1 | goog | Google | us | | TRUE | 2004 | | |
| 2 | msft | Microsoft | | | | 1986 | | 18000 |
| 3 | gpro | GoPro | us | TRUE | | | 1850000000 | |
| 4 | xom | Exxon Mobile | us | | | | | |
| | | | | | | | | |
+---------+--------+--------------+---------+----------+----------+-----------+------------+-----------+
and
+---------+----------------+--------+
| stockid | datetime | value |
+---------+----------------+--------+
| 1 | 9/21/2016 8:15 | 771.41 |
| 1 | 9/21/2016 8:14 | 771.39 |
| 1 | 9/21/2016 8:12 | 771.37 |
| 1 | 9/21/2016 8:10 | 771.35 |
| 1 | 9/21/2016 8:08 | 771.33 |
| 1 | 9/21/2016 8:06 | 771.31 |
| 1 | 9/21/2016 8:04 | 771.29 |
| 2 | 9/21/2016 8:15 | 56.81 |
| 2 | 9/21/2016 8:14 | 56.82 |
| 2 | 9/21/2016 8:12 | 56.83 |
| 2 | 9/21/2016 8:10 | 56.84 |
+---------+----------------+--------+
To get the same results:
SELECT
stocks.stocksymbol,
stocks.name,
stocks.country,
stocks.bobspick,
stocks.startyear,
stockvalue.stockvalue
FROM
stocks
LEFT OUTER JOIN (SELECT max(value) as stockvalue, stockid FROM ticketTable GROUP BY stockid) as stockvalue ON
stocks.stockid = stockvalue.stockid
WHERE symbol in ('goog', 'msft')
Now we have the flexibility in the query where we can quickly change out fields without monkeying around in subqueries, but we have to hassle our DBA every time we want to add a field.
There is a further abstraction from EAV that is definitely something to avoid. I don't know if it has a name, but I call it "Database in a database". Here you have a table of tables, table of fields, and a table of values. The entire schema is kept as records as our the values that would be stored in the schema. Ultimatele flexibility is gained, but the sql you will write to get at your data will be nightmarish and your query speeds will degrade at a fast rate as you add to your data/schema/data/schema mess.
As for your last idea of adding a new table for each stock, if the fields you are going to track for each stock are different (startyear, employees, and market cap for one stock and marketmax, country, address, yearsinbusiness in another) and you aren't planning on adding new stocks often, then it may be a good fit. I'm betting though that the attributes/fields that you track on stock1 are going to also be tracked on stock2, and therefore suggest that your should have a single stock table with all those common attributes and maybe an EAV to track attributes that are particular to each stock so you can have the flexibility you need.
In each of these schemas I would also suggest that you put your ticker data in it's own table. Whether you are capturing ticket data by the minute, hour, day, week, or month, because it's datetime level data, it deserves it's own table. (Unless you are only going to track the most current value, then it becomes a field).
If you want to add fields dynamically, but without actually altering the schema of the table, then you should use a vertical schema for the table and retrieve the data via a PIVOT statement.
In this manner you can add as many Field/Value pairs as you wish for each stock/customer pairing.
The basic table would have 5 columns perhaps:
ID (Identity); StockName; AttributeName; Value; Timestamp;
If you take a look at how SQL organizes it's table schema in INFORMATION_SCHEMA.COLUMNS, it provides this very same vertical schema layout for you.

Only Some Dates From SQL SELECT Being Set To "0" or "1969-12-31" -- UNIX_TIMESTAMP

So I have been doing pretty well on my project (Link to previous StackOverflow question), and have managed to learn quite a bit, but there is this one problem that has been really dogging me for days and I just can't seem to solve it.
It has to do with using the UNIX_TIMESTAMP call to convert dates in my SQL database to UNIX time-format, but for some reason only one set of dates in my table is giving me issues!
==============
So these are the values I am getting -
#abridged here, see the results from the SELECT statement below to see the rest
#of the fields outputted
| firstVst | nextVst | DOB |
| 1206936000 | 1396238400 | 0 |
| 1313726400 | 1313726400 | 278395200 |
| 1318910400 | 1413604800 | 0 |
| 1319083200 | 1413777600 | 0 |
when I use this SELECT statment -
SELECT SQL_CALC_FOUND_ROWS *,UNIX_TIMESTAMP(firstVst) AS firstVst,
UNIX_TIMESTAMP(nextVst) AS nextVst, UNIX_TIMESTAMP(DOB) AS DOB FROM people
ORDER BY "ref DESC";
So my big question is: why in the heck are 3 out of 4 of my DOBs being set to date of 0 (IE 12/31/1969 on my PC)? Why is this not happening in my other fields?
I can see the data quite well using a more simple SELECT statement and the DOB field looks fine...?
#formatting broken to change some variable names etc.
select * FROM people;
| ref | lastName | firstName | DOB | rN | lN | firstVst | disp | repName | nextVst |
| 10001 | BlankA | NameA | 1968-04-15 | 1000000 | 4600000 | 2008-03-31 | Positive | Patrick Smith | 2014-03-31 |
| 10002 | BlankB | NameB | 1978-10-28 | 1000001 | 4600001 | 2011-08-19 | Positive | Patrick Smith | 2011-08-19 |
| 10003 | BlankC | NameC | 1941-06-08 | 1000002 | 4600002 | 2011-10-18 | Positive | Patrick Smith | 2014-10-18 |
| 10004 | BlankD | NameD | 1952-08-01 | 1000003 | 4600003 | 2011-10-20 | Positive | Patrick Smith | 2014-10-20 |
It's because those DoB's are from before 12/31/1969, and the UNIX epoch starts then, so anything prior to that would be negative.
From Wikipedia:
Unix time, or POSIX time, is a system for describing instants in time, defined as the number of seconds that have elapsed since 00:00:00 Coordinated Universal Time (UTC), Thursday, 1 January 1970, not counting leap seconds.
A bit more elaboration: Basically what you're trying to do isn't possible. Depending on what it's for, there may be a different way you can do this, but using UNIX timestamps probably isn't the best idea for dates like that.