SQL Query with group by for multiple date ranges - sql

I need to formulate a t-sql query and so far I have been unable to do so. The table that I need to query is called Operations with two columns ,an FK OperationTypeID and an OperationDate. The query needs to return a result which comprises of the count of operation type id during the range specified.
Through the application interface the user can specify multiple operationtype Ids as well as their individual date ranges so for instance, the operationtype id 'A' can be looked for in the range
22/04/2010 to 22/04/2012 and operationtype Id 'B' can be searched in 15/10/2012 to 15/11/2013 and so on for other operation type ids. Now I need to return a count for each operationtype id during each of the range specified for individual operation type Ids.
What is the most efficient way to achieve this in a single t-sql query considering the performance issues ... a rough layout presented below, i am not very good at formatting so i hope it will still give an idea.
+---------------+----------+----------+-----+
|OperationTypeID|Min date |Max Date |Count|
+---------------+----------+----------+-----+
|A |22/04/2010|22/04/2012|899 |
+---------------+----------+----------+-----+
|B |15/10/2012|15/11/2013|789 |
+---------------+----------+----------+-----+
.... and so on
Would appreciate if anyone can help. The query needs to return a count for each operationtype id based on the min/max date range specified by the user. The Min/Max functions available in sql server probably don't apply here. One possible approach that I have thought of so far makes use of the Union All approach, where I formulate a single query for a single operation type id based on the date range and then do a UNION All, any performance impacts?

You will need to store the search criteria somewhere. The best place, would probably be a temporary table with the following columns:
CREATE TABLE #SearchCriteria (
OperationTypeId VARCHAR(1)
MinDate DATETIME
MaxDate DATETIME
)
Now, once you have populated this table, a simple query like this, should give you what you want:
SELECT OperationTypeId,
MinDate,
MaxDate,
(SELECT COUNT(*) FROM Operations
WHERE OperationDate BETWEEN SC.MinDate AND SC.MaxDate
AND OperationTypeId = SC.OperationTypeId) AS [Count]
FROM
#SearchCriteria SC
If you must have everything in a single query (without using a temporary table), do something like this:
SELECT OperationTypeId,
MinDate,
MaxDate,
(SELECT COUNT(*) FROM Operations
WHERE OperationDate BETWEEN SC.MinDate AND SC.MaxDate
AND OperationTypeId = SC.OperationTypeId) AS [Count]
FROM
(VALUES ('A', '22/04/2010', '22/04/2012')
,('B', '15/10/2012', '15/11/2013')
/* ... etc ... */
) SC(OperationTypeId, MinDate, MaxDate)

Related

How to aggregate data stored column-wise in a matrix table

I have a table, Ellipses (...), represent multiple columns of a similar type
TABLE: diagnosis_info
COLUMNS: visit_id,
patient_diagnosis_code_1 ...
patient_diagnosis_code_100 -- char(100) with a value of ‘0’ or ‘1’
How do I find the most common diagnosis_code? There are 101 columns including the visit_id. The table is like a matrix table of 0s and 1s. How do I write something that can dynamically account for all the columns and count all the rows where the value is 1?
What I would normally do is not feasable as there are too many columns:
SELECT COUNT(patient_diagnostic_code_1), COUNT(patient_diagnostic_code_2),... FROM diagnostic_info WHERE patient_diagnostic_code_1 = ‘1’ and patient_diagnostic_code_2 = ‘1’ and ….
Then even if I typed all that out how would I select which column had the highest count of values = 1. The table is more column oriented instead of row oriented.
Unfortunately your data design is bad from the start. Instead it could be as simple as:
patient_id, visit_id, diagnosis_code
where a patient with 1 dignostic code would have 1 row, a patient with 100 diagnostic codes 100 rows and vice versa. At any given time you could transpose this into the format you presented (what is called a pivot or cross tab). Also in some databases, for example postgreSQL, you could put all those diagnostic codes into an array field, then it would look like:
patient_id, visit_id, diagnosis_code (data type -bool or int- array)
Now you need the reverse of it which is called unpivot. On some databases like SQL server there is UNPIVOT as an example.
Without knowing what your backend this, you could do that with an ugly SQL like:
select code, pdc
from
(
select 1 as code, count(*) as pdc
from myTable where patient_diagnosis_code_1=1
union
select 2 as code, count(*) as pdc
from myTable where patient_diagnosis_code_2=1
union
...
select 100 as code, count(*) as pdc
from myTable where patient_diagnosis_code_100=1
) tmp
order by pdc desc, code;
PS: This would return all the codes with their frequency ordered from most to least. You could limit to get 1 to get the max (with ties in case there are more than one code to match the max).

Approaching SQL query building

I'm unsure what method to use in creating a query. Each week I need to pull a Count of invoices grouped by status types and the most recent Invoice entered.
I have many vendor tables that store sales records and I create a report each week that pulls the following.
Select Invoice_status, COUNT(Invoice_status) As Total, Max(Invoice_date)
From VendorABCRecordsTable
Group By Invoice_Status
Results for each vendor
|Invoice_status| Total | column3
I run this for VendorABCRecordsTable, Vendor123RecordsTable, VendorXYZRecordsTable and copy paste the results to a spread sheet.
How would I write it so the results would come out
Vendor | Invoice_status | Total | column3
VendorABC |
Vendor123 |
VendorXYZ |
SELECT
'VendorABC' Vendor
,Invoice_status
,COUNT(Invoice_status) Total
,MAX(Invoice_date) MaxInvoiceDate
from VendorABCRecordsTable
group by Invoice_Status
UNION ALL SELECT
'Vendor123' Vendor
,Invoice_status
,COUNT(Invoice_status) Total
,MAX(Invoice_date) MaxInvoiceDate
from Vendor123RecordsTable
group by Invoice_Status
UNION ALL SELECT
'VendorXYZ' Vendor
,Invoice_status
,COUNT(Invoice_status) Total
,MAX(Invoice_date) MaxInvoiceDate
from VendorXYZRecordsTable
group by Invoice_Status
Note, UNION ALL, as there will be no duplicate rows. If it needs to be ordered somehow, add an ORDER BY clause to (only!) the last query. (And you can name it as "Column3", if necessary.)
Definitely not optimized schema. However, perhaps a UNION query will help you deal with this.
SELECT *, "ABC" AS Vendor FROM VendorABCRecordsTable
UNION SELECT *, "123" FROM Vendor123RecordsTable
UNION SELECT *, "XYZ" FROM VendorXYZRecordsTable;
In order to use the * wildcard, fields must be in same order and same data types and same number of fields in table design. Otherwise, explicitly list fields in each SELECT. There is a limit of 50 SELECT lines. First SELECT line defines field names and data types.
Now use that query in subsequent queries.
Cannot edit data via UNION query.
Could specify fields to do aggregate calcs and grouping in each SELECT line. However, building UNION with raw data provides a dataset that can serve as source for various manipulations of data. This UNION is essentially what properly designed table would be like.

SQL - Insert using Column based on SELECT result

I currently have a table called tempHouses that looks like:
avgprice | dates | city
dates are stored as yyyy-mm-dd
However I need to move the records from that table into a table called houses that looks like:
city | year2002 | year2003 | year2004 | year2005 | year2006
The information in tempHouses contains average house prices from 1995 - 2014.
I know I can use SUBSTRING to get the year from the dates:
SUBSTRING(dates, 0, 4)
So basically for each city in tempHouses.city I need to get the the average house price from the above years into one record.
Any ideas on how I would go about doing this?
This is an SQL Server approach, and a PIVOT may be a better, but here's one way:
SELECT City,
AVG(year2002) AS year2002,
AVG(year2003) AS year2003,
AVG(year2004) AS year2004
FROM (
SELECT City,
CASE WHEN Dates BETWEEN '2002-01-01T00:00:00' AND '2002-12-31T23:59:59' THEN avgprice
ELSE 0
END AS year2002,
CASE WHEN Dates BETWEEN '2003-01-01T00:00:00' AND '2003-12-31T23:59:59' THEN avgprice
ELSE 0
END AS year2003
CASE WHEN Dates BETWEEN '2004-01-01T00:00:00' AND '2004-12-31T23:59:59' THEN avgprice
ELSE 0
END AS year2004
-- Repeat for each year
)
GROUP BY City
The inner query gets the data into the correct format for each record (City, year2002, year2003, year2004), whilst the outer query gets the average for each City.
There many be many ways to do this, and performance may be the deciding factor on which one to choose.
The best way would be to use a script to perform the query execution for you because you will need to run it multiple times and you extract the data based on year. Make sure that the only required columns are city & row id:
http://dev.mysql.com/doc/refman/5.0/en/insert-select.html
INSERT INTO <table> (city) VALUES SELECT DISTINCT `city` from <old_table>;
Then for each city extract the average values, insert them into a temporary table and then insert into the main table.
SELECT avg(price), substring(dates, 0, 4) dates from <old_table> GROUP BY dates;
Otherwise you're looking at a combination query using joins and potentially unions to extrapolate the data. Because you're flattening the table into a single row per city it's going to be a little tough to do. You should create indexes first on the date column if you don't want the database query to fail with memory limits or just take a very long time to execute.

CDC in sql server

i have enabled CDC feature on one of my database. now i have below table data in cdc tables
MemberID LastName __$operation
1 David 4
1 Dave 4
2 Jimmy 4
2 Test 4
Now my problem is that i have to query the cdc table and get all the rows that are the latest one for all the members (most recent updated value). for example the query would return
MemberID LastName __$operation
1 Dave 4
2 Test 4
In addition to the _$operation column, there are also the _$start_lsn and __$seq_val columns. Ordering by those two should get you there.
You can not only determine by _$operations for CDC. If you want to do it correct use other column fields that are:
__$start_lsn
__$end_lsn
__$seqval
__$update_mask
So I'm not 100% sure I understand what you are asking for, but if you need the latest values for all the members in the table then ignore the CDC table and just query the table itself as this is where all the latest values are afterall.
If you need to see the latest values for all the members that have been changed within a certain time period, then you should use the cdc.fn_cdc_get_net_changes_(capture_instance) function, detailed here:
cdc.fn_cdc_get_net_changes
This allows you to specify a start and end date for the capture period (via the sys.fn_cdc_map_time_to_lsn function which allows you to map the LSNs to actual times) and it will then output the net changes to the table within this period.
The cdc.fn_cdc_get_net_changes_(capture_instance) changes is generated depending on your table name, so as you have not specified what this is, I have called it dbo_members, please change as required, here is an example of how you can get a list of the latest values for all changed members within the last day using the functions detailed above:
DECLARE #begin_time DATETIME ,
#end_time DATETIME ,
#begin_lsn BINARY(10) ,
#end_lsn BINARY(10);
SELECT #begin_time = GETDATE() - 1 ,
#end_time = GETDATE();
SELECT #begin_lsn = sys.fn_cdc_map_time_to_lsn('smallest greater than',
#begin_time);
SELECT #end_lsn = sys.fn_cdc_map_time_to_lsn('largest less than or equal',
#end_time);
SELECT [MemberID] ,
[LastName]
FROM cdc.fn_cdc_get_net_changes_dbo_members(#begin_lsn, #end_lsn, 'all')
GO
As per steoleary you can simply check the data table for the latest values and ignore CDC altogether, but if you are looking to what changed with values from and to, then you will need to refer to the _$operation values 3 (deleted) and 4 (inserted) values in conjunction with the __$start_lsn. The inserted and deleted values correspond to those tables you would use when writing triggers btw.
To just see what column values changes as a precursor to actually evaluating those values, then you can use the __$update_mask column, tied into the cdc.captured_columns table which will provide you the actual column names, by implementing the sys.fn_cdc_is_bit_set(captured_columns.column_ordinal, __$update_mask) function where the result = 1.
Welcome to the wacky world of CDC and the copious amounts of late nights and caffeine hits required to even attempt to master it!
If your cdc system table name is cdc.dbo_demo_ct then with following query you will get desired result:
SELECT *
FROM (SELECT Row_number() OVER (partition BY a.MemberID ORDER BY b.tran_end_time DESC) t,
*
FROM cdc.dbo_demo_ct a
INNER JOIN cdc.lsn_time_mapping b
ON a.__$start_lsn = b.start_lsn) T
WHERE T.t = 1

Converting Rows to Columns in SQL SERVER 2008

In SQL Server 2008,
I have a table for tracking the status history of actions (STATUS_HISTORY) that has three columns ([ACTION_ID],[STATUS],[STATUS_DATE]).
Each ACTION_ID can have a variable number of statuses and status dates.
I need to convert these rows into columns that preferably look something like this:
[ACTION_ID], [STATUS_1], [STATUS_2], [STATUS_3], [DATE_1], [DATE_2], [DATE_3]
Where the total number of status columns and date columns is unknown, and - of course - DATE_1 correlates to STATUS_1, etc. And I'd like for the status to be in chronological order (STATUS_1 has the earliest date, etc.)
My reason for doing this is so I can put the 10 most recent Statuses on a report in an Access ADP, along with other information for each action. Using a subreport with each status in a new row would cause the report to be far too large.
Is there a way to do this using PIVOT? I don't want to use the date or the status as a column heading.
Is it possible at all?
I have no idea where to even begin. It's making my head hurt.
Let us suppose for brevity that you only want 3 most recent statuses for each action_id (like in your example).
Then this query using CTE should do the job:
WITH rownrs AS
(
SELECT
action_id
,status
,status_date
,ROW_NUMBER() OVER (PARTITION BY action_id ORDER BY status_date DESC) AS rownr
FROM
status_history
)
SELECT
s1.action_id AS action_id
,s1.status AS status_1
,s2.status AS status_2
,s3.status AS status_3
,s1.status_date AS date_1
,s2.status_date AS date_2
,s3.status_date AS date_3
FROM
(SELECT * FROM rownrs WHERE rownr=1) AS s1
LEFT JOIN
(SELECT * FROM rownrs WHERE rownr=2) AS s2
ON s1.action_id = s2.action_id
LEFT JOIN
(SELECT * FROM rownrs WHERE rownr=3) AS s3
ON s1.action_id = s3.action_id
NULL values will appear in the rows where the action_id has less then 3 status-es.
I haven't had to do it with two columns, but a PIVOT sounds like what you should try. I've done this in the past with dates in a result set where I needed the date in each row be turned into the columns at the top.
http://msdn.microsoft.com/en-us/library/ms177410.aspx
I sympathize with the headache from trying to design and visualize it, but the best thing to do is try getting it working with one of the columns and then go from there. It helps once you start playing with it.