SQL getting count in a date range - sql

I'm looking for input on getting a COUNT of records that were 'active' in a certain date range.
CREATE TABLE member {
id int identity,
name varchar,
active bit
}
The scenario is one where "members" number fluctuate over time. So I could have linear growth where I have 10 members at the beginning of the month and 20 at the end. Currently We go off the number of CURRENTLY ACTIVE (as marked by an 'active' flag in the DB) AT THE TIME OF REPORT. - this is hardly accurate and worse, 6 months from now, my "members" figure may be substantially different than now. and Since I'm doing averages per user, if I run a report now, and 6 months from now - the figures will probably be different.
I don't think a simple "dateActive" and "dateInactive" will do the trick... due to members coming and going and coming back etc. so:
JOE may be active 12-1 and deactivated 12-8 and activated 12-20
so JOE counts as being a 'member' for 8 days and then 11 days for a total of 19 days
but the revolving door status of members means keeping a separate table (presumably) of UserId, status, date
CREATE TABLE memberstatus {
member_id int,
status bit, -- 0 for in-active, 1 for active
date date
} (adding this table would make the 'active' field in members obsolete).
In order to get a "good" Average members per month (or date range) - it seems I'd need to get a daily average, and do an average of averages over 'x' days. OR is there some way in SQL to do this already.
This extra "status" table would allow an accurate count going back in time. So in a case where you have a revenue or cost figure, that DOESN'T change or is not aggregate, it's fixed, that when you want cost/members for last June, you certainly don't want to use your current members count, you want last Junes.
Is this how it's done? I know it's one way, but it the 'better' way...
#gordon - I got ya, but I guess I was looking at records like this:
Members
1 Joe
2 Tom
3 Sue
MemberStatus
1 1 '12-01-2014'
1 0 '12-08-2014'
1 1 '12-20-2014'
In this way I only need the last record for a user to get their current status, but I can track back and "know" their status on any give day.
IF I'm understanding your method it might look like this
CREATE TABLE memberstatus {
member_id int,
active_date,
inactive_date
}
so on the 1-7th the record would look like this
1 '12-01-2014' null
and on the 8th it would change to
1 '12-01-2014' '12-08-2014'
the on the 20th
1 '12-01-2014' '12-08-2014'
1 '12-20-2014' null
Although I can get the same data out, it seems more difficult without any benefit - am i missing something?

You could also use a 2 table method to have a one-to-many relationship for working periods. For example you have a User table
User
UserID int, UserName varchar
and an Activity table that holds ranges
Activity
ActivityID int, UserID int, startDate date, (duration int or endDate date)
Then whenever you wanted information you could do something like (for example)...
SELECT User.UserName, count(*) from Activity
LEFT OUTER JOIN User ON User.UserID = Activity.UserID
WHERE startDate >= '2014-01-01' AND startDate < '2015-01-01'
GROUP BY User.UserID, User.UserName
...to get a count grouped by user (and labeled by username) of the times they were became active in 2014

I have used two main ways to accomplish what you want. First would be something like this:
CREATE TABLE [MemberStatus](
[MemberID] [int] NOT NULL,
[ActiveBeginDate] [date] NOT NULL,
[ActiveEndDate] [date] NULL,
CONSTRAINT [PK_MemberStatus] PRIMARY KEY CLUSTERED
(
[MemberID] ASC,
[ActiveBeginDate] ASC
)
Every time a member becomes active, you add an entry, and when they become inactive you update their ActiveEndDate to the current date.
This is easy to maintain, but can be hard to query. Another option is to do basically what you are suggesting. You can create a scheduled job to run at the end of each day to add entries to the table .

I recommend setting up your tables so that you store more data, but in exchange the structure supports much simpler queries to achieve the reporting you require.
-- whenever a user's status changes, we update this table with the new "active"
-- bit, and we set "activeLastModified" to today.
CREATE TABLE member {
id int identity,
name varchar,
active bit,
activeLastModified date
}
-- whenever a user's status changes, we insert a new record here
-- with "startDate" set to the current "activeLastModified" field in member,
-- and "endDate" set to today (date of status change).
CREATE TABLE memberStatusHistory {
member_id int,
status bit, -- 0 for in-active, 1 for active
startDate date,
endDate date,
days int
}
As for the report you're trying to create (average # of actives in a given month), I think you need yet another table. Pure SQL can't calculate that based on these table definitions. Pulling that data from these tables is possible, but it requires programming.
If you ran something like this once-per-day and stored it in a table, then it would be easy to calculate weekly, monthly and yearly averages:
INSERT INTO myStatsTable (date, activeSum, inactiveSum)
SELECT
GETDATE(), -- based on DBMS, eg., "current_date" for Postgres
active.count,
inactive.count
FROM
(SELECT COUNT(id) FROM member WHERE active = true) active
CROSS JOIN
(SELECT COUNT(id) FROM member WHERE active = true) inactive

Related

How to design this table schema (and how to query it)

I have a table that stores a list of members - for the sake of simplicity, I will use a simple real-world case that models my use case.
Let's use the analogy of a sports club or gym.
The membership of the gym changes every three months (for example) - with some old members leaving, some new members joining and some members staying unchanged.
I want to run a query on the table - spanning a multi-time period and return the average weight of all of the members in the club.
These are the tables I have come up with so far:
-- A table containing all members the gym has ever had
-- current members have their leave_date field left at NULL
-- departed members have their leave_date field set to the days they left the gym
CREATE TABLE IF NOT EXISTS member (
id PRIMARY KEY NOT NULL,
name TEXT NOT NULL,
join_date DATE NOT NULL,
-- set to NULL if user has not left yet
leave_date DATE DEFAULT NULL
);
-- A table of members weights.
-- This table is populated DAILY,after the weights of CURRENT members
-- has been recorded
CREATE TABLE IF NOT EXISTS current_member_weight (
id PRIMARY KEY NOT NULL,
calendar_date DATE NOT NULL,
member_id INTEGER REFERENCES member(id) NOT NULL,
weight REAL NOT NULL
);
-- I want to write a query that returns the AVERAGE daily weight of
-- CURRENT members of the gym. The query should take a starting_date
-- and an ending_date between which to calculate the daily
-- averages. The aver
-- PSEUDO SQL BELOW!
SELECT calendar_date, AVG(weight)
FROM member, current_member_weight
WHERE calendar_date BETWEEN(starting_date, ending_date);
I have two questions:
can the schema above be improved - if yes, please illustrate
How can I write an SQL* query to return the average weights calculated for all members in the gym during a specified period (t1, t2), where (t1,t2) spans a period that members have joined/left the gym?
[[Note about SQL]]
Preferably, any SQL shown would be database anagnostic, however if a particular flavour of SQL is to be used, I'd prefer PostgreSQL, since that this is the database I'm using.
Below SQL would work as long as the data in the gym_member table is consistent with the joining and leaving date of each member (i.e. for any member, the gym_member table should not have rows with calendar_date less his joining date or with calendar_date greater than his leaving date)
SELECT
gm.calendar_date,
AVG(gm.weight) avg_weight
FROM
member m,
gym_member gm
WHERE
m.id = gm.member_id
AND
gm.calendar_date >= '1-Jan-2017'
AND
gm.calendar_date <= '31-Dec-2017'
GROUP BY
gm.calendar_date

CDC in sql server

i have enabled CDC feature on one of my database. now i have below table data in cdc tables
MemberID LastName __$operation
1 David 4
1 Dave 4
2 Jimmy 4
2 Test 4
Now my problem is that i have to query the cdc table and get all the rows that are the latest one for all the members (most recent updated value). for example the query would return
MemberID LastName __$operation
1 Dave 4
2 Test 4
In addition to the _$operation column, there are also the _$start_lsn and __$seq_val columns. Ordering by those two should get you there.
You can not only determine by _$operations for CDC. If you want to do it correct use other column fields that are:
__$start_lsn
__$end_lsn
__$seqval
__$update_mask
So I'm not 100% sure I understand what you are asking for, but if you need the latest values for all the members in the table then ignore the CDC table and just query the table itself as this is where all the latest values are afterall.
If you need to see the latest values for all the members that have been changed within a certain time period, then you should use the cdc.fn_cdc_get_net_changes_(capture_instance) function, detailed here:
cdc.fn_cdc_get_net_changes
This allows you to specify a start and end date for the capture period (via the sys.fn_cdc_map_time_to_lsn function which allows you to map the LSNs to actual times) and it will then output the net changes to the table within this period.
The cdc.fn_cdc_get_net_changes_(capture_instance) changes is generated depending on your table name, so as you have not specified what this is, I have called it dbo_members, please change as required, here is an example of how you can get a list of the latest values for all changed members within the last day using the functions detailed above:
DECLARE #begin_time DATETIME ,
#end_time DATETIME ,
#begin_lsn BINARY(10) ,
#end_lsn BINARY(10);
SELECT #begin_time = GETDATE() - 1 ,
#end_time = GETDATE();
SELECT #begin_lsn = sys.fn_cdc_map_time_to_lsn('smallest greater than',
#begin_time);
SELECT #end_lsn = sys.fn_cdc_map_time_to_lsn('largest less than or equal',
#end_time);
SELECT [MemberID] ,
[LastName]
FROM cdc.fn_cdc_get_net_changes_dbo_members(#begin_lsn, #end_lsn, 'all')
GO
As per steoleary you can simply check the data table for the latest values and ignore CDC altogether, but if you are looking to what changed with values from and to, then you will need to refer to the _$operation values 3 (deleted) and 4 (inserted) values in conjunction with the __$start_lsn. The inserted and deleted values correspond to those tables you would use when writing triggers btw.
To just see what column values changes as a precursor to actually evaluating those values, then you can use the __$update_mask column, tied into the cdc.captured_columns table which will provide you the actual column names, by implementing the sys.fn_cdc_is_bit_set(captured_columns.column_ordinal, __$update_mask) function where the result = 1.
Welcome to the wacky world of CDC and the copious amounts of late nights and caffeine hits required to even attempt to master it!
If your cdc system table name is cdc.dbo_demo_ct then with following query you will get desired result:
SELECT *
FROM (SELECT Row_number() OVER (partition BY a.MemberID ORDER BY b.tran_end_time DESC) t,
*
FROM cdc.dbo_demo_ct a
INNER JOIN cdc.lsn_time_mapping b
ON a.__$start_lsn = b.start_lsn) T
WHERE T.t = 1

SQL Query to show attendance between two dates

I've got a .Net application with an attendance table which has fields for a Start and End date. I'm struggling to show a graph of attendance for a given period. I can easily find how many rows are applicable on any given day using between but I can't get my head around pivoting results so that I can graph a count of rows per day. I could run a SQL query for every day individually and then graph the results but is there any way of doing this with T-SQL that I could then use to graph with?
Edit:-
Apologies as this is the first time I've asked a question here, but as huMpty duMpty has stated the question probably needs more clarification. I've got both a startdate and enddate column in the sql db and I need to count per day if the range between these dates falls between the range of the selection criteria. e.g if I've got a start date of 2013-01-01 and end date 2013-01-10 and I report on a period of 2013-01-09 to 2013-01-11 then i'm looking at getting a result for 1 for 2013-01-09 and 1 for 2013-01-10... Hope this make more sense and thanks for your assistance
I think you have a table with start and end dates; for a given date range, you would like to know the given number of records that fall on each date.
I believe this problem may be solved with a numbers table. I created a numbers table on the fly in a stored procedure, but I recommend creating a permanent numbers table in your production code. Here's the SQL Fiddle.
Create Table Attendance (
id int primary key identity(1,1) not null
,start_date date not null
,end_date date not null
);
Go
Insert Attendance(start_date, end_date)
Values ('1/1/2013', '1/10/2013')
,('1/10/2013', '1/15/2013')
,('2/20/2013', '3/1/2013');
Go
-- Create numbers table. See: Method 3 of http://stackoverflow.com/a/1407488/772086
With Numbers(Number) As
(
Select 1 As Number
Union All
Select Number + 1 From Numbers Where Number < 10000
)
Select
AttendanceDate = Convert(date, DateAdd(day,Numbers.number, '1/1/2000'))
,AttendanceCount = Count(*)
From dbo.Attendance
Join Numbers
On Numbers.Number
Between DateDiff(day, '1/1/2000', Attendance.start_date)
And DateDiff(day, '1/1/2000', Attendance.end_date)
-- Reporting range between 1/9 and 1/11
Where DateAdd(day,Numbers.number, '1/1/2000') Between '1/9/2013'
And '1/11/2013'
Group By Convert(date, DateAdd(day,Numbers.number, '1/1/2000'))
Option(MaxRecursion 10000);
All dates are in US format (m/d/yy) - you may want to switch those to the internationalized standard (yyyy-mm-dd) in your production code.
You said you wanted a count by day in a date range. That can be done with a COUNT with a GROUP BY clause.
I don't know your schema, but a solution might look like this:
declare #MyTable table
(
ID int identity(1,1) primary key clustered,
MyDate smalldatetime
)
insert into #MyTable (MyDate)
values
('2012-12-31'), -- before the date range, so not included in results
('2013-01-10'),
('2013-01-10'), -- appears twice
('2013-01-11'), -- appears once
('2013-01-12') -- after the date range, so not included in results
select * from #MyTable
select
MyDate,
count(*)
from #MyTable
where MyDate between '2013-01-09' and '2013-01-11'
group by MyDate

How can I make multiple records to be printed as a single row

alt text http://img59.imageshack.us/img59/962/62737835.jpg
This three columns are taken from 3 tables. In other words, these records are
retrieved by joining 3 tables.
It is basically a very simple time sheet that keeps track of shift starts time, lunch time and so on.
I want these four records to show in one row, for example:
setDate --- ShiftStarted --- LunchStarted --- LunchEnded ---- ShiftEnded ----- TimeEntered
Note: discard TimeEntered column. I will deal with this later, once i know how to solve the above issue, it will be easy for me to handle the rest.
How can i do it?
Further Info - Here is my query:
SELECT TimeSheet.setDate, TimeSheetType.tsTypeTitle
FROM TimeSheet
INNER JOIN TimeSheetDetail ON TimeSheet.timeSheetID = TimeSheetDetail.timeSheetID
INNER JOIN TimeSheetType ON TimeSheetType.timeSheetTypeID = TimeSheetDetail.timeSheetTypeID
TimeSheet table consists of the following columns:
timeSheetID
employeeID - FK
setDate
setDate represents today's date.
TimeSheetType table consists of the following columns:
timeSheetTypeID
tsTypeTitle
tsTypeTitle represents shifts e.g. shift starts at, lunch starts at, shift ends at, etc.
TimeSheetDetail table consists of the following columns:
timeSheetDetailID
timeSheetID - FK
timeSheetTypeID - FK
timeEntered
addedOn
timeEnetered represents the time that employee set manually.
addedOn represents the system time, the time that a record was inserted.
I must admit I haven't fully read all but I think you can work out the rest for yourself. Basically you can join the table timesheet with itself.
I did this ...
create table timesheet (timesheet number, setdate timestamp, timesheettype varchar2(200), timeentered timestamp);
insert into timesheet values (1,to_date('2010-08-02','YYYY-MM-DD'),'Shift Started',current_timestamp);
insert into timesheet values (1,to_date('2010-08-02','YYYY-MM-DD'),'Lunch Started',current_timestamp);
insert into timesheet values (1,to_date('2010-08-02','YYYY-MM-DD'),'Lunch Ended',current_timestamp);
insert into timesheet values (1,to_date('2010-08-02','YYYY-MM-DD'),'Shift Ended',current_timestamp);
commit;
select * from timesheet t1
left join timesheet t2 on (t1.timesheet = t2.timesheet)
where t1.timesheettype = 'Shift Started'
and t2.timesheettype = 'Lunch Started'
... and got out this
TIMESHEET SETDATE TIMESHEETTYPE TIMEENTERED TIMESHEET_1 SETDATE_1 TIMESHEETTYPE_1 TIMEENTERED_1
1 02.08.2010 00:00:00.000000 Shift Started 05.08.2010 12:35:56.264075 1 02.08.2010 00:00:00.000000 Lunch Started 05.08.2010 12:35:56.287357
It was not SQL Server but in principle it should work for you too.
Let me know if you still have a question
You might want to check out the PIVOT operator. It basically allows you to use particular row values to create new columns in your result set.
You'll have to supply an aggregate function for combining multiple rows - for instance (assuming you split your data on a per day basis), you'll have to decide how to deal with multiple "shift started" events on the same day. Assuming that such events never occur, you'll still have to use an aggregate. MAX() is usually a safe choice in those circumstances.

SQL standard select current records from an audit log question

My memory is failing me. I have a simple audit log table based on a trigger:
ID int (identity, PK)
CustomerID int
Name varchar(255)
Address varchar(255)
AuditDateTime datetime
AuditCode char(1)
It has data like this:
ID CustomerID Name Address AuditDateTime AuditCode
1 123 Bob 123 Internet Way 2009-07-17 13:18:06.353I
2 123 Bob 123 Internet Way 2009-07-17 13:19:02.117D
3 123 Jerry 123 Internet Way 2009-07-17 13:36:03.517I
4 123 Bob 123 My Edited Way 2009-07-17 13:36:08.050U
5 100 Arnold 100 SkyNet Way 2009-07-17 13:36:18.607I
6 100 Nicky 100 Star Way 2009-07-17 13:36:25.920U
7 110 Blondie 110 Another Way 2009-07-17 13:36:42.313I
8 113 Sally 113 Yet another Way 2009-07-17 13:36:57.627I
What would be the efficient select statement be to get all most current records between a start and end time? FYI: I for insert, D for delete, and U for update.
Am I missing anything in the audit table? My next step is to create an audit table that only records changes, yet you can extract the most recent records for the given time frame. For the life of me I cannot find it on any search engine easily. Links would work too. Thanks for the help.
Another (better?) method to keep audit history is to use a 'startDate' and 'endDate' column rather than an auditDateTime and AuditCode column. This is often the approach in tracking Type 2 changes (new versions of a row) in data warehouses.
This lets you more directly select the current rows (WHERE endDate is NULL), and you will not need to treat updates differently than inserts or deletes. You simply have three cases:
Insert: copy the full row along with a start date and NULL end date
Delete: set the End Date of the existing current row (endDate is NULL)
Update: do a Delete then Insert
Your select would simply be:
select * from AuditTable where endDate is NULL
Anyway, here's my query for your existing schema:
declare #from datetime
declare #to datetime
select b.* from (
select
customerId
max(auditdatetime) 'auditDateTime'
from
AuditTable
where
auditcode in ('I', 'U')
and auditdatetime between #from and #to
group by customerId
having
/* rely on "current" being defined as INSERTS > DELETES */
sum(case when auditcode = 'I' then 1 else 0 end) >
sum(case when auditcode = 'D' then 1 else 0 end)
) a
cross apply(
select top 1 customerId, name, address, auditdateTime
from AuditTable
where auditdatetime = a.auditdatetime and customerId = a.customerId
) b
References
A cribsheet for data warehouses, but has a good section on type 2 changes (what you want to track)
MSDN page on data warehousing
Ok, a couple of things for audit log tables.
For most applications, we want audit tables to be extremely quick on insertion.
If the audit log is truly for diagnostic or for very irregular audit reasons, then the quickest insertion criteria is to make the table physically ordered upon insertion time.
And this means to put the audit time as the first column of the clustered index, e.g.
create unique clustered index idx_mytable on mytable(AuditDateTime, ID)
This will allow for extremely efficient select queries upon AuditDateTime O(log n), and O(1) insertions.
If you wish to look up your audit table on a per CustomerID basis, then you will need to compromise.
You may add a nonclustered index upon (CustomerID, AuditDateTime), which will allow for O(log n) lookup of per-customer audit history, however the cost will be the maintenance of that nonclustered index upon insertion - that maintenance will be O(log n) conversely.
However that insertion time penalty may be preferable to the table scan (that is, O(n) time complexity cost) that you will need to pay if you don't have an index on CustomerID and this is a regular query that is performed.
An O(n) lookup which locks the table for the writing process for an irregular query may block up writers, so it is sometimes in writers' interests to be slightly slower if it guarantees that readers aren't going to be blocking their commits, because readers need to table scan because of a lack of a good index to support them....
Addition: if you are looking to restrict to a given timeframe, the most important thing first of all is the index upon AuditDateTime. And make it clustered as you are inserting in AuditDateTime order. This is the biggest thing you can do to make your query efficient from the start.
Next, if you are looking for the most recent update for all CustomerID's within a given timespan, well thereafter a full scan of the data, restricted by insertion date, is required.
You will need to do a subquery upon your audit table, between the range,
select CustomerID, max(AuditDateTime) MaxAuditDateTime
from AuditTrail
where AuditDateTime >= #begin and Audit DateTime <= #end
and then incorporate that into your select query proper, eg.
select AuditTrail.* from AuditTrail
inner join
(select CustomerID, max(AuditDateTime) MaxAuditDateTime
from AuditTrail
where AuditDateTime >= #begin and Audit DateTime <= #end
) filtration
on filtration.CustomerID = AuditTrail.CustomerID and
filtration.AuditDateTime = AuditTrail.AuditDateTime
Another approach is using a sub select
select a.ID
, a.CustomerID
, a.Name
, a.Address
, a.AuditDateTime
, a.AuditCode
from myauditlogtable a,
(select s.id as maxid,max(s.AuditDateTime)
from myauditlogtable as s
group by maxid)
as subq
where subq.maxid=a.id;
start and end time? e.g as in between 1am to 3am
or start and end date time? e.g as in 2009-07-17 13:36 to 2009-07-18 13:36