Optimize SELECT query for working with large database

Optimize SELECT query for working with large database - sql

This is a part of my database:
ID EmployeeID Status EffectiveDate
1 110545 Active 2011-08-01
2 110700 Active 2012-01-05
3 110060 Active 2012-01-05
4 110222 Active 2012-06-30
5 110545 Resigned 2012-07-01
6 110545 Active 2013-02-12
I want to generate records which select Active employees:
ID EmployeeID Status EffectiveDate
2 110700 Active 2012-01-05
3 110060 Active 2012-01-05
4 110222 Active 2012-06-30
So, I tried this query:
SELECT *
FROM Employee AS E
WHERE E.Status='Active' AND
E.EffectiveDate between'2011-08-01' and '2012-07-02'AND NOT
EXISTS(SELECT * FROM Employee AS E2
WHERE E2.EmployeeID = E.EmployeeID AND E2.Status = 'Resigned'
AND E2.EffectiveDate between '2011-08-01' and '2012-07-02'
);
It only works with small amount of data, but got timeout error with large database.
Can you help me optimize this?

This is how I read your request: You want to show active employees. For this to happen, you look at their latest entry, which is either 'Active' or 'Resigned'.
You want to restrict this to a certain time range. That probably means you want to find all employees that became active without becoming immediately inactive again within that time frame.
So, get the latest date per employee first, then stay with those rows in case they are active.
select *
from employee
where (employeeid, effectivedate) in
(
select employeeid, max(effectivedate)
from employee
where effectivedate between date '2011-08-01' and date '2012-07-02'
group by employeeid
)
and status = 'active'
order by employeeid;
The subquery tries to find a time range and then look at each employee to find their latest date within. I'd offer the DBMS this index:
create index idx on employee (effectivedate, employeeid);
The main query wants to find that row again by using employeeid and effectivedate and would then look up the status. The above index could be used again. We could even add the status in order to ease the lookup:
create index idx on employee (effectivedate, employeeid, status);
The DBMS may use this index or not. That's up to the DBMS to decide. I find it likely that it will, for it can be used for all steps in the execution of the query and even contains all columns the query works with, so the table itself wouldn't even have to be read.

I have tried to achieve the above result set using Case Statements.
Hope this helps.
CREATE TABLE employee_test
(rec NUMBER,
employee_id NUMBER,
status VARCHAR2(100),
effectivedate DATE);
INSERT INTO employee_test VALUES(1,110545,'Active',TO_DATE('01-08-2011','DD-MM-YYYY'));
INSERT INTO employee_test VALUES(2,110545,'Active',TO_DATE('05-01-2012','DD-MM-YYYY'));
INSERT INTO employee_test VALUES(3,110545,'Active',TO_DATE('05-01-2012','DD-MM-YYYY'));
INSERT INTO employee_test VALUES(4,110545,'Active',TO_DATE('30-06-2012','DD-MM-YYYY'));
INSERT INTO employee_test VALUES(5,110545,'Resigned',TO_DATE('01-07-2012','DD-MM-YYYY'));
INSERT INTO employee_test VALUES(6,110545,'Active',TO_DATE('12-02-2013','DD-MM-YYYY'));
COMMIT;
SELECT * FROM(
SELECT e.* ,
CASE WHEN (effectivedate BETWEEN TO_DATE('2011-08-01','YYYY-MM-DD') AND TO_DATE('2012-07-02','YYYY-MM-DD') AND status='Active')
THEN 'Y' ELSE 'N' END AS FLAG
FROM Employee_Test e)
WHERE Flag='Y'
;

I'm adding another answer with another interpretation of the request. Just in case :-)
The table shows statuses per employee. An employee can become active, then retired, then active again. But they can not become active and then active again, without becoming retired in between, of course.
We are looking at a time range and want to find all employees that became active but never retired within - no matter whether they became active again after retirement in that period.
This makes this easy. We are looking for employees, that have exactly one row in that time range and that row is active. One way to do this:
select employeeid, any_value(effectivedate), max(status)
from employee
where effectivedate between date '2011-08-01' and date '2012-07-02'
group by employeeid
having max(status) = 'Active'
order by employeeid;
As in my other answer, an appropriate index would be
create index idx on employee (effectivedate, employeeid, status);
as we want to look into the date range and look up the statuses per employee.

Related

how to get emp id for any change values happend after first record

I have emp table
it contains many colmuns and an employee can have many rows beside on the
changed happend to his/her records . it hasnt primary key because emp id can repeated beside on employee values .
there's a column "health" ,it decribes the health and with values(heart,skin,null) etc.. and modification_date for each change of values in a health column
let's say employee number 1 has a heart problem as a first record registed in a health column
then the employee got well then added a second row and column health=null ,
after sometimes the employee got sick to a nother disease 'skin'
how to get employee number if his/her column(health)
has been change to any values of health if values become null or other values ?
any help please ?
select empid, health_status from
(
select e.emp_id empid, e.health health_status,
count(e.health) over (partition by e.emp_id order by e.modification_date asc) sick_count
from emp e
)
where sick_count > 1

Seems you need counting NULLs and NOT-NULLs. NVL2() function would suit well in order to compute this such as
SELECT e.emp_id, e.health,
SUM(NVL2(health,1,0)) OVER (PARTITION BY e.emp_id) AS "Sick",
SUM(NVL2(health,0,1)) OVER (PARTITION BY e.emp_id) AS "Got well"
FROM emp e
if the health is NOT-NULL then second argument will return, otherwise the third argument will return. Btw, using an ORDER BY clause will be redundant.

From Oracle 12, you can use MATCH_RECOGNIZE to find the employees who were sick, got well and then got sick again:
SELECT emp_id
FROM emp
MATCH_RECOGNIZE (
PARTITION BY emp_id
ORDER BY modification_date
PATTERN (sick well+ sick)
DEFINE
sick AS health IS NOT NULL
well AS health IS NULL
);

How to replace a column value for the matching records within a table in Oracle

I have a table which has some matching values of employees i.e. an employee could be in multiple departments.
I wanted to identify those records on the basis of their "name" and "dob". if it is same then replace the "id" as an increment of the decimal.
In below example: Mike is in 2 departments (IT, Finance) so I want his IT dept id (as an increment of the decimal) in the final outcome. Base id can identified on the basis department IT.
Please let me know how can I do this?

Let's take the min id and the row number divided by 10:
SELECT
MIN(id) OVER(PARTITION BY name, dob) + ((ROW_NUMBER() OVER(PARTITION BY name, dob ORDER BY id)-1)/10 as id,
name,
dob,
department
FROM
emp
I chose the min id for the employee as the base id. If you have a different strategy, like you want IT dept id to form the base value, instead of MIN(id) consider something like FIRST_VALUE(id) OVER(PARTITION BY ... ORDER BY case when department = 'it' then 0 else 1 end)
I agree with tim though; there seems a good deal of your question thatvis unclear poorly specified or not completely thought through. What if an emp is in 10 departments and an id conflict occurs? Generally we don't care about what an id number is so we don't change it or try to fill up gaps etc

sql server row number order by

I have a below table in sql server 2014
Empno Resign Hour Dept
1000 2999-01-01 40 20
1000 2999-01-01 40 21
1001 2999-01-01 40 22
1001 2999-01-01 40 23
I need to pick a top record based on Resignation date and Hour. It doesn't matter row with which dept gets picked up. So I went with query
SELECT *
FROM(
SELECT Empno,Resign, Hour, Dept,
ROW_NUMBER() OVER(PARTITION BY Empno ORDER BY Resign DESC,
hour DESC) AS Row
FROM Table ) AS master
WHERE master.Row = 1
AND master.Empno = '1000';
I got back with
EmployeeNumber ResignationDate Hour Dept Row
1000 2999-01-01 40 20 1
I understand sql server doesn't guarantee the order(in this case row with which dept) of the row number unless an order by clause is specified for the Dept.
I dont mind which row with which dept gets picked up but would this happen consistently to pick one, based on some index or id, how would the top would be produced by the query plan?
In the row_number I can simply add another orderby based on dept so that it consistently picks up one but I dont want to do that.

No. You have to add order by with unique combination to force non-arbitrary output.
And why? Its easy question - sql server don't see tables as human. He reads and finds pages which not may not be near or in a row.

SQL getting count in a date range

I'm looking for input on getting a COUNT of records that were 'active' in a certain date range.
CREATE TABLE member {
id int identity,
name varchar,
active bit
}
The scenario is one where "members" number fluctuate over time. So I could have linear growth where I have 10 members at the beginning of the month and 20 at the end. Currently We go off the number of CURRENTLY ACTIVE (as marked by an 'active' flag in the DB) AT THE TIME OF REPORT. - this is hardly accurate and worse, 6 months from now, my "members" figure may be substantially different than now. and Since I'm doing averages per user, if I run a report now, and 6 months from now - the figures will probably be different.
I don't think a simple "dateActive" and "dateInactive" will do the trick... due to members coming and going and coming back etc. so:
JOE may be active 12-1 and deactivated 12-8 and activated 12-20
so JOE counts as being a 'member' for 8 days and then 11 days for a total of 19 days
but the revolving door status of members means keeping a separate table (presumably) of UserId, status, date
CREATE TABLE memberstatus {
member_id int,
status bit, -- 0 for in-active, 1 for active
date date
} (adding this table would make the 'active' field in members obsolete).
In order to get a "good" Average members per month (or date range) - it seems I'd need to get a daily average, and do an average of averages over 'x' days. OR is there some way in SQL to do this already.
This extra "status" table would allow an accurate count going back in time. So in a case where you have a revenue or cost figure, that DOESN'T change or is not aggregate, it's fixed, that when you want cost/members for last June, you certainly don't want to use your current members count, you want last Junes.
Is this how it's done? I know it's one way, but it the 'better' way...
#gordon - I got ya, but I guess I was looking at records like this:
Members
1 Joe
2 Tom
3 Sue
MemberStatus
1 1 '12-01-2014'
1 0 '12-08-2014'
1 1 '12-20-2014'
In this way I only need the last record for a user to get their current status, but I can track back and "know" their status on any give day.
IF I'm understanding your method it might look like this
CREATE TABLE memberstatus {
member_id int,
active_date,
inactive_date
}
so on the 1-7th the record would look like this
1 '12-01-2014' null
and on the 8th it would change to
1 '12-01-2014' '12-08-2014'
the on the 20th
1 '12-01-2014' '12-08-2014'
1 '12-20-2014' null
Although I can get the same data out, it seems more difficult without any benefit - am i missing something?

You could also use a 2 table method to have a one-to-many relationship for working periods. For example you have a User table
User
UserID int, UserName varchar
and an Activity table that holds ranges
Activity
ActivityID int, UserID int, startDate date, (duration int or endDate date)
Then whenever you wanted information you could do something like (for example)...
SELECT User.UserName, count(*) from Activity
LEFT OUTER JOIN User ON User.UserID = Activity.UserID
WHERE startDate >= '2014-01-01' AND startDate < '2015-01-01'
GROUP BY User.UserID, User.UserName
...to get a count grouped by user (and labeled by username) of the times they were became active in 2014

I have used two main ways to accomplish what you want. First would be something like this:
CREATE TABLE [MemberStatus](
[MemberID] [int] NOT NULL,
[ActiveBeginDate] [date] NOT NULL,
[ActiveEndDate] [date] NULL,
CONSTRAINT [PK_MemberStatus] PRIMARY KEY CLUSTERED
(
[MemberID] ASC,
[ActiveBeginDate] ASC
)
Every time a member becomes active, you add an entry, and when they become inactive you update their ActiveEndDate to the current date.
This is easy to maintain, but can be hard to query. Another option is to do basically what you are suggesting. You can create a scheduled job to run at the end of each day to add entries to the table .

I recommend setting up your tables so that you store more data, but in exchange the structure supports much simpler queries to achieve the reporting you require.
-- whenever a user's status changes, we update this table with the new "active"
-- bit, and we set "activeLastModified" to today.
CREATE TABLE member {
id int identity,
name varchar,
active bit,
activeLastModified date
}
-- whenever a user's status changes, we insert a new record here
-- with "startDate" set to the current "activeLastModified" field in member,
-- and "endDate" set to today (date of status change).
CREATE TABLE memberStatusHistory {
member_id int,
status bit, -- 0 for in-active, 1 for active
startDate date,
endDate date,
days int
}
As for the report you're trying to create (average # of actives in a given month), I think you need yet another table. Pure SQL can't calculate that based on these table definitions. Pulling that data from these tables is possible, but it requires programming.
If you ran something like this once-per-day and stored it in a table, then it would be easy to calculate weekly, monthly and yearly averages:
INSERT INTO myStatsTable (date, activeSum, inactiveSum)
SELECT
GETDATE(), -- based on DBMS, eg., "current_date" for Postgres
active.count,
inactive.count
FROM
(SELECT COUNT(id) FROM member WHERE active = true) active
CROSS JOIN
(SELECT COUNT(id) FROM member WHERE active = true) inactive

SQL standard select current records from an audit log question

My memory is failing me. I have a simple audit log table based on a trigger:
ID int (identity, PK)
CustomerID int
Name varchar(255)
Address varchar(255)
AuditDateTime datetime
AuditCode char(1)
It has data like this:
ID CustomerID Name Address AuditDateTime AuditCode
1 123 Bob 123 Internet Way 2009-07-17 13:18:06.353I
2 123 Bob 123 Internet Way 2009-07-17 13:19:02.117D
3 123 Jerry 123 Internet Way 2009-07-17 13:36:03.517I
4 123 Bob 123 My Edited Way 2009-07-17 13:36:08.050U
5 100 Arnold 100 SkyNet Way 2009-07-17 13:36:18.607I
6 100 Nicky 100 Star Way 2009-07-17 13:36:25.920U
7 110 Blondie 110 Another Way 2009-07-17 13:36:42.313I
8 113 Sally 113 Yet another Way 2009-07-17 13:36:57.627I
What would be the efficient select statement be to get all most current records between a start and end time? FYI: I for insert, D for delete, and U for update.
Am I missing anything in the audit table? My next step is to create an audit table that only records changes, yet you can extract the most recent records for the given time frame. For the life of me I cannot find it on any search engine easily. Links would work too. Thanks for the help.

Another (better?) method to keep audit history is to use a 'startDate' and 'endDate' column rather than an auditDateTime and AuditCode column. This is often the approach in tracking Type 2 changes (new versions of a row) in data warehouses.
This lets you more directly select the current rows (WHERE endDate is NULL), and you will not need to treat updates differently than inserts or deletes. You simply have three cases:
Insert: copy the full row along with a start date and NULL end date
Delete: set the End Date of the existing current row (endDate is NULL)
Update: do a Delete then Insert
Your select would simply be:
select * from AuditTable where endDate is NULL
Anyway, here's my query for your existing schema:
declare #from datetime
declare #to datetime
select b.* from (
select
customerId
max(auditdatetime) 'auditDateTime'
from
AuditTable
where
auditcode in ('I', 'U')
and auditdatetime between #from and #to
group by customerId
having
/* rely on "current" being defined as INSERTS > DELETES */
sum(case when auditcode = 'I' then 1 else 0 end) >
sum(case when auditcode = 'D' then 1 else 0 end)
) a
cross apply(
select top 1 customerId, name, address, auditdateTime
from AuditTable
where auditdatetime = a.auditdatetime and customerId = a.customerId
) b
References
A cribsheet for data warehouses, but has a good section on type 2 changes (what you want to track)
MSDN page on data warehousing

Ok, a couple of things for audit log tables.
For most applications, we want audit tables to be extremely quick on insertion.
If the audit log is truly for diagnostic or for very irregular audit reasons, then the quickest insertion criteria is to make the table physically ordered upon insertion time.
And this means to put the audit time as the first column of the clustered index, e.g.
create unique clustered index idx_mytable on mytable(AuditDateTime, ID)
This will allow for extremely efficient select queries upon AuditDateTime O(log n), and O(1) insertions.
If you wish to look up your audit table on a per CustomerID basis, then you will need to compromise.
You may add a nonclustered index upon (CustomerID, AuditDateTime), which will allow for O(log n) lookup of per-customer audit history, however the cost will be the maintenance of that nonclustered index upon insertion - that maintenance will be O(log n) conversely.
However that insertion time penalty may be preferable to the table scan (that is, O(n) time complexity cost) that you will need to pay if you don't have an index on CustomerID and this is a regular query that is performed.
An O(n) lookup which locks the table for the writing process for an irregular query may block up writers, so it is sometimes in writers' interests to be slightly slower if it guarantees that readers aren't going to be blocking their commits, because readers need to table scan because of a lack of a good index to support them....
Addition: if you are looking to restrict to a given timeframe, the most important thing first of all is the index upon AuditDateTime. And make it clustered as you are inserting in AuditDateTime order. This is the biggest thing you can do to make your query efficient from the start.
Next, if you are looking for the most recent update for all CustomerID's within a given timespan, well thereafter a full scan of the data, restricted by insertion date, is required.
You will need to do a subquery upon your audit table, between the range,
select CustomerID, max(AuditDateTime) MaxAuditDateTime
from AuditTrail
where AuditDateTime >= #begin and Audit DateTime <= #end
and then incorporate that into your select query proper, eg.
select AuditTrail.* from AuditTrail
inner join
(select CustomerID, max(AuditDateTime) MaxAuditDateTime
from AuditTrail
where AuditDateTime >= #begin and Audit DateTime <= #end
) filtration
on filtration.CustomerID = AuditTrail.CustomerID and
filtration.AuditDateTime = AuditTrail.AuditDateTime

Another approach is using a sub select
select a.ID
, a.CustomerID
, a.Name
, a.Address
, a.AuditDateTime
, a.AuditCode
from myauditlogtable a,
(select s.id as maxid,max(s.AuditDateTime)
from myauditlogtable as s
group by maxid)
as subq
where subq.maxid=a.id;

start and end time? e.g as in between 1am to 3am
or start and end date time? e.g as in 2009-07-17 13:36 to 2009-07-18 13:36

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas