Finding, grouping and deleting the duplicate records in MS SQL and keeping the oldest - sql

I have a table that holds the employee's bank account records. There are duplicate records of the last record created by a user, which were created due to daily job execution.
For example: for Employee E1 below are the bank account records in DB
Jan 1 2015 - Bank X
Jan 1 2018 - Bank Y
Jan 1 2020 - Bank X
Here multiple duplicate records for Bank X and Y after each manual change on 1st Jan of 2015,18 and 20.
Now, I have to delete the duplicate records which are duplicate in terms of values of columns "BankName" and "BankAccountNumber".
Here in this scenario, the records should remain in system are all those which are updated on Jan 1 of years 2015, 2018 and 2020 even though the name and account number is the same for Bank X.
The columns we are considering in the table for preparing script are:
1. Recordid(Uniqueidentifier) primary key
2. Recordsequence(INT) Identity column by 1
3. EmployeeID(INT) <set of records are linked to an employee through of employee table>.
My current Logic is to find the duplicate and delete the records:
;WITH BARecords
AS (
SELECT recordid
,ROW_NUMBER() OVER (
PARTITION BY employeeID
,BankName
,AccountNumber ORDER BY recordsequence
) row_num
FROM employeebankaccount WITH (NOLOCK)
WHERE employeeid IN (
SELECT Id
FROM #EMPLOYEEIDs
)
)
DELETE
FROM BARecords
WHERE row_num > 1
My current logic will remove the Bank X details from Jan 1 2020 as well and will keep only Jan 1 2015.
As the user have updated Bank X again that should also remain in the system of creation date Jan1 2020.

Related

Nested Impala query to find occurrences of an event twice

I have a table which contains employee name, employee ID and timestamp of times logged.
employee id
employee name
event_time
1
Harry
2021-11-18T20:03:25Z
1
Harry
2021-11-19T20:03:25Z
1
Harry
2021-11-20T20:03:25Z
2
Charlie
2021-11-18T20:03:25Z
I need to find out compliance percentage. Basically the percentage of days logged in out of the difference between max day and min day . This can be done through the following in Impala
SELECT employee_id,employee_name
(count(distinct(cast(event_time as timestamp))))/(datediff(cast(max(event_time) as TIMESTAMP),cast(min(event_time) as timestamp))) * 100.0 as compliance_percentage
FROM employee
group by employee_id,employee_name;
Now if the event_time is to be considered compliant only if it occurs twice i.e if a person logs two times and only two times how can we write a query in Impala so that we get the required result?
For example if Harry has logged for exactly 2 times on 8 different days between 1st Feb 2021 and 1st Feb 2022 this would be the expected result(cp would be (8/365 * 100)
employee_id
employee_name
cp
1
Harry
2.19

SQL/HQL - Partition By

Trying to understand Partition By and getting super confused, I have the following data:
Name ID Product Date Amount
Jason 1 Car Jan 2017 $10
Jason 1 Car Feb 2017 $5
Jason 2 Car Jan 2017 $50
Jason 2 Car Feb 2017 $60
Jason 3 House Jan 2017 $20
Jason 3 House Feb 2017 $30
Would doing:
Select Name, ID, Product, Date, Amount,
**LAG(Amount,1) Over Partition By Name Order by Date**
FROM table
give me Jason's correct previous month amount for the appropriate Product and ID number? So, for example at Feb 2017: Jason, ID 1 and Product Car's should give me the amount $5.
OR
Would I need to modify the Partition by to include the Product and ID, such as:
Select Name, ID, Product, Date, Amount,
**LAG(Amount,1) Over Partition By *Name, ID, Product* Order by Date** FROM table'
Thanks!
I myself also came here in search of some understanding of the "partition by" clause. But to answer your question, the new column created would give you the previous row's value. So you don't have to add the other columns (i.e. Product and ID) in to your Partition by clause.
Essentially, you would have your existing 5 columns, plus one (in which contains the row contains the value of the previous row in "amount").

How to retreive equivalent records for a specific person's record in a specific month using sql

I have an assignment which is to view names of the top 3 high achievers in my department. A high achiever is a regular employee who stayed the longest hours in the company for a certain month and all tasks assigned to him/her with a deadline of this month are fixed. My query :
SELECT distinct top 3 Regular_Employees.username ,
DATEDIFF(HOUR,start_time,end_time)
FROM Regular_Employees INNER JOIN Staff_Members on
Regular_Employees.username=Staff_Members.username INNER JOIN Tasks ON
Staff_Members.username=Tasks.regular_employee INNER join Attendance_Records
on Regular_Employees.username = Attendance_Records.staff
WHERE MONTH(deadline) = MONTH('11/11/2017') AND YEAR(deadline) =
YEAR('11/11/2017') AND department=Staff_Members.department
ORDER BY DATEDIFF(HOUR,start_time,end_time) DESC
I wanted to retrieve top 3 records in November and the result was
daniel.magdi 2
farid.elsoury 2
joy.ahmed 2
where the employees are the right employees while the records '2' is the record for one day not the whole month. How can i compute the record for each employee per whole month and view the results ?
I want to insert date as ' 11/11/2017' and the result would be
Result
Employee HOURS
daniel.magdi 180
farid.elsoury 179
joy.ahmed 175
where those are the highest records in November

Get values for absent records for each employee

I have two tables:
Table 1 - EmployeeCommitment
This table describes the percentage of time commitment for employees.
Columns:
Date: The date they commit their time to.
Employee: employee username
Commitment: E.g. 20% means the employee is planning to commit 20% of their daily time on work
Table 2 - Calendar
This table is basically a calendar with a row for each day of the year.
Columns:
Calendar_Date
Include: a binary column that indicates whether this date is a working day or not (e.g. if the day is weekend or holiday the value is 0 and otherwise 1)
EmployeeCommitment table does not contain the days where employees have no commitments which is quite reasonable since it only gets a new row when somebody commits to work and not the other way around.
But what I want to have is to get a row for each day that employees have not committed to any work. E.g. if employee john.smith has committed to only 3 days this week (mon, tue, wed), he should get two extra rows for thu and fri (let's say this is a normal week with no holidays) containing his name and a 0% commitment which will be a total of 5 rows for this week.
I have tried to join the two tables like this
SELECT * FROM Calendar c LEFT JOIN EmployeeCommitment e ON c.Calendar_Date=EmployeeCommitment.Date
But this JOIN gives me null columns for the table on the right and there is no way of knowing which employee they belong to.
You need to generate all the rows using cross join and then bring in the information you want . . . for instance, using left join:
SELECT c.*, e.employee, COALESCE(ec.commitment, 0) as commitment
FROM Calendar c CROSS JOIN
(SELECT DISTINCT Employee FROM Employees e) e LEFT JOIN
EmployeeCommitment ec
ON c.Calendar_Date = ec.Date AND
e.Employee = ec.Employee;

Oracle SQL Month Statement Generation

I am having performance issue on a set of SQLs to generate current month's statement in realtime.
Customers will purchase some goods using points from an online system, and the statement containing "open_balance", "point_earned", "point_used", "current_balance" should be generated.
The following shows the shortened schema :
//~200k records
customer: {account_id:string, create_date:timestamp, bill_day:int} //totally 14 fields
//~250k records per month, kept for 6 month
history_point: {point_id:long, account_id:string, point_date:timestamp, point:int} //totally 9 fields
//each customer have maximum of 12 past statements kept
history_statement: {account_id:string, open_date:date, close_date:date, open_balance:int, point_earned:int, point_used:int, close_balance:int} //totally 9 fields
On every bill day, the view should automatically create a new month statement.
i.e. If bill_day is 15, then transaction done on or after 16 Dec 2013 00:00:00 should belongs to new bill cycle of 16 Dec 2013 00:00:00 - 15 Jan 2014 23:59:59
I tried the approach described below,
Calculate the last close day for each account (in materialized view, so that it update only after there is new customer or past month statement inserted into history_statement)
Generate a record for each customer each month that I need to calculate (Also in materialized view)
Sieve the point record for only point records within the date that I will calculate (This takes ~0.1s only)
Join 2 with 3 to obtain point earned and used for each customer each month
Join 4 with 4 on date less than open date to sum for open and close balance
6a. Select from 5 where open date is less than 1 month old as current balance (these are not closed yet, and the point reflect the point each customer own now)
6b. All the statements are obtained by union of history_statement and 5
On a development server, the average response time (200K customer, 1.5M transactions in current month) is ~3s which is pretty slow for web application, and on the testing server, where resources are likely to be shared, the average response time (200K customer, ~200k transaction each month for 8 months) is 10-15s.
Does anyone have some idea on writing a query with better approach or to speed up the query?
Related SQL:
2: IV_STCLOSE_2_1_T(Materialized view)
3: IV_STCLOSE_2_2_T (~0.15s)
SELECT ACCOUNT_ID, POINT_DATE, POINT
FROM history_point
WHERE point_date >= (
SELECT MIN(open_date)
FROM IV_STCLOSE_2_1_t
)
4: IV_STCLOSE_3_T (~1.5s)
SELECT p0.account_id, p0.open_date, p0.close_date, COALESCE(SUM(DECODE(SIGN(p.point),-1,p.point)),0) AS point_used, COALESCE(SUM(DECODE(SIGN(p.point),1,p.point)),0) AS point_earned
FROM iv_stclose_2_1_t p0
LEFT JOIN iv_stclose_2_2_t p
ON p.account_id = p0.account_id
AND p.point_date >= p0.open_date
AND p.point_date < p0.close_date + INTERVAL '1' DAY
GROUP BY p0.account_id, p0.open_date, p0.close_date
5: IV_STCLOSE_4_T (~3s)
WITH t AS (SELECT * FROM IV_STCLOSE_3_T)
SELECT t1.account_id AS STAT_ACCOUNT_ID, t1.open_date, t1.close_date, t1.open_balance, t1.point_earned AS point_earn, t1.point_used , t1.open_balance + t1.point_earned + t1.point_used AS close_balance
FROM (
SELECT v1.account_id, v1.open_date, v1.close_date, v1.point_earned, v1.point_used, COALESCE(sum(v2.point_used + v2.point_earned),0) AS OPEN_BALANCE
FROM t v1
LEFT JOIN t v2
ON v1.account_id = v2.account_id
AND v1.OPEN_DATE > v2.OPEN_DATE
GROUP BY v1.account_id, v1.open_date, v1.close_date, v1.point_earned, v1.point_used
) t1
It turns out to be that in IV_STCLOSE_4_T
WITH t AS (SELECT * FROM IV_STCLOSE_3_T)
is problematic.
At first thought WITH t AS would be faster as IV_STCLOSE_3_T is only evaluated once, but it apparently forced materializing the whole IV_STCLOSE_3_T, generating over 200k records despite I only need at most 12 of them from a single customer at any time.
With the above statement removed and appropriately indexing account_id, the cost reduced from over 500k to less than 500.