Finding Duplicate Rows in a Table - sql

I am trying to find out how many duplicate records I have in a table. I can use count, but I'm not sure how best to eliminate records where the count is only 1.
select first_name, last_name, start_date, count(1)
from employee
group by first_name, last_name, start_date;
I can try to order by the count, but I am still not eliminating those with a count of one.

you can use having clause as having Count(*) > 1 after group by like this :
select
first_name,
last_name,
start_date,
Count(*) AS Count
from
employee
group by
first_name,
last_name,
start_date
having
Count(*) > 1

Related

Analytic query trying to solve

im solving the following task with analytic functions and im stuck.
task: Write a query that shows the latest hired employee per department. In case of ties, use the lowest employee ID.
select a.EMPLOYEE_ID,
a.DEPARTMENT_ID,
a.FIRST_NAME,
a.LAST_NAME,
a.HIRE_DATE,
a.JOB_ID
from (select ROW_NUMBER() over (PARTITION by department_id order by hire_date desc)
from hr.EMPLOYEES a) A
where A = 1 ;
You need to include the columns you want to select in the outer query in the SELECT clause of the inner query and need to give an alias to the ROW_NUMBER computed value:
select EMPLOYEE_ID,
DEPARTMENT_ID,
FIRST_NAME,
LAST_NAME,
HIRE_DATE,
JOB_ID
from (
select EMPLOYEE_ID,
DEPARTMENT_ID,
FIRST_NAME,
LAST_NAME,
HIRE_DATE,
JOB_ID,
ROW_NUMBER() over (PARTITION by department_id order by hire_date desc) AS rn
from hr.EMPLOYEES
)
where rn = 1 ;
You still need to address the second part of the question:
In case of ties, use the lowest employee ID.
However, since this appears to be a homework question, I'll leave that for you to solve.

Calculate salary difference between two rows in HIVE

I have a table with below columns-
last_name, first_name, department, salary
I want to calculate list of employees who receive a salary less than 100, compared
to their immediate employee with higher salary in the same department. I went to below answer- Compute differences between succesive records in Hadoop with Hive Queries and tried but I think I am doing something wrong as I am new to HIVE.
Below is the query which I am running-
select last_name,first_name, salary from emp where
100 = LEAD(salary,1) OVER(PARTITION BY department ORDER BY salary)-salary;
Please help me with the solution.
Use a case expression.
SELECT last_name,
first_name,
salary
FROM (SELECT last_name,
first_name,
salary,
CASE
WHEN 100 > LEAD(salary, 1)
OVER(
PARTITION BY department
ORDER BY salary) - salary THEN 1
ELSE 0
END sal_flag
FROM emp)
WHERE sal_flag = 1;
Hive enforces every sub query to be given a name. I have just added the name to Kaushik's query. Try this, it will work.
SELECT last_name,
first_name,
salary
FROM (SELECT last_name,
first_name,
salary,
CASE
WHEN 100 > LEAD(salary, 1)
OVER(
PARTITION BY department
ORDER BY salary) - salary THEN 1
ELSE 0
END sal_flag
FROM employee) v
WHERE sal_flag = 1;
I personally prefer using WITH clause as opposed to subquery as below. With clauses make the query more readable. Also, they produce better execution plan generally.
WITH sal_view
AS (SELECT last_name,
first_name,
salary,
CASE
WHEN 100 > LEAD(salary, 1)
OVER(
PARTITION BY department
ORDER BY salary) - salary THEN 1
ELSE 0
END sal_flag
FROM employee)
SELECT last_name,
first_name,
salary
FROM sal_view
WHERE sal_flag = 1;
Try
with temp as(
select last_name,
first_name,
department,
salary,
LEAD(salary, 1)
OVER( PARTITION BY department
ORDER BY salary) as diff
FROM emp
)
select ast_name,
first_name,
department,
salary
from temp
where diff >100

Retrieving rows randomly in pl/sql query

I have a table (t1). I know how to retrieve percentage of set randomly.
What I want is to insert 30% of randomly selected rows into t2, and insert remaining 70% into table t3.
Is there any other way except inserting 30% into table t2 and than compare t2 with t1 and insert into t3? This method is not good for me since table is huge.
ps. oracle version - 11g
Look into ora_hash. Generate a hash using the table's PK (or some similar column combination) with a bucket of 9, and those with a 0-6 go in one table, and those with 7,8 or 9 go in another.
would an insert all work? here is one I did with the HR employees table so I ordered by random and took 30 percent of them. those ones got an indicator of one. I did a union all on the whole table and give it an indicator of 0. I took the max for the indicator then did an insert all. if the indicator is 1 into the first table otherwise the remaining 70% into the second.
INSERT ALL
WHEN (table_one_ind = 1) THEN
INTO table_one
(
employee_id,
first_name,
last_name,
email,
hire_date,
job_id
)
VALUES
(
employee_id,
first_name,
last_name,
email,
hire_date,
job_id
)
ELSE
INTO table_two
(
employee_id,
first_name,
last_name,
email,
hire_date,
job_id
)
VALUES
(
employee_id,
first_name,
last_name,
email,
hire_date,
job_id
)
SELECT MAX (table_one_ind) table_one_ind,
employee_id,
first_name,
last_name,
email,
hire_date,
job_id
FROM
(SELECT t.*,
1 AS table_one_ind
FROM
( SELECT * FROM employees ORDER BY dbms_random.value
) t
WHERE rownum <=
( SELECT ceil(COUNT(*)*.3) FROM employees
)
UNION ALL
SELECT t.*, 0 FROM employees t
)
GROUP BY employee_id,
first_name,
last_name,
email,
hire_date,
job_id

How to use distinct in third column

I want to run the query
select first_name, last_name, distinct salary from employees
But it throws an error. While if I use this select distinct salary, first_name, last_name from employees it runs.
I want o/p in the form of first column should be first_name then last_name then distinct salary.
try this!
SELECT Salary, First_Name, Last_Name
FROM table_name
GROUP BY Salary
the above should return a list of first_name and last_name of people who share the same salary.
If your data set contains duplicate rows you may want to do this to get rid of duplicate rows:
WITH salaries
AS ( SELECT DISTINCT Salary,
First_Name,
Last_Name
FROM table_name )
SELECT Salary,
First_Name,
Last_Name
FROM salaries
GROUP BY Salary;
Here's the trick, you should insert first to temp table where you distinct the salary, then after that , you can now select the data in your temp table with your desired arrangement of columns.
select distinct salary, first_name, last_name * into #temp from employees
Then after the distinct, you can now do what you want in the second query without the distinct.
select first_name, last_name, salary from #temp

SQl server query multiple aggregate columns

I need to write a query in sql server to data get like this.
Essentially it is group by dept, race, gender and then
SUM(employees_of_race_by_gender),Sum(employees_Of_Dept).
I could get data of first four columns, getting sum of employees in that dept is becoming difficult.
Could you pls help me in writing the query?
All these details in same table Emp. Columns of Emp are Emp_Number, Race_Name,Gender,Dept
Your "num_of_emp_in_race" is actually by Gender too
SELECT DISTINCT
Dept,
Race_name,
Gender,
COUNT(*) OVER (PARTITION BY Dept, Race_name, Gender) AS num_of_emp_in_race,
COUNT(*) OVER (PARTITION BY Dept) AS num_of_emp_dept
FROM
MyTable
You should probably have this
COUNT(*) OVER (PARTITION BY Dept, Gender) AS PerDeptRace
COUNT(*) OVER (PARTITION BY Dept, Race_name) AS PerDeptGender,
COUNT(*) OVER (PARTITION BY Dept, Race_name, Gender) AS PerDeptRaceGender,
COUNT(*) OVER (PARTITION BY Dept) AS PerDept
Edit: the DISTINCT appears to be applied before the COUNT (which would odd based on this) so try this instead
SELECT DISTINCT
*
FROM
(
SELECT
Dept,
Race_name,
Gender,
COUNT(*) OVER (PARTITION BY Dept, Race_name, Gender) AS num_of_emp_in_race,
COUNT(*) OVER (PARTITION BY Dept) AS num_of_emp_dept
FROM
MyTable
) foo
Since the two sums you're looking for are based on a different aggregation, you need to calculate them separately and join the result. In such cases I first build the selects to show me the different results, making it easy to catch errors early:
SELECT Dept, Gender, race_name, COUNT(*) as num_of_emp_in_race
FROM Emp
GROUP BY 1, 2, 3
SELECT Dept, COUNT(*) as num_of_emp_in_dept
FROM Emp
GROUP BY 1
Afterwards, joining those two is pretty straight forward:
SELECT *
FROM ( first statement here ) as by_race
JOIN ( second statement here ) as by_dept ON (by_race.Dept = by_dept.Dept)