Need help on sub-query of Spark-SQL Databricks - sql

I have below mentioned SQL and getting below mentioned dataset as result. But i want to display only one Open status record which has MIN date.
SELECT distinct o.svc_ord_nbr AS SVC_ORD_NBR,
o.svc_ord_stat_nm AS SVC_ORD_STAT_NM,
min(t.start_date_est) AS STRT_DT_EST, t.status_text
FROM A o inner join B t on t.ticket=o.notif_nbr
and o.svc_ord_nbr in ('021519_574819','110714_246149')
Group by o.svc_ord_nbr, o.svc_ord_stat_nm, t.status_text
The Result dataset looks like this:
I want only the first row which is having MIN of STRT_DT_EST.
Thanks in Advance...

Have you tried with window functions for this use case .
spark.sql(
“””
|SELECT a.*,
|ROW_NUMBER() OVER(PARTITION BY dept ORDER BY salary) as rn,
|RANK() OVER(PARTITION BY dept ORDER BY salary) as rank,
|DENSE_RANK() OVER(PARTITION BY dept ORDER BY salary) as dense_rank,
|PERCENT_RANK() OVER(PARTITION BY dept ORDER BY salary) as percent_rank,
|NTILE(3) OVER(PARTITION BY dept ORDER BY salary) as ntile
|FROM employee a
|”””.stripMargin).show(false)

Related

Why doesn't DISTINCT work in this case? (SQL)

SELECT DISTINCT
employees.departmentname,
employees.firstname,
employees.salary,
employees.departmentid
FROM employees
JOIN (
SELECT MAX(salary) AS Highest, departmentID
FROM employees
GROUP BY departmentID
) departments ON employees.departmentid = departments.departmentid
AND employees.salary = departments.highest;
Why doesn't the DISTINCT work here?
I'm trying to have each department to show only once because the question is asking the highest salary in each department.
Use the ROW_NUMBER() function, as in:
select departmentname, firstname, salary, departmentid
from (
select e.*,
row_number() over(partition by departmentid, order by salary desc) as rn
from employees e
) x
where rn = 1
I'm trying to have each department to show only once because the question is asking the highest salary in each department.
Use window functions:
SELECT e.*
FROM (SELECT e.*,
ROW_NUMBER() OVER (PARTITION BY departmentID ORDER BY salary DESC) as seqnum
FROM employees e
) e
WHERE seqnum = 1;
This is guaranteed to return one row per department, even when there are ties. If you want all rows when there are ties, use RANK() instead.
Why doesn't the DISTINCT work here?
DISTINCT is not a function; it is a keyword that will eliminate duplicate rows when ALL the column values are duplicates. It does NOT apply to a single column.
The DISTINCT keyword has "worked" (i.e. done what it is intended to do) because there are no rows where all the column values are a duplicate of another row's values.
However, it hasn't solved your problem because DISTINCT is not the correct solution to your problem. For that, you want to "fetch the row which has the max value for a column [within each group]" (as per this question).
Gwen, Elena and Paula all have the same salary
and they are in the same department

Finding department having maximum number of employee

I have a table employee
id name dept
1 bucky shp
2 name shp
3 other mrk
How can i get the name of the department(s) having maximum number of employees ? ..
I need result
dept
--------
shp
SELECT cnt,deptno FROM (
SELECT rank() OVER (ORDER BY cnt desc) AS rnk,cnt,deptno from
(SELECT COUNT(*) cnt, DEPTNO FROM EMP
GROUP BY deptno))
WHERE rnk = 1;
Assuming you are using SQL Server and each record representing an employee. So you can use window function to get the result
WITH C AS (
SELECT RANK() OVER (ORDER BY dept) Rnk
,name
,dept
FROM table
)
SELECT TOP 1 dept FROM
(SELECT COUNT(Rnk) cnt, dept FROM C GROUP BY dept) t
ORDER BY cnt DESC
With common table expressions, count the number of rows per department, then find the biggest count, then use that to select the biggest department.
WITH depts(dept, size) AS (
SELECT dept, COUNT(*) FROM employee GROUP BY dept
), biggest(size) AS (
SELECT MAX(size) FROM depts
)
SELECT dept FROM depts, biggest WHERE depts.size = biggest.size
Based on one of the answer, Let me try to explain step by step
First of all we need to get the employee count department wise. So the firstly innermost query will run
select count(*) cnt, deptno from scott.emp group by deptno
This will give result as
Now out of this we have to get the one which is having max. employee i.e. department 30.
Also please note there are chances that 2 departments have same number of employees
The second level of query is
select rank() over (order by cnt desc) as rnk,cnt,deptno from
(
select count(*) cnt, deptno from scott.emp group by deptno
)
Now we have assigned ranking to each department
Now to select rank 1 out of it. we have a simplest outer query
select * from
(
select rank() over (order by cnt desc) as rnk,cnt,deptno from
(
select count(*) cnt, deptno from scott.emp group by deptno
)
)
where rnk=1
So we have the final result where we got the department which has the maximum employees. If we want the minimum one we have to include the department table as there are chances there is a department which has no employees which will not get listed in this table
You can ignore the scott in scott.emp as that is the table owner.
The above SQL can be practised at Practise SQL online

Top 2 Salary Grouped By Department

Below is the table I am referring to.
I want to find ou the 2 Employees in each department with highest salary.
Further to the above answer, if there are ties (multiple employees sharing the same salary), you can use the following to bring them all through instead of just picking two at random (which is what the ROW_NUMBER clause will do)
SELECT *
FROM (
SELECT *, DENSE_RANK() OVER (PARTITION BY Dept ORDER BY Salary DESC) AS rn
FROM MyTable ) t
WHERE t.rn <= 2
Use ROW_NUMBER() to get the top salaries per Department, then select the first two records from each departmental partiton:
SELECT *
FROM (
SELECT *, ROW_NUMBER() OVER (PARTITION BY Dept ORDER BY Salary DESC) AS rn
FROM MyTable ) t
WHERE t.rn <= 2

How to find out 2nd highest salary of employees?

Created table named geosalary with columns name, id, and salary:
name id salary
patrik 2 1000
frank 2 2000
chinmon 3 1300
paddy 3 1700
I tried this below code to find 2nd highest salary:
SELECT salary
FROM (SELECT salary, DENSE_RANK() OVER(ORDER BY SALARY) AS DENSE_RANK FROM geosalary)
WHERE DENSE_RANK = 2;
However, getting this error message:
ERROR: subquery in FROM must have an alias
SQL state: 42601
Hint: For example, FROM (SELECT ...) [AS] foo.
Character: 24
What's wrong with my code?
I think the error message is pretty clear: your sub-select needs an alias.
SELECT t.salary
FROM (
SELECT salary,
DENSE_RANK() OVER (ORDER BY SALARY DESC) AS DENSE_RANK
FROM geosalary
) as t --- this alias is missing
WHERE t.dense_rank = 2
The error message is pretty obvious: You need to supply an alias for the subquery.
Here is a simpler / faster alternative:
SELECT DISTINCT salary
FROM geosalary
ORDER BY salary DESC NULLS LAST
OFFSET 1
LIMIT 1;
This finds the "2nd highest salary" (1 row), as opposed to other queries that find all employees with the 2nd highest salary (1-n rows).
I added NULLS LAST, as NULL values typically shouldn't rank first for this purpose. See:
PostgreSQL sort by datetime asc, null first?
SELECT department_id, salary, RANK1 FROM (
SELECT department_id,
salary,
DENSE_RANK ()
OVER (PARTITION BY department_id ORDER BY SALARY DESC)
AS rank1
FROM employees) result
WHERE rank1 = 3
This above query will get you the 3rd highest salary in the individual department. If you want regardless of the department, then just remove PARTITION BY department_id
Your SQL engine doesn't know the "salary" column of which table you are using, that's why you need to use an alias to differentiate the two columns.
Try this:
SELECT salary
FROM (SELECT G.salary ,DENSE_RANK() OVER(ORDER BY G.SALARY) AS DENSE_RANK FROM geosalary G)
WHERE DENSE_RANK=2;
WITH salaries AS (SELECT salary, DENSE_RANK() OVER(ORDER BY SALARY) AS DENSE_RANK FROM geosalary)
SELECT * FROM salaries WHERE DENSE_RANK=2;
select level, max(salary)
from geosalary
where level=2
connect by
prior salary>salary
group by level;
In case of duplicates in salary column below query will give the right result:
WITH tmp_tbl AS
(SELECT salary,
DENSE_RANK() OVER (ORDER BY SALARY) AS DENSE_RANK
FROM geosalary
)
SELECT salary
FROM tmp_tbl
WHERE dense_rank =
(SELECT MAX(dense_rank)-1 FROM tmp_tbl
)
AND rownum=1;
Here is SQL standard
SELECT name, salary
FROM geosalary
ORDER BY salary desc
OFFSET 1 ROW
FETCH FIRST 1 ROW ONLY
To calculate nth highest salary change offset value
SELECT MAX(salary) FROM geosalary WHERE salary < ( SELECT MAX(salary) FROM geosalary )

SQl server query multiple aggregate columns

I need to write a query in sql server to data get like this.
Essentially it is group by dept, race, gender and then
SUM(employees_of_race_by_gender),Sum(employees_Of_Dept).
I could get data of first four columns, getting sum of employees in that dept is becoming difficult.
Could you pls help me in writing the query?
All these details in same table Emp. Columns of Emp are Emp_Number, Race_Name,Gender,Dept
Your "num_of_emp_in_race" is actually by Gender too
SELECT DISTINCT
Dept,
Race_name,
Gender,
COUNT(*) OVER (PARTITION BY Dept, Race_name, Gender) AS num_of_emp_in_race,
COUNT(*) OVER (PARTITION BY Dept) AS num_of_emp_dept
FROM
MyTable
You should probably have this
COUNT(*) OVER (PARTITION BY Dept, Gender) AS PerDeptRace
COUNT(*) OVER (PARTITION BY Dept, Race_name) AS PerDeptGender,
COUNT(*) OVER (PARTITION BY Dept, Race_name, Gender) AS PerDeptRaceGender,
COUNT(*) OVER (PARTITION BY Dept) AS PerDept
Edit: the DISTINCT appears to be applied before the COUNT (which would odd based on this) so try this instead
SELECT DISTINCT
*
FROM
(
SELECT
Dept,
Race_name,
Gender,
COUNT(*) OVER (PARTITION BY Dept, Race_name, Gender) AS num_of_emp_in_race,
COUNT(*) OVER (PARTITION BY Dept) AS num_of_emp_dept
FROM
MyTable
) foo
Since the two sums you're looking for are based on a different aggregation, you need to calculate them separately and join the result. In such cases I first build the selects to show me the different results, making it easy to catch errors early:
SELECT Dept, Gender, race_name, COUNT(*) as num_of_emp_in_race
FROM Emp
GROUP BY 1, 2, 3
SELECT Dept, COUNT(*) as num_of_emp_in_dept
FROM Emp
GROUP BY 1
Afterwards, joining those two is pretty straight forward:
SELECT *
FROM ( first statement here ) as by_race
JOIN ( second statement here ) as by_dept ON (by_race.Dept = by_dept.Dept)