I am new at Spark and I just want to ask you please this question related to Spark SQL. Let's consider this EMPLOYEE table :
Employee Sub_department Department
A 105182 10
A 105182 10 (data can be redundant !)
A 114256 11
A 127855 12
A 125182 12
B 136234 13
B 133468 13
Department is defined as substring(sub_department, 0, 2) to extract only the first 2 digits of the sub_department.
What I want to display is to divide 3 types of employees :
Set 1 : Employees having at least 3 different departments (regardless of their sub_departments)
Set 1 : Employees having at least 5 different sub_departments AND 2 different departments
Set 3 : Employees having at least 10 different sub_departments with the same department
Concretely I have no idea how to do this even in classical SQL. But at least, I think the final output can be something like this :
Employee Sub_department total_sub_dept Department total_dept
A 105182 4 10 3
A 114256 4 11 3
A 127855 4 12 3
A 125182 4 12 3
And "eventually" a column named "Set" to show in which set an employee can belong to, but it's optional and I'm scared it will be too heavy to compute such a value...
It's important to display the different values AND the count for each of the 2 columns (the sub_department and the department).
I have a very big table (with many columns and many data that can be redundant) so I thought to do this by using a first partition on the sub_department and store it on a first table. Then a second partition on the department (regardless the "sub_department" value) and store it on a second table. And finally, do an inner joint between the two table based on the employee name.
But I got some wrong results and I don't know if there is a better way to do this ? or at least to have an optimisation since the department column depends on the sub_department (to do one group by rather than 2).
So, how can I fix this? I tried but it seems impossible to combine count(column) with the same column for each of the 2 columns.
I'll help you out with the requirement in set 1 just to encourage you. Please try to understand below query, once done, it is very simple to do set 2 and set 3.
SELECT
employee
total_dept
FROM
(
SELECT
employee
COUNT(Department) AS total_dept
FROM
(
select
employee,
Sub_department,
SUBSTRING(Sub_department,0,2) AS Department,
ROW_NUMBER() OVER (partition by employee,SUBSTRING(Sub_department,0,2)) AS redundancy
FROM
table
)
WHERE redundancy = 1
GROUP BY employee
) WHERE total_dept >= 3
EDIT1:
SELECT
full_data.employee,
full_data.sub_department,
total_sub_dept_count.total_sub_dept
full_data.SUBSTRING(Sub_department,0,2) AS Department
total_dept_count.total_dept
FROM
(
SELECT
employee
COUNT(Department) AS total_dept
FROM
(
select
employee,
Sub_department,
SUBSTRING(Sub_department,0,2) AS Department,
ROW_NUMBER() OVER (partition by employee,SUBSTRING(Sub_department,0,2)) AS redundancy
FROM
employee_table
)
WHERE redundancy = 1
GROUP BY employee
) total_dept_count
JOIN
(
SELECT
employee
COUNT(department) AS total_sub_dept
FROM
(
select
employee,
department,
ROW_NUMBER() OVER (partition by employee,department) AS redundancy
FROM
employee_table
)
WHERE redundancy = 1
GROUP BY employee
) total_sub_dept_count
ON(total_dept_count.employee = total_sub_dept_count.employee)
JOIN
employee_table full_data
ON(total_sub_dept_count.employee = full_data.employee)
You can use the window functions collect_set() and get the results. Check this out
scala> val df = Seq(("A","105182","10"), ("A","105182","10" ), ("A","114256","11"), ("A","127855","12"), ("A","125182","12"), ("B","136234","13"), ("B","133468","13")).toDF("emp","subdept","dept")
df: org.apache.spark.sql.DataFrame = [emp: string, subdept: string ... 1 more field]
scala> df.printSchema
root
|-- emp: string (nullable = true)
|-- subdept: string (nullable = true)
|-- dept: string (nullable = true)
scala> df.show
+---+-------+----+
|emp|subdept|dept|
+---+-------+----+
| A| 105182| 10|
| A| 105182| 10|
| A| 114256| 11|
| A| 127855| 12|
| A| 125182| 12|
| B| 136234| 13|
| B| 133468| 13|
+---+-------+----+
scala> val df2 = df.withColumn("dept2",substring('subdept,3,7))
df2: org.apache.spark.sql.DataFrame = [emp: string, subdept: string ... 2 more fields]
scala> df2.createOrReplaceTempView("salaman")
scala> spark.sql(""" select *, size(collect_set(subdept) over(partition by emp)) sub_dep_count, size(collect_set(dept) over(partition by emp)) dep_count from salaman """).show(false)
+---+-------+----+-----+-------------+---------+
|emp|subdept|dept|dept2|sub_dep_count|dep_count|
+---+-------+----+-----+-------------+---------+
|B |136234 |13 |6234 |2 |1 |
|B |133468 |13 |3468 |2 |1 |
|A |105182 |10 |5182 |4 |3 |
|A |105182 |10 |5182 |4 |3 |
|A |125182 |12 |5182 |4 |3 |
|A |114256 |11 |4256 |4 |3 |
|A |127855 |12 |7855 |4 |3 |
+---+-------+----+-----+-------------+---------+
scala>
Related
I have a table structured like this where I need to get the ID's last number, how many people's ID ends with that number, and the person with the highest ID:
Members: |ID |Name |
-----------------
|123 |foo |
|456 |bar |
|789 |boo |
|1226|far |
The result I need to get looks something like this
|LAST_NUMBER |OCCURENCES |HIGHEST_ID_GUY |
---------------------------------------------
|3 |1 |foo |
|6 |2 |far |
|9 |1 |boo |
However, while I can get the first two results to display correctly, I have no idea how to display HIGHEST_ID_GUY. My code looks like this:
SELECT DISTINCT SUBSTR(id, LENGTH(id - 1), LENGTH(id)) AS LAST_NUMBER,
COUNT(*) AS OCCURENCES
/* This is where I need to add HIGHEST_ID_GUY */
FROM Members
GROUP BY SUBSTR(id, LENGTH(id - 1), LENGTH(id))
ORDER BY LAST_NUMBER
Any help appreciated :)
If id is a number, then use arithmetic operations:
select mod(id, 10) as last_digit,
count(*),
max(name) keep (dense_rank first order by id desc) as name_at_biggest
from t
group by mod(id, 10);
If id is a string, then you need to convert to a number or something similar to define the "highest id". For instance:
select substr(id, -1) as last_digit,
count(*),
max(name) keep (dense_rank first order by to_number(id) desc) as name_at_biggest
from t
group by substr(id, -1);
I'm trying to write a SQL Select query that uses the DIFFERENCE() function to find similar names in a database to identify duplicates.
The short version of the code I'm using is:
SELECT *, DIFFERENCE(FirstName, LEAD(FirstName) OVER (ORDER BY SOUNDEX(FirstName))) d
WHERE d >= 3
The problem is my database has additional columns that include middle names and nicknames. So if I have a customer who has multiple names they go by, they might be in the database multiple times, and I need to compare a variety of columns against each other.
Sample Data:
+----+--------+--------+--------+--------+
|ID |First |Middle |AKA1 |AKA2 |
+----+--------+--------+--------+--------+
|1 |Sally |Ann |NULL |NULL |
|2 |Ann |NULL |NULL |NULL |
|3 |Sue |NULL |NULL |NULL |
|4 |Suzy |NULL |NULL |NULL |
|5 |Patricia|NULL |Trish |Patty |
|6 |Patty |NULL |Patricia|Trish |
|7 |Trish |NULL |Patty |Patricia|
+----+--------+--------+--------+--------+
In the above, rows 1+2 are duplicates of each other, as are 3+4, and 5+6+7.
So I'm not sure the best way to get what I want. Here's the longer version of the code I'm actually using:
WITH A AS (SELECT *,
SOUNDEX(FirstName) AS "FirstSoundex",
SOUNDEX(LastName) AS "LastSoundex",
LAG (SOUNDEX(FirstName)) OVER (ORDER BY SOUNDEX(FirstName)) AS "PreviousFirstSoundex",
LAG (SOUNDEX(LastName)) OVER (ORDER BY SOUNDEX(LastName)) AS "PreviousLastSoundex"
FROM Clients),
B AS (
SELECT *,
ISNULL(DIFFERENCE(FirstName, LEAD(FirstName) OVER (ORDER BY FirstSoundex)),0) AS "FirstScore",
ISNULL(DIFFERENCE(LastName, LEAD(LastName) OVER (ORDER BY LastSoundex)),0) AS "LastScore"
FROM A),
C AS (
SELECT *,
ISNULL(LAG (FirstScore) OVER (ORDER BY FirstSoundex),0) AS "PreviousFirstScore",
ISNULL(LAG (LastScore) OVER (ORDER BY LastSoundex),0) AS "PreviousLastScore"
FROM B
),
D AS (
SELECT *,
(CASE WHEN (PreviousFirstScore >=3 AND PreviousLastScore >=3) THEN (PreviousFirstSoundex + PreviousLastSoundex)
WHEN (FirstScore >= 3 AND LastScore >=3) THEN (FirstSoundex + LastSoundex)
END) AS "GroupName"
FROM C
WHERE ((PreviousFirstScore >=3 AND PreviousLastScore >=3) OR (FirstScore >= 3 AND LastScore >=3))
)
SELECT *,
LAG(GroupName) OVER (ORDER BY GroupName) AS "PreviousGroup",
LEAD(GroupName) OVER (ORDER BY GroupName) AS "NextGroup"
FROM D
WHERE (D.GroupName = D.PreviousGroup OR D.GroupName = D.NextGroup)
This lets me group together bundles of potential duplicates and it works well for me. However, I now want to add in a way to check against multiple columns, and I don't know how to do that.
I was thinking about creating a union, something like:
SELECT ClientID,
LastName,
FirstName AS "TempName"
FROM Clients
UNION
SELECT ClientID,
LastName,
MiddleName AS "TempName"
FROM Clients
WHERE MiddleName IS NOT NULL
...etc
But then my LAG() and LEAD() wouldn't work because I'd have multiple rows with the same ClientID. I don't want to identify a single Client as a duplicate of itself.
Anyways, any suggestions? Thanks in advance.
I have a table as follow:
+---+---+---+
|obj|col|Val|
+---+---+---+
|1 |c1 | v1|
+---+---+---+
|1 |c2 | v2|
+---+---+---+
|2 |c1 | v3|
+---+---+---+
|2 |c2 | v4|
+---+---+---+
And I am looking for SQL that will give the result in the following format
+---+---+---+
|obj|c1 |c2 |
+---+---+---+
|1 |v1 | v2|
+---+---+---+
|2 |v3 | v4|
+---+---+---+
In this SQL, I am checking for col = 'c?' and printing out the corresponding Val. But the reason for group by is to avoid all NULL values in case the condition doesn't match. By grouping on obj all the NULL values will be avoided and produce the desired result.
SELECT obj,
MAX( CASE WHEN col = 'c1' THEN Val END ) AS c1,
MAX( CASE WHEN col = 'c2' THEN Val END ) AS c2
FROM Table
GROUP BY obj;
First you need to select all the unique id from your table
select distinct id
from a_table_you_did_not_name
how you can use that to left join to your columns
select base.id, one.val as c1, two.val as c2
from (
select distinct id
from a_table_you_did_not_name
) base
left join a_table_you_did_not_name one on one.id = base.id and one.col = 'c1'
left join a_table_you_did_not_name two on two.id = base.id and two.col = 'c2'
note: your case is a relatively simple case of this kind of join -- I coded it like this because using my method can be extended to the more complicated cases and still work. There are some other ways for this particular requirement that might be simpler.
specifically the most common one is joining to multiple tables, not all in the same table. My method will still work in those cases.
I have data in a table. There are 3 columns (ID, Interval, ContactInfo). This table lists all phone contacts. I'm attempting to get a count of phone numbers that called twice on the same day and have no idea how to go about this. I can get duplicate entries for the same number but it does not match on date. The code I have so far is below.
SELECT ContactInfo, COUNT(Interval) AS NumCalls
FROM AllCalls
GROUP BY ContactInfo
HAVING COUNT(AllCalls.ContactInfo) > 1
I'd like to have it return the date, the number of calls on that date if more than 1, and the phone number.
Sample data:
|ID |Interval |ContactInfo|
|--------|------------|-----------|
|1 |3/1/2017 |8009999999 |
|2 |3/1/2017 |8009999999 |
|3 |3/2/2017 |8001234567 |
|4 |3/2/2017 |8009999999 |
|5 |3/3/2017 |8007771111 |
|6 |3/3/2017 |8007771111 |
|--------|------------|-----------|
Expected result:
|Interval |ContactInfo|NumCalls|
|------------|-----------|--------|
|3/1/2017 |8009999999 |2 |
|3/3/2017 |8007771111 |2 |
|------------|-----------|--------|
Just as juergen d suggested, you should try to add Interval in your GROUP BY. Like so:
SELECT AC.ContactInfo
, AC.Interval
, COUNT(*) AS qnty
FROM AllCalls AS AC
GROUP BY AC.ContactInfo
, AC.Interval
HAVING COUNT(*) > 1
The code should like this :
select Interval , ContactInfo, count(ID) AS NumCalls from AllCalls group by Interval, ContactInfo having count(ID)>1;
I have two tables with data like:
table: test_results
ID |test_id |test_type |result_1 |amps |volts |power |
----+-----------+-----------+-----------+-----------+-----------+-----------+
1 |101 |static |10.1 |5.9 |15 |59.1 |
2 |101 |dynamic |300.5 |9.1 |10 |40.1 |
3 |101 |prime |48.9 |8.2 |14 |49.2 |
4 |101 |dual |235.2 |2.9 |11 |25.8 |
5 |101 |static |11.9 |4.3 |9 |43.3 |
6 |101 |prime |49.9 |5.8 |15 |51.6 |
and
table: test_records
ID |model |test_date |operator |
----+-----------+-----------+-----------+
101 |m-300 |some_date |john doe |
102 |m-243 |some_date |john doe |
103 |m-007 |some_date |john doe |
104 |m-523 |some_date |john doe |
105 |m-842 |some_date |john doe |
106 |m-252 |some_date |john doe |
and I'm making a report that looks like this:
|static |dynamic |
test_id |model |test_date |operator |result_1 |amps |volts |power |result_1 |amps |volts |power |
-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+
101 |m-300 |some_date |john doe |10.1 |5.9 |15 |59.1 |300.5 |9.1 |10 |40.1 |
with left outer joins like so:
SELECT
A.ID AS test_id, model, test_date, operator,
B.result_1, B.amps, B.volts, B.power,
C.result_1, C.amps, C.volts, C.power
FROM
test_records A
LEFT JOIN
test_results B
ON
A.ID = B.test_id
AND
B.test_type = 'static'
LEFT JOIN
test_results C
ON
A.ID = C.test_id
AND
C.test_type = 'dynamic'
But I have run into a problem. The "static" and "prime" tests are run twice.
I don't know how to differentiate between them to create their own 4 fields.
An abstracted(simplified) view of the desired report would look like:
|static |dynamic |prime |dual |static2 |prime2 |
|4 fields |4 fields |4 fields |4 fields |4 fields |4 fields |
Is this even possible?
Notes:
I'm labeling the groups of 4 fields with html so don't worry about the labels
Not every test will run "static" and "prime" twice. So this is a case of If ("static" and "prime") are found twice, do this SQL.
I think we're going to get our engineers to append a 2 to the second tests, eliminating the problem, so this question is more out of curiosity to know what method could solve a problem like this.
If you have another field (here I use ID) that you know is always going to be ordered in respect to the field you can use a windowing function to give them sequential values and then join to that. Like this:
WITH test_records_numbered AS
(
SELECT test_id, test_type, result_1, amps, volts, power,
ROW_NUMBER() OVER (PARTITION BY test_id, test_type ORDER BY ID) as type_num
FROM test_records
)
SELECT
A.ID AS test_id, model, test_date, operator,
B.result_1, B.amps, B.volts, B.power,
C.result_1, C.amps, C.volts, C.power
FROM test_records A
LEFT JOIN test_records_numbered B
ON A.ID = B.test_id AND B.test_type = 'static' and B.type_num = 1
LEFT JOIN test_records_numbered C
ON A.ID = C.test_id AND C.test_type = 'dynamic' and C.type_num = 2
I use a CTE to make it clearer but you could use a sub-queries, you would (of course) have to have the same sub-query twice in the SQL, most servers would have no issue optimizing without the CTE I expect.
I feel this solution is a bit of a "hack." You really want your original data to have all the information it needs. So I think it is good you are having your app developers modify their code (FWIW).
If this had to go into production I think I would break out the numbering as a view to hi-light the codification of questionable business rules (and to make it easy to change)