Combining similar names and counting distinct from a single column - sql

I am attempting to combine rows with similar names within a single column. The data I am using has some clients listed with the same name:
| Client_Column | Address |
| XYZ ISD | 1 Drive |
| XYZ ISD | 2 Drive |
| XYZ ISD | 3 Drive |
However, there are some records where the same client has a distinct, but similar name with an alternate address:
| Client_Column | Address |
| ABC Inc. | 101 Lane |
| ABC Inc. - P1 | 102 Lane |
| ABC Inc. - P2 | 103 Lane |
I would like to count the total number of distinct addresses for each client in the table. So in this example, XYZ ISD and ABC Inc. would each have 3 unique addresses. How can I adjust my script so that I can achieve this result?
I think I need to use SUM to aggregate the total, but I am not sure how. I am able to count the addresses for clients with the exact same name with accuracy, but the ones with only similar names display a single record for AddressCount.
Here is the script:
SELECT ClientName, COUNT(Address) AS AddressCount
FROM tableA
GROUP BY ClientName
ORDER BY AddressCount
Result:
| Client_Column | AddressCount |
| Blue LLC | 15 |
| PSC Parts | 14 |
| ABC Inc. | 1 |
| ABC Inc. - P1 | 1 |
| ABC Inc. - P2 | 1 |

Related

SQL - joining 3 tables and choosing newest logged entry per id

I got rather complicated riddle to solve. So far I'm unlocky.
I got 3 tables which I need to join to get the result.
Most important is that I need highest h_id per p_id. h_id is uniqe entry in log history. And I need newest one for given point (p_id -> num).
Apart from that I need ext and name as well.
history
+----------------+---------+--------+
| h_id | p_id | str_id |
+----------------+---------+--------+
| 1 | 1 | 11 |
| 2 | 5 | 15 |
| 3 | 5 | 23 |
| 4 | 1 | 62 |
+----------------+---------+--------+
point
+----------------+---------+
| p_id | num |
+----------------+---------+
| 1 | 4564 |
| 5 | 3453 |
+----------------+---------+
street
+----------------+---------+-------------+
| str_id | ext | name |
+----------------+---------+-------------+
| 15 | | Mein st. 33 | - bad name
| 11 | | eck st. 42 | - bad name
| 62 | abc | Main st. 33 |
| 23 | efg | Back st. 42 |
+----------------+---------+-------------+
EXPECTED RESULT
+----------------+---------+-------------+-----+
| num | ext | name |h_id |
+----------------+---------+-------------+-----+
| 3453 | efg | Back st. 42 | 3 |
| 4564 | abc | Main st. 33 | 4 |
+----------------+---------+-------------+-----+
I'm using Oracle SQL. Tried using query below but result is not true.
SELECT num, max(name), max(ext), MAX(h_id) maxm FROM history
INNER JOIN street on street.str_id = history._str_id
INNER JOIN point on point.p_id = history.p_id
GROUP BY point.num
In Oracle, you can use keep:
SELECT p.num,
MAX(h.h_id) as maxm,
MAX(s.name) KEEP (DENSE_RANK FIRST ORDER BY h.h_id DESC) as name,
MAX(s.ext) KEEP (DENSE_RANK FIRST ORDER BY h.h_id DESC) as ext
FROM history h INNER JOIN
street s
ON s.str_id = h._str_id INNER JOIN
point p
ON p.p_id = h.p_id
GROUP BY p.num;
The keep syntax allows you to do "first()" and "last()" for aggregations.

SQL: Cascading conditions on Join

I have found a few similar questions to this on SO but nothing which applies to my situation.
I have a large dataset with hundreds of millions of rows in Table 1 and am looking for the most efficient way to run the following query. I am using Google BigQuery but I think this is a general SQL question applicable to any DBMS?
I need to apply an owner to every row in Table 1. I want to join in the following priority:
1: if item_id matches an identifier in Table 2
2: if no item_id matches try match on item_name
3: if no item_id or item_name matches try match on item_division
4: if no item_division matches, return null
Table 1 - Datapoints:
| id | item_id | item_name | item_division | units | revenue
|----|---------|-----------|---------------|-------|---------
| 1 | xyz | pen | UK | 10 | 100
| 2 | pqr | cat | US | 15 | 120
| 3 | asd | dog | US | 12 | 105
| 4 | xcv | hat | UK | 11 | 140
| 5 | bnm | cow | UK | 14 | 150
Table 2 - Identifiers:
| id | type | code | owner |
|----|---------|-----------|-------|
| 1 | id | xyz | bob |
| 2 | name | cat | dave |
| 3 | division| UK | alice |
| 4 | name | pen | erica |
| 5 | id | xcv | fred |
Desired output:
| id | item_id | item_name | item_division | units | revenue | owner |
|----|---------|-----------|---------------|-------|---------|-------|
| 1 | xyz | pen | UK | 10 | 100 | bob | <- id
| 2 | pqr | cat | US | 15 | 120 | dave | <- code
| 3 | asd | dog | US | 12 | 105 | null | <- none
| 4 | xcv | hat | UK | 11 | 140 | fred | <- id
| 5 | bnm | cow | UK | 14 | 150 | alice | <- division
My attempts so far have involved multiple joining the table onto itself and I fear it is becoming hugely inefficient.
Any help much appreciated.
Another option for BigQuery Standard SQL
#standardSQL
SELECT ARRAY_AGG(a)[OFFSET(0)].*,
ARRAY_AGG(owner
ORDER BY CASE
WHEN type = 'id' THEN 1
WHEN type = 'name' THEN 2
WHEN type = 'division' THEN 3
END
LIMIT 1
)[OFFSET(0)] owner
FROM Datapoints a
JOIN Identifiers b
ON (a.item_id = b.code AND b.type = 'id')
OR (a.item_name = b.code AND b.type = 'name')
OR (a.item_division = b.code AND b.type = 'division')
GROUP BY a.id
ORDER BY a.id
It leaves out entries which k=have no owners - like in below result (id=3 is out as it has no owner)
Row id item_id item_name item_division units revenue owner
1 1 xyz pen UK 10 100 bob
2 2 pqr cat US 15 120 dave
3 4 xcv hat UK 11 140 fred
4 5 bnm cow UK 14 150 alice
I am using the following query (thanks #Barmar) but want to know if there is a more efficient way in Google BigQuery:
SELECT a.*, COALESCE(b.owner,c.owner,d.owner) owner FROM datapoints a
LEFT JOIN identifiers b on a.item_id = b.code and b.type = 'id'
LEFT JOIN identifiers c on a.item_name = c.code and c.type = 'name'
LEFT JOIN identifiers d on a.item_division = d.code and d.type = 'division'
I'm not sure if BigQuery optimizes today a query like this - but at least you would be writing a query that gives strong hints to not run the subqueries when not needed:
#standardSQL
SELECT COALESCE(
null
, (SELECT MIN(payload)
FROM `githubarchive.year.2016`
WHERE actor.login=a.user)
, (SELECT MIN(payload)
FROM `githubarchive.year.2016`
WHERE actor.id = SAFE_CAST(user AS INT64))
)
FROM (SELECT '15229281' user) a
4.2s elapsed, 683 GB processed
{"action":"started"}
For example, the following query took a long time to run, but BigQuery could optimize its execution massively in the future (depending on how frequently users needed an operation like this):
#standardSQL
SELECT COALESCE(
"hello"
, (SELECT MIN(payload)
FROM `githubarchive.year.2016`
WHERE actor.login=a.user)
, (SELECT MIN(payload)
FROM `githubarchive.year.2016`
WHERE actor.id = SAFE_CAST(user AS INT64))
)
FROM (SELECT actor.login user FROM `githubarchive.year.2016` LIMIT 10) a
114.7s elapsed, 683 GB processed
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello

Compare two most recent columns for same ID

I have a table for persons’ addresses. The Class column tells whether it is a home (H), postal (P) or internal (I) address. Each time a person updates the address, the date is stored as well.
The internal address is used by my company when a correspondence is returned to us due to incorrect addresses, so we update that to our address until we can obtain new contact details.
I’d like to know which persons have changed their address to a postal or home address with an effective date after the effective date of the internal address.
Here’s an example of what the table looks like:
| **PersonID**| **Class** | **EffDate** | **Line1**
| 1 | H | 12/01/2010 | 31 Academy Avenue
| 1 | H | 13/09/2010 | 433 Hillcrest Drive
| 1 | I | 26/10/2015 | 1 Bond Circle
| 2 | H | 17/12/2012 | 761 Circle St.
| 2 | H | 12/11/2013 | 597 Elm Lane
| 2 | I | 1/10/2015 | 1 Bond Circle
| 2 | H | 6/12/2016 | 8332 Mountainview St.
| 3 | P | 27/09/2010 | 8 Bow Ridge Lane
| 3 | H | 6/12/2010 | 22 Shady St.
| 3 | I | 7/12/2015 | 1 Bond Circle
| 3 | H | 8/12/2016 | 7423 Rockcrest Ave.
| 4 | P | 9/12/2015 | 888 N. Shady Street
| 4 | I | 10/12/2016 | 1 Bond Circle
I'd like the query to only return:
| **PersonID**| **Class** | **EffDate** | **Line1**
| 2 | H | 6/12/2016 | 8332 Mountainview St.
| 3 | H | 8/12/2016 | 7423 Rockcrest Ave.
Any ideas? I'm using SQL Server 2012.
Here is one method:
select t.*
from t
where t.class in ('H', 'P') and
t.effdate > (select max(t2.effdate)
from t t2
where t2.class = 'I' and t2.personId = t.personId
);
Another method that uses window functions:
select t.*
from (select t.*,
max(case when class = 'I' then date end) over (partition by personid) as max_idate
from t
) t
where t.class in ('H', 'P') and
t.date > t.maxidate;

Multiple self joins plus one inner join

I have two tables: ck_startup and ck_price. The price table contains the columns cu_type, prd_type, part_cd, qty, and dllrs. The startup table is linked to the price table through a one-to-many relationship on ck_startup.prd_type_cd = ck_price.prd_type.
The price table contains multiple entries for the same product/part/qty but under different customer types. Not all customer types have the same unique combination of those three values. I'm trying to create a query that will do two things:
Join some columns from ck_startup onto ck_price (description, and some additional values).
Join ck_price onto itself with a dllrs column for each customer type. So in total I would only have one instance of each unique key of product/part/qty, and a value in each customer's price column if they have one.
I've never worked with self joining tables, and so far I can only get records to show up where both customers have the same options available.
And because someone is going to demand I post sample code, here's the crappy query that doesn't show missing prices:
select pa.*, pac.dllrs from ck_price pa
join ck_price pac on pa.prd_type = pac.prd_type and pa.part_carbon_cd = pac.part_carbon_cd and pa.qty = pac.qty
where pa.cu_type = 'A' and pac.cu_type = 'AC';
EDIT: Here's sample data from the two tables, and how I want them to look when I'm done:
CK_STARTUP
+-----+-----------------+-------------+
| CD | DSC | PRD_TYPE_CD |
+-----+-----------------+-------------+
| 3D | Stuff | SKD3 |
| DC | Different stuff | SKD |
| DN2 | Similar stuff | SKD |
+-----+-----------------+-------------+
CK_PRICE
+---------+-------------+---------+-----+-------+
| CU_TYPE | PRD_TYPE_CD | PART_CD | QTY | DLLRS |
+---------+-------------+---------+-----+-------+
| A | SKD3 | 1 | 100 | 10 |
| A | SKD3 | 1 | 200 | 20 |
| A | SKD3 | 1 | 300 | 30 |
| A | SKD | 1 | 100 | 50 |
| A | SKD | 1 | 200 | 100 |
| AC | SKD3 | 1 | 300 | 30 |
| AC | SKD | 1 | 100 | 100 |
| AC | SKD | 1 | 200 | 200 |
| AC | SKD | 1 | 300 | 300 |
| AC | SKD | 1 | 400 | 400 |
+---------+-------------+---------+-----+-------+
COMBO:
+----+-----------------+---------+-----+---------+----------+
| CD | DSC | PART_CD | QTY | DLLRS_A | DLLRS_AC |
+----+-----------------+---------+-----+---------+----------+
| 3D | Stuff | 1 | 100 | 10 | null |
| 3D | Stuff | 1 | 200 | 20 | null |
| 3D | Stuff | 1 | 300 | 30 | 60 |
| DC | Different stuff | 1 | 100 | 50 | 100 |
| DC | Different stuff | 1 | 200 | 100 | 200 |
| DC | Different stuff | 1 | 300 | null | 300 |
| DC | Different stuff | 1 | 400 | null | 400 |
+----+-----------------+---------+-----+---------+----------+
Ok, take a look at below query and at the results:
SELECT *
FROM (SELECT
cs.cd, cs.dsc, cp.part_cd, cp.qty, cp.dllrs, cp.cu_type
FROM ck_startup cs
JOIN ck_price cp ON (cs.prd_type_cd = cp.prd_type_cd))
PIVOT (SUM(dllrs) AS dlllrs FOR (cu_type) IN ('A' AS a, 'AC' AS ac))
ORDER BY cd, qty
;
Output:
CD DSC PART_CD QTY A_DLLLRS AC_DLLLRS
-------- ----------------- ---------- ------- ---------- ----------
3D Stuff 1 100 10
3D Stuff 1 200 20
3D Stuff 1 300 30 30
DC Different stuff 1 100 50 50
DC Different stuff 1 200 100 100
DC Different stuff 1 300 150
DC Different stuff 1 400 200
DN2 Similar stuff 1 100 50 50
DN2 Similar stuff 1 200 100 100
DN2 Similar stuff 1 300 150
DN2 Similar stuff 1 400 200
It is not what you would expect, because I do not understand why you have different values in DLLRS_AC column that are in the CK_PRICE table? I mean, for example, why do you have 400 in last line of your output, not 200? Why is this value doubled (as others are in DLLRS_AC column)?
If you are using Oracle 10g, you can achieve the same result using DECODE and GROUP BY, take a look:
SELECT
cd,
dsc,
part_cd,
qty,
SUM(DECODE(cu_type, 'A', dllrs, NULL)) AS dllrs_a,
SUM(DECODE(cu_type, 'AC', dllrs, NULL)) AS dllrs_ac
FROM (
SELECT
cs.cd, cs.dsc, cp.part_cd, cp.qty, cp.dllrs, cp.cu_type
FROM ck_startup cs
JOIN ck_price cp ON (cs.prd_type_cd = cp.prd_type_cd)
)
GROUP BY cd, dsc, part_cd, qty
ORDER BY cd, qty;
Result is the same.
If you want to read more about pivoting, I recommend article by Tim Hall: Pivot and Unpivot at Oracle Base

MySQL: How to select and display ALL rows from one table, and calculate the sum of a where clause on another table?

I'm trying to display all rows from one table and also SUM/AVG the results in one column, which is the result of a where clause. That probably doesn't make much sense, so let me explain.
I need to display a report of all employees...
SELECT Employees.Name, Employees.Extension
FROM Employees;
--------------
| Name | Ext |
--------------
| Joe | 123 |
| Jane | 124 |
| John | 125 |
--------------
...and join some information from the PhoneCalls table...
--------------------------------------------------------------
| PhoneCalls Table |
--------------------------------------------------------------
| Ext | StartTime | EndTime | Duration |
--------------------------------------------------------------
| 123 | 2010-09-05 10:54:22 | 2010-09-05 10:58:22 | 240 |
--------------------------------------------------------------
SELECT Employees.Name,
Employees.Extension,
Count(PhoneCalls.*) AS CallCount,
AVG(PhoneCalls.Duration) AS AverageCallTime,
SUM(PhoneCalls.Duration) AS TotalCallTime
FROM Employees
LEFT JOIN PhoneCalls ON Employees.Extension = PhoneCalls.Extension
GROUP BY Employees.Extension;
------------------------------------------------------------
| Name | Ext | CallCount | AverageCallTime | TotalCallTime |
------------------------------------------------------------
| Joe | 123 | 10 | 200 | 2000 |
| Jane | 124 | 20 | 250 | 5000 |
| John | 125 | 3 | 100 | 300 |
------------------------------------------------------------
Now I want to filter out some of the rows that are included in the SUM and AVG calculations...
WHERE PhoneCalls.StartTime BETWEEN "2010-09-12 09:30:00" AND NOW()
...which will ideally result in a table looking something like this:
------------------------------------------------------------
| Name | Ext | CallCount | AverageCallTime | TotalCallTime |
------------------------------------------------------------
| Joe | 123 | 5 | 200 | 1000 |
| Jane | 124 | 10 | 250 | 2500 |
| John | 125 | 0 | 0 | 0 |
------------------------------------------------------------
Note that John has not made any calls in this date range, so his total CallCount is zero, but he is still in the list of results. I can't seem to figure out how to keep records like John's in the list. When I add the WHERE clause, those records are filtered out.
How can I create a select statement that displays all of the Employees and only SUMs/AVGs the values returned from the WHERE clause?
Use:
SELECT e.Name,
e.Extension,
Count(pc.*) AS CallCount,
AVG(pc.Duration) AS AverageCallTime,
SUM(pc.Duration) AS TotalCallTime
FROM Employees e
LEFT JOIN PhoneCalls pc ON pc.extension = e.extension
AND pc.StartTime BETWEEN "2010-09-12 09:30:00" AND NOW()
GROUP BY e.Name, e.Extension
The issue is when using an OUTER JOIN, specifying criteria in the JOIN section is applied before the JOIN takes place--like a derived table or inline view. The WHERE clause is applied after the OUTER JOIN, which is why when you specified the WHERE clause on the table being LEFT OUTER JOIN'd to that the rows you still wanted to see are being filtered out.