Trying to match a column's value with another container multiple values - google-bigquery

I have two tables, with the same column name/type.
Table A: Property listings
ID | Postal Code | Town
12 | xxxxx | California
13 | xxxxx | Nashville
14 | xxxxx | New York
Table B: User preferences
ID | Name | Preferred Towns
909| Dave | ["California", "New York"]
The town column in Table A is a string.
The preferred towns in Table B is a json array.
The goal is to match Dave with property listings located in specific town(s).
Expected output:
User to Property Matches
User ID | User Name | Matched Property ID
909 | Dave | 12, 14

Consider below approach (BigQuery)
select ID as User_ID, Name as User_Name,
( select string_agg(p.ID)
from t.Preferred_Towns Town
join Property_listings p
using (Town)
) Matched_Property_ID
from User_Preferences t
If applied to sample data in your question as in below example
with Property_listings as (
select '12' ID, 'xxxxx' PostalCode, 'California' Town union all
select '13', 'xxxxx', 'Nashville' union all
select '14', 'xxxxx', 'New York'
), User_Preferences as (
select '909' ID, 'Dave' Name, ["California", "New York"] Preferred_Towns
)
select ID as User_ID, Name as User_Name,
( select string_agg(p.ID)
from t.Preferred_Towns Town
join Property_listings p
using (Town)
) Matched_Property_ID
from User_Preferences t
the output is

Related

Multi-Pass Duplication Identification with Exclusions

I have a customer table with several hundred thousand records. There are a LOT of duplicates of varying degrees. I am trying to identify duplicate records with level of possibility of being a duplicate.
My source table has 7 fields and looks like this:
I look for duplicates, and put them into an intermediate table with the level of possibility, table name, and the customer number.
Intermediate Table
CREATE TABLE DataCheck (
id int identity(1,1),
reason varchar(100) DEFAULT NULL,
tableName varchar(100) DEFAULT NULL,
tableID varchar(100) DEFAULT NULL
)
Here is my code to identify and insert:
-- Match on Company, Contact, Address, City, and Phone
-- DUPE
INSERT INTO DataCheck
SELECT 'Duplicate','CUSTOMER',tcd.uid
FROM #tmpCoreData tcd
INNER JOIN
(SELECT
company,
fname,
lname,
add1,
city,
phone1,
COUNT(*) AS count
FROM #tmpCoreData
WHERE company <> ''
GROUP BY company, fname, lname, add1, city, phone1
HAVING COUNT(*) > 1) dl
ON dl.company = tcd.company
ORDER BY tcd.company
In this example, it would insert ids 101, 102
The problem is when I perform the next pass:
-- Match on Company, Address, City, Phone (Diff Contacts)
-- LIKELY DUPE
INSERT INTO DataCheck
SELECT 'Likely Duplicate','CUSTOMER',tcd.uid
FROM #tmpCoreData tcd
INNER JOIN
(SELECT
company,
add1,
city,
phone1,
COUNT(*) AS count
FROM #tmpCoreData
WHERE company <> ''
GROUP BY company, add1, city, phone1
HAVING COUNT(*) > 1) dl
ON dl.company = tcd.company
ORDER BY tcd.companyc
This pass would then insert, 101, 102 & 103.
The next pass drops the phone so it would insert 101, 102, 103, 104
The next pass would look for company only which would insert all 5.
I now have 14 entries into my intermediate table for 5 records.
How can I add an exclusion so the 2nd pass groups on the same Company, Address, City, Phone but DIFFERENT fname and lname. Then it should only insert 101 and 103
I considered adding a NOT IN (SELECT tableID FROM DataCheck) to ensure IDs aren't added multiple times, but on the 3rd of 4th pass it may find a duplicate and entered 700 records after the row it's a duplicate of, so you lose the context of it's a dupe of.
My output uses:
SELECT
dc.reason,
dc.tableName,
tcd.*
FROM DataCheck dc
INNER JOIN #tmpCoreData tcd
ON tcd.uid = dc.tableID
ORDER BY dc.id
And looks something like this, which is a bit confusing:
I'm going to challenge your perception of your issue, and instead propose that you calculate a simple "confidence score", which will also help you vastly simplify your results table:
WITH FirstCompany AS (SELECT custNo, company, fname, lname, add1, city, phone1
FROM(SELECT custNo, company, fname, lname, add1, city, phone1,
ROW_NUMBER() OVER(PARTITION BY company ORDER BY custNo) AS ordering
FROM CoreData) FC
WHERE ordering = 1)
SELECT RankMapping.description, Duplicate.custNo, Duplicate.company, Duplicate.fname, Duplicate.lname, Duplicate.add1, Duplicate.city, Duplicate.phone1
FROM (SELECT FirstCompany.custNo AS originalCustNo, Duplicate.*,
CASE WHEN FirstCompany.custNo = Duplicate.custNo THEN 1 ELSE 0 END
+ CASE WHEN FirstCompany.fname = Duplicate.fname AND FirstCompany.lname = Duplicate.lname THEN 1 ELSE 0 END
+ CASE WHEN FirstCompany.add1 = Duplicate.add1 AND FirstCompany.city = Duplicate.city THEN 1 ELSE 0 END
+ CASE WHEN FirstCompany.phone1 = Duplicate.phone1 THEN 1 ELSE 0 END
AS ranking
FROM FirstCompany
JOIN CoreData Duplicate
ON Duplicate.custNo >= FirstCompany.custNo
AND Duplicate.company = FirstCompany.company) Duplicate
JOIN (VALUES (4, 'original'),
(3, 'duplicate'),
(2, 'likely dupe'),
(1, 'possible dupe'),
(0, 'not likely dupe')) RankMapping(score, description)
ON RankMapping.score = Duplicate.ranking
ORDER BY Duplicate.originalCustNo, Duplicate.ranking DESC
SQL Fiddle Example
... which generates results that look like this:
| description | custNo | company | fname | lname | add1 | city | phone1 |
|-----------------|--------|----------|---------|--------|--------------|--------------|------------|
| original | 101 | ACME INC | JOHN | DOE | 123 ACME ST | LOONEY HILLS | 1231234567 |
| duplicate | 102 | ACME INC | JOHN | DOE | 123 ACME ST | LOONEY HILLS | 1231234567 |
| likely dupe | 103 | ACME INC | JANE | SMITH | 123 ACME ST | LOONEY HILLS | 1231234567 |
| possible dupe | 104 | ACME INC | BOB | DOLE | 123 ACME ST | LOONEY HILLS | 4564567890 |
| not likely dupe | 105 | ACME INC | JESSICA | RABBIT | 456 ROGER LN | WARNER | 4564567890 |
This code baselessly assumes that the smallest custNo is the "original", and assumes matches will be equivalent to solely that one, but it's completely possible to get other matches as well (just unnest the subquery in the CTE, and remove the row number).

Select only rows with COLUMN=<value> if matches exist, or COLUMN IS NULL otherwise

Here's a sample script to demonstrate the problem:
CREATE TABLE person (
id NUMERIC PRIMARY KEY,
name VARCHAR(255) NOT NULL,
city VARCHAR(255)
);
INSERT INTO person(id, name, city) VALUES(1, 'John', 'New York');
INSERT INTO person(id, name, city) VALUES(2, 'Mike', 'Boston');
INSERT INTO person(id, name, city) VALUES(3, 'Ralph', NULL);
For city='Boston' I want only the 2nd row to be returned. For city='Chicago' I want the 3rd row to be returned.
If you are looking for one row:
select p.*
from person p
where city is null or city = 'Boston'
order by (city = 'value') desc
fetch first 1 row only;
If you can have multiple matches, then I would suggest:
select p.*
from person p
where p.city = 'Boston'
union all
select p.*
from person p
where p.city is null and
not exists (select 1 from person p2 where p2.city = 'Boston');
Or, using window functions:
select p.*
from (select p.*, count(*) filter (where p.city = 'Boston') as cnt
from person p
) p
where (cnt > 0 and p.city = 'Boston') or
(cnt = 0 and p.city is null);
try like below by using subquery
select * from person
where
city='chicago' or
( city is null and
1!=( select count(*) from person where city='chicago' )
)
demo link
ID NAME CITY
3 Ralph
select * from person
where
city='Boston' or
( city is null and
1!=( select count(*) from person where city='Boston' )
)
result using boston
ID NAME CITY
2 Mike Boston
Demo using boston
When you have multiple override columns then no good SQL solution will work. Under the assumption that you have only one column - like in the proposed example column city only - then you can try either of these queries:
SELECT
IFNULL(person.id, person_override.id) AS id,
IFNULL(person.name, person_override.name) AS name,
IFNULL(person.city, person_override.city) AS city
FROM person AS person_override
LEFT JOIN person ON person.city = :city
WHERE person_override.city IS NULL;
Output for Boston is:
| id | name | city |
|----|------|--------|
| 3 | Mike | Boston |
Output for Chicago is:
| id | name | city |
|----|-------|--------|
| 3 | Ralph | |
Note that you won't be able to process multiple entries at the same time but it should satisfy the illustrated example.

Questions with SQL - separating a column in two

I don't know how I can answer this question. Because the name and last name are in one column. I'm not allowed to change the columns.
"Get the average spending (per customer) of all customers who share a last name with another customer"
I thought to say in sqlite3
SELECT avg_spending
FROM customer
JOIN customer on WHERE name is name;
This is how the table is defined:
CREATE TABLE customer
(
cuid INTEGER,
name STRING,
age INTEGER,
avg_spending REAL,
PRIMARY KEY(cuid)
);
So those values are having the same last name
INSERT INTO customer VALUES (4, "Henk Krom", 65, 24);
INSERT INTO customer VALUES (9, "Bob Krom", 66, 4);
From the sample data you posted I guess the format of the column name is:
FirstName LastName
so you need to extract the LastName and use group by to get the average:
select
substr(name, instr(name, ' ') + 1) lastname,
avg(avg_spending) avg_spending
from customer
group by lastname
having count(*) > 1
The having clause restricts the results to those customer names that have at least 1 other customer name with the same last name.
See the demo.
For the sample data:
> cuid | name | age | avg_spending
> :--- | :-------- | :-- | :-----------
> 4 | Henk Krom | 65 | 24
> 9 | Bob Krom | 66 | 4
> 5 | Jack Doe | 66 | 4
> 7 | Jill Doe | 66 | 6
> 1 | Alice No | 66 | 44
you get results:
> lastname | avg_spending
> :------- | :-----------
> Doe | 5
> Krom | 14
As mentioned in the comments, the crux of this is to find a rule how to reliably extract the surname from the name. Apart from that you merely need an exists clause, because you want to select customers where another customer with the same surname exists.
("Get the average spending (per customer)" simply means get a row from the table, because each row contains exactly one customer and their average spending.)
If all names were in the format first name - blank - last name, that would be:
select *
from customer c
where exists
(
select *
from customer other
where other.cuid <> c.cuid
and substr(other.name, instr(other.name, ' ') + 1) = substr(c.name, instr(c.name, ' ') + 1)
);
You were correct in joining the customer table to itself but you also need to parse out the last name to compare and remove duplicates once a match was found since if nameA equals nameB then nameB has to equal nameA.
with custs AS
(
select distinct
a.name as name_1 ,
b.name as name_2
from customer a
join customer b
on substr(a.name, instr(a.name, ' ') + 1) = substr(b.name, instr(b.name, ' ') + 1)
where a.name like '%Krom%' and a.name <> b.name
)
select * from customer where name in (select name_1 from custs)
union
select * from customer where name in (select name_2 from custs)

SELECT subquery with 2 return values

I want to select multiple columns from a subquery. Here my minimal example:
A function that returns two values:
CREATE OR REPLACE FUNCTION dummy_function(my_text text)
RETURNS TABLE (id Integer, remark text) AS $$
BEGIN
RETURN QUERY SELECT 42, upper(my_text);
END;
$$ LANGUAGE plpgsql;
My not working query:
SELECT
id,
city_name,
dummy_function(city_name)
FROM
(SELECT 1 as id, 'Paris' as city_name
UNION ALL
SELECT 2 as id, 'Barcelona' as city_name
) AS dummy_table
My wrong result:
id | city_name | dummy_function
----+-----------+----------------
1 | Paris | (42,PARIS)
2 | Barcelona | (42,BARCELONA)
But I would like to have a result like this:
id | city_name | number | new_text
----+-----------+---------------------
1 | Paris | 42 | PARIS
2 | Barcelona | 42 | BARCELONA
Do you know how to achieve this without running the function twice?
Use the function returning row (or set of rows) in the FROM clause:
SELECT
dummy_table.id,
city_name,
dummy_function.id,
remark
FROM
(SELECT 1 as id, 'Paris' as city_name
UNION ALL
SELECT 2 as id, 'Barcelona' as city_name
) AS dummy_table,
LATERAL dummy_function(city_name)
id | city_name | id | remark
----+-----------+----+-----------
1 | Paris | 42 | PARIS
2 | Barcelona | 42 | BARCELONA
(2 rows)
Per the documentation:
Table functions appearing in FROM can also be preceded by the key word LATERAL, but for functions the key word is optional; the function's arguments can contain references to columns provided by preceding FROM items in any case.
SELECT
dummy_table.id,
city_name,
df.id as number,
df.remark as new_text
FROM
(SELECT 1 as id, 'Paris' as city_name
UNION ALL
SELECT 2 as id, 'Barcelona' as city_name
) AS dummy_table,
dummy_function(city_name) df

How to select all attributes (*) with distinct values in a particular column(s)?

Here is link to the w3school database for learners:
W3School Database
If we execute the following query:
SELECT DISTINCT city FROM Customers
it returns us a list of different City attributes from the table.
What to do if we want to get all the rows like that we get from SELECT * FROM Customers query, with unique value for City attribute in each row.
DISTINCT when used with multiple columns, is applied for all the columns together. So, the set of values of all columns is considered and not just one column.
If you want to have distinct values, then concatenate all the columns, which will make it distinct.
Or, you could group the rows using GROUP BY.
You need to select all values from customers table, where city is unique. So, logically, I came with such query:
SELECT * FROM `customers` WHERE `city` in (SELECT DISTINCT `city` FROM `customers`)
I think you want something like this:
(change PK field to your Customers Table primary key or index like Id)
In SQL Server (and standard SQL)
SELECT
*
FROM (
SELECT
*, ROW_NUMBER() OVER (PARTITION BY City ORDER BY PK) rn
FROM
Customers ) Dt
WHERE
(rn = 1)
In MySQL
SELECT
*
FORM (
SELECT
a.City, a.PK, count(*) as rn
FROM
Customers a
JOIN
Customers b ON a.City = b.City AND a.PK >= b.PK
GROUP BY a.City, a.PK ) As DT
WHERE (rn = 1)
This query -I hope - will return your Cities distinctly and also shows other columns.
You can use GROUP BY clause for getting distinct values in a particular column. Consider the following table - 'contact':
+---------+------+---------+
| id | name | city |
+---------+------+---------+
| 1 | ABC | Chennai |
+---------+------+---------+
| 2 | PQR | Chennai |
+---------+------+---------+
| 3 | XYZ | Mumbai |
+---------+------+---------+
To select all columns with distinct values in City attribute, use the following query:
SELECT *
FROM contact
GROUP BY city;
This will give you the output as follows:
+---------+------+---------+
| id | name | city |
+---------+------+---------+
| 1 | ABC | Chennai |
+---------+------+---------+
| 3 | XYZ | Mumbai |
+---------+------+---------+