Google's Big Query using SQL: Associate the assignee name and harmonized assignee name when there are multiple assignees - sql

My goal is to create a table from Google's Big Query patents-public-data.patents.publications_201710 table using standard SQL that has one row for the publication_number, assignee and assignee_harmonized.name where the publication_number is repeated for records that have multiple assignees. Here's an example of my desired output:
publication_number|assignee|assignee_harm
US-6044964-A|Sony Corporation|SONY CORP
US-6044964-A|Digital Audio Disc Corporation|DIGITAL AUDIO DISC CORP
US-8746747-B2|IPS Corporation—Weld-On Division|IPS CORPORATION—WELD ON DIVISION
US-8746747-B2|null|MCPHERSON TERRY R
I've tried the following query based off of the UNNEST suggestion found in this post
#standard SQL
SELECT
p.publication_number,
p.assignee,
a.name AS assignee_harm
FROM
`patents-public-data.patents.publications_201710` AS p,
UNNEST(assignee_harmonized) AS a
WHERE
p.publication_number IN ('US-6044964-A',
'US-8746747-B2')
However, the output appears as follows:
row|publication_number|assignee|assignee_harm
1|US-6044964-A|Sony Corporation|SONY CORP
||Digital Audio Disc Corporation|
2|US-6044964-A|Sony Corporation|DIGITAL AUDIO DISC CORP
||Digital Audio Disc Corporation|
3|US-8746747-B2|IPS Corporation—Weld-On Division|MCPHERSON TERRY R
4|US-8746747-B2|IPS Corporation—Weld-On Division|IPS CORPORATION—WELD ON DIVISION
You can see that the "Sony Corporation" assignee is inappropriately associated with the "DIGITAL AUDIO DISC CORP" harmonized name in row 2 with a similar issue appearing in row 3. Also, rows 1 and 2 contain two lines each but don't repeat the publication_number identifier. I don't see a straightforward way to do this because the number of "assignee" doesn't always equal the number of "assignee_harmonized.name" and they don't always appear in the same order (otherwise I could try creating two tables and merging them somehow). On the other hand, there has to be a way to associate the "assignee" variable with its harmonized value "assignee_harmonized.name", otherwise the purpose of having a harmonized value is lost. Could you please suggest a query (or set of queries) that will produce the desired output when there are either multiple "assignee" or multiple "assignee_harmonized.name" or both?

You're querying for a string and two arrays - the whole thing basically looks like this:
{
"publication_number": "US-8746747-B2",
"assignee": [
"IPS Corporation—Weld-On Division"
],
"assignee_harm": [
"MCPHERSON TERRY R",
"IPS CORPORATION—WELD ON DIVISION"
]
}
So that's the data and you somehow need to decide how to treat the combination of them ... either you cross join everything:
#standard SQL
SELECT
p.publication_number,
assignee,
assignee_harmonized.name AS assignee_harm
FROM
`patents-public-data.patents.publications_201710` AS p
,p.assignee assignee
,p.assignee_harmonized AS assignee_harmonized
WHERE
p.publication_number IN ('US-6044964-A','US-8746747-B2')
.. which gives you relational data .. or you leave it as two separate arrays:
#standard SQL
SELECT
p.publication_number,
assignee,
ARRAY( (SELECT name FROM p.assignee_harmonized)) AS assignee_harm
FROM
`patents-public-data.patents.publications_201710` AS p
WHERE
p.publication_number IN ('US-6044964-A','US-8746747-B2')
You can save this nested result as a table in bq as well.

Related

Microsoft SQL Server generates two select queries and puts data in separate columns

I am looking for a query that separate the data with the condition WHERE in the same output but in separates columns.
Example: I have the table Product_2:
I have two separates queries (to separate the products by Produt_Tag):
SELECT
Product_Mark AS "PIT-10_Product_Mark",
Product_Model AS "PIT-10_Product_Model"
FROM Product_2
WHERE Product_Tag = 'PIT-10';
SELECT
Product_Mark AS "PIT-11_Product_Mark",
Product_Model AS "PIT-11_Product_Model"
FROM Product_2
WHERE Product_Tag = 'PIT-11';
And I get this output:
But I need the output to be like this:
Can someone tell me how I need to modify my query to have the four columns in the same table/ output?
Thank you
I forgot to tell that in the data I Have the “Porduct_Mark” that only appears one time. (in reality the data in “Product_Mark” is the name of the place where the instrument is located and one place can have one or two instruments “Product_Model”. At the end I’m looking for the result show in the image here below. I tried to use LEFT JOIN but that don’t work.
here is the new table "Product_2"
Result that I'm looking for:
Luis Ardila
I am assuming Product_PK is the primary key for the table and the repeated value 1002 shown in the question is a mistake. Considering this assumption, you can get the result set using self join as below.
SELECT pa.Product_Mark AS "PIT-10_Product_Mark", pa.Product_Model AS "PIT-10_Product_Model",
pb.Product_Mark AS "PIT-11_Product_Mark", pb.Product_Model AS "PIT-11_Product_Model"
FROM Product_2 pa
INNER JOIN Product_2 pb
ON pa.Product_Mark = pb.Product_Mark
WHERE pa.product_pk != pb.product_pk
and pa.Product_Tag = 'PIT-10'
and pb.Product_Tag = 'PIT-11';
verified same in https://dbfiddle.uk/NiOO8zc1

SQL Update based on secondary table in BQ

I have 2 tables, 1 containing the main body of information, the second contains information on country naming convensions. in the information table, countries are identified by Name, I would like to update this string to contain an ISO alpha 3 value which is contained in the naming convention table. e.g turning "United Kingdom" -> "GBR"
I have wrote the following query to make the update, but it effects 0 rows
UPDATE
`db.catagory.test_votes_ds`
SET
`db.catagory.test_votes_ds`.country = `db.catagory.ISO-Alpha`.Alpha_3_code
FROM
`db.catagory.ISO-Alpha`
WHERE
`LOWER(db.catagory.ISO-Alpha`.Country) = LOWER(`db.catagory.test_votes_ds`.country)
I've done an inner join outside of the update between the 2 to make sure that the values are compatable and it returns the correct value, any ideas as to why it isn't updating?
The join used to validate the result is listed below, along with the result:
SELECT
`db.catagory.test_votes_ds`.country, `db.catagory.ISO-Alpha`.Alpha_3_code
from
`db.catagory.test_votes_ds`
inner join
`db.catagory.ISO-Alpha`
on
LOWER(`db.catagory.test_votes_ds`.country) = LOWER(`db.catagory.ISO-Alpha`.Country)
1,Ireland,IRL
2,Australia,AUS
3,United States,USA
4,United Kingdom,GBR
This is not exactly an answer. But your test may not be sufficient. You need to check where the values do not match. So, to return those:
select tv.*
from `db.catagory.test_votes_ds` tv left join
`db.catagory.ISO-Alpha` a
on LOWER(tv.country) = LOWER(a.Country)
where a.Country IS NULL;
I suspect that you will find countries that do not match. So when you run the update, the matches are getting changed the first time. Then the non-matches are never changed.

Joining SQL statements Atrium Syntess Firebird DB

I'm trying to get (one or multiple) number lines (PROGCODE) that are attached to an OBJECT (i.e. a building) that is connected to a Relation which in turn has a GC_ID (relation unique ID). I need all the buildings & progcodes connected to a relation ID in a firebird 2.5 database generated by my companies ERP system.
I can look through all the tables in the firebird database and run select queries on them.I like to think I have the join statement syntax down and I know how to find the unique ID belonging to a relation, unfortunately I'm unsure how I can find the correct table that houses the information I seek.
The table I think this data is in has the following fields:
GC_ID, DEVICE_GC_ID, USER_GC_ID, CODE, DESCRIPTION.
However when I query it using
select GC_ID, DEVICE_GC_ID, USER_GC_ID, CODE, DESCRIPTION
from AT_PROGCODE A
Then I get a description of the fields I'm trying to query.
i.e.
| GC_ ID : 100005 | DEVICE_GC_ID : 100174 | USER_GC_ID : 1000073 | DESCRIPTION: >description of what I'm trying to query< |
Can anyone shed some insight how I should handle this?
Update 7-09-2017
I spoke with the ERP consultant and was told the tables I needed (if anyone reading this is using syntess Atrium; the AT_BRENT table holds a description of all the tables.)
However, I've run into a new problem; the data I get from my sql query keeps streaming (it seems to never end with me stopping the script at 90 mil loops and the ERP program crashing if I ask for a count).
select A.GC_OMSCHRIJVING Bedrijf, A.GC_CODE ,M.GC_OMSCHRIJVING Werktitel,
M.TELEFOON1, M.TELEFOON2, M.MOBIEL, M.EMAIL,
M.URL, M.DOORKIES_NR, M.WERKLOCATIE, M.EMAIL_INTERN
from AT_MEDEW M , AT_RELATIE A
JOIN AT_MEDEW ON A.GC_ID = M.GC_ID
WHERE M.TELEFOON1 <> '' OR M.TELEFOON2 <> '' OR M.MOBIEL <> ''
Any ideas on what's the cause for my latest peril?
First I had to find the AT_BRENT table which holds all the descriptions for tables in Syntess Atrium
Then I was using a CROSS JOIN (as pointed out by https://stackoverflow.com/users/696808/bacon-bits )
I ended up using
select A.GC_OMSCHRIJVING Bedrijf, A.GC_CODE ,M.GC_OMSCHRIJVING Werktitel,
M.TELEFOON1, M.TELEFOON2, M.MOBIEL, M.EMAIL,
M.URL, M.DOORKIES_NR, M.WERKLOCATIE, M.EMAIL_INTERN
from AT_MEDEW M
JOIN AT_RELATIE A ON A.GC_ID = M.GC_ID
WHERE M.TELEFOON1 <> '' OR M.TELEFOON2 <> '' OR M.MOBIEL <> ''
Thank you all who helped.

Ways to Clean-up messy records in sql

I have the following sql data:
ID Company Name Customer Address 1 City State Zip Date
0108500 AAA Test Mish~Sara Newa Claims Chtiana CO 123 06FE0046
0108500 AAA.Test Mish~Sara Newa Claims Chtiana CO 123 06FE0046
1802600 AAA Test Company Ban, Adj.~Gorge PO Box 83 MouLaurel CA 153 09JS0025
1210600 AAA Test Company Biwel~Brce 97kehst ve Jacn CA 153 04JS0190
AAA Test, AAA.Test and AAA Test Company are considered as one company.
Since their data is messy I'm thinking either to do this:
Is there a way to search all the records in the DB wherein it will search the company name with almost the same name then re-name it to the longest name?
In this case, the AAA Test and AAA.Test will be AAA Test Company.
OR Is there a way to filter only record with company name that are almost the same then they can have option to change it?
If there's no way to do it via sql query, what are your suggestions so that we can clean-up the records? There are almost 1 million records in the database and it's hard to clean it up manually.
Thank you in advance.
You could use String matching algorithm like Jaro-Winkler. I've written an SQL version that is used daily to deduplicate People's names that have been typed in differently. It can take awhile but it does work well for the fuzzy match you're looking for.
Something like a self join? || is ANSI SQL concat, some products have a concat function instead.
select *
from tablename t1
join tablename t2 on t1.companyname like '%' || t2.companyname || '%'
Depending on datatype you may have to remove blanks from the t2.companyname, use TRIM(t2.companyname) in that case.
And, as Miguel suggests, use REPLACE to remove commas and dots etc.
Use case-insensitive collation. SOUNDEX can be used etc etc.
I think most Database Servers support Full-Text search ability, and if so there are some functions related to Full-Text search that support Proximity.
for example there is a Near function in SqlServer and here is its documentation https://msdn.microsoft.com/en-us/library/ms142568.aspx
You can do the clean-up in several stages.
Create new columns
Convert everything to upper case, remove punctuation & whitespace, then match on the first 6 to 10 characters (using self join). Assuming your table is called "vendor": add two columns, "status", "dupstr", then update as follows
/** Populate dupstr column for fuzzy match **/
update vendor v
set v.dupstr = left(upper(regex_replace(regex_replace(v.companyname,'.',''),' ','')),6)
;
Identify duplicate records
Add an index on the dupstr column, then do an update like this to identify "good" records:
/** Mark the good duplicates **/
update vendor v
set v.status = 'keep' --indicate keeper record
where
--dupes to clean up
exists ( select 1 from vendor v1 where v.dupstr = v1.dupstr
and v.id != v1.id )
and
( --keeper has longest name
length(v.companyname) =
( select max(length(v2.companyname)) from vendor v2
where v.dupstr = v2.dupstr
)
or
--keeper has latest record (assuming ID is sequential)
v.id =
( select max(v3.id) from vendor v3
where v.dupstr = v3.dupstr
)
)
group by v.dupstr
;
The above SQL can be refined to add "dupe" status to other records , or you can do a separate update.
Clean Up Stragglers
Report any remaining partial matches to be reviewed by a human (i.e. dupe records without a keeper record)
You can use SQL query with SOUDEX of DIFFRENCE
For example:
SELECT DIFFERENCE ('AAA Test','AAA Test Company')
DIFFERENCE returns 0 - 4 ( 4 = almost the same, 0 - totally diffrent)
See also: https://learn.microsoft.com/en-us/sql/t-sql/functions/difference-transact-sql?view=sql-server-2017

Way to combine filtered results using LIKE

I have a many to many relationship between people and some electronic codes. The table with the codes has the code itself, and a text description of the code. A typical result set from a query might be (there are many codes that contain "broken" so I feel like it's better to search the text description rather than add a bunch of ORs for every code.)
id# text of code
1234 broken laptop
1234 broken mouse
Currently the best way for me to get a result set like this is to use the LIKE%broken% filter. Without changing the text description, is there any way I can return only one instance of a code with broken? So in the example above the query would only return 1234 and broken mouse OR broken laptop. In this scenario it doesn't matter which is returned, all I'm looking for is the presence of "broken" in any of the text descriptions of that person's codes.
My solution at the moment is to create a view that would return
`id# text of code
1234 broken laptop
1234 broken mouse`
and using SELECT DISTINCT ID# while querying the view to get only one instance of each.
EDIT ACTUALLY QUERY
SELECT tblVisits.kha_id, tblICD.descrip, min(tblICD.Descrip) as expr1
FROM tblVisits inner join
icd_jxn on tblVisits.kha_id = icd_jxn.kha)id inner join tblICD.icd_fk=tblICD.ICD_ID
group by tblVisits.kha_id, tblicd.descrip
having (tblICD.descrip like n'%broken%')
You could use the below query to SELECT the MIN code. This will ensure only text per id.
SELECT t.id, MIN(t.textofcode) as textofcode
FROM table t
WHERE t.textofcode LIKE '%broken%'
GROUP BY t.id
Updated Actual Query:
SELECT tblVisits.kha_id,
MIN(tblICD.Descrip)
FROM tblVisits
INNER JOIN icd_jxn ON tblVisits.kha_id = icd_jxn.kha)id
INNER JOIN tblicd ON icd_jxn.icd_fk = tbl.icd_id
WHERE tblICD.descrip like n'%broken%'
GROUP BY tblVisits.kha_id