Find string from table in cell in BiqQuery --> Query exceeded resource limits - google-bigquery

I have two tables in BigQuery:
City List: Table: invertible-fin-XXX238.Reports.City
StationionNames: invertible-fin-XXX238.Reports.Station
Most of the StationNames containing City Names. Now I want to extract the city from the Station Table.
Here some example data:
City: Berlin
Stationname: inStore_Berlin_Alexanderplatz
Stationname: Berlin Schönefeld Airport
Stationname: Train Station Franchise Berlin
I tried the INSTR Function, but had no success (the INSTR works only with Legacy SQL and there I couldn’t use SUBSELECTS).
SELECT City,
INSTR((SELECT AdGroupName
FROM [invertible-fin-XXX238.Reports.City]),City) AS Match
FROM [invertible-fin-XXX238.Reports.Station]
Therefore I tried it with WHERE LIKE. Below the SQL Code:
SELECT a.City
FROM [invertible-fin-XXX238.Reports.City] a
CROSS JOIN [invertible-fin-XXX238.Reports.Station] b
WHERE b. Name LIKE '%' + a.City + '%'
GROUP BY a.City
But now the Query is too computationally intensive and I got the Error Code “Query exceeded resource limits for tier 1. Tier 18 or higher required.” back.
Could some please help me, writing a more resource friendly query.
Thanks in advance,
Philipp

Below are few of many possible versions for BiigQuery Standard SQL
#standardSQL
SELECT city, station
FROM `invertible-fin-XXX238.Reports.Station` AS s
JOIN `invertible-fin-XXX238.Reports.City` AS c
ON REPLACE(LOWER(station), LOWER(city), '') <> LOWER(station)
or
#standardSQL
SELECT city, station
FROM `invertible-fin-XXX238.Reports.Station` AS s
JOIN `invertible-fin-XXX238.Reports.City` AS c
ON LOWER(station) LIKE CONCAT('%',LOWER(city),'%')
You can remove LOWER() function if names of City are spelled in same case in both tables
While above versions look more straightforward - i would prefer below one as it allows control way you extract city from station -r'([^ _]+)' - you should all characters that you observe being delimiters in column station. So in this case you will extract only city when it is not part of longer name
Of course you should validate if you even need to worry of this
#standardSQL
WITH tokens AS (
SELECT token, station
FROM `invertible-fin-XXX238.Reports.Station` AS s,
UNNEST(REGEXP_EXTRACT_ALL(LOWER(station), r'([^ _]+)')) token
)
SELECT city, station
FROM tokens AS s
JOIN `invertible-fin-XXX238.Reports.City` AS c
ON LOWER(city) = token

I also wonder how the performance for a sub-query would be in this case. For instance:
WITH City AS(
SELECT 'Berlin' As Name UNION ALL
SELECT 'Hamburg'
),
StationNames AS(
SELECT 'inStore_Berlin_Alexanderplatz' AS Name UNION ALL
SELECT 'Berlin Schönefeld Airport' UNION ALL
SELECT 'Train Station Franchise Berlin' UNION ALL
SELECT 'Train Station Hamburg' UNION ALL
SELECT 'Train Station Pluton'
)
SELECT
Name StationName,
(SELECT Name FROM City c WHERE LOWER(s.Name) LIKE CONCAT('%', LOWER(c.Name), '%')) city
FROM StationNames s
Or in your case:
SELECT
Name StationName,
(SELECT Name FROM `invertible-fin-XXX238.Reports.City` c WHERE LOWER(s.Name) LIKE CONCAT('%', LOWER(c.Name), '%')) city
FROM `invertible-fin-XXX238.Reports.Station` s
I know it's common sense for most databases that JOIN has better performance than sub-queries but BigQuery have lots of different optimization techniques for storing and querying data, I was curious to know how different the performance would be in this case.

Related

SQL query that will return results separated by commas

I has two tables on MS SQL Server. First with the cities, second with student names and city ID from first table, where this student lives.
I need a sql query that will return results like this:
Stella Paris
Bob Moscow,New York
Mary Paris,New York
WITH CITIES(ID,CITY_NAME)AS
(
SELECT 1,'MOSCOW' UNION ALL
SELECT 2,'PARIS' UNION ALL
SELECT 3,'NEW YORK'
),
STUDENTS(ID,STUDENT_NAME,LIVE_IN) AS
(
SELECT 1,'STELLA','2' UNION ALL
SELECT 2,'BOB','1,3' UNION ALL
SELECT 3,'MARY','2,3'
)
SELECT X.ID,X.STUDENT_NAME,STRING_AGG(X.CITY_NAME,',')CITY_NAME FROM
(
SELECT S.ID,S.STUDENT_NAME,VALUE AS CITY_ID,C.CITY_NAME
FROM STUDENTS AS S
CROSS APPLY string_split(S.LIVE_IN,',')
JOIN CITIES AS C ON VALUE=C.ID
)X
GROUP BY X.ID,X.STUDENT_NAME;
For SQL Server 2017 and later, as pointed by #Squirrel, you can use STRING_SPLIT and STRING_AGG
Steps you can use:-
Use two different queries
first returning students records
then loop through to get cities records
update the array data as required
A better suggestion would be redesign database so as to be used table JOIN

Grouping multiple rows into one string postgres for each position in select

I have a table education that has a column university. For each of the rows in the table I want to find 3 most similar universities from the table.
Here is my query that finds 3 most similar universities to a given input:
select distinct(university),
similarity(unaccent(lower(university)),
unaccent(lower('Boston university')))
from education
order by similarity(unaccent(lower(university)),
unaccent(lower('Boston university'))) desc
limit 3;
It works fine. But now I would like to modify this query so that I get two columns and a row for each existing university in the table: the first column would be the university name and the second would be the three most similar universities found in the database (or if its easier - four columns where the first is the university and the next 3 are the most similar ones).
What should this statement look like?
You could use an inline aggregated query:
with u as (select distinct university from education)
select
university,
(
select string_agg(u.university, ',')
from u
where u.university != e.university
order by similarity(
unaccent(lower(u.university)),
unaccent(lower(e.university))
) desc
limit 3
) similar_universities
from education e
This assumes that a given university may occur more than once in the education table (hence the need for a cte).
I think a lateral join and aggregation does what you want:
select e.university, e2.universities
from education e cross join lateral
(select array_agg(university order by sim desc) as universities
from (select e2.university,
similarity(unaccent(lower(e2.university)),
unaccent(lower(e.university))
) as sim
from education e2
order by sim desc
limit 3
) e2
) e2;
Note: The most similar university is probably the one with the same name. (You can filter that out with a where clause in the subquery.)
This returns the value as an array, because I prefer working with arrays rather than strings in Postgres.

Access SQL Group by with condition

I'm using MS Access for the following task (due to office restrictions). I'm quite new to SQL.
I have the following table:
I want to select all stores grouped by street, zip and place. But i only want to group them, if the SquareSum (after Group by) is < 1000. Rue de gare 2 should be grouped, while Bahnhofstrasse 23 should be seperate lines.
So far as i know MS Access doesn't allow a case statement. So my query looks like this:
SELECT
Street,
ZIP,
Place,
Sum(Square) AS SumSquare,
FROM Table1
SWITCH (SumSquare > 1000, GROUP BY (Street, ZIP, Place))
I also tried:
GROUP BY
SWITCH (SumSquare > 1000, (Street, ZIP, Place))
But it keeps telling me i have a syntax error. Could someone please help me?
In Access, I would do this with several queries.
This would be easier to do if you had an id on the rows (such as an autonumber).
First query identifies the streets that should be summed.
query: SumTheseStreets
SELECT
Street,
ZIP,
Place,
Sum(Square) AS SumSquare
FROM Table1
GROUP BY Street, ZIP, Place
HAVING sum(Square) < 1000
Note the HAVING which is a bit like a WHERE clause that's applied outside of the GROUP BY or SUM
Second query identifies the other rows (notes on this one below):
query: StreetsNotSummed
SELECT
Street,
ZIP,
Place,
Square AS SumSquare
FROM Table1
LEFT JOIN SumTheseStreets ON Table1.Street = SumTheseStreets.Street AND Table1.ZIP = SUmTheseStreets.ZIP AND Table1.Place = SumTheseStreets.Place
WHERE SumTheseStreets.Street IS NULL;
A couple of notes:
I've called the field SumSquare because I want it to be the same name as the SumSquare field in the first query
It uses the first query as one of the input "tables"
This uses a LEFT JOIN which means "give me all of the rows in the first table (table1) and if any rows in the second table (SumTheseStreets) match, put those in as well.
but then it filters out the rows that DO match.
So this query only lists the streets that you want NOT summed.
So now you need a third query.
This simply includes all of the rows in both of those queries.
I'm not too sure on the Access syntax on this one, but there's a union query wizard if this isn't right.
Query: TheAnswerRequired
SELECT
Street,
ZIP,
Place,
SumSquare
FROM SumTheseStreets
UNION
SELECT
Street,
ZIP,
Place,
SumSquare
FROM StreetsNotSummed
(it might need to be UNION ALL)
Good luck.
You can use UNION ALL:
SELECT ts.*
FROM (SELECT Street, Zip, Place, SUM(Square) as SumSquare
FROM Table1
GROUP BY Street, Zip, Place
) as ts
WHERE ts.SumSquare < 1000
UNION ALL
SELECT t1.*
FROM Table1 as t1 INNER JOIN
(SELECT Street, Zip, Place, SUM(Square) as SumSquare
FROM Table1
GROUP BY Street, Zip, Place
) as ts
ON t1.Street = ts.Street AND t1.Zip = ts.Zip and t1.Place = ts.Place
WHERE ts.SumSquare >= 1000

Need to arrange employee names as per their city column wise

I have written a query which extracts the data from different columns group by city name.
My query is as follows:
select q.first_name
from (select employee_id as eid,first_name,city
from employees
group by city,first_name,employee_id
order by first_name)q
, employees e
where e.employee_id = q.eid;
The output of the query is employee names in a single column grouped by their cities.
Now I would like to enhance the above query to classify the employees by their city names in different columns.
I tried using pivot to make this work. Here is my pivot query:
select * from (
select q.first_name
from (select employee_id as eid,first_name,city
from employees
group by city,first_name,employee_id
order by first_name)q
, employees e
where e.employee_id = q.eid
) pivot
(for city in (select city from employees))
I get some syntax issue saying missing expression and I am not sure how to use pivot to achieve the below expected output.
Expected Output:
DFW CH NY
---- --- ---
TripeH John Hitman
Batista Cena Yokozuna
Rock James Mysterio
Appreciate if anyone can guide me in the right direction.
Unfortunately what you are trying to do is not possible, at least not in "straight" SQL - you would need dynamic SQL, or a two-step process (in the first step generating a string that is a new SQL statement). Complicated.
The problem is that you are not including a fixed list of city names (as string literals). You are trying to create columns based on whatever you get from (select city from employees). Thus the number of columns and the name of the columns is not known until the Oracle engine reads the data from the table, but before the engine starts it must already know what all the columns will be. Contradiction.
Note also that if this was possible, you almost surely would want (select distinct city from employees).
ADDED: The OP asks a follow-up question in a comment (see below).
The ideal arrangement is for the cities to be in their own, smaller table, and the "city" in the employees table to have a foreign key constraint so that the "city" thing is manageable. You don't want one HR clerk to enter New York, another to enter New York City and a third to enter NYC for the same city. One way or the other, first try your code by replacing the subquery that follows the operator IN in the pivot clause with simply the comma-separated list of string literals for the cities: ... IN ('DFW', 'CH', 'NY'). Note that the order in which you put them in this list will be the order of the columns in the output. I didn't check the entire query to see if there are any other issues; try this and let us know what happens.
Good luck!
select
(CASE WHEN CITY="DFW" THEN EMPLOYEE_NAME END) DFW,
(CASE WHEN CITY="CH" THEN EMPLOYEE_NAME END) CH,
(CASE WHEN CITY="NY" THEN EMPLOYEE_NAME END) NY
FROM employees
order by first_name
Maybe you need to transpose your result. See this link . I think DECODE or CASE works best for your case:
select
(CASE WHEN CITY="DFW" THEN EMPLOYEE_NAME END) DFW,
(CASE WHEN CITY="CH" THEN EMPLOYEE_NAME END) CH,
(CASE WHEN CITY="NY" THEN EMPLOYEE_NAME END) NY
FROM employees
order by first_name
Normally I would "edit" my first answer, but the question has changed so much, it's quite different from the original one so my older answer can't be "edited" - this now needs a completely new answer.
You can do what you want with pivoting, as I show below. Wondering why you want to do this in basic SQL and not by using reporting tools, which are written specifically for reporting needs. There's no way you need to keep your data in the pivoted format in the database.
You will see 'York' twice in the Chicago column; you will recognize that's on purpose (you will see I had a duplicate row in the "test" table at the top of my code); this is to demonstrate a possible defect of your arrangement.
Before you ask if you could get the list but without the row numbers - first, if you are simply generating a set of rows, those are not ordered. If you want things ordered for reporting purposes, you can do what I did, and then select "'DFW'", "'CHI'", "'NY'" from the query I wrote. Relational theory and the SQL standard do not guarantee the row order will be preserved, but Oracle apparently does preserve it, at least in current versions; you can use that solution at your own risk.
max(name) in the pivot clause may look odd to the uninitiated; one of the weird limitations of the PIVOT operator in Oracle is that it requires an aggregate function to be used, even if it's over a set of exactly one element.
Here's the code:
with t (city, name) as -- setting up input data for testing
(
select 'DFW', 'Smith' from dual union all
select 'CHI', 'York' from dual union all
select 'DFW', 'Matsumoto' from dual union all
select 'NY', 'Abu Osman' from dual union all
select 'DFW', 'Adams' from dual union all
select 'CHI', 'Wilson' from dual union all
select 'CHI', 'Arenas' from dual union all
select 'NY', 'Theodore' from dual union all
select 'CHI', 'McGhee' from dual union all
select 'NY', 'Zhou' from dual union all
select 'NY' , 'Simpson' from dual union all
select 'CHI', 'Narayanan' from dual union all
select 'CHI', 'York' from dual union all
select 'NY', 'Perez' from dual
)
select * from
(
select row_number() over (partition by city order by name) rn,
city, name
from t
)
pivot (max(name) for city in ('DFW', 'CHI', 'NY') )
order by rn
/
And the output:
RN 'DFW' 'CHI' 'NY'
---------- --------- --------- ---------
1 Adams Arenas Abu Osman
2 Matsumoto McGhee Perez
3 Smith Narayanan Simpson
4 Wilson Theodore
5 York Zhou
6 York
6 rows selected.

Oracle Query - Use of Analytical functions

Assume we have loaded a flat file with patient diagnosis data into a table called “Data”. The table structure is:
Create table Data (
Firstname varchar(50),
Lastname varchar(50),
Date_of_birth datetime,
Medical_record_number varchar(20),
Diagnosis_date datetime,
Diagnosis_code varchar(20))
The data in the flat file looks like this:
'jane','jones','2/2/2001','MRN-11111','3/3/2009','diabetes'
'jane','jones','2/2/2001','MRN-11111','1/3/2009','asthma'
'jane','jones','5/5/1975','MRN-88888','2/17/2009','flu'
'tom','smith','4/12/2002','MRN-22222','3/3/2009','diabetes'
'tom','smith','4/12/2002','MRN-33333','1/3/2009','asthma'
'tom','smith','4/12/2002','MRN-33333','2/7/2009','asthma'
'jack','thomas','8/10/1991','MRN-44444','3/7/2009','asthma'
You can assume that no two patients have the same firstname, lastname, and date of birth combination. However one patient might have several visits on different days. These should all have the same medical record number.
The problem is this: Tom Smith has 2 different medical record numbers. Write a query that would always show all the patients
who are like Tom Smith – patients with more than one medical record number.
I came up with below query. It works perfectly fine, but wanted to know if there is a better way to write this query using Oracle Analytical function's. Thank you in advance
SELECT a.firstname,
a.lastname,
a.date_of_birth,
a.medical_record_number
FROM data a, data b
WHERE a.firstname = b.firstname
AND a.lastname = b.lastname
AND a.date_of_birth = b.date_of_birth
AND a.medical_record_number <> .medical_record_number
GROUP BY a.firstname,
a.lastname,
a.date_of_birth,
a.medical_record_number
It is possible to do via analytic functions, but whether it's faster than doing the join in your query* or not depends on what data you have. You'd need to test.
with data (firstname, lastname, date_of_birth, medical_record_number, diagnosis_date, diagnosis_code)
as (select 'jane','jones','2/2/2001','MRN-11111',to_date('3/3/2009', 'mm/dd/yyyy'),'diabetes' from dual union all
select 'jane','jones','2/2/2001','MRN-11111',to_date('1/3/2009', 'mm/dd/yyyy'),'asthma' from dual union all
select 'jane','jones','5/5/1975','MRN-88888',to_date('2/17/2009', 'mm/dd/yyyy'),'flu' from dual union all
select 'tom','smith','4/12/2002','MRN-22222',to_date('3/3/2009', 'mm/dd/yyyy'),'diabetes' from dual union all
select 'tom','smith','4/12/2002','MRN-33333',to_date('1/3/2009', 'mm/dd/yyyy'),'asthma' from dual union all
select 'tom','smith','4/12/2002','MRN-33333',to_date('2/7/2009', 'mm/dd/yyyy'),'asthma' from dual union all
select 'jack','thomas','8/10/1991','MRN-44444',to_date('3/7/2009', 'mm/dd/yyyy'),'asthma' from dual),
-- end of mimicking your table and its data
res as (select firstname,
lastname,
date_of_birth,
medical_record_number,
count(distinct medical_record_number) over (partition by firstname, lastname, date_of_birth) cnt_med_rec_nums
from data)
select distinct firstname,
lastname,
date_of_birth,
medical_record_number
from res
where cnt_med_rec_nums > 1;
*btw, the group by in your example query is not necessary; it would make much more sense to switch it out for a distinct - it makes your intent much clearer, since you're wanting to get a distinct set of records.
You can probably simplify the query a bit using a HAVING clause rather than doing a self-join
SELECT a.firstname,
a.lastname,
a.date_of_birth,
MIN(a.medical_record_number) lowest_medical_record_number,
MAX(a.medical_record_number) highest_medical_record_number
FROM data a
GROUP BY a.firstname,
a.lastname,
a.date_of_birth
HAVING COUNT( DISTINCT a.medical_record_number ) > 1
I'm returning the smallest and largest medical record number for each patient here (that's what I'd do if most of the patients with this problem have just two numbers rather than having dozens). You could return just one or you could return a comma-separated list of all the medical record numbers if you'd rather (which would probably make more sense if most of the bad folks have dozens of numbers).