Output different query results in same table - google-bigquery

After the recent update, google bigquery now allows querying from country-specific tables.I wanted to find the number of origins(websites) in the us table containing the word 'space' and display it side by side with a similar result from the Japan table. The query I'm making is -
WITH
query_1 as
(select distinct origin as japan
from `chrome-ux-report.country_jp.201712` where
origin like "%space%"),
query_2 as
(select distinct origin as usa
from `chrome-ux-report.country_us.201712`
where origin like "%space%" )
SELECT japan,usa from query_1,query_2
But it results in a table having multiple repetitions of the same origin in both the japan and usa column. Another strange thing is that the o/p table contains same number of rows for japan and usa where clearly, the number of sites containing the word 'space' is not same in the 2 tables. I'm using standard sql, not legacy.
Any help is appreciated. Thanks.
Note: by side by side, I mean there will be two columns, the japan column displaying sites for japan and the usa column displaying results for usa.

in BigQuery Standard SQL (that you are using in your query) comma between tables in FROM statement mean CROSS JOIN. This explains why it results in a table having multiple repetitions of the same origin in both the japan and usa column
Depends on how exactly you want your result to look - you can construct your query in many different ways - for example
WITH
query_1 AS
(SELECT DISTINCT origin AS japan
FROM `chrome-ux-report.country_jp.201712` WHERE
origin LIKE "%space%"),
query_2 AS
(SELECT DISTINCT origin AS usa
FROM `chrome-ux-report.country_us.201712`
WHERE origin LIKE "%space%" )
SELECT
ARRAY(SELECT japan FROM query_1) AS japan,
ARRAY(SELECT usa FROM query_2) AS usa
Also you can check counts as below
WITH
query_1 AS
(SELECT DISTINCT origin AS japan
FROM `chrome-ux-report.country_jp.201712` WHERE
origin LIKE "%space%"),
query_2 AS
(SELECT DISTINCT origin AS usa
FROM `chrome-ux-report.country_us.201712`
WHERE origin LIKE "%space%" )
SELECT
ARRAY_LENGTH(ARRAY(SELECT japan FROM query_1)) AS japan_count,
ARRAY_LENGTH(ARRAY(SELECT usa FROM query_2)) AS usa_count

Related

When a statement contains an item in a list, show it in a new column

I would appreciate a little help on some script in sql. So I have a list like the one below and a database table -Table1 with statement as a colum name, and I will like to create a column called location, where the script can search in the statement column and once it finds any of the items in the list in any row it states that in the location column
(Tema, london, Sydney, Germany, China, Africa,)
Statement
-------------------
Going to london
Apples in Tema
Sydney is a city
China is a country
Africa is a continent
In the end I hope to see a table like this :
Statement
location
Going to london
London
Apples in Tema
Tema
Sydney is a city
Sydney
china is a country
China
Africa is a continent
Africa
By using this script,
SELECT Statement,
Case
WHEN Statement::text ~~* '%london%'::character varying::text
THEN 'london'::character varying
ELSE NULL::character varying
END AS location
FROM Table1
I think I would have to write a very tall script, but I was wondering if I could get help with something efficient and quite simple to achieve this
If you have a list of places, you can use that:
select t1.*, v.place
from table1 t1 cross join
(values ('tema'), ('london'), ('sydney'), ('germany'), ('china'), ('africa')
) v(place)
on Statement::text ilike '%' || v.place || '%';
Note: You might want to use regular expressions so you can include work boundaries but your example code doesn't do tis.

How does SQL count(distinct) work in this case?

I'm trying to find the match no in which Germany played against Poland. This is from https://www.w3resource.com/sql-exercises/soccer-database-exercise/sql-subqueries-exercise-soccer-database-4.php. There are two tables : match_details and soccer_country. I don't understand how the count(distinct) works in this case. Can someone please clarify? Thanks!
SELECT match_no
FROM match_details
WHERE team_id = (
SELECT country_id
FROM soccer_country
WHERE country_name = 'Germany')
OR team_id = (
SELECT country_id
FROM soccer_country
WHERE country_name = 'Poland')
GROUP BY match_no
HAVING COUNT(DISTINCT team_id) = 2;
As Lamak mentioned, what an ugly consideration for a query, but many ways to approach a query.
As mentioned, counting for (Distinct team_id) makes sure that there are only 2 unique teams. If there is ever a Cartesian result, you could get repetition of multiple rows showing more than one instance of both teams. So the count of distinct on the TEAM_ID eliminates that.
Now, that said, Other "team" query data structures I have seen have a single record for the match and a column for EACH TEAM playing the match. That is easier by a long-shot, but still a relatively easy query.
Break the query down a little, and consider a large scale set of data (not that this, or any sort of even professional league would have such large record counts to give delay with a sql engine).
Your first criteria is games with Germany. So lets start with that.
SELECT
md1.match_no
FROM
match_details md1
JOIN soccer_country sc1
on md1.team_id = sc1.country_id
AND sc1.country_name = 'Germany'
So, why even look at any other record/match if Germany is not even part of the match on either side. Of which this in itself would return 6 matches from the sample data of 51 matches. So now, all you need to do is join AGAIN to the match details table a second time for only those matches, but ALSO the second team is Poland
SELECT
md1.match_no
FROM
match_details md1
JOIN soccer_country sc1
on md1.team_id = sc1.country_id
AND sc1.country_name = 'Germany'
-- joining again for the same match Germany was already qualified
JOIN match_details md2
on md1.match_no = md2.match_no
-- but we want the OTHER team record since Germany was first team
and md1.team_id != md2.team_id
-- and on to the second country table based on the SECOND team ID
JOIN soccer_country sc2
on md2.team_id = sc2.country_id
-- and the second team was Poland
AND sc2.country_name = 'Poland'
Yes, may be a longer query, but by eliminating 45 other matches (again, thinking a LARGE database), you have already saved blowing through tons of data to a very finite set. And now finishing only those Germany / Poland. No aggregates, counts, distincts, just direct joins.
FEEDBACK
Lets take a look at some BAD sample data... which as all programmers know, there is no such thing (NOT). Anyhow, lets take a look at these few matches.
Match Team ID blah
52 Poland Just put the names here for simplistic purposes
52 Poland
53 Germany
53 Germany
If you were to run the query without DISTINCT Teams, both match 52 and 53 would show up... As Poland is one team and appears 2 times for match 52, and similarly Germany 2 times for match 53. By doing DISTINCT Team, you can see that for each match, there is only 1 team being returned and thus excluded. Does that help? Again, no such thing as bad data :)
And yet another sample match where more than 2 teams created
Match Team ID
54 France
54 Poland
54 England
55 Hungary
56 Austria
In each of these matches, NONE would be returned. Match 54 has 3 distinct teams, and Match 55 and 56 only have single entry, thus no opponent to compete against.
2nd FEEDBACK
To clarify the query. If you look at the short query for just Germany, that aliased instance of "md1" is already sitting on any given record for a Germany match. So the second join to the "md2", I only care about the same match, so I can join on the same match_no. However, in the "md2" alias, the "!=" means NOT EQUAL. ! = logical NOT. So the join is saying from the MD1, join to the MD2 alias on the same match id. However, only give me where the teams are NOT the same. So the first instance holds Germany's team ID (already qualified) and thus give me the secondary team id. So now I can use the secondary (md2) instance team ID to join to the country to confirm only for Poland.
Does this now clarify things for you?

Needing Clarity on SQL Join Query

Having some trouble understanding this query, particularly the WHERE in the subquery. I don't really get what it is accomplishing. Any help would be appreciated. Thanks
# Find the largest country (by area) in each continent. Show the continent,
# name, and area.
SELECT continent, name, area
FROM countries AS a
WHERE area = (
SELECT MAX(area)
FROM countries AS b
WHERE a.continent = b.continent
)
Consider the following subset of the countries data:
Continent Country Area
North America USA 3718691
North America Canada 3855081
North America Mexico 761602
Europe France 211208
Europe Germany 137846
Europe UK 94525
Europe Italy 116305
This is a correlated query that behaves as follows:
Reads the first row returned by the outer query (North America, USA, 3718691)
Runs the subquery which correlates to a.continent, North America, and returns 3855081 which is the maximum area in North America.
Does the where equality which checks to see if 3855081 matches the area on the row we're working on.
It doesn't match so the next row in the outer query is read and we start over at step 1 this time working on the second row.
Repeat for all rows in the outer query.
When we're looking at rows 2 and 4, step 4. will match so those rows will be returned by the query.
You can check the results by using this data in your countries table and running the query.
Note that this is a very poor way to determine the country with the maximum area per continent because it repeats the subquery for every country. Using my sample data, it determines the maximum area for North America 3 times and the maximum area for Europe 4 times.
Since you asked in your comment, I would write this query as follows:
SELECT a.continent, a.name, a.area
FROM countries AS a
inner join (select continent, max(area) max_area
from countries
group by continent) as b on a.continent = b.continent
WHERE a.area = b.max_area
In this version of the query, the maximum for each continent is only determined once. The original query was written to illustrate correlated queries and it's important to understand them. Correlated queries can often be used to resolve complex logic.
The subquery is finding the maximum area for countries. Which countries? All countries that match the continent of the country in the outer query.
So, for each country it gets the area of the largest country on the same continent.
The WHERE clause then says "are the two areas the same -- the maximum area and the area of this country?". It chooses only countries that have the maximum area.

Oracle REGEXP_SUBSTR for string matching b/w two columns

The problem
Users are frequently inputting "country name" strings into the "city name" field. Heuristically, this appears to be an extremely common practice. For example, a user might put "TAIPEI TAIWAN" in the city name when only "TAIPEI" should be input and then the country would be "TAIWAN". I am working to aggregate these instances for this specific field (your help will allow me to expand this to other columns and tables) and then identify where possible rankings associated with strictly the "country" names in the "city" field.
I have two tables that I am attempting to leverage to track down data validation issues. Tbl1 is named "Customer_Address" comprised of geographic columns like (Customer_Num, Address, City_Name, State, Country_Code, Zipcode). Tbl2 named "HR_Countries" is clean table of 2-digit ISO country codes with their corresponding name values (Lebanon, Taiwan, China, Syria, Russia, Ukraine, etc) and some other fields not presently used.
The initial step is to query "Customer_Address" to find City_Names LIKE a series of OR statements (LIKE '%CHINA', OR LIKE 'TAIWAN', OR etc etc) and count the number of occurrences where the City_Name is like the designated country_name string I passed it and the results are pretty good. I've coded in some exclusions to deal with things like "Lebanon, OH" so my overall results are satisfactory for the first phase.
Part of the query does a LEFT join from Tbl1 to Tbl2 to add the risk rating from tbl2 as a result of the query against tbl1:
LEFT JOIN tbl2 risk
ON INSTR(addr.CITY_NM, risk.COUNTRY_NAME,1) <> 0
Example of Tbl1 Data Output (head(tbl1), n=7)
CountryNameInCity CountOfOccurences RR
China 15 High
Taiwan 2000 Medium
Japan 250 Low
Taipei, Taiwan 25 NULL
Kabul, Afghanistan 10 NULL
Shenzen China 100 NULL
Afghanistan 52 Very High
Example of Tb2 Data (head(tbl2), n=6)
CountryName CountryCode RR
China CN High
Taiwan TW High
Iraq IQ Very High
Cuba CU Medium
Lebanon LB Very High
Greece GR High
So my question(s) are as follows:
1) Instead of manually passing in a series of OR-statements for country codes is there a better way to using Tbl2 as the matching "LIKE" driving the query?
2) Can you recommend a better way of comparing the output of the query (see Tbl1 example) and ensuring that multiple strings (Taipei, Taiwan, etc) are appropriately aggregated and bring back the correct 'RR' rating.
Thanks for taking the time to review this and respond.

SQL Selecting distinct rows from multiple columns based on max value in one column

This is my SQL View - lets call it MyView :
ECode SHCode TotalNrShare CountryCode Country
000001 +00010 100 UKI United Kingdom
000001 ABENSO 900 USA United States
000355 +00012 1000 ESP Spain
000355 000010 50 FRA France
000042 009999 10 GER Germany
000042 +00012 999 ESP Spain
000787 ABENSO 500 USA United States
000787 000150 500 ITA Italy
001010 009999 100 GER Germany
I would like to return the single row with the highest number in the column TotalNrShare for each ECode.
For example, I’d like to return these results from the above view:
ECode SHCode TotalNrShare CountryCode Country
000001 ABENSO 900 USA United States
000355 +00012 1000 ESP Spain
000042 +00012 999 ESP Spain
000787 ABENSO 500 USA United States
001010 009999 100 GER Germany
(note in the case of ECode 000787 where there are two SHCode's with 500 each, as they are the same amount we can just return the first row rather than both, it isnt important for me which row is returned since this will happen very rarely and my analysis doesnt need to be 100%)
Ive tried various things but do not seem to be able to return either unqiue results or the additional country code/country info that I need.
This is one of my attempts (based on other solutions on this site, but I am doing something wrong):
SELECT tsh.ECode, tsh.SHCode, tsh.TotalNrShare, tsh.CountryCode, tsh.Country
FROM dbo.MyView AS tsh INNER JOIN
(SELECT DISTINCT ECode, MAX(TotalNrShare) AS MaxTotalSH
FROM dbo.MyView
GROUP BY ECode) AS groupedtsh ON tsh.ECode = groupedtsh.ECode AND tsh.TotalNrShare = groupedtsh.MaxTotalSH
WITH
sequenced_data AS
(
SELECT
*,
ROW_NUMBER() OVER (PARTITION BY ECode ORDER BY TotalNrShare) AS sequence_id
FROM
myView
)
SELECT
*
FROM
sequenced_data
WHERE
sequence_id = 1
This should, however, give the same results as your example query. It's simply a different approach to accomplish the same thing.
As you say that something is wrong, however, please could you elaborate on what is going wrong? Is TotalNrShare actually a string for example? And is that messing up your ordering (and so the MAX())?
EDIT:
Even if the above code was not compatible with your SQL Server, it shouldn't crash it out completely. You should just get an error message. Try executing Select * By Magic, for example, and it should just give an error. I strongly suggest getting your installation of Management Studio looked at and/or re-installed.
In terms of an alternative, you could do this...
SELECT
*
FROM
(SELECT ECode FROM MyView GROUP BY ECode) AS base
CROSS APPLY
(SELECT TOP 1 * FROM MyView WHERE ECode = base.ECode ORDER BY TotalNrShare DESC) AS data
Ideally you would replace the base sub-query with a table that already has a distinct list of all the ECodes that you are interested in.
try this;
with cte as(
SELECT tsh.ECode, tsh.SHCode, tsh.TotalNrShare, tsh.CountryCode, tsh.Country,
ROW_NUMBER() over (partition by ECode order by SHCode ) as row_num
FROM dbo.MyView)
select * from cte where row_num=1