Consider a relation that contained the names and number of locations of restaurants including split and stand alone restaurants:
RESTAURANT: NUM_OF_LOC
Pizza Hut 1
Pizza Hut/Taco Bell 2
Taco Bell 2
Also consider you will not know the name of the restaurant, stand alone or split, or Number of Locations. The only consistent piece is the "/" string character between split restaurants.
How to return the above table as a result with the number of stand alone restaurants summed into the number of split restaurants in desc, like so:
RESTAURANT: NUM_OF_LOC
Pizza Hut/Taco Bell 5
Taco Bell 2
Pizza Hut 1
So are you looking to get the count of all restaurants for just Taco Bell and Pizza Hut where a joint counts as 1 for each or are you looking to count all occurrences of each variant?
I'm thinking you aren't just looking for totals and are looking to tear apart the combined restaurants so you can do something like
SELECT name, count(*)
FROM restaurants
WHERE CONTAINS (name, 'Taco Bell')
Relooking it seems like you want the consolidated groups to include all occurrences of either which would be something like:
CREATE TABLE sums AS
SELECT name, count(*)
FROM restaurants
WHERE CONTAINS (name, 'Taco Bell')
OR CONTAINS(name, 'Pizza Hut')
Related
I'm using BigQuery.
I have two simple tables with "bad" data quality from our systems. One represents revenue and the other production rows for bus journeys.
I need to match every journey to a revenue transaction but I only have a set of fields and no key and I don't really know how to do this matching.
This is a sample of the data:
Revenue
Year, Agreement, Station_origin, Station_destination, Product
2020, 123123, London, Manchester, Qwerty
Journeys
Year, Agreement, Station_origin, Station_destination, Product
2020, 123123, Kings Cross, Piccadilly Gardens, Qwer
2020, 123123, Kings Cross, Victoria Station, Qwert
2020, 123123, London, Manchester, Qwerty
Every station has a maximum of 9 alternative names and these are stored in a "station" table.
Stations
Station Name, Station Name 2, Station Name 3,...
London, Kings Cross, Euston,...
Manchester, Piccadilly Gardens, Victoria Station,...
I would like to test matching or joining the tables first with the original fields. This will generate some matches but there are many journeys that are not matched. For the unmatched revenue rows, I would like to change the product name (shorten it to two letters and possibly get many matches from production table) and then station names by first change the station_origin and then station_destination. When using a shorter product name I could possibly get many matches but I want the row from the production table with the most common product.
Something like this:
1. Do a direct match. That is, I can use the fields as they are in the tables.
2. Do a match where the revenue.product is changed by shortening it to two letters. substr(product,0,2)
3. Change the rev.station_origin to the first alternative, Station Name 2, and then try a join. The product or other station are not changed.
4. Change the rev.station_origin to the first alternative, Station Name 2, and then try a join. The product is changed as above with a substr(product,0,2) but rev.station_destination is not changed.
5. Change the rev.station_destination to the first alternative, Station Name 2, and then try a join. The product or other station are not changed.
I was told that maybe I should create an intermediate table with all combinations of stations and products and let a rank column decide the order. The station names in the station's table are in order of importance so "station name" is more important than "station name 2" and so on.
I started to do a query with a subquery per rank and do a UNION ALL but there are so many combinations that there must be another way to do this.
Don't know if this makes any sense but I would appreciate any help or ideas to do this in a better way.
Cheers,
Cris
To implement a complex joining strategy with approximate matching, it might make more sense to define the strategy within JavaScript - and call the function from a BigQuery SQL query.
For example, the following query does the following steps:
Take the top 200 male names in the US.
Find if one of the top 200 female names matches.
If not, look for the most similar female name within the options.
Note that the logic to choose the closest option is encapsulated within the JS UDF fhoffa.x.fuzzy_extract_one(). See https://medium.com/#hoffa/new-in-bigquery-persistent-udfs-c9ea4100fd83 to learn more about this.
WITH data AS (
SELECT name, gender, SUM(number) c
FROM `bigquery-public-data.usa_names.usa_1910_2013`
GROUP BY 1,2
), top_men AS (
SELECT * FROM data WHERE gender='M'
ORDER BY c DESC LIMIT 200
), top_women AS (
SELECT * FROM data WHERE gender='F'
ORDER BY c DESC LIMIT 200
)
SELECT name male_name,
COALESCE(
(SELECT name FROM top_women WHERE name=a.name)
, fhoffa.x.fuzzy_extract_one(name, ARRAY(SELECT name FROM top_women))
) female_version
FROM top_men a
I can't seem to find all comics with certain characters in them. The comics and characters tables have a many to many relationship as follows:
My database schema:
**comics table**
comic_id
comic_name
comic_date
**character table**
character_id
character_name
**comics_character table**
comic_id
character_id
This works fine for one character:
sqlite3 comics.db
select comic_name
from comics as c, comics_characters as cc, characters as h
on c.comic_id = cc.comic_id and h.character_id = cc.character_id
where h.character_name = 'Superman';
But if I want all comics with say Superman and Batman in them, I tried using this:
sqlite3 comics.db
select comic_name
from comics as c, comics_characters as cc, characters as h
on c.comic_id = cc.comic_id and h.character_id = cc.character_id
where h.character_name in ('Batman', 'Superman');
but this only gets me a list of comics featuring either Batman OR Superman rather than comics with both Batman AND Superman in
I've also tried this which doesn't return anything:
sqlite3 comics.db
select comic_name
from comics as c, comics_characters as cc, characters as h
on (c.comic_id = cc.comic_id and h.character_id = cc.character_id)
where (h.character_name = 'Batman' and h.character_name = 'Superman');
I've tried other variations but can't get the desired result
The OR route doesn't work, for reasons you've worked out - you get rows for superman or batman, and comics that have both characters have two rows with the same comic is and different character ids. The AND route doesn't work because a row cannot be simultaneously both characters.
So, you need to use the OR route to get comics with one or both characters and then also use a count to show only comics with both characters. Essentially, "filter to superman or batman, and then filter again to only comic ids that appear on two rows" or "filter to only batman or superman then group them up based on the comic is and only take Groups that have two entities in them". Ultimately, the lesson here is that database rows are thought of as different entities and when you want to treat them as one you have to group them, so we are identifying comics based on some attribute of the group after (deliberately) losing the detail of exactly which entities the group contains:
SELECT comic_id
FROM comic_characters
WHERE character_id IN (1,2) --Batman or Superman
GROUP BY conic_id
HAVING COUNT(*) = 2
The number on the right hand side of the = must be the same as the number of character IDs in the IN() clause. If you IN for 4 character IDs, then use COUNT(*) = 4
You can join in other tables so you can use names etc; I simplified this to make the point without extraneous detail
Footnote; this technique wil find comics that feature at least batman and superman- the comic could very well contain other characters too but we lost those other guys at the WHERE stage before we did the GROUP BY. If you wanted comics that ONLY featured batman and superman it's a different thing. For that we could do something like grouping first, and counting conditionally - give batman or superman a score of 1 and everyone else a score of 10, comics that had only B and S would score 2, comics that featured only one of them would score 1 and anyone else's presence would cause a score of 10 or more, so we could filter on the 2
I have a table of invoice data with over 100k unique invoices and several thousand unique company names associated with them.
I'm trying to group these company names into more general groups to understand how many invoices they're responsible for, how often they receive them, etc.
Currently, I'm using the following code to identify unique company names:
SELECT DISTINCT(company_name)
FROM invoice_data
ORDER BY company_name
The problem is that this only gives me exact matches, when its obvious that there are many string values in company_name that are similar. For example: McDonalds Paddington, McDonlads Oxford Square, McDonalds Peckham, etc.
How can I make by GROUP BY statement more general?
Sometimes the issue isn't as simple as the example listed above, occasionally there is simply an extra space or PTY/LTD which throws off a GROUP BY match.
EDIT
To give an example of what I'm looking for, I'd be looking to turn the following:
company_name
----------------------
Jim's Pizza Paddington|
Jim's Pizza Oxford |
McDonald's Peckham |
McDonald's Victoria |
-----------------------
And be able to group by their company name rather than exclusively with an exact string match.
Have you tried using the Soundex function?
SELECT
SOUNDEX(name) AS code,
MAX( name) AS sample_name,
count(name) as records
FROM ((
SELECT
"Jim's Pizza Paddington" AS name)
UNION ALL (
SELECT
"Jim's Pizza Oxford" AS name)
UNION ALL (
SELECT
"McDonald's Peckham" AS name)
UNION ALL (
SELECT
"McDonald's Victoria" AS name))
GROUP BY
1
ORDER BY
You can then use the soundex to create groupings, with a split or other type of function to pull the part of the string which matches the name group or use a windows function to pull back one occurrence to get the name string. Not perfect but means you do not need to pull into other tools with advanced language recognition.
I have categories and Listings stored in a ListingCategory table and a Listing table respectively.
A listing can be stored in many categories and a category can have many listings. These are joined by a table *ListingCategory_Listings*:
ID | ListingCategoryID | ListingID
I need to somehow grab all the ListingCategories where listings in them meet a certain criteria.
As an example, Imagine categories such as: Food, Drink, Lodging.
A bar listing would be linked to Food and Drink and a hotel would link to Food, Drink and Lodging, a hostel would link to lodging etc etc.
Each of these listings is geo-coded and I want to be able to display the categories where there are listings within X miles of a determined geo-location. So if just the bar fell within the X miles, we would show Food and Drink. If just the hostel fell in this radius, we only show lodging, etc. I have the logic to work out the distance, I just don't know how to get my desired result
Lastly... apologies for the horrible post title
should be as simple as
SELECT DISTINCT c.ID, c.name
FROM ListingCategory c
JOIN ListingCategory_Listings lc
ON c.ID = lc.ListingCategoryID
WHERE lc.ListingID IN (<list of listings comma separated>)
I have a question about updating multiple rows with different values in MS Access 2010.
Table 1: Food
ID | Favourite Food
1 | Apple
2 | Orange
3 | Pear
Table 2: New
ID | Favourite Food
1 | Watermelon
3 | Cherries
Right now, it looks deceptively simple to execute them separately (because this is just an example). But how would I execute a whole lot of them at the same time if I had, say, 500 rows to update out of 1000 records.
So what I want to do is to update the "Food" table based on the new values from the "New" table.
Would appreciate if anyone could give me some direction / syntax so that I can test it out on MS Access 2010. If this requires VBA, do provide some samples of how I should carry this out programmatically, not manually statement-by-statement.
Thank you!
ADDENDUM (REAL DATA)
Table: Competitors
Columns: CompetitorNo (PK), FirstName, LastName, Score, Ranking
query: FinalScore
Columns: CompetitorNo, Score, Ranking
Note - this query is a query of another query, which in turn, is a query of another query (could there be a potential problem here? There are at least 4 queries before this FinalScore query is derived. Should I post them?)
In the competitors table, all the columns except "Score" and "Ranking" are filled. We would need to take the values from the FinalScore query and insert them into the relevant competitor columns.
Addendum (Brief Explanation of Query)
Table: Competitors
Columns: CompetitorNo (PK), FirstName, LastName, Score, Ranking
Sample Data: AX1234, Simpson, Danny, <blank initially>, <blank initially>
Table: CompetitionRecord
Columns: EventNo (PK composite), CompetitorNo (PK composite), Timing, Bonus
Sample Data1: E01, AX1234, 14.4, 1
Sample Data2: E01, AB1938, 12.5, 0
Sample Data3: E01, BB1919, 13.0, 2
Event No specifies unique event ID
Timing measures the time taken to run 200 metres. The lesser, the better.
Bonus can be given in 3 values (0 - Disqualified, 1 - Normal, 2 - Exceptional). Competitors with Exceptional are given bonus points (5% off their timing).
Query: FinalScore
Columns: CompetitorNo (PK), Score, Ranking
Score is calculated by wins. For example, in the above event (E01), there are three competitors. The winner of the event is BB1919. Winners get 1 point. Losers don't get any points. Those that are disqualified do not receive any points as well.
This query lists the competitors and their cumulative scores (from a list of many events - E01, E02, E03 etc.) and calculates their ranking in the ranking column everytime the query is executed. (For example, a person who wins the most 200m events would be at the top of this list).
Now, I am required to update the Competitors table with this information. The query is rather complex - with all the grouping, summations, rankings and whatnots. Thus, I had to create multiple queries to achieve the end result.
How about:
UPDATE Food
INNER JOIN [New]
ON Food.ID=New.ID
SET Food.[Favourite Food] = New.[Favourite Food]