What are the cases whereby EXCEPT and DISTINCT are different?

What are the cases whereby EXCEPT and DISTINCT are different? - sql

Looking into my notes for introduction to databases, I have stumbled upon a case that i do not understand (Between except and distinct).
It says so in my notes that:
The two queries below have the same results, but this will not be the case in general.
First query:
Select c.first_name,c.last_name,c.email
FROM customers as c
WHERE c.country = 'Japan'
EXCEPT
Select c.first_name,c.last_name,c.email
FROM customers as c
WHERE c.last_name LIKE 'D%';
Second query:
Select DISTINCT c.first_name,c.last_name,c.email
FROM customers as c
WHERE c.country = 'Japan' AND NOT (c.last_name LIKE 'D%');
Could anyone provide me some insights as to what are cases whereby the results would differ?

Number 1 selects first, last & email from customers who are from Japan and whose last names do not start with D.
Number 2 selects first, last & email, where no two records have all 3 fields the same, where the customers are from Singapore and their last names do not begin with D.
I suppose I can imagine a table where these would yield the same results, but I don't think it would ever appear except in very contrived circumstances.
Joe Smith jsmith#abc.com Japan
Joe Smith jsmith#abc.com Singapore
Would be one of them. Both queries would yield Joe Smith jsmith#abc.com. Another case would be if no-one was from either country or everyone's last name started with D, then they would both yield nothing.
None of this is tested, and the EXCEPT statement is something I've read about but never had occasion to use.

The first is looking at Japan, the second at Singapore, so I don't see why these would generally -- or specifically -- return the same data.
Even if the countries were the same you have another issue with NULL values. So, if your data looks like this:
first_name last_name email country
xxx NULL a Japan
Your first query would return the row. The second would not.

Related

How does SQL count(distinct) work in this case?

I'm trying to find the match no in which Germany played against Poland. This is from https://www.w3resource.com/sql-exercises/soccer-database-exercise/sql-subqueries-exercise-soccer-database-4.php. There are two tables : match_details and soccer_country. I don't understand how the count(distinct) works in this case. Can someone please clarify? Thanks!
SELECT match_no
FROM match_details
WHERE team_id = (
SELECT country_id
FROM soccer_country
WHERE country_name = 'Germany')
OR team_id = (
SELECT country_id
FROM soccer_country
WHERE country_name = 'Poland')
GROUP BY match_no
HAVING COUNT(DISTINCT team_id) = 2;

As Lamak mentioned, what an ugly consideration for a query, but many ways to approach a query.
As mentioned, counting for (Distinct team_id) makes sure that there are only 2 unique teams. If there is ever a Cartesian result, you could get repetition of multiple rows showing more than one instance of both teams. So the count of distinct on the TEAM_ID eliminates that.
Now, that said, Other "team" query data structures I have seen have a single record for the match and a column for EACH TEAM playing the match. That is easier by a long-shot, but still a relatively easy query.
Break the query down a little, and consider a large scale set of data (not that this, or any sort of even professional league would have such large record counts to give delay with a sql engine).
Your first criteria is games with Germany. So lets start with that.
SELECT
md1.match_no
FROM
match_details md1
JOIN soccer_country sc1
on md1.team_id = sc1.country_id
AND sc1.country_name = 'Germany'
So, why even look at any other record/match if Germany is not even part of the match on either side. Of which this in itself would return 6 matches from the sample data of 51 matches. So now, all you need to do is join AGAIN to the match details table a second time for only those matches, but ALSO the second team is Poland
SELECT
md1.match_no
FROM
match_details md1
JOIN soccer_country sc1
on md1.team_id = sc1.country_id
AND sc1.country_name = 'Germany'
-- joining again for the same match Germany was already qualified
JOIN match_details md2
on md1.match_no = md2.match_no
-- but we want the OTHER team record since Germany was first team
and md1.team_id != md2.team_id
-- and on to the second country table based on the SECOND team ID
JOIN soccer_country sc2
on md2.team_id = sc2.country_id
-- and the second team was Poland
AND sc2.country_name = 'Poland'
Yes, may be a longer query, but by eliminating 45 other matches (again, thinking a LARGE database), you have already saved blowing through tons of data to a very finite set. And now finishing only those Germany / Poland. No aggregates, counts, distincts, just direct joins.
FEEDBACK
Lets take a look at some BAD sample data... which as all programmers know, there is no such thing (NOT). Anyhow, lets take a look at these few matches.
Match Team ID blah
52 Poland Just put the names here for simplistic purposes
52 Poland
53 Germany
53 Germany
If you were to run the query without DISTINCT Teams, both match 52 and 53 would show up... As Poland is one team and appears 2 times for match 52, and similarly Germany 2 times for match 53. By doing DISTINCT Team, you can see that for each match, there is only 1 team being returned and thus excluded. Does that help? Again, no such thing as bad data :)
And yet another sample match where more than 2 teams created
Match Team ID
54 France
54 Poland
54 England
55 Hungary
56 Austria
In each of these matches, NONE would be returned. Match 54 has 3 distinct teams, and Match 55 and 56 only have single entry, thus no opponent to compete against.
2nd FEEDBACK
To clarify the query. If you look at the short query for just Germany, that aliased instance of "md1" is already sitting on any given record for a Germany match. So the second join to the "md2", I only care about the same match, so I can join on the same match_no. However, in the "md2" alias, the "!=" means NOT EQUAL. ! = logical NOT. So the join is saying from the MD1, join to the MD2 alias on the same match id. However, only give me where the teams are NOT the same. So the first instance holds Germany's team ID (already qualified) and thus give me the secondary team id. So now I can use the secondary (md2) instance team ID to join to the country to confirm only for Poland.
Does this now clarify things for you?

match tables with intermediate mapping table (fuzzy joins with similar strings)

I'm using BigQuery.
I have two simple tables with "bad" data quality from our systems. One represents revenue and the other production rows for bus journeys.
I need to match every journey to a revenue transaction but I only have a set of fields and no key and I don't really know how to do this matching.
This is a sample of the data:
Revenue
Year, Agreement, Station_origin, Station_destination, Product
2020, 123123, London, Manchester, Qwerty
Journeys
Year, Agreement, Station_origin, Station_destination, Product
2020, 123123, Kings Cross, Piccadilly Gardens, Qwer
2020, 123123, Kings Cross, Victoria Station, Qwert
2020, 123123, London, Manchester, Qwerty
Every station has a maximum of 9 alternative names and these are stored in a "station" table.
Stations
Station Name, Station Name 2, Station Name 3,...
London, Kings Cross, Euston,...
Manchester, Piccadilly Gardens, Victoria Station,...
I would like to test matching or joining the tables first with the original fields. This will generate some matches but there are many journeys that are not matched. For the unmatched revenue rows, I would like to change the product name (shorten it to two letters and possibly get many matches from production table) and then station names by first change the station_origin and then station_destination. When using a shorter product name I could possibly get many matches but I want the row from the production table with the most common product.
Something like this:
1. Do a direct match. That is, I can use the fields as they are in the tables.
2. Do a match where the revenue.product is changed by shortening it to two letters. substr(product,0,2)
3. Change the rev.station_origin to the first alternative, Station Name 2, and then try a join. The product or other station are not changed.
4. Change the rev.station_origin to the first alternative, Station Name 2, and then try a join. The product is changed as above with a substr(product,0,2) but rev.station_destination is not changed.
5. Change the rev.station_destination to the first alternative, Station Name 2, and then try a join. The product or other station are not changed.
I was told that maybe I should create an intermediate table with all combinations of stations and products and let a rank column decide the order. The station names in the station's table are in order of importance so "station name" is more important than "station name 2" and so on.
I started to do a query with a subquery per rank and do a UNION ALL but there are so many combinations that there must be another way to do this.
Don't know if this makes any sense but I would appreciate any help or ideas to do this in a better way.
Cheers,
Cris

To implement a complex joining strategy with approximate matching, it might make more sense to define the strategy within JavaScript - and call the function from a BigQuery SQL query.
For example, the following query does the following steps:
Take the top 200 male names in the US.
Find if one of the top 200 female names matches.
If not, look for the most similar female name within the options.
Note that the logic to choose the closest option is encapsulated within the JS UDF fhoffa.x.fuzzy_extract_one(). See https://medium.com/#hoffa/new-in-bigquery-persistent-udfs-c9ea4100fd83 to learn more about this.
WITH data AS (
SELECT name, gender, SUM(number) c
FROM `bigquery-public-data.usa_names.usa_1910_2013`
GROUP BY 1,2
), top_men AS (
SELECT * FROM data WHERE gender='M'
ORDER BY c DESC LIMIT 200
), top_women AS (
SELECT * FROM data WHERE gender='F'
ORDER BY c DESC LIMIT 200
)
SELECT name male_name,
COALESCE(
(SELECT name FROM top_women WHERE name=a.name)
, fhoffa.x.fuzzy_extract_one(name, ARRAY(SELECT name FROM top_women))
) female_version
FROM top_men a

Having trouble with query giving different results inside subform vs just running query

I'm trying to use a query to make a list of records for use in a subform. This will allow a user to pick the record they want to edit. With no changes to the SQL or tables I get different results running the query directly or using the same query as the recordsource of my form.
households has a record for each address - h_id is the ID number
people is a list of people in each household - household_id links to a household
Query SQL
SELECT P.last_name, H.h_id, P.first_name, H.second_last_name, H.address, H.mailing_address, H.phone, H.second_phone FROM households AS H INNER JOIN people AS P ON H.h_id = P.household_id WHERE (((P.last_name) Is Not Null))
the only difference in the form is that I have to put the WHERE part under Filter
if the query is run alone I get
last_name h_id first_name address
Johnson 1289 Mike 26 Aiken Ave
Raty- Johnson 1289 Mary 26 Aiken Ave
when the query is the recordsource for the form I get
last_name h_id first_name address
Raty- Johnson 1289 Mary 26 Aiken Ave
So far I assume that the form is killing the second record for maybe matching h_id, but I have no idea how to stop it. The output from running the query alone is what I want, only in the form. This will allow the form to have other people in the household listed, in case one of them come instead of the person who signed up.

Oracle: LIKE where any part of one string matches any part of another string

I am using PL/SQL v7.1
I am trying to find all address records where the country name has been entered in one of the address line fields, and also the country field.
The problem is that the country details have not been entered consistently eg.
addr4 addr5 country
---------- ---------- ---------------
JERSEY UK(JERSEY)
IRELAND REPUBLIC OFIRELAND
DOUGLAS ISLE OF MAN UK(ISLE OF MAN)  
So, I need to find the records where ANY PART of the Country field is also found in either addr4 or addr5.
I started with this
SELECT *
FROM test_addresses
WHERE addr4||addr5 LIKE '%'||country||'%'
I know this doesn't work because it will, taking the 1st record as an example, check if 'UK(JERESEY)' is found in addr4||addr5 and ,so, no match will be found. But how do I make it check if 'JERSEY' is found in addr4||addr5

Try this way:
SELECT *
FROM test_addresses
WHERE (addr4 is not null and country like '%'||addr4||'%')
or (addr5 is not null and country like '%'||addr5||'%')
Sql Fiddle Demo

I don't know so much about plsql
but I think your query is backwards, try this.
SELECT *
FROM test_addresses
WHERE country LIKE '%'||addr4||'%'
or country LIKE '%'||addr5||'%'

Getting distinct rows based on a certain field from a database in Django

I need to construct a query in Django, and I'm wondering if this is somehow possible (it may be really obvious but I'm missing it...).
I have a normal query Model.objects.filter(x=True)[:5] which can return results like this:
FirstName LastName Country
Bob Jones UK
Bill Thompson UK
David Smith USA
I need to only grab rows which are distinct based on the Country field, something like Model.objects.filter(x=True).distinct('Country')[:5] would be ideal but that's not possible with Django.
The rows I want the query to grab ultimately are:
FirstName LastName Country
Bob Jones UK
David Smith USA
I also need the query to use the same ordering as set in the model's Meta class (ie. I can't override the ordering in any way).
How would I go about doing this?
Thanks a lot.

I haven't tested this, but it seems to me a dict should do the job, although ordering could be off then:
d = {}
for x in Model.objects.all():
d[x.country] = x
records_with_distinct_countries = d.values()

countries = [f.country in Model.objects.all()]
for c in countries:
try:
print Model.objects.filter(country=c)
except Model.DoesNotExist:
pass

I think that #skrobul is on the right track, but a little bit off.
I don't think you'll be able to do this with a single query, because the distinct() method adds the SELECT DISTINCT modifier to the query, which acts on the entire row. You'll likely have to create a list of countries and then return limited QuerySets based on iterating that list.
Something like this:
maxrows = 5
countries = set([x.country for x in Model.objects.all()])
rows = []
count = 0
for c in countries:
if count >= maxrows:
break
try:
rows.append(Model.objects.filter(country=c)[0])
except Model.DoesNotExist:
pass
count += 1
This is a very generic example, but it gives the intended result.

Can you post the raw SQL that returns what you want from the source database? I have a hunch that the actual problem here is the query/data structure...

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas