SQL ending with certain letter -> strange behaviour? - sql

i've got a simple sql question:
I want to get all Customers(more precise: their name and their balance) working in a sector ending with E. I Want to order my results alphabetically by name. Therefore my query is:
SELECT Name,Balance FROM customer WHERE sector LIKE '%E' ORDER BY Name
, which is giving me false results.
I tested it by looking up which sectors exist:
SELECT Distinct(Sector) FROM Kunde
giving me:
Sector
----------
AUTOMOBILE
BUILDING
FURNITURE
HOUSEHOLD
MACHINERY
Now i tried using a query like
SELECT Distinct(Sector) FROM customer WHERE Sector LIKE '%E'
only giving me:
Sector
----------
AUTOMOBILE
It's probably me being stupid here, but why w'ont the last query give me AUTOMOBILE and FURNITURE? I don't see the problem. I'm using DB2 if thats important.
Thank you!

In case of trailing spaces, remove it :
SELECT Distinct(Sector)
FROM customer
WHERE RTRIM(Sector) LIKE '%E'

Related

match tables with intermediate mapping table (fuzzy joins with similar strings)

I'm using BigQuery.
I have two simple tables with "bad" data quality from our systems. One represents revenue and the other production rows for bus journeys.
I need to match every journey to a revenue transaction but I only have a set of fields and no key and I don't really know how to do this matching.
This is a sample of the data:
Revenue
Year, Agreement, Station_origin, Station_destination, Product
2020, 123123, London, Manchester, Qwerty
Journeys
Year, Agreement, Station_origin, Station_destination, Product
2020, 123123, Kings Cross, Piccadilly Gardens, Qwer
2020, 123123, Kings Cross, Victoria Station, Qwert
2020, 123123, London, Manchester, Qwerty
Every station has a maximum of 9 alternative names and these are stored in a "station" table.
Stations
Station Name, Station Name 2, Station Name 3,...
London, Kings Cross, Euston,...
Manchester, Piccadilly Gardens, Victoria Station,...
I would like to test matching or joining the tables first with the original fields. This will generate some matches but there are many journeys that are not matched. For the unmatched revenue rows, I would like to change the product name (shorten it to two letters and possibly get many matches from production table) and then station names by first change the station_origin and then station_destination. When using a shorter product name I could possibly get many matches but I want the row from the production table with the most common product.
Something like this:
1. Do a direct match. That is, I can use the fields as they are in the tables.
2. Do a match where the revenue.product is changed by shortening it to two letters. substr(product,0,2)
3. Change the rev.station_origin to the first alternative, Station Name 2, and then try a join. The product or other station are not changed.
4. Change the rev.station_origin to the first alternative, Station Name 2, and then try a join. The product is changed as above with a substr(product,0,2) but rev.station_destination is not changed.
5. Change the rev.station_destination to the first alternative, Station Name 2, and then try a join. The product or other station are not changed.
I was told that maybe I should create an intermediate table with all combinations of stations and products and let a rank column decide the order. The station names in the station's table are in order of importance so "station name" is more important than "station name 2" and so on.
I started to do a query with a subquery per rank and do a UNION ALL but there are so many combinations that there must be another way to do this.
Don't know if this makes any sense but I would appreciate any help or ideas to do this in a better way.
Cheers,
Cris
To implement a complex joining strategy with approximate matching, it might make more sense to define the strategy within JavaScript - and call the function from a BigQuery SQL query.
For example, the following query does the following steps:
Take the top 200 male names in the US.
Find if one of the top 200 female names matches.
If not, look for the most similar female name within the options.
Note that the logic to choose the closest option is encapsulated within the JS UDF fhoffa.x.fuzzy_extract_one(). See https://medium.com/#hoffa/new-in-bigquery-persistent-udfs-c9ea4100fd83 to learn more about this.
WITH data AS (
SELECT name, gender, SUM(number) c
FROM `bigquery-public-data.usa_names.usa_1910_2013`
GROUP BY 1,2
), top_men AS (
SELECT * FROM data WHERE gender='M'
ORDER BY c DESC LIMIT 200
), top_women AS (
SELECT * FROM data WHERE gender='F'
ORDER BY c DESC LIMIT 200
)
SELECT name male_name,
COALESCE(
(SELECT name FROM top_women WHERE name=a.name)
, fhoffa.x.fuzzy_extract_one(name, ARRAY(SELECT name FROM top_women))
) female_version
FROM top_men a

BigQuery: grouping by similar strings for a large dataset

I have a table of invoice data with over 100k unique invoices and several thousand unique company names associated with them.
I'm trying to group these company names into more general groups to understand how many invoices they're responsible for, how often they receive them, etc.
Currently, I'm using the following code to identify unique company names:
SELECT DISTINCT(company_name)
FROM invoice_data
ORDER BY company_name
The problem is that this only gives me exact matches, when its obvious that there are many string values in company_name that are similar. For example: McDonalds Paddington, McDonlads Oxford Square, McDonalds Peckham, etc.
How can I make by GROUP BY statement more general?
Sometimes the issue isn't as simple as the example listed above, occasionally there is simply an extra space or PTY/LTD which throws off a GROUP BY match.
EDIT
To give an example of what I'm looking for, I'd be looking to turn the following:
company_name
----------------------
Jim's Pizza Paddington|
Jim's Pizza Oxford |
McDonald's Peckham |
McDonald's Victoria |
-----------------------
And be able to group by their company name rather than exclusively with an exact string match.
Have you tried using the Soundex function?
SELECT
SOUNDEX(name) AS code,
MAX( name) AS sample_name,
count(name) as records
FROM ((
SELECT
"Jim's Pizza Paddington" AS name)
UNION ALL (
SELECT
"Jim's Pizza Oxford" AS name)
UNION ALL (
SELECT
"McDonald's Peckham" AS name)
UNION ALL (
SELECT
"McDonald's Victoria" AS name))
GROUP BY
1
ORDER BY
You can then use the soundex to create groupings, with a split or other type of function to pull the part of the string which matches the name group or use a windows function to pull back one occurrence to get the name string. Not perfect but means you do not need to pull into other tools with advanced language recognition.

What are the cases whereby EXCEPT and DISTINCT are different?

Looking into my notes for introduction to databases, I have stumbled upon a case that i do not understand (Between except and distinct).
It says so in my notes that:
The two queries below have the same results, but this will not be the case in general.
First query:
Select c.first_name,c.last_name,c.email
FROM customers as c
WHERE c.country = 'Japan'
EXCEPT
Select c.first_name,c.last_name,c.email
FROM customers as c
WHERE c.last_name LIKE 'D%';
Second query:
Select DISTINCT c.first_name,c.last_name,c.email
FROM customers as c
WHERE c.country = 'Japan' AND NOT (c.last_name LIKE 'D%');
Could anyone provide me some insights as to what are cases whereby the results would differ?
Number 1 selects first, last & email from customers who are from Japan and whose last names do not start with D.
Number 2 selects first, last & email, where no two records have all 3 fields the same, where the customers are from Singapore and their last names do not begin with D.
I suppose I can imagine a table where these would yield the same results, but I don't think it would ever appear except in very contrived circumstances.
Joe Smith jsmith#abc.com Japan
Joe Smith jsmith#abc.com Singapore
Would be one of them. Both queries would yield Joe Smith jsmith#abc.com. Another case would be if no-one was from either country or everyone's last name started with D, then they would both yield nothing.
None of this is tested, and the EXCEPT statement is something I've read about but never had occasion to use.
The first is looking at Japan, the second at Singapore, so I don't see why these would generally -- or specifically -- return the same data.
Even if the countries were the same you have another issue with NULL values. So, if your data looks like this:
first_name last_name email country
xxx NULL a Japan
Your first query would return the row. The second would not.

SQL Statement with 3 conditions

I have two tables:
Name Forename CostCentre
Max Meier 11111
Paul Peters 22222
Kai Green 11111
CostCentre departmentCostCentre
11111 HR
22222 IT
Besides this I have a Searchfield and a combobox for the cost centre.
If I enter "a" in the searchfield and "11111" in cost centre, I'll get all records...
But I just want to get Max and Kai. Here's my SQL statement:
SELECT tbl_Employee.Name, tbl_Employee.Forename, tbl_Employee.CostCentre, tbl_Department.Department
FROM tbl_DepartmentINNER JOIN tbl_EmployeeON tbl_Department.CostCentre= tbl_Employee.CostCentre
WHERE tbl_Employee.Name Like "*a*" OR tbl_Employee.Forename Like "*a*"AND tbl_Mitarbeiter.CostCentre=44444;
I really don't know where's the error.... If I delete the name or forename condition it works fine, but with both I get weird results...
If you want the cost centre condition to always apply and for the name conditions to apply to either name, then you need to use parenthesis:
SELECT * FROM tbl_Employee
WHERE (tbl_Employee.Name Like 'a' Or
tbl_Employee.Forename Like 'a') And
tbl_Employee.CostCentre=22222;
Otherwise, And binds more closely than Or and you're instead saying that either the Name condition must match or that both the Forename and CostCentre conditions must match.
Your question did already include some parentheses in your code which I've removed. I'm not sure what they related to.
Based on updated query:
SELECT tbl_Employee.Name, tbl_Employee.Forename, tbl_Employee.CostCentre, tbl_Department.Department
FROM tbl_DepartmentINNER JOIN tbl_EmployeeON tbl_Department.CostCentre= tbl_Employee.CostCentre
WHERE
(
tbl_Employee.Name Like "*a*"
OR
tbl_Employee.Forename Like "*a*"
)
AND
tbl_Mitarbeiter.CostCentre=44444;
SELECT
*
FROM
tbl_Employee
WHERE
tbl_Employee.Name LIKE '%a%'
AND tbl_Employee.CostCentre = 11111;

Oracle: LIKE where any part of one string matches any part of another string

I am using PL/SQL v7.1
I am trying to find all address records where the country name has been entered in one of the address line fields, and also the country field.
The problem is that the country details have not been entered consistently eg.
addr4 addr5 country
---------- ---------- ---------------
JERSEY UK(JERSEY)
IRELAND REPUBLIC OFIRELAND
DOUGLAS ISLE OF MAN UK(ISLE OF MAN)  
So, I need to find the records where ANY PART of the Country field is also found in either addr4 or addr5.
I started with this
SELECT *
FROM test_addresses
WHERE addr4||addr5 LIKE '%'||country||'%'
I know this doesn't work because it will, taking the 1st record as an example, check if 'UK(JERESEY)' is found in addr4||addr5 and ,so, no match will be found. But how do I make it check if 'JERSEY' is found in addr4||addr5
Try this way:
SELECT *
FROM test_addresses
WHERE (addr4 is not null and country like '%'||addr4||'%')
or (addr5 is not null and country like '%'||addr5||'%')
Sql Fiddle Demo
I don't know so much about plsql
but I think your query is backwards, try this.
SELECT *
FROM test_addresses
WHERE country LIKE '%'||addr4||'%'
or country LIKE '%'||addr5||'%'