Background
To replace invalid zip codes.
Sample Data
Consider the following data set:
Typo | City | ST | Zip5
-------+------------+----+------
33967 | Fort Myers | FL | 33902
33967 | Fort Myers | FL | 33965
33967 | Fort Myers | FL | 33911
33967 | Fort Myers | FL | 33901
33967 | Fort Myers | FL | 33907
33967 | Fort Myers | FL | 33994
34115 |Marco Island| FL | 34145
34115 |Marco Island| FL | 34146
86405 | Kingman | FL | 86404
86405 | Kingman | FL | 86406
33967 closely matches 33965, although 33907 could also be correct. (In this case, 33967 is a valid zip code, but not in our zip code database.)
34115 closely matches is 34145 (off by one digit, with a difference of 3 for that digit).
86405 closely matches both.
Sometimes digits are simply reversed (e.g,. 89 instead of 98).
Question
How would you write a SQL statement that finds the "minimum distance" between multiple numbers that have the same number of digits, returning at most one result no matter what?
Ideas
Subtract the digits.
Use LIMIT 1.
Conditions
PostgreSQL 8.3
This sounds like a case for Levenshtein distance.
The Levenshtein distance between two
strings is defined as the minimum
number of edits needed to transform
one string into the other, with the
allowable edit operations being
insertion, deletion, or substitution
of a single character.
It looks like PostgreSQL has it built-in:
test=# SELECT levenshtein('GUMBO', 'GAMBOL');
levenshtein
-------------
2
(1 row)
http://www.postgresql.org/docs/8.3/static/fuzzystrmatch.html
Redfilter answered the question that was asked, but I just wanted to clarify that the requested solution will not resolve what appears to be the real problem.
The real problem here seems to be that you have a database which was hand keyed and some numbers were transcribed giving garbage data.
The ONLY way to solve this problem is to validate the full address against a database like the USPS, MapQuest, or another provider. I know the first two have API's available for doing this.
The example I gave in a comment above was to consider a zip of 75084 and a city value of Richardson. Richardson has zip codes in the range of 75080, 81, 82, 83, and 85. The minimum number of edits will be 1. However, which one?
Another equal problem is what if the entered zip code was 75083 for Richardson. Which is a valid zipcode for that city; however, what if the address resided in 75082?
The only way to get that is to have the full address validated.
Related
I'm wracking my brain trying to figure this out. I have a dataset / table that looks like this:
ID | Person1 | Person2 | Person3 | EffortPerPerson
01 | Bob | Ann | Frank | 2
02 | Frank | Bob | Joe | 3
03 | Ann | Joe | Beth | 1
I'm trying add up "Effort" for each person. For example, Bob is 2+3, Joe is 3+1, etc. My goal is to produce a PowerBI scatter plot showing total Effort for each person.
In a perfect world, the query shouldn't care how many "Person" fields there are. It should just count up the Effort value for every row that the individual's name appears.
I thought GROUP BY would work, but obviously that's only for one column, and I can't wrap my head around how to make nested queries work here.
Any one have any ideas? Thanks in advance!
As Nick suggested, you should go with the Unpivot transformation. Go to Edit Queries and select Transform tab:
Select columns you want to transform in rows, open dropdown menu under Unpivot Columns and select "Unpivot Only Selected Columns":
And that's it! Power BI will aggregate values for you:
I want to know what is two-way tables in SQL?
And how can i read these two-way tables
Two-way tables is no way of storing data, but of displaying data. It doesn't say anything about how the data is stored.
Let's say we store persons along with their IQ and the country they live in. The table may look like this:
name iq country
John Smith 125 GB
Mary Jones 150 GB
Juan Lopez 150 ES
Liz Allen 125 GB
The two-way table to show the relation between IQ and country would be:
| 125 | 150
---+------+----
GB | 2 | 1
ES | 0 | 1
or
| GB | ES
----+-----+---
125 | 2 | 0
150 | 1 | 1
In order to retrieve this data from the database you might write this query:
select iq, country, count(*)
from persons
group by iq, country;
SQL is meant to retrieve data; it is not really meant to care about it's presentation, the layout. So you'd write a program (in PHP, C#, Java, whatever) sending the query to the database, receiving the data and filling a GUI grid in a loop.
In rare cases SQL can provide the layout itself, i.e. give you the data in columns and rows. This is when the attributes of one dimensions are known beforehand. This is usually not the case with IQs or countries as in the example given (i.e. you wouldn't have known which countries and which IQs are present in the table, if I hadn't shown you). But of course you could have retrieved either the countries or the IQs first and then build the main query dynamically (with conditional aggregation or pivot). Another case when values are known beforehand is booleans, e.g. a flag in the persons table to show whether a person is homeless. If you wanted results for how many homeless persons in which countries, you could easily write a query with two columns for homeless and not homeless.
As mentioned: that you can display data in a two-way table doesn't say anything about how this data is stored. Above I showed you a one table example. But let's say you have stores in many cities and want to know in which cities live thinner or thicker people. You decide to check which t-shirt sizes you sold in which cities. So you take your clients orders, look up the clients and the cities they live in. You also look up the order details and the items they refer to, then take all items of type t-shirt. There are many tables involved, but the result is again a two-sided table showing the relation of two attributes. E.g:
city | S | M | L | XL
------------+-----+-----+-----+-----
New York | 5% | 8% | 7% | 10%
Los Angeles | 10% | 7% | 7% | 8%
Chicago | 1% | 4% | 6% | 11%
Houston | 2% | 2% | 5% | 7%
I am not a programmer, but doing Excel work for a small library. We have these fields in an excel sheet:
John | J | Smith | BMI | 123 | 100 |
Sarah | P | Crown | ASCAP | 564 | 100 |
Tommy | T | Stew | BMI | 134 | 100 |
Suzy | S | Smith | BMI | 678 | 50 |
John | J | Smith | BMI | 123 | 50
What I would like to be able to combine any of the cells (in the same row)into one cell that would read like this:
John J Smith, (BMI), 100%, IPI 123
or
Suzy S Smith, (BMI), 50%, IPI 678 | John J Smith (BMI), 50% IPI 123
I figured out how to use the Concatenate function to do this, but it doesn't skip empty cells, and I get extra "|" or "()" in those spots. I also found the =StringConCat topic, and that works great for skipping, but I can't figure out how to add the extra characters.
Any help would be most appreciated.
Thank you!!
EDIT: Thanks for the quick responses so far. I should be more clear -
the pipes in my example were only to designate different cells - they are not actual characters in the cells (thanks for converting it to a table for me, Bruce). The only Pipe character I would like to use is in the results, as in my example between Suzy and John.
There will rarely be more than 2 entries on the same result line, but it is possible. Mostly it will be to composers that are sharing the credit. But there is a chance that they will work on a Public Domain song and I have to list "Traditional" or maybe "Mozart" as another composer.
Sorry that I don't know enough to ask my question as intelligently as I should. Just learning how to do this, and trying to figure it out as I go.
Thanks again!
For the extra spaces, use substitute to get rid of empties.
So, if your code is =concatenate(A1,B1,C1) and your 'empty spaces' are "| " then edit your formula to become =substitute(concatenate(A1,B1,C1),"| ","")
You can even stack the substitutes to add more possible 'empties', like " " (two spaces) or the like. =substitute(substitute(concatenate(A1,B1,C1),"| ","")," ","")
This may be a strange question, but I'm wondering how users on sites like stackoverflow format the example resultsets when asking / answering questions? Is there a clean and easy way to create something like:
+----------+-------------------+
| Count(*) | MAX(created_date) |
+----------+-------------------+
| 234 | 10-may-14 |
| 847 | 03-Apr-14 |
+----------+-------------------+
It does a great job at representing everything in plain text, I'm just wondering if everyone takes the time to format that manually? Or is that an export option for some sort of database software? Something tells me I'm showing my inexperience here haha.
After much searching, I found what I was looking for! This simple tool lets you mock up your table/data in excel, then copy/paste in the default tab delimited format. It then converts it into the ASCII style table seen in my question. Worth a bookmark in my opinion.
http://www.sensefulsolutions.com/2010/10/format-text-as-table.html
Your example is:
+----------+-------------------+
| Count(*) | MAX(created_date) |
+----------+-------------------+
| 234 | 10-may-14 |
| 847 | 03-Apr-14 |
+----------+-------------------+
I think the following is a pretty desirable format:
Count(*) MAX(created_date)
234 10-may-14
847 03-Apr-14
(The only improvement is to represent the dates in ISO standard YYYY-MM-DD format. ;-)
To get this format, just prepend each row with four spaces. That creates the background shading and the fixed-width font. You can also omit the first four spaces and just type in what you want:
Count(*) MAX(created_date)
234 10-may-14
847 03-Apr-14
Oh. That looks ugly. Select the three lines and click on the {} just above the input box and the four spaces are added automagically.
Say I have an employee table, with a record for each employee in my company, and a column for supervisor (as seen below). I would like to prepare a report, which lists the names and title for each step in a supervision line. eg for dick robbins, 1d #15, i'd like a list of each supervisor in his "chain of command," all the way to the president, big cheese. I'd like to avoid using cursors, but if that's the only way to do this then that's ok.
id fname lname title supervisorid
1 big cheese president 1
2 jim william vice president 1
3 sally carr vice president 1
4 ryan allan senior manager 2
5 mike miller manager 4
6 bill bryan manager 4
7 cathy maddy foreman 5
8 sean johnson senior mechanic 7
9 andrew koll senior mechanic 7
10 sarah ryans mechanic 8
11 dana bond mechanic 9
12 chris mcall technician 10
13 hannah ryans technician 10
14 matthew miller technician 11
15 dick robbins technician 11
The real data probably won't be more than 10 levels deep...but I'd rather not just do 10 outside joins...I was hoping there was something better than that, and less involved than cursors.
Thanks for any help.
This is basically a port of the accepted answer on my question that I linked to in the OP comments.
you can use common-table expressions
WITH Family As
(
SELECT e.id, e.supervisorid, 0 as Depth
FROM Employee e
WHERE id = #SupervisorID
UNION All
SELECT e2.ID, e2.supervisorid, Depth + 1
FROM Employee e2
JOIN Family
On Family.id = e2.supervisorid
)
SELECT*
FROM Family
For more:
Recursive Queries Using Common Table Expressions
You might be interested in the "Materialized Path" solution, which does slightly de-normalize the table but can be used on any type of SQL database and prevents you from having to do recursive queries. In fact, it can even be used on no-SQL databases.
You just need to add a column which holds the entire ancestry of the object. For example, the table below includes a column named tree_path:
+----+-----------+----------+----------+
| id | value | parent | tree_path|
+----+-----------+----------+----------+
| 1 | Some Text | 0 | |
| 2 | Some Text | 0 | |
| 3 | Some Text | 2 | -2-|
| 4 | Some Text | 2 | -2-|
| 5 | Some Text | 3 | -2-3-|
| 6 | Some Text | 3 | -2-3-|
| 7 | Some Text | 1 | -1-|
+----+-----------+----------+----------+
Selecting all the descendants of the record with id=2 looks like this:
SELECT * FROM comment_table WHERE tree_path LIKE '-2-%' ORDER BY tree_path ASC
To build a tree, you can sort by tree_path to get an array that's fairly easy to convert to a tree.
You can also index tree_path and the index can be used when the wildcard is not at the beginning.
For example, tree_path LIKE '-2-%' can use the index, but tree_path LIKE '%-2-' cannot.
Some recursive function which either return the supervisor (if any) or null. Could be a SP which invokes itself as well, and using UNION.
SQL is a language for performing set operations and recursion is not one of them. Further, many database systems have limitations on recursion using stored procedures as a safety measure to prevent rogue code from running away with precious server resources.
So, when working with SQL always think 'flat', not 'hierarchical'. So I would highly recommend the 'tree_path' method that has been suggested. I have used the same approach and it works wonderfully and crucially, very robustly.