Struggling with a MySQL database of phone numbers - sql

My application wants to store a list of international phone number in a mysql database. Then the application would need to query the database and search for a specific number. Sounds simple but it's actually a huge problem.
Because users can search for that number in a different format we'll have to do a full scan to the database each time.
For example. We might have the number 17162225555 stored in the database (along with another 5 million entries). Now the user comes along and attempt to search using 7162225555. Another user might try to serach with 2225555. etc etc. So in other words, the database have to issue the SQL query using a "like %number%" which would result in a full scan.
How should we design this application? Is there some way to tweat the Mysql to handle this better? Or should we not use SQL at all?
PS. We have millions of entries, and 10s of these search request per second.

This is very weird, I've struggled with this issue myself many times, over the last 15 years and generally come up with structures that separate area codes, country codes and number into separate fields etc. But whilst reading your question another solution just popped into my head, it does require a separate field though so may not be appropriate for you.
You could have a separate field called reverse_phone_number, have this automatically populated by the DB engine then when people search simply reverse the search string and use the indexed reverse field with just a % at the end of the like string, thereby allowing the use of the index.
Dependant on your DB engine you may be able to create an index based on a user-defined function that does the reverse for you obviating the need for an additional field.
In some countries, e.g. the UK, you may have an issue with leading zeros. A UK phone number is represented as (area code)(Phone Number) e.g. 01634 511098, when this is internationalised the leading zero of the area code is removed and the international dial code (+ or 00) and the country code (44) are added. This results in an international phone number of +441634511098. Any user searching for 0163451109 would not find the phone number if it was entered in internationalised format. You can overcome this issue by removing leading zeros from the search string.
EDIT
Based on suggestions from Ollie Jones you should store the number as entered by the user and then strip leading zeros, punctuation and white space from the number before reversing and storing in the reversed field. Then simply use the same algorithm to strip the search string before reversing, find the record and then display the originally entered number back to the user.

Related

sql search for data format

Is it possible to search an SQL database by an entry's format?
I'm specifically trying to find instances in a database where the structure of an entry is A##.# (Any letter, followed by 2 numbers, a decimal, then a final number)
And in case it could cause complications or may need separate attention, would the same code be able to find instances where there is no decimal, just A##?

data sanitization/clean-up

Just wondering…
We have a table where the data in certain fields is alphanumeric, comprising a 1-2 digit alpha followed by a 1-2 digit number e.g. x2, x53, yz1, yz95
The number of letters added before the number can be determined by the field so that certain fields will always have the same 1 letter added before the number while others will always have the same 2 letters.
For each field, the actual letters and number of letters added (1 or 2) are always the same, thus, we can always tell which letters appear before the numbers just via the field names.
For the purposes of all downstream data analysis, it is only ever the numeric value from the string which is important.
Sql queries are constructed dynamically behind a user form where the final sql can take many forms depending on which selections and switches the user has chosen. With this, the VBA generating the sql constructs is fairly involved, containing many conditions/variable pathways to the final sql construct.
With this, it would make the VBA and sql much easier to write, read, debug, and perhaps increase the sql execution speed, etc. – if we were only dealing with a numeric datatype e.g. I wouldn’t need to accommodate the many apostrophes within the numerous lines of “strSQL = strSQL & …”
Given that the data itself being analysed is a copy that’s imported via regular .csv extracts from a live source, would it be acceptable to pre sanitize/clean-up these fields around the import stage by converting the data within to numeric values and field datatypes?
- perhaps either by modifying the sql used to generate the extract or by modifying the schema/vba process used to import the extract into the analysis table e.g. using something like a Replace function such as “ = Replace(OriginalField,”yz”,””) “ to strip out the yz characters.
Yes, link the csv "as is", and for each linked table create a straight select query that does the sanitization, like:
Select
Val(Mid([Field1], 2)) As NumField1,
Val(Mid([Field2], 1)) As NumField2,
etc.
Val(Mid([FieldN], 2)) As NumFieldN
From
YourLinkedCsvTable
then use this query throughout your application when you need the data.

Yahoo finance API field misalignment

I'm trying to use the Yahoo Finance API to create a custom csv but depending upon the stock there is field misalignment.
For instance, if I just want to download the "k3" field for yahoo which corresponds to last trade size, I would craft the url like so:
http://finance.yahoo.com/d/quotes.csv?s=yhoo&f=k3
However, if you download that csv there are two columns of data.
Similarly, if I decide to get Float Shares , I want the url:
http://finance.yahoo.com/d/quotes.csv?s=yhoo&f=f6
However that gives me 3 columns. Is there a way to get it in exactly one column? I want to automate this process but the different column orientations make it very difficult as different rows then have different column lengths and I am unable to easily match up the column name with the row.
Bonus: If someone can explain where the 3 float share numbers come from that would be great, I seem to only be able to match up the first to the site...
Thank you for your help!
In short, you are describing known bugs that Yahoo isn't going to fix as the feed is officially unsupported.
Specifically re. Float (f6): the number returned is the entire float. It is not 3 csv numbers. The commas are not delimiters; rather, they are 1,000s separators. (I suspect the same is the case with K3. As it is with a couple of other known numbers. (See link below.))
Two solutions:
(1) Write your own workaround using conditional statements (if or case) in your code.
(2) Download the buggy parameters separately from the clean ones.
See: Yahoo's official reply to your question.
The multiple columns is because that excel (or whatever csv viewer you are using) treats "thousand-seperator" as the the "comma-seperator". We used to have this problem in our school project, and found a hack which is good only if you are using this api for some hobby project and not concerning data usage.
The idea is instead of treating the results as a csv, pick a static column (column A) where you will know the value beforehand (e.g. column 's' stock symbol) or put this value as the first column. When constructing the query, use this column to surround those columns (float columns) with formatting problem. once you get the quotes.csv, manual seperate the results on the column A value.
for example using
http://download.finance.yahoo.com/d/quotes.csv?s=yhoo&f=sf6sa5sb6
will get you
"YHOO", 887,675,000,"YHOO",400,"YHOO",N/A
Then use ,"YHOO", to seperate the results (excluding first column).
Not an elegent way to solve the problem, but at least it gives you the correct result.

Is there a way to make the Sql Server FullText engine to stop recognizing numbers as numbers?

I have an user input processor that transforms this search into "this" AND "search". My processor works great, it breaks the user input into pieces according to what I believe to be word breakers, which include punctuation.
The problem is that, while indexing, Sql Server applies a special approach when it comes to numbers.
if you index the following string:
12,20.1231,213.23,45,345.234.324.556,234.534.345
it will find the following index keys:
12
1231
20
213
23,45 (detected decimal separator)
234.534.345 (detected thousand separator)
345.234.324.556 (detected thousand separator)
Now, if the user searches for 324 he/she won't find anything, because 324 is contained within the last entry, not in the beggining, so it is not going to be found. I'd like it to stop treating numbers as numbers and just index it the way it does with words.
Is there a way to alter this behavior? without implementing too much code?
Not sure if I understand the question correctly, but if you say SQL is finding numbers, where it should find strings then casting them as strings seems to be the solution.
select cast(345234324556 as nvarchar)

How do you know when to use varchar and when to use text in sql?

It seems like a very arbitrary decision.
Both can accomplish the same thing in most cases.
By limiting the varchar length seems to me like you're shooting yourself in the foot cause you never know how long of a field you will need.
Is there any specific guideline for choosing VARCHAR or TEXT for your string fields?
I will be using postgresql with the sqlalchemy orm framework for python.
In PostgreSQL there is no technical difference between varchar and text
You can see a varchar(nnn) as a text column with a check constraint that prohibits storing larger values.
So each time you want to have a length constraint, use varchar(nnn).
If you don't want to restrict the length of the data use text
This sentence is wrong:
By limiting the varchar length seems to me like you're shooting yourself in the foot cause you never know how long of a field you will need.
If you are saving, for example, MD5 hashes you do know how large the field is your storing and your storage becomes more efficient. Other examples are:
Usernames (64 max)
Passwords (128 max)
Zip codes
Addresses
Tags
Many more!
In brief:
Variable length fields save space, but because each field can have different length, it makes table operations slower
Fixed length fields make table operations fast, although must be large enough for the maximum expected input, so can use more space
Think of an analogy to arrays and linked lists, where arrays are fixed length fields, and linked lists are like varchars. Which is better, arrays or linked lists? Lucky we have both, because they are both useful in different situations, so too here.
In the most cases you do know what the max length of a string in a field is. In case of a first of lastname you don't need more then 255 characters for example. So by design you choose wich type to use, if you always use text you're wasting resources
Check this article on PostgresOnline, it also links to two other usefull articles.
Most problems with TEXT in PostgreSQL occur when you're using tools, applications and drivers that treat TEXT very different from VARCHAR because other databases behave very different with these two datatypes.
Database designers almost always know how many characters a column needs to hold. US delivery addresses need to hold up to 64 characters. (The US Postal Service publishes addressing guidelines that say so.) US ZIP codes are 5 characters long.
A database designer will look at representative sample data from her clients when she's specifying columns. She'll ask herself, questions like "What's the longest product name?" And when the answer is "70 characters", she won't make the column 3000 characters wide.
VARCHAR has a limit of 8k in SQL Server (I think). Most applications don't require nearly that much storage for a single column.