Is there a way to make the Sql Server FullText engine to stop recognizing numbers as numbers? - sql

I have an user input processor that transforms this search into "this" AND "search". My processor works great, it breaks the user input into pieces according to what I believe to be word breakers, which include punctuation.
The problem is that, while indexing, Sql Server applies a special approach when it comes to numbers.
if you index the following string:
12,20.1231,213.23,45,345.234.324.556,234.534.345
it will find the following index keys:
12
1231
20
213
23,45 (detected decimal separator)
234.534.345 (detected thousand separator)
345.234.324.556 (detected thousand separator)
Now, if the user searches for 324 he/she won't find anything, because 324 is contained within the last entry, not in the beggining, so it is not going to be found. I'd like it to stop treating numbers as numbers and just index it the way it does with words.
Is there a way to alter this behavior? without implementing too much code?

Not sure if I understand the question correctly, but if you say SQL is finding numbers, where it should find strings then casting them as strings seems to be the solution.
select cast(345234324556 as nvarchar)

Related

How to find Bad characters in the column

I am trying to pull 'COURSE_TITLE' column value from 'PS_TRAINING' table in PeopleSoft and writing into UTF-8 text file to get loaded into Workday system. The file is erroring out while loading because of bad characters(Ã â and many more) present in the column. I have used a procedure which will convert non-ascii value into space. But because of this procedure, the 'Course_Title' which are written in non-english language like Chinese, Korean, Spanish also replacing with spaces.
I even tried using regular expressions (``regexp_like(course_title, 'Ã) only to find bad characters but since the table has hundreds of thousands of rows, it would be difficult to find all bad characters. Please suggest a way to solve this.
If you change your approach, this may work.
Define what you want, and retrieve it.
select *
from PS_TRAINING
where not regexp_like(course_title, '[0-9A-Za-z]')```
If you take out too much data, just add it to the regex

No space before ORDER BY - Why does this work?

I came across some SQL in an application which had no space before the "ORDER BY" clause. I was surprised that this even works.
Given a table of numbers, called [counter] where there is simply one column, counter_id that is an incrementing list of integers this SQL works fine in Microsoft SQL Server 2012
select
*
FROM [counter] c
where c.counter_id = 1000ORDER by counter_id
This also works with strings, e.g.:
WHERE some_string = 'test'ORDER BY something
My question is, are there any potential pitfalls or dangers with this query? And conversely, are there any benefits? Other than saving, what, 8 bits of network traffic for that whitespace (whcih may well be a consideration in some applications)
Let me explain the reason why this works with numbers and strings.
The reason is because numbers cannot start identifiers, unless the name is escaped. Basically, the first things that happens to a SQL query is tokenization. That is, the components of the query are broken into identifiers and keywords, which are then analyzed.
In SQL Server, keywords and identifiers and function names (and so on) cannot start with a digit (unless the name is escaped, of course). So, when the tokenizer encounters a digit, it knows that it has a number. The number ends when a non-digit character is encountered. So, a sequence of characters such as 1000ORDER BY is easily turned into three tokens, 1000, ORDER, and BY.
Similarly, the first time that a single quote is encountered, it always represents a string literal. The string literal ends when the final single quote is encountered. The next set of characters represents another token.
Let me add that there is exactly zero reason to ever use these nuances. First, these rules are properties of SQL Server's tokenization and do not necessarily apply to other databases. Second, the purpose of SQL is for humans to be able to express queries. It is way, way more important that we read them.
As jarlh mentioned there might be difference during scanning and parsing the tokens but it creates correctly during execution plan, hence it might not be huge difference in advantages or disadvantages
When parser examines characters ,it checks for keywords,identifiers,string constants and match overall semantic and syntactic structure of the language. Since 'Order by' is a keyword and sql parser knows its possible syntactic location in a query, it will interpret it accordingly without throwing any error. This is the reason why your order by will not throw any error.
Parsing sql query
Parsing SQL

Lucene: problems when keeping punctuation

I need Lucene to keep some punctuation marks when indexing my texts, so I'm now using a WhitespaceAnalyzer which doesn't remove the symbols.
If there is a sentence like oranges, apples and bananas in the text, I want the phrase query "oranges, apples" to be a match (not without the comma), and this is working fine.
However, I'd also want the simple query oranges to produce a hit, but it seems the indexed token contains the comma too (oranges,) so it won't be a match unless I write the comma in the query too, which is undesirable.
Is there any simple way to make this work the way I need?
Thanks in advance.
I know this is a very old question, but I'm bored and I'll reply anyway. I see two ways of doing this:
Creating a TokenFilter that will create a synonym (e.g. insert a token into the stream with a 0 position length) each time a word contains punctuation with the non-puncuation version.
Add another field with the same content but using a standard tokenizer that removes all punctuation. Both fields will be matched.

How do you know when to use varchar and when to use text in sql?

It seems like a very arbitrary decision.
Both can accomplish the same thing in most cases.
By limiting the varchar length seems to me like you're shooting yourself in the foot cause you never know how long of a field you will need.
Is there any specific guideline for choosing VARCHAR or TEXT for your string fields?
I will be using postgresql with the sqlalchemy orm framework for python.
In PostgreSQL there is no technical difference between varchar and text
You can see a varchar(nnn) as a text column with a check constraint that prohibits storing larger values.
So each time you want to have a length constraint, use varchar(nnn).
If you don't want to restrict the length of the data use text
This sentence is wrong:
By limiting the varchar length seems to me like you're shooting yourself in the foot cause you never know how long of a field you will need.
If you are saving, for example, MD5 hashes you do know how large the field is your storing and your storage becomes more efficient. Other examples are:
Usernames (64 max)
Passwords (128 max)
Zip codes
Addresses
Tags
Many more!
In brief:
Variable length fields save space, but because each field can have different length, it makes table operations slower
Fixed length fields make table operations fast, although must be large enough for the maximum expected input, so can use more space
Think of an analogy to arrays and linked lists, where arrays are fixed length fields, and linked lists are like varchars. Which is better, arrays or linked lists? Lucky we have both, because they are both useful in different situations, so too here.
In the most cases you do know what the max length of a string in a field is. In case of a first of lastname you don't need more then 255 characters for example. So by design you choose wich type to use, if you always use text you're wasting resources
Check this article on PostgresOnline, it also links to two other usefull articles.
Most problems with TEXT in PostgreSQL occur when you're using tools, applications and drivers that treat TEXT very different from VARCHAR because other databases behave very different with these two datatypes.
Database designers almost always know how many characters a column needs to hold. US delivery addresses need to hold up to 64 characters. (The US Postal Service publishes addressing guidelines that say so.) US ZIP codes are 5 characters long.
A database designer will look at representative sample data from her clients when she's specifying columns. She'll ask herself, questions like "What's the longest product name?" And when the answer is "70 characters", she won't make the column 3000 characters wide.
VARCHAR has a limit of 8k in SQL Server (I think). Most applications don't require nearly that much storage for a single column.

Struggling with a MySQL database of phone numbers

My application wants to store a list of international phone number in a mysql database. Then the application would need to query the database and search for a specific number. Sounds simple but it's actually a huge problem.
Because users can search for that number in a different format we'll have to do a full scan to the database each time.
For example. We might have the number 17162225555 stored in the database (along with another 5 million entries). Now the user comes along and attempt to search using 7162225555. Another user might try to serach with 2225555. etc etc. So in other words, the database have to issue the SQL query using a "like %number%" which would result in a full scan.
How should we design this application? Is there some way to tweat the Mysql to handle this better? Or should we not use SQL at all?
PS. We have millions of entries, and 10s of these search request per second.
This is very weird, I've struggled with this issue myself many times, over the last 15 years and generally come up with structures that separate area codes, country codes and number into separate fields etc. But whilst reading your question another solution just popped into my head, it does require a separate field though so may not be appropriate for you.
You could have a separate field called reverse_phone_number, have this automatically populated by the DB engine then when people search simply reverse the search string and use the indexed reverse field with just a % at the end of the like string, thereby allowing the use of the index.
Dependant on your DB engine you may be able to create an index based on a user-defined function that does the reverse for you obviating the need for an additional field.
In some countries, e.g. the UK, you may have an issue with leading zeros. A UK phone number is represented as (area code)(Phone Number) e.g. 01634 511098, when this is internationalised the leading zero of the area code is removed and the international dial code (+ or 00) and the country code (44) are added. This results in an international phone number of +441634511098. Any user searching for 0163451109 would not find the phone number if it was entered in internationalised format. You can overcome this issue by removing leading zeros from the search string.
EDIT
Based on suggestions from Ollie Jones you should store the number as entered by the user and then strip leading zeros, punctuation and white space from the number before reversing and storing in the reversed field. Then simply use the same algorithm to strip the search string before reversing, find the record and then display the originally entered number back to the user.