T-SQL CONTAINS and CONTAINSTABLE and punctuation like the period / dot

T-SQL CONTAINS and CONTAINSTABLE and punctuation like the period / dot - sql-server-2012

Is there a way to make full-text search in SQL Server 2012 a little more "naive", removing the "intelligence" that causes it to treat the [.] character as a word separator?
We have meaningful strings that contain dots and want to treat the entire string as a contiguous token, not as separate chunks.
Is there a place where such "punctuation" marks are defined, which can be easily edited to remove the "."?

Related

Field Name including a period gives me error (using brackets)

I put together an Access Database for a department. They've been using it frequently for the past few months with no hiccups.
However, they changed one of the field names of a linked Excel File, which forces me to go into Access and update the query a bit.
The field name has gone from "PacU" to "Mr. Cooper"
Original:
SELECT Round(BidTemplate.[PacU],6) AS PacU
New:
SELECT Round(BidTemplate.[Mr. Cooper],6) AS [Mr. Cooper]
I am receing an error as follows "Invalid bracketing of the name 'BidTeample.[Mr.Cooper]'.
I'm sure the issue is driven off of the period that is now included in the field. But shouldn't the brackets take care of this?
What am I missing?

Field names cannot contain a period.
From the MS Access Documentation:
Names of fields, controls, and objects in Microsoft Access desktop databases:
Can be up to 64 characters long.
Can include any combination of letters, numbers, spaces, and special characters except a period (.), an exclamation point (!), an
accent grave (`), and brackets ([ ]).
Can't begin with leading spaces.
Can't include control characters (ASCII values 0 through 31).
Can't include a double quotation mark (") in table, view, or stored procedure names in a Microsoft Access project.

remove extra space
SELECT Round(BidTemplate.[Mr Cooper],6) AS [Mr Cooper]

Regex to validate if string is valid SQL column name

I am searching a regex to validate if a string could be a valid SQL column name.
I would like to use PCRE syntax.
Up to now I found this:
[\w-]+
But I think this is not enough. I have seen the / too (in SAP).
AFAIK the spec is closed source (you need to pay for it).
From the docs (Python re):
\w
Matches Unicode word characters; this includes most characters that can be part of a word in any language, as well as numbers and the
underscore. If the ASCII flag is used, only [a-zA-Z0-9_] is matched.
How does the regex look like to validate SQL column names?
The string should be able to used like this my_column.
AFAIK reserved words are valid, since you can use them like this:
select * from my_table where "where" = 'here'
"where" is the name of a column. The regex does not need to care for reserved words.

The manual clarifies:
SQL identifiers and key words must begin with a letter (a-z, but also
letters with diacritical marks and non-Latin letters) or an underscore
(_). Subsequent characters in an identifier or key word can be
letters, underscores, digits (0-9), or dollar signs ($). Note that
dollar signs are not allowed in identifiers according to the letter of
the SQL standard, so their use might render applications less
portable. The SQL standard will not define a key word that contains
digits or starts or ends with an underscore, so identifiers of this
form are safe against possible conflict with future extensions of the
standard.
The system uses no more than NAMEDATALEN-1 bytes of an identifier;
longer names can be written in commands, but they will be truncated.
By default, NAMEDATALEN is 64 so the maximum identifier length is 63
bytes. If this limit is problematic, it can be raised by changing the
NAMEDATALEN constant in src/include/pg_config_manual.h.
And:
There is a second kind of identifier: the delimited identifier or
quoted identifier. It is formed by enclosing an arbitrary sequence of
characters in double-quotes ("). [...]
Quoted identifiers can contain any character, except the character
with code zero. (To include a double quote, write two double quotes.)
This allows constructing table or column names that would otherwise
not be possible, such as ones containing spaces or ampersands. The
length limitation still applies.
There is more, you can even use escaped unicode characters like: U&"d\0061t\+000061". Read the whole chapter.
So any character, except the character with code zero is allowed in a valid identifier, once the name is double-quoted. And without double-quotes, even simple strings like 'select' may be invalid if they happen to be reserved words. (The concept of reserved words is an unfortunate one, set by the SQL standard, hard to change now.)
You might just let Postgres do the work, using quote_ident():
SELECT quote_ident('0of') = '0of';
Quotes are added only if necessary.
The expression returns true for valid identifiers. Or just used the result of quote_ident('$identifier') to get a legal name in either case (quoted if necessary).

If we follow the PostgreSQL documentation:
SQL identifiers and key words must begin with a letter (a-z, but also letters with diacritical marks and non-Latin letters) or an underscore (_). Subsequent characters in an identifier or key word can be letters, underscores, digits (0-9), or dollar signs ($). Note that dollar signs are not allowed in identifiers according to the letter of the SQL standard [...]
we could write a regular expression for identifiers like this:
^([[:alpha:]_][[:alnum:]_]*|("[^"]*")+)$
The second branch of the regular expression takes care of quoted identifiers.

When to use single quotes in an SQL statement?

I know that I should use it when I deal with data of TEXT type (and I guess the ones that fall back to TEXT), but is it the only case?
Example:
UPDATE names SET name='Mike' WHERE id=3
I'm writing an SQL query auto generation in C++, so I want to make sure I don't miss cases, when I have to add quotes.

Single quotes (') denote textual data, as you noted (e.g., 'Mike' in your example). Numeric data (e.g., 3 in your example), object (table, column, etc) names and syntactic elements (e.g., update, set, where) should not be wrapped in quotes.

The single quote is the delimiter for the string. It lets the parser know where the string starts and where it ends as well as that is is a string. You will find that sometimes you get away with a double quote too.
The only way to be certain you don't miss any cases would be to escape the input, otherwise this will be vulnerable to abuse when somehow a single quote ends up in in the text.

Return sql rows where field contains ONLY non-alphanumeric characters

I need to find out how many rows in a particular field in my sql server table, contain ONLY non-alphanumeric characters.
I'm thinking it's a regular expression that I need along the lines of [^a-zA-Z0-9] but Im not sure of the exact syntax I need to return the rows if there are no valid alphanumeric chars in there.

SQL Server doesn't have regular expressions. It uses the LIKE pattern matching syntax which isn't the same.
As it happens, you are close. Just need leading+trailing wildcards and move the NOT
WHERE whatever NOT LIKE '%[a-z0-9]%'

If you have short strings you should be able to create a few LIKE patterns ('[^a-zA-Z0-9]', '[^a-zA-Z0-9][^a-zA-Z0-9]', ...) to match strings of different length. Otherwise you should use CLR user defined function and a proper regular expression - Regular Expressions Make Pattern Matching And Data Extraction Easier.

This will not work correctly, e.g. abcÑxyz will pass thru this as it has a,b,c... you need to work with Collate or check each byte.

How can you query a SQL database for malicious or suspicious data?

Lately I have been doing a security pass on a PHP application and I've already found and fixed one XSS vulnerability (both in validating input and encoding the output).
How can I query the database to make sure there isn't any malicious data still residing in it? The fields in question should be text with allowable symbols (-, #, spaces) but shouldn't have any special html characters (<, ", ', >, etc).
I assume I should use regular expressions in the query; does anyone have prebuilt regexes especially for this purpose?

If you only care about non-alphanumerics and it's SQL Server you can use:
SELECT *
FROM MyTable
WHERE MyField LIKE '%[^a-z0-9]%'
This will show you any row where MyField has anything except a-z and 0-9.
EDIT:
Updated pattern would be: LIKE '%[^a-z0-9!-# ]%' ESCAPE '!'
I had to add the ESCAPE char since you want to allow dashes -.

For the same reason that you shouldn't be validating input against a black-list (i.e. list of illegal characters), I'd try to avoid doing the same in your search. I'm commenting without knowing the intent of the fields holding the data (i.e. name, address, "about me", etc.), but my suggestion would be to construct your query to identify what you do want in your database then identify the exceptions.
Reason being there are just simply so many different character patterns used in XSS. Take a look at the XSS Cheat Sheet and you'll start to get an idea. Particularly when you get into character encoding, just looking for things like angle brackets and quotes is not going to get you too far.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas