Can you use field alias with space in Google Big Query - google-bigquery

We need to create descriptive aliases for fields. Ideally we would like to create views with alias with a space. Is this possible? How can we do this?
Example:
SELECT word, word_count "Word Count" FROM [publicdata:samples.shakespeare] LIMIT 1000

No, the rules for field names (and aliases) in BigQuery are quite simple, and I quote:
Fields must contain only letters, numbers, and underscores, start with
a letter or underscore, and be at most 128 characters long.
As you see, spaces, quote characters, and other punctuation, are not allowed. Feel free to open a feature request at https://code.google.com/p/google-bigquery/ (explaining your use case, esp. why using underscores in lieu of spaces is not acceptable) -- or star an existing FR at https://code.google.com/p/google-bigquery/issues/list if it coincides with your requirements.

Related

Regex to validate if string is valid SQL column name

I am searching a regex to validate if a string could be a valid SQL column name.
I would like to use PCRE syntax.
Up to now I found this:
[\w-]+
But I think this is not enough. I have seen the / too (in SAP).
AFAIK the spec is closed source (you need to pay for it).
From the docs (Python re):
\w
Matches Unicode word characters; this includes most characters that can be part of a word in any language, as well as numbers and the
underscore. If the ASCII flag is used, only [a-zA-Z0-9_] is matched.
How does the regex look like to validate SQL column names?
The string should be able to used like this my_column.
AFAIK reserved words are valid, since you can use them like this:
select * from my_table where "where" = 'here'
"where" is the name of a column. The regex does not need to care for reserved words.
The manual clarifies:
SQL identifiers and key words must begin with a letter (a-z, but also
letters with diacritical marks and non-Latin letters) or an underscore
(_). Subsequent characters in an identifier or key word can be
letters, underscores, digits (0-9), or dollar signs ($). Note that
dollar signs are not allowed in identifiers according to the letter of
the SQL standard, so their use might render applications less
portable. The SQL standard will not define a key word that contains
digits or starts or ends with an underscore, so identifiers of this
form are safe against possible conflict with future extensions of the
standard.
The system uses no more than NAMEDATALEN-1 bytes of an identifier;
longer names can be written in commands, but they will be truncated.
By default, NAMEDATALEN is 64 so the maximum identifier length is 63
bytes. If this limit is problematic, it can be raised by changing the
NAMEDATALEN constant in src/include/pg_config_manual.h.
And:
There is a second kind of identifier: the delimited identifier or
quoted identifier. It is formed by enclosing an arbitrary sequence of
characters in double-quotes ("). [...]
Quoted identifiers can contain any character, except the character
with code zero. (To include a double quote, write two double quotes.)
This allows constructing table or column names that would otherwise
not be possible, such as ones containing spaces or ampersands. The
length limitation still applies.
There is more, you can even use escaped unicode characters like: U&"d\0061t\+000061". Read the whole chapter.
So any character, except the character with code zero is allowed in a valid identifier, once the name is double-quoted. And without double-quotes, even simple strings like 'select' may be invalid if they happen to be reserved words. (The concept of reserved words is an unfortunate one, set by the SQL standard, hard to change now.)
You might just let Postgres do the work, using quote_ident():
SELECT quote_ident('0of') = '0of';
Quotes are added only if necessary.
The expression returns true for valid identifiers. Or just used the result of quote_ident('$identifier') to get a legal name in either case (quoted if necessary).
If we follow the PostgreSQL documentation:
SQL identifiers and key words must begin with a letter (a-z, but also letters with diacritical marks and non-Latin letters) or an underscore (_). Subsequent characters in an identifier or key word can be letters, underscores, digits (0-9), or dollar signs ($). Note that dollar signs are not allowed in identifiers according to the letter of the SQL standard [...]
we could write a regular expression for identifiers like this:
^([[:alpha:]_][[:alnum:]_]*|("[^"]*")+)$
The second branch of the regular expression takes care of quoted identifiers.

T-SQL CONTAINS and CONTAINSTABLE and punctuation like the period / dot

Is there a way to make full-text search in SQL Server 2012 a little more "naive", removing the "intelligence" that causes it to treat the [.] character as a word separator?
We have meaningful strings that contain dots and want to treat the entire string as a contiguous token, not as separate chunks.
Is there a place where such "punctuation" marks are defined, which can be easily edited to remove the "."?

What are pros and cons of using special characters in SQL identifiers?

Should I avoid special characters like "é á ç" in SQL table names and column names?
What are the pros and cons of using special characters?
As you can guess, there are pros and cons. This is more or less a subjective question.
SQL (unlike most programming languages) allows you to use special characters, whitespace, punctuation, or reserved words in your table or column identifiers.
It's pretty nice that people have the choice to use appropriate characters for their native language.
Especially in cases where a word changes its meaning significantly when spelled with the closest ASCII characters: e.g. año vs. ano.
But the downside is that if you do this, you have to use "delimited identifiers" every time you reference the table with special characters. In standard SQL, delimited identifiers use double-quotes.
SELECT * FROM "SELECT"
This is actually okay! If you want to use an SQL reserved word as a table name, you can do it. But it might cause some confusion for some readers of the code.
Likewise if you use special non-ASCII characters, it might make it hard for English-speaking programmers to maintain the code, because they are not familiar with the key sequence to type those special characters. Or they might forget that they have to delimit the table names.
SELECT * FROM "año"
Then there's non-standard delimited identifiers. Microsoft uses square-brackets by default:
SELECT * FROM [año]
And MySQL uses back-ticks by default:
SELECT * FROM `año`
Though both can use the standard double-quotes as identifier delimiters if you enable certain options, you can't always rely on that, and if the option gets disabled, your code will stop working. So users of Microsoft and MySQL are kind of stuck using the non-standard delimiters, unfortunately.
Maintaining the code is simpler in some ways if you can stick with ASCII characters. But there are legitimate reasons to want to use special characters too.

SQL2008 fulltext index search without word breakers

I are trying to search an FTI using CONTAINS for Twitter-style usernames, e.g. #username, but word breakers will ignore the # symbol. Is there any way to disable word breakers? From research, there is a way to create a custom word breaker DLL and install it and assign it but that all seems a bit intensive and, frankly, over my head. I disabled stop words so that dashes are not ignored but I need that # symbol. Any ideas?
You're not going to like this answer. But full text indexes only consider the characters _ and ` while indexing. All the other characters are ignored and the words get split where these characters occur. This is mainly because full text indexes are designed to index large documents and there only proper words are considered to make it a more refined search.
We faced a similar problem. To solve this we actually had a translation table, where characters like #,-, / were replaced with special sequences like '`at`','`dash`','`slash`' etc. While searching in the full text, u've to again replace ur characters in the search string with these special sequences and search. This should take care of the special characters.

What is causing the LIKE statement to disregard html-tags, words after commas, or end in periods?

I'm working on a search module that searches in text columns that contains html code. The queries are constructed like: WHERE htmlcolumn LIKE '% searchterm %';
Default the modules searches with spaces at both end of the searchterms, with wildcards at the beginning and/or the end of the searchterms these spaces are removed (*searchterm -> LIKE '%searchterm %'; Also i've added the possibility to exclude results with certain words (-searchterm -> NOT LIKE '% searchterm %'). So far so good.
The problem is that words that that are preceded by an html-tag are not found (<br/>searchterm is not found when searching on LIKE '% searchterm.., also words that come after a comma or end with a period etc.).
What i would like to do is search for words that are not preceded or followed by the characters A-Z and a-z. Every other characters are ok.
Any ideas how i should achieve this? Thanks!
Look into MySQLs fulltextsearch, it might be able to use non-letter characters as delimiters. It will alsow be much much faster than a %term% search since that requires a full table-scan.
You could use a regular expression: http://dev.mysql.com/doc/refman/5.0/en/regexp.html
Generally speaking, it is better to use full text search facilities, but if you really want a small SQL, here it is:
SELECT * FROM `t` WHERE `htmlcolumn` REGEXP '[[:<:]]term[[:>:]]'
It returns all records that contain word 'term' whether it is surrounded with spaces, punctuation, special characters etc
I don't think SQL's "LIKE" operator alone is the right tool for the job you are trying to do. Consider using Lucene, or something like it. I was able to integrate Lucene.NET into my application in a couple days. You'll spend more time than that trying to salvage your current approach.
If you have no choice but to make your current approach work, then consider storing the text in two columns in your database. The first column is for the pure text, with punctuation etc. The second column is the text that has been pre-preprocessed, just words, no punctuation, normalized so as to be easier for your "LIKE" approach.