Query Chinese characters(utf-8) in Google Big Query

Query Chinese characters(utf-8) in Google Big Query - sql

I want to query out titles which contains Chinese characters(ex:數學) from my google dataset, and I hava tried many methods as follows.
Google big query only has LENGTH() function,and it doesn't hava DATALENGTH() to compare the difference of length and datasize.
Then, I try to use REGEXP_MATCH() '[\u4e00-\u9fa5]' to match Chinese characters, but it doesn't work, too.
I can't figure out if there are other methods to solve this problem.
Please help, thank you.

BigQuery's LENGTH function currently has a bug which returns the incorrect STRING length for characters that fall out of the ASCII encoding range: https://code.google.com/p/google-bigquery/issues/detail?id=109
Possible workaround: If you just need an accurate LENGTH count, you could use the REGEXP_REPLACE function to convert your characters into a random ASCII character (such as '_'), and count that:
SELECT '數學',
LENGTH(REGEXP_REPLACE('數學', r'.', '_')) as correct,
LENGTH('數學') as incorrect;

Related

Pattern matching in Big query vs SSMS-Return strings which contain special characters or numerics

I'm a bit lost.
I've had a look at the documentation but I'm not sure if you can use LIKE and pattern match in Big Query the same as SSMS.
The code shown here works in SSMS but the results are not correct in Big Query, so was wondering if there was another way to do it.
WHERE column_name NOT LIKE '[a-Z]%'
I'm looking to return strings which contain special characters or numerics.

Use REGEXP_CONTAINS instead
where not regexp_contains(column_name, r'[a-zA-Z]')
Meantime, LIKE is also supported as a comparison operator

How do I remove a character from strings of different lengths with sql? Intersystems cache sql

I have a column of strings that have an '&' at the beginning and end of each one that I need to remove for a Crystal report I'm creating. I'm writing the SQL code outside of Crystal I am using Intersystems Cache SQL. Below is an example:
&This& This
&is& is
&What& what
&it& I
&looks& need
&like& it
&now& to
look
like
Any suggestions would be greatly appreciated!!!

Assuming the ampersands are always positioned as both the leading and trailing characters, here's at least maybe a start. Use a combination of SUBSTR (or SUBSTRING, if using stream data) and LENGTH, like so:
SELECT SUBSTR((SELECT column FROM table), 2, LENGTH(SELECT column FROM table) - 2)
This should return a substring that starts counting at the 2nd character [of the original string, given by the first sub-expression/argument to SUBSTR], counting up for the total number of characters [of the original string] less 2 (i.e. less the two ampersands).
If you need to including trailing blanks and/or the string termination character, you may need to use a different variation of the LENGTH function. See resources for details on these functions and their variants:
https://docs.intersystems.com/irislatest/csp/docbook/DocBook.UI.Page.cls?KEY=RSQL_substr
https://cedocs.intersystems.com/latest/csp/docbook/DocBook.UI.Page.cls?KEY=RSQL_length

Here's a Crystal formula that does the same:
ExtractString({YourData},"&","&")

Redshift SQL - Extract numbers from string

In Amazon Redshift tables, I have a string column from which I need to extract numbers only out. For this currently I use
translate(stringfield, '0123456789'||stringfield, '0123456789')
I was trying out REPLACE function, but its not gonna be elegant.
Any thoughts with converting the string into ASCII first and then doing some operation to extract only number? Or any other alternatives.
It is hard here as Redshift do not support functions and is missing lot of traditional functions.
Edit:
Trying out the below, but it only returns 051-a92 where as I need 05192 as output. I am thinking of substring etc, but I only have regexp_substr available right now. How do I get rid of any characters in between
select REGEXP_SUBSTR('somestring-051-a92', '[0-9]+..[0-9]+', 1)

might be late but I was solving the same problem and finally came up with this
select REGEXP_replace('somestring-051-a92', '[a-z/-]', '')
alternatively, you can create a Python UDF now

Typically your inputs will conform to some sort of pattern that can be used to do the parsing using SUBSTRING() with CHARINDEX() { aka STRPOS(), POSITION() }.
E.g. find the first hyphen and the second hyphen and take the data between them.
If not (and assuming your character range is limited to ASCII) then your best bet would be to nest 26+ REPLACE() functions to remove all of the standard alpha characters (and any punctuation as well).
If you have multibyte characters in your data though then this is a non-starter.

Better method is to remove all the non-numeric values:
select REGEXP_replace('somestring-051-a92', '[^0-9]', '')

You can specify "any non digit" that includes non-printable, symbols, alpha, etc.
e.g., regexp_replace('brws--A*1','[\D]')
returns
"1"

count number of characters in nvarchar column

Does anyone know a good way to count characters in a text (nvarchar) column in Sql Server?
The values there can be text, symbols and/or numbers.
So far I used sum(datalength(column))/2 but this only works for text. (it's a method based on datalength and this can vary from a type to another).

You can find the number of characters using system function LEN.
i.e.
SELECT LEN(Column) FROM TABLE

Use
SELECT length(yourfield) FROM table;

Use the LEN function:
Returns the number of characters of the specified string expression, excluding trailing blanks.

Doesn't SELECT LEN(column_name) work?

text doesn't work with len function.
ntext, text, and image data types will be removed in a future version
of Microsoft SQL Server. Avoid using these data types in new
development work, and plan to modify applications that currently use
them. Use nvarchar(max), varchar(max), and varbinary(max) instead. For
more information, see Using Large-Value Data Types.
Source

I had a similar problem recently, and here's what I did:
SELECT
columnname as 'Original_Value',
LEN(LTRIM(columnname)) as 'Orig_Val_Char_Count',
N'['+columnname+']' as 'UnicodeStr_Value',
LEN(N'['+columnname+']')-2 as 'True_Char_Count'
FROM mytable
The first two columns look at the original value and count the characters (minus leading/trailing spaces).
I needed to compare that with the true count of characters, which is why I used the second LEN function. It sets the column value to a string, forces that string to Unicode, and then counts the characters.
By using the brackets, you ensure that any leading or trailing spaces are also counted as characters; of course, you don't want to count the brackets themselves, so you subtract 2 at the end.

How do I check the end of a particular string using SQL pattern matching?

I am trying to use sql pattern matching to check if a string value is in the correct format.
The string code should have the correct format of:
alphanumericvalue.alphanumericvalue
Therefore, the following are valid codes:
D0030.2190
C0052.1925
A0025.2013
And the following are invalid codes:
D0030
.2190
C0052.
A0025.2013.
A0025.2013.2013
So far I have the following SQL IF clause to check that the string is correct:
IF #vchAccountNumber LIKE '_%._%[^.]'
I believe that the "_%" part checks for 1 or more characters. Therefore, this statement checks for one or more characters, followed by a "." character, followed by one or more characters and checking that the final character is not a ".".
It seems that this would work for all combinations except for the following format which the IF clause allows as a valid code:
A0025.2013.2013
I'm having trouble correcting this IF clause to allow it to treat this format as incorrect. Can anybody help me to correct this?
Thank you.

This stackoverflow question mentions using word-boundaries: [[:<:]] and [[:>:]] for whole word matches. You might be able to use this since you don't have spaces in your code.

This is ANSI SQL solution
This LIKE expression will find any pattern not alphanumeric.alphanumeric. So NOT LIKE find only this that match as you wish:
IF #vchAccountNumber NOT LIKE '%[^A-Z0-9].[^A-Z0-9]%'
However, based on your examples, you can use this...
LIKE '[A-Z][0-9][0-9][0-9][0-9].[0-9][0-9][0-9][0-9]'
...or one like this if you 5 alphas, dot, 4 alphas
LIKE '[A-Z0-9][A-Z0-9][A-Z0-9][A-Z0-9][A-Z0-9].[A-Z0-9][A-Z0-9][A-Z0-9][A-Z0-9]'
The 2nd one is slightly more obvious for fixed length values. The 1st one is slighty less intuitive but works with variable length code either side of the dot.
Other SO questions Creating a Function in SQL Server with a Phone Number as a parameter and returns a Random Number and Best equivalent for IsInteger in SQL Server

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Query Chinese characters(utf-8) in Google Big Query - sql

Related

Pattern matching in Big query vs SSMS-Return strings which contain special characters or numerics

How do I remove a character from strings of different lengths with sql? Intersystems cache sql

Redshift SQL - Extract numbers from string

count number of characters in nvarchar column

How do I check the end of a particular string using SQL pattern matching?

Categories

Resources