How to find out the square sequence in the number string? - google-bigquery

How to find out sequence of square patterns in the number string and eliminate it?
Example,
Input: 1232416256789
Output: 123789

Related

Teradata regular expressions, 0 or 1 spaces

In Teradata, I'm looking for one regular expression pattern that would allow me to find a pattern of some numbers, then a space or maybe no space, and then 'SF'. It should return 7 in both cases below:
SELECT
REGEXP_INSTR('12345 1000SF', pattern),
REGEXP_INSTR('12345 1000 SF', pattern)
Or, my actual goal is to extract the 1000 in both cases if there's an easier way, probably using REGEXP_SUBSTR. More details are below if you need them.
I have a column that contains free text and I would like to extract the square footage. But, in some cases, there is a space between the number and 'SF' and in some cases there is not:
'other stuff 1000 SF'
'other stuff 1000SF'
I am trying to use the REGEXP_INSTR function to find the starting position. Through google, I have found the pattern for the first to be
'([0-9])+ SF'
When I try the pattern for the second, I try
'([0-9])+SF'
and I get the error
SELECT Failed. [2662] SUBSTR: string subscript out of bounds
I've also found an answer to a similar questions, but they don't work for Teradata. For example, I don't think you can use ? in Teradata.
The error message indicates you're using SUBSTR, not REGEXP_SUBSTR.
Try this:
RegExp_Substr(col, '[0-9]*(?= {0,1}SF)')
Find multiple digits followed by a single optional blank followed by SF and extract those digits.
I would pattern it like this:
\b(\d+)\s*[Ss][Ff]\b
\b # word boundary
(\d+) # 1 or more digits (captured)
\s* # 0 or more white-space characters
[Ss] # character class
[Ff] # character class
\b # word boundary
Demo

regex capture middle of url

I'm trying to figure out the base regex to capture the middle of a google url out of a sql database.
For example, a few links:
https://www.google.com/cars/?year=2016&model=dodge+durango&id=1234
https://www.google.com/cars/?year=2014&model=jeep+cherokee+crossover&id=6789
What would be the regex to capture the text to get dodge+durango , or jeep+cherokee+crossover ? (It's alright that the + still be in there.)
My Attempts:
1)
\b[=.]\W\b\w{5}\b[+.]?\w{7}
, but this clearly does not work as this is a hard coded scenario that would only work like something for the dodge durango example. (would extract "dodge+durango)
2) Using positive lookback ,
[^+]( ?=&id )
but I am not fully sure how to use this, as this only grabs one character behind the & symbol.
How can I extract a string of (potentially) any length with any amount of + delimeters between the "model=" and "&id" boundaries?
seems like you could use regexp_replace and access match groups:
regexp_replace(input, 'model=(.*?)([&\\s]|$)', E'\\1')
from here:
The regexp_replace function provides substitution of new text for
substrings that match POSIX regular expression patterns. It has the
syntax regexp_replace(source, pattern, replacement [, flags ]). The
source string is returned unchanged if there is no match to the
pattern. If there is a match, the source string is returned with the
replacement string substituted for the matching substring. The
replacement string can contain \n, where n is 1 through 9, to indicate
that the source substring matching the n'th parenthesized
subexpression of the pattern should be inserted, and it can contain \&
to indicate that the substring matching the entire pattern should be
inserted. Write \ if you need to put a literal backslash in the
replacement text. The flags parameter is an optional text string
containing zero or more single-letter flags that change the function's
behavior. Flag i specifies case-insensitive matching, while flag g
specifies replacement of each matching substring rather than only the
first one
I may be misunderstanding, but if you want to get the model, just select everything between model= and the ampersand (&).
regexp_matches(input, 'model=([^&]*)')
model=: Match literally
([^&]*): Capture
[^&]*: Anything that isn't an ampersand
*: Unlimited times

How hive sentences function breaks each sentence

Before posting, I tried the hive sentences function and did some search but couldn't get a clear understanding, my question is based on what delimiter hive sentences function breaks each sentence? hive manual says "appropriate boundary" what does that mean? Below is an example of my tries, I tried adding period (.) and exclamatory sign(!) at different points of the sentence. I'm getting different outputs, can someone explain on this?
with period (.)
select sentences('Tokenizes a string of natural language text into words and sentences. where each sentence is broken at the appropriate sentence boundary and returned as an array of words.') from dummytable
output - 1 array
[["Tokenizes","a","string","of","natural","language","text","into","words","and","sentences","where","each","sentence","is","broken","at","the","appropriate","sentence","boundary","and","returned","as","an","array","of","words"]]
with '!'
select sentences('Tokenizes a string of natural language text into words and sentences! where each sentence is broken at the appropriate sentence boundary and returned as an array of words.') from dummytable
output - 2 arrays
[["Tokenizes","a","string","of","natural","language","text","into","words","and","sentences"],["where","each","sentence","is","broken","at","the","appropriate","sentence","boundary","and","returned","as","an","array","of","words"]]
If you understand the functionality of sentences()..it clears your doubt.
Definition of sentences(str):
Splits str into arrays of sentences, where each sentence is an array
of words.
Example:
SELECT sentences('Hello there! I am a UDF.') FROM src LIMIT 1;
[ ["Hello", "there"], ["I", "am", "a", "UDF"] ]
SELECT sentences('review . language') FROM movies;
[["review","language"]]
An exclamation point is a type of punctuation mark that goes at the end of a sentence. Other examples of related punctuation marks include periods and question marks, which also go at the end of sentences.But as per the definition of sentences() ,Unnecessary punctuation, such as periods and commas in English, is automatically stripped.So,we are able to get two arrays of words with !. It completely involves java.util.Locale.java
I don't know the actual reason but observed after period(.) if you put space and next word first letter as capital then it is working.
Here I changed from where to Where it it worked. However this is not require for !
Tokenizes a string of natural language text into words and sentences. Where each sentence is broken at the appropriate sentence boundary and returned as an array of words.
And this is giving below output
[["Tokenizes","a","string","of","natural","language","text","into","words","and","sentences"],["Where","each","sentence","is","broken","at","the","appropriate","sentence","boundary","and","returned","as","an","array","of","words"]]

pdf decoding, what do the bracket and numbers do?

my PDF file has deflate encoding, when inflating the string, it outputs something like this:
[(Lorem)-21( ipsum)-55( dolor)-14( sit)-55( amet,)-56( consectetur)-8( adipiscing)-14( elit.)-34( Donec)-15( faucibus)-49( lorem)-42( varius2)-56( mauris)-28( porttitor,)-34( et)-28( pellentesque)-1( )]TJ
what do the numbers and brackets mean?
it does not seems to be character count, or spacing,
does anyone know?
That is an array for showing text (Stuff in brackets denote array objects []), it should be followed by the TJ operator. The number is used to translate the text matrix (adjust the positioning of the text). Assuming horizontal text, a negative number moves the next glyph to the right.
From 9.4.3 Text-Showing Operators (Please see the spec for more details)
Show one or more text strings, allowing individual glyph positioning.
Each element of array shall be either a string or a number. If the
element is a string, this operator shall show the string. If it is a
number, the operator shall adjust the text position by that amount;
that is, it shall translate the text matrix, Tm. The number shall be
expressed in thousandths of a unit of text space (see 9.4.4, "Text
Space Details"). This amount shall be subtracted from the current
horizontal or vertical coordinate, depending on the writing mode. In
the default coordinate system, a positive adjustment has the effect of
moving the next glyph painted either to the left or down by the given
amount.
The parentheses denote string objects:
String objects shall be written in one of the following two ways:
As a sequence of literal characters enclosed in parentheses ( ) (using
LEFT PARENTHESIS (28h) and RIGHT PARENThESIS (29h)); see 7.3.4.2,
"Literal Strings."
...
A literal string shall be written as an arbitrary number of characters
enclosed in parentheses. Any characters may appear in a string except
unbalanced parentheses (LEFT PARENHESIS (28h) and RIGHT PARENTHESIS
(29h)) and the backslash (REVERSE SOLIDUS (5Ch)), which shall be
treated specially as described in this sub-clause. Balanced pairs of
parentheses within a string require no special treatment.
I suggest getting the PDF Spec and reading it to find out more info.

How Do I get this Split Function to Work? (VB.NET)

So, I made a program that for the most part, converts numbers to letters. My problem before was it was converting each individual digit instead of each number e.g. (1-0-1 instead of 101). Someone suggested that I use the Split function:
Dim numbers As String() = DTB.Split(" ")
So now it's reading the number all the way through being that it will only the split if there's a space in between. My problem now is that it's translating for example: "[102, 103, 104]" as "[102", "103" and "104]" because it will only split if there's a space between. Obviously, you can't convert "[102" or "104]" because they aren't actual numbers.
Does anyone have a solution on what I should do to get this to convert no matter the spacing? Would Regex be the way to go?
use a regular expression with \d+ it will match numbers
so
12234abcsdf23434
will return two matches
12234
23434