Finding the location of one string within another string in Bigquery - google-bigquery

I couldn't find a function in BigQuery query reference which looks for one string within a second one and returns the index of the location. Something like instr() in other SQL dialects. Is there any substitute or any technique to achieve this?
For example: Looking into "de" in "abcdef" will return 4.

One way you can do this is with a Regular Expression extract (see reference here):
SELECT
title, LENGTH(REGEXP_EXTRACT(title, r'^(.*)def.*')) + 1 AS location_of_fragment
FROM
[publicdata:samples.wikipedia]
WHERE
REGEXP_MATCH(title, r'^(.*)def.*')
LIMIT 10;
Returns:
Row title location_of_fragment
1 Austrian air defense 14
2 Talk:Interface defeat 16
3 High-definition television 6
4 Talk:IAU definition of planet 10
5 Wikipedia:Articles for deletion/Culture defines politics 41
6 Wikipedia:WikiProject Spam/LinkReports/defenders.org 40
7 Adenine phosphoribosyltransferase deficiency 35
8 Stay-at-home defenceman 14
9 Manganese deficiency (plant) 11
10 High-definition television 6

The old answer is now deprecated and #carlos answer works:
STRPOS(string, substring)

The legacy SQL INSTR(str1,str2) function "Returns the one-based index of the first occurrence of a string." So that should work for you.
https://cloud.google.com/bigquery/docs/reference/legacy-sql

I'm late to the party but the BigQuery API changed, now the Regex syntax is as follow:
SELECT mydomains FROM `myproject.mydataset.mytable`
where regexp_contains(mydomains, r'^(.*)example.*');
To answer the question with For example: Looking into "de" in "abcdef" will return 4., it would look like:
SELECT de FROM `myproject.mydataset.mytable`
where regexp_contains(de, r'^(.*)abcdef.*');
REGEXP_MATCH is now part of Legacy SQL Functions and Operators as per the reference link.
Hope it helps the one! :)

Related

Fulltext search in SQL Server

I'm trying to create a simple search page on my site but finding it difficult to get full text search working as I would expect it to word.
Example search:
select *
from Movies
where contains(Name, '"Movie*" and "2*"')
Example data:
Id Name
------------------
1 MyMovie
2 MyMovie Part 2
3 MyMovie Part 3
4 MyMovie X2
5 Other Movie
Searches like "movie*" return no results since the term is in the middle of a work.
Searches like "MyMovie*" and "2*" only return MyMovie Part 2 and not MyMovie Part X2
It seems like I could just hack together a dynamic SQL query that will just
and name like '%movie%' and name like '%x2%' and it would work better than full text search which seems odd since it's a large part of SQL but doesn't seem to work as good as a simple like usage.
I've turned off my stop list so the number and single character results appear but it just doesn't seem to work well for what I'm doing which seems rather basic.
select
*
from Movies
where
name like ('%Movie%')
or name like ('%2%')
;
select * from Movies where freetext(Name, ' "Movie*" or "2*" ')

Using Lucene Fuzzy search with a word that has no aliases

I wish do searches using fuzzy search. Using Luke to help me, if I search for a word that has aliases (eg similar words) it all works as expected:
However if I enter a search term that doesn't have any similar words (eg a serial code), the search fails and I get no results, even though it should be valid:
Do I need to structure my search in a different way? Why don't I get the same in the second search as the first, but with only one "term"?
You have not specified Lucene version so I would assume you are using 6.x.x.
The behavior that you are seeing is a correct behavior of Lucene Fuzzy Search.
Refer this and I quote ,
At most, this query will match terms up to 2 edits.
Which roughly but not very accurately means that two texts varying with maximum of two characters at any positions would be a returned as match if using FuzzyQuery.
Below is a sample output from one of my simple Java programs that I illustrate here,
Lets say three Indexed Docs have a field with values like -
"123456787" , "123456788" , "123456789" ( Appended 7 , 8 and 9 to
– 12345678 )
Results :
No Hits Found for search string -> 123456 ( Edit distance = 3 , last
3 digits are missing)
3 Docs found !! for Search String -> 1234567 ( Edit distance = 2 )
3 Docs found !! for Search String -> 12345678 ( Edit distance = 1 )
1 Docs found !! for Search String -> 1236787 ( Edit distance = 2 for
found one, missing 4 , 5 and last digit for remaining two documents)
No Hits Found for search string -> 123678789 ( Edit distance = 4 ,
missing 4 , 5 and last two digits)
So you should read more about Edit Distance.
If your requirement is to match N-Continuous characters without worrying about edit distance , then N-Gram Indexing using NGramTokenizer is the way to go.
See this too for more about N-Gram

SQL Server 2014 equivalent to mysql's find_in_set()

I'm working with a database that has a locations table such as:
locationID | locationHierarchy
1 | 0
2 | 1
3 | 1,2
4 | 1
5 | 1,4
6 | 1,4,5
which makes a tree like this
1
--2
----3
--4
----5
------6
where locationHierarchy is a csv string of the locationIDs of all its ancesters (think of a hierarchy tree). This makes it easy to determine the hierarchy when working toward the top of the tree given a starting locationID.
Now I need to write code to start with an ancestor and recursively find all descendants. MySQL has a function called 'find_in_set' which easily parses a csv string to look for a value. It's nice because I can just say "find in set the value 4" which would give all locations that are descendants of locationID of 4 (including 4 itself).
Unfortunately this is being developed on SQL Server 2014 and it has no such function. The CSV string is a variable length (virtually unlimited levels allowed) and I need a way to find all ancestors of a location.
A lot of what I've found on the internet to mimic the find_in_set function into SQL Server assumes a fixed depth of hierarchy such as 4 levels maximum) which wouldn't work for me.
Does anyone have a stored procedure or anything that I could integrate into a query? I'd really rather not have to pull all records from this table to use code to individually parse the CSV string.
I would imagine searching the locationHierarchy string for locationID% or %,{locationid},% would work but be pretty slow.
I think you want like -- in either database. Something like this:
select l.*
from locations l
where l.locationHierarchy like #LocationHierarchy + ',%';
If you want the original location included, then one method is:
select l.*
from locations l
where l.locationHierarchy + ',' like #LocationHierarchy + ',%';
I should also note that SQL Server has proper support for recursive queries, so it has other options for hierarchies apart from hierarchy trees (which are still a very reasonable solution).
Finally It worked for me..
SELECT * FROM locations WHERE locationHierarchy like CONCAT(#param,',%%') OR
o.unitnumber like CONCAT('%%,',#param,',%%') OR
o.unitnumber like CONCAT('%%,',#param)

query, where field contains 2 t's

is there a way to perform a where clause that will match only 2 t's independent off where they are located.
such as
Matthew --
would work
Thanatos --
would work
Thanatos T --
would not work
Tom --
would not work
I've been Googling but cant find anything specific about this
any help is apreciated
You could try
SELECT *
FROM Table
WHERE Field LIKE '%t%t%' AND Field NOT LIKE '%t%t%t%'
I'm curious which would be faster, this or Goat CO's answer.
You could use LEN() and REPLACE():
SELECT *
FROM Table
WHERE LEN(REPLACE(field,'t','tt')) - LEN(Field) = 2
Demo: SQL Fiddle

Separate String with separator ###

The string below I have to call three times using sql.
String: JAN###PIET###HENK
The first call shoudl return:
JAN
The second call:
PIET
The third:
HENK
So we could use the ### as separator but it could be that the string is:
JAN###PIET only. Still all three calls will be done where
call 1 returns:
JAN
call 4 returns:
PIET
cal three return:
<>
Same could happen for string JAN only.
I hope this explanation is sufficient for someone to help me with this case.
Thanks in advance.
Regards,
Ryni
if you're using sql 2005, adapt this post to your needs:
Split Funcin - SQL 2005
It sounds to me like you are asking for a split function that maintains state like an enumerator. You don't actually want that. That could be very bad potentially.
I'd recommend a split string function (in sql server, they're called table valued functions). If you do that, worst case scenario is you have to loop over the table result. In any event, you can find the function here: http://dpatrickcaldwell.blogspot.com/2008/08/table-valued-function-to-split-strings.html
Your usage would look like this:
SELECT *
FROM dbo.SplitString('JAN###PIET###HENK', '###')
-- PartId Part
-- ----------- --------
-- 1 JAN
-- 2 PIET
-- 3 HENK
SELECT *
FROM dbo.SplitString('JAN###PIET###HENK', '###')
WHERE PartId = 2
-- PartId Part
-- ----------- --------
-- 2 PIET
every possible SQL Server split string method with detailed pro and cons:
http://www.sommarskog.se/arrays-in-sql.html