Document search from BigQuery based specific text? - google-bigquery

Here is my use case:
Employee profiles/resumes are stored in BigQuery.
The names of the employee needs to be returned when searched with a particular skill or skills.

For searching text within a field the like command is helpful.
There are many advanced ways to do this search task: regular expression, split the text by delimer, the soundex command (more usefull for different spelling of names).
At first the with creates a tempory table with your sample data. The where skills like "%task_to_search%" searches for the task and filters these datasets to be displayed only.
With tbl as
(Select "Jane" as name, " Java, Salesforce" as skills
UNION ALL SELECT "John", "python" )
SELECT *
from tbl
where lower(skills) like "%java%"

Related

Full text search VS Fuzzy Search based on many columns

I have an Employee table with these columns :
EmployeeId
Fullname
Phone
Department
Team
Function
Manager
I have a form with a search text input, where a user can type one column or all of them there like for example :
a user can search by Fullname only
a user can search by combining the Fullname + Phone + Team
What is the difference between Full text search and Fuzzy Search in SQL Server in this case?
so you have to options :
using full text search :
if the data is huge and you are looking for scalable data search , this method is preferable , however harder to maintain . so I suggest you add a computed column in that table and put and full text index on that:
alter table tablename
add column cmptcolumn as concat_ws(',',EmployeeId,FullName,PhoneNumber,...)
--full text catalog
CREATE FULLTEXT CATALOG catalogName AS DEFAULT;
-- full text index
create full text index on tablename (cmptcolumn)
-- search :
select * from tablename
where contain(cmptcolumn, 'SearchString');
by full text search you can search for synonyms and also words related to each other as well:
select * from tablename
where freetext(cmptcolumn, 'SearchString');
read more about different full text search options here
using search query. witch again you can benefit from computed column or search inside each column separately:
select *
from tablename
where (Fullname like '%'+#fullNameSearchString+'%' or #fullNameSearchString is null)
and (Department = #DepartmentSearchString or #DepartmentSearchString is null)
and ...
while first method is a faster way to search insode strings, second method provides more accurate result . however 'FreeText' looks for meaning of the word as well, in that case it might be slower.
in the second method either way you go ( with or without computed column) , having index on the column(s) in a necessity to improve the performance , however using like '%%' usually can't use index as it should.
create stored procedure and pass column comma separated check this example:
CREATE Procedure [dbo].Create_Sp
#SearchColumn varchar(500)=null
AS
BEGIN
DECLARE #SQL nvarchar(max)
SELECT #SQL = N'select '+#SearchColumn+' from #Employee'
EXECUTE sp_executesql #SQL
End
Well we know that the main difference between fuzzy and full text is "exactly what im looking for" versus "similarities." As that applies to SQL Server and to your example given, is fulltext search on firstname as opposed to mining similarities across three different columns, which might give a lot of noise depending on how clean your data is or how "guided" the end user experience in terms of entering those values is.
With fulltext search on fullname, im pretty sure that its a straight forward answer:
SELECT * FROM tblEmployee WHERE FullName=#FullNamParameter;
With fuzzy search, it can get very gross and nasty, as without clearer understanding of what the UI is doing I am forced to assume we want to check similarities on all parameters for all fields. TO be fairly honest Im pretty sure this query is the absolutel worst but it does demonstrate the idea for you.
SELECT * FROM tblEmployee WHERE FullName LIKE '%'+#Parameter1+'%'
UNION
SELECT * FROM tblEmployee WHERE FullName LIKE '%'+#Parameter2+'%'
UNION
SELECT * FROM tblEmployee WHERE FullName LIKE '%'+#Parameter3+'%'
UNION
SELECT * FROM tblEmployee WHERE Phone LIKE '%'+#Parameter1+'%'
UNION
SELECT * FROM tblEmployee WHERE Phone LIKE '%'+#Parameter2+'%'
UNION
SELECT * FROM tblEmployee WHERE Phone LIKE '%'+#Parameter3+'%'
UNION
SELECT * FROM tblEmployee WHERE Team LIKE '%'+#Parameter1+'%'
UNION
SELECT * FROM tblEmployee WHERE Team LIKE '%'+#Parameter2+'%'
UNION
SELECT * FROM tblEmployee WHERE Team LIKE '%'+#Parameter3+'%'
A very good solution is searloc. It is a CLR library with zero dependencies and with many features.
It supports full text search, phonetic match for all languages, keyboard match, fuzzy search, multi columns search, and many others.
And the most important it is very fast, it needs just a few milliseconds for millions of records.

How to match a regular expression from a SQL table to a text

I have used before LIKE command to match patterns to a specific SQL table column. For example need all the rows which all have name started with "A". But this case I am trying to solve things reversed , have a column "Description" where all the regular expressions row wise. And need to return specific rows which all match with a input text.
Table A
Description
============
\b[0-9A-Z ]*WT[A-Z0-9 ]*BALL\b
\b[0-9A-Z ]*WG[A-Z0-9 ]*BAT\b
\b[0-9A-Z ]*AX[A-Z0-9 ]*BAR\b
So Description column has these regular expressions and the input text "BKP 200 WT STAR BALL" So need to return the first row after select query since that the only match. If any one can help with the select query or idea, would be very helpful. If more details required please mention also.
Cross join you regex table to one that you searching within. Then just match two columns against each other.
Here's the example how you can match any of your expressions.
Here how you can match all of them

Ignore a SELECT LIKE statement if an identical one has already been satisfied.

I have a table with 4 entries.
CREATE TABLE tab(
name Text
);
INSERT INTO "tab" VALUES('Intertek');
INSERT INTO "tab" VALUES('Pntertek');
INSERT INTO "tab" VALUES('Ontertek');
INSERT INTO "tab" VALUES('ZTPay');
Pntertek & Ontertek are fuzzy duplicates of the correctly spelt Intertek. I wish to create a list consisting of fuzzy duplicates and the correctly spelt names.
As I have 4 names, I have 4 search criteria:
SELECT name FROM tab WHERE name LIKE '%ntertek'
AND (SELECT COUNT(*) FROM tab WHERE name LIKE '%ntertek') >1;
SELECT name FROM tab WHERE name LIKE '%ntertek'
AND (SELECT COUNT(*) FROM tab WHERE name LIKE '%ntertek') >1;
SELECT name FROM tab WHERE name LIKE '%ntertek'
AND (SELECT COUNT(*) FROM tab WHERE name LIKE '%ntertek') >1;
SELECT name FROM tab WHERE name LIKE '%TPay'
AND (SELECT COUNT(*) FROM tab WHERE name LIKE '%TPay') >1;
This produces 3 lists containing the same information. I would like to ignore the 2nd and 3rd identical SELECT statements if the first one returns a result. Is this possible using SQLite and how would I do this?
I'm very much a beginner when it comes to sqlite and programming in general so any help would be greatly appreciated.
Thanks in advance.
What do you want the query to return? Just potential duplicates? If so you could do the above with one query by including a having statement. However, the method that you are using at the moment only allows for differences at the start of the name. I would suggest looking into something like an edit-distance algorithm (sometimes referred to as Levenshtein distance) to identify the number of characters you would need to change on one field to make it the same as another.
There are details of a possible SQLite implementation in the following link: http://www.sqlite.org/spellfix1.html

SQL : query that only contain one word

i want to select name that only contain one word with SQL wildcards..
i have tried
select name from employee where name not like '% %'
it works,but i wonder if there are other ways to do it using SQL wildcards
note : i am a college student,i am studying wildcards right now . i was just wonder if there are other ways to show data that only contain one word with wildcards except the above..
Your method makes proper use of wildcards, alternatively you could do it with CHARINDEX or similar function depending on RDBMS
select name
from employee
where CHARINDEX(' ',name) = 0
Likewise the patindex function or similar use wildcards, but that's pretty much the same as CHARINDEX, just allows for patterns, so if looking for multiple spaces it would be helpful. I don't think there's much in the way of variation from your method for using wildcards.
If you have large database I would suggest to create new indexed column word_count which would be autofilled by insert/update trigger. Thus you will be able to search for such records more efficiently.
That's the way I'd do it using wildcards. The other way would be:
select name
from employee
where charindex(' ', name) = 0

How to implement a Keyword Search in MySQL?

I am new to SQL programming.
I have a table job where the fields are id, position, category, location, salary range, description, refno.
I want to implement a keyword search from the front end. The keyword can reside in any of the fields of the above table.
This is the query I have tried but it consist of so many duplicate rows:
SELECT
a.*,
b.catname
FROM
job a,
category b
WHERE
a.catid = b.catid AND
a.jobsalrange = '15001-20000' AND
a.jobloc = 'Berkshire' AND
a.jobpos LIKE '%sales%' OR
a.jobloc LIKE '%sales%' OR
a.jobsal LIKE '%sales%' OR
a.jobref LIKE '%sales%' OR
a.jobemail LIKE '%sales%' OR
a.jobsalrange LIKE '%sales%' OR
b.catname LIKE '%sales%'
For a single keyword on VARCHAR fields you can use LIKE:
SELECT id, category, location
FROM table
WHERE
(
category LIKE '%keyword%'
OR location LIKE '%keyword%'
)
For a description you're usually better adding a full text index and doing a Full-Text Search (MyISAM only):
SELECT id, description
FROM table
WHERE MATCH (description) AGAINST('keyword1 keyword2')
SELECT
*
FROM
yourtable
WHERE
id LIKE '%keyword%'
OR position LIKE '%keyword%'
OR category LIKE '%keyword%'
OR location LIKE '%keyword%'
OR description LIKE '%keyword%'
OR refno LIKE '%keyword%';
Ideally, have a keyword table containing the fields:
Keyword
Id
Count (possibly)
with an index on Keyword. Create an insert/update/delete trigger on the other table so that, when a row is changed, every keyword is extracted and put into (or replaced in) this table.
You'll also need a table of words to not count as keywords (if, and, so, but, ...).
In this way, you'll get the best speed for queries wanting to look for the keywords and you can implement (relatively easily) more complex queries such as "contains Java and RCA1802".
"LIKE" queries will work but they won't scale as well.
Personally, I wouldn't use the LIKE string comparison on the ID field or any other numeric field. It doesn't make sense for a search for ID# "216" to return 16216, 21651, 3216087, 5321668..., and so on and so forth; likewise with salary.
Also, if you want to use prepared statements to prevent SQL injections, you would use a query string like:
SELECT * FROM job WHERE `position` LIKE CONCAT('%', ? ,'%') OR ...
I will explain the method i usally prefer:
First of all you need to take into consideration that for this method you will sacrifice memory with the aim of gaining computation speed.
Second you need to have a the right to edit the table structure.
1) Add a field (i usually call it "digest") where you store all the data from the table.
The field will look like:
"n-n1-n2-n3-n4-n5-n6-n7-n8-n9" etc.. where n is a single word
I achieve this using a regular expression thar replaces " " with "-".
This field is the result of all the table data "digested" in one sigle string.
2) Use the LIKE statement %keyword% on the digest field:
SELECT * FROM table WHERE digest LIKE %keyword%
you can even build a qUery with a little loop so you can search for multiple keywords at the same time looking like:
SELECT * FROM table WHERE
digest LIKE %keyword1% AND
digest LIKE %keyword2% AND
digest LIKE %keyword3% ...
You can find another simpler option in a thread here: Match Against.. with a more detail help in 11.9.2. Boolean Full-Text Searches
This is just in case someone need a more compact option. This will require to create an Index FULLTEXT in the table, which can be accomplish easily.
Information on how to create Indexes (MySQL): MySQL FULLTEXT Indexing and Searching
In the FULLTEXT Index you can have more than one column listed, the result would be an SQL Statement with an index named search:
SELECT *,MATCH (`column`) AGAINST('+keyword1* +keyword2* +keyword3*') as relevance FROM `documents`USE INDEX(search) WHERE MATCH (`column`) AGAINST('+keyword1* +keyword2* +keyword3*' IN BOOLEAN MODE) ORDER BY relevance;
I tried with multiple columns, with no luck. Even though multiple columns are allowed in indexes, you still need an index for each column to use with Match/Against Statement.
Depending in your criterias you can use either options.
I know this is a bit late but what I did to our application is this. Hope this will help someone tho. But it works for me:
SELECT * FROM `landmarks` WHERE `landmark_name` OR `landmark_description` OR `landmark_address` LIKE '%keyword'
OR `landmark_name` OR `landmark_description` OR `landmark_address` LIKE 'keyword%'
OR `landmark_name` OR `landmark_description` OR `landmark_address` LIKE '%keyword%'