search any word inside a string in million rows - sql

I have a set of 50k values say X. each value i want to compare with a set of 10k values say Y. if X is present any where in the string Y it matches.
So each value in X i want to check across each value in Y and assign X if it matches.
what would be the best method to complete this task. It is required for a data mining project.
I loaded the data into MS Access database.
then using a vba program
take each X . Update Y if it matches (Like '%X%') but it is a never ending process. The columns are indexed but no effect.
Is there any algorithm or steps to reduce it into step-by-step process and complete the mapping faster?
Please let me know if there is any other options available other than the answers given below. I ll explain the scenario bit more
Table1.Data
sentense1
sentense2
sentense3
sentense4
sentense5
sentense6
-
-
-
Sentense100k
Table2.Phrase (Means multiple words)
Phrase1
Phrase2
Phrase3
Phrase4
Phrase5
-
-
-
Phrase 100k
Want to check Phrase1 has any Match in Sentense1 to Sentense100k Exact Match of Phrase, anywhere Match of Phrase, Maximum Words in Phrase1 Match in Sentense etc.. and create a map based on best Match(ideally exact phrase available anywhere in the sentense)
Table3 Output
Data Best Possible Phrase Second Best Phrase(Optional)
Sentense1 Phrase1000 Phrase50k
Sentense2 Phrase10 Phrase70k
Please let me know any tool,logic to perform this. The logic what i tried in SQL
1.
Select A.Data,B.Phrase from Table1 A left join Table2 B on A.Data Like '%' + B.Phrase + '%'
2.
Check for any word in phrase available in sentense. So replaced all spaces with % like word1%word2%word3. then did query as
A.Data Like '%' + B.Phrase + '%' which is
A.Data Like '%word1%word2%word3%'
But it takes days to complete the task for this much data.
Any readily usable tools, indexing methods,queries would really help. The answers given below seems too technical for me to adapt. Please guide

You can build a suffix tree in linear time (you can look up suffix trees online), out of the concatenation of all strings in X and Y, with special unique symbols that end each string.
Then for each string Xi in X, you look it up in the suffix tree (linear time in length of Xi) and assign Xi to each string in Y that is somewhere in the subtree rooted at the end of Xi.
This is linear time in the number of strings in Y that Xi is assigned to.
Thus you get an optimal O(N + k) time algorithm, where:
N is the total length of all the strings in X and Y,
and k is the total number of matches between query strings in X and target strings in Y.

Related

Is it possible to explicitly and minimally index all integer solutions to x + y + z = 2n?

For example, the analogous two dimensional question, x + y = 2n is easy to solve: one can simply consider pairs (i,2n-i) for i=1,2,...,n and thus index every solution, exactly once. We note that we have n such pairs solving x + y = 2n, for every fixed value of positive integer n, and so the cardinality of such a set is equal to n as expected.
However, trying to repeat the same problem for x + y + z = 2n, it is not clear to me how (or if it is possible) to write down a minimal set {(2n-i-j,i,j)} such that varying i and j over particular intervals precisely produces every such triplet, exactly once. It can be shown that the number of elements in such a minimal set would be equal to the nearest integer to n^2/3.
It is not hard to see how one can obtain such an indexing with repetitions, or how one can algorithmically remove repetitions, but what I would like to know is whether there is a clean, general construction, as for the x + y = 2n case. Is this possible, or will one always have to artificially restrict certain values of the parameters on the intervals for which they are defined?

What does CAST(SUBSTRING()AS INTEGER)=x do?

I am having an exam on SQL which I have rarely ever worked on. While going through the study material I encountered this example:
DELETE FROM table_name WHERE
CAST (SUBSTRING (attribute_name from x for y) AS INTEGER) =z;
Now, I am guessing that this would delete a specific line, where an attribute name would be specified in the code, but am unsure.
The substring() function extracts part of string (from the column attribute_name) from the position numbered x for y characters (or until position x + y - 1). For instance 1 to 3 would be the first three characters.
This is then converted to an integer and compared to another value.
Rows where the comparison returns "true" are deleted.

How do I calculate the sum efficiently?

Given an integer n such that (1<=n<=10^18)
We need to calculate f(1)+f(2)+f(3)+f(4)+....+f(n).
f(x) is given as :-
Say, x = 1112222333,
then f(x)=1002000300.
Whenever we see a contiguous subsequence of same numbers, we replace it with the first number and zeroes all behind it.
Formally, f(x) = Sum over all (first element of the contiguous subsequence * 10^i ), where i is the index of first element from left of a particular contiguous subsequence.
f(x)=1*10^9 + 2*10^6 + 3*10^2 = 1002000300.
In, x=1112222333,
Element at index '9':-1
and so on...
We follow zero based indexing :-)
For, x=1234.
Element at index-'0':-4,element at index -'1':3,element at index '2':-2,element at index 3:-1
How to calculate f(1)+f(2)+f(3)+....+f(n)?
I want to generate an algorithm which calculates this sum efficiently.
There is nothing to calculate.
Multiplying each position in the array od numbers will yeild thebsame number.
So all you want to do is end up with 0s on a repeated number
IE lets populate some static values in an array in psuedo code
$As[1]='0'
$As[2]='00'
$As[3]='000'
...etc
$As[18]='000000000000000000'```
these are the "results" of 10^index
Given a value n of `1234`
```1&000 + 2&00 +3 & 0 + 4```
Results in `1234`
So, if you are putting this on a chip, then probably your most efficient method is to do a bitwise XOR between each register and the next up the line as a single operation
Then you will have 0s in all the spots you care about, and just retrive the values in the registers with a 1
In code, I think it would be most efficient to do the following
```$n = arbitrary value 11223334
$x=$n*10
$zeros=($x-$n)/10```
Okay yeah we can just do bit shifting to get a value like 100200300400 etc.
To approach this problem, it could help to begin with one digit numbers and see what sum you get.
I mean like this:
Let's say, we define , then we have:
F(1)= 45 # =10*9/2 by Euler's sum formula
F(2)= F(1)*9 + F(1)*100 # F(1)*9 is the part that comes from the last digit
# because for each of the 10 possible digits in the
# first position, we have 9 digits in the last
# because both can't be equal and so one out of ten
# becomse zero. F(1)*100 comes from the leading digit
# which is multiplied by 100 (10 because we add the
# second digit and another factor of 10 because we
# get the digit ten times in that position)
If you now continue with this scheme, for k>=1 in general you get
F(k+1)= F(k)*100+10^(k-1)*45*9
The rest is probably straightforward.
Can you tell me, which Hackerrank task this is? I guess one of the Project Euler tasks right?

Ranking Big O Functions By Complexity

I am trying to rank these functions — 2n, n100, (n + 1)2, n·lg(n), 100n, n!, lg(n), and n99 + n98 — so that each function is the big-O of the next function, but I do not know a method of determining if one function is the big-O of another. I'd really appreciate if someone could explain how I would go about doing this.
Assuming you have some programming background. Say you have below code:
void SomeMethod(int x)
{
for(int i = 0; i< x; i++)
{
// Do Some Work
}
}
Notice that the loop runs for x iterations. Generalizing, we say that you will get the solution after N iterations (where N will be the value of x ex: number of items in array/input etc).
so This type of implementation/algorithm is said to have Time Complexity of Order of N written as O(n)
Similarly, a Nested For (2 Loops) is O(n-squared) => O(n^2)
If you have Binary decisions made and you reduce possibilities into halves and pick only one half for solution. Then complexity is O(log n)
Found this link to be interesting.
For: Himanshu
While the Link explains how log(base2)N complexity comes into picture very well, Lets me put the same in my words.
Suppose you have a Pre-Sorted List like:
1,2,3,4,5,6,7,8,9,10
Now, you have been asked to Find whether 10 exists in the list. The first solution that comes to mind is Loop through the list and Find it. Which means O(n). Can it be made better?
Approach 1:
As we know that List of already sorted in ascending order So:
Break list at center (say at 5).
Compare the value of Center (5) with the Search Value (10).
If Center Value == Search Value => Item Found
If Center < Search Value => Do above steps for Right Half of the List
If Center > Search Value => Do above steps for Left Half of the List
For this simple example we will find 10 after doing 3 or 4 breaks (at: 5 then 8 then 9) (depending on how you implement)
That means For N = 10 Items - Search time was 3 (or 4). Putting some mathematics over here;
2^3 + 2 = 10 for simplicity sake lets say
2^3 = 10 (nearly equals --- this is just to do simple Logarithms base 2)
This can be re-written as:
Log-Base-2 10 = 3 (again nearly)
We know 10 was number of items & 3 was the number of breaks/lookup we had to do to find item. It Becomes
log N = K
That is the Complexity of the alogorithm above. O(log N)
Generally when a loop is nested we multiply the values as O(outerloop max value * innerloop max value) n so on. egfor (i to n){ for(j to k){}} here meaning if youll say for i=1 j=1 to k i.e. 1 * k next i=2,j=1 to k so i.e. the O(max(i)*max(j)) implies O(n*k).. Further, if you want to find order you need to recall basic operations with logarithmic usage like O(n+n(addition)) <O(n*n(multiplication)) for log it minimizes the value in it saying O(log n) <O(n) <O(n+n(addition)) <O(n*n(multiplication)) and so on. By this way you can acheive with other functions as well.
Approach should be better first generalised the equation for calculating time complexity. liken! =n*(n-1)*(n-2)*..n-(n-1)so somewhere O(nk) would be generalised formated worst case complexity like this way you can compare if k=2 then O(nk) =O(n*n)

Math Equation to text

Is there a way to convert Math Equations to a Text?
ex.
3π * 3 = x
out put will be
3(3.141592) multiplied by 3 equals to x
By default Its not possible, You can write a code in such way that it will get the correct string(word) for different symbols and numbers. You can use the database for storing the words or you can do it with your functions.