Levenshtein for multiple words on multiple columns - sql

I'm trying to make search a bit more friendly and wanted to exploit the Levenshtein distance. This works great but if a value in a column has a length of 25 characters long, the distance to only 3 characters is too far. In this case, it performs worse than the LIKE method. I solved this by splitting all words into their own rows using regexp_split_to_table. This is nice, but it's still not working if I have multiple words as input.
For example:
Let the data look as following
id
col1
col2
1
one two
three
2
two
one
3
horse
tree
4
house
three
using regexp_split_to_table would transform this to
id
col
1
one
1
two
1
three
2
one
2
two
2
two
3
horse
3
tree
4
house
4
three
If I search for one tree, I'd like to compare one with each word but also compare tree with each word and then order by the sum of both distances.
I have no idea where to start. I also do not know if this is the best approach to do this (it seems somewhat excessive but I'm also not an expert). Maybe I'm also overthinking this. I'd appreciate a hint into the right direction :).

Related

Looping though jagged array while fixing some dimentions in vb.net

I have a 7 dimensional jagged array that essentially is just a collection of decimal numbers. I need to go through the array and add up all the decimals that have certain values in certain columns. For example;
(A)(B)(..)(..)(..)(..)(..)
Where .. is the entire size of the dimention. For the above case I can simply use a bunch of nested for loops because I know that A and B are at the start of the array. But how can I deal with this if the dimention in which A and B are located is randomised. Eg.
(..)(A)(..)(..)(B)(..)(..)
Or
(..)(..)(..)(..)(..)(A)(B)
Or
(..)(..)(A)(..)(..)(..)(B)
Etc.
I thought about have a select case for the locations of A and B but this leads to hundreds (if not thousands) of lines of repeated code, and it feels like bad practice.
Any suggestions?
Edit #1
This is difficult to explain so I'm going to use a much more simple example. Instead of 7 dimentions let's say it's 2 dimentions (each with a length of 4). And instead of A and B let's say it's just A. I wish to add the following elements:
(A)(0)
(A)(1)
(A)(2)
(A)(3)
(0)(A)
(1)(A)
(2)(A)
(3)(A)
As you can see this is every element where A is in either of the dimentions (A is a real number, in this case either 0, 1, 2, or 3). Now in my case there's the need for both A and B to be in one of the dimentions and the requirement that A is always before B. But since there's 7 dimentions there's so many possible locations of A and B that writing code to each scenario is not ideal (also I'd like to extend it to C, D, etc.)

Store multiple disconnected datasets in SQL correctly

I have multiple datasets with identical schema and I am not sure how I should design the SQL correctly here. The question is very simple, but I just do not have experience with SQL. Lets' say that there are 40 tables that store matrix data as row_num, col_num, val. Each such table has its own name. Because the tables have hundreds of millions of rows, to fit all of them into just one table seems wrong from the point of performance. So, I am thinking of creating 40 tables, but I am not sure how optimal schema should look like in this case. Each such table, that represents a matrix, in turn, will have the relevant tables with different schema:
table_of_type_MATRIX_1 --> table_of_type_BIRDS (relevant for table_of_type_MATRIX_1 only!)
table_of_type_MATRIX_2 --> table_of_type_BIRDS (relevant for table_of_type_MATRIX_2 only!)
So, basically there is a bunch of kind of disconnected data that I want to store in one database and I am not sure how to organize it. There will be queries, of course, which will require looking into multiple tables with identical schema. Any suggestions would be greatly appreciated.
Example
Matrix looks like that:
gene cell_id expr
0 0610005C13Rik GCTAAGTATTTN_CTL-6_OPC 0.000000
1 0610007N19Rik GCTAAGTATTTN_CTL-6_OPC 0.000000
2 0610007P14Rik GCTAAGTATTTN_CTL-6_OPC 3.593143
3 0610009B22Rik GCTAAGTATTTN_CTL-6_OPC 3.593143
4 0610009D07Rik GCTAAGTATTTN_CTL-6_OPC 10.779429
...
other dozen millions of rows
It is the matrix of gene expression: in the first column we have gene that is expressed in the cell that is shown in the second column with the expression level shown in the third. The cells (second column) are also grouped into clusters after dimensionality reduction and clustering algorithms are run, and so, we have second table that is related to the first:
cell_id cluster
GCTAAGTATTTN_CTL-6_OPC 1
GCTGGGTATTTN_CTL-6_OPC 2
GCTAAGTATAAN_CTL-6_OPC 2
GCTAAGTATTTN_CTL-6_OPC 3
...
and so on for all of the cells
So, these two related tables: gene expression matrix and cells' clusters assignment will form a disconnected dataset in itself. There will be many such groups of 2 tables that need to be stored.

recoding multiple variables in the same way

I am looking for the shortest way to recode many variables in the same way.
For example I have data frame where columns a,b,c are names of items of survey and rows are observations.
d <- data.frame(a=c(1,2,3), b=c(1,3,2), c=c(1,2,1))
I want to change values of all observations for selected columns. For instance value 1 of column "a" and "c" should be replaced to string "low" and values 2,3 of these columns should be replaced to "high".
I do it often with many columns so I am looking for function which can do it in very simple way, like this:
recode2(data=d, columns=a,d, "1=low, 2,3=high").
Almost ok is function recode from package cars, but if I have 10 columns to recode I have to rewrite it 10 times and it is not as effective as I want.

Numbering repeated values in column in Excel using VBA

I have a column with varying values and some of these values can sometimes be repeated, so if there are two of the same value I need to have the first value followed by 1 and the second followed by 2.
For Example:
Apple1
Apple2
Lemon1
Apple3
Pear1
Lemon2
Apple4
Orange1
Pear2
I've tried using nested if loops but I can't seem to find an efficient way to do this.
You can use 2 loops to go through all the elements.
Btw, you can add 1 more step to check numeric the last character and skip for faster process

Power-law distribution in T-SQL

I basically need the answer to this SO question that provides a power-law distribution, translated to T-SQL for me.
I want to pull a last name, one at a time, from a census provided table of names. I want to get roughly the same distribution as occurs in the population. The table has 88,799 names ranked by frequency. "Smith" is rank 1 with 1.006% frequency, "Alderink" is rank 88,799 with frequency of 1.7 x 10^-6. "Sanders" is rank 75 with a frequency of 0.100%.
The curve doesn't have to fit precisely at all. Just give me about 1% "Smith" and about 1 in a million "Alderink"
Here's what I have so far.
SELECT [LastName]
FROM [LastNames] as LN
WHERE LN.[Rank] = ROUND(88799 * RAND(), 0)
But this of course yields a uniform distribution.
I promise I'll still be trying to figure this out myself by the time a smarter person responds.
Why settle for the power-law distribution when you can draw from the actual distribution ?
I suggest you alter the LastNames table to include a numeric column which would contain a numeric value representing the actual number of indivuduals with a name that is more common. You'll probably want a number on a smaller but proportional scale, say, maybe 10,000 for each percent of representation.
The list would then look something like:
(other than the 3 names mentioned in the question, I'm guessing about White, Johnson et al)
Smith 0
White 10,060
Johnson 19,123
Williams 28,456
...
Sanders 200,987
..
Alderink 999,997
And the name selection would be
SELECT TOP 1 [LastName]
FROM [LastNames] as LN
WHERE LN.[number_described_above] < ROUND(100000 * RAND(), 0)
ORDER BY [number_described_above] DESC
That's picking the first name which number does not exceed the [uniform distribution] random number. Note how the query, uses less than and ordering in desc-ending order; this will guaranty that the very first entry (Smith) gets picked. The alternative would be to start the series with Smith at 10,060 rather than zero and to discard the random draws smaller than this value.
Aside from the matter of boundary management (starting at zero rather than 10,060) mentioned above, this solution, along with the two other responses so far, are the same as the one suggested in dmckee's answer to the question referenced in this question. Essentially the idea is to use the CDF (Cumulative Distribution function).
Edit:
If you insist on using a mathematical function rather than the actual distribution, the following should provide a power law function which would somehow convey the "long tail" shape of the real distribution. You may wan to tweak the #PwrCoef value (which BTW needn't be a integer), essentially the bigger the coeficient, the more skewed to the beginning of the list the function is.
DECLARE #PwrCoef INT
SET #PwrCoef = 2
SELECT 88799 - ROUND(POWER(POWER(88799.0, #PwrCoef) * RAND(), 1.0/#PwrCoef), 0)
Notes:
- the extra ".0" in the function above are important to force SQL to perform float operations rather than integer operations.
- the reason why we subtract the power calculation from 88799 is that the calculation's distribution is such that the closer a number is closer to the end of our scale, the more likely it is to be drawn. The List of family names being sorted in the reverse order (most likely names first), we need this substraction.
Assuming a power of, say, 3 the query would then look something like
SELECT [LastName]
FROM [LastNames] as LN
WHERE LN.[Rank]
= 88799 - ROUND(POWER(POWER(88799.0, 3) * RAND(), 1.0/3), 0)
Which is the query from the question except for the last line.
Re-Edit:
In looking at the actual distribution, as apparent in the Census data, the curve is extremely steep and would require a very big power coefficient, which in turn would cause overflows and/or extreme rounding errors in the naive formula shown above.
A more sensible approach may be to operate in several tiers i.e. to perform an equal number of draws in each of the, say, three thirds (or four quarters or...) of the cumulative distribution; within each of these parts list, we would draw using a power law function, possibly with the same coeficient, but with different ranges.
For example
Assuming thirds, the list divides as follow:
First third = 425 names, from Smith to Alvarado
Second third = 6,277 names, from to Gainer
Last third = 82,097 names, from Frisby to the end
If we were to need, say, 1,000 names, we'd draw 334 from the top third of the list, 333 from the second third and 333 from the last third.
For each of the thirds we'd use a similar formula, maybe with a bigger power coeficient for the first third (were were are really interested in favoring the earlier names in the list, and also where the relative frequencies are more statistically relevant). The three selection queries could look like the following:
-- Random Drawing of a single Name in top third
-- Power Coef = 12
SELECT [LastName]
FROM [LastNames] as LN
WHERE LN.[Rank]
= 425 - ROUND(POWER(POWER(425.0, 12) * RAND(), 1.0/12), 0)
-- Second third; Power Coef = 7
...
WHERE LN.[Rank]
= (425 + 6277) - ROUND(POWER(POWER(6277.0, 7) * RAND(), 1.0/7), 0)
-- Bottom third; Power Coef = 4
...
WHERE LN.[Rank]
= (425 + 6277 + 82097) - ROUND(POWER(POWER(82097.0, 4) * RAND(), 1.0/4), 0)
Instead of storing the pdf as rank, store the CDF (the sum of all frequencies until that name, starting from Aldekirk).
Then modify your select to retrieve the first LN with rank greater than your formula result.
I read the question as "I need to get a stream of names which will mirror the frequency of last names from the 1990 US Census"
I might have read the question a bit differently than the other suggestions and although an answer has been accepted, and a very through answer it is, I will contribute my experience with the Census last names.
I had downloaded the same data from the 1990 census. My goal was to produce a large number of names to be submitted for search testing during performance testing of a medical record app. I inserted the last names and the percentage of frequency into a table. I added a column and filled it with a integer which was the product of the "total names required * frequency". The frequency data from the census did not add up to exactly 100% so my total number of names was also a bit short of the requirement. I was able to correct the number by selecting random names from the list and increasing their count until I had exactly the required number, the randomly added count never ammounted to more than .05% of the total of 10 million.
I generated 10 million random numbers in the range of 1 to 88799. With each random number I would pick that name from the list and decrement the counter for that name. My approach was to simulate dealing a deck of cards except my deck had many more distinct cards and a varing number of each card.
Do you store the actual frequencies with the ranks?
Converting the algebra from that accepted answer to MySQL is no bother, if you know what values to use for n. y would be what you currently have ROUND(88799 * RAND(), 0) and x0,x1 = 1,88799 I think, though I might misunderstand it. The only non-standard maths operator involved from a T-SQL perspective is ^ which is just POWER(x,y) == x^y.