String distances and variable substitution costs - stringdist

I want to quantify distance between word pairs based on phonological features. Insertion and deletion costs will stay constant but substitution costs will vary according to letter pair which are stored in a matrix. I am thinking of using the stringdist package to do this but I do not know how to incorporate variable substitution costs.
Thanks!

Related

Does pandas categorical data speed up indexing?

Somebody told me it is a good idea to convert identifying columns (e.g. person numbers) from strings to categorical. This would speed up some operations like searching, filtering and grouping.
I understand that a 40 chars strings costs much more RAM and time to compare instead of a simple integer.
But I would have some overhead because of a str-to-int-table for translating between two types and to know which integer number belongs to which string "number".
Maybe .astype('categorical') can help me here? Isn't this an integer internally? Does this speed up some operations?
The user guide has the following about categorical data use cases:
The categorical data type is useful in the following cases:
A string variable consisting of only a few different values. Converting such a string variable to a categorical variable will save some memory, see here.
The lexical order of a variable is not the same as the logical order (“one”, “two”, “three”). By converting to a categorical and specifying an order on the categories, sorting and min/max will use the logical order instead of the lexical order, see here.
As a signal to other Python libraries that this column should be treated as a categorical variable (e.g. to use suitable statistical methods or plot types).
See also the API docs on categoricals.
The book, Python for Data Analysis by Wes McKinney, has the following on this topic:
The categorical representation can yield significant performance
improvements when you are doing analytics. You can also perform
transformations on the categories while leaving the codes unmodified.
Some example transformations that can be made at relatively low cost are:
Renaming categories
Appending a new category without changing the order or position of the existing categories
GroupBy operations can be significantly faster with categoricals because the underlying algorithms use the integer-based codes array instead of an array of strings.
Series containing categorical data have several special methods similar to the Series.str specialized string methods. This also provides convenient access to the categories and codes.
In large datasets, categoricals are often used as a convenient tool for memory savings and better performance.

How can I map a complex number to a qubit in Q#?

In theory, the state of a qubit is defined by 2 complex numbers, following this formula:
The rule is that the amount of complex numbers needed to define the state of a set of qubits is equal to 2ⁿ, where n is the number of used qubits.
if i have an array of complex numbers, how can I map or assign each number to a qubit?
For instance:
I have this complex number: 0.0020908999722450972 + i*0.001669629942625761.
What would the state of a qubit be in this case?
Would I be needing more qubits to represent this number?
I think that depends on what are you going to do with these numbers after you have mapped them to qubits.
If you need to use 2ⁿ numbers to prepare a quantum state on n qubits that is a weighted superposition of the basis states, you can use PrepareArbitraryState operation which does exactly that. Internally it implements the paper Synthesis of Quantum Logic Circuits by Shende, Bullock, Markov.
If you need to represent these numbers in a way that would allow you to read them out by measuring the qubits, you might have to do something like convert them in binary and store each bit in a separate qubit.

how to get surrogate variables in rpart

I have looked everywhere I can, but I couldn't find answer to my question regarding rpart package.
I have built a regression tree using rpart, I have around 700 variables. I want to get the variables actually used to build the tree including the surrogates. I can find the actual variables used using tree$variable.importance, but I also have to get the surrogates because I need them to predict on the test set data I have. I do not want to keep all the 700 variables in the test set as I have a very big data (20mil observations) and I am running out of memory.
The list variable.importance in an rpart object does show the surrogate variables, but it only shows the top variables limited by a minimum importance value.
The matrix splits in an rpart object lists all of the split variables and their surrogate variables along with some other data like index, the value on which it splits (for continuous variable) or the categories that are split (for categorical variable), count how many observations are that split applies to. It doesn't give a hierarchy of which surrogates apply to which split, but it does list every variable. To get the hierarchy, you have to do summary(rpart_object).

When writing big O notation can unknown variables be used?

I do not know if the language I am using in the title is correct, but here is an example that illustrates what I am asking.
What would the time complexity for this non-optimal algorithm that removes character pairs from a string?
The function will loop through a string. When it finds two identical characters next to each other it will return a string without the found pair. It then recursively calls itself until no pair is found.
Example (each line is the return string from one recurisive function call):
iabccba
iabba
iaa
i
Would it be fair to describe the time complexity as O(|Characters| * |Pairs|)?
What about O(|Characters|^2) Can pairs be used to describe the time complexity even though the number of pairs is not knowable at the initial function call?
It was argued to me that this algorithm was O(n^2)because the number of pairs is not known.
You're right that this is strictly speaking O(|Characters| * |Pairs|)
However, in the worst case, number of pairs can be same as number of charachters (or same order of magnitude), for example in the string 'abcdeedcba'
So it also makes sense to describe it as O(n^2) worst-case.
I think this largely depends on the problem you mean to solve and and it's definition.
For graph algorithms for example, everyone is comfortable with a writing as complexity O(|V| + |E|), although in the worst case of a dense graph |E| = |V|^2. In other problems we just look at the worst possible case and write O(n^2), without breaking it into more specific variables.
I'd say that if there's no special convention, or no special data in the problem regarding number of pairs, O(...) implies worst-case performance and hence O(n^2) would be more appropriate.

VB.NET Comparing files with Levenshtein algorithm

I'd like to use the Levenshtein algorithm to compare two files in VB.NET. I know I can use an MD5 hash to determine if they're different, but I want to know HOW MUCH different the two files are. The files I'm working with are both around 250 megs. I've experimented with different ways of doing this and I've realized I really can't load both files into memory (all kinds of string-related issues). So I figured I'd just stream the bytes I need as I go. Fine. But the implementations that I've found of the Levenshtein algorithm all dimension a matrix that's length 1 * length 2 in size, which in this case is impossible to work with. I've heard there's a way to do this with just two vectors instead of the whole matrix.
How can I compute Levenshtein distance of two large files without declaring a matrix that's the product of their file sizes?
Note that the values in each row of the Levenshtein matrix depend only on the values in the row above it. This means that you only need two one-dimensional arrays: one contains the values of the current row; the other is populated with the new values that you can compute from the current row. Then, you swap their roles (the "new" row becomes the "current" row and vice versa) and continue.
Note that this approach only lets you compute the Levenshtein distance (which seems to be what you want); it cannot tell you which operations must be done in order to transform one string into the other. There exists a very clever modification of the algorithm that lets you reconstruct the edit operations without using nm memory, but I've forgotten how it works.