BaseX Index or Query Optimization for Partial String Match - optimization

I am currently having an issue where my BaseX query is taking a lot longer than it should to search through my dataset. The problem is that I am searching for the substring of a fixed length text field. If I search for the exact string, for example, it returns in a matter of 6ms vs searching for the last 8 character of a 15 character string takes 5 sec. I have tried writing the search 3 different ways and every time it takes 5 sec. My concern is that it needs an index for the last characters and it only seems to allow for indexing of the full string. Anyhow, here is some sample info:
xml:
<xmlfile><sometag>FIXED39LENGTH</sometag></xmlfile>
query:
<result>{
for $c in db:open('Test')
where $c/xmlfile/sometag[text() contains text ".{6,6}9LENGTH" using wildcards]
return <result sometag="{$c/xmlfile/sometag}"/>
}</result>
As for the "full-text" index, it would just be specified as "sometag".

Related

Optimizing a vb.net code that uses a very large string list, over 400 000 entries

I use a static list of unique strings T (french dictionary), with 402 325 entries. My code is playing Scrabble and uses a specialized construction called a DAGGAD to construct playable words and verifies that the words are actually in the list. Is there a way faster than list.indexof(T) to find if a word exists ? I looked at HashSet.Contains(T) but it does not use an index that I can use to retrieve a word. For example, for a given turn of play there could be thousands of valid solutions: I actually store only the index of the list, but with a HashSet I would not be able to do that and will have to store all words, which increases memory usage. In most cases solutions are found in one or two seconds, but in some cases (i.e. with blanks) it takes up to 15 seconds, and I need to reduce that if at all possible with VB !
As Craig suggested, using List.BinarySearch(T) on a sorted list of T improves the speed by around 10 fold. A Scrabble play with a blank letter is now taking no more than 1 or 2 seconds compared to 15 to 20 seconds when I was using IndexOf.

Random string generation using arc4random

I'm trying to create a method that creates a random string consisting of 32 characters. This method will generate a random number using arc4random_uniform(62) to choose a number between 0 and 61 and then chose a character from a string that holds numbers from 0 to 9 and alphabet letters both small and capital letters, respectively. For an instance, if arc4random_uniform(62) returns 10, the chosen character will be a, if it returns 61, the chosen character will be Z). The method will do this for 32 times to create the final generated string.
I was wondering when this approach will fail to generate a unique String and result in a repeated one. I searched about this topic and didn't find a satisfying answer. I hope that you will help with me this since I am trying to use this method to generate unique IDs for use in my app.
This method will generate a random number using arc4random_uniform(62) to choose a number between 0 and 61 and then chose a character from a string that holds numbers from 0 to 9 and alphabet letters both small and capital letters, respectively.
You could create an array with a string for all the characters you want to include, and randomly pick values. Or, alternatively you could take advantage of the ASCII encoding has mostly sequential character positions and you can fairly easily convert an ascii number to an NSString.
An integer between 48 and 57 is the numbers 0-9 in ASCII, 65 to 90 is A-Z and 97 to 122 is a-z: https://en.wikipedia.org/wiki/Ascii_table#ASCII_printable_code_chart
I was wondering when this approach will fail to generate a unique String and result in a repeated one. I searched about this topic and didn't find a satisfying answer.
It's often referred to as the "birthday problem". As long as your value is reasonably long (say, 20 characters), it is effectively impossible to have a collision. The world is more likely to be destroyed in the next 2 seconds than your app ever creating a collision.
I hope that you will help with me this since I am trying to use this method to generate unique IDs for use in my app.
Apple provides an API for generating unique IDs. You should use that instead of inventing your own system:
NSString *id = [NSUUID UUID].UUIDString;
That will give you a value like D19B40AA-322C-4ADF-BEF6-2EC4D4CE7BA8. It conforms to "Version 4" of the UUID standard — according to Wikipedia if you generate 1 billion UUIDs every second for the next 100 years, there is a 50% chance of getting two IDs that are the same.
If the UUID is longer than you want, you could grab a smaller part part of the string. Beware that the 4 at the start of the third block means this is a "version 4" UUID and is not a random value. Also the first character at the start of the 4th block is only has four possible values — so avoid or strip off those two characters if you want to grab a smaller part of the string for use as your random ID. See the wikipedia page on UUIDs for more detail.

How to improve a single character PrefixQuery performance?

I have a RAMDirectory with 1.5 million documents and I'm searching using a PrefixQuery for a single field. When the search text has a length of 3 or more characters, the search is extremely fast, less than 20 milliseconds. But when the search text has a length of less than 3 characters, the search might take even a full 1 second.
Since it's an auto complete feature and the user starts with one character (and there are results that are indeed 1 char length), I cannot restrict the length of the search text.
The code is pretty much:
var symbolCodeTopDocs = searcher.Search(new PrefixQuery(new Term("SymbolCode", searchText), 10);
The SymbolCode is a NOT_ANALYZED field. The Lucene.NET version is 3.0.3.
The example is simplified, and I might have to use a BooleanQuery to apply additional constraints in a real world scenario.
How can I improve performance on this specific case? These single-char or two-char queries are bringing the server down.
Consider removing stop words from your index if you haven't already.
To understand how stop words slow down PrefixQuery then consider how PrefixQuery works: It is rewritten as a BooleanQuery that includes every term from the index beginning with the PrefixQuery's term. For example a* becomes a OR and OR aardvark OR anchor OR ... So far this isn't bad and it will perform surprisingly well even with thousands of terms. The real drain is when stop words like a and and are included because they'll likely be found multiple times in every single document in your index. This creates a lot more work for the gathering/collecting/scoring portion of the search and thus slows things down.
On a side note, I highly recommend not running the autocomplete search when the user has entered less than 2 or 3 characters, purely from a usability perspective. I can't imagine the results would be at all relevant. Imagine running a search for a* -- there's no way to tell which results are more relevant. If you must display something to the user then consider an n-gram approach like Jf Beaulac suggested in the comments.

How to modify data in a Text File in VB .NET

i have text file more than 2000 row like these:
10
21
13
...
and i want to find the avarge of the 1440 row ,start from down to up and find the max,then find the avarge for each 30 row and put them besid the data and find the max of these avarge like this
max(od data)=----
max(averge)=-----
While the question shows a lack of effort, I'll still give some basic guidelines to help your search.
Here are some things you are going to have to understand to tackle your problem:
1. How to handle text files in .NET
You can easily process files using the System.IO.File class. This class has several static methods that are very useful. (Static methods allow you to call the method without explicitly creating an object
System.IO.File Reference on MSDN
System.IO.File.ReadAllLines This method lets you read each line into an array
ReadAllLines is most useful when the file is short enough to read all at once. At 2000 rows this should not be a problem. If you had millions of rows you would have to look at how to work with something called streams (deal with data in small chunks)
2. How to convert a String to a number
The strings you read in with ReadAllLines aren't very useful as strings. You need to convert them to numbers to do math with them. And of course there is a class for that...
System.Int32.Parse Converts a string to a number, throws an exception for bad formats
System.Int32.TryParse Converts a string to a number, returns a default value on error
3. How to do a for loop in VB.NET
Any introductory tutorial should cover for loops, but here is one from MSDN
For Loops in VB.NET
4. How to do something every nth time through a loop
Use the modulus operator. This operator is like division, except that it returns the remainder. Every time the mod operation returns zero you have an exact multiple.
Example of using the Mod operator in VB.NET
5. How to find the max in a list of numbers
Have a variable to store the max value. Give it a value that is less than any value. Int32.MinValue is a safe value. Loop through every number. If it is larger than the max value, assign it to max value (it's the new max value). When you have processed every number max value contains the largest number you were able to find.
There are a few other details but if you can accomplish 1-5 you'll be able to ask a more specific question. This type of specific question will be better received by the stackoverflow community.
Happy coding.

How does Lucene work

I would like to find out how lucene search works so fast. I can't find any useful docs on the web. If you have anything (short of lucene source code) to read, let me know.
A text search query using mysql5 text search with index takes about 18 minutes in my case. A lucene search for the same query takes less than a second.
Lucene is an inverted full-text index. This means that it takes all the documents, splits them into words, and then builds an index for each word. Since the index is an exact string-match, unordered, it can be extremely fast. Hypothetically, an SQL unordered index on a varchar field could be just as fast, and in fact I think you'll find the big databases can do a simple string-equality query very quickly in that case.
Lucene does not have to optimize for transaction processing. When you add a document, it need not ensure that queries see it instantly. And it need not optimize for updates to existing documents.
However, at the end of the day, if you really want to know, you need to read the source. Both things you reference are open source, after all.
Lucene creates a big index. The index contains word id, number of docs where the word is present, and the position of the word in those documents. So when you give a single word query it just searches the index (O(1) time complexity). Then the result is ranked using different algorithms. For multi-word query just take the intersection of the set of files where the words are present.
Thus Lucene is very very fast.
For more info read this article by Google developers- http://infolab.stanford.edu/~backrub/google.html
In a word: indexing.
Lucene creates an index of your document that allows it to search much more quickly.
It's the same difference between a list O(N) data structure and a hash table O(1) data structure. The list has to walk through the entire collection to find what you want. The hash table has an index that lets it figure out exactly where the desired item is and simply fetch it.
Update:
I'm not certain what you mean by "Lucene index searches are a lot faster than mysql index searches."
My guess is that you're using MySQL "WHERE document LIKE '%phrase%'" to search for a document. If that's true, then MySQL has to do a table scan on every row, which will be O(N).
Lucene gets to parse the document into tokens, group them into n-grams at your direction, and calculate indexes for each one of those. It's O(1) to find a word in an indexed Lucene document.
Lucene works with Term frequency and Inverse document frequency. It creates an index mapping each word with the document and it's frequency count which is nothing but inverse index on the document.
Example :
File 1 : Random Access Memory is the main memory.
File 2 : Hard disks are secondary memory.
Lucene creates a reverse index something like
File 1 :
Term : Random
Frequency : 1
Position : 0
Term : Memory
Frequency : 2
Position : 3
Position : 6
So it is able to search and retrieve the searched content quickly. When there is too many matches for the search query it outputs the result based on the weight. Consider the search query "Main Memory" it searches for all 4 words individually and the result would be like,
Main
File 1 : Frequency - 1
Memory
File 1 : Frequency - 2
File 2 : Frequency - 1
The result would be File1 followed by File2. To stop getting carried away by weights on most common words like 'and', 'or', 'the' it considers the inverse document frequency (ie' it decreases the weight of the word which is most popular among the document set).