git_index_get_byindex vs git_status_foreach_ext - libgit2

It looks like git_index_get_bypath and git_status_foreach_ext (with GIT_STATUS_SHOW_INDEX_ONLY) are just different ways of reading the index. What are the differences, and why would I use one vs. the other?

git_index_get_bypath lets you look up a particular entry a given index.
git_status_foreach_ext does a status check, which is a comparison between the worktree, the index and HEAD and iterates over the results of that comparison calling the passed function. With that flag, it would skip the worktree in that comparison.
Which one to use depends on what you're looking for: a particular entry in the index or a list of differences between the index and HEAD.

Related

Lucene DocValuesField, SortedDocValuesField usage for filtering and sorting

I am going to switch to newest (4.10.2) version of Lucene and I'd like to make some optimization in my index and code.
I would like to use DocValuesField to get values but also for filtering and sorting.
So here I have some questions:
If I'd like to use range filter (FieldCacheRangeFilter) I need to store a value in XxxDocValuesField,
but if i want to use terms filter (FieldCacheTermsFilter) I need to store a value in SortedDocValuesField.
So it looks like if I want to use range and terms filters I need to have two different fields. Am I right? Am I using it correctly?
Another thing is Sort. I can choose between SortedNumericSortField and SortField. First one requires SortedNumericDocValues, another NumericDocValuesField. Is there any(big) difference in performance?
Should I use SortedNumericSortField (adding another field to the index)?
And the last one. Am I right that all corresponding DocValuesField will be removed from index when doc is removed? I saw an IndexWriter method for an update doc value but no delete method for doc value.
Regards
Piotr

Any way to use strings as the scores in a Redis sorted set (zset)?

Or maybe the question should be: What's the best way to represent a string as a number, such that sorting their numeric representations would give the same result as if sorted as strings? I devised a way that could sort up to 9 characters per string, but it seems like there should be a much better way.
In advance, I don't think using Redis's lexicographical commands will work. (See the following example.)
Example: Suppose I want to presort all of the names linked to some ID so that I can use ZINTERSTORE to quickly get an ordered list of IDs based on their names (without using redis' SORT command). Ideally I would have the IDs as the zset's members, and the numeric representation of each name would be the zset's scores.
Does that make sense? Or am I going about it wrong?
You're trying to use an order preserving hash function to generate a score for each id. While it appears you've written one, you've already found out that the score's range allows you to use only the first 9 characters (it would be interesting to see your function btw).
Instead of this approach, here's a simpler one that would be easier IMO - use set members of the form <name>:<id> and set the score to 0. You'll be able to use lexicographical ordering this way and use something like split(':') to get the id from the set's members.

Hibernate Search - possible to get new Lucene query after facets applied?

A Lucene Query is generated as so:
Query luceneQuery = builder.all().createQuery();
Then facets are applied.
I'm not sure if when facets are applied the luceneQuery is ANDed and ORed with other Querys resulting in a new Lucene Query. Alternatively, perhaps a bunch of BitSets's are applied to the original Query to refine the results. (I don't know).
If a new query is generated I'd like to retrieve it. If not, I need a rethink. That's the crux of the question.
Why:
I'm applying a faceted search on a field with multiple possible values.
E.g. TMovie.class many-to-many TTag.class (multiple-value-facet)
I'm filtering on TMovie where TTag is some value.
Anyway, the filtering works but there is a known problem whereby the Facet-counts returned are incorrect.
Detailed here: Add faceting over multivalued to application using Hibernate Search and https://forum.hibernate.org/viewtopic.php?f=9&t=1010472
I'm using this solution:
http://sujitpal.blogspot.ie/2007/04/lucene-search-within-search-with.html (see comment on new API under article)
The BitSet solution (in this example at least) generates counts based on the original Lucene Query. This works perfectly. However.....
If alternate (different, not TTags) facets are applied to the original query some complications arise.
The Bitset solution calculates on the original Lucene query. It does not calculate on the lucene query now reduced by the application of alternate Facets (a different FacetSelection) (or even TTag Facets themselves for that matter). I.e. the count calculations are irrespective of any other FacetSelection Facets applied.
So...
A. can I get the new Lucene query after facets are applied? The BitSet solution applied to this would be correct.
B. Any other alternative suggestions?
Thanks so much.. All comments welcome.
John
Regarding your first question, applying a facet is not modifying the original query, it uses a custom Collector called FacetCollector - see https://github.com/hibernate/hibernate-search/blob/master/engine/src/main/java/org/hibernate/search/query/collector/impl/FacetCollector.java. Under the hood the collector uses a Lucene FieldCache for doing the facet count. There is also the root of the limitation for multi-value faceting. FieldCache does not support multiple values per field.
Anyways, no additional queries are applied during faceting and the original query is unmodified. The benefit of course is performance. The solution you are pointing to probably works as well, but relies on running multiple queries. However, it might be a valid work around for your use case.

Efficient method for storing simple regular expressions

I have a list of simple regular expressions:
ABC.+DE.+FHIJ.+
.+XY.+Z.+AB
.+KLM.+NO.+J.+
QRST.+UV
they all have alternating patterns of .+ and some text (I will call "words") repeated some number of times. A pattern may or may not begin or end in .+. These regular expression are all mutually exclusive. When another regex is added I want to remove any other matching regular expressions, and add one regular expression that combines the added one with all of its matches. For example, adding:
.+J.+
would match,
ABC.+DE.+FHIJ.+
.+KLM.+NO.+J.+
and thus, these would be remove and replaced with the added regular expression resulting in:
.+J.+
.+XY.+Z.+AB
QRST.+UV
I need to store these patterns either in some data structure or (preferably) in a database in an efficient manner. I first tried a tree of dictionaries, only to realize that in the case that a regex starts with a .* it has to search the entire tree for the next word, which is order O(2^n). Unfortunately, (unless I am mistaken) it appears that neither SQLite (which I am using) nor any other relational database that I have used, supports "regular expression" as a data type. My question is, is there an efficient method for storing and retrieving such simple regular expressions? If there is no canned method, is there some data structure that would be relatively efficient (say, at worst amortized polynomial time)?
Could you please explain what you are using these regular expressions for as that would make it easier to provide a better answer? In particular when I see the way you are splitting your regular expressions I'm wondering if a Trie or a Directed acyclic word graph would be a better fit.
From their you may find your answer is as simple as providing better normalization or finding an alternative no SQL db made specifically for your problem area.

Testing phrases to see if they match each other

I have a large number of phrases (~ several million), each less than six or seven words and the large majority less than five, and I would like to see if they "phrase match" each other. This is a search engine marketing term - essentially, A phrase matches B if A is contained in B. Right now, they are stored in a db (postgres), and I am performing a join on regexes (see this question). It is running impossibly slowly even after trying all basic optimization tricks (indexing, etc) and trying the suggestions provided.
Is there an easier way to do this? I am not averse to a non-DB solution. Is there any reason to think that regexes are overkill and are taking way longer than a different solution?
An ideal algorithm for doing sub-string matching is AhoCorsick.
Although you will have to read the data out of the database to use it, it is tremendously fast, when compared to more naive methods.
See here for a related question on substring matching:
And here for an AhoCorsick implementation in Java:
It would be great to get a little more context as to why you need to see which phrases are subsets of others: for example, it seems strange that the DB would be built in such a way anyway: you're having to do the work now because the DB is not in an appropriate format, so it makes sense that you should 'fix' the DB or the way in which it is built, instead.
It depends massively on what you are doing with the data and why, but I have found it useful in the past to break things down into single words and pairs of words, then link resources or phrases to those singles/pairs.
For example to implement a search I have done:
Source text: Testing phrases to see
Entries:
testing
testing phrases
phrases
phrases to
to
to see
see
To see if another phrase was similar (granted, not contained within) you would break down the other phrase in the same way and count the number of phrases common between them.
It has the nice side effect of still matching if you were to use (for example) "see phases to testing": because the individual words would match.. but because the order is different the pairs wouldn't, so it's taking phrases (consecutive words) into account at the same time, the number of matches wouldn't be as high, good for use as a 'score' in matching.
As I say that -kind- of thing has worked for me, but it would be great to hear some more background/context, so we can see if we can find a better solution.
When you have the 'cleaned column' from MaasSQL's previous answer, you could, depending on the way "phrase match" works exactly (I don't know), sort this column based on the length of the containing string.
Then make sure you run the comparison query in a converging manner in a procedure instead of a flat query, by stepping through your table (with a cursor) and eliminating candidates for comparison through WHERE statements and through deleting candidates that have already been tested (completely). You may need a temporary table to do this.
What do I mean with 'WHERE' statement previously? Well, if the comparison value is in a column sorted on length, you'll never have to test whether a longer string matches inside a shorter string.
And with deleting candidates: starting with the shortest strings, once you've tested all strings of a certain length, you'll can remove them from the comparison table, as any next test you'll do will never get a match.
Of course, this requires a bit more programming than just one SQL statement. And is dependent on the way "phrase match" works exactly.
DTS or SSIS may be your friend here as well.