Building dictionary of words from large text - lucene

I have a text file containing posts in English/Italian. I would like to read the posts into a data matrix so that each row represents a post and each column a word. The cells in the matrix are the counts of how many times each word appears in the post. The dictionary should consist of all the words in the whole file or a non exhaustive English/Italian dictionary.
I know this is a common essential preprocessing step for NLP. And I know it's pretty trivial to code it, sill I'd like to use some NLP domain specific tool so I get stop-words trimmed etc..
Does anyone know of a tool\project that can perform this task?
Someone mentioned apache lucene, do you know if lucene index can be serialized to a data-structure similar to my needs?

Maybe you want to look at GATE. It is an infrastructure for text-mining and processing. This is what GATE does (I got this from the site):
open source software capable of solving almost any text processing problem
a mature and extensive community of developers, users, educators, students and scientists
a defined and repeatable process for creating robust and maintainable text processing workflows
in active use for all sorts of language processing tasks and applications, including: voice of the customer; cancer research; drug research; decision support; recruitment; web mining; information extraction; semantic annotation
the result of a €multi-million R&D programme running since 1995, funded by commercial users, the EC, BBSRC, EPSRC, AHRC, JISC, etc.
used by corporations, SMEs, research labs and Universities worldwide
the Eclipse of Natural Language Engineering, the Lucene of Information Extraction, the ISO 9001 of Text Mining

What you want is so simple that, in most languages, I would suggest you roll your own solution using an array of hash tables that map from strings to integers. For example, in C#:
foreach (var post in posts)
{
var row = new Dictionary<string, int>();
foreach (var word in GetWordsFromPost(post))
{
IncrementContentOfRow(row, word);
}
}
// ...
private void IncrementContentOfRow(IDictionary<string, int> row, string word)
{
int oldValue;
if (!row.TryGet(word, out oldValue))
{
oldValue = 0;
}
row[word] = oldValue + 1;
}

You can check out:
bow - a veteran C library for text classification; I know it stores the matrix, it may require some hacking to get it.
Weka - a Java machine learning framework that can handle text and build the matrix
Sujit Pal's blog post on building the term-document matrix from scratch
If you insist on using Lucene, you should create an index using term vectors, and use something like a loop over getTermFreqVector() to get the matrix.

Thanks to #Mikos' comment, I googled the term "term-document matrix' and found TMG (Text to Matrix Generator).
I found it suitable for my needs.

Related

wxDataViewListCtrl is slow with 100k items from another thread

The requirements:
100k lines
One of the columns is not text - its custom painted with wxDC*.
The items addition is coming from another thread using wxThreadEvent.
Up until now I used wxDataViewListCtrl, but it takes too long to AppendItem 100 thousand time.
wxListCtrl (in virtual mode) does not have the ability to use wxDC* - please correct me if I am wrong.
The only thing I can think of is using wxDataViewCtrl + wxDataViewModel. But I can't understand how to add items.
I looked at the samples (https://github.com/wxWidgets/wxWidgets/tree/WX_3_0_BRANCH/samples/dataview), too complex for me.
I cant understand them.
I looked at the wiki (https://wiki.wxwidgets.org/WxDataViewCtrl), also too complex for me.
Can somebody please provide a very simple example of a wxDataViewCtrl + wxDataViewModel with one string column and one wxDC* column.
Thanks in advance.
P.S.
Per #HajoKirchhoff's request in the comments, I am posting some code:
// This is called from Rust 100k times.
extern "C" void Add_line_to_data_view_list_control(unsigned int index,
const char* date,
const char* sha1) {
wxThreadEvent evt(wxEVT_THREAD, 44);
evt.SetPayload(ViewListLine{index, std::string(date), std::string(sha1)});
wxQueueEvent(g_this, evt.Clone());
}
void TreeWidget::Add_line_to_data_view_list_control(wxThreadEvent& event) {
ViewListLine view_list_line = event.GetPayload<ViewListLine>();
wxVector<wxVariant> item;
item.push_back(wxVariant(static_cast<int>(view_list_line.index)));
item.push_back(wxVariant(view_list_line.date));
item.push_back(wxVariant(view_list_line.sha1));
AppendItem(item);
}
Appending 100k items to a control will always be slow. That's because it requires moving 100k items from your storage to the controls storage. A much better way for this amount of data is to have a "virtual" list control or wxGrid. In both cases the data is not actually transferred to the control. Instead when painting occurs, a callback function will transfer only the data required to paint. So for a 100k list you will only have "activity" for the 20-30 lines that are visible.
With wxListCtrl see https://docs.wxwidgets.org/3.0/classwx_list_ctrl.html, specify the wxLC_VIRTUAL flag, call SetItemCount and then provide/override
OnGetItemText
OnGetItemImage
OnGetItemColumnImage
Downside: You can only draw items contained in a wxImageList, since the OnGetItemImage return indizes into the list. So you cannot draw arbitrary items using a wxDC. Since the human eye will be overwhelmed with 100k different images anyway, this is usually acceptable. You may have to provide 20/30 different images beforehand, but you'll have a fast, flexible list.
That said, it is possible to override the OnPaint function and use that wxDC to draw anything in the list. But that'll get difficult pretty soon.
So an alternative would be to use wxGrid, create a wxGridTableBase derived class that acts as a bridge between the grid and your actual 100k data and create wxGridCellRenderer derived classes to render the actual data onscreen. The wxGridCellRenderer class will get a wxDC. This will give you more flexibility but is also much more complex than using a virtual wxListCtrl.
The full example of doing what you want will inevitably be relatively complex. But if you decompose in simple parts, it's really not that difficult: you do need to define a custom model, but if your list is flat, this basically just means returning the value of the item at the N-th position, as you can trivially implement all model methods related to the tree structure. An example of such a model, although with multiple columns can be found in the sample, so you just need to simplify it to a one (or two) column version.
Next, you are going to need a custom renderer too, but this is not difficult neither and, again, there is an example of this in the sample too.
If you have any concrete questions, you should ask them, but it's going to be difficult to do much better than what the sample shows and it does already show exactly what you want to do.
Thank you every one who replied!
#Vz.'s words "If you have any concrete questions, you should ask them" got me thinking and I took another look at the samples of wxWidgets. The full code can be found here. Look at the following classes:
TreeDataViewModel
TreeWidget
TreeCustomRenderer

Writing a computer program that will analyse the quality of another computer program?

I'm interested in knowing the possibilities of this. I'm working on a project that validates the skills of a software engineer, currently we validate skills based on code reviews by credentialed developers.
I know the answer if far more completed that the question, I couldn't imagine how complex the program would have to be able to analyse complex code but I am starting with basic programming interview questions.
For example, the classic FizzBuzz question:
Write a program that prints the numbers from 1 to 20. But for multiples of three print “Fizz” instead of the number and for the multiples of five print “Buzz”. For numbers which are multiples of both three and five print “FizzBuzz”.
and below is the solution in python:
for num in range(1,21):
string = ""
if num % 3 == 0:
string = string + "Fizz"
if num % 5 == 0:
string = string + "Buzz"
if num % 5 != 0 and num % 3 != 0:
string = string + str(num)
print(string)
Question is, can we programatically analyse the validity of this solution?
I would like to know if anyone has attempted this, and if there are current implementations I can take a look at. Also if anyone has used z3, and if it is something I can use to solve this problem.
As Vilx- mentioned, correctness of programs (including whether or not they terminate) is in general known to be undecidable. However, tools such as Z3 show that relevant concrete cases can still be reasoned about, despite the general undecidability of the problem.
Static analysers typically look for "simple" problems (e.g. null dereferences, out-of-bounds accesses, numerical overflows), but are comparably fast and require little user guidance (think of guidance in the spirit of adding type annotations to your code).
A non-exhaustive (and biased) list of keywords to search for: "static analysers", "abstract interpretation"; "facebook infer", "airbus absint", "juliasoft".
Verifiers attempt to prove much richer properties, in particular functional correctness, e.g. "does this sort-implementation really sort my array (and not do anything else, e.g. deallocate some global memory or update an element reachable from the array)?" or "does that crypto-implementation really implement the crypto protocol it promises to implement?". This is a much harder task and tools from that line of research are typically rather slow, require expert users with a background in formal verification and significant user guidance.
A non-exhaustive (and biased) list of keywords to search for: "verification", "hoare logic", "separation logic"; "eth viper", "microsoft dafny", "kuleuven verifast", "microsoft f*".
Other formal methods exist, e.g. refinement (or correct-by-construction), but with even less tool support and, as far as I know, industry acceptance.
Let's put it this way: it's been mathematically proven that you CANNOT determine if a program will ever terminate. So if you want a mathematically perfect answer of if the target program is correct, you're doomed.
That said, you can still do unit tests and "linting" which will give you plenty of intetesting insights.
But for simple pieces of code like the FizzBuzz, I think that eyeballing by an experienced dev will probably bring the best results.

Where does the KeY verification tool shine?

What are some code examples demonstrating KeY’s strength?
Details
With so many Formal Method tools available, I was wondering where KeY is better than its competition, and how? Some readable code examples would be quite helpful for comparison and understanding.
Updates
Searching through the KeY website, I found code examples from the book — is there a suitable code example in there somewhere?
Furthermore, I found a paper about the bug that KeY found in Java 8’s mergeCollapse in TimSort. What is a minimal code from TimSort that demonstrates KeY’s strength? I do not understand, however, why model checking supposedly cannot find the bug — a bit array with 64 elements should not be too large to handle. Are other deductive verification tools just as capable of finding the bug?
Is there an established verification competition with suitable code examples?
This is a very hard question, which is why it hasn't yet been answered after having already been asked more than one year ago (and although we from the KeY community are well aware of it...).
The Power of Interaction
First, I'd like to point out that KeY is basically the only tool out there allowing for interactive proofs of Java programs. Although many proofs work automatically and we have quite powerful automatic strategies at hand, sometimes interaction is required to understand why a proof fails (too weak or even wrong specifications, wrong code or "just" a prover incapacity) and to add suitable corrections or strengthenings.
Feedback from Proof Inspection
Especially in the case of a prover incapacity (specification and program are OK, but the problem is too hard for the prover to succeed automatically), interaction is a powerful feature. Many program provers (like OpenJML, Dafny, Frama-C etc.) rely on SMT solvers in the backend which they feed with many more or less small verification conditions. The verification status for these conditions is then reported back to the user, basically as pass or fail -- or timeout. When an assertion failed, a user can change the program or refine the specifications, but cannot inspect the state of the proof to deduct information about why something went wrong; this style is sometimes called "auto-active" as opposed to interactive. While this can be quite convenient in many cases (especially when proofs pass, since the SMT solvers can be really quick in proving something), it can be hard to mine SMT solver output for information. Not even the SMT solvers themselves know why something went wrong (although they can produce a counterexample), as they just are fed a set of formulas for which they attempt to find a contradiction.
TimSort: A Complicated Algorithmic Problem
For the TimSort proof which you mentioned, we had to use a lot of interaction to make them pass. Take, for instance, the mergeHi method of the sorting algorithm which has been proven by one of the most experienced KeY power users known to me. In this proof of 460K proof nodes, 3K user interactions were necessary, consisting of quite a lot of simple ones like the hiding of distracting formulas, but also of 478 quantifier instantiations and about 300 cuts (on-line lemma introduction). The code of that method features many difficult Java features like nested loops with labeled breaks, integer overflows, bit arithmetic and so on; especially, there are a lot of potential exceptions and other reasons for branching in the proof tree (which is why in addition, also five manual state merging rule applications have been used in the proof). The workflow for proving this method basically was to give the strategies a try for some time, check the proof state afterward, prune back the proof and introduce a helpful lemma to reduce the overall proof work and to start again; occasionally, quantifiers were instantiated manually if the strategies failed to find the right instantiation directly by themselves, and proof tree branches were merged to tackle state explosion. I would just claim here that proving this code is (at least currently) not possible with auto-active tools, where you cannot guide the prover in that way, and also cannot obtain the right feedback for knowing how to guide it.
Strength of KeY
Concluding, I'd say that KeY's strong in proving hard algorithmic problems (like sorting etc.) where you have complicated quantified invariants and integer arithmetic with overflows, and where you need to find quantifier instantiations and small lemmas on the fly by inspecting and interacting with the proof state. The KeY approach of semi-interactive verification also excels in general for cases where SMT solvers time out, such that a user cannot tell whether something is wrong or an additional lemma is required.
KeY can of course also proof "simple" problems, however there you need to take care that your program does not contain an unsupported Java feature like floating point numbers or multithreading; also, library methods can be quite a problem if they're not yet specified in JML (but this problem applies to other approaches as well).
Ongoing Developments
As a side remark, I also would like to point out that KeY is now more and more being transformed to a platform for static analysis of different kinds of program properties (not only functional correctness of Java programs). On the one hand, we have developed tools such as the Symbolic Execution Debugger which can be used also by non-experts to examine the behavior of a sequential Java program. On the other hand, we are currently busy in refactoring the architecture of the system for making it possible to add frontends for languages different than Java (in our internal project "KeY-RED"); furthermore, there are ongoing efforts to modernize the Java frontend such that also newer language features like Lambdas and so on are supported. We are also looking into relational properties like compiler correctness. And while we already support the integration of third-party SMT solvers, our integrated logic core will still be there to support understanding proof situations and manual interactions for cases where SMT and automation fails.
TimSort Code Example
Since you asked for a code example... I cannot right know think of "the" code example showing KeY's strength, but maybe for giving you a flavor of the complexity of mergeHi in the TimSort algorithm, here a shortened excerpt with some comments (the full method has about 100 lines of code):
private void mergeHi(int base1, int len1, int base2, int len2) {
// ...
T[] tmp = ensureCapacity(len2); // Method call by contract
System.arraycopy(a, base2, tmp, 0, len2); // Manually specified library method
// ...
a[dest--] = a[cursor1--]; // potential overflow, NullPointerException, ArrayIndexOutOfBoundsException
if (--len1 == 0) {
System.arraycopy(tmp, 0, a, dest - (len2 - 1), len2);
return; // Proof branching
}
if (len2 == 1) {
// ...
return; // Proof branching
}
// ...
outer: // Loop labels...
while (true) {
// ...
do { // Nested loop
if (c.compare(tmp[cursor2], a[cursor1]) < 0) {
// ...
if (--len1 == 0)
break outer; // Labeled break
} else {
// ...
if (--len2 == 1)
break outer; // Labeled break
}
} while ((count1 | count2) < minGallop); // Bit arithmetic
do { // 2nd nested loop
// That's one complex statement below...
count1 = len1 - gallopRight(tmp[cursor2], a, base1, len1, len1 - 1, c);
if (count1 != 0) {
// ...
if (len1 == 0)
break outer;
}
// ...
if (--len2 == 1)
break outer;
count2 = len2 - gallopLeft(a[cursor1], tmp, 0, len2, len2 - 1, c);
if (count2 != 0) {
// ...
if (len2 <= 1)
break outer;
}
a[dest--] = a[cursor1--];
if (--len1 == 0)
break outer;
// ...
} while (count1 >= MIN_GALLOP | count2 >= MIN_GALLOP);
// ...
} // End of "outer" loop
this.minGallop = minGallop < 1 ? 1 : minGallop; // Write back to field
if (len2 == 1) {
// ...
} else if (len2 == 0) {
throw new IllegalArgumentException(
"Comparison method violates its general contract!");
} else {
System.arraycopy(tmp, 0, a, dest - (len2 - 1), len2);
}
}
Verification Competition
VerifyThis is an established competition for logic-based verification tools which will have its 7th iteration in 2019. The concrete challenges for past events can be downloaded from the "archive" section of the website I linked. Two KeY teams participated there in 2017. The overall winner that year was Why3. An interesting observation is that there was one problem, Pair Insertion Sort, which came as a simplified and as an optimized Java version, for which no team succeeded in verifying the real-world optimized version on site. However, a KeY team finished that proof in the weeks after the event. I think that highlights my point: KeY proofs of difficult algorithmic problems take their time and require expertise, but they're likely to succeed due to the combined power of strategies and interaction.

Evaluate Formal tools

What are all the factors one should consider in order to compare 3 formal verification tools?
Eg: Jaspergold, Onespin, Incisive.
From my little research, Jaspergold comes on top. But i want to do it myself on a project.
I have noted down some points such as
1.Supported languages(vhdl, sv, verilog, sva, psl,etc)
2.GUI
3.Capability(how much big design can they handle)
4.Number of Evaluation cycles
5.Performance(How fast they find proof or counter example)
With what other features can i extend this list?
Thanks!

text based RPG command interpreter

I was just playing a text based RPG and I got to wondering, how exactly were the command interpreters implemented and is there a better way to implement something similar now? It would be easy enough to make a ton of if statements, but that seems cumbersome especially considering for the most part pick up the gold is the same as pick up gold which has the same effect as take gold. I'm sure this is a really in depth question, I'd just like to know the general idea of how interpreters like that were implemented. Or if there's an open source game with a decent and representative interpreter, that would be perfect.
Answers can be language independent, but try to keep it in something reasonable, not prolog or golfscript or something. I'm not sure exactly what to tag this as.
The usual name for this sort of game is text adventure or interactive fiction, if it is single player, or MUD if it is multiplayer.
There are several special purpose programming languages for writing interactive fiction, such as Inform 6, Inform 7 (an entirely new language that compiles down to Inform 6), TADS, Hugo, and more.
Here's an example of a game in Inform 7, that has a room, an object in the room, and you can pick up, drop, and otherwise manipulate the object:
"Example Game" by Brian Campbell
The Alley is a room. "You are in a small, dark alley." A bronze key is in the
Alley. "A bronze key lies on the ground."
Produces when played:
Example Game
An Interactive Fiction by Brian Campbell
Release 1 / Serial number 100823 / Inform 7 build 6E59 (I6/v6.31 lib 6/12N) SD
Alley
You are in a small, dark alley.
A bronze key lies on the ground.
>take key
Taken.
>drop key
Dropped.
>take the key
Taken.
>drop key
Dropped.
>pick up the bronze key
Taken.
>put down the bronze key
Dropped.
>
For the multiplayer games, which tend to have simpler parsers than interactive fiction engines, you can check out a list of MUD servers.
If you would like to write your own parser, you can start by simply checking your input against regular expressions. For instance, in Ruby (as you didn't specify a language):
case input
when /(?:take|pick +up)(?: +(?:the|a))? +(.*)/
take_command(lookup_name($3))
when /(?:drop|put +down)(?: +(?:the|a))? +(.*)/
drop_command(lookup_name($3))
end
You may discover that this becomes cumbersome after a while. You could simplify it somewhat using some shorthands to avoid repetition:
OPT_ART = "(?: +(?:the|a))?" # shorthand for an optional article
case input
when /(?:take|pick +up)#{OPT_ART} +(.*)/
take_command(lookup_name($3))
when /(?:drop|put +down)#{OPT_ART} +(.*)/
drop_command(lookup_name($3))
end
This may start to get slow if you have a lot of commands, and it checks the input against each command in sequence. You also may find that it still becomes hard to read, and involves some repetition that is difficult to simply extract into shorthands.
At that point, you might want to look into lexers and parsers, a topic much too big for me to do justice to in a reply here. There are many lexer and parser generators, that given a description of a language, will produce a lexer or parser that is capable of parsing that language; check out the linked articles for some starting points.
As an example of how a parser generator would work, I'll give an example in Treetop, a Ruby based parser generator:
grammar Adventure
rule command
take / drop
end
rule take
('take' / 'pick' space 'up') article? space object {
def command
:take
end
}
end
rule drop
('drop' / 'put' space 'down') article? space object {
def command
:drop
end
}
end
rule space
' '+
end
rule article
space ('a' / 'the')
end
rule object
[a-zA-Z0-9 ]+
end
end
Which can be used as follows:
require 'treetop'
Treetop.load 'adventure.tt'
parser = AdventureParser.new
tree = parser.parse('take the key')
tree.command # => :take
tree.object.text_value # => "key"
If by 'text based RPG' you are referring to Interactive Fiction, there are specific programming languages for this. My favorite (the only one I know ;P) is Inform: http://en.wikipedia.org/wiki/Inform
The rec.arts.int-fiction FAQ has further information: http://www.plover.net/~textfire/raiffaq/FAQ.htm