Extracting Fields Value using Lucene

Extracting Fields Value using Lucene - lucene

My problem is that I would like to parse one document only (not multiple documents) with textual data and extract some relevant information based on my query.
For example:
If I have the following text:
This is a sample document.
Name: Te
Age: 25
Email: te#gmail.com
Some text in the end of the document
I would like to extract the fields (Name, Age, Email) with there corresponding values
Many of the examples I found are mainly to search for documents that matches a query. I would appreciate if someone can guide me on which Analyzer or Query classes to lookin in lucene library or any materials to read.

This should get you started. With a regular expression, in Java, where the document content has been assigned to the variable text:
String expr = "Name\:\s(\w+)\sAge\:\s+(\d+)\s+Email\:\s+([a-z0-9.#]+)\s+";
Pattern r = Pattern.compile(expr, Pattern.CASE_INSENSITIVE);
Matcher m = r.matcher(text);
if (m.find( ))
{
System.out.println("Name: " + m.group(1) );
System.out.println("Age: " + m.group(2) );
System.out.println("Email: " + m.group(3) );
}
else { System.out.println("Match not found"); }

Related

How to add prefix and suffix when indexing

How is it possible to add a suffix and prefix to an entity in Hibernate Search during indexing?
I need this to perform exact search.
E.g. if one is searching for "this is a test", then following entries are found:
* this is a test
* this is a test and ...
So I found the idea to add a prefix and suffix to the whole value during indexing, e.g.:
_____ this is a test _____
and if one is searching for "this is a test" and is enabling the checkbox for exact search, I'll change the search string to_
"_____ this is a test _____"
I created a FilterFactory for this, but with this one it adds the prefix and suffix to every term:
public boolean incrementToken() throws IOException {
if (!this.input.incrementToken()) {
return false;
} else {
String input = termAtt.toString();
// add "_____" at the beginning and ending of the phrase for exact match searching
input = "_____ " + input + " _____";
char[] newBuffer = input.toLowerCase().toCharArray();
termAtt.setEmpty();
termAtt.copyBuffer(newBuffer, 0, newBuffer.length);
return true;
}
}

This is not how you should do it.
What you need is that the string you index is considered a unique token. This way, you will only have results having the exact token.
To do so you need to define an analyzer based on the KeywordTokenizer.
#Entity
#AnalyzerDefs({
#AnalyzerDef(name = "keyword",
tokenizer = #TokenizerDef(factory = KeywordTokenizerFactory.class)
)
})
#Indexed
public class YourEntity {
#Fields({
#Field, // your default field with default analyzer if you need it
#Field(name = "propertyKeyword", analyzer = #Analyzer(definition = "keyword"))
})
private String property;
}
Then you should search on the propertyKeyword field. Note that the analyzer definition is global so you only need to declare the definition for one entity for it to be available for all your entities.
Take a look at the documentation about analyzers: http://docs.jboss.org/hibernate/stable/search/reference/en-US/html_single/#example-analyzer-def .
It's important to understand what an analyzer is for because usually the default one is not exactly the one you are looking for.

Twitter4j gather only a certain country

for a project I am currently working on I need to gather tweets from a stream only for one country.
Although the Twitter4j streaming API allows to filter by language the results aren't accurate enough. So I thought to put a filter on top of the filter by checking if the country attribute of the tweet is filled. This works fine when I check if theres a value at all:
if(status.getPlace().getCountry() != null) {
System.out.println("User: " + status.getUser().getName());
System.out.println("Text: : " + status.getText());
System.out.println("Country: " + status.getPlace().getCountry());
System.out.println("Language: " + status.getLang());
}
}
TwitterStream ts = new TwitterStreamFactory(cb.build()).getInstance();
ts.addListener(listener);
FilterQuery filter = new FilterQuery();
String[] language = {"country"};
String[] keywords = {"some keywords"};
filter.track(keywords);
filter.language(language);
ts.filter(filter);
But if I check for a certain country e.g. germany I don't receive any tweets:
if(status.getPlace().getCountry() != "germany") {
System.out.println("User: " + status.getUser().getName());
System.out.println("Text: : " + status.getText());
System.out.println("Country: " + status.getPlace().getCountry());
System.out.println("Language: " + status.getLang());
}
}
It would be great if there's someone who can help me with this.

filter.track(keywords);
filter.language(language);
Keep in mind that the above code means track(keywords) OR language(language).
This is not logical AND.
If you want tweets from only one country, and with certain keywords, then remove filter.language(language) and check the country after you get status.
if(status.getPlace().getCountry().equalsIgnoreCase("germany")) {}
// String comparison
You won't get a tweet from Germany if nobody from Germany tweets with the filters you have specified.

Lucene Highlighter class: highlight different words in different colors

Probably most people reading the title who know a bit about Lucene won't need much further explanation. NB I use Jython but I think most Java users will understand the Java equivalent...
It's a classic thing to want to do: you have more than one term in your search string... in Lucene terms this returns a BooleanQuery. Then you use something like this code to highlight (NB I am a Lucene newbie, this is all closely tweaked from Net examples):
yellow_highlight = SimpleHTMLFormatter( '<b style="background-color:yellow">', '</b>' )
green_highlight = SimpleHTMLFormatter( '<b style="background-color:green">', '</b>' )
...
stream = FrenchAnalyzer( Version.LUCENE_46 ).tokenStream( "both", StringReader( both ) )
scorer = QueryScorer( fr_query, "both" )
fragmenter = SimpleSpanFragmenter(scorer)
highlighter = Highlighter( yellow_highlight, scorer )
highlighter.setTextFragmenter(fragmenter)
best_fragments = highlighter.getBestTextFragments( stream, both, True, 5 )
if best_fragments:
for best_frag in best_fragments:
print "=== best frag: %s, type %s" % ( best_frag, type( best_frag ))
html_text += "&bull %s<br>\n" % unicode( best_frag )
... and then the html_text is put in a JTextPane for example.
But how would you make the first word in your query highlight with a yellow background and the second word highlight with a green background? I have tried to understand the various classes in org.apache.lucene.search... to no avail. So my only way of learning was googling. I couldn't find any clues...

I asked this question four years ago... At the time I did manage to implement a solution using javax.swing.text.html.HTMLDocument. There's also the interface org.w3c.dom.html.HTMLDocument in the standard Java library. This way is hard work.
But for anyone interested there's a far simpler solution. Taking advantage of the fact that Lucene's SimpleHTMLFormatter returns about the simplest imaginable "marked up" piece of text: chosen words are highlighted with the HTML B tag. That's it. It's not even a "proper" HTML fragment, just a String with <B>s and </B>s in it.
A multi-word query generates a BooleanQuery... from which you can extract multiple TermQuerys by going booleanQuery.clauses() ... getQuery()
I'm working in Groovy. The colouring I want to apply is console codes, as per BASH (or Cygwin). Other types of colouring can be worked out on this model.
So you set up a map before to hold your "markup details":
def markupDetails = [:]
Then for each TermQuery, you call this, with the same text param each time, stipulating a different colour param for each term. NB I'm using Lucene 6.
def createHighlightAndAnalyseMarkup( TermQuery tq, String text, String colour ) {
def termQueryScorer = new QueryScorer( tq )
def termQueryHighlighter = new Highlighter( formatter, termQueryScorer )
TokenStream stream = TokenSources.getTokenStream( fieldName, null, text, analyser, -1 )
String[] frags = termQueryHighlighter.getBestFragments( stream, text, 999999 )
// not sure under what circs you get > 1 fragment...
assert frags.size() <= 1
// NB you don't always get all terms in all returned LDocuments...
if( frags.size() ) {
String highlightedFrag = frags[ 0 ]
Matcher boldTagMatcher = highlightedFrag =~ /<\/?B>/
def pos = 0
def previousEnd = 0
while( boldTagMatcher.find()) {
pos += boldTagMatcher.start() - previousEnd
previousEnd = boldTagMatcher.end()
markupDetails[ pos ] = boldTagMatcher.group() == '<B>'? colour : ConsoleColors.RESET
}
}
}
As I said, I wanted to colourise console output. The colour parameter in the method here is per the console colour codes as found here, for example. E.g. yellow is \033[033m. ConsoleColors.RESET is \033[0m and marks the place where each coloured bit of text stops.
... after you've finished doing this with all TermQuerys you will have a nice map telling you where individual colours begin and end. You work backwards from the end of the text so as to insert the "markup" at the right position in the String. NB here text is your original unmarked-up String:
markupDetails.sort().reverseEach{ pos, markup ->
String firstPart = text.substring( 0, pos )
String secondPart = text.substring( pos )
text = firstPart + markup + secondPart
}
... at the end of which text contains your marked-up String: print to console. Lovely.

How do I get a list of fields in a generic sObject?

I'm trying to build a query builder, where the sObject result can contain an indeterminate number of fields. I'm using the result to build a dynamic table, but I can't figure out a way to read the sObject for a list of fields that were in the query.
I know how to get a list of ALL fields using the getDescribe information, but the query might not contain all of those fields.
Is there a way to do this?

Presumably you're building the query up as a string, since it's dynamic, so couldn't you just loop through the fields in the describe information, and then use .contains() on the query string to see if it was requested? Not crazy elegant, but seems like the simplest solution here.
Taking this further, maybe you have the list of fields selected in a list of strings or similar, and you could just use that list?

Not sure if this is exactly what you were after but something like this?
public list<sObject> Querylist {get; set;}
Define Search String
string QueryString = 'select field1__c, field2__c from Object where';
Add as many of these as you need to build the search if the user searches on these fields
if(searchParameter.field1__c != null && searchParameter.field1__c != '')
{
QueryString += ' field1__c like \'' + searchParameter.field1__c + '%\' and ';
}
if(searchParameter.field2__c != null && searchParameter.field2__c != '')
{
QueryString += ' field2__c like \'' + searchParameter.field2__c + '%\' and ';
}
Remove the last and
QueryString = QueryString.substring(0, (QueryString.length()-4));
QueryString += ' limit 200';
add query to the list
for(Object sObject : database.query(QueryString))
{
Querylist.add(sObject);
}

To get the list of fields in an sObject, you could use a method such as:
public Set<String> getFields(sObject sobj) {
Set<String> fieldSet = new Set<String>();
for (String field : sobj.getSobjectType().getDescribe().fields.getMap().keySet()) {
try {
a.get(field);
fieldSet.add(field);
} catch (Exception e) {
}
}
return fieldSet;
}
You should refactor to bulkily this approach for your context, but it works. Just pass in an sObject and it'll give you back a set of the field names.

I suggest using a list of fields for creating both the query and the table. You can put the list of fields in the result so that it's accesible for anyone using it. Then you can construct the table by using result.getFields() and retrieve the data by using result.getRows().
for (sObject obj : result.getRows()) {
for (String fieldName : result.getFields()) {
table.addCell(obj.get(fieldName));
}
}
If your trying to work with a query that's out of your control, you would have to parse the query to get the list of fields. But I wouldn't suggest trying that. It complicates code in ways that are hard to follow.

Salesforce API: How to identify a Case from an email reference code ("[Ref: ... :Ref]")?

I'm writing a Windows service that will poll my IMAP4 inbox for emails from clients and create new Cases in Salesforce based on them.
Sometimes emails come in with a Case reference code in the subject. Ex: "[ ref:00FFwxyz.500FFJJS5:ref ]". I'd like to assign such emails to the existing Case identified by the code rather than create a new one.
My questions is: Is there a definitive formula for extracting a unique Case identifier from the ref code? I've seen a few formulas that do the reverse, but they all look like guesswork: Blog post on KnowThyCloud.com, Force.com Discussion Board thread.

Found a decent enough solution. I was wrong in calling the post on KnowThyCloud.com guesswork. In the right context it works fine.
My solution is to create a new custom field on the Case record of type "Formula (Text)". The field's value is the formula mentioned in the blog post:
TRIM(" [ ref:" + LEFT( $Organization.Id, 4) + RIGHT($Organization.Id, 4) +"."+ LEFT( Id, 4) + SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(Id, RIGHT( Id, 4), ""), LEFT( Id, 4), ""), "0", "") + RIGHT( Id, 4) + ":ref ] ")
Now the value of the custom field for each Case record is the same as the reference Id in emails and I can simply query for it with the Salesforce API.

I implemented urig's solution and it works well.
Here is an Apex code solution that locates the Case without this field.
String emailSubject = 'FW: Re: RE: order 36000862A Case: 00028936 [ ref:00D7JFzw.5007Ju10k:ref ]';
String caseNumber = null;
/*
Extract the text after ref: and before the period. Could use this to locate the organization.
In the text after the period and before the :ref split into a 4 digit number and remaining number.
Insert 0's to get ref id.
*/
String patternString = '.*ref:(.{8}).(.{4})(.+):ref.*';
Pattern thePattern = Pattern.compile(patternString);
Matcher matcher = thePattern.matcher(emailSubject);
if (matcher.matches()) {
String caseId = matcher.group(2) + '000000' + matcher.group(3);
Case[] matchingCases = [Select CaseNumber from Case where Id = :caseId];
if(matchingCases.size() == 1) {
Case theCase = matchingCases[0];
caseNumber = theCase.CaseNumber;
}
}

I have modified Jan's code snippet above in order to support the new reference string containing underrscores (e.g. _00DC0PxQg._500C0KoOZS).
String patternString = '.*ref:(.{11}).(.{5})(.+):ref.*';
Pattern thePattern = Pattern.compile(patternString);
Matcher matcher = thePattern.matcher(emailSubject);
if (matcher.matches()) {
String caseId = matcher.group(2) + '00000' + matcher.group(3);
system.debug('### '+caseId);
Case[] matchingCases = [Select CaseNumber from Case where Id = :caseId];
if(matchingCases.size() == 1) {
Case theCase = matchingCases[0];
caseNumber = theCase.CaseNumber;
}
}

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Extracting Fields Value using Lucene - lucene

Related

How to add prefix and suffix when indexing

Twitter4j gather only a certain country

Lucene Highlighter class: highlight different words in different colors

How do I get a list of fields in a generic sObject?

Salesforce API: How to identify a Case from an email reference code ("[Ref: ... :Ref]")?

Categories

Resources