lucene highlighter: how can I get locations of fragments? - lucene

I know how to get relevant highlighted fragments together with some surrounding text using Lucene highlighter, namely, using
Highlighter highlighter = new Highlighter(scorer);
String[] fragments = highlighter.getBestFragments(stream, fieldContents, fragmentNumber);
But can I instead get pointers to these fragments in the original contents? In other words, I need to know where these fragments start and, if possible, end.

If you use the GetBestTextFragments method instead, you will get back an array of TextFragments. These have properties textStartPos and textEndPos.
(They are marked internal in Lucene.NET, which will require you to make some code changes to get access to them. I'm not sure about Java Lucene.)

Related

Tabulator - formatting print and PDF output

I am a relatively new user of Tabulator so please forgive me if I am asking anything that, perhaps, should be obvious.
I have a Tabulator report that I am able to print and create as a PDF, but the report's formatting (as shown on the screen) is not used in either output.
For printing I have used printAsHtml and printStyled=true, but this doesn't produce a printout that matches what is on the screen. I have formatted number fields (with comma separators) and these are showing correctly, but the number columns should be right-aligned but all of the columns appear as left-aligned.
I am also using Tree View where the tree rows are coloured differently to the main table, but when I print the report with a tree open it colours the whole table with the tree colours and not just the tree.
For the PDF none of the Tabulator formatting is being used. I've looked for anything similar to the printStyled option, but I can't see anything. I've also looked at the autoTable option, but I am struggling to find what to use.
I want to format the print and PDF outputs so that they look as close to the screen representation as possible.
Is there anywhere I could look that would provide examples of how to achieve the above? The Tabulator documentation is very good, but the provided examples don't appear to explain what I am trying to do.
Perhaps there are there CSS classes that I am missing or even mis-using? I have tried including .tabulator-print-table in my CSS, but I am probably not using it correctly. I also couldn't find anything equivalent for producing PDFs. Some examples would help immensely.
Thank you in advance for any advice or assistance.
Formatting is deliberately not included in these, below i will outline why:
Downloaders
Downloaded files do not contain formatted data, only the raw data, this is because a lot of the formatters create visual elements (progress bar, star formatter etc) that cannot be replicated sensibly in downloaded files.
If you want to change the format of data in the download you will need to use an accessor, the accessorDownload option is the one you want to use in this case. The accessors transform the data as it is leaving the table.
For instance we could create an accessor that prepended "Mr " to the front of every name in a column:
var mrAccessor= function(value, data, type, params, column, row){
return "Mr " + value;
}
Assign it to a columns definition:
{title:"Name", field:"name", accessorDownload:mrAccessor}
Printing
Printing also does not include the formatters, this is because when you print a Tabulator table, the whole table is actually rebuilt as a standard HTML table, which allows the printer to work out how to layout everything across multiple pages with column headers etc. The downside of this is that it is only loosely styled like a Tabulator and so formatted contents generated inside Tabulator cells will likely break when added to a normal td element.
For this reason there is also a accessorPrint option that works in the same way as the download accessor but for printing.
If you want to use the same accessor for both occasions, you can assign the function once to the accessor option and it will be applied in both instances.
Checkout the Accessor Documentation for full details.

unable to perform search on custom_field(JIRA-Python)

I'm getting the below error when I search on custom_field.
{"errorMessages":["Field \'customfield_10029\' does not exist or you do not have permission to view it."],"warningMessages":[]}
But I have enough permissions(Admin) to access that field. And also I enabled the field visible.
URL = 'https://xyz.atlassian.net/rest/api/2/search?jql=status="In+Progress"+and+customfield_10029=125&fields=id,key,status'
Custom fields in JQL searches are referenced using the abbreviation 'cf' followed by their ID inside square brackets '[id]', so your URL would be:
URL =
'https://xyz.atlassian.net/rest/api/2/search?jql=status="In+Progress"+and+cf[10029]=125&fields=id,key,status'
Make sure you properly encode the square brackets in UTF-8 format in your language's encoding method.
PS. Generally speaking, it's much easier to reference custom fields in JQL searches by their names, not their IDs. It makes the search URL easier to read and understand what is being searched for.
I get a 400 response code with customized field syntax:
https://domain/rest/api/2/search?maxResults=500&jql=cf[10025]='xxxxxxxxxd'&fields=id,key,issuetype,status,customfield_10025

Lucene Incomplete Word Spellchecking

I am using Lucene to perform spell checking. I am using
https://lucene.apache.org/core/5_4_1/suggest/org/apache/lucene/search/spell/SpellChecker.html
What I really want is, for example, I have a word:
spellingmistake
And now I type:
speli.
In this case, I want to spellchecker to return me the correct word or at least return spell. So to achieve this while indexing the dictionary in the indexwriter config I used a EdgeNGramTokenizer assuming it would work for this case. But unfortunately, it is not working.
How do I get this working?
Thanks.

TSearch2 - dots explosion

Following conversion
SELECT to_tsvector('english', 'Google.com');
returns this:
'google.com':1
Why does TSearch2 engine didn't return something like this?
'google':2, 'com':1
Or how can i make the engine to return the exploded string as i wrote above?
I just need "Google.com" to be foundable by "google".
Unfortunately, there is no quick and easy solution.
Denis is correct in that the parser is recognizing it as a hostname, which is why it doesn't break it up.
There are 3 other things you can do, off the top of my head.
You can disable the host parsing in the database. See postgres documentation for details. E.g. something like ALTER TEXT SEARCH CONFIGURATION your_parser_config
DROP MAPPING FOR url, url_path
You can write your own custom dictionary.
You can pre-parse your data before it's inserted into the database in some manner (maybe splitting all domains before going into the database).
I had a similar issue to you last year and opted for solution (2), above.
My solution was to write a custom dictionary that splits words up on non-word characters. A custom dictionary is a lot easier & quicker to write than a new parser. You still have to write C tho :)
The dictionary I wrote would return something like 'www.facebook.com':4, 'com':3, 'facebook':2, 'www':1' for the 'www.facebook.com' domain (we had a unique-ish scenario, hence the 4 results instead of 3).
The trouble with a custom dictionary is that you will no longer get stemming (ie: www.books.com will come out as www, books and com). I believe there is some work (which may have been completed) to allow chaining of dictionaries which would solve this problem.
First off in case you're not aware, tsearch2 is deprecated in favor of the built-in functionality:
http://www.postgresql.org/docs/9/static/textsearch.html
As for your actual question, google.com gets recognized as a host by the parser:
http://www.postgresql.org/docs/9.0/static/textsearch-parsers.html
If you don't want this to occur, you'll need to pre-process your text accordingly (or use a custom parser).

Controlling Doxygen's LaTeX output for making PDF documentation

I'm using Doxygen to generate documentation for my code. I need to make a PDF version of this and using Doxygen's LaTeX output appears to be the way to do it.
However I've run into a number of annoying problems, and not knowing anything about LaTeX previously haven't really got much of an idea on how to approach them, and the countless references for LaTeX related things are not much help...
I worked out how to create a custom style thing in a sty file and how to get Doxygen to use it. After a lot of searching I found out how to set the page margins etc. through this, and I'm guessing the perhaps this is the file I want for doing the other things I want, but I cant seem to find any commands for doign what I want :(
The table of contents at the start of the document contains a lot of items Id rather it didn't as it makes the contents very long. Is there some way to limit this contents to just say the first two levels, rather than having entries for every single individual function, variable, etc.? Id quite like to keep all the bookmarks however. I did try the "COMPACT_LATEX" option but as well as removing items on the contents pages, it removed the bookmarks and the member lists at the start of each section, which I do really want to keep.
Is there a way to change the order of things, like putting the full class description at the start of the section, rather than after all the members and attributes?
Wow, that's kind of evil of Doxygen.
Okay, to get around the tocdepth counter problem, add the following line to your .sty file:
\AtBeginDocument{\setcounter{tocdepth}{2}}% or whatever level you want
You can set the PDF bookmarks depth to a separate value:
% requires you \usepackage{hyperref} first
\hypersetup{
bookmarksdepth = section, % of whatever level you want
}
Also note that if you have a list of figures/tables, the tocdepth must be at least 2 for them to show up.
I don't see any way of rearranging those items within the LaTeX files---Doxygen just barfs them out there, so we can't do much. You'll have to poke around the Doxygen documentation to see if there's any way to specify the order I guess. (Here's hoping!)
You're so close.
Googling on "latex contents level" brought me to LaTeX - customizing the depth of the table of contents for different parts of the thesis which suggests
\setcounter{tocdepth}{n}
where n starts at zero for only the highest level division. This is presumable defined in all the default styles, but is worth a try in doxygen.
You could write a Perl/Awk script to simply delete the unwanted lines from the table of contents. For the file burble.tex, Latex will generate the file burble.toc, which will contain lines such as:
\contentsline {subsection}{Class F rewrites}{38}
\contentsline {subsection}{Class M rewrites}{39}
\contentsline {section}{\numberline {7}Definition and properties of the translation}{44}
\contentsline {paragraph}{Well-formedness}{54}
Simple regexes will identify which levels each line belongs to, and you can filter the file based on that. Once you have the table of contents the way you want it, insert \nofiles in the appropriate place (the style sheet?), which means that Latex will read the auxiliary files but not overwrite them.