Tess4J doOCR() for *First Page* of pdf / tif - pdf

Is there a way to tell Tess4J to only OCR a certain amount of pages / characters?
I will potentially be working with 200+ page PDF's, but I really only want to OCR the first page, if that!
As far as I understand, the common sample
package net.sourceforge.tess4j.example;
import java.io.File;
import net.sourceforge.tess4j.*;
public class TesseractExample {
public static void main(String[] args) {
File imageFile = new File("eurotext.tif");
Tesseract instance = Tesseract.getInstance(); // JNA Interface Mapping
// Tesseract1 instance = new Tesseract1(); // JNA Direct Mapping
try {
String result = instance.doOCR(imageFile);
System.out.println(result);
} catch (TesseractException e) {
System.err.println(e.getMessage());
}
}
}
Would attempt to OCR the entire, 200+ page into a single String.
For my particular case, that is way more than I need it to do, and I'm worried it could take a very long time if I let it do all 200+ pages and then just substring the first 500 or so.

The library has a PdfUtilities class that can extract certain pages of a PDF.

Related

JSF2.0: Show certain pdf page on load

I'd like to open a PDF in a new Page from JSF2, and display a certain page in this pdf on load. I have a kind of TOC in my jsf page, and want to jump from there to the page in the PDF directly.
What I know (this is not, what I need, just an example of giving adobe reader and other pdf readers the page I want to jump to):
Something like this will open the page (chose something from the internet):
https://www.cdc.gov/diabetes/pdfs/data/statistics/national-diabetes-statistics-report.pdf#page=10
The #page=10 makes the pdf plugin of the browser display page 10.
Requirements for selecting the PDF:
PDF is dynamically downloaded from a webservice according to an ID that must only reside in the ManagedBeans, since it's secret, and should not be passed to others (like Session ID...) (below given anser by me passes the ID in the GET-Parameter, which should not be done)
PDF should not reside in the Filesystem, sinc I don't want the handling of temporary files (below given answer by me actually utilizes PDFs on FS, with stream only it does not work)
Now my real problem: I have to change the URL beeing displayed/used in JSF, but can't use the normal way with and includeViewParams, because this will insert a "?", and not a "#" in the URL.
Also, I have a backing bean, that gets the content of the PDF from a backend service, based on some other parameters I'm giving, so a solution with would be cool, but I'm aware that this is probably not possible...
Does anyone have an idea, how to solve this?
I didn't include any code, since it doesn't work anyways, and I probably need a completely new way to solve this anyways...
Turns out, Primefaces has this already implemented (although the implementation has it's restrictions):
<p:media player="pdf" value="#{viewerBean.media}" width="100%" height="100%">
<f:param name="#page" value="#{viewerBean.pageNumber}"/>
<f:param name="toolbar" value="1"/>
<!--<f:param name="search" value="#{viewerBean.queryText}"/>-->
</p:media>
https://www.primefaces.org/showcase/ui/multimedia/media.xhtml
Restriction: Can't read from a stream, at least not very stable. Save your energy, and write a stream to a temp file, and set this filename dynamically. Not sure, whether this is complete, but you should get the idea:
import javax.faces.bean.ManagedProperty;
import javax.faces.bean.RequestScoped;
import java.io.*;
import javax.annotation.PostConstruct;
import java.nio.file.Files;
import java.nio.file.Paths;
#ManagedBean
#RequestScoped
public class ViewerBean implements Serializable {
#ManagedProperty(value = "#{param.page}")
private String pageNumber;
private File media;
#PostConstruct
public void init() {
try {
media = Files.createTempFile("car", ".pdf").toFile();
try (FileOutputStream outputStream = new FileOutputStream(media)) {
IOUtils.copy(getStreamedContent().getStream(), outputStream);
}
} catch (IOException e) {
LOGGER.error(e);
throw new RuntimeException("Error creating temp file", e);
}
}
public StreamedContent getMedia() {
try {
return new DefaultStreamedContent(new FileInputStream(media), "application/pdf");
} catch (FileNotFoundException e) {
String message = "Error reading file " + media.getAbsolutePath();
LOGGER.error(message, e);
throw new RuntimeException(message, e);
}
}
}
If the pagename is not needed, you could use this:
http://balusc.omnifaces.org/2006/05/pdf-handling.html
Maybe if you can utilize outputLink for this you'll be lucky, but I ran out of time to test this option.
Found the (THE) solution; above answher mentions , but this cannot cope with #ViewScope beans, and sends many requests to the underlying bean for reading only one InputStream. I found this not acceptable for load reasons.
So here we go:
Create JSF page with <f:event type="preRenderView" listener="#{documentDownloadBean.writeIntpuStreamToResponseOutputStream}"/>
Put neccessary data for dynamic retrieval of the PDF into flash scope
redirect to above JSF page like so: return "document_search/view_pdf.xhtml?faces-redirect=true#page=" + page;
#ManagedBean
#ViewScoped
public class DocumentDownloadBean implements Serializable {
#ManagedProperty(value = "#{documentSearchBean}")
private DocumentSearchBean documentSearchBean;
public String activeDocumentToFlashScope(String page) {
Document document = documentSearchBean.getSelectedDocument();
FacesContext.getCurrentInstance().getExternalContext().getFlash().put("document", document);
// everything preapared now, redirect to viewing JSF page, with page=xxx parameter in URL, which will be evaluated by adobe pdf reader (and other readers, too)
return "document_search/view_pdf.xhtml?faces-redirect=true#page=" + page;
}
public void download() {
Document document = (Document) FacesContext.getCurrentInstance().getExternalContext().getFlash().get("document");
InputStream inputStream = getInputstreamFromBackingWebserviceSomehow(document);
FacesUtils.writeToResponseStream(FacesContext.getCurrentInstance().getExternalContext(), inputStream, document.getFileName());
}
}
Calling JSF Page:
<p:commandLink id="outputText" action="#{documentDownloadBean.activeDocumentToFlashScope(selectedDocument, page)}"
target="_blank" ajax="false">
<h:outputText value="View PDF"/>
</p:commandLink>

How to merge 10000 pdf into one using pdfbox in most effective way

PDFBox api is working fine for less number of files. But i need to merge 10000 pdf files into one, and when i pass 10000 files(about 5gb) it's taking 5gb ram and finally goes out of memory.
Is there some implementation for such requirement in PDFBox.
I tried to tune it for that i used AutoClosedInputStream which gets closed automatically after read, But output is still same.
I have a similar scenario here, but I need to merge only 1000 documents in a single one.
I tried to use PDFMergerUtility class, but I getting an OutOfMemoryError. So I did refactored my code to read the document, load the first page (my source documents have one page only), and then merge, instead of using PDFMergerUtility. And now works fine, with no more OutOfMemoryError.
public void merge(final List<Path> sources, final Path target) {
final int firstPage = 0;
try (PDDocument doc = new PDDocument()) {
for (final Path source : sources) {
try (final PDDocument sdoc = PDDocument.load(source.toFile(), setupTempFileOnly())) {
final PDPage spage = sdoc.getPage(firstPage);
doc.importPage(spage);
}
}
doc.save(target.toAbsolutePath().toString());
} catch (final IOException e) {
throw new IllegalStateException(e);
}
}

Text Extraction, Not Image Extraction

Please help me understand if my solution is correct.
I'm trying to extract text from a PDF file with a LocationTextExtractionStrategy parser. I'm getting exceptions because the ParseContentMethod tries to parse inline images? The code is simple and looks similar to this:
RenderFilter[] filter = { new RegionTextRenderFilter(cropBox) };
ITextExtractionStrategy strategy = new FilteredTextRenderListener(new LocationTextExtractionStrategy(), filter);
PdfTextExtractor.GetTextFromPage(pdfReader, pageNumber, strategy);
I realize the images are in the content stream but I have a PDF file failing to extract text because of inline images. It returns an UnsupportedPdfException of "The filter /DCTDECODE is not supported" and then it finally fails with and InlineImageParseException of "Could not find image data or EI", when all I really care about is the text. The BI/EI exists in my file so I assume this failure is because of the /DCTDECODE exception. But again, I don't care about images, I'm looking for text.
My current solution for this is to add a filterHandler in the InlineImageUtils class that assigns the Filter_DoNothing() filter to the DCTDECODE filterHandler dictionary. This way I don't get exceptions when I have InlineImages with DCTDECODE. Like this:
private static bool InlineImageStreamBytesAreComplete(byte[] samples, PdfDictionary imageDictionary) {
try {
IDictionary<PdfName, FilterHandlers.IFilterHandler> handlers = new Dictionary<PdfName, FilterHandlers.IFilterHandler>(FilterHandlers.GetDefaultFilterHandlers());
handlers[PdfName.DCTDECODE] = new Filter_DoNothing();
PdfReader.DecodeBytes(samples, imageDictionary, handlers);
return true;
} catch (IOException e) {
return false;
}
}
public class Filter_DoNothing : FilterHandlers.IFilterHandler
{
public byte[] Decode(byte[] b, PdfName filterName, PdfObject decodeParams, PdfDictionary streamDictionary)
{
return b;
}
}
My problem with this "fix" is that I had to change the iTextSharp library. I'd rather not do that so I can try to stay compatible with future versions.
Here's the PDF in question:
https://app.box.com/s/7eaewzu4mnby9ogpl2frzjswgqxn9rz5

Lucene 4.1 : How split words that contains "dots" when indexing?

I'l trying to figure out what I should do to index my keywords that contains "." .
ex : this.name
I want to index the terms : this and name in my index.
I use the StandardAnalyser. I try to extends the WhitespaceTokensizer or extends TokenFilter, but I'm not sure if I'm in the right direction.
if I use the StandardAnalyser, I'll obtain "this.name" as a keyword, and that's not what I want, but the analyser do the rest correctly for me.
You can put a CharFilter in front of StandardTokenizer that converts periods and underscores to spaces. MappingCharFilter will work.
Here's MappingCharFilter added to a stripped-down StandardAnalyzer (see the original 4.1 version here):
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.charfilter.MappingCharFilter;
import org.apache.lucene.analysis.charfilter.NormalizeCharMap;
import org.apache.lucene.analysis.core.LowerCaseFilter;
import org.apache.lucene.analysis.core.StopAnalyzer;
import org.apache.lucene.analysis.core.StopFilter;
import org.apache.lucene.analysis.standard.StandardFilter;
import org.apache.lucene.analysis.standard.StandardTokenizer;
import org.apache.lucene.analysis.util.StopwordAnalyzerBase;
import org.apache.lucene.util.Version;
import java.io.IOException;
import java.io.Reader;
public final class MyAnalyzer extends StopwordAnalyzerBase {
private int maxTokenLength = 255;
public MyAnalyzer() {
super(Version.LUCENE_41, StopAnalyzer.ENGLISH_STOP_WORDS_SET);
}
#Override
protected TokenStreamComponents createComponents
(final String fieldName, final Reader reader) {
final StandardTokenizer src = new StandardTokenizer(matchVersion, reader);
src.setMaxTokenLength(maxTokenLength);
TokenStream tok = new StandardFilter(matchVersion, src);
tok = new LowerCaseFilter(matchVersion, tok);
tok = new StopFilter(matchVersion, tok, stopwords);
return new TokenStreamComponents(src, tok) {
#Override
protected void setReader(final Reader reader) throws IOException {
src.setMaxTokenLength(MyAnalyzer.this.maxTokenLength);
super.setReader(reader);
}
};
}
#Override
protected Reader initReader(String fieldName, Reader reader) {
NormalizeCharMap.Builder builder = new NormalizeCharMap.Builder();
builder.add(".", " ");
builder.add("_", " ");
NormalizeCharMap normMap = builder.build();
return new MappingCharFilter(normMap, reader);
}
}
Here's a quick test to demonstrate it works:
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.BaseTokenStreamTestCase;
public class TestMyAnalyzer extends BaseTokenStreamTestCase {
private Analyzer analyzer = new MyAnalyzer();
public void testPeriods() throws Exception {
BaseTokenStreamTestCase.assertAnalyzesTo
(analyzer,
"this.name; here.i.am; sentences ... end with periods.",
new String[] { "name", "here", "i", "am", "sentences", "end", "periods" } );
}
public void testUnderscores() throws Exception {
BaseTokenStreamTestCase.assertAnalyzesTo
(analyzer,
"some_underscore_term _and____ stuff that is_not in it",
new String[] { "some", "underscore", "term", "stuff" } );
}
}
If I understand you correctly, you need to use a tokenizer that removes dots -- that is, any name that contains a dot should be split at that point ("here.i.am" becomes "here" + "i" + "am").
you are getting caught by behavior documented here:
However, a dot that's not followed by whitespace is considered part of a token.
StandardTokenizer introduces some more complex to parsing rules than you may not be looking for. This one, in particular, is intended to prevent tokenization of URLs, IPs, idenifiers, etc. A simpler implementation might suit your needs, like LetterTokenizer.
If that doesn't really suit your needs (and it might well turn out to be throwing the baby out with the bathwater), then you may need to modify StandardTokenizer yourself, which is explicitly encouraged by the Lucene docs:
Many applications have specific tokenizer needs. If this tokenizer does not suit your application, please consider copying this source code directory to your project and maintaining your own grammar-based tokenizer.
Sebastien Dionne: I didn't understand how to split a word, do I have to parse the document char by char ?
Sebastien Dionne: I still want to know how to split a token into multiple part, and index them all
You may have to write a custom analyzer.
Analyzer is a combination of Tokenizer and possibly a chain of TokenFilter instances.
Tokenizer : Takes in the input text passed by you probably as a java.io.Reader. It
JUST breakdowns the text. Doesn't alter, just breaks it down.
TokenFilter : Takes in the token emitted by Tokenizer, adds / removes / alters tokens and emits the same one by one until all are finished.
If it replaces a token with multiple tokens based on requirements, buffers all, emits them one by one to the Indexer.
You may check following resource, unfortunately, you may have to sign-up for a trial membership.
By writing a custom analyzer, you can breakdown the text the way you want to. You may even use some existing components like LowercaseFilter. Fortunately, it is achievable with Lucene to come up with some Analyzer that serves your purpose if you couldn't find that as a built-in or on the web.
" Writing Custom Filters: Lucene in Action 2"

How to capture and record video from webcam using JavaCV

I'm new to JavaCV and I have difficult time finding good tutorials about different issues on the topics that I'm interested in. I've succeed to implement some sort of real time video streaming from my webcam but the problem is that I use this code snippet which I found on the net :
#Override
public void run() {
FrameGrabber grabber = new VideoInputFrameGrabber(0); // 1 for next
// camera
int i = 0;
try {
grabber.start();
IplImage img;
while (true) {
img = grabber.grab();
if (img != null) {
cvFlip(img, img, 1);// l-r = 90_degrees_steps_anti_clockwise
cvSaveImage((i++) + "-aa.jpg", img);
// show image on window
canvas.showImage(img);
}
that results in multiple jpg files.
What I really want to do is capture my webcam input and along with showing it I want to save it in a proper video file. I find out about FFmpegFrameRecorder but don't know how to implement it. Also I've been wondering what are the different options for the format of the video file, because flv maybe would be more useful for me.
It's been quite a journey. Still a few things that I'm not sure what's the meaning behind them, but here is a working example for capturing and recording video from a webcam using JavaCV:
import com.googlecode.javacv.CanvasFrame;
import com.googlecode.javacv.FFmpegFrameRecorder;
import com.googlecode.javacv.OpenCVFrameGrabber;
import com.googlecode.javacv.cpp.avutil;
import com.googlecode.javacv.cpp.opencv_core.IplImage;
public class CameraTest {
public static final String FILENAME = "output.mp4";
public static void main(String[] args) throws Exception {
OpenCVFrameGrabber grabber = new OpenCVFrameGrabber(0);
grabber.start();
IplImage grabbedImage = grabber.grab();
CanvasFrame canvasFrame = new CanvasFrame("Cam");
canvasFrame.setCanvasSize(grabbedImage.width(), grabbedImage.height());
System.out.println("framerate = " + grabber.getFrameRate());
grabber.setFrameRate(grabber.getFrameRate());
FFmpegFrameRecorder recorder = new FFmpegFrameRecorder(FILENAME, grabber.getImageWidth(),grabber.getImageHeight());
recorder.setVideoCodec(13);
recorder.setFormat("mp4");
recorder.setPixelFormat(avutil.PIX_FMT_YUV420P);
recorder.setFrameRate(30);
recorder.setVideoBitrate(10 * 1024 * 1024);
recorder.start();
while (canvasFrame.isVisible() && (grabbedImage = grabber.grab()) != null) {
canvasFrame.showImage(grabbedImage);
recorder.record(grabbedImage);
}
recorder.stop();
grabber.stop();
canvasFrame.dispose();
}
}
It was somewhat hard for me to make this work so in addition to those that may have the same issue, if you follow the official guide about how to setup JavaCV on Windows 7/64bit and want to capture video using the code above you should create a new directory in C:\ : C:\ffmpeg and extract the files from the ffmped release that you've been told to download in the official guide. Then you should add C:\ffmpeg\bin to your Enviorment variable PATH and that's all. About this step all credits go to karlphillip
and his post here