How to force Mobile Vision for Android to read full lines of text - android-vision

I have implemented Google's Mobile Vision for Android by following a tutorial. I am trying to build an app that will scan a receipt and find the numeric total. However, as I scan different receipts that are printed in different formats, the API will detect TextBlocks in what seems to be an arbitrary way. For example, in one receipt, if several words of text are separated by single spaces, then they are grouped into a single TextBlock. However, if two words of text are separated by lots of spaces, then they are separated as independent TextBlocks, even though they appear on the same "line". What I am trying to do is force the API to recognize each entire line of the receipt as a single entity. Is this possible?

public ArrayList<T> getAllGraphicsInRow(float rawY) {
synchronized (mLock) {
ArrayList<T> row = new ArrayList<>();
// Get the position of this View so the raw location can be offset relative to the view.
int[] location = new int[2];
this.getLocationOnScreen(location);
for (T graphic : mGraphics) {
float rawX = this.getWidth();
for (int i=0; i<rawX; i+=10){
if (graphic.contains(i - location[0], rawY - location[1])) {
if(!row.contains(graphic)) {
row.add(graphic);
}
}
}
}
return row;
}
}
This should be in the GraphicOverlay.java file and essentially fetches all the graphics in that row.
public static boolean almostEqual(double a, double b, double eps){
return Math.abs(a-b)<(eps);
}
public static boolean pointAlmostEqual(Point a, Point b){
return almostEqual(a.y,b.y,10);
}
public static boolean cornerPointAlmostEqual(Point[] rect1, Point[] rect2){
boolean almostEqual=true;
for (int i=0; i<rect1.length;i++){
if (!pointAlmostEqual(rect1[i],rect2[i])){
almostEqual=false;
}
}
return almostEqual;
}
private boolean onTap(float rawX, float rawY) {
String priceRegex = "(\\d+[,.]\\d\\d)";
ArrayList<OcrGraphic> graphics = mGraphicOverlay.getAllGraphicsInRow(rawY);
OcrGraphic currentGraphics = mGraphicOverlay.getGraphicAtLocation(rawX,rawY);
if (graphics !=null && currentGraphics!=null) {
List<? extends Text> currentComponents = currentGraphics.getTextBlock().getComponents();
final Pattern pattern = Pattern.compile(priceRegex);
final Pattern pattern1 = Pattern.compile(priceRegex);
TextBlock text = null;
Log.i("text results", "This many in the row: " + Integer.toString(graphics.size()));
ArrayList<Text> combinedComponents = new ArrayList<>();
for (OcrGraphic graphic : graphics) {
if (!graphic.equals(currentGraphics)) {
text = graphic.getTextBlock();
Log.i("text results", text.getValue());
combinedComponents.addAll(text.getComponents());
}
}
for (Text currentText : currentComponents) { // goes through components in the row
final Matcher matcher = pattern.matcher(currentText.getValue()); // looks for
Point[] currentPoint = currentText.getCornerPoints();
for (Text otherCurrentText : combinedComponents) {//Looks for other components that are in the same row
final Matcher otherMatcher = pattern1.matcher(otherCurrentText.getValue()); // looks for
Point[] innerCurrentPoint = otherCurrentText.getCornerPoints();
if (cornerPointAlmostEqual(currentPoint, innerCurrentPoint)) {
if (matcher.find()) { // if you click on the price
Log.i("oh yes", "Item: " + otherCurrentText.getValue());
Log.i("oh yes", "Value: " + matcher.group(1));
itemList.add(otherCurrentText.getValue());
priceList.add(Float.valueOf(matcher.group(1)));
}
if (otherMatcher.find()) { // if you click on the item
Log.i("oh yes", "Item: " + currentText.getValue());
Log.i("oh yes", "Value: " + otherMatcher.group(1));
itemList.add(currentText.getValue());
priceList.add(Float.valueOf(otherMatcher.group(1)));
}
Toast toast = Toast.makeText(this, " Text Captured!" , Toast.LENGTH_SHORT);
toast.show();
}
}
}
return true;
}
return false;
}
This should be in OcrCaptureActivity.java and it breaks up the TextBlock into lines and finds the blocks in the same row as the line and checks if the components are all prices, and prints all value accordingly.
The eps value in almostEqual is the tolerance for how tall it checks for graphics in the row.

Related

Using Lucene's highlighting, getting too much highlighted, is there a workaround for this?

I am using the highlighting feature of Lucene to isolate matching terms for my query, but some of the matched terms are excessive.
I have some simple test cases which are delivered in an Ant project (download details below).
Materials
You can download the test case here: mydemo_with_libs.zip
That archive includes the Lucene 8.6.3 libraries which my test uses; if you prefer a copy without the JAR files you can download that from here: mydemo_without_libs.zip
The necessary libraries are: core, analyzers, queries, queryparser, highlighter, and memory.
You can run the test case by unzipping the archive into an empty directory and running the Ant command ant synsearch
Input
I have provided a short synonym list which is used for indexing and analysing in the highlighting methods:
cope,manage
jobs,tasks
simultaneously,at once
and there is one document being indexed:
Queues are a useful way of grouping jobs together in order to manage a number of them at once. You can:
hold or release multiple jobs at the same time;
group multiple tasks (for the same event);
control the priority of jobs in the queue;
Eventually log all events that take place in a queue.
Use either job.queue or task.queue in specifications.
Process
When building the index I am storing the text field, and using a custom analyzer. This is because (in the real world) the content I am indexing is technical documentation, so stripping out punctuation is inappropriate because so much of it may be significant in technical expressions. My analyzer uses a TechTokenFilter which breaks the stream up into tokens consisting of strings of words or digits, or individual characters which don't match the previous pattern.
Here's the relevant code for the analyzer:
public class MyAnalyzer extends Analyzer {
public MyAnalyzer(String synlist) {
if (synlist != "") {
this.synlist = synlist;
this.useSynonyms = true;
}
}
public MyAnalyzer() {
this.useSynonyms = false;
}
#Override
protected TokenStreamComponents createComponents(String fieldName) {
WhitespaceTokenizer src = new WhitespaceTokenizer();
TokenStream result = new TechTokenFilter(new LowerCaseFilter(src));
if (useSynonyms) {
result = new SynonymGraphFilter(result, getSynonyms(synlist), Boolean.TRUE);
result = new FlattenGraphFilter(result);
}
return new TokenStreamComponents(src, result);
}
and here's my filter:
public class TechTokenFilter extends TokenFilter {
private final CharTermAttribute termAttr;
private final PositionIncrementAttribute posIncAttr;
private final ArrayList<String> termStack;
private AttributeSource.State current;
private final TypeAttribute typeAttr;
public TechTokenFilter(TokenStream tokenStream) {
super(tokenStream);
termStack = new ArrayList<>();
termAttr = addAttribute(CharTermAttribute.class);
posIncAttr = addAttribute(PositionIncrementAttribute.class);
typeAttr = addAttribute(TypeAttribute.class);
}
#Override
public boolean incrementToken() throws IOException {
if (this.termStack.isEmpty() && input.incrementToken()) {
final String currentTerm = termAttr.toString();
final int bufferLen = termAttr.length();
if (bufferLen > 0) {
if (termStack.isEmpty()) {
termStack.addAll(Arrays.asList(techTokens(currentTerm)));
current = captureState();
}
}
}
if (!this.termStack.isEmpty()) {
String part = termStack.remove(0);
restoreState(current);
termAttr.setEmpty().append(part);
posIncAttr.setPositionIncrement(1);
return true;
} else {
return false;
}
}
public static String[] techTokens(String t) {
List<String> tokenlist = new ArrayList<String>();
String[] tokens;
StringBuilder next = new StringBuilder();
String token;
char minus = '-';
char underscore = '_';
char c, prec, subc;
// Boolean inWord = false;
for (int i = 0; i < t.length(); i++) {
prec = i > 0 ? t.charAt(i - 1) : 0;
c = t.charAt(i);
subc = i < (t.length() - 1) ? t.charAt(i + 1) : 0;
if (Character.isLetterOrDigit(c) || c == underscore) {
next.append(c);
// inWord = true;
}
else if (c == minus && Character.isLetterOrDigit(prec) && Character.isLetterOrDigit(subc)) {
next.append(c);
} else {
if (next.length() > 0) {
token = next.toString();
tokenlist.add(token);
next.setLength(0);
}
if (Character.isWhitespace(c)) {
// shouldn't be possible because the input stream has been tokenized on
// whitespace
} else {
tokenlist.add(String.valueOf(c));
}
// inWord = false;
}
}
if (next.length() > 0) {
token = next.toString();
tokenlist.add(token);
// next.setLength(0);
}
tokens = tokenlist.toArray(new String[0]);
return tokens;
}
}
Examining the index I can see that the index contains the separate terms I expect, including the synonym values. For example the text at the end of the first line has produced the terms
of
them
at , simultaneously
once
.
You
can
:
and the text at the end of the third line has produced the terms
same
event
)
;
When the application performs a search it analyzes the query without using the synonym list (because the synonyms are already in the index), but I have discovered that I need to include the synonym list when analyzing the stored text to identify the matching fragments.
Searches match the correct documents, but the code I have added to identify the matching terms over-performs. I won't show all the search method here, but will focus on the code which lists matched terms:
public static void doSearch(IndexReader reader, IndexSearcher searcher,
Query query, int max, String synList) throws IOException {
SimpleHTMLFormatter htmlFormatter = new SimpleHTMLFormatter("\001", "\002");
Highlighter highlighter = new Highlighter(htmlFormatter, new QueryScorer(query));
Analyzer analyzer;
if (synList != null) {
analyzer = new MyAnalyzer(synList);
} else {
analyzer = new MyAnalyzer();
}
// Collect all the docs
TopDocs results = searcher.search(query, max);
ScoreDoc[] hits = results.scoreDocs;
int numTotalHits = Math.toIntExact(results.totalHits.value);
System.out.println("\nQuery: " + query.toString());
System.out.println("Matches: " + numTotalHits);
// Collect matching terms
HashSet<String> matchedWords = new HashSet<String>();
int start = 0;
int end = Math.min(numTotalHits, max);
for (int i = start; i < end; i++) {
int id = hits[i].doc;
float score = hits[i].score;
Document doc = searcher.doc(id);
String docpath = doc.get("path");
String doctext = doc.get("text");
try {
TokenStream tokens = TokenSources.getTokenStream("text", null, doctext, analyzer, -1);
TextFragment[] frag = highlighter.getBestTextFragments(tokens, doctext, false, 100);
for (int j = 0; j < frag.length; j++) {
if ((frag[j] != null) && (frag[j].getScore() > 0)) {
String match = frag[j].toString();
addMatchedWord(matchedWords, match);
}
}
} catch (InvalidTokenOffsetsException e) {
System.err.println(e.getMessage());
}
System.out.println("matched file: " + docpath);
}
if (matchedWords.size() > 0) {
System.out.println("matched terms:");
for (String word : matchedWords) {
System.out.println(word);
}
}
}
Problem
While the correct documents are selected by these queries, and the fragments chosen for highlighting do contain the query terms, the highlighted pieces in some of the selected fragments extend over too much of the input.
For example, if the query is
+text:event +text:manage
(the first example in the test case) then I would expect to see 'event' and 'manage' in the highlighted list. But what I actually see is
event);
manage
Despite the highlighting process using an analyzer which breaks terms apart and treats punctuation characters as single terms, the highlight code is "hungry" and breaks on whitespace alone.
Similarly if the query is
+text:queeu~1
(my final test case) I would expect to only see 'queue' in the list. But I get
queue.
job.queue
task.queue
queue;
It is so nearly there... but I don't understand why the highlighted pieces are inconsistent with the index, and I don't think I should have to parse the list of matches through yet another filter to produce the correct list of matches.
I would really appreciate any pointers to what I am doing wrong or how I could improve my code to deliver exactly what I need.
Thanks for reading this far!
I managed to get this working by replacing the WhitespaceTokenizer and TechTokenFilter in my analyzer with a PatternTokenizer; the regular expression took a bit of work but once I had it all the matching terms were extracted with pinpoint accuracy.
The replacement analyzer:
public class MyAnalyzer extends Analyzer {
public MyAnalyzer(String synlist) {
if (synlist != "") {
this.synlist = synlist;
this.useSynonyms = true;
}
}
public MyAnalyzer() {
this.useSynonyms = false;
}
private static final String tokenRegex = "(([\\w]+-)*[\\w]+)|[^\\w\\s]";
#Override
protected TokenStreamComponents createComponents(String fieldName) {
PatternTokenizer src = new PatternTokenizer(Pattern.compile(tokenRegex), 0);
TokenStream result = new LowerCaseFilter(src);
if (useSynonyms) {
result = new SynonymGraphFilter(result, getSynonyms(synlist), Boolean.TRUE);
result = new FlattenGraphFilter(result);
}
return new TokenStreamComponents(src, result);
}

Text displayed in blue although PDAnnotation removed

We have a requirement where we need to remove annotation on some matched conditional check. PDAnnotaion gets removed when I have executed allPageAnnotationsList.remove(annotationTobeRemoved) statement.
But corresponding text remained displayed in blue color only. How could I update the text color to normal(black)?
Originally I thought you asked for all non-black text on a page to be changed to black. This resulted in my original answer, now the first section 'Updating All Text to Black'. Then you clarified that you only wanted the text in the areas of the removed annotations to be made black. That's shown in the second section 'Updating Text in Areas to Black'.
Updating All Text to Black
First of all, as already described by Tilman in comments, removing link annotations usually merely removes the interactivity of that link but the text in the area of the link annotation remains as is. If you want to update the text color to normal(black), therefore, you have to add a second step and manipulate the colors in the static page contents.
The static page content is defined by a stream of instructions which change the graphics state or draw something. The color used for drawing is part of the graphics state and is set by explicit color setting instructions. Thus, one could think you could simply replace all color setting instructions by instructions selecting normal(black).
Unfortunately it's not that easy because colors may be changed to draw other things, too. E.g. in your document at the start the whole page is filled with white; if you replaced the color setting instruction before that fill instruction, your whole page would be black. Not exactly what you want.
To update the text color to normal(black) but not change other colors, therefore, you have to consider the context of instructions you want to change.
The PDFBox parsing framework can help you here, iterating over a content stream and keeping track of the graphics state.
Based upon that framework, furthermore, a generic content stream editor helper class has been created in this answer, the PdfContentStreamEditor. (For details and example uses see that answer.) Now you merely have to customize it for your use case, e.g. like this:
PDDocument document = ...;
for (PDPage page : document.getDocumentCatalog().getPages()) {
PdfContentStreamEditor editor = new PdfContentStreamEditor(document, page) {
#Override
protected void write(ContentStreamWriter contentStreamWriter, Operator operator, List<COSBase> operands) throws IOException {
String operatorString = operator.getName();
if (TEXT_SHOWING_OPERATORS.contains(operatorString)) {
if (currentlyReplacedColor == null)
{
PDColor currentFillColor = getGraphicsState().getNonStrokingColor();
if (!isBlack(currentFillColor))
{
currentlyReplacedColor = currentFillColor;
super.write(contentStreamWriter, SET_NON_STROKING_GRAY, GRAY_BLACK_VALUES);
}
}
} else if (currentlyReplacedColor != null) {
PDColorSpace replacedColorSpace = currentlyReplacedColor.getColorSpace();
List<COSBase> replacedColorValues = new ArrayList<>();
for (float f : currentlyReplacedColor.getComponents())
replacedColorValues.add(new COSFloat(f));
if (replacedColorSpace instanceof PDDeviceCMYK)
super.write(contentStreamWriter, SET_NON_STROKING_CMYK, replacedColorValues);
else if (replacedColorSpace instanceof PDDeviceGray)
super.write(contentStreamWriter, SET_NON_STROKING_GRAY, replacedColorValues);
else if (replacedColorSpace instanceof PDDeviceRGB)
super.write(contentStreamWriter, SET_NON_STROKING_RGB, replacedColorValues);
else {
//TODO
}
currentlyReplacedColor = null;
}
super.write(contentStreamWriter, operator, operands);
}
PDColor currentlyReplacedColor = null;
final List<String> TEXT_SHOWING_OPERATORS = Arrays.asList("Tj", "'", "\"", "TJ");
final Operator SET_NON_STROKING_CMYK = Operator.getOperator("k");
final Operator SET_NON_STROKING_RGB = Operator.getOperator("rg");
final Operator SET_NON_STROKING_GRAY = Operator.getOperator("g");
final List<COSBase> GRAY_BLACK_VALUES = Arrays.asList(COSInteger.ZERO);
};
editor.processPage(page);
}
document.save("withBlackText.pdf");
(ChangeTextColor test testMakeTextBlackTestAfterRemovingAnnotation)
Here we check whether the current instruction is a text drawing instruction. If it is and the current color is not already replaced, we check whether the current color is already black'ish. If it is not black, we store it and add an instruction to replace the current fill color by black.
Otherwise, i.e. if the current instruction is not a text drawing instruction, we check whether the current color has been replaced by black. If it has, we restore the original color.
To check whether a given color is black'ish, we use the following helper method.
static boolean isBlack(PDColor pdColor) {
PDColorSpace pdColorSpace = pdColor.getColorSpace();
float[] components = pdColor.getComponents();
if (pdColorSpace instanceof PDDeviceCMYK)
return (components[0] > .9f && components[1] > .9f && components[2] > .9f) || components[3] > .9f;
else if (pdColorSpace instanceof PDDeviceGray)
return components[0] < .1f;
else if (pdColorSpace instanceof PDDeviceRGB)
return components[0] < .1f && components[1] < .1f && components[2] < .1f;
else
return false;
}
(ChangeTextColor helper method)
Updating Text in Areas to Black
In comments you clarified that you only want the text in the areas of the removed annotations to become black.
For this you have to collect the rectangles of the annotations you remove and later check the position before switching colors whether it's inside one of those rectangles.
This can be done by extending the code above as follows. Here I remove every other annotation only and collect their rectangles to check against them later. Also I override the PDFStreamEngine method showText(byte[]) to store the position of the text shown in the current text drawing instruction.
PDDocument document = ...;
for (PDPage page : document.getDocumentCatalog().getPages()) {
List<PDRectangle> areas = new ArrayList<>();
// Remove every other annotation, collect their areas
List<PDAnnotation> annotations = new ArrayList<>();
boolean remove = true;
for (PDAnnotation annotation : page.getAnnotations()) {
if (remove)
areas.add(annotation.getRectangle());
else
annotations.add(annotation);
remove = !remove;
}
page.setAnnotations(annotations);
PdfContentStreamEditor editor = new PdfContentStreamEditor(document, page) {
#Override
protected void write(ContentStreamWriter contentStreamWriter, Operator operator, List<COSBase> operands) throws IOException {
String operatorString = operator.getName();
if (TEXT_SHOWING_OPERATORS.contains(operatorString) && isInAreas()) {
if (currentlyReplacedColor == null)
{
PDColor currentFillColor = getGraphicsState().getNonStrokingColor();
if (!isBlack(currentFillColor))
{
currentlyReplacedColor = currentFillColor;
super.write(contentStreamWriter, SET_NON_STROKING_GRAY, GRAY_BLACK_VALUES);
}
}
} else if (currentlyReplacedColor != null) {
PDColorSpace replacedColorSpace = currentlyReplacedColor.getColorSpace();
List<COSBase> replacedColorValues = new ArrayList<>();
for (float f : currentlyReplacedColor.getComponents())
replacedColorValues.add(new COSFloat(f));
if (replacedColorSpace instanceof PDDeviceCMYK)
super.write(contentStreamWriter, SET_NON_STROKING_CMYK, replacedColorValues);
else if (replacedColorSpace instanceof PDDeviceGray)
super.write(contentStreamWriter, SET_NON_STROKING_GRAY, replacedColorValues);
else if (replacedColorSpace instanceof PDDeviceRGB)
super.write(contentStreamWriter, SET_NON_STROKING_RGB, replacedColorValues);
else {
//TODO
}
currentlyReplacedColor = null;
}
super.write(contentStreamWriter, operator, operands);
before = null;
after = null;
}
PDColor currentlyReplacedColor = null;
final List<String> TEXT_SHOWING_OPERATORS = Arrays.asList("Tj", "'", "\"", "TJ");
final Operator SET_NON_STROKING_CMYK = Operator.getOperator("k");
final Operator SET_NON_STROKING_RGB = Operator.getOperator("rg");
final Operator SET_NON_STROKING_GRAY = Operator.getOperator("g");
final List<COSBase> GRAY_BLACK_VALUES = Arrays.asList(COSInteger.ZERO);
#Override
protected void showText(byte[] string) throws IOException {
Matrix ctm = getGraphicsState().getCurrentTransformationMatrix();
if (before == null)
before = getTextMatrix().multiply(ctm);
super.showText(string);
after = getTextMatrix().multiply(ctm);
}
Matrix before = null;
Matrix after = null;
boolean isInAreas() {
return isInAreas(before) || isInAreas(after);
}
boolean isInAreas(Matrix m) {
return m != null && areas.stream().anyMatch(rect -> rect.contains(m.getTranslateX(), m.getTranslateY()));
}
};
editor.processPage(page);
}
document.save("WithoutSomeAnnotation-withBlackTextThere.pdf");

PDF- Can text chunk contains 2 or more words?

Im using LocationTextExtractionStrategy to render text from PDF.
Text is rendered in function called RenderText.
So my question is: Can one chunk contains more than 2 words ?
For example we have text:
'MKL is a helpfull person'
Can it be written in chunks like (the most important chunk is bolded):
MK
L
is a h
elpfull
per son
?
Below is the code i use for word separation.
Im doing the word separation during adding text(chunk from renderText function) to current line.
public class TextLineLocation
{
public float X { get; set; }
public float Y { get; set; }
public float Height { get; set; }
public float Width { get; set; }
private string Text;
private List<char> bannedSings = new List<char>() {' ',',', '.', '/', '|', Convert.ToChar(#"\"), ';', '(', ')', '*', '&', '^', '!','?' };
public void AddText(TextInfo text)
{
Text += text;
foreach (char sign in bannedSings)
{
//creating new word
if (text.textChunk.Text.Contains(sign))
{
string[] splittedText = text.textChunk.Text.Split(sign);
foreach (string val in splittedText)
{
//if its first element, add it to current word
if (splittedText[0] == val)
{
// if its space, just ignore...
if (splittedText[0] == " ")
{
continue;
}
wordList[wordList.Count - 1].Text += val;
wordList[wordList.Count - 1].Width += text.getFontWidth();
wordList[wordList.Count - 1].Height += text.getFontHeight();
}
else
{
//if it isnt a first element, create another word
wordList.Add(new WordLocation(text.textChunk.StartLocation[1], text.textChunk.StartLocation[0], text.getFontWidth(), text.getFontHeight(), val));
//TODO: what if chunk has more than 2 words separated ?
}
}
}
}
else
{
//update last word
wordList[wordList.Count-1].Text += text.textChunk.Text;
wordList[wordList.Count - 1].Width += text.getFontWidth();
wordList[wordList.Count - 1].Height += text.getFontHeight();
}
}
public List<WordLocation> wordList = new List<WordLocation>();
}
Not sure from what library LocationTextExtractionStrategy comes, or what it does exactly, but in the PDF representation itself you can group characters together in a "chunk".
How this is used totally depends on the program that produces the PDF: Some programs keep words together, some programs only group word fragments (for example for kerning), some program do other, random things.
So, if LocationTextExtractionStrategy returns these as chunks, you can't rely on anything. If LocationTextExtractionStrategy doesn't return these, but instead relies on spacing heuristics to group characters into chunks, then this will be as good as the heuristics are.
Bottom line: A PDF doesn't contain text, and contains glyphs and their position on the page. Trying to reconstruct text from it is and remains guesswork. You may get it to work in the majority of cases, but there'll always be PDFs where whatever you are doing fails.

JavaFX troubles with removing items from ArrayList

I have 2 TableViews (tableProduct, tableProduct2). The first one is populated by database, the second one is populated with selected by user items from first one (addMeal method, which also converts those to simple ArrayList). After adding/deleting few objects user can save current data from second Table to txt file. It seems to work just fine at beginning. But problem starts to show a bit randomly... I add few items, save it, delete few items, save it, everything is fine. Then after few actions like that, one last object stays in txt file, even though the TableView is empty. I just can't do anything to remove it and I get no errors...
Any ideas what's going on?
public void addMeal() {
productData selection = tableProduct.getSelectionModel().getSelectedItem();
if (selection != null) {
tableProduct2.getItems().add(new productData(selection.getName() + "(" + Float.parseFloat(weightField.getText()) + "g)", String.valueOf(Float.parseFloat(selection.getKcal())*(Float.parseFloat(weightField.getText())/100)), String.valueOf(Float.parseFloat(selection.getProtein())*(Float.parseFloat(weightField.getText())/100)), String.valueOf(Float.parseFloat(selection.getCarb())*(Float.parseFloat(weightField.getText())/100)), String.valueOf(Float.parseFloat(selection.getFat())*(Float.parseFloat(weightField.getText())/100))));
productlist.add(new productSimpleData(selection.getName() + "(" + Float.parseFloat(weightField.getText()) + "g)", String.valueOf(Float.parseFloat(selection.getKcal())*(Float.parseFloat(weightField.getText())/100)), String.valueOf(Float.parseFloat(selection.getProtein())*(Float.parseFloat(weightField.getText())/100)), String.valueOf(Float.parseFloat(selection.getCarb())*(Float.parseFloat(weightField.getText())/100)), String.valueOf(Float.parseFloat(selection.getFat())*(Float.parseFloat(weightField.getText())/100))));
}
updateSummary();
}
public void deleteMeal() {
productData selection = tableProduct2.getSelectionModel().getSelectedItem();
if(selection != null){
tableProduct2.getItems().remove(selection);
Iterator<productSimpleData> iterator = productlist.iterator();
productSimpleData psd = iterator.next();
if(psd.getName().equals(String.valueOf(selection.getName()))) {
iterator.remove();
}
}
updateSummary();
}
public void save() throws IOException {
File file = new File("C:\\Users\\Maciek\\Desktop\\test1.txt");
if(file.exists()){
file.delete();
}
FileWriter fw = null;
BufferedWriter bw = null;
try {
fw = new FileWriter(file);
bw = new BufferedWriter(fw);
Iterator iterator;
iterator = productlist.iterator();
while (iterator.hasNext()) {
productSimpleData pd;
pd = (productSimpleData) iterator.next();
bw.write(pd.toString());
bw.newLine();
}
} catch (IOException e) {
e.printStackTrace();
} finally {
bw.flush();
bw.close();
}
}
and yeah, I realize addMethod inside if statement looks scary but don't mind it, that part is allright after all...
You only ever check the first item in the productlist list to determine, if the item should be removed. Since you do not seem to write to the List anywhere without doing a similar modification to the items of tableProduct2, you can just do the same in this case.
public void deleteMeal() {
int selectedIndex = tableProduct2.getSelectionModel().getSelectedIndex();
if(selectedIndex >= 0) {
tableProduct2.getItems().remove(selectedIndex);
productlist.remove(selectedIndex);
}
updateSummary();
}
This way you also prevent issues, if there are 2 equal items in the list, which could lead to the first one being deleted when the second one is selected...
and yeah, I realize addMethod [...] looks scary
Yes, it does, so it's time to rewrite this:
Change the properties in productData and productSimpleData to float and don't convert the data to String until you need it as String.
if (selection != null) {
float weight = Float.parseFloat(weightField.getText());
float weight100 = weight / 100;
float calories = Float.parseFloat(selection.getKcal())*weight100;
float protein = Float.parseFloat(selection.getProtein())*weight100;
float carb = Float.parseFloat(selection.getCarb())*weight100;
float fat = Float.parseFloat(selection.getFat())*weight100;
ProductData product = new productData(
selection.getName() + "(" + weight + "g)",
calories,
protein,
carb,
fat);
productlist.add(new productSimpleData(product.getName(), calories, protein, carb, fat));
tableProduct2.getItems().add(product);
}
Also that this kind of loop can be rewritten to an enhanced for loop:
Iterator iterator;
iterator = productlist.iterator();
while (iterator.hasNext()) {
productSimpleData pd;
pd = (productSimpleData) iterator.next();
bw.write(pd.toString());
bw.newLine();
}
Assuming you've declared productlist as List<productSimpleData> or a subtype, you can just do
for (productSimpleData pd : productlist) {
bw.write(pd.toString());
bw.newLine();
}
furthermore you could rely on a try-with-resources to close the writers for you:
try (FileWriter fw = new FileWriter(file);
BufferedWriter bw = new BufferedWriter(fw)){
...
} catch (IOException e) {
e.printStackTrace();
}
Also there is no need to delete the file since java overwrites the file by default and only appends data if you specify this in an additional constructor parameter for FileWriter.

How to Detect table start in itextSharp?

I am trying to convert pdf to csv file. pdf file has data in tabular format with first row as header. I have reached to the level where I can extract text from a cell, compare the baseline of text in table and detect newline but I need to compare table borders to detect start of table. I do not know how to detect and compare lines in PDF. Can anyone help me?
Thanks!!!
As you've seen (hopefully), PDFs have no concept of tables, just text placed at specific locations and lines drawn around them. There is no internal relationship between the text and the lines. This is very important to understand.
Knowing this, if all of the cells have enough padding you can look for gaps between characters that are large enough such as the width of 3 or more spaces. If the cells don't have enough spacing this will unfortunately probably break.
You could also look at every line in the PDF and try to figure out what represents your "table-like" lines. See this answer for how to walk every token on a page to see what's being drawn.
I was also searching the answer for the similar question, but unfortunately I didn't found one so I did it on my own.
A PDF page like this
Will give the output as
Here is the github link for the dotnet Console Application I made.
https://github.com/Justabhi96/Detect_And_Extract_Table_From_Pdf
This application detects the table in the specific page of the PDF and prints them in a table format on the console.
Here is the code that i used to make this application.
First of all I took the text out of PDF along with their coordinates using a class which extends iTextSharp.text.pdf.parser.LocationTextExtractionStrategy class of iTextSharp. The Code is as follows:
This is the Class that is going to store the chunks with there coordinates and text.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Web;
namespace itextPdfTextCoordinates
{
public class RectAndText
{
public iTextSharp.text.Rectangle Rect;
public String Text;
public RectAndText(iTextSharp.text.Rectangle rect, String text)
{
this.Rect = rect;
this.Text = text;
}
}
}
And this is the class that extends the LocationTextExtractionStrategy class.
using iTextSharp.text.pdf.parser;
using System;
using System.Collections.Generic;
using System.Linq;
using System.Web;
namespace itextPdfTextCoordinates
{
public class MyLocationTextExtractionStrategy : LocationTextExtractionStrategy
{
public List<RectAndText> myPoints = new List<RectAndText>();
//Automatically called for each chunk of text in the PDF
public override void RenderText(TextRenderInfo renderInfo)
{
base.RenderText(renderInfo);
//Get the bounding box for the chunk of text
var bottomLeft = renderInfo.GetDescentLine().GetStartPoint();
var topRight = renderInfo.GetAscentLine().GetEndPoint();
//Create a rectangle from it
var rect = new iTextSharp.text.Rectangle(
bottomLeft[Vector.I1],
bottomLeft[Vector.I2],
topRight[Vector.I1],
topRight[Vector.I2]
);
//Add this to our main collection
this.myPoints.Add(new RectAndText(rect, renderInfo.GetText()));
}
}
}
This class is overriding the RenderText method of the LocationTextExtractionStrategy class which will be called each time you extract the chunks from a PDF page using PdfTextExtractor.GetTextFromPage() method.
using itextPdfTextCoordinates;
using iTextSharp.text.pdf;
//Create an instance of our strategy
var t = new MyLocationTextExtractionStrategy();
var path = "F:\\sample-data.pdf";
//Parse page 1 of the document above
using (var r = new PdfReader(path))
{
for (var i = 1; i <= r.NumberOfPages; i++)
{
// Calling this function adds all the chunks with their coordinates to the
// 'myPoints' variable of 'MyLocationTextExtractionStrategy' Class
var ex = iTextSharp.text.pdf.parser.PdfTextExtractor.GetTextFromPage(r, i, t);
}
}
//Here you can loop over the chunks of PDF
foreach(chunk in t.myPoints){
Console.WriteLine("character {0} is at {1}*{2}",i.Text,i.Rect.Left,i.Rect.Top);
}
Now for Detecting the start and end of the table you can use the coordinates of the chunks extracted from the PDF.
Like if the specific line is not having table then there will be no jumps in the right coordinate of the current chunk and and Left coordinate of next chunk. But the lines having table will be having those coordinate jumps of at least 3 points.
Like for Lines having table will have coordinates of chunks something like this:
right coord of current chunk -> 12.75pts
left coords of next chunk -> 20.30pts
so further you can use this logic to detect tables in the PDF.
The code is as follows:
using itextPdfTextCoordinates;
using iTextSharp.text.pdf;
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
namespace ConsoleApp1
{
class LineUsingCoordinates
{
public static List<List<string>> getLineText(string path, int page, float[] coord)
{
//Create an instance of our strategy
var t = new MyLocationTextExtractionStrategy();
//Parse page 1 of the document above
using (var r = new PdfReader(path))
{
// Calling this function adds all the chunks with their coordinates to the
// 'myPoints' variable of 'MyLocationTextExtractionStrategy' Class
var ex = iTextSharp.text.pdf.parser.PdfTextExtractor.GetTextFromPage(r, page, t);
}
// List of columns in one line
List<string> lineWord = new List<string>();
// temporary list for working around appending the <List<List<string>>
List<string> tempWord;
// List of rows. rows are list of string
List<List<string>> lineText = new List<List<string>>();
// List consisting list of chunks related to each line
List<List<RectAndText>> lineChunksList = new List<List<RectAndText>>();
//List consisting the chunks for whole page;
List<RectAndText> chunksList;
// List consisting the list of Bottom coord of the lines present in the page
List<float> bottomPointList = new List<float>();
//Getting List of Coordinates of Lines in the page no matter it's a table or not
foreach (var i in t.myPoints)
{
Console.WriteLine("character {0} is at {1}*{2}", i.Text, i.Rect.Left, i.Rect.Top);
// If the coords passed to the function is not null then process the part in the
// given coords of the page otherwise process the whole page
if (coord != null)
{
if (i.Rect.Left >= coord[0] &&
i.Rect.Bottom >= coord[1] &&
i.Rect.Right <= coord[2] &&
i.Rect.Top <= coord[3])
{
float bottom = i.Rect.Bottom;
if (bottomPointList.Count == 0)
{
bottomPointList.Add(bottom);
}
else if (Math.Abs(bottomPointList.Last() - bottom) > 3)
{
bottomPointList.Add(bottom);
}
}
}
// else process the whole page
else
{
float bottom = i.Rect.Bottom;
if (bottomPointList.Count == 0)
{
bottomPointList.Add(bottom);
}
else if (Math.Abs(bottomPointList.Last() - bottom) > 3)
{
bottomPointList.Add(bottom);
}
}
}
// Sometimes the above List will be having some elements which are from the same line but are
// having different coordinates due to some characters like " ",".",etc.
// And these coordinates will be having the difference of at most 4 points between
// their bottom coordinates.
//so to remove those elements we create two new lists which we need to remove from the original list
//This list will be having the elements which are having different but a little difference in coordinates
List<float> removeList = new List<float>();
// This list is having the elements which are having the same coordinates
List<float> sameList = new List<float>();
// Here we are adding the elements in those two lists to remove the elements
// from the original list later
for (var i = 0; i < bottomPointList.Count; i++)
{
var basePoint = bottomPointList[i];
for (var j = i+1; j < bottomPointList.Count; j++)
{
var comparePoint = bottomPointList[j];
//here we are getting the elements with same coordinates
if (Math.Abs(comparePoint - basePoint) == 0)
{
sameList.Add(comparePoint);
}
// here ae are getting the elements which are having different but the diference
// of less than 4 points
else if (Math.Abs(comparePoint - basePoint) < 4)
{
removeList.Add(comparePoint);
}
}
}
// Here we are removing the matching elements of remove list from the original list
bottomPointList = bottomPointList.Where(item => !removeList.Contains(item)).ToList();
//Here we are removing the first matching element of same list from the original list
foreach (var r in sameList)
{
bottomPointList.Remove(r);
}
// Here we are getting the characters of the same line in a List 'chunkList'.
foreach (var bottomPoint in bottomPointList)
{
chunksList = new List<RectAndText>();
for (int i = 0; i < t.myPoints.Count; i++)
{
// If the character is having same bottom coord then add it to chunkList
if (bottomPoint == t.myPoints[i].Rect.Bottom)
{
chunksList.Add(t.myPoints[i]);
}
// If character is having a difference of less than 3 in the bottom coord then also
// add it to chunkList because the coord of the next line will differ at least 10 points
// from the coord of current line
else if (Math.Abs(t.myPoints[i].Rect.Bottom - bottomPoint) < 3)
{
chunksList.Add(t.myPoints[i]);
}
}
// Here we are adding the chunkList related to each line
lineChunksList.Add(chunksList);
}
bool sameLine = false;
//Here we are looping through the lines consisting the chunks related to each line
foreach(var linechunk in lineChunksList)
{
var text = "";
// Here we are looping through the chunks of the specific line to put the texts
// that are having a cord jump in their left coordinates.
// because only the line having table will be having the coord jumps in their
// left coord not the line having texts
for (var i = 0; i< linechunk.Count-1; i++)
{
// If the coord is having a jump of less than 3 points then it will be in the same
// column otherwise the next chunk belongs to different column
if (Math.Abs(linechunk[i].Rect.Right - linechunk[i + 1].Rect.Left) < 3)
{
if (i == linechunk.Count - 2)
{
text += linechunk[i].Text + linechunk[i+1].Text ;
}
else
{
text += linechunk[i].Text;
}
}
else
{
if (i == linechunk.Count - 2)
{
// add the text to the column and set the value of next column to ""
text += linechunk[i].Text;
// this is the list of columns in other word its the row
lineWord.Add(text);
text = "";
text += linechunk[i + 1].Text;
lineWord.Add(text);
text = "";
}
else
{
text += linechunk[i].Text;
lineWord.Add(text);
text = "";
}
}
}
if(text.Trim() != "")
{
lineWord.Add(text);
}
// creating a temporary list of strings for the List<List<string>> manipulation
tempWord = new List<string>();
tempWord.AddRange(lineWord);
// "lineText" is the type of List<List<string>>
// this is our list of rows. and rows are List of strings
// here we are adding the row to the list of rows
lineText.Add(tempWord);
lineWord.Clear();
}
return lineText;
}
}
}
You can call getLineText() method of the above class and run the following loop to see the output in the table structure on the console.
var testFile = "F:\\sample-data.pdf";
float[] limitCoordinates = { 52, 671, 357, 728 };//{LowerLeftX,LowerLeftY,UpperRightX,UpperRightY}
// This line gives the lists of rows consisting of one or more columns
//if you pass the third parameter as null the it returns the content for whole page
// but if you pass the coordinates then it returns the content for that coords only
var lineText = LineUsingCoordinates.getLineText(testFile, 1, null);
//var lineText = LineUsingCoordinates.getLineText(testFile, 1, limitCoordinates);
// For detecting the table we are using the fact that the 'lineText' item which length is
// less than two is surely not the part of the table and the item which is having more than
// 2 elements is the part of table
foreach (var row in lineText)
{
if (row.Count > 1)
{
for (var col = 0; col < row.Count; col++)
{
string trimmedValue = row[col].Trim();
if (trimmedValue != "")
{
Console.Write("|" + trimmedValue + "|");
}
}
Console.WriteLine("");
}
}
Console.ReadLine();