Table Detection Algorithms - pdf

Context
I have a bunch of PDF files. Some of them are scanned (i.e. images). They consist of text + pictures + tables.
I want to turn the tables into CSV files.
Current Plan:
1) Run Tesseract OCR to get text of all the documents.
2) ??? Run some type of Table Detection Algorithm ???
3) Extract the rows / columns / cells, and the text in them.
Question:
Is there some standard "Table Extraction Algorithm" to use?
Thanks!

Abbyy Fine Reader includes table detection and will be the easiest approach. It can scan, import PDF', TIFF's etc. You will also be able to manually adjust the tables and columns when the auto detection fails.
www.abbyy.com - You should be able to download a trial version and you will also find the OCR results are much more accurate than Tesseract which will also save you a lot of time.
Trying to write something yourself will be hit and miss as there are too many different types of tables to cope with. ie. with lines, without lines, shaded, multiple lines, different alignments, headers, footers etc..
Good luck.

Related

Is it possible for RapidMiner to mine 50,000 documents using 4GB RAM?

I am facing a challenge to process 50k rows of data with feedback in text form and I am trying to find a good way to reduce the dimensions. so far i have used the text processing steps - tokenize- transform to low cases- remove stop words and stem but it is still giving a very big dimension space of about 15,000 with words that have no meaning included as well. What else can I do to extract only the relevant words?

Strings indexing tool for binary files

Very often I've to deal with very large binary files (from 50 to 500Gb), in different formats, which contains basically mixed data including strings.
I need to index the strings inside the file, creating a database or an index, so I can do quick searches (basic search or complex with regex). The output of the search should be of course the offset of the found string in the binary file.
Does anyone know a tool, framework or library which can help me on this task?
You can run 'strings -t d' (Linux / OS X) on it to pull out strings with their corresponding offset and then put that into Solr or Elastic. If you want more than just ASCII though, it gets more complex.
Autopsy has its own strings extraction code (for UTF-8 and UTF-16) and puts it into Solr (and uses Tika if the file format is supported), but it doesn't record the offset from a binary file, so it may not meet your needs.

how to look for the content of text file in pentaho?

I have a ETL which give text file output and I have to check the those text content has the word error or bad using pentaho.
Is there any simple way to find it?
If you are trying to process a number of files, you can use a Get Filenames step to get all the filenames. Then, if your text files are small, you can use a Get File Content step to get the whole file as one row, then use a Java Filter or other matching step (RegEx, e.g.) to search for the words.
If your text files are too big but line-based or otherwise in a fixed format (which it likely is if you used a text file output step), you can use a Text File Input step to get the lines, then a matcher step (see above) to find the words in the line. Then you can use a Filter Rows step to choose just those rows that contain the words, then Select Values to choose just the filename, then a Sort Rows on the filename, then a Unique Rows step. The result should be a list of filenames whose contents contain the search words.
This may seem like a lot of steps, but Pentaho Data Integration or PDI (aka Kettle) is designed to be a flow of steps with distinct (and very reusable) functionality. A smaller but less "PDI" method is to write a User Defined Java Class (or other scripting) step to do all the work. This solution has a smaller number of steps but is not very configurable or reusable.
If you're writing these files out yourself, then dont you already know the content? So scan the fields at the point at which you already have them in memory.
If you're trying to see if Pentaho has written an error to the file, then you should use error handling on the output step.
Finally PDI is not a text searching tool. If you really need to do this, then probably best bet is good old grep..

customizing output from database and formatting it

Say you have an average looking database. And you want to generate a variety of text files (each with their own specific formatting - so the files may have rudimentary tables and spacing). So you'd be taking the data from the Database, transforming it in a specified format (while doing some basic logic) and saving it as a text file (you can store it in XML as an intermediary step).
So if you had to create 10 of these unique files what would be the ideal approach to creating these files? I suppose you can create classes for each type of transformation but then you'd need to create quite a few classes, and what if you needed to create another 10 more of these files (a year down the road)?
What do you think is a good approach to this problem? being able to maintain the customizability of the output file, yet not creating a mess of a code and maintenance effort?
Here is what I would do if I were to come up with a general approach to this vague question. I would write three pieces of code, independent of each other:
a) A query processor which can run a query on a given database and output results in a well-known xml format.
b) An XSL stylesheet which can interpret the well-known xml format in (a) and transform it to the desired format.
c) An XML-to-Text transformer which can read the files in (a) and (b) and put out the result.

Compare Images in SQL

What is the best way to compare to Images in the database?
I tried to compare them (#Image is the type Image):
Select * from Photos
where [Photo] = #Image
But receives the error "The data types image and image are incompatible in the equal to operator".
Since the Image data type is a binary and huge space for storing data, IMO, the easiest way to compare Image fields is hash comparison.
So you need to store hash of the Photo column on your table.
If you need to compare the image you should retrive all the images from the database and do that from the language that you use for accessing the database. This is one of the reasons why it's not a best practice to store images or other binary files in a relational database. You should create unique file name every time when you want to store an image in a database. Rename the file with this unique file name, store the image on the disk and insert in your database it's name on the disk and eventually the original name of the file or the one provided by the user of your app.
Generally, as it's been mentioned already, you need to use dedicated algorithms from the image processing shelve.
Moreover, it's hard to give precise answer because the question is too general. Two images may be considered as different or not based on number of properties.
For instance, you can have one image of a flower 100x100 pixels and image of the same flower but resized to 50x50 pixels. For some purposes and applications these two will be considered as similar (regardless of different dimensions), but will be different images for other purposes.
You may want to check how image comparison is realised by some tools and learn how it works:
pdiff
ImageMagick compare
If you don't want to compare image content but just check if two binary streams (image files, other binary files, binary object with image content) are equivalent, then you can compare MD5 checksums of images.
It depends on how accurate you want to be and how many images you need to compare. You can use various functions like DATALENGTH and SUBSTRING or READTEXT to do some comparisons. Alternatively, you could write code in the CLR and implement it through a stored procedure to do comparisons.
Comparing images falls under a specific category of Computer science, that is called image processing. You should search some libraries that provide image processing capabilities. Using that you can compare upto what ratio given two images are same or identical. you can have 2 images matching each other upto 50% or more or less. There are mathematical algorithm that define the comparison formulae which returns the ratio.
Hope this gives you a direction to work further on your Problem..