How to batch extract images from a PDF - pdf

TL/DR Version: How do I extract the image out of the Type B file below. Note that there are around 600 such files so I would prefer some sort of batch operation.
Type A Type B
Details: I'm redesigning my company's online catalog an need top extract the design images from ~2000 PDFs which either Type A (where I can export the images using Acrobat XI Tools - Document Processing - Extract All Images) or of Type B.
I don't know how these were designed or the cause for the difference (the PDF creation was contracted out to some now defunct company 2 yrs ago).
As noted above I can batch process (Acrobat XI Action Wizard) all Type A files but that still leaves me with ~600 Type B files for which I am clueless.
Any ideas?

This can be done with pdfimages (poppler utils):
http://cgit.freedesktop.org/poppler/poppler/tree/utils

Related

Architectural design clarrification

I built an API in nodejs+express that allows reactjs clients to upload CSV files(maximum size is atmost 1GB) to the server.
I also wrote another API which when given the filename and row numbers in an array (ie array of row numbers ) as input, it selects the rows corresponding to the row numbers, from the previously stored files and writes it to another result file (writeStream).
Then th resultant file is piped back to the client(all via streaming).
Currently as you see I am using files(basically nodejs' read and write streams) to asynchronously manage this.
But I have faced srious latency (only 2 cores are used) and some memory leak (900mb consumption) when I have 15 requests, each supplying about 600 rows to retrieve from files of size approximately 150mb.
I also have planned an alternate design.
Basically, I will store the entire file as a SQL Table with row numbers as primary indexed key.
I will convert the user inputted array of row numbrs to a another table using sql unnest and then join both these tables to get the rows needed.
Then I will supply back the resultant table as a csv file to the client.
Would this architecture be better than the previous architecture?
Any suggestions from devs is highly appreciated.
Thanks.
Use the client to do all the heavy lifting by using the XLSX package for any manipulation of content. Then have API to save information about the transaction. This will remove upload to server and download from the server and help you provide better experience.

Google bigquery export big table to multiple objects in Google Cloud storage

I have two bigquery tables, bigger than 1 GB.
To export to storage,
https://googlecloudplatform.github.io/google-cloud-php/#/docs/google-cloud/v0.39.2/bigquery/table?method=export
$destinationObject = $storage->bucket('myBucket')->object('tableOutput_*');
$job = $table->export($destinationObject);
I used wild card.
Strange things is one bigquery table is exported to 60 files each of them with 3 - 4 MB size.
Another table is exported to 3 files, each of them close to 1 GB, 900 MB.
The codes are the same. The only difference is in the case that the table exported to 3 files. I put them into a subfolder.
The one exported to 60 files are one level above the subfolder.
My question is how bigquery decided that a file will be broken into dozens smaller files or just be broken into a few big files (as long as each file is less than 1GB)?
Thanks!
BigQuery makes no guarantees on the sizes of the exported files, and there is currently no way to adjust this.

customizing output from database and formatting it

Say you have an average looking database. And you want to generate a variety of text files (each with their own specific formatting - so the files may have rudimentary tables and spacing). So you'd be taking the data from the Database, transforming it in a specified format (while doing some basic logic) and saving it as a text file (you can store it in XML as an intermediary step).
So if you had to create 10 of these unique files what would be the ideal approach to creating these files? I suppose you can create classes for each type of transformation but then you'd need to create quite a few classes, and what if you needed to create another 10 more of these files (a year down the road)?
What do you think is a good approach to this problem? being able to maintain the customizability of the output file, yet not creating a mess of a code and maintenance effort?
Here is what I would do if I were to come up with a general approach to this vague question. I would write three pieces of code, independent of each other:
a) A query processor which can run a query on a given database and output results in a well-known xml format.
b) An XSL stylesheet which can interpret the well-known xml format in (a) and transform it to the desired format.
c) An XML-to-Text transformer which can read the files in (a) and (b) and put out the result.

Table Detection Algorithms

Context
I have a bunch of PDF files. Some of them are scanned (i.e. images). They consist of text + pictures + tables.
I want to turn the tables into CSV files.
Current Plan:
1) Run Tesseract OCR to get text of all the documents.
2) ??? Run some type of Table Detection Algorithm ???
3) Extract the rows / columns / cells, and the text in them.
Question:
Is there some standard "Table Extraction Algorithm" to use?
Thanks!
Abbyy Fine Reader includes table detection and will be the easiest approach. It can scan, import PDF', TIFF's etc. You will also be able to manually adjust the tables and columns when the auto detection fails.
www.abbyy.com - You should be able to download a trial version and you will also find the OCR results are much more accurate than Tesseract which will also save you a lot of time.
Trying to write something yourself will be hit and miss as there are too many different types of tables to cope with. ie. with lines, without lines, shaded, multiple lines, different alignments, headers, footers etc..
Good luck.

Hadoop Input Files

Is there a difference between having say n files with 1 line each in the input folder and having 1 file with n lines in the input folder when running hadoop?
If there are n files, does the "InputFormat" just see it all as 1 continuous file?
There's a big difference. It's frequently referred to as "the small files problem" , and has to do with the fact that Hadoop expects to split giant inputs into smaller tasks, but not to collect small inputs into larger tasks.
Take a look at this blog post from Cloudera:
http://www.cloudera.com/blog/2009/02/02/the-small-files-problem/
If you can avoid creating lots of files, do so. Concatenate when possible. Large splittable files are MUCH better for Hadoop.
I once ran Pig on the netflix dataset. It took hours to process just a few gigs. I then concatenated the input files (I think it was a file per movie, or a file per user) into a single file -- had my result in minutes.