I have a list of image files (jpg, png, etc...) that the dot before the suffix was replaced with # this makes the file unrecognizable to the Android os. In dos, it would be easy. How can I do it in js easily because I've never used js before?
If you're using Node.js, you can use the fs module.
I haven't been able to make time to test a solution, but a naive implementation would look probably like this:
// Import the fs module
const fs = require('fs');
// Read all files in directory
var files = fs.readDirSync('/path/to/images/folder/');
// Loop through the files and rename each file
files.forEach(path => fs.rename(path, path.replace('#', '.')));
I have found somebody talks libhdfs does not support read/write gzip file at about 2010.
I download the newest hadoop-2.0.4 and read hdfs.h. There is also no compressing arguments.
Now I am wondering if it supports reading compressed file now?
If it not, how can I make a patch for the libhdfs and make it work?
Thanks in advance.
Best Regards
Haiti
As I have known, libhdfs only uses JNI to access the HDFS. If you are familiar with HDFS Java API, libhdfs is just a wrapper of org.apache.hadoop.fs.FSDataInputStream. So it can not read compressed files directly now.
I guess that you want to access the file in the HDFS by C/C++. If so, you can use libhdfs to read the raw file, and use the zip/unzip C/C++ library to decompress the content. The compressed file format is the same. For example, if the files are compressed by lzo, then you can use lzo library to decompress them.
But if the file is a sequence file, then you may need to use the JNI to access them as they are Hadoop special file. I have seen Impala does the similar work before. But it's not out-of-the-box.
Thanks for the reply. Use libhdfs to read raw file, then use zlib to inflate the content. This can work. The file used gzip. I used the codes like these.
z_stream gzip_stream;
gzip_stream.zalloc = (alloc_func)0;
gzip_stream.zfree = (free_func)0;
gzip_stream.opaque = (voidpf)0;
gzip_stream.next_in = buf;
gzip_stream.avail_in = readlen;
gzip_stream.next_out = buf1;
gzip_stream.avail_out = 4096 * 4096;
ret = inflateInit2(&gzip_stream, 16 + MAX_WBITS);
if (ret != Z_OK) {
printf("deflate init error\n");
}
ret = inflate(&gzip_stream, Z_NO_FLUSH);
ret = inflateEnd(&gzip_stream);
printf("the buf \n%s\n", buf1);
return buf;
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 6 years ago.
Improve this question
This has been asked before, but I don't really know if the answers help me. Here is my problem: I got a bunch of (10,000 or so) pdf files. Some were text files that were saved using adobe's print feature (so their text is perfect and I don't want to risk screwing them up). And some were scanned images (so they don't have any text and I will have to settle for OCR). The files are in the same directory and I can't tell which is which. Ultimately I want to turn them into .txt files and then do string processing on them. So I want the most accurate OCR possible.
It seems like people have recommended:
adobe pdf (I don't have a licensed copy of this so ... plus if ABBYY finereader or something is better, why pay for it if I won't use it)
ocropus (I can't figure out how to use this thing),
Tesseract (which seems like it was great in 1995 but I'm not sure if there's something more accurate plus it doesn't do pdfs natively and I've have to convert to TIFF. that raises its own problem as I don't have a licensed copy of acrobat so I don't know how I'd convert 10,000 files to tiff. plus i don't want 10,000 30 page documents turned into 30,000 individual tiff images).
wowocr
pdftextstream (that was from 2009)
ABBYY FineReader (apparently its' $$$, but I will spend $600 to get this done if this thing is significantly better, i.e. has more accurate ocr).
Also I am a n00b to programming so if it's going to take like weeks to learn how to do something, I would rather pay the $$$. Thx for input/experiences.
BTW, I'm running Linux Mint 11 64 bit and/or windows 7 64 bit.
Here are the other threads:
Batch OCRing PDFs that haven't already been OCR'd
Open source OCR
PDF Text Extraction Approach Using OCR
https://superuser.com/questions/107678/batch-ocr-for-many-pdf-files-not-already-ocred
Just to put some of your misconceptions straight...
" I don't have a licensed copy of acrobat so I don't know how I'd convert 10,000 files to tiff."
You can convert PDFs to TIFF with the help of Free (as in liberty) and free (as in beer) Ghostscript. Your choice if you want to do it on Linux Mint or on Windows 7. The commandline for Linux is:
gs \
-o input.tif \
-sDEVICE=tiffg4 \
input.pdf
"i don't want 10,000 30 page documents turned into 30,000 individual tiff images"
You can have "multipage" TIFFs easily. Above command does create such TIFFs of the G4 (fax tiff) flavor. Should you even want single-page TIFFs instead, you can modify the command:
gs \
-o input_page_%03d.tif \
-sDEVICE=tiffg4 \
input.pdf
The %03d part of the output filename will automatically translate into a series of 001, 002, 003 etc.
Caveats:
The default resolution for the tiffg4 output device is 204x196 dpi. You probably want a better value. To get 720 dpi you should add -r720x720 to the commandline.
Also, if your Ghostscript installation uses letter as its default media size, you may want to change it. You can use -gXxY to set widthxheight in device points. So to get ISO A4 output page dimensions in landscape you can add a -g8420x5950 parameter.
So the full command which controls these two parameters, to produce 720 dpi output on A4 in portrait orientation, would read:
gs \
-o input.tif \
-sDEVICE=tiffg4 \
-r720x720 \
-g5950x8420 \
input.pdf
Figured I would try to contribute by answering my own question (have written some nice code for myself and could not have done it without help from this board). If you cat the pdf files in unix (well, osx for me), then the pdf files that have text will have the word "Font" in them (as a string, but mixed in with other text) b/c that's how the file tells Adobe what fonts to do display.
The cat command in bash seems to have the same output as reading the file in binary mode in python (using 'rb' mode when opening file instead of 'w' or 'r' or 'a'). So I'm assuming that all pdf files that contain text with have the word "Font" in the binary output and that no image-only files ever will. If that's always true, then this code will make a list of all pdf files in a single directory that have text and a separate list of those that have only images. It saves each list to a separate .txt file, then you can use a command in bash to move the pdf files to the appropriate folder.
Once you have them in their own folders, then you can run your batch ocr solution on just the pdf files in the images_only folder. I haven't gotten that far yet (obviously).
import os, re
#path is the directory with the files, other 2 are the names of the files you will store your lists in
path = 'C:/folder_with_pdfs'
files_with_text = open('files_with_text.txt', 'a')
image_only_files = open('image_only_files.txt', 'a')
#have os make a list of all files in that dir for a loop
filelist = os.listdir(path)
#compile regular expression that matches "Font"
mysearch = re.compile(r'.*Font.*', re.DOTALL)
#loop over all files in the directory, open them in binary ('rb'), search that binary for "Font"
#if they have "Font" they have text, if not they don't
#(pdf does something to understand the Font type and uses this word every time the pdf contains text)
for pdf in filelist:
openable_file = os.path.join(path, pdf)
cat_file = open(openable_file, 'rb')
usable_cat_file = cat_file.read()
#print usable_cat_file
if mysearch.match(usable_cat_file):
files_with_text.write(pdf + '\n')
else:
image_only_files.write(pdf + '\n')
To move the files, I entered this command in bash shell:
cat files_with_text.txt | while read i; do mv $i Volumes/hard_drive_name/new_destination_directory_name; done
Also, I didn't re-run the python code above, I just hand-edited the thing, so it might be buggy, Idk.
This is an interesting problem. If you are willing to work on Windows in .NET, you can do this with dotImage (disclaimer, I work for Atalasoft and wrote most of the OCR engine code). Let's break the problem down into pieces - the first is iterating over all your PDFs:
string[] candidatePDFs = Directory.GetFiles(sourceDirectory, "*.pdf");
PdfDecoder decoder = new PdfDecoder();
foreach (string path in candidatePDFs) {
using (FileStream stm = new FileStream(path, FileMode.Open)) {
if (decoder.IsValidFormat(stm)) {
ProcessPdf(path, stm);
}
}
}
This gets a list of all files that end in .pdf and if the file is a valid pdf, calls a routine to process it:
public void ProcessPdf(string path, Stream stm)
{
using (Document doc = new Document(stm)) {
int i=0;
foreach (Page p in doc.Pages) {
if (p.SingleImageOnly) {
ProcessWithOcr(path, stm, i);
}
else {
ProcessWithTextExtract(path, stm, i);
}
i++;
}
}
}
This opens the file as a Document object and asks if each page is image only. If so it will OCR the page, else it will text extract:
public void ProcessWithOcr(string path, Stream pdfStm, int page)
{
using (Stream textStream = GetTextStream(path, page)) {
PdfDecoder decoder = new PdfDecoder();
using (AtalaImage image = decoder.Read(pdfStm, page)) {
ImageCollection coll = new ImageCollection();
coll.Add(image);
ImageCollectionImageSource source = new ImageCollectionImageSource(coll);
OcrEngine engine = GetOcrEngine();
engine.Initialize();
engine.Translate(source, "text/plain", textStream);
engine.Shutdown();
}
}
}
what this does is rasterizes the PDF page into an image and puts it into a form that is palatable for engine.Translate. This doesn't strictly need to be done this way - one could get an OcrPage object from the engine from an AtalaImage by calling Recognize, but then it would be up to client code to loop over the structure and write out the text.
You'll note that I've left out GetOcrEngine() - we make available 4 OCR engines for client use: Tesseract, GlyphReader, RecoStar, and Iris. You would select the one that would be best for your needs.
Finally, you would need the code to extract text from the pages that already have perfectly good text on them:
public void ProcessWithTextExtract(string path, Stream pdfStream, int page)
{
using (Stream textStream = GetTextStream(path, page)) {
StreamWriter writer = new StreamWriter(textStream);
using (PdfTextDocument doc = new PdfTextDocument(pdfStream)) {
PdfTextPage page = doc.GetPage(i);
writer.Write(page.GetText(0, page.CharCount));
}
}
}
This extracts the text from the given page and writes it to the output stream.
Finally, you need GetTextStream():
public Stream GetTextStream(string sourcePath, int pageNo)
{
string dir = Path.GetDirectoryName(sourcePath);
string fname = Path.GetFileNameWithoutExtension(sourcePath);
string finalPath = Path.Combine(dir, String.Format("{0}p{1}.txt", fname, pageNo));
return new FileStream(finalPath, FileMode.Create);
}
Will this be a 100% solution? No. Certainly not. You could imagine PDF pages that contain a single image with a box draw around it - this would clearly fail the image only test but return no useful text. Probably, a better approach is to just use the extracted text and if that doesn't return anything, then try an OCR engine. Changing from one approach to the other is a matter of writing a different predicate.
The simplest approach would be to use a single tool such a ABBYY FineReader, Omnipage etc to process the images in one batch without having to sort them out into scanned vs not scanned images. I believe FineReader converts the PDF's to images before performing OCR anyway.
Using an OCR engine will give you features such as automatic deskew, page orientation detection, image thresholding, despeckling etc. These are features you would have to buy an image processng library for and program yourself and it could prove difficult to find an optimal set of parameters for your 10,000 PDF's.
Using the automatic OCR approach will have other side effects depending on the input images and you would find you would get better results if you sorted the images and set optimal parameters for each type of images. For accuracy it would be much better to use a proper PDF text extraction routine to extract the PDF's that have perfect text.
At the end of the day it will come down to time and money versus the quality of the results that you need. At the end of the day, a commercial OCR program will be the quickest and easiest solution. If you have clean text only documents then a cheap OCR program will work as well as an expensive solution. The more complex your documents, the more money you will need to spend to process them.
I would try finding some demo / trial versions of commercial OCR engines and just see how they perform on your different document types before spending too much time and money.
I have written a small wrapper for Abbyy OCR4LINUX CLI engine (IMHO, doesn't cost that much) and Tesseract 3.
The wrapper can batch convert files like:
$ pmocr.sh --batch --target=pdf --skip-txt-pdf /some/directory
The script uses pdffonts to determine if a PDF file has already been OCRed to skip them. Also, the script can work as system service to monitor a directory and launch an OCR action as soon as a file enters the directory.
Script can be found here:
https://github.com/deajan/pmOCR
Hopefully, this helps someone.
I've got a module that has to let users upload files and everything works as long as the files are in the standard array of allowed extensions. I've tried using file_validate_extensions, but this doesn't seem to change anything.
This is the code I'm using to upload now (the docx extension is added to the standard drupal allowed ones, but it doesn't seem to get picked up):
$fid = $form_state['values']['attachment'];
$file = file_load($fid);
if($file != null){
file_validate_extensions($file, "jpg jpeg gif png txt doc xls pdf ppt pps odt ods odp docx");
$file->status = FILE_STATUS_PERMANENT;
file_save($file);
}
I just looked to this Drupal API, and it seems that you can use the function "file_save_upload" (with $validator as an array of valid extension), this get the file in a temporary state. And then, you have to call "file_save" to make it permanent.
How can I convert pdf files from version 1.1 to 1.4 (or higher)?
Actually I need some sort of command line tool for batch converting or some API to be able to convert dynamically severall documents.
Use Ghostscript tool.
gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -o output.pdf input.pdf
Pdf 1.1 is forward compatible with pdf 1.4. Everything in pdf 1.1 will work with pdf 1.4 - it's guaranteed by the spec. Let's assume that you've got some justifiable reason why this is not good enough for you (let's assume, for example, that you have a non-spec compliant tool that consumes PDF and explodes on any file version less that 1.4).
We can focus on the main syntactic differences between versions.
All PDF files have a header somewhere in the first 1024 bytes. In most cases, it's the very first line, but that's not guaranteed (I'm looking at you GhostScript!). The header looks like this in PDF 1.1:
%PDF-1.1
in PDF 1.4, it looks like this:
%PDF-1.4
So in theory, all you need is a tool that will look in the first 1024 bytes for a file for "%PDF-1.1" and change it to "%PDF-1.4". You could use sed, perl, etc to do something like that for you. You could write it in C and you would be tempted to do something like this:
#define PDFHEADERSIZE 1024
bool ChangeFileToNewPdfVersion(char *file)
{
char *replacePoint = NULL;
FILE *fp = fopen(file, "rw");
char buf[PDFHEADERSIZE + 1];
buf[PDFHEADERSIZE] = '\0';
if (fread(buf, 1, PDFHEADERSIZE, fp) != PDFHEADERSIZE) { fclose(fp); return false; }
fseek(fp, 0, SEEK_SET);
if ((replacePoint = strstr(buf, "%PDF-1.1")) == NULL) { fclose(fp); return false; }
replacePoint[7] = '4';
if (fwrite(buf, 1, PDFHEADERSIZE, fp) != PDFHEADERSIZE) { fclose(fp); return false; }
fflush(fp);
fclose(fp);
return;
}
which will work in most sane cases. It will not work if the file starts, for example, with 0 bytes, which would serve as null terminators in the block of data.
A better choice (really) would be to cobble up a simple state machine to find %PDF-1. by reading 1 byte at a time until it either finds it or passes 1017 (1024 less the header length), then reads the next byte, if it's a '1', it seeks back a byte and writes a '4'.
The only other thing you would need to worry about is that PDF 1.4 suggests that the document catalog should contain a Version key with the file version. Since this is defined as optional in the spec, you are safe to ignore it.
So this will solve your problem. I do not, however, believe that you should need to do this. Really.
You should take some time to read part of the PDF spec, specifically section I.2 about version numbers and compatibility.
I just had this problem. Trying to submit some PDF's to a finanicial institution. "We only support PDF 1.4 or newer". Apparently our HP scanner creates version 1.3 PDF's.
I opened the PDF file with Notepad++ and changed the 3 to a 4 and saved it. It was that simple.
It's the very first part of the file and it's in plain text.
Another option for a small number of pdf files is to open them in Chrome or other browser then save as PDF or print to PDF. In my case, using Chrome, it saved to a newer pdf version and the bank accepted it