Convert PDF to TIFF Format - vb.net

I am writting a VB.Net application. I need to be able to convert either a Word or PDF file to TIF format.
Free would be nice but I will also accept low cost.
I would like sample code if possible, VB is preferable but I also know c#

It's very simple with imagemagick (you do have to download ghostscript, too.). You just need to use VB to run it as a process.
Dim imgmgk As New Process()
With imgmgk.StartInfo
.FileName = v_locationOfImageMagickConvert.exe
.UseShellExecute = False
.CreateNoWindow = True
.RedirectStandardOutput = True
.RedirectStandardError = True
.RedirectStandardInput = False
.Arguments = " -units PixelsPerInch " & v_pdf_filename & " -depth 16 -flatten +matte –monochrome –density 288 -compress ZIP " & v_tiff_filename
End With
imgmgk.Start()
Dim output As String = imgmgk.StandardOutput.ReadToEnd()
Dim errorMsg As String = imgmgk.StandardError.ReadToEnd()
imgmgk.WaitForExit()
imgmgk.Close()
The arguments are varied - use imagemagick docs to see what they are. You can do something as simple as just pass the pdf file name and the tiff file name for a simple conversion.

You can do this with Atalasoft DotImage (commerial product - obligatory disclaimer: I work for Atalasoft and write a lot of the PDF code):
// this code extracts images from the PDF - there may be multiple images per page
public void PdfToTiff(Stream pdf, Stream tiff)
{
TiffEncoder encoder = new TiffEncoder(tiff);
PdfImageSource images = new PdfImageSource(pdf);
encoder.Save(tiff, images, null);
}
// this code renders each page
public void PdfToTiff(string pathToPdf, Stream tiff)
{
TiffEncoder encoder = new TiffEncoder(tiff);
FileSystemImageSource images = new FileSystemImageSource(pathToPdf, true);
encoder.Save(tiff, images, null);
}
The latter example is probably what you want. It works on a path because FileSystemImageSource takes advantage of code to operate on the file system with wildcards. It's overkill for the task, really. If you wanted to do it without you would have this:
public void PdfToTiff(Stream pdf, Stream tiff)
{
TiffEncoder encoder = new TiffEncoder();
encoder.Append = true;
PdfDecoder decoder = new PdfDecoder();
int pageCount = decoder.GetFrameCount(pdf);
for (int i=0; i < pageCount; i++) {
using (AtalaImage image = decoder.Read(pdf, i, null)) {
encoder.Save(tiff, image, null);
}
}
}

There are many products I found that can do this. Some have a cost while some are free.
I used Black Ice http://www.blackice.com/Printer%20Drivers/Tiff%20Printer%20Drivers.htm. This is afordable at about $40. I didn't buy this one though.
I ended up using a free one called MyMorph.

Related

Automating Photoshop to edit en rename +600 files with the name of the folder

I have +600 product images on my mac already cut out and catalogued in their own folder. They are all PSD's and I need a script that will do the following.
Grab the name of the folder
Grab all the PSD's in said folder
Combine them in one big PSD in the right order (the filenames are saved sequentially as 1843, 1845, 1846 so they need to open in that order)
save that PSD
save the separate layers as PNG with the name from the folder + _1, _2, _3
I have previous experience in Bash (former Linux user) and tried for hours in Automator but to no success.
Welcome to Stack Overflow. The quick answer is yes this is possible to do via scripting. I might even suggest breaking down into two scripts, one to grab and save the PSDs and the second to save out the layers.
It's not very clear about "combining" the PSDs or about "separate layers, only I don't know if they are different canvas sizes, where you want each PSD to be positioned (x, y offsets & layering) Remember none of use have your files infront of us to refer from.
In short, if you write out pseudo code of what is it you expect your code to do it makes it easier to answer your question.
Here's a few code snippets to get you started:
This will open a folder and retrieve alls the PSDs as an array:
// get all the files to process
var folderIn = Folder.selectDialog("Please select folder to process");
if (folderIn != null)
{
var tempFileList = folderIn.getFiles();
}
var fileList = new Array(); // real list to hold images, not folders
for (var i = 0; i < tempFileList.length; i++)
{
// get the psd extension
var ext = tempFileList[i].toString();
ext = ext.substring(ext.lastIndexOf("."), ext.length);
if (tempFileList[i] instanceof File)
{
if (ext == ".psd") fileList.push (tempFileList[i]);
// else (alert("Ignoring " + tempFileList[i]))
}
}
alert("Files:\n" + fileList.length);
You can save a png with this
function save_png(afilePath)
{
// Save as a png
var pngFile = new File(afilePath);
pngSaveOptions = new PNGSaveOptions();
pngSaveOptions.embedColorProfile = true;
pngSaveOptions.formatOptions = FormatOptions.STANDARDBASELINE;
pngSaveOptions.matte = MatteType.NONE; pngSaveOptions.quality = 1;
activeDocument.saveAs(pngFile, pngSaveOptions, false, Extension.LOWERCASE);
}
To open a psd just use
app.open(fileRef);
To save it
function savePSD(afilePath)
{
// save out psd
var psdFile = new File(afilePath);
psdSaveOptions = new PhotoshopSaveOptions();
psdSaveOptions.embedColorProfile = true;
psdSaveOptions.alphaChannels = true;
activeDocument.saveAs(psdFile, psdSaveOptions, false, Extension.LOWERCASE);
}

Split PDF into separate files based on text

I have a large single pdf document which consists of multiple records. Each record usually takes one page however some use 2 pages. A record starts with a defined text, always the same.
My goal is to split this pdf into separate pdfs and the split should happen always before the "header text" is found.
Note: I am looking for a tool or library using java or python. Must be free and available on Win 7.
Any ideas? AFAIK imagemagick won't work for this. May itext do this? I never used and it's
pretty complex so would need some hints.
EDIT:
Marked Answer led me to solution. For completeness here my exact implementation:
public void splitByRegex(String filePath, String regex,
String destinationDirectory, boolean removeBlankPages) throws IOException,
DocumentException {
logger.entry(filePath, regex, destinationDirectory);
destinationDirectory = destinationDirectory == null ? "" : destinationDirectory;
PdfReader reader = null;
Document document = null;
PdfCopy copy = null;
Pattern pattern = Pattern.compile(regex);
try {
reader = new PdfReader(filePath);
final String RESULT = destinationDirectory + "/record%d.pdf";
// loop over all the pages in the original PDF
int n = reader.getNumberOfPages();
for (int i = 1; i < n; i++) {
final String text = PdfTextExtractor.getTextFromPage(reader, i);
if (pattern.matcher(text).find()) {
if (document != null && document.isOpen()) {
logger.debug("Match found. Closing previous Document..");
document.close();
}
String fileName = String.format(RESULT, i);
logger.debug("Match found. Creating new Document " + fileName + "...");
document = new Document();
copy = new PdfCopy(document,
new FileOutputStream(fileName));
document.open();
logger.debug("Adding page to Document...");
copy.addPage(copy.getImportedPage(reader, i));
} else if (document != null && document.isOpen()) {
logger.debug("Found Open Document. Adding additonal page to Document...");
if (removeBlankPages && !isBlankPage(reader, i)){
copy.addPage(copy.getImportedPage(reader, i));
}
}
}
logger.exit();
} finally {
if (document != null && document.isOpen()) {
document.close();
}
if (reader != null) {
reader.close();
}
}
}
private boolean isBlankPage(PdfReader reader, int pageNumber)
throws IOException {
// see http://itext-general.2136553.n4.nabble.com/Detecting-blank-pages-td2144877.html
PdfDictionary pageDict = reader.getPageN(pageNumber);
// We need to examine the resource dictionary for /Font or
// /XObject keys. If either are present, they're almost
// certainly actually used on the page -> not blank.
PdfDictionary resDict = (PdfDictionary) pageDict.get(PdfName.RESOURCES);
if (resDict != null) {
return resDict.get(PdfName.FONT) == null
&& resDict.get(PdfName.XOBJECT) == null;
} else {
return true;
}
}
You can create a tool for your requirements using iText.
Whenever you are looking for code samples concerning (current versions of) the iText library, you should consult iText in Action — 2nd Edition the code samples from which are online and searchable by keyword from here.
In your case the relevant samples are Burst.java and ExtractPageContentSorted2.java.
Burst.java shows how to split one PDF in multiple smaller PDFs. The central code:
PdfReader reader = new PdfReader("allrecords.pdf");
final String RESULT = "record%d.pdf";
// We'll create as many new PDFs as there are pages
Document document;
PdfCopy copy;
// loop over all the pages in the original PDF
int n = reader.getNumberOfPages();
for (int i = 0; i < n; ) {
// step 1
document = new Document();
// step 2
copy = new PdfCopy(document,
new FileOutputStream(String.format(RESULT, ++i)));
// step 3
document.open();
// step 4
copy.addPage(copy.getImportedPage(reader, i));
// step 5
document.close();
}
reader.close();
This sample splits a PDF in single-page PDFs. In your case you need to split by different criteria. But that only means that in the loop you sometimes have to add more than one imported page (and thus decouple loop index and page numbers to import).
To recognize on which pages a new dataset starts, be inspired by ExtractPageContentSorted2.java. This sample shows how to parse the text content of a page to a string. The central code:
PdfReader reader = new PdfReader("allrecords.pdf");
for (int i = 1; i <= reader.getNumberOfPages(); i++) {
System.out.println("\nPage " + i);
System.out.println(PdfTextExtractor.getTextFromPage(reader, i));
}
reader.close();
Simply search for the record start text: If the text from page contains it, a new record starts there.
Apache PDFBox has a PDFSplit utility that you can run from the command-line.
If you like Python, there's a nice library: PyPDF2. The library is pure python2, BSD-like license.
Sample code:
from PyPDF2 import PdfFileWriter, PdfFileReader
input1 = PdfFileReader(open("C:\\Users\\Jarek\\Documents\\x.pdf", "rb"))
# analyze pdf data
print input1.getDocumentInfo()
print input1.getNumPages()
text = input1.getPage(0).extractText()
print text.encode("windows-1250", errors='backslashreplacee')
# create output document
output = PdfFileWriter()
output.addPage(input1.getPage(0))
fout = open("c:\\temp\\1\\y.pdf", "wb")
output.write(fout)
fout.close()
For non coders PDF Content Split is probably the easiest way without reinventing the wheel and has an easy to use interface: http://www.traction-software.co.uk/pdfcontentsplitsa/index.html
hope that helps.

How can I programmatically export pdf annotations (such as a formula encircled in a rectangle) as images?

I'm wondering if it is possible to export some annotations as images. I already know how to export highlighted text as text, but this doesn't work well with equations. If equations were denoted by an annotation, such as a box encircling them, could I convert them all at once to images using a pdf snapshot tool?
It is easy to do each one individually by hand with the pdf snapshot tool. Do any pdf libraries or programs have any tools that let you make image snapshots programmatically, not of whole pages, but of individual equations that are marked somehow with an annotation?
For the purposes of the question, they don't necessarily have to be free programs.
Thanks.
I came up with a full ruby based solution here, using the ruby gems pdf-reader and rmagick (along with an installation of imagemagick).
require 'pdf-reader'
require 'RMagick'
pdf_file_name='statmech' #without extension
doc = PDF::Reader.new(File.expand_path(pdf_file_name+".pdf"))
$objects = doc.objects
def convertpagetojpgandcrop(filename,pagenum,croprect,imgname)
pagename = filename+".pdf[#{pagenum-1}]"
#higher density used for quality purposes (otherwise fuzzy)
pageim = Magick::Image.read(pagename){ |opts| opts.density = 216}.first
#factors of 3 needed because higher density TODO: generalize to pdf density!=72
#SouthWestGravity puts coordinate origin in bottom left to match pdf coords
eqim =pageim.crop(Magick::SouthWestGravity,...
3*croprect[0],3*croprect[1],3*croprect[2]-3*croprect[0],3*croprect[3]-3*croprect[1])
eqim.write(imgname)
end
def is_square?(object)
object[:Type] == :Annot && object[:Subtype] == :Square
end
def is_highlight?(object)
object[:Type] == :Annot && object[:Subtype] == :Highlight
end
def annots_on_page(page)
references = (page.attributes[:Annots] || [])
lookup_all(references).flatten
end
def lookup_all(refs)
refs = *refs
refs.map { |ref| lookup(ref) }
end
def lookup(ref)
object = $objects[ref]
return object unless object.is_a?(Array)
lookup_all(object)
end
def highlights_on_page(page)
all_annots = annots_on_page(page)
all_annots.select { |a| is_highlight?(a) }
end
def squares_on_page(page)
all_annots = annots_on_page(page)
all_annots.select { |a| is_square?(a) }
end
def restricted_annots_on_page(page)
all_annots = annots_on_page(page)
all_annots.select { |a| is_square?(a)||is_highlight?(a) }
end
#This block exports a jpg for each 'square' annotation in pdf
doc.pages.each do |page|
eqnum=0
all_squares = squares_on_page(page)
all_squares.each do |annot|
eqnum = eqnum+1
puts "#{annot[:Rect]}"
convertpagetojpgandcrop(pdf_file_name,page.number,annot[:Rect],...
pdf_file_name+"page#{page.number}eq#{eqnum}.jpg")
end
end
#This block gives the text of the highlights and wikilinks to the images
#TODO:(needs to go in text file)
doc.pages.each do |page|
eqnum = 0
annots = restricted_annots_on_page(page)
if annots.length>0
puts "# Page #{page.number}"
end
annots.each do |annot|
if is_square?(annot)
eqnum = eqnum+1
puts "{{wiki:#{pdf_file_name}page#{page.number}eq#{eqnum}.jpg}}"
else
puts "#{annot[:Contents]}"
end
end
end
This code expands upon example code for the pdf-reader and rmagick gems found online. Few of the lines are original.
This code sample uses Amyuni PDF Creator .Net, it will export the page with only one annotation visible at a time:
using System.IO;
using Amyuni.PDFCreator;
using System.Collections;
//open a pdf document
FileStream testfile = new FileStream(filename, FileMode.Open, FileAccess.Read, FileShare.Read);
IacDocument document = new IacDocument(null);
document.SetLicenseKey("your license", "your code");
document.Open(testfile, "");
document.CurrentPageNumber = 1;
IacAttribute attribute = document.CurrentPage.AttributeByName("Objects");
// listobj is an array list of objects
ArrayList listobj = (System.Collections.ArrayList)attribute.Value;
ArrayList annotations = new ArrayList();
foreach (Amyuni.PDFCreator.IacObject iacObj in listobj)
{
if ((bool)iacObj.AttributeByName("Annotation").Value)
{
annotations.Add(iacObj);
// Put the annotation out of sight
iacObj.Coordinates = Rectangle.FromLTRB(
-iacObj.Coordinates.Left,
-iacObj.Coordinates.Top,
-iacObj.Coordinates.Right,
-iacObj.Coordinates.Bottom);
}
else
iacObj.Delete(false);
}
ArrayList images = new ArrayList();
int i = 0;
foreach (Amyuni.PDFCreator.IacObject iacObj in annotations)
{
// Back on sight
iacObj.Coordinates = Rectangle.FromLTRB(
-iacObj.Coordinates.Left,
-iacObj.Coordinates.Top,
-iacObj.Coordinates.Right,
-iacObj.Coordinates.Bottom);
//Draw the page
Bitmap bmp = new Bitmap(1000, 1000);
Graphics gr = Graphics.FromImage(bmp);
IntPtr hdc = gr.GetHdc();
document.DrawCurrentPage(hdc.ToInt32(), true);
gr.ReleaseHdc();
images.Add(bmp);
bmp.Save("c:\\temp\\image" + i + ".pdf");
iacObj.Delete(false); // object not needed anymore
i++;
}
If needed, you can extract the part of the resulting image that corresponds to the annotation by using the Coordinates property of the annotation object.
If you want to extract all objects from a rectangular area (annotations or otherwise) you can replace the loop that collects annotations with a call to the method IacDocument.GetObjectsInRectangle
Usual disclaimer applies

Splitted PDF size is larger when using the iTextSharp

Dear Team,
In my application, i want to split the pdf using itextsharp. If i upload PDF contains 10 pages with file size 10 mb for split, After splitting the combine file size of each pdfs will result into above 20mb file size. If this possible to reduce the file size(each pdf).
Please help me to solve the issue.
Thanks in advance
This may have to do with the resources in the file. If the original document uses an embedded font on each, for example, then there will only be one instance of the font in the original file. When you split it, each file will be required have that font as well. The total overhead will be n pages × sizeof(each font). Elements that will cause this kind of bloat include fonts, images, color profiles, document templates (aka forms), XMP, etc.
And while it doesn't help you in your immediate problem, if you use the PDF tools in Atalasoft dotImage, your task becomes a 1 liner:
PdfDocument.Separate(userpassword, ownerpassword, origPath, destFolder, "Separated Page{0}.pdf", true);
which will take the PDF in orig file and create new pages in the dest folder each named with the pattern. The bool at the end is to overwrite an existing file.
Disclaimer: I work for Atalasoft and wrote the PDF library (also used to work at Adobe on Acrobat versions 1, 2, 3, and 4).
Hi Guys i modified the above code to split a PDF file into multiple Pdf file.
iTextSharp.text.pdf.PdfReader reader = null;
int currentPage = 1;
int pageCount = 0;
//string filepath_New = filepath + "\\PDFDestination\\";
System.Text.UTF8Encoding encoding = new System.Text.UTF8Encoding();
//byte[] arrayofPassword = encoding.GetBytes(ExistingFilePassword);
reader = new iTextSharp.text.pdf.PdfReader(filepath);
reader.RemoveUnusedObjects();
pageCount = reader.NumberOfPages;
string ext = System.IO.Path.GetExtension(filepath);
for (int i = 1; i <= pageCount; i++)
{
iTextSharp.text.pdf.PdfReader reader1 = new iTextSharp.text.pdf.PdfReader(filepath);
string outfile = filepath.Replace((System.IO.Path.GetFileName(filepath)), (System.IO.Path.GetFileName(filepath).Replace(".pdf", "") + "_" + i.ToString()) + ext);
reader1.RemoveUnusedObjects();
iTextSharp.text.Document doc = new iTextSharp.text.Document(reader.GetPageSizeWithRotation(currentPage));
iTextSharp.text.pdf.PdfCopy pdfCpy = new iTextSharp.text.pdf.PdfCopy(doc, new System.IO.FileStream(outfile, System.IO.FileMode.Create));
doc.Open();
for (int j = 1; j <= 1; j++)
{
iTextSharp.text.pdf.PdfImportedPage page = pdfCpy.GetImportedPage(reader1, currentPage);
pdfCpy.SetFullCompression();
pdfCpy.AddPage(page);
currentPage += 1;
}
doc.Close();
pdfCpy.Close();
reader1.Close();
reader.Close();
}
Have you tried setting the compression on the writer?
Document doc = new Document();
using (MemoryStream ms = new MemoryStream())
{
PdfWriter writer = PdfWriter.GetInstance(doc, ms);
writer.SetFullCompression();
}

How to automate Photoshop?

I am trying to automate the process of scanning/cropping photos in Photoshop. I need to scan 3 photos at a time, then use Photoshop's Crop and Straighten Photos command, which creates 3 separate images. After that I'd like to save each of the newly created images as a PNG.
I looked at the JSX scripts and they seem to a lot of promise. Is what I described possible to automate in Photoshop using JavaScript or VBScript or whatever?
I just found this script just did the work for me! It automatically crop & straighten the photo and save each result to directory you specified.
http://www.tranberry.com/photoshop/photoshop_scripting/PS4GeeksOrlando/IntroScripts/cropAndStraightenBatch.jsx
Save it to local then run it in the PS=>File=>Command=>Browse
P.S I found in the comment it said the script can be executed directly by double clicking from Mac Finder or Windows Explorer.
Backup gist for the script here
I actually got the answer on the Photoshop forums over at adobe. It turns out that Photoshop CS4 is totally scriptable via JavaScript, VBScript and comes with a really kick-ass Developer IDE, that has everything you'd expect (debugger, watch window, color coding and more). I was totally impressed.
Following is an extract for reference:
you can run the following script that will create a new folder off the existing one and batch split all the files naming them existingFileName#001.png and put them in the new folder (edited)
#target Photoshop
app.bringToFront;
var inFolder = Folder.selectDialog("Please select folder to process");
if(inFolder != null){
var fileList = inFolder.getFiles(/\.(jpg|tif|psd|)$/i);
var outfolder = new Folder(decodeURI(inFolder) + "/Edited");
if (outfolder.exists == false) outfolder.create();
for(var a = 0 ;a < fileList.length; a++){
if(fileList[a] instanceof File){
var doc= open(fileList[a]);
doc.flatten();
var docname = fileList[a].name.slice(0,-4);
CropStraighten();
doc.close(SaveOptions.DONOTSAVECHANGES);
var count = 1;
while(app.documents.length){
var saveFile = new File(decodeURI(outfolder) + "/" + docname +"#"+ zeroPad(count,3) + ".png");
SavePNG(saveFile);
activeDocument.close(SaveOptions.DONOTSAVECHANGES) ;
count++;
}
}
}
};
function CropStraighten() {
function cTID(s) { return app.charIDToTypeID(s); };
function sTID(s) { return app.stringIDToTypeID(s); };
executeAction( sTID('CropPhotosAuto0001'), undefined, DialogModes.NO );
};
function SavePNG(saveFile){
pngSaveOptions = new PNGSaveOptions();
pngSaveOptions.embedColorProfile = true;
pngSaveOptions.formatOptions = FormatOptions.STANDARDBASELINE;
pngSaveOptions.matte = MatteType.NONE;
pngSaveOptions.quality = 1;
pngSaveOptions.PNG8 = false; //24 bit PNG
pngSaveOptions.transparency = true;
activeDocument.saveAs(saveFile, pngSaveOptions, true, Extension.LOWERCASE);
}
function zeroPad(n, s) {
n = n.toString();
while (n.length < s) n = '0' + n;
return n;
};
Visit here for complete post.
Have you tried using Photoshop Actions? I don't now about the scanning part, but the rest can all be done by actions quite easily.