Powershell script to convert PDF to TIFF with Ghostscript - pdf

I have been asked to write a script that automatically converts PDF files to TIFF files so they can be processed furter. With a lot of help from Google and this site. (I never studied any programming language) I created the code below.
Even though it's working now, it is not quite what I was hoping for since it creates 13 files every time it runs where it should create only 1.
Could someone be kind enough to take a look at the script and tell me where I went wrong?
Thank you in advance!
EDIT:
In this (test) case there's only one PDF in the folder and it's named test.pdf, however the idea is that the script looks through all the PDF in the given folder since it's unsure how many PDF's are in the folder at any given time. Let it run as a service in the background(?)
I'll edit the post with the error code/description once I find out how to get them, I can't keep up with the command line.
#Path to your Ghostscript EXE
$tool = 'C:\\Program Files\\gs\\gs9.10\\bin\\gswin64c.exe'
#Directory containing the PDF files that will be converted
$inputDir = 'C:\\test\\'
#Output path where converted PDF files will be stored
$outputDirPDF = 'C:\\test\\oud\\'
#Output path where the TIF files will be saved
$outputDir = 'C:\\test\\TIFF'
$pdfs = get-childitem $inputDir -recurse | where {$_.Extension -match "pdf"}
foreach($pdf in $pdfs)
{
$tif = $outputDir + $pdf.BaseName + ".tif"
if(test-path $tif)
{
"tif file already exists " + $tif
}
else
{
'Processing ' + $pdf.Name
$param = "-sOutputFile=$tif"
& $tool -q -dNOPAUSE -sDEVICE=tiffg4 $param -r300 $pdf.FullName -c quit
}
Move-Item $pdf $outputDirPDF
}

It's working now, apparently I was missing an "exit" at the end of the code. It might not be the most beautiful piece of code, but it seems to do the job so I'm happy with it.
Below the piece of code that actually works;
#Path to your Ghostscript EXE
$tool = 'C:\\Program Files\\gs\\gs9.10\\bin\\gswin64c.exe'
#Directory containing the PDF files that will be converted
$inputDir = 'C:\\test\\'
#Output path where converted PDF files will be stored
$outputDirPDF = 'C:\\test\\oud\\'
#Output path where the TIF files will be saved
$outputDir = 'C:\\test\\TIFF\\'
$pdfs = get-childitem $inputDir -recurse | where {$_.Extension -match "pdf"}
foreach($pdf in $pdfs)
{
$tif = $outputDir + $pdf.BaseName + ".tif"
if(test-path $tif)
{
"tif file already exists " + $tif
}
else
{
'Processing ' + $pdf.Name
$param = "-sOutputFile=$tif"
& $tool -q -dNOPAUSE -sDEVICE=tiffg4 $param -r300 $pdf.FullName -c quit
}
Move-Item $pdf $outputDirPDF
}
EXIT

It appears to be creating one TIFF file for each PDF file in the source directory. How many PDF files are in the directory (and any sub-directories) ? How many pages in the input PDF file ?
I note that you move the original PDF from 'InputDir' to 'OutputDirPDF' when completed, but 'OutputDirPDF' is a child of 'InputDir', so if you recurse child directories when looking for input files you may find files you have already processed. NB I know nothing about Powershell so this may be just fine.
I'd suggest making 'InputDir' and 'OutputDirPDF' at the same level, eg "c:\temp\input" and "c:\temp\outputPDF".
That's about all I can say on the information here, you could state what the input PDF filename(s) and output Filename(s) are, and what the processing messages say.

Related

How to merge multiple markdown files with pandoc while retaining cross document links?

I am trying to merge multiple markdown documents in a single folder together into a PDF with pandoc.
The documents may contain links to each other which should be browseable in the markdown format, e.g. through IntelliJ or within GitLab.
Simple example documents:
0001-file-a.md
---
id: 0001
---
# File a
This is a simple file without an [external link](www.stackoverflow.com).
0002-file-b.md
---
id: 0002
---
# File b
This file links to [another file](0001-file-a.md).
By default pandoc does not handle this case out of the box, e.g. when running the following command:
pandoc -s -f markdown -t pdf *.md -V linkcolor=blue -o test.pdf
It merges the files, creates a PDF and highlights the links correctly, but when clicking the second link it wants to open the file instead of jumping to the right location in the document.
This problem has been experienced by many before me but none of the solutions I found so far have solved it. The closest I came was with the help of this answer: https://stackoverflow.com/a/61908457/6628753
It defines a filter that is first applied to each file and then the resulting JSON files are merged.
I modified this filter to fit my needs:
Add the number of the file to the label of the top-level header
Prepend the top-level header to all other header labels
Remove .md from internal links
Here is the filter:
#!/usr/bin/env python3
from pandocfilters import toJSONFilter, Header, Link
import re
import sys
"""
Pandoc filter to convert internal links for multifile documents
"""
headerL1 = []
def fix_links(key, value, format, meta):
global headerL1
# Store level 1 headers
if key == "Header":
[level, [label, t1, t2], header] = value
if level == 1:
id = meta.get("id")
newlabel = f"{id['c'][0]['c']}-{label}"
headerL1 = [newlabel]
sys.stderr.write(f"\nGlobal header: {headerL1}\n")
return Header(level, [newlabel, t1, t2], header)
# Prepend level 1 header label to all other header labels
if level > 1:
prefix = headerL1[0]
newlabel = prefix + "-" + label
sys.stderr.write(f"Header label: {label} -> {newlabel}\n")
return Header(level, [newlabel, t1, t2], header)
if key == "Link":
[t1, linktext, [linkref, t4]] = value
if ".md" in linkref:
newlinkref = re.sub(r'.md', r'', linkref)
sys.stderr.write(f'Link: {linkref} -> {newlinkref}\n')
return Link(t1, linktext, [newlinkref, t4])
else:
sys.stderr.write(f'External link: {linkref}\n')
if __name__ == "__main__":
toJSONFilter(fix_links)
And here is a script that executes the whole thing:
#!/bin/bash
MD_INPUT=$(find . -type f | grep md | sort)
# Pass the markdown through the gitlab filters into Pandoc JSON files
echo "Filtering Gitlab markdown"
for file in $MD_INPUT
do
echo "Filtering $file"
pandoc \
--filter fix-links.py \
"$file" \
-t json \
-o "${file%.md}.json"
done
JSON_INPUT=$(find . -type f | grep json | sort)
echo "Generating LaTeX"
pandoc -s -f json -t latex $JSON_INPUT -V linkcolor=blue -o test.tex
echo "Generating PDF"
pandoc -s -f json -t pdf $JSON_INPUT -V linkcolor=blue -o test.pdf
Applying this script generates a PDF where the second link does not work at all.
Looking at the LaTeX code the problem can be solved by replacing the generated \href directive with \hyperlink.
Once this is done the linking works as expected.
The problem now is that this isn't done automatically by pandoc, which almost seems like a bug.
Is there a way to tell pandoc a link is internal from within the filter?
After running the filter it is non-trivial to fix the issue since there is no good way to differentiate internal and external links.

Bigquery error (ASCII 0) encountered for external table and when loading table

I'm getting this error
"Error: Error detected while parsing row starting at position: 4824. Error: Bad character (ASCII 0) encountered."
The data is not compressed.
My external table points to multiple CSV files, and one of them contains a couple of lines with that character. In my table definition I added "MaxBadRecords", but that had no effect. I also get the same problem when loading the data in a regular table.
I know I could use DataFlow or even try to fix the CSVs, but is there an alternative to that does not include writing a parser, and hopefully just as easy and efficient?
is there an alternative to that does not include writing a parser, and hopefully just as easy and efficient?
Try below in Google Cloud SDK Shell (with use of tr utility)
gsutil cp gs://bucket/badfile.csv - | tr -d '\000' | gsutil cp - gs://bucket/fixedfile.csv
This will
Read your "bad" file
Remove ASCII 0
Save "fixed" file into new file
After you have new file - just make sure your table now points to that fixed one
Sometimes it occurs that a final byte appears in file.
What could help is replacing it thanks to :
tr '\0' ' ' < file1 > file2
You can clean the file using an external tool like python or PowerShell. There is no way to load any file with an ASCII0 in bigquery
This is a script that can clear the file with python:
def replace_chars(self,file_path,orignal_string,new_string):
#Create temp file
fh, abs_path = mkstemp()
with os.fdopen(fh,'w', encoding='utf-8') as new_file:
with open(file_path, encoding='utf-8', errors='replace') as old_file:
print("\nCurrent line: \t")
i=0
for line in old_file:
print(i,end="\r", flush=True)
i=i+1
line=line.replace(orignal_string, new_string)
new_file.write(line)
#Copy the file permissions from the old file to the new file
shutil.copymode(file_path, abs_path)
#Remove original file
os.remove(file_path)
#Move new file
shutil.move(abs_path, file_path)
The same but for PowerShell:
(Get-Content "C:\Source.DAT") -replace "`0", " " | Set-Content "C:\Destination.DAT"

Exporting image data with bcp in sql2000

Hi I have a a sql 2000 database with a large number of scanned documents stored as pdfs and word documents stored in an image data type.
I need to export them to files.
I have written code to to do this using xp_cmdshell and bcp. Looking at other questions I have created a fmt file as below:
8.0
1
1 SQLIMAGE 0 0 "" 1 FILEDATA ""
the command is
bcp "select filedata FROM attacheddocuments where pkey = '+ convert (varchar, #imageid) + '" queryout "c:\scans\' + #imagefilename + '" -T -f c:\scans\attached.fmt
however when I run the query it creates all the files but they cannot be opened in either word or acrobat. both report that the file is corrupt.
If instead I run the command
bcp "select filedata FROM attacheddocuments where pkey = '+ convert (varchar, #imageid) + '" queryout "c:\scans\' + #imagefilename + '" -T -N
The pdf files now open ok but the word documents are still corrupt.
Does anyone have any ideas where I am going wrong?
I know this is a really old post, I am having this issue with all files except PDF's.
I have tried without the -N and with. Using a format file and not using a format file. Strange thing is I used to use my script often, had not used it for a while. During that period some SQL updates have came out. Not the scripts only exports PDF documents not being corrupt.
Zip files and just about any file I can run through a repair tool and they are fixed. But that is not a doable option due to volume.
format file
Plus My format file had a 4 instead of 0. That alone made all files but PDF's documents corrupt with corrupt headers.

How to use 7z.dll?

I have now a script to download a file and copy to a directory. But how could i make it so that if i compress a folder to a zip file , and then would need to extract it when that zipped folder is downloaded. It's takes too much time to to write the lines for every file separately. I know that i could use 7z.dll to decompress, but dont know how to put that in code.
[Code]
procedure InitializeWizard;
begin
idpDownloadAfter(wpReady);
end;
procedure CurPageChanged(CurPageID: Integer);
begin
if CurPageID = wpReady then
begin
idpClearFiles;
if IsComponentSelected('IGR') then
idpAddFile('http://www.mediafire.com/download/f9hnlkt1t75ykjk/waterfall_IGR.model', ExpandConstant('{tmp}\waterfall_IGR.model'));
end;
end;
procedure CurStepChanged(CurStep: TSetupStep);
begin
if CurStep = ssPostInstall then
begin
// Copy downloaded files to application directory
FileCopy(ExpandConstant('{tmp}\waterfall_IGR.model'), ExpandConstant('{app}\res_mods\0.8.10\content\Environment\env_waterfall\waterfall_IGR.model'), false);
end;
end;
I dont know if 7z.dll will work directly, but what can be done is to download 7zip portable, include its folder in your package and pass the unzipping command to 7za.exe .
Eg :
7za.exe x <path to>\in.zip -oc:\pathToOutFolder
I had the same problem when create a 7zip file and split it out in several smnall files using the -v option, the way I fixed is using powershell I get the list of files and the create dynamically the Inno project, it looks something like
$Files = Get-Item "$zipFilesLocation\*.*"
$files | Select-object #{Name="Address"; Expression={"idpAddFile('<webaddress>" + $_.Name + "' , ExpandConstant('{tmp}\58-Formulary_201311.7z.001'));"}}
and tghen just write each object into the iss file like
foreach ($elem in $files)
{
$e = "idpAddFile( WebWrlString + '" + $elem.Name + "', ExpandConstant('{tmp}\" + $elem.Name + "'));"
$e | Out-File "Innopackage.iss" -Encoding ASCII -Append
}
I hope this helps

Renaming a list of files and creating folder in Powershell

I'm in need a script, in PowerShell or batch script, that will do the following.
Rename a file to append creation date minus 1 day to the filename.
For example:
foo.xlsx (created 7/27/2011)
foo-2011-07-26.xlsx --note, it's yesterday's date.
Date format isn't too important as long as it's there. There will be 10 files (all with the same creation date), so either I can copy and paste the same renaming line for the different files (just rename the filename) or just have the script affect all *.xlsx files in the existing folder.
Create a new folder where those files are and name it 'fooFolder-2011-07-26' (yesterday's date).
Move those renamed files to that folder.
I only have limited experience with PowerShell. It's on my todo list of languages to learn..
Here you go. It could be shortened up a lot using aliases and piping and whatnot, but since you're unfamiliar with Powershell still, I decided to write in a more procedural style for your reading:
function MoveFilesAndRenameWithDate([string]$folderPrefix, [string]$filePattern) {
$files = Get-ChildItem .\* -include $filePattern
ForEach ($file in $files) {
$yesterDate = $file.CreationTime.AddDays(-1).ToString('yyyy-MM-dd')
$newSubFolderName = '{0}-{1}' -f $folderPrefix,$yesterDate
if (!(Test-Path $newSubFolderName)) {
mkdir $newSubFolderName
}
$newFileName = '{0}-{1}{2}' -f $file.BaseName,$yesterDate,$file.Extension
Move-Item $file (Join-Path $newSubFolderName $newFileName)
}
}
You would paste the above into your Powershell session (place it in your profile). Then you call the function like this:
MoveFilesAndRenameWithDate 'fooFolder' '*.xslx'
I tend to use more aliases and piping than the above function. The first version I wrote was this, and then I separated parts of it to make it more comprehensible to a Powershell newcomer:
function MoveFilesAndRenameWithDate([string]$folderPrefix, [string]$filePattern) {
gci .\* -include $filePattern |
% { $date = $_.CreationTime.AddDays(-1).ToString('yyyy-MM-dd')
mkdir "$folderPrefix-$date" 2>$null
mv $_ (join-path $newSubFolderName ('{0}-{1}{2}' -f $_.BaseName,$date,$_.Extension))}
}
Edit: Modified both functions to create dated folder for the files that match that date. I considered making a temporary directory and grabbing a single date from the files moved to it, finally renaming the directory after the loop. However, if a day should be missed and files for 2 (or more) days get processed together, there would still be a folder for each day with these, which is more consistent.
ok i´ve made it
function NameOfFunction([string]$folderpath)
{
foreach ($filepath in [System.IO.Directory]::GetFiles($folderpath))
{
$file = New-Object System.IO.FileInfo($filepath);
$date = $file.CreationTime.AddDays(-1).ToString('yyyy-MM-dd');
if (![System.IO.Directory]::Exists("C:\\test\foo-$date"))
{
[System.IO.Directory]::CreateDirectory("$folderpath\foo-$date");
}
$filename = $file.Name.Remove($file.Name.LastIndexOf('.'));
$fileext = $file.Name.SubString($file.Name.LastIndexOf('.'));
$targetpath = "$folderpath\foo-$date" + '\' + $filename + '-' + $date + $fileext;
[System.IO.File]::Move($filepath, $targetpath);
}
}
Explanation:
First get all Files in the rootfolder.
Foreach Folder we create a FileInfo-Object and get the CreationTime -1 Day.
Then we check, if the Directory, which should be created, exists and create it if false.
Then we get the filename and the extension.
At the End we move the files to the new Directory, under the new Filename.
Hope that help you