programmatically rename file in ocrmyfile - pdf

I'm a new programmer and I'm making a first attempt at a larger data science project. To do this I have made a class that is supposed to open PDFs with ocrmypdf and then uses a while statement to walk through all the documents in a folder.
class DocumentReader:
This class is used to open and read a
# document using OCR and then
# creating the document in its place
def __init__(self,file):
self.file = file
def convert(self):
ocrmypdf.ocr(self.file,new_doc,deskew=True)
and here is the while statement:
count = 0
while count <final:
for file in os.listdir('PayStubs'):
if file.endswith(".pdf"):
index = str(file).find('.pdf')
new_doc = file[:index]+'_new'+file[index:]
d1=DocumentReader(file)
d1.convert()
I can make each of the classes work if I run them individually but it is the '.pdf' extension when I try to run them programmatically that is messing me up. Does anyone know how to create a new file name programmatically for the second argument in the ocrmypdf command?
I have tried several different ways of making this work but I keep getting errors. The most common errors that my attempts have yielded are:
InputFileError: File not found - 20070928ch6495.pdf.pdf
and
isadirectoryerror: [errno 21] is a directory: '_new/'
I'm to the point where I'm running in circles. Any help would be greatly appreciated. thanks!

Related

Unpack a rar file

Okay, so I have searched for dll files that will allow me to unrar files and I was able to find quite a few such as unrar.dll, chilkat, sharpcompress and some more but I wanted to use the one provided by Rar themselves.
So I referenced the DLL file in my project and imported it. I was using unrar.dll.
But I wasn't able to find any up to date code to allow me to test and try things out. All the examples I found were either not up to date or not for Vb.net.
I also tried the official example, which came in the installation but that didn't work even after I fixed it and when I tried to use the code I always got an error for
object reference not set to an instance of an object
I just want to unrar a rar file from a specific location to the root directory of my program so if my program was on the desktop I want it to unrar a file in My documents and extract the files to my desktop.
If you just want to unrar files, I Was able to do that with SharpCompress
First I created a new VB.Net App and added a reference to SharpCompress.dll before using this code to extract all the files from a Rar file.
'Imports
Imports SharpCompress.Archives
Imports SharpCompress.Common
'Unrar code
Dim archive As IArchive = ArchiveFactory.Open("C:\file.rar")
For Each entry In archive.Entries
If Not entry.IsDirectory Then
Console.WriteLine(entry.Key)
entry.WriteToDirectory("C:\unrar", New ExtractionOptions With
{.ExtractFullPath = True, .Overwrite = True})
End If
Next
More code samples
for those who will try in vb.net the extract options are renamed and used as
Dim options As New ExtractionOptions With {
.ExtractFullPath = True,
.Overwrite = True
}
entry.WriteToDirectory(Application.StartupPath, options)

I can't get netbeans to find a txt file I have in the same directory... java.io.FileNotFoundException

I can't make it path specific because once I get this program to work (this is the last thing I have to do) I'm uploading to my university's ilearn website and it has to run on my professors computer with no modifications. I've tried a few different amalgamations of code similar to the following...
File file = new File("DataFile.txt");
Scanner document = new Scanner(new File("DataFile.txt"));
Or...
java.io.File file = new java.io.File("DataFile.txt");
Scanner document = new Scanner(file);
But nothing seems to work. I've got the necessary stuff imported. I've tried moving DataFile around in a few different folders (the src folder, and other random folders in the project's NetBeansProjects folder) I tried creating a folder in the project and putting the file in that folder and trying to use some kind of
documents/DataFile.txt
bit I found online (I named the folder documents).
I've tried renaming the file, saving it in different ways. I'm all out of ideas.
The file is just a list of numbers that are used in generating random data for this program we got assigned for building a gas station simulator. The program runs great when I just use user input from the console. But I can not get netbeans to find that file for the life of me! Help!?!?!?
Try adding the file to build path ..
public void readTextFile (){
try{
Scanner scFile =new Scanner(new File("filename.txt");
while(scFile.hasNext()){
String line =scFile.nextLine();
Scanner details=new Scanner(line).useDelimiter("symbol");
than you can work from there to store integer values use e.g in an array
litterArr(size)=details.nextInt();
Note: size is a variable counting the size/number of info the array has.
}
scFile.close();
{
catch
(FILENOTFOUNDEXCEPION e){
..... *code*
}
Keep file in the same folder as the program,but if it is saved in another folder you need to supply the path indicating the location of the file as part of the file name e.g memAthletics.Lines.LoadFromFile('C:\MyFiles\Athletics.txt');
hope this helps clear the problem up :)

MsTest, DataSourceAttribute - how to get it working with a runtime generated file?

for some test I need to run a data driven test with a configuration that is generated (via reflection) in the ClassInitialize method (by using reflection). I tried out everything, but I just can not get the data source properly set up.
The test takes a list of classes in a csv file (one line per class) and then will test that the mappings to the database work out well (i.e. try to get one item from the database for every entity, which will throw an exception when the table structure does not match).
The testmethod is:
[DataSource(
"Microsoft.VisualStudio.TestTools.DataSource.CSV",
"|DataDirectory|\\EntityMappingsTests.Types.csv",
"EntityMappingsTests.Types#csv",
DataAccessMethod.Sequential)
]
[TestMethod()]
public void TestMappings () {
Obviously the file is EntityMappingsTests.Types.csv. It should be in the DataDirectory.
Now, in the Initialize method (marked with ClassInitialize) I put that together and then try to write it.
WHERE should I write it to? WHERE IS THE DataDirectory?
I tried:
File.WriteAllText(context.TestDeploymentDir + "\\EntityMappingsTests.Types.csv", types.ToString());
File.WriteAllText("EntityMappingsTests.Types.csv", types.ToString());
Both result in "the unit test adapter failed to connect to the data source or read the data". More exact:
Error details: The Microsoft Jet database engine could not find the
object 'EntityMappingsTests.Types.csv'. Make sure the object exists
and that you spell its name and the path name correctly.
So where should I put that file?
I also tried just writing it to the current directory and taking out the DataDirectory part - same result. Sadly, there is limited debugging support here.
Please use the ProcessMonitor tool from technet.microsoft.com/en-us/sysinternals/bb896645. Put a filter on MSTest.exe or the associate qtagent32.exe and find out what locations it is trying to load from and at what point in time in the test loading process. Then please provide an update on those details here .
After you add the CSV file to your VS project, you need to open the properties for it. Set the Property "Copy To Output Directory" to "Copy Always". The DataDirectory defaults to the location of the compiled executable, which runs from the output directory so it will find it there.

itext outofmemory error while attempting to count the number of pages in a pdf file

I'm trying to execute the following code:
PdfReader reader = new PdfReader("/path/to/file.pdf");
int pages = reader.getNumberOfPages();
It works on most files, but on one particular file, it crashes with error:
java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:2882)
at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:100)
at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:572)
at java.lang.StringBuffer.append(StringBuffer.java:320)
at com.itextpdf.text.pdf.PRTokeniser.readString(PRTokeniser.java:158)
at com.itextpdf.text.pdf.PRTokeniser.getStartxref(PRTokeniser.java:224)
at com.itextpdf.text.pdf.PRTokeniser.getStartxref(PRTokeniser.java:229)
...goes on for a while
at com.itextpdf.text.pdf.PRTokeniser.getStartxref(PRTokeniser.java:229)
I know that it's something wrong with the input file. I'm just wondering if there's a way of knowing before attempting to make the method call, that the file is going to cause a problem.
It turns out it was a bug with the version of itext I am using (5.0.1). I logged a query with the developers, and a fix was put in - that I tested - and which hopefully will find it's way into the next version (5.0.2)

SSIS 2005 flat file source - partial row which isn't actually a partial row

I'm currently working on an SSIS package to load mainframe logs from multiple server/file sources into a database.
As it stands at the moment I'm using a foreach loop container to loop through a recordset containing filenames and load the files using a Data Flow task from a Flat File Source and File connection to an OLE DB Destination through a Derived column.
I've built in error handling on the Data Flow task to allow for the fact that there won't always be a log file in the location specified (ie. because the server was down for maintenance during a specific period as the files are generated on an hourly basis), but the problems start after it finishes handling these errors.
If the file immediately following an attempt to load a file that wasn't found exists it begins to load it but then throws the following warning message: [Message Log File Source (NORDXSL) [57]] Warning: There is a partial row at the end of the file., and doesn't load all of the records in that file.
However, when I remove the files I know won't exist from the recordset (so that it only attempts to load files that do exist, including the one with the alleged "partial row"), everything works fine and all files/rows are loaded without a problem. It just seems to not want to load the first file after it's failed a missing file correctly and I can't for the life of me work out why?
I've tried calling Dispose() and ReleaseConnection() on the file connection after the Data Flow task has finished processing but this makes no difference and I'm now completely out of ideas.
Any help would be really appreciated as this is the last bug in this project and I want to get it out the door. PLEASE!!
Thanks,
James
I've now found a workaround for this problem...
I've added a Script Task before the Data Flow Task to load the files that checks to see if the file I want to read exists:
If (System.IO.File.Exists(Dts.Variables("MQLogMessagePath").Value.ToString)) Then
Dts.TaskResult = Dts.Results.Success
Else
Dts.TaskResult = Dts.Results.Failure
End If
If it doesn't exist it fails the iteration of the Foreach Loop container and continues onto the next file.
BINGO!