fitz.fitz.FileDataError: cannot open document - pdf

def read_pdf(path: str) -> str:
doc = fitz.open(path)
txt = ""
for page in doc:
txt += page.get_text("text")
return txt
I get the error fitz.fitz.FileDataError: cannot open document. Could some one help? Thank you in anticipation.

Related

Extracting and save relevant data from a txt file

I tried to write a code but I didn't succeed at all. Could someone help me please?
I want my program to read the data.txt file
The data.txt file contains:
Name: Christian
Phone: x
Address: x
Name: Alexander
Phone: x
Address: x
I would like the program to save the names in a output file: output_data.txt
The output_data.txt file should be contain:
Christian
Alexander
This is what I have so far:
Using DataReader As New Microsoft.VisualBasic.FileIO.TextFieldParser("data.txt")
Dim DataSaver As System.IO.StreamWriter
DataReader.TextFieldType = FileIO.FieldType.Delimited
DataReader.SetDelimiters("Name: ")
Dim Row As String()
While Not DataReader.EndOfData
Row = DataReader.ReadFields()
Dim DataSplited As String
For Each DataSplited In Row
My.Computer.FileSystem.WriteAllText("output_data.txt", DataSplited, False)
'MsgBox(DataSplited)
Next
End While
End Using
But the output_data.txt file does not save properly to what "MsgBox (DataSplited)" shows. MsgBox(DataSplit) delimits the name by Name: but also shows the rest, such as address, phone. I don't know what the problem is.
this will work:
Dim strFile = "c:\test5\data.txt"
Dim InputBuf As String = File.ReadAllText(strFile)
Dim OutBuf As String = ""
Dim InBufHold() As String = Split(InputBuf, "Name:")
For i = 1 To InBufHold.Length - 1
OutBuf += Trim(Split(InBufHold(i), vbCrLf)(0)) & vbCrLf
Next
File.WriteAllText("c:\test5\output_data.txt", OutBuf)
Dim names = File.ReadLines("data.txt").
Where(Function(line) line.StartsWith("Name: ")).
Select(Function(line) line.SubString(5).Trim())
File.WriteAllText("output_data.txt", String.Join(vbCrLf, names))

Create multiple PDF files from one input PDF file, split using bookmarks

I've been working on a VB.NET project to dynamically create report packs in PDF format using a SQL database and a number of input PDF templates. To cut a long story short, due to the way that Business Objects creates the input files it will be much more efficient to allow input of compiled PDF reports rather than individual report template pages. In order for this to work however, we would need to split the input PDF files into sections using the Bookmarks created by BOBJ. We are not sure how many pages will be in the range of each bookmark but require a consistent naming convention of the split files so that the next part of the process can pick the correct templates up and merge them in the required combinations.
The second part of this process is designed and working well using a .Net library called PDFSHARP. I have used the samples on their website to write some code which splits an input PDF file into one section per page of the input file, but do not understand how to split it using the bookmarks.
If I could understand how to parse the PDF and read in the meta data for the bookmarks which contain the start page and end page and the name of the bookmark then I think I could finish it.
An example of the input PDF format is here:
https://drive.google.com/open?id=0B0GZGW6CFCI-UWY2WGRvV0dQSWZSNnNOWlp4R21zbFVPZDBn
There are 5 bookmarks (TID01, TID02 ...) and 6 pages. Section TID04 would have two pages output.
The file names I would need would be in the format of "ExamplePDF_TID01.pdf"
Any help to move forward would be greatly appreciated. - Looking on the wiki for the project it seems that it isn't very active any more and whilst other people have asked questions about this in the past there aren't any answers that I can find.
Code to Split by Page:
Sub Splitfiles()
Dim inputdir As String = "O:\Transformation\Standardisation\Input PDFs"
Dim outputdir As String = "O:\Transformation\Standardisation\Input PDFs\output\"
'inputdir = folder path containing input files
Dim fileEntries As String() = Directory.GetFiles(inputdir)
Dim filename As String
Dim pdfpage As PdfPage
Dim ccid As String
Dim pageid As Integer
Dim outputfilename As String
For Each filename In fileEntries
Dim importdoc As PdfDocument = PdfReader.Open(filename, PdfSharp.Pdf.IO.PdfDocumentOpenMode.Import)
Dim count As Integer = importdoc.PageCount
Dim x = 0
Do Until x = count
Dim outputdoc As PdfDocument = New PdfDocument
pdfpage = importdoc.Pages(x)
outputdoc.AddPage(pdfpage)
ccid = Strings.Right(filename, Len(filename) - Len(inputdir)) 'expand this to find CC ID
ccid = Strings.Left(ccid, Len(ccid) - 4)
pageid = x
outputfilename = outputdir & ccid & "_" & pageid & ".pdf"
outputdoc.Save(outputfilename)
x = x + 1
Loop
Next
End Sub
And the code I started to split by bookmark but couldn't finish:
Sub SplitPDFByBookmark()
Dim inputfile As String = "O:\Transformation\Standardisation\Input PDFs\Business Sub Area Report - Project Management - FY16_FP02 - 17062016_0709.PDF"
Dim outputdir As String = "O:\Transformation\Standardisation\Input PDFs\output\"
'inputdir = folder path containing input files
'Dim fileEntries As String() = Directory.GetFiles(inputdir)
Dim filename As String
Dim pdfpage As PdfPage
Dim ccid As String
Dim pageid As Integer
Dim outputfilename As String
filename = inputfile
'For Each filename In fileEntries
Dim importdoc As PdfDocument = PdfReader.Open(filename, PdfSharp.Pdf.IO.PdfDocumentOpenMode.Import)
Dim count As Integer = importdoc.PageCount
Dim x = 0
For Each bookmark In importdoc.Outlines
Dim outputdoc As PdfDocument = New PdfDocument
pdfpage = importdoc.Pages(importdoc.Outlines.)
outputdoc.AddPage(pdfpage)
pageid = x
outputfilename = outputdir & "OutputFile_" & pageid & ".pdf"
outputdoc.Save(outputfilename)
x = x + 1
Next
'Next
End Sub
Thanks in advance for your help!

How to save decoded image in server using vb.net

I have written this code for saving decoded images locally and it works fine. Actually I need to save this decoded image in a server, but I don't now how to achieve this? I have seen many examples saving files in server, but here I have a base64 decoded image ... Can I get any hint ? Thanks in advance
Dim bt64 As Byte() = System.Convert.FromBase64String(srcFile)
Dim destFile As String = " C:\Users\Administrator\Desktop\MASavedImages"
Dim imgName As String
imgName = String.Format("{0:dd-MM-yyyy hh-mm-ss tt-fff}", DateTime.Now)
imgName += ".jpg"
If (Not System.IO.Directory.Exists(destFile)) Then
System.IO.Directory.CreateDirectory(destFile)
File.WriteAllBytes(destFile + "\" + imgName, decodedimg)
Else
File.WriteAllBytes(destFile + "\" + imgName, decodedimg)
End If
And why can't you save to server by the same method ? What happens when you change the destFile to \\server\path\filename and run ?

Replace before save to CSV

I'm using scrapy's export to CSV but sometimes the content I'm scraping contains quotes and comma's which i don't want.
How can I replace those chars with nothing '' before outputting to CSV?
Heres my CSV containing the unwanted chars in the strTitle column:
strTitle,strLink,strPrice,strPicture
"TOYWATCH 'Metallic Stones' Bracelet Watch, 35mm",http://shop.nordstrom.com/s/toywatch-metallic-stones-bracelet-watch-35mm/3662824?origin=category,0,http://g.nordstromimage.com/imagegallery/store/product/Medium/11/_8412991.jpg
Heres my code which errors on the replace line:
def parse(self, response):
hxs = Selector(response)
titles = hxs.xpath("//div[#class='fashion-item']")
items = []
for titles in titles[:1]:
item = watch2Item()
item ["strTitle"] = titles.xpath(".//a[#class='title']/text()").extract()
item ["strTitle"] = item ["strTitle"].replace("'", '').replace(",",'')
item ["strLink"] = urlparse.urljoin(response.url, titles.xpath("div[2]/a[1]/#href").extract()[0])
item ["strPrice"] = "0"
item ["strPicture"] = titles.xpath(".//img/#data-original").extract()
items.append(item)
return items
EDIT
Try adding this line before the replace.
item["strTitle"] = ''.join(item["strTitle"])
strTitle = "TOYWATCH 'Metallic Stones' Bracelet Watch, 35mm"
strTitle = strTitle.replace("'", '').replace(",",'')
strTitle == "TOYWATCH Metallic Stones Bracelet Watch 35mm"
In the end the solution was:
item["strTitle"] = [titles.xpath(".//a[#class='title']/text()").extract()[0].replace("'", '').replace(",",'')]

Reading the next line from a text file

I'm working on an RPG type game for a project and I am stuck.
Basically, this code searches for a name in a text file (structure: odds as names and evens as levels). It then needs to output the next line which is the level they where on. I have the counter (variable "count") to output the right text line in which the level is written but I can not use that count to read that line (using "FileSystem.LineInput(count)").
Here is my full code:
Sub LoadGame()
Dim filename, filepath, searchitem, question, read As String
Dim found As Boolean
Dim count As Integer = 1
filename = "save.txt"
filepath = CurDir() & "\" & filename
searchitem = name
FileOpen(1, filename, OpenMode.Input)
Do While Not EOF(1)
read = LineInput(1)
If read = searchitem Then
found = True
Exit Do
Else
found = False
End If
count = count + 1
Loop
If found = True Then
If count >= 3 Then
count = count + 1
End If
question = FileSystem.LineInput(count) ' This bit is broken
Console.WriteLine("Found save game... Loading: " & question)
Console.ReadLine()
Console.BackgroundColor = ConsoleColor.Black
Console.ForegroundColor = ConsoleColor.Red
Console.Clear()
Race(question)
Else
Console.WriteLine("No save game...")
Console.ReadLine()
End If
FileClose(1)
End Sub
I am not sure what is wrong but any help would be greatly appreciated (using VB 2010)
LineInput reads the next line of the specified file (parameter FileNumber).
Your file has the FileNumber 1 and the file pointer points to the desired line. Therefore, it should be sufficient to
question = FileSystem.LineInput(1)
In my oppinion, you should avoid those kinds of file access (per FileNumber) in .Net. This is just an old relict from VB6 times. In .Net you have easy-to-use classes such as StreamReader for that purpose. But if you want to do it the old-fashioned way, at least use the FreeFile method to define the file number.