Extracting word document with styles associated to the content - formatting

I'm trying to extract the format of a word document containing text in different fonts and font-sizes, images, comments etc. I have used zipfile module to extract the XML files of the word document.
XML files are:
['[Content_Types].xml',
'_rels/.rels',
'word/_rels/document.xml.rels',
'word/document.xml',
'word/footer2.xml',
'word/header1.xml',
'word/footer1.xml',
'word/endnotes.xml',
'word/footnotes.xml',
'word/_rels/header1.xml.rels',
'word/header2.xml',
'word/_rels/header2.xml.rels',
'word/embeddings/Microsoft_Word_97_-_2003_Document1.doc',
'word/media/image3.wmf',
'word/media/image2.emf',
'word/theme/theme1.xml',
'word/media/image1.png',
'word/embeddings/oleObject1.bin',
'word/comments.xml',
'word/settings.xml',
'word/styles.xml',
'customXml/itemProps1.xml',
'word/numbering.xml',
'customXml/_rels/item1.xml.rels',
'customXml/item1.xml',
'docProps/app.xml',
'word/stylesWithEffects.xml',
'word/webSettings.xml',
'word/fontTable.xml',
'docProps/core.xml',
'docProps/custom.xml']
I'm unable to understand the styles associated with the content present in word/document.xml.
I'm trying to encapsulate the results in the following manner:
{
"text": "some-text-in-document",
"font": "some-font",
"font_size": 10,
"some_field": "some-more-value",
...
}
Tried using python-docx to get the fonts and font-sizes but mostly the value is None
here's the code snippet:
from docx.enum.style import WD_STYLE_TYPE
styles = document.styles
#print(styles.default)
paragraph_styles = [s for s in styles if s.type == WD_STYLE_TYPE.PARAGRAPH]
for style in paragraph_styles:
#print(style.font.name)
if(style.font.name):
print(style.font.name, style.font.size)
for paragraph in document.paragraphs:
#print(paragraph.text)
for run in paragraph.runs:
print(run.text)
font = run.style.font
print(font.size)
Results are mostly None for font and size.

A value of None for style means Normal.
All paragraphs have a style, it's just that most have the same style, so Word doesn't spell it out for that majority case, perhaps to save space.

Related

Display UTF-8 JSPdf with VueJS

I am trying to display a catalog of products in pdf using JSPdf with VueJS. Data are coming from an API.
Base on the choice langage, the data will be in English, French etc...
Letters are working in English. But in French the problem is for letter like "é è à ç" etc... All this letter are displaying in a strange way. I read a lot of about setting a new font but it's still not working like this.
So I am coming here to see if somebody can share a script using VueJS to setup a custom fonts working with UTF-8 enconding.
Here is a minimal function but the idea is here:
download() {
let pdfRef = this.$refs.pdf;
html2canvas(pdfRef).then(canvas => {
let pdf = new jsPDF();
const str = "éçàeéi test";
pdf.save("Articles.pdf");
});
},
On the pdf i will see. Sometimes some letters are blank also.
é è à ç test
instead of
éçàeéi test
Thank you in advance.
Have a nice day!
Following this Link it's working perfectly. It's funny i following this doc before but i did not use a binary string to convert my ttf font.
I downloaded the font on google fonts and after i used a website to convert file ttf to base64 and get the string for the variable const myFont. Don't be scared because it can be thousand of letters
Just after the solution!
Thank you for your help and have a nice day :)
const myFont = ... // load the *.ttf font file as binary string
// add the font to jsPDF
doc.addFileToVFS("MyFont.ttf", myFont);
doc.addFont("MyFont.ttf", "MyFont", "normal");
doc.setFont("MyFont"); ```

How to change the font in the SpTextPresenter?

Pharo 9, Spec 2 -- I have a Spec 2 presenter with a text widget:
initializePresenters
text := self newText.
super initializePresenters
As I understand its type is SpTextPresenter. How to change the font of this text? Font face, size of the all shown text in this widget... For example, to "Courier New", 9.
EDIT 1:
Also I tried:
text addStyle: { SpStyleSTONReader fromString:
'
Font {
#name: "Source Sans Pro",
#size: 12,
#bold: false,
#italic: true
}' }.
but it does not work, the error is: Improper store into indexable object.
EDIT 2:
Also I found this documentation. It seems that the scenario must be:
Read styles as STON
Set styles somwhere (where?) for the all application. They are described under its names in the STON so they can be referred under its names in the application.
Call addStyle: 'the-name' so the widget with a name the-name will refer own styles from the loaded STON.
The problem is in 2. - I have not application, just one presenter which I open with openWithSpec.
I didn't notice this 'till now.
Spec "styles" cannot be added directly to the component but they need to be part of a stylesheet.
Stylesheets are defined in your application (in particular in your application configuration).
You can take a look at StPharoApplication>>resetConfiguration, StPharoMorphicConfiguration>>styleSheet and StPharoMorphicConfiguration>>styleSheetCommon as examples (you will also see there than using STON to declare your styles is just a convenience way, not mandatory).
Here a simplified version of what you will find there:
StPharoApplication >> resetConfiguration
self useBackend: #Morphic with: StPharoMorphicConfiguration new
StPharoMorphicConfiguration >> styleSheet
^ SpStyle defaultStyleSheet, self styleSheetCommon
StPharoMorphicConfiguration >> styleSheetCommon
"Just an example on how to build styles programatically ;)"
^ SpStyleSTONReader fromString: '
.application [
.searchInputField [
Font { #size: 12 }
]
]
'
Then you can add the style to your component:
text addStyle: 'searchInputField'

How to use font in other PDF files? (itext7 PDF)

I am now trying to modify a PDF file with ONLY text content. When I use
TextRenderInfo.getFont()
it returns me a Font which is actually an indirect object.
pdf.inderect.object.belong.to.other.pdf.document.Copy.object.to.current.pdf.document
would be thrown in this case when close the PdfDocument.
Is there a way to let me reuse this Font in a new PDF file? OR, is there a way to in-place edit the text content in PDF (without changing the font, color, fontSize)?
I'm using itext7.
Thanks
First of all, from the error message I see that you are not using the latest version of iText, which is 7.0.2 at the moment. So I recommend that you update your iText version.
Secondly, it is indeed possible to use a font in another document. But to do that, you first have to copy the corresponding font object to that other document (as stated in the exception message by the way). But you should be warned that this approach has some limitations, e.g. in case of a font subset, you will only be able to use the glyphs that are present in the original font subset in the source document and will not be able to use other glyphs.
PdfFont font = textRenderInfo.getFont(); // font from source document
PdfDocument newPdfDoc = ... // new PdfDocument you want to write some text to
// copy the font dictionary to the new document
PdfDictionary fontCopy = font.getPdfObject().copyTo(newPdfDoc);
// create a PdfFont instance corresponding to the font in the new document
PdfFont newFont = PdfFontFactory.createFont(fontCopy);
// Use newFont in newPdfDoc, e.g.:
Document doc = new Document(newPdfDoc);
doc.add(new Paragraph("Hello").setFont(newFont));

HTML string to PDF conversion

I need to create various reports in PDF format and email it to specific person. I managed to load HTML template into string and am replacing certain "custom markers" with real data. At the end I have a fulle viewable HTML file. This file must now be printed into PDF format which I am able todo after following this link : https://www.appcoda.com/pdf-generation-ios/. My problem is that I do not understand how to determine the number of pages from the HTML file as the pdf renderer requires creating page-by-page.
I know this is an old thread, I would like to leave this answer here. I also used the same tutorial you've mention and here's what I did to make multiple pages. Just modify the drawPDFUsingPrintPageRenderer method like this:
func drawPDFUsingPrintPageRenderer(printPageRenderer: UIPrintPageRenderer) -> NSData! {
let data = NSMutableData()
UIGraphicsBeginPDFContextToData(data, CGRect.zero, nil)
printPageRenderer.prepare(forDrawingPages: NSMakeRange(0, printPageRenderer.numberOfPages))
let bounds = UIGraphicsGetPDFContextBounds()
for i in 0...(printPageRenderer.numberOfPages - 1) {
UIGraphicsBeginPDFPage()
printPageRenderer.drawPage(at: i, in: bounds)
}
UIGraphicsEndPDFContext()
return data
}
In your custom PrintPageRenderer you can access the numberOfPages to have the total count of the pages

How do I figure out the font family and the font size of the words in a pdf document?

How do I figure out the font family and the font size of the words in a pdf document? We are actually trying to generate a pdf document programmatically using iText, but we are not sure how to find out the font family and the font size of the original document which needs to be generated. document properties doesn't seem to contain this information
Fonts are stored in the catalog (I suppose in a sub-catalog of type font). If you open a pdf as a text file, you should be able to find catalog entries (they begin and end with "<<" and ">>" respectively.
On a simple pdf file, i found the following:
<</Type/Font/BaseFont/Helvetica-Bold/Subtype/Type1/Encoding/WinAnsiEncoding>>
thus searching for the prefix should help you (in some pdf files, there are spaces between
the commponents but '/Type /Font' should be ok).
Of course this is a manual process, while you would probably prefer an automatic one.
On another note, we sometime use identifont or what the font to find uncommon fonts that give us problem (logo font).
regards
Guillaume
Edit : the following code will find all font in the pages. To be short, you search the dictionnary of each page for the subdictionnary "ressource" and then the subdictionnary "font". Each entry in the later is a font dictionnary, describing a font.
PdfReader reader = new PdfReader(
new FileInputStream(new File("file.pdf")));
int nbmax = reader.getNumberOfPages();
System.out.println("nb pages " + nbmax);
for (int i = 1; i <= nbmax; i++) {
System.out.println("----------------------------------------");
System.out.println("Page " + i);
PdfDictionary dico = reader.getPageN(i);
PdfDictionary ressource = dico.getAsDict(PdfName.RESOURCES);
PdfDictionary font = ressource.getAsDict(PdfName.FONT);
// we got the page fonts
Set keys = font.getKeys();
Iterator it = keys.iterator();
while (it.hasNext()) {
PdfName name = (PdfName) it.next();
PdfDictionary fontdict = font.getAsDict(name);
PdfObject typeFont = fontdict.getDirectObject(PdfName.SUBTYPE);
PdfObject baseFont = fontdict.getDirectObject(PdfName.BASEFONT);
System.out.println(baseFont.toString());
}
}
The name (variable "name" in the following code) is what is used in the text to change font. In the PDF, you'll have to find it next to a text. The following number is the size. Here for example, it's size 12. (sorry, still no code for this part).
BT
/F13 12 Tf
288 720 Td
the text to find Tj
ET
Depending on the PDF, if it hasn't been outlined you may be able to open it in Adobe Illustrator, double click the text and select some of it to see it's font family, size, etc.
If the text is outlined then use one of those online tools that PATRY suggests to find out the font.
Good luck
If you have Adobe Acrobat you can see the fonts inside and examine the objects and text streams. I wrote a blog post on this at http://pdf.jpedal.org/java-pdf-blog/bid/10479/Viewing-PDF-objects