Create Word/PDF files from Excel content - scripting

I have an Excel file that I want to split into several files (Word, PDF is also good), based on content. The content is somewhat like this:
Person Fase Date Item Text
A 1 01-01-2012 Z Lorem ipsum
A 2 01-02-2012 X Lorem ipsum
B 1 02-01-2012 Y Lorem ipsum
C 2 01-01-2012 Z Lorem ipsum
I want Word/PDF documents with names like
Person_Fase.docx
And as content the date, item and text. Idealy in a table layout. Any hints/ clues on how to get there? It's about 700 clients, with up to 300 Excel entries each.

A trick I have used in the past was to create HTML documents, replacing the tab delimited file by an HTML table. This HTML file can be opened by MS Word, and maybe even renamed to .docx, even though it is HTML.
Another way to do it is to use VBScript and build proper Office documents.
Download ofhtml9.exe from http://msdn.microsoft.com/en-us/library/office/aa155477(v=office.10).aspx for more info. There are some proprietary attributes to format the HTML files for Office.
Good Luck!

Related

ways to separate passages in pdf using gap?

I have some pdf's with 2-3 passages for every page. every passage is separated by some line gap, but while reading with pymupdf, I cannot see any machine printable separator between passages. is there any other way, other library can do this?
code:
import fitz
from more_itertools import *
doc = fitz.open('IT_past.pdf',)
single_doc = doc.load_page(0) # put here the page number
text=single_doc.get_text('text')
text
page screen shot:
enter image description here
pdf
Full pdf
There is no gap as such, just for the moment as its much easier, lets look closer in your linked viewer rendering :-
   
So lets replicate what is inside the real PDF (that has no web side html <p> markers) :-
support, product design, HR Management, knowledge process outsourcing for
pharmaceutical companies and large complex projects.
Software exports make up 20 % of India's total export revenue in 2003-04, up from 4.9 %
in 1997.This figure is expected to go up to 44% of annual exports by 2010. Though India
See there is "no gap" just left aligned non justified (ragged) text that needs a style such as a font name and stretched out locations added to hold in a page de-void of line feeds nor true carriage returns. (occasionally there are some backspace or vertical/horizontal moves but generally meaningless in line printer text). Even "Tabs" "Indents" and some spatial characters are normally discarded in a PDF printout.
If you want gaps or line-wrap you need to add them.
A good alternative is export the -layout using poppler or xpdf here to - (console) or pipe it or replace that with a path/name.txt, many other options available like -nopgbrk
xpdf-tools-win-4.04\bin32>pdftotext -f 1 -l 1 -layout IT_past.pdf -

Arabic pdf text extraction

I'm trying to extract text from Arabic pdfs - raw data extraction not OCR -.
I tried many packages, tools and none of them worked, python packages, pdfBox, adobe API, and many other tools and all of them field to extract the text correctly, either it reads the text LTR or it do wrong decoding.
Here is a two sample from different tools
sample 1:
املحتويات
7 الثانية الطبعة مقدمة
9 وتاريخه األدب -١
51 الجاهليون -٢
95 الشعر نحل أسباب -٣
149 والشعراء الشعر -٤
213 مرض شعر -٥
271 الشعر -٦
285 الجاهيل النثر -٧
sample 2:
ﺔﻴﻧﺎﺜﻟا ﺔﻌﺒﻄﻟا ﺔﻣﺪﻘﻣ
ﻪﺨﻳرﺎﺗو بدﻷا -١
نﻮﻴﻠﻫﺎﺠﻟا -٢
ﺮﻌﺸﻟا ﻞﺤﻧ بﺎﺒﺳأ -٣
ءاﺮﻌﺸﻟاو ﺮﻌﺸﻟا -٤
ﴬﻣ ﺮﻌﺷ -٥
ﺮﻌﺸﻟا -٦
ﲇﻫﺎﺠﻟا ﺮﺜﻨﻟا -٧
original text
and yes I can copy it and get the same rendered text.
are there any tool that can extract Arabic text correctly
the book link can be found here
The text in a PDF is not the same as the text used for its construction, we can see that in your example where page 7 is shown in Arabic on the surface but is coded as 7 in the plain text.
However a greater problem is the Languages as supported by fonts, so in Notepad I had to accept a script font to see a similarity, but that is using a font substitution.
Another complication is Unicode and whitespace ordering.
so the result from
pdftotext -f 5 -l 5 في_الأدب_الجاهلي.pdf try.txt
At best will look like
Thus in summary your Sample 1 is equal if not better, than any other simple attempt.
Later Edit from B.A. comment below
I found a way to go around this, after extracting the text I open the txt file and normalize its content using unicodedata python module that offers unicodedata.normalize() function. So I can now say that pdftotext is the best tool for Arabic text extraction
Unicode Normalization should be fixing that issue. (you can choose NFKC)
Most programming languages have a normal.
check here for more info about normalization.
https://unicode.org/reports/tr15/

Trouble with tabulizer library in r recognizing non-alphanumeric (symbol) characters on a table in a PDF

I am using the tabulizer library in r to capture data from a table located inside a PDF on a public website
(https://www.waterboards.ca.gov/sandiego/water_issues/programs/basin_plan/docs/update082812/Chpt_2_2012.pdf).
The example table that I am interested in is on page 23 of the PDF (p. 2-21, document has a couple of blankpages at beginning). The table has a non-standard format and also different symbols (non-alphanumeric characters in the cells).
I want to extract most if not all tables from this document.
I want to end up with a table that has characters with codes (i.e., black circles with 999, white circles with 777, plus signs with -99, etc).
Tabulizer does a good job for the most part converting the dark circles into consistent alphanumeric codes, and keeping the plus signs, but runs into problems on the REC1 column with white
circles, which is odd since it does seems to recognize exotic characters on other columns.
Could anyone please help fix this? I also tried selecting the table area but the output was worse. Below is the r code I am using.
I know I can complete this process by hand for all the tables in the document using PDF's built-in select and export tools but would like to automate the process.
library("tabulizer")
f2 <- "https://www.waterboards.ca.gov/sandiego/water_issues/programs/basin_plan/docs/update082812/Chpt_2_2012.pdf"
tab <- extract_tables(f2, pages = 23, method = 'lattice')
head(tab[[1]])
df <- as.data.frame(tab)
write.csv(df, file = "test.csv")

How to use "lorem" inside paragraph tag

I am just learning zen-coding and am trying to do this:
Entering 'lorem' and hitting tab will produce 50 word lorem text.
However, if I start with <p></p> and then type lorem + tab inside the tag, it does not produce the lorem text.
Is there a better way to do this?
Use this:
p>lorem
and then tab
Since people are still reading this I will add, if you want to print off many lorems inside of p tags you can do this:
(p>lorem)*7
Replace 7 with how many lines you want!
use this:
<p>
lorem*(how many line you want)

SPSS tables to latex (PDF) without creating an A4 page

This may be a stupid question, but I can't figure it out.
I have made some tables in SPSS. Now I want them over to my latex document.
What I do, it that I right-click the table in SPSS, and press export.
Here I can choose between PDF or .doc. BUT the PDF-file created, generates a file with the table on top of a page (A4 size, with "page 1" at the bottom). I do not want this, I only want the table.
example how it turns out:
Example how I want it to turn out:
If I export to word, I can further save as PDF, but same problem occurs.
Screenshot works, but does not give me the same picture-quality that I prefer.
Do anyone of you have any tips for me?
Thanks :)
Unfortunately SPSS does not provide native table export to Latex. It does provide table export to html and xls, which can post-hoc be converted to Tex tables. PDF output for everything forces to export the full page (very annoying for graphics as well) - but you probably don't want to insert the image of the table (you could crop the PDF if need be), but have a Tex table (in the same font) as your document anyway.
One thing I have done in the past to make the export to text tables with specific markup is to use the PRINT or LIST commands to print the text table to the output (or to a text file) that is closer to the end goal. In this NABBLE post I have some syntax that makes pandoc flavored pipe style markdown tables - it should be pretty clear how that same approach could be used for Tex tables (actually Tex tables should be much simpler).
Here is an example of some code using LIST to make a the markup closer to Tex tables.
DATA LIST FREE / Variable (A1) Mean Median (2F4.2).
BEGIN DATA
A 3.25 2.00
B 2.56 2.50
C 9.87 10.20
END DATA.
*Using LIST to make Latex style table.
STRING Mid (A1) End (A2).
COMPUTE Mid = "&".
COMPUTE End = "//".
LIST /VARIABLES = Variable Mid Mean Mid Median End.
And here is a screen shot of the produced output on my machine.
So here I would still have to copy-paste the text output into my Tex document, (and make the header row).
You can also use OMS to save designed items in a variety of formats, including XML and then use an xml-to-Latex tool such as xmltex. You could probably even generate such a conversion with XSLT from the XML.
From the Viewer, you could also retrieve the table with Python scripting and use a Python-based converter tool.