I would like to populate a Word document with data from our MS SQL database.
Is this possible, and if yes how?
I have done it various ways in the past. It depends whether the user initiates the action from OUTSIDE of Microsoft Word or from INSIDE Microsoft Word.
From INSIDE Microsoft Word, you can use one of the following techniques:
Open a template with placeholders and use VBA or VSTO to iterate over them using a data source using copy & paste. Please note that tables are also a crime here. This approach resembles approach 1 below with "OUTSIDE". Disadvantage that it is relatively slow (copy & paste) and that Microsoft Word autocorrect likes to kick in when least needed.
Open a template with placeholders and use VBA or VSTO to parse the XML representation and then replace these. It is faster, but harder to write. Especially since the XML representation can contain XML fragments within the placeholders (such as '<<PUT_<xxx/>IT_HERE>>' and more complex cases). Also, you need to make sure you keep a valid XML document and well balanced.
From OUTSIDE Microsoft Word (such as web interface) you can use one of the following techniques:
Store a template somewhere using RTF (which is far easier to process than Word's own structure). Put '<<PLACEHOLDER-FOR-NAME>>' or similar easily recognized texts where you like to replace it. When user requests the Word document, fetch the RTF, fetch the data, replace the placeholders, server RTF to user. RTF has some restrictions, but some advantages. Advantages are: ease of creating new templates and also works with Microsoft Wordpad and other Office packages. Disadvantages are that tables are a real mess to process and that not all Microsoft Word constructs are possible. Repeating rows in a table are even less recommended. High volume can be an issue.
Use a reporting package that happens to also output docx, doc or RTF documents. Write a report. In general perfectly suited for high volume. Less suited if you want the end user to type a major amount of additional text, since reporting packages typically work based on pages instead on a flowing text, where sometimes an explicit or implicit page break is inserted. But if you only need the end user to enter one or two sentences in addition, it is sufficient.
Fat-client. Put the SQL data somewhere. Open Word. Read the data and see further techniques for from INSIDE Microsoft Word.
If you need to fill a Word document from SQL Server (or any other database or data platform) , I can suggest the free edition of Invantive Composition for filling Word documents from the database (please note that I've been involved with that product). It opens templates and merges them from within Word, but is more targeted at non-developers; just specify the template and datablocks (possibly nested) and publish. Developers can only add some C# in plugins. I think it is a good product when you have MANY templates (over 50) because it scales easier.
You may use microsoft query to fetch data from SQL database to your document, this video may be usefull
https://vimeo.com/83983247
You could also try using MS-Excel as it binds to XML better than Word does.
It's easy too to make Excel produce 'Word' styled output.
Related
I am tasked to create a template that will be Filled up by Business Users with Employee Information, then our program will load this into the Database using External Tables.
However, our Business Users constantly change the template by adding, removing or reordering fields.
I am convinced to use XLSX instead of CSV so that I can lock the Column Headers so they cannot remove, add and reorder the columns.
However, When i query the External Table, it shows Non-ASCII Characters when reading XLSX because its in Binary.
How can i do either of the following?
Effectively Read Excel Files from External Tables
Lock the Headers of CSV Files?
What you have here is a political problem, but you are looking for a technical fix. Not a good fit.
The problem comes in two halves:
Somebody decided it was a good idea to collect user input in a spreadsheet, which it is generally not.
Users are fiddling with the input format, which they should not.
Fixes are:
Strictly enforce the data structure. Reject any CSV which doesn't natch and make the users edit them. They will quickly tire of tweaking the spreadsheets when they realise they're just creating more work for themselves. But they will also get resentful, so consider ...
Building a data input screen. It's pretty simple to knock up a spreadsheet-like grid UI. You don't need anything complicated in Java: Oracle's Apex is intended for exactly this sort of thing. Find out more.
However, if you are stuck with Excel as a UI I suggest you have a look at Anton Scheffer's excellent PLSQL as_read_xlsx package on the Amis site. Check it out. You'll probably need to replace your external table with a view over a table (perhaps pipelined) function.
I have been trying to write a simple console application or PowerShell script to extract the text from a large number of PDF documents. There are several libraries and CLI tools that offer to do this, but it turns out that none are able to reliably identify document structure. In particular I am concerned with the recognition of text columns. Even the very expensive PDFLib TET tool frequently jumbles the content of two adjacent columns of text.
It is frequently noted that the PDF format does not have any concept of columns, or even words. Several answers to similar questions on SO mention this. The problem is so great that it even warrants academic research. This journal article notes:
All data objects in a PDF file are represented in a
visually-oriented way, as a sequence of operators which...generally
do not convey information about higher level text units such as
tokens, lines, or columns—information about boundaries between such
units is only available implicitly through whitespace
Hence, all extraction tools I have tried (iTextSharp, PDFLib TET, and Python PDFMiner) have failed to recognize text column boundaries. Of these tools, PDFLib TET performs best.
However, SumatraPDF, the very lightweight and open source PDF Reader, and many others like it can identify columns and text areas perfectly. If I open a document in one of these applications, select all the text on a page (or even the entire document with CTRL+A) copy and paste it into a text file, the text is rendered in the correct order almost flawlessly. It occasionally mixes the footer and header text into one of the columns.
So my question is, how can these applications do what is seemingly so difficult (even for the expensive tools like PDFLib)?
EDIT 31 March 2014: For what it's worth I have found that PDFBox is much better at text extraction than iTextSharp (notwithstanding a bespoke Strategy implementation) and PDFLib TET is slightly better than PDFBox, but it's quite expensive. Python PDFMiner is hopeless. The best results I have seen come from Google. One can upload PDFs (2GB at a time) to Google Drive and then download them as text. This is what I am doing. I have written a small utility that splits my PDFs into 10 page files (Google will only convert the first 10 pages) and then stitches them back together once downloaded.
EDIT 7 April 2014. Cancel my last. The best extraction is achieved by MS Word. And this can be automated in Acrobat Pro (Tools > Action Wizard > Create New Action). Word to text can be automated using the .NET OpenXml library. Here is a class that will do the extraction (docx to txt) very neatly. My initial testing finds that the MS Word conversion is considerably more accurate with regard to document structure, but this is not so important once converted to plain text.
I once wrote an algorithm that did exactly what you mentioned for a PDF editor product that is still the number one PDF editor used today. There are a couple of reasons for what you mention (I think) but the important one is focus.
You are correct that PDF (usually) doesn't contain any structure information. PDF is interested in the visual representation of a page, not necessarily in what the page "means". This means in its purest form it doesn't need information about lines, paragraphs, columns or anything like that. Actually, it doesn't even need information about the text itself and there are plenty of PDF files where you can't even copy and paste the text without ending up with gibberish.
So if you want to be able to extract formatted text, you have to indeed look at all of the pieces of text on the page, perhaps taking some of the line-art information into account as well, and you have to piece them back together. Usually that happens by writing an engine that looks at white-space and then decides first what are lines, what are paragraphs and so on. Tables are notoriously difficult for example because they are so diverse.
Alternative strategies could be to:
Look at some of the structure information that is available in some PDF files. Some PDF/A files and all PDF/UA files (PDF for archival and PDF for Universal Accessibility) must have structure information that can very well be used to retrieve structure. Other PDF files may have that information as well.
Look at the creator of the PDF document and have specific algorithms to handle those PDFs well. If you know you're only interested in Word or if you know that 99% of the PDFs you will ever handle will come out of Word 2011, it might be worth using that knowledge.
So why are some products better at this than others? Focus I guess. The PDF specification is very broad, and some tools focus more on lower-level PDF tasks, some more on higher-level PDF tasks. Some are oriented towards "office" use - some towards "graphic arts" use. Depending on your focus you may decide that a certain feature is worth a lot of attention or not.
Additionally, and that may seem like a lousy answer, but I believe it's actually true, this is an algorithmically difficult problem and it takes only one genius developer to implement an algorithm that is much better than the average product on the market. It's one of those areas where - if you are clever and you have enough focus to put some of your attention on it, and especially if you have a good idea what the target market is you are writing this for - you'll get it right, while everybody else will get it mediocre.
(And no, I didn't get it right back then when I was writing that code - we never had enough focus to follow-through and make something that was really good)
To properly extract formatted text a library/utility should:
Retrieve correct information about properties of the fonts used in the PDF (glyph sizes, hinting information etc.)
Maintain graphics state (i.e. non-font parameters like text and page scaling etc.)
Implement some algorithm to decide which symbols on a page should be treated like words, lines or columns.
I am not really an expert in products you mentioned in your question, so the following conclusions should be taken with a grain of salt.
The tools that do not draw PDFs tend to have less expertise in the first two requirements. They have not have to deal with font details on a deeper level and they might not be that well tested in maintaining graphics state.
Any decent tool that translates PDFs to images will probably become aware of its shortcomings in text positioning sooner or later. And fixing those will help to excel in text extraction.
We have a large database with hundreds of tables that have lookup text stored in them with spelling errors from when the application was originally written offshore. Is there a way to run a spellcheck on the data in sql server so we can find all of these errors quickly?
One thought - You could write a CLR function to access the spell checker included with Microsoft Word. See: Walkthrough: Accessing the Spelling Checker in Word for a starting point.
Export your data to Excel.
Distribute the sheets to multiple people to break up the work load
Have them run spell check to identify the misspelled words.
Gather up the bad records and create update statements from the db.
Don't offshore applications where English spelling is important.
EDIT: I should have pointed out, you might use the spell check to identify the issues but you want human eyes on the actual data as well as suggested fixes. Thus, being able to spread the work around to run the spell check is important. There won't be any fully automated solution that will catch everything, or worse it will catch too much and muck the data up worse.
Just stumbled across this one.
Another way to do this is using Microsoft Access, using ODBC, link to the tables on SQL side, then open the table in access and run the spellcheck in Access, you can update the corrections on the fly.
I am primarily a web developer, mostly working with PHP, MySQL, and JavaScript. I was recently contacted by a local Sheriff's Office (small town word of mouth, nerds are always needed) to digitize a 4 page monstrosity of a form... because nobody could read the handwriting of the deputies.
The catch here is that this is a small town department and, while they are fancy enough to carry computers in the field, they are not connected to the Internet. Visual Basic was the first solution that came to mind and I have been scrambling to learn the basics. I am confident in my ability to organize the content of the form and perform any necessary validation but I am unsure where to begin in terms of storing each report locally (database) and printing the end result.
Another matter that makes things complicated is the fact that they want the end result to look exactly the same as the original form, only typed instead of hand written.
So, to sum things up, here are the questions I have:
There seem to be several options for databases in VB 2010 Express. What is the best option for LOCAL storage of records?
It looks as though the best way to format the form the exact way they want it to look with populated data would be to create a form within the application with just this content on it. Is this the best solution or might there be a better way - possibly outputting to another file? And if the data is put on another form, how would I go about printing it?
Many thanks!
The word "best" is of course subjective, so instead I'll give you some pros and cons for a database.
SQL Server Express is a really awesome database to work with that acts almost exactly like the big paid version. Some high-level things like replication and encryption aren't supported but you don't have a need for that probably. I've built many websites that target it with zero performance problems. The downside of SQL Server Express is that you need to install it on every machine and it pretty much needs to be running all the time. It doesn't "weigh" a whole lot but its still going to be running in the background 24/7. If you create an installer from within Visual Studio/VB Express (which you should) you can check it as a prerequisite and the installer will pretty much take care of it for you. As a major security target you are opening a potential for security issues which you should be aware of.
SQLite would another great choice, there's some great .Net wrappers available. If you're used to using SQL Server or MySql you might find SQLite limiting but you get used to it. SQLite doesn't have a "database engine" and its goal is to be a very lightweight open source SQL database system.
The third option that I'd recommend is just writing to an XML file. Simple, no engine, no tables, no third-party whatever, just raw text that anyone can parse if something breaks. EDIT And VB.Net has some wonderful built-in XML syntactic things such as XML literals:
Dim MyXml = <Person>
<FirstName><%= txtFirstName.Text %></FirstName>
<LastName><%= txtLastName.Text %></LastName>
</Person>
For the form generation, I'd recommend using something like iTextSharp. (Free but make sure you check that the license matches yours.) Take their actual PDF Form (or create a PDF of theirs), use Acrobat or something similar to turn it into a "PDF Form" and then just use iTextSharp to fill in the form. There's a bunch of support on this site if you've got any questions about it.
I have a PDF file (not created by me - I have no control over the design etc.) which allows users to fill in some form fields in Adobe Reader and save the result. I want to automate the process of populating the fields, using the following steps:
Fetch data from database.
Open PDF template.
Populate form fields with data.
Save modified file to a separate location on disk.
Lock modified file so that the form fields can no longer be edited.
Send file to user.
I'm happy to use PHP, Perl, Python or Java to do steps 2-5 (in descending order of preference), but whatever I use has to work under Linux (i.e. it mustn't rely on libraries which are only available on Windows for example).
The end result should be a PDF which the average user can open and print, but not modify (I'm sure advanced users could find a way to do so, but I accept that I can't guarantee complete security against modification). I don't want to change the structure of the PDF, merely populate the form fields.
Is there a standard piece of software for doing this? I've seen mentions of FDF Toolkit, but I'm not entirely sure if that's what I want and whether it will allow me to lock the file afterwards, and whether what I want to do fits in with the EULA.
Edit: Final answer is to use iText (as suggested by Mark Storer) but to implement it as a web service which allows you to pass in an array of form field names and values and the PDF file 'template'. The web service will be open source (and available on GitHub once I've written it), as per the AGPL, but anything connecting to it won't have to be.
Filling
Any number of different libraries can fill in field values. I'm partial to iText (java) or iTextSharp (c#). I wrote one in Java a number of years ago. It's not that hard). There are lots. Search SO, you'll find 'em.
Locking
There are a couple different levels of "lock the fields".
Each field has a "read only" flag. This is pretty much a courtesy as far as other libraries capable of setting field values are concerned. In fact, it's generally considered to mean "the ui cannot make changes". Form script can, regardless.
Form flattening: Draw the fields directly into the page and removing all the interactivity.
Each one has pros and cons.
Flag: None too secure. Form data still easily accessible. Scrolling fields still scroll.
Flattening: Pretty much the exact opposite. It's harder to modify (though far from impossile). The form data can only be extracted via text extraction (which is hard, but becoming increasingly common). List & text fields that contain more stuff than is visible will no longer scroll.
The ability to flatten forms is relatively rare. Again, iText can do it (as can iTextSharp), but I'm not aware of any other third party libraries that can... I'm sure they exist, I just can't name them off the top of my head.