Export MFC CView into vectorial format - pdf

With my MFC application, I am able to print my CDocument on screen using the CView class.
Basically, I use the CDC class to write text and draw polygons on screen to provide a view representation of my document.
Now let's say I would like to use that output view in Microsoft Word.
From a user point a view and without anymore developer work, I can :
copy-paste my drawing to word : this produces a raster BMP file which I am able to paste in Word
print my drawing and use a PDF exporter : this produces a vectorial PDF file which is light and zoom-able, but not easy to reuse in Word.
These two effortless solutions are great because I can keep the exact layout of my view, but have cons (raster or format)
Another way to solve my problem would be to write SVG or VML but I would not get the same layout and this would require a lot of work.
Is there a library to do the same kind of PDF export / print mechanism into a standard format ?
What would you suggest ? Thanks a lot.

To draw your view into a Enhanced Meta File, first read the documentation # MSDN:
http://msdn.microsoft.com/en-us/library/427wezx1%28v=VS.80%29.aspx
Here is an example, how this works:
CMetaFileDC MFDC;
CRect rect(0,0,width,height);
MFDC.CreateEnhanced(NULL,NULL,rect,NULL);
MFDC.SetBkMode(TRANSPARENT);
MFDC.SetMapMode(MM_HIMETRIC);
CDC tempDC;
tempDC.CreateCompatibleDC(&MFDC);
MFDC.SetAttribDC(tempDC.m_hDC);
// now you draw into the DC like it was your original view
HENHMETAFILE hEnhMetaFile = MFDC.CloseEnhanced();
HENHMETAFILE hEMF = NULL;
hEMF = CopyEnhMetaFile(hEnhMetaFile,"C:\\Temp\\Test.emf");
DeleteEnhMetaFile(hEMF);
DeleteEnhMetaFile(hEnhMetaFile);

Related

Tabulator - formatting print and PDF output

I am a relatively new user of Tabulator so please forgive me if I am asking anything that, perhaps, should be obvious.
I have a Tabulator report that I am able to print and create as a PDF, but the report's formatting (as shown on the screen) is not used in either output.
For printing I have used printAsHtml and printStyled=true, but this doesn't produce a printout that matches what is on the screen. I have formatted number fields (with comma separators) and these are showing correctly, but the number columns should be right-aligned but all of the columns appear as left-aligned.
I am also using Tree View where the tree rows are coloured differently to the main table, but when I print the report with a tree open it colours the whole table with the tree colours and not just the tree.
For the PDF none of the Tabulator formatting is being used. I've looked for anything similar to the printStyled option, but I can't see anything. I've also looked at the autoTable option, but I am struggling to find what to use.
I want to format the print and PDF outputs so that they look as close to the screen representation as possible.
Is there anywhere I could look that would provide examples of how to achieve the above? The Tabulator documentation is very good, but the provided examples don't appear to explain what I am trying to do.
Perhaps there are there CSS classes that I am missing or even mis-using? I have tried including .tabulator-print-table in my CSS, but I am probably not using it correctly. I also couldn't find anything equivalent for producing PDFs. Some examples would help immensely.
Thank you in advance for any advice or assistance.
Formatting is deliberately not included in these, below i will outline why:
Downloaders
Downloaded files do not contain formatted data, only the raw data, this is because a lot of the formatters create visual elements (progress bar, star formatter etc) that cannot be replicated sensibly in downloaded files.
If you want to change the format of data in the download you will need to use an accessor, the accessorDownload option is the one you want to use in this case. The accessors transform the data as it is leaving the table.
For instance we could create an accessor that prepended "Mr " to the front of every name in a column:
var mrAccessor= function(value, data, type, params, column, row){
return "Mr " + value;
}
Assign it to a columns definition:
{title:"Name", field:"name", accessorDownload:mrAccessor}
Printing
Printing also does not include the formatters, this is because when you print a Tabulator table, the whole table is actually rebuilt as a standard HTML table, which allows the printer to work out how to layout everything across multiple pages with column headers etc. The downside of this is that it is only loosely styled like a Tabulator and so formatted contents generated inside Tabulator cells will likely break when added to a normal td element.
For this reason there is also a accessorPrint option that works in the same way as the download accessor but for printing.
If you want to use the same accessor for both occasions, you can assign the function once to the accessor option and it will be applied in both instances.
Checkout the Accessor Documentation for full details.

Looking for software or API that will give me co-ordinates of text in a pdf

Simple question I hope - I have a pdf and want to detect the co-ordinates of specific word(s) or placeholder text. I then intend to use itextsharp to stamp a replacement bit of text on top at the co-ordinates found.
Can anyone recommend anything please?
Thanks
As answered in the comments, one could use iText to perform such a task. Maybe there are some better solutions, however, I doubt it. The cause of the mentioned issue, i.e. "[itextsharp] sometimes give co-ords of the start of the sentence the search text is in", is that sometimes glyphs are so close, that their boxes overlap, hence I don't see how it could be handled as you want.
So you can do the following:
extend LocationTextExtractionStrategy class and override eventOccurred, for example, as follows:
#Override
public void eventOccurred(IEventData data, EventType type) {
if (type.equals(EventType.RENDER_TEXT)) {
TextRenderInfo renderInfo = (TextRenderInfo) data;
// Obtain all the necesary information from renderInfo, for example
LineSegment segment = renderInfo.getBaseline();
// ...
}
pass an instance of such an extended class to PdfTextExtractor.getTextFromPage as follows:
PdfTextExtractor.getTextFromPage(pdfDocument.getPage(1), new ExtendedLocationTextExtractionStrategy()
once text is found, the event will be triggered.
There are some difficulties in such a solution, of course, because the text you want to find and write above could be present in the PDF not as "Text", but "T", "ex", t", or even "t", "x", "e", "T". However, since you use iText, you may want to harness the advantages of one of its products - pdfSweep. This product aims to completely remove unnecessary content from the PDF, with such a content being passed either as some locations (which you want to obtain, so that is not an option) or regexes.
This is how to create such a regex strategy (to find all "Dolor" and "dolor" instances in the document, completely remove them (from all the streams, so that they are either not observed from a PDF viewer nor found in the underlying PDF objects):
RegexBasedCleanupStrategy strategy = new RegexBasedCleanupStrategy("(D|d)olor").setRedactionColor(ColorConstants.GREEN);
This is how to use it:
PdfAutoSweep autoSweep = new PdfAutoSweep(strategy);
autoSweep.cleanUp(pdf); // a PdfDocument instance
And this is how to write some text on the location, at which the unnecessary text was present:
for (IPdfTextLocation location : strategy.getResultantLocations()) {
Rectangle rect = location.getRectangle();
// do something, for exapmle, write some text
}

Hiding text but not matplotlib plots in IPython notebooks

Sometimes I have a long notebook (essentially a lab-notebook) with lots of text, headings, plots etc. What I'd like is to be able to filter out all the text and just show the plots, so that I can quickly get an overview of what's in the notebook or find that one plot I want but I can't remember exactly where I put it. There's enough text in the notebooks that it takes a long while to scroll through it all. I'm aware that it's possible with an extension to hide input cells, which helps somewhat, but often there's a lot of text in the outputs too. The matplotlib plots are typically 'inline', so that they are just embedded pngs. Thus it should be sufficient to just hide text while preserving images.
I've looked through the extension index but haven't found anything appropriate. I'm guessing I could achieve something like this using an nbconvert template, or some javascript, but perhaps someone has a good way already.
Depending on the type of text output, you could hide the text_output class via your custom.js. To this end add the following lines to your custom.js:
define([
'base/js/namespace',
'base/js/events'
],
function(IPython, events) {
events.on("app_initialized.NotebookApp",
function () {
$("#view_menu").append("<li id=\"toggle_toolbar\" title=\"Show/Hide text output\">Toggle Text Output</li>");
}
);
}
);
text_show=true;
function text_toggle() {
if (text_show){
$('div.output_text').hide();
} else {
$('div.output_text').show();
}
text_show = !text_show
}
This adds a menu entry in the view menu to toggle the visibility of the output_text class.
Of course if you have some text in markdown cells, these are not hidden. If required it should be straight forward to adapt the above code to hide the input cells (class input, the markdown cells (class text_cell), etc.

Extract only the text from PDF files with CGPDFScanner

There are a number of questions (some answered and others not) about extracting simple text from PDF files. Stackoverflow has been helpful to point out that the PDF Adobe documentation is very clear to detect objects during parsing: i.e. one should use 'BT' and 'ET' PDF reference Operators to construct the callbacks when using CGPDFScanner.
The apple documentation shows a callback example:
static void op_BT (CGPDFScannerRef s, void *info) {
const char *name;
if (!CGPDFScannerPopName(s, &name))
return;
printf("BT /%s\n", name);
}
And, among other CGPDFScanner commands, the above call-back is set-up by first creating:
myTable = CGPDFOperatorTableCreate();
CGPDFOperatorTableSetCallback (myTable, "BT", &op_BT);
All good so far, but the Apple documentation doesn't appear to help low-to-intermediate programmers like me to understand the next step: Beyond identifying the text block (presumably between BT and BE callbacks?), what few steps/lines are needed during/in/outside the callback to capture the identified text block into a NSString?
Many thanks.
The first thing you should do is download the PDF reference. These days that's an ISO standard, but you can download the Acrobat SDK (http://www.adobe.com/devnet/acrobat.html) which contains an Adobe copy that will serve you just as well.
Read chapter 9. It'll teach you that on the one hand you need to understand text operators (Tj, ', ", TJ) and on the other hand you need to understand fonts and encodings.
The text operators are the operators that you can intercept that add "strings" to the PDF document; while all text operators must appear between BT and ET blocks, intercepting these BT and ET blocks by itself isn't going to do much for you I think.
Fonts are important because they will define how the bytes used by those operators correspond to actual (Unicode) characters. So if you want to derive the meaning of the bytes you get from the PDF file, you need to know how to use fonts to derive that meaning.
Some additional points:
Don't assume BT and ET correspond to an actual text block or paragraph as you may know it from an application such as InDesign or Word. One text block may contain a whole page or a single character (or nothing).
There are also text state operators that determine how the text is going to be shown on the page. There are ways for example to draw invisible text; you may or may not wish to extract that type of text. If you don't, you'll need to support enough text state operators that you can tell the difference.
Not a small task :)
Update after looking at sample PDF
Because in comments the question was refined to indicate text extraction of a specific type of PDF file, let me add a little additional information.
1) Looking at the PDF file you reference, you won't be able to skip the font/encoding problem. The fonts in the sample PDF file are subsetted which means that you don't have "cleartext" in the PDF page description but instead indexes that have to be mapped through the encoding of the fonts used to get meaningful text.
2) Extracting the text is possible, if you look at the following output from pdfToolbox (warning, I'm affiliated rather heavily with this tool):
<page id="33">
<words>
<word txt="Senator">
<parts>
<part tlh="28.3481" tlv="868.534" trh="55.4455" trv="868.534" blh="28.3481" blv="859.902" brh="55.4455" brv="859.902"></part>
</parts>
</word>
<word txt="House,">
<parts>
<part tlh="57.5305" tlv="868.534" trh="82.123" trv="868.534" blh="57.5305" blv="859.902" brh="82.123" brv="859.902"></part>
</parts>
</word>
<word txt="85">
<parts>
<part tlh="84.208" tlv="868.534" trh="92.548" trv="868.534" blh="84.208" blv="859.902" brh="92.548" brv="859.902"></part>
</parts>
</word>
There are undoubtedly other tools which can give a similar (or better) result, so extracting the text by itself should be doable.
The big problem is going to be finding the text you're interested in in the right order. The extraction I used here gives the text of each "word" and it's position (bounding box) on the page. When I look through the XML when you get to the table, the challenge is going to be which text belongs to which table cell, where rows and columns end etc...
In a way this problem is harder than the problem of simply detecting lines of text because you're dealing with a pretty dense table (and where my problem was largely one-dimensional (gathering everything on the same line) this problem is two-dimensional.

Set histogram ticks/label using Syntax

Let me preface this by saying that I am a programmer by trade, but not very familiar with SPSS.
I am helping a friend set up some histogram plots using SPSS Syntax language. Using the Chart Builder, we have arrived at the code below:
GGRAPH
/GRAPHDATASET NAME="graphdataset" VARIABLES=OurVariable MISSING=LISTWISE REPORTMISSING=NO
/GRAPHSPEC SOURCE=INLINE
TEMPLATE=[
"C:\some\path\greenHistogram.sgt"].
BEGIN GPL
SOURCE: s=userSource(id("graphdataset"))
DATA: OurVariable=col(source(s), name("OurVariable"))
GUIDE: axis(dim(1), label("OurVariable"))
GUIDE: axis(dim(2), label("Frequency"))
GUIDE: text.title(label("Bla bla",
"bla"))
ELEMENT: interval(position(summary.count(bin.rect(OurVariable, binStart(0.5)))),
shape.interior(shape.square))
END GPL.
As you can see, she would like to make the histogram columns green. We could not achieve that using the Chart Builder, but we could easily make a template via the Chart Editor window and apply that. This seems like a very sensible approach, as she has many charts she wants green.
She would also like to customize the y-axis labels (number of decimal places, tick "major increment" etc.). This can also be achieved using the Chart Editor and saving a template. However, this is a much more individualized edit, and making a custom template for each and every plot seems cumbersome. Is it possible to adjust these things directly in the Syntax-script which generates the plots?
In many other places there is a nice Paste-button which generates the necessacy code, but I could not find one in the Chart Editor.