How to emit a blank line using yaml-cpp - yaml-cpp

Using yaml-cpp, version 0.2.5 ...
I would like to emit a blank line between entries in a list (for readability purposes).
Is this possible?
I have tried experimentation with the Verbatim, and Null manipulators, but have not had success.

As of revision 420, this is possible:
YAML::Emitter emitter;
emitter << YAML::BeginSeq;
emitter << "a" << YAML::Newline << "b" << "c";
emitter << YAML::EndSeq;
produces
---
- a
- b
- c
I decided to go with YAML::Newline instead of YAML::Newline(n) since I found
I usually just wanted a single newline.
I kept accidentally typing YAML::Newline, which implicitly cast the function to a bool (specifically, true), so I figured other people would probably make the same mistake.
If you just want to patch your copy (and not sync with the trunk), use revisions 418-420. (Note: it's slightly trickier than the patch in the link you posted, since you have to be careful about a newline after an implicit key. See the comment on the (now closed) Issue 77 for more details.)

Related

Raku operator for 2's complement arithmetic?

I sometimes use this:
$ perl -e "printf \"%d\", ((~18446744073709551592)+1)"
24
I can't seem to do it with Raku. The best I could get is:
$ raku -e "say +^18446744073709551592"
-18446744073709551593
So: how can I make Raku give me the same answer as Perl ?
Gotta go with (my variant¹ of) Liz's custom op (in her comment below).
sub prefix:<²^>(uint $a) { (+^ $a) + 1 }
say ²^ 18446744073709551592; # 24
My original "semi-educated wild guess"² that turned out to be acceptable to #zentrunix and the basis for Liz's op:
say (+^ my uint $ = 18446744073709551592) + 1; # 24
\o/ It works!³
Footnotes
¹ I flipped the two character op because I wanted to follow the +^ form, have it sub-vocalize as "two's complement", and avoid it looking like ^2.
² One line of thinking was about the particular integer. I saw that 18446744073709551592 is close to 2**64. Another was that integers are limited precision in Perl unless you do something to make them otherwise, whereas in Raku they are arbitrary precision unless you do something to make them otherwise. A third line of thinking came from reading the doc for prefix +^ which says "converts the number to binary using as many bytes as needed" which I interpreted as meaning that the representation is somehow important. Hmm. What if I try an int variable? Overflow. (Of course.) uint? Bingo.
³ I've no idea if this solution is right for the wrong reasons. Or even worse. One thing that's concerning is that uint in Raku is defined to correspond to the largest native unsigned integer size supported by the Raku compiler used to compile the Raku code. (Iirc.) In practice today this means Rakudo and whatever underlying platform is being targeted, and I think that almost certainly means C's uint64_t in almost all cases. I imagine perl has some similar platform dependent definition. So my solution, if it is a reasonable one, is presumably only portable to the degree that the Raku compiler (which in practice today means Rakudo) agrees with the perl binary (which in practice today means P5P's perl) when run on some platform. See also #p6steve's comment below.
'Long-hand' answer:
raku -e 'put ( (18446744073709551592.base(2) - 0b1).comb.map({!$_.Int+0}).join.parse-base(2));'
OR
raku -e 'say 18446744073709551592.base(2).comb.map({!$_.Int+0}).join.parse-base(2) + 1;'
Sample Output: 24
The answers above (should?) implement "Two's-Complement" encoding directly. Neither uses Raku's +^ twos-complement operator. The first one subtracts one from the binary representation, then inverts. The second one inverts first, then adds one. Neither answer feels truly correct, yet the same answer as Perl5 is obtained (24).
Looking at the Raku Docs page, one would conclude that the "twos-complement" of a positive number would be negative, hence it's not clear what the Perl (and now Raku) answers represent. Hopefully the foregoing is somewhat useful.
https://docs.raku.org/routine/+$CIRCUMFLEX_ACCENT

With Jison, how do I scan right shift operator and nested generic type definitions

I'm working on a grammar for a language that supports the right shift operator and generic types. For example:
function rectangle(): Pair<Tuple<Float, Float>> {
let x = 0 >> 2;
}
My problem is, during scanning, the right shift operator is correctly tokenized, but the >> in Pair<Tuple<Float, Float>> becomes a single >> token instead of two separate > tokens (unless I add a space). This is because I have the >> before the > in my .jison file:
">>" { return '>>' }
">" { return '>' }
Is there a good way to resolve this in Jison? I feel like this is a common problem as my syntax is similar to every other C-style language, but I haven't found a solution to it yet (besides writing a pre-scan script that manually space-delimits the >s).
The easiest solution is to just not recognize >> as a single token in the lexer. Instead, in your parser, recognize two consecutive > tokens as a right shift, and then check to make sure there's nothing (no whitespace or comments) between them (and give a syntax error if there is).

type of literal words

I was reading Bindology and tried this:
>> type? first ['x]
== lit-word!
>> type? 'x
== word!
I expected type? 'x to return lit-word! too. Appreciate any insights.
A LIT-WORD! if seen in a "live" context by the evaluator resolves to the word itself. It can be used to suppress evaluation simply with a single token when you want to pass a WORD! value to a function. (Of course, in your own dialects when you are playing the role of "evaluator", it's a Tinker-Toy and you can make it mean whatever you want.)
Had you wanted to get an actual LIT-WORD! you would have to somehow suppress the evaluator from turning it into a WORD!. You noticed that can be achieved by picking it out of an unevaluated block, such as with first ['x]. But the more "correct" way is to use quote 'x:
>> type? quote 'x
== lit-word!
Beware an odd bug known as "lit-word decay":
>> x-lit: quote 'x
>> type? x-lit
== word!
That has been corrected in Red and is pending correction in Rebol. Until then you have to use a GET-WORD! to extract a lit-word value from the variable holding it:
>> x-lit: quote 'x
>> type? :x-lit
== lit-word!
(You may have already encountered this practice as the way of fetching the value of a word vs. "running" it through the evaluator...as when you want to deal with a function's value vs. invoking it. It should not be necessary on values holding lit-word!. Accident of history, it would seem.)

PostScript mark token

In PostScript if you have
[4 5 6]
you have the following tokens:
mark integer integer integer mark
The stack goes like this:
| mark |
| mark | integer |
| mark | integer | integer |
| mark | integer | integer | integer |
| array |
Now my question:
Is the ]-mark operator a literal object or an executable object?
Am I correct that the [-mark is a literal object (just data) and that the ]-mark is an executable object (because you always need to create an array when you see this ]-mark operator) ?
PostScript Language Reference Manual section 3.3.2 gives me:
The [ and ] operators, when executed, produce a literal array object with the en-closed objects as elements. Likewise, << and >> (LanguageLevel 2) produce a
literal dictionary object.
That is not clear for me if both [ ] operators are executable or only the ] operator.
Summary.
All of these special tokens, [, ], <<, >>, come out of the scanner as executable names. [ and << are defined to yield a marktype object (so they are not operators per se, but they are executable names defined in systemdict where all the operators live). ] and >> are defined as procedures or operators which are executed just like any other procedure or operator. These use the counttomark operator to find the opening bracket. But all of these tokens are treated specially by the scanner, which recognizes them without surrounding whitespace since they are part of its delimiter set.
Details.
It all depends on when you look at it. Let's trace through what the interpreter does with these tokens. I'm going to illustrate this with a string, but it works just the same with a file.
So if you have an input string
([4 5 6]) cvx exec
cvx makes a literal object executable. The program stream is a file object also labeled executable. exec pushes an object on the Execution Stack, where it is encountered by the interpreter on the next iteration of the inner interpreter processing loop. When executing the program stream, the executable file object is topmost on the Execution Stack.
The interpreter uses token to call the scanner. The scanner skips initial whitespace, then reads all non-whitespace characters up to the next delimiter, then attempts to interpret the string as a number, and failing that it becomes an executable name. The brackets are part of the set of delimiters, and so are termed 'self-delimiting'. So the scanner reads the one bracket character, stops reading because it's a delimiter, discovers it cannot be a number, so it yields an executable name.
Top of Exec Stack | Operand Stack
(4 5 6]) [ |
Next, the interpreter loop executes anything executable (unless it's an array). Executing a token means loading it from the dictionary, and then executing the definition if it's executable. [ is defined as a -mark- object, same as the name mark is defined. It's not technically an operator or a procedure, it's just a definition. Automatic loading happens because the name comes out of the scanner with the executable flag set.
(4 5 6]) | -mark-
The scanner then yields 4, 5, and 6 which are numbers and get pushed straight to the operand stack. 6 is delimited by the ] which is pushed back on the stream.
(]) | -mark- 4 5 6
The interpreter doesn't execute the numbers since they are not executable, but it would be just the same if it did. The action for executing a number is simply to push it on the stack.
Then, finally the scanner encounters the right bracket ]. And that's where the magic happens. Self-delimited, it doesn't need to be followed by any whitespace. The scanner yields the executable name ] and the interpreter executes it by loading and it finds ...
{ counttomark array astore exch pop }
Or maybe an actual operator that does this. But, yeah. counttomark yields the number of elements. array creates an array of that size. astore fills an array with elements from the stack. And exch pop to discard that pesky mark once and for all.
For dictionaries, << is exactly the same as [. It drops a mark. Then you line up some key-value pairs, and >> is procedure that does something to effect of ...
{ counttomark dup dict begin 2 idiv { def } repeat pop currentdict end }
Make a dictionary. Define all the pairs. Pop the mark. Yield the dictionary. This version of the procedure tries to create a fast dictionary by making it double-sized. Move the 2 idiv to before dup to make a small dictionary.
So, to get philosophical, counttomark is the operator you're using. And it requires a special object-type that isn't used for anything else, the marktype object, -mark-. The rest is just syntactical sugar to let you access this stack-counting ability to create linear data-structures.
Appendix
Here's a procedure that models the interpreter loop reading from currentfile.
{currentfile token not {exit} if dup type /arraytype ne {exec} if }loop
exec is responsible for loading (and further executing) any executable names. You can see from this that token really is the name of the scanner; and that procedures (arrays) directly encountered by the interpreter loop are not executed (type /arraytype ne {exec} if).
Using token on strings lets you do really cool stuff, however. For example, you can dynamically construct procedure bodies with substituted names. This is very much like a lisp macro.
/makeadder { % n . { n add }
1 dict begin
/n exch def
({//n add}) token % () {n add} true
pop exch pop % {n add}
end
} def
token reads the entire procedure from the string, substituting the immediately-evaluated name //n with its currently defined value. Notice here that the scanner reads an executable array all at once, effectively executing [ ... ] cvx internally before returning (In certain interpreters, like my own xpost, this allows you to bypass the stack-size limits to build an array, because the array is built in separate memory. But Level 2 garbage collection makes this largely irrelevant).
There is also the bind operator which modifies a procedure by replacing operator names with the operator objects themselves. These tricks help you to factor-out name lookups in speed-critical procedures (like inner loops).
Both [ and ] are executable tokens. [ produces a mark object, ] creates an array of objects to the last mark

Unicode in PDF

My program generates relatively simple PDF documents on request, but I'm having trouble with unicode characters, like kanji or odd math symbols. To write a normal string in PDF, you place it in brackets:
(something)
There is also the option to escape a character with octal codes:
(\527)
but this only goes up to 512 characters. How do you encode or escape higher characters? I've seen references to byte streams and hex-encoded strings, but none of the references I've read seem to be willing to tell me how to actually do it.
Edit: Alternatively, point me to a good Java PDF library that will do the job for me. The one I'm currently using is a version of gnujpdf (which I've fixed several bugs in, since the original author appears to have gone AWOL), that allows you to program against an AWT Graphics interface, and ideally any replacement should do the same.
The alternatives seem to be either HTML -> PDF, or a programmatic model based on paragraphs and boxes that feels very much like HTML. iText is an example of the latter. This would mean rewriting my existing code, and I'm not convinced they'd give me the same flexibility in laying out.
Edit 2: I didn't realise before, but the iText library has a Graphics2D API and seems to handle unicode perfectly, so that's what I'll be using. Though it isn't an answer to the question as asked, it solves the problem for me.
Edit 3: iText is working nicely for me. I guess the lesson is, when faced with something that seems pointlessly difficult, look for somebody who knows more about it than you.
In the PDF reference in chapter 3, this is what they say about Unicode:
Text strings are encoded in
either PDFDocEncoding or Unicode character encoding. PDFDocEncoding is a
superset of the ISO Latin 1 encoding and is documented in Appendix D. Unicode
is described in the Unicode Standard by the Unicode Consortium (see the Bibliography).
For text strings encoded in Unicode, the first two bytes must be 254 followed by
255. These two bytes represent the Unicode byte order marker, U+FEFF, indicating
that the string is encoded in the UTF-16BE (big-endian) encoding scheme specified
in the Unicode standard. (This mechanism precludes beginning a string using
PDFDocEncoding with the two characters thorn ydieresis, which is unlikely to
be a meaningful beginning of a word or phrase).
The simple answer is that there's no simple answer. If you take a look at the PDF specification, you'll see an entire chapter — and a long one at that — devoted to the mechanisms of text display. I implemented all of the PDF support for my company, and handling text was by far the most complex part of exercise. The solution you discovered — use a 3rd party library to do the work for you — is really the best choice, unless you have very specific, special-purpose requirements for your PDF files.
Algoman's answer is wrong in many things. You can make a PDF document with Unicode in it and it's not rocket science, though it needs some work.
Yes he is right, to use more than 255 characters in one font you have to create a composite font (CIDFont) pdf object.
Then you just mention the actual TrueType font you want to use as a DescendatFont entry of CIDFont.
The trick is that after that you have to use glyph indices of a font instead of character codes. To get this indices map you have to parse cmap section of a font - get contents of the font with GetFontData function and take hands on TTF specification.
And that's it! I've just did it and now I have a Unicode PDF!
Sample Code for parsing cmap section is here: https://web.archive.org/web/20150329005245/http://support.microsoft.com/en-us/kb/241020
And yes, don't forget /ToUnicode entry as #user2373071 pointed out or user will not be able to search your PDF or copy text from it.
As dredkin pointed out, you have to use the glyph indices instead of the Unicode character value in the page content stream. This is sufficient to display Unicode text in PDF, but the Unicode text would not be searchable. To make the text searchable or have copy/paste work on it, you will also need to include a /ToUnicode stream. This stream should translate each glyph in the document to the actual Unicode character.
See Appendix D (page 995) of the PDF specification. There is a limited number of fonts and character sets pre-defined in a PDF consumer application. To display other characters you need to embed a font that contains them. It is also preferable to embed only a subset of the font, including only required characters, in order to reduce file size. I am also working on displaying Unicode characters in PDF and it is a major hassle.
Check out PDFBox or iText.
http://www.adobe.com/devnet/pdf/pdf_reference.html
I have worked several days on this subject now and what I have learned is that unicode is (as good as) impossible in pdf. Using 2-byte characters the way plinth described only works with CID-Fonts.
seemingly, CID-Fonts are a pdf-internal construct and they are not really fonts in that sense - they seem to be more like graphics-subroutines, that can be invoked by addressing them (with 16-bit addresses).
So to use unicode in pdf directly
you would have to convert normal fonts to CID-Fonts, which is probably extremely hard - you'd have to generate the graphics routines from the original font(?), extract character metrics etc.
you cannot use CID-Fonts like normal fonts - you cannot load or scale them the way you load and scale normal fonts
also, 2-byte characters don't even cover the full Unicode space
IMHO, these points make it absolutely unfeasible to use unicode directly.
What I am doing instead now is using the characters indirectly in the following way:
For every font, I generate a codepage (and a lookup-table for fast lookups) - in c++ this would be something like
std::map<std::string, std::vector<wchar_t> > Codepage;
std::map<std::string, std::map<wchar_t, int> > LookupTable;
then, whenever I want to put some unicode-string on a page, I iterate its characters, look them up in the lookup-table and - if they are new, I add them to the code-page like this:
for(std::wstring::const_iterator i = str.begin(); i != str.end(); i++)
{
if(LookupTable[fontname].find(*i) == LookupTable[fontname].end())
{
LookupTable[fontname][*i] = Codepage[fontname].size();
Codepage[fontname].push_back(*i);
}
}
then, I generate a new string, where the characters from the original string are replaced by their positions in the codepage like this:
static std::string hex = "0123456789ABCDEF";
std::string result = "<";
for(std::wstring::const_iterator i = str.begin(); i != str.end(); i++)
{
int id = LookupTable[fontname][*i] + 1;
result += hex[(id & 0x00F0) >> 4];
result += hex[(id & 0x000F)];
}
result += ">";
for example, "H€llo World!" might become <01020303040506040703080905>
and now you can just put that string into the pdf and have it printed, using the Tj operator as usual...
but you now have a problem: the pdf doesn't know that you mean "H" by a 01. To solve this problem, you also have to include the codepage in the pdf file. This is done by adding an /Encoding to the Font object and setting its Differences
For the "H€llo World!" example, this Font-Object would work:
5 0 obj
<<
/F1
<<
/Type /Font
/Subtype /Type1
/BaseFont /Times-Roman
/Encoding
<<
/Type /Encoding
/Differences [ 1 /H /Euro /l /o /space /W /r /d /exclam ]
>>
>>
>>
endobj
I generate it with this code:
ObjectOffsets.push_back(stream->tellp()); // xrefs entry
(*stream) << ObjectCounter++ << " 0 obj \n<<\n";
int fontid = 1;
for(std::list<std::string>::iterator i = Fonts.begin(); i != Fonts.end(); i++)
{
(*stream) << " /F" << fontid++ << " << /Type /Font /Subtype /Type1 /BaseFont /" << *i;
(*stream) << " /Encoding << /Type /Encoding /Differences [ 1 \n";
for(std::vector<wchar_t>::iterator j = Codepage[*i].begin(); j != Codepage[*i].end(); j++)
(*stream) << " /" << GlyphName(*j) << "\n";
(*stream) << " ] >>";
(*stream) << " >> \n";
}
(*stream) << ">>\n";
(*stream) << "endobj \n\n";
Notice that I use a global font-register - I use the same font names /F1, /F2,... throughout the whole pdf document. The same font-register object is referenced in the /Resources Entry of all pages. If you do this differently (e.g. you use one font-register per page) - you might have to adapt the code to your situation...
So how do you find the names of the glyphs (/Euro for "€", /exclam for "!" etc.)? In the above code, this is done by simply calling "GlyphName(*j)". I have generated this method with a BASH-Script from the list found at
http://www.jdawiseman.com/papers/trivia/character-entities.html
and it looks like this
const std::string GlyphName(wchar_t UnicodeCodepoint)
{
switch(UnicodeCodepoint)
{
case 0x00A0: return "nonbreakingspace";
case 0x00A1: return "exclamdown";
case 0x00A2: return "cent";
...
}
}
A major problem I have left open is that this only works as long as you use at most 254 different characters from the same font. To use more than 254 different characters, you would have to create multiple codepages for the same font.
Inside the pdf, different codepages are represented by different fonts, so to switch between codepages, you would have to switch fonts, which could theoretically blow your pdf up quite a bit, but I for one, can live with that...
dredkin's answer has worked fine for me in the forward direction (unicode text to PDF representation).
I was writing an increasingly convoluted comment there about the reverse direction (PDF representation to text, when copying from the PDF document), explained by user2373071. The method referred to throughout this thread is the definition of a /ToUnicode map (which, incidentally, is optional). I found it simplest to map from glyphs to characters using the beginbfrange srcCode1 srcCode2 [ dstString1 m ] endbfrange construct.
This seems to work OK in Adobe Reader, but two glyphs (0x100 and 0x1ef) cause the mapping for cyrillic characters to fail in browsers and SumatraPDF (the copy/paste provides the glyph IDs instead of the characters. By excluding those two glyphs I made it work there. (I really can't see what's special about these glyphs, and it's independent of font (i.e. it's the same glyphs, but different characters, in Times/Georgia/Palatino, and these values are afaik identically mapped in UTF-16. Any ideas welcome!)
However, and more importantly,
I have reached the conclusion that the whole /ToUnicode mechanism is fundamentally flawed in concept, because many fonts re-use glyphs for multiple characters. Consider simple ones like 0x20 and 0xa0 (ordinary and non-breaking space); 0x2d and 0xad (hyphen and soft hyphen); these two are in the 8-bit character range. Slightly beyond that are 0x3b and 0x37e (semi-colon and greek question mark). And it would be quite reasonable to re-use cyrillic small a and latin small a, and similar homoglyphs. So the point is, in the non-ASCII world that prompts us to worry about Unicode at all, we will encountering a one-to-many mapping from glyphs to characters, and will therefore be bound to pick up the wrong character at some point - which rather removes the point of being able to extract the text in the first place.
The other method in the (1.7) PDF reference is to use /ActualText instead of /ToUnicode. This is better in principle, because completely avoids the homoglyph problem I've mentioned above, and the overhead is probably bearable, but it only seems to be implemented in Adobe Reader (i.e. I haven't got anything consistent or meaningful from SumatraPdf or four browsers).
I'm not a PDF expert, and (as Ferruccio said) the PDF specs at Adobe should tell you everything, but a thought popped up in my mind:
Are you sure you are using a font that supports all the characters you need?
In our application, we create PDF from HTML pages (with a third party library), and we had this problem with cyrillic characters...