How to dump a digit string with leading zeros, as a valid yaml string in yaml-cpp? - yaml-cpp

Creating a yaml string with leading zeros is not escaped with quotes in yaml-cpp. So writing the string to a texfile is not a valid yaml-string.leading_zeros: 00005 is 5 according to the specification yaml 1.2 (Try yourself: http://www.yamllint.com/)
YAML::Node node;
node["leading_zeros"] = "00005";
std::cout << YAML::Dump(node)<<std::endl;
// output: leading_zeros: 00005
// instead of:leading_zeros: "00005"
How to bring yaml-cpp to escape a string with leading zeros? So that is would not be interpreted as integer from other yaml parser?
Escaping manually does not seem to be the correct answer.
node["leading_zeros"] = "\"00005\"";
Update:
The digit value is stored in a YAML::Node! I am pretty sure it is a bug.

After some time spent on analyzing code I went to this hacky solution: https://github.com/nikich340/yaml-cpp/commit/468e0832b39c8320faa7c925708b76f6a3b1b840
It will save double quotes (or you can change emitte manip to single quotes) around all scalars which were strings with quotes originally.

Use the YAML::Emitter directly:
YAML::Emitter out;
out << YAML::BeginMap;
out << YAML::Key << "leading_zeroes" << YAML::Value;
out << YAML::Value << YAML::DoubleQuoted << "00005";
out << YAML::EndMap;

Related

How to print a specific integer value in a Bottle in YARP?

I'm trying to print a specific integer value of a Bottle by using cout (the Bottle contains only integer values), but it seems that this is a wrong way to do it. The command I used in a for loop is (the b Bottle is defined outside the loop):
std::cout << b.get(i) << std::endl;
The corresponding error is:
I'd like to see an example regarding the reading of a Bottle value.
You would have to get at the underlying type of the Value, if you know that it is indeed an int32_t (in other words b.get(i).isInt32() is true) then
std::cout << b.get(i).asInt32() << std::endl;
For the purposes of writing, without having to check the underlying type, you might also consider simply stringifying the Value
std::cout << b.get(i).toString() << std::endl;

Returning uint Value from Char in gawk

I'm trying to get value of an ASCII char I receive via RS232 to convert them into binary like values.
Example:
0xFF-->########
0x01--> #
0x02--> #
...
My Problem is to get the value of ASCII chars higher than 127.
Test-Code to get the int value:
echo -e "\xFF" | gawk -l ordchr -e '{printf("%c : %i", ord($0),ord($0))}'
Return:
� : -1
Test-Code 2:
echo -e "\x61" | gawk -l ordchr -e '{printf("%c : %i", ord($0),ord($0))}'
Return:
a : 97
So my solution to convert the values into unsigned int, is like this:
if(ord($0)<0)
{
new_char=ord($0)+256;
}
else new_char = ord($0)+0`
But I wanted to know if there was a way to cast directly an int as uint in gawk.
Later I tried to write my own ord() function.
#!/bin/bash
echo -e "\xFF" | awk 'BEGIN {_ord_init()}
{
printf("%s : %d\n", $0, ord($0))
}
function _ord_init( i, t)
{
for (i=0; i <= 255; i++) {
t = sprintf("%c", i)
_ord_[t] = i
}
}
function ord(str, c)
{
# only first character is of interest
c = substr(str, 1, 1)
return _ord_[c]
}'
0xFF returns:
� : 0
0x61 returns:
a : 97
Can someone explain me the behavior?
I'm using:
GNU Awk 4.1.3, API: 1.1 (GNU MPFR 3.1.4-p1, GNU MP 6.1.1)
But I wanted to know if there was a way to cast directly an int as uint in gawk.
Actually, any string in awk is, in the end, a number.
Strings are converted to numbers and numbers are converted to strings,
if the context of the awk program demands it. [...] A string is
converted to a number by interpreting any numeric prefix of the string
as numerals: "2.5" converts to 2.5, "1e3" converts to 1,000, and
"25fix" has a numeric value of 25. Strings that can’t be interpreted
as valid numbers convert to zero. source
Let's make a quick test:
BEGIN {
print 0xff
print 0xff + 0
print 0xff +0.0
print "0xff"
}
# 255
# 255
# 255
# 0xff
So, any hex is automatically interpreted as uint. Casting a int to uint is a tricky question: generally, you should convert the modulus of the int to hex, then add the sign bit as MSB (that is, if the number is non-positive). But you should not need to do so in awk.
Remember that conversion is made as a call to sprintf() and you may control it via the CONVFMT variable:
CONVFMT
A string that controls the conversion of numbers to strings
(see section Conversion of Strings and Numbers). It works by being
passed, in effect, as the first argument to the sprintf() function
(see section String-Manipulation Functions). Its default value is
"%.6g". CONVFMT was introduced by the POSIX standard. source
Remember that locale settings may affect the way the conversion is performed, especially with the decimal separator. For more, see this, which is out of scope.
Can someone explain me the behavior?
I can't actually reproduce it, but I suspect this line of code:
# only first character is of interest
c = substr(str, 1, 1)
In your example, the first char is always 0 and the output should always be the same. I'm testing this online.
I'll make another example of mine:
BEGIN {
a = 0xFF
b = 0x61
printf("a: %d %f %X %s %c\n", a,a,a,a,a)
printf("b: %d %f %X %s %c\n", b,b,b,b,b)
}
# a: 255 255.000000 FF 255 ÿ
# b: 97 97.000000 61 97 a
Either run gawk in binary mode gawk -b to stop it from pre-stitching UTF8 code points. Split it by // empty string, then each single spot inside that resulting array will contain something that's 1-byte wide.
For the other way around, just pre-make an array from 0 to 256. Gawk doesn't stop there at all. In my routine gawk startup sequence, I do that same custom ord sequence from 0x3134F all the way back to zero (around 210k or so). The reason to do it backwards is, for whatever reason, there are some code points that will come out with an IDENTICAL character that gawk can't differentiate. doing it reverse will ensure the lowest # code point is assigned to it. For this mode, I run it in regular utf8 one.
For your scenario I'll just pre-make 4-hex wide array from 0x0000 to 0xFFFF, back to their integer ones, then for each 0xZZ 0xWW, throw ZZWW into that lookup dictionary and get back and integer.
If you just try ord( ) from 128 to 255 it usually won't work like that because 128 is where unicode begins 2 bytes. 0x800 begins 3bytes, 0x10000 begins 4 bytes. I'm not too familiar with those that extend ascii to 256 - they usually require using iconv or similar to get back to UTF-8 first.
A quick note if you want to take raw UTF8 bytes and trying to figure out how many stitched UTF8 code points there are, just delete everything 0x80 - 0xBF. The length() of the residual is the number of code points.
In decimal lingo, out of the 4 ranges of 64 numbers from 0 to 255 :
000 - 063 - ASCII
064 - 127 - ASCII
128 - 191 - UT8-multiple-byte continuation encoding (the 0x80 0xBF)
192 - 255 - the most significant byte of UTF8 multi-byte char
and this looks hideous. Luckily, octal to the rescue. The 0x80 - 0xBF range is just \200-\277. You can use any of AWK's regex to find those (also for FS / RS etc). I was spending time manually coding up the utf8 algorithm before doing all that bit-shifting when I realized much later I don't need that to get to my end goal.
You can easily beat the system built in wc -m command if you want to count utf8 code-points when combining the logic above with mawk2. On my 2.5 year old laptop, against a 1.83 GB flat text file FILLED with unicode all over, I got it down to approx 19 seconds or so to count out 1.29 billion utf8 code points, using just awk.
i've ran into the same problem myself. I ended up with first with a detector whether it's running gawk in unicode mode or byte mode (check the length() of 3 octal value combo that make up one UTF8 code point returns 1 or 3)
then when it sees gawk unicode mode, run a custom shell command from gawk and use unix printf to print out bytes 128-255, and chunk it back into gawk into an array. If you need it i can paste the code sometime (but it's SUPER hideous so i hope i won't get dinged for its lack of elegance)
because there are simply bytes like C0, C1, or FF etc that don't exist in UTF8, no matter what combination you attempt, you cannot get it to generate it all 256 within gawk. I mean another way to do it would be pre-making that chain and using something xxd -ps to store it as a hash string, only converting it back at runtime, but it's admittedly slower.

NSString Unicode display

I want to look at the display of a number of Unicodes using a forLoop. However the compiler doesn't like "%x" or "%d" in the string to build a Unicode. Is there a work around?
for (int k = 0; k < 16; k++){
lbl.text =[NSString stringWithFormat:#"\u00B%x", k ];// <-- incomplete universal character name \u00B
}
thanks
I'm not sure a fully understand what you're trying to achieve. For this answer, I assume that you want to generate the Unicode characters int the range between B0 and BF.
Your code doesn't work due to the \u escape sequence (and not because of the %x or %d format specifiers). Just read the error message carefully. The code assumes that the %x specifier will be substituted with a number first and that the escape sequence will be evaluated second. However, it happens the other way round: First the \u sequence is evaluated by the compiler and an error thrown because it is invalid.
A better (and simpler) approach is the following code:
for (unichar ch = 0xB0; ch <= 0xBF; ch++){
lbl.text =[NSString stringWithFormat:#"%C", ch ];
}
This code directly puts a Unicode character into the string.
Use this method instead:
NSString stringWithUTF8String:
From the documentation here on the section Strings and Non-ASCII Characters:
Formatting Strings

How to emit a blank line using yaml-cpp

Using yaml-cpp, version 0.2.5 ...
I would like to emit a blank line between entries in a list (for readability purposes).
Is this possible?
I have tried experimentation with the Verbatim, and Null manipulators, but have not had success.
As of revision 420, this is possible:
YAML::Emitter emitter;
emitter << YAML::BeginSeq;
emitter << "a" << YAML::Newline << "b" << "c";
emitter << YAML::EndSeq;
produces
---
- a
- b
- c
I decided to go with YAML::Newline instead of YAML::Newline(n) since I found
I usually just wanted a single newline.
I kept accidentally typing YAML::Newline, which implicitly cast the function to a bool (specifically, true), so I figured other people would probably make the same mistake.
If you just want to patch your copy (and not sync with the trunk), use revisions 418-420. (Note: it's slightly trickier than the patch in the link you posted, since you have to be careful about a newline after an implicit key. See the comment on the (now closed) Issue 77 for more details.)

Unicode in PDF

My program generates relatively simple PDF documents on request, but I'm having trouble with unicode characters, like kanji or odd math symbols. To write a normal string in PDF, you place it in brackets:
(something)
There is also the option to escape a character with octal codes:
(\527)
but this only goes up to 512 characters. How do you encode or escape higher characters? I've seen references to byte streams and hex-encoded strings, but none of the references I've read seem to be willing to tell me how to actually do it.
Edit: Alternatively, point me to a good Java PDF library that will do the job for me. The one I'm currently using is a version of gnujpdf (which I've fixed several bugs in, since the original author appears to have gone AWOL), that allows you to program against an AWT Graphics interface, and ideally any replacement should do the same.
The alternatives seem to be either HTML -> PDF, or a programmatic model based on paragraphs and boxes that feels very much like HTML. iText is an example of the latter. This would mean rewriting my existing code, and I'm not convinced they'd give me the same flexibility in laying out.
Edit 2: I didn't realise before, but the iText library has a Graphics2D API and seems to handle unicode perfectly, so that's what I'll be using. Though it isn't an answer to the question as asked, it solves the problem for me.
Edit 3: iText is working nicely for me. I guess the lesson is, when faced with something that seems pointlessly difficult, look for somebody who knows more about it than you.
In the PDF reference in chapter 3, this is what they say about Unicode:
Text strings are encoded in
either PDFDocEncoding or Unicode character encoding. PDFDocEncoding is a
superset of the ISO Latin 1 encoding and is documented in Appendix D. Unicode
is described in the Unicode Standard by the Unicode Consortium (see the Bibliography).
For text strings encoded in Unicode, the first two bytes must be 254 followed by
255. These two bytes represent the Unicode byte order marker, U+FEFF, indicating
that the string is encoded in the UTF-16BE (big-endian) encoding scheme specified
in the Unicode standard. (This mechanism precludes beginning a string using
PDFDocEncoding with the two characters thorn ydieresis, which is unlikely to
be a meaningful beginning of a word or phrase).
The simple answer is that there's no simple answer. If you take a look at the PDF specification, you'll see an entire chapter — and a long one at that — devoted to the mechanisms of text display. I implemented all of the PDF support for my company, and handling text was by far the most complex part of exercise. The solution you discovered — use a 3rd party library to do the work for you — is really the best choice, unless you have very specific, special-purpose requirements for your PDF files.
Algoman's answer is wrong in many things. You can make a PDF document with Unicode in it and it's not rocket science, though it needs some work.
Yes he is right, to use more than 255 characters in one font you have to create a composite font (CIDFont) pdf object.
Then you just mention the actual TrueType font you want to use as a DescendatFont entry of CIDFont.
The trick is that after that you have to use glyph indices of a font instead of character codes. To get this indices map you have to parse cmap section of a font - get contents of the font with GetFontData function and take hands on TTF specification.
And that's it! I've just did it and now I have a Unicode PDF!
Sample Code for parsing cmap section is here: https://web.archive.org/web/20150329005245/http://support.microsoft.com/en-us/kb/241020
And yes, don't forget /ToUnicode entry as #user2373071 pointed out or user will not be able to search your PDF or copy text from it.
As dredkin pointed out, you have to use the glyph indices instead of the Unicode character value in the page content stream. This is sufficient to display Unicode text in PDF, but the Unicode text would not be searchable. To make the text searchable or have copy/paste work on it, you will also need to include a /ToUnicode stream. This stream should translate each glyph in the document to the actual Unicode character.
See Appendix D (page 995) of the PDF specification. There is a limited number of fonts and character sets pre-defined in a PDF consumer application. To display other characters you need to embed a font that contains them. It is also preferable to embed only a subset of the font, including only required characters, in order to reduce file size. I am also working on displaying Unicode characters in PDF and it is a major hassle.
Check out PDFBox or iText.
http://www.adobe.com/devnet/pdf/pdf_reference.html
I have worked several days on this subject now and what I have learned is that unicode is (as good as) impossible in pdf. Using 2-byte characters the way plinth described only works with CID-Fonts.
seemingly, CID-Fonts are a pdf-internal construct and they are not really fonts in that sense - they seem to be more like graphics-subroutines, that can be invoked by addressing them (with 16-bit addresses).
So to use unicode in pdf directly
you would have to convert normal fonts to CID-Fonts, which is probably extremely hard - you'd have to generate the graphics routines from the original font(?), extract character metrics etc.
you cannot use CID-Fonts like normal fonts - you cannot load or scale them the way you load and scale normal fonts
also, 2-byte characters don't even cover the full Unicode space
IMHO, these points make it absolutely unfeasible to use unicode directly.
What I am doing instead now is using the characters indirectly in the following way:
For every font, I generate a codepage (and a lookup-table for fast lookups) - in c++ this would be something like
std::map<std::string, std::vector<wchar_t> > Codepage;
std::map<std::string, std::map<wchar_t, int> > LookupTable;
then, whenever I want to put some unicode-string on a page, I iterate its characters, look them up in the lookup-table and - if they are new, I add them to the code-page like this:
for(std::wstring::const_iterator i = str.begin(); i != str.end(); i++)
{
if(LookupTable[fontname].find(*i) == LookupTable[fontname].end())
{
LookupTable[fontname][*i] = Codepage[fontname].size();
Codepage[fontname].push_back(*i);
}
}
then, I generate a new string, where the characters from the original string are replaced by their positions in the codepage like this:
static std::string hex = "0123456789ABCDEF";
std::string result = "<";
for(std::wstring::const_iterator i = str.begin(); i != str.end(); i++)
{
int id = LookupTable[fontname][*i] + 1;
result += hex[(id & 0x00F0) >> 4];
result += hex[(id & 0x000F)];
}
result += ">";
for example, "H€llo World!" might become <01020303040506040703080905>
and now you can just put that string into the pdf and have it printed, using the Tj operator as usual...
but you now have a problem: the pdf doesn't know that you mean "H" by a 01. To solve this problem, you also have to include the codepage in the pdf file. This is done by adding an /Encoding to the Font object and setting its Differences
For the "H€llo World!" example, this Font-Object would work:
5 0 obj
<<
/F1
<<
/Type /Font
/Subtype /Type1
/BaseFont /Times-Roman
/Encoding
<<
/Type /Encoding
/Differences [ 1 /H /Euro /l /o /space /W /r /d /exclam ]
>>
>>
>>
endobj
I generate it with this code:
ObjectOffsets.push_back(stream->tellp()); // xrefs entry
(*stream) << ObjectCounter++ << " 0 obj \n<<\n";
int fontid = 1;
for(std::list<std::string>::iterator i = Fonts.begin(); i != Fonts.end(); i++)
{
(*stream) << " /F" << fontid++ << " << /Type /Font /Subtype /Type1 /BaseFont /" << *i;
(*stream) << " /Encoding << /Type /Encoding /Differences [ 1 \n";
for(std::vector<wchar_t>::iterator j = Codepage[*i].begin(); j != Codepage[*i].end(); j++)
(*stream) << " /" << GlyphName(*j) << "\n";
(*stream) << " ] >>";
(*stream) << " >> \n";
}
(*stream) << ">>\n";
(*stream) << "endobj \n\n";
Notice that I use a global font-register - I use the same font names /F1, /F2,... throughout the whole pdf document. The same font-register object is referenced in the /Resources Entry of all pages. If you do this differently (e.g. you use one font-register per page) - you might have to adapt the code to your situation...
So how do you find the names of the glyphs (/Euro for "€", /exclam for "!" etc.)? In the above code, this is done by simply calling "GlyphName(*j)". I have generated this method with a BASH-Script from the list found at
http://www.jdawiseman.com/papers/trivia/character-entities.html
and it looks like this
const std::string GlyphName(wchar_t UnicodeCodepoint)
{
switch(UnicodeCodepoint)
{
case 0x00A0: return "nonbreakingspace";
case 0x00A1: return "exclamdown";
case 0x00A2: return "cent";
...
}
}
A major problem I have left open is that this only works as long as you use at most 254 different characters from the same font. To use more than 254 different characters, you would have to create multiple codepages for the same font.
Inside the pdf, different codepages are represented by different fonts, so to switch between codepages, you would have to switch fonts, which could theoretically blow your pdf up quite a bit, but I for one, can live with that...
dredkin's answer has worked fine for me in the forward direction (unicode text to PDF representation).
I was writing an increasingly convoluted comment there about the reverse direction (PDF representation to text, when copying from the PDF document), explained by user2373071. The method referred to throughout this thread is the definition of a /ToUnicode map (which, incidentally, is optional). I found it simplest to map from glyphs to characters using the beginbfrange srcCode1 srcCode2 [ dstString1 m ] endbfrange construct.
This seems to work OK in Adobe Reader, but two glyphs (0x100 and 0x1ef) cause the mapping for cyrillic characters to fail in browsers and SumatraPDF (the copy/paste provides the glyph IDs instead of the characters. By excluding those two glyphs I made it work there. (I really can't see what's special about these glyphs, and it's independent of font (i.e. it's the same glyphs, but different characters, in Times/Georgia/Palatino, and these values are afaik identically mapped in UTF-16. Any ideas welcome!)
However, and more importantly,
I have reached the conclusion that the whole /ToUnicode mechanism is fundamentally flawed in concept, because many fonts re-use glyphs for multiple characters. Consider simple ones like 0x20 and 0xa0 (ordinary and non-breaking space); 0x2d and 0xad (hyphen and soft hyphen); these two are in the 8-bit character range. Slightly beyond that are 0x3b and 0x37e (semi-colon and greek question mark). And it would be quite reasonable to re-use cyrillic small a and latin small a, and similar homoglyphs. So the point is, in the non-ASCII world that prompts us to worry about Unicode at all, we will encountering a one-to-many mapping from glyphs to characters, and will therefore be bound to pick up the wrong character at some point - which rather removes the point of being able to extract the text in the first place.
The other method in the (1.7) PDF reference is to use /ActualText instead of /ToUnicode. This is better in principle, because completely avoids the homoglyph problem I've mentioned above, and the overhead is probably bearable, but it only seems to be implemented in Adobe Reader (i.e. I haven't got anything consistent or meaningful from SumatraPdf or four browsers).
I'm not a PDF expert, and (as Ferruccio said) the PDF specs at Adobe should tell you everything, but a thought popped up in my mind:
Are you sure you are using a font that supports all the characters you need?
In our application, we create PDF from HTML pages (with a third party library), and we had this problem with cyrillic characters...