Unwanted space in a PDF generated with TCPDF - pdf

We use TCPDF to generate PDFs. In one special case I got a strange behaviour, it looks like TCPDF puts a space inbetween two characters.
I use the cid0cs as font, the strange behaviour appears if I place "µg" in the PDF, it looks like "µ g" (with some space inbetween) now.
I edited the cid0cs.php on index 181 (like here: http://bytethinker.com/blog/correct-display-of-imported-fonts-in-tcpdf) with no success.
Any help is really appreciated.

Did you edit the character µ or g? If you select the letters you can see which letter the extra space belongs to. So... for a small "g" (the first letter after which is the space, you must edit the entry "130=>???" of the $cw array.
$cw=array(0=>0,1=>750,2=>750,3=>750,4=>750
Make it half the value. If its 750, make it 400 and try. Or even better: search for a letter that could be the same with as your "g" (an "a" for example).
Cheers,
Guido
(customer service is when you look at all the links that lead to your website :)

Related

How do I make iText 7 diacritic mark stacking work correctly?

I have run into a problem with iText 7 where diacritic marks are painted on top of one another instead of stacking properly when multiple marks are used on a single character. Is there a setting that makes them appear correctly, or is this a bug in iText 7? Any help greatly appreciated. This can be observed if you create a text object in your PDF like below. Obviously, replace the relevant bit with an actual font object, rather than than what I have in there.
new Text("ḗ and ṓ are characters that display incorrectly").setFont(<UNICODE COMPATIBLE FONT LIKE CHARIS>);
While Bruno and Benoit correctly pointed out that for advanced typography stuff like stacking diacritical marks you need pdfCalligraph module, there is a workaround you can try at your own risk. If your combinations of base glyph and diacritics are real, meaning they occur in real texts in some languages or some other known contexts, then such combinations are most probably present in Unicode and have their own number associated with them. For instance, in text you provided, they are 0xU1E17 and 0x1E53 Unicode characters. Some fonts may contain such glyphs, so that there is a second option to showing base glyph and stacking diacritics: showing combined glyphs. For example, ArialUni shipped with Windows does contain the above mentioned glyphs.
To try this approach, you would need the following code for composing known Unicode base glyph + diacritics combinations into single glyphs:
String originalStr = "ḗ and ṓ are characters that display incorrectly";
String normalizedStr = java.text.Normalizer.normalize(originalStr, Normalizer.Form.NFC);
new Text(normalizedStr); // Use this normalized Text instance
The result that I got with ArialUni:
But again as I mentioned do it at your own risk because it will only work if there are necessary combinations present both in Unicode and font. For correct rendering you still should use pdfCalligraph.

Scrapy: how to solve the "empty" item in html due to a foreign language symbol?

One of the scrapy-ed items seems contain no content in HTML. In MySQL database, it does have content including a non-regular - (dash) that is slightly longer. It could be a dash symbol from Chinese input, or something similar. I am copy it below, not sure whether it will keep the original form. The web link is here and this non-regular dash is in the title and the beginning of the description.
**Hospitalist – Chattanooga**
To further prove it, the export CVS file from MySQL convert this weird dash to ?€?. Most likely this weird symbol causes the non-display problem.
I want to either delete this weird symbol or replace it with a , or a regular dash. Where can it be done? During Scrapy? Or in MySQL? Sorry this is not a specific coding question. I need some guidance before figuring out any codes for this problem.
The long dash is called an EM dash fileformat - EM dash
The reason you are seeing it is likely due to the chosen encoding.
Try setting a different encoding or replacing the EM dash with the , sign as you mentioned in your question.
In php you can do so with the following code:
str_replace(chr(151), ',' $input);

Special unicode question mark characters in database table

Firstly anyone who reads this and response, thanks for your assistance.
I'm having a problem where I have a site (primarily in English), with many translations for different language. I have a database which stores these translations. Unfortunately one of the language seems to be populated with question mark characters between each general character. Because of this, any text which contains these characters wont show up in IE.
Is there any SQL statements that will seek these characters out and remove them? There's a find/replace option, but I can't seem to find a rule that applies.
Thanks for any help you can give.
As an example, this is how text shows in a table:
�i�O�N� �k�i�t� �d�e� �s�u�p�p�o�r�t� �V�é�l�o� - which stops it showing IE.
Removing these as below will show it in IE:
iON kit de support Vélo
Any idea how I go about this?
Thanks :)
Your translation database contains mangled data that has come from misinterpreting UTF-16-encoded input as ISO-8859-1 (or the closely related Windows code page 1252; you can't tell the difference from the example data).
You could attempt to undo the damage by extracting the data, encoding it back to what is hopefully the original set of bytes, and re-decoding it, then inserting it back into the database. For example in PHP:
$mangled = "i\0O\0N\0 \0k\0i\0t\0 \0d\0e\0 \0s\0u\0p\0p\0o\0r\0t\0 \0V\0\xE9\0l\0o\0"
$fixed = iconv('utf-16le', 'utf-8', $mangled)
# "iON kit de support V\xC3\xA9lo"
but it would be best to go back to the original input data and re-import it properly really.
Just removing zero bytes from a UTF-16-encoded bytes string (str_replace("\0", '', $mangled)) isn't really fixing it, it would work for the ASCII characters (U+0000–U+007F) but you would end up with ISO-8859-1 bytes for characters U+0080–U+00FF (more usually you would want UTF-8) and any other characters outside that range would remain unreadable nonsense.

Special characters in the URL

I am working on a ASP.NET MVC app.
This app displays a detail information regarding a product.
The product name can have any special chars like single quote, the percentage symbol, the Registered symbol the one with a circle and 'R' inside, the Trademark symbol etc.
Currently all these are replaced with a '-'.
If the name is like this:
Super - Men's 100% Polyester Knit Shirts
It appears like this in the URL:
8080/super---men-s-100-polyester-knit-shirts/maverick
- men-s-100-polyester-knit-shirts
This is done in Js like so:
Name.replace(/([~!##$%^&*()_+=`{}\[\]\|\\:;'"<>,.\/? ])+/g, '-').replace(/^(-)+|(-)+$/g, '');
So the question is, should the name be displayed as-is in the URL?
If yes, some pointers please.
If no, please provide some valid reasons like standards as followed today that will help me put the point across the table.
Regards.
The short answer is not to fiddle with it. It's as good as it gets out of the box.
The Url can only contain a small number of alphanumeric letters. which basically means you can only have 0-9 a-z and - _ . ~.
All other characters need to be encoded. Now that you can have arabic url's too it has gotten a little more complicated.
But assuming your website is indo-european this is it. So you will never be able to have full product names in your url.
And renaming them as something more cool like replacing % with "percent" in the url can bring desaster upon your url's as in some cases the "fake" names may not end up unique and therefore end up with unreliable routing.
look at URI characters on wiki

Asc(Chr(254)) returns 116 in .Net 1.1 when language is Hungarian

I set the culture to Hungarian language, and Chr() seems to be broken.
System.Threading.Thread.CurrentThread.CurrentCulture = "hu-US"
System.Threading.Thread.CurrentThread.CurrentUICulture = "hu-US"
Chr(254)
This returns "ţ" when it should be "þ"
However, Asc("ţ") returns 116.
This: Asc(Chr(254)) returns 116.
Why would Asc() and Chr() be different?
I checked and the 'wide' functions do work correctly: ascw(chrw(254)) = 254
Chr(254) interprets the argument in a system dependent way, by looking at the System.Globalization.CultureInfo.CurrentCulture.TextInfo.ANSICodePage property. See the MSDN article about Chr. You can check whether that value is what you expect. "hu-US" (the hungarian locale as used in the US) might do something strange there.
As a side-note, Asc() has no promise about the used codepage in its current documentation (it was there until 3.0).
Generally I would stick to the unicode variants (ending on -W) if at all possible or use the Encoding class to explicitly specify the conversions.
My best guess is that your Windows tries to represent Chr(254)="ţ" as a combined letter, where the first letter is Chr(116)="t" and the second ("¸" or something like that) cannot be returned because Chr() only returns one letter.
Unicode text should not be handled character-by-character.
It sounds like you need to set the code page for the current thread -- the current culture shouldn't have any effect on Asc and Chr.
Both the Chr docs and the Asc docs have this line:
The returned character depends on the code page for the current thread, which is contained in the ANSICodePage property of the TextInfo class. TextInfo.ANSICodePage can be obtained by specifying System.Globalization.CultureInfo.CurrentCulture.TextInfo.ANSICodePage.
I have seen several problems in VBA on the Mac where characters over 127 and some control characters are not treated properly.
This includes paragraph marks (especially in text copied from the internet or scanned), "¥", and "Ω".
They cannot always be searched for, cannot be used in file names - though they could in the past, and when tested, come up as another ascii number. I have had to write algorithms to change these when files open, as they often look like they are the right character, but then crash some of my macros when they act strangely. The character will look and act right when I save the file, but may be changed when it is reopened.
I will eventually try to switch to unicode, but I am not sure if that will help this issue.
This may not be the issue that you are observing, but I would not rule out isolated problems with certain characters like this. I have sent notes to MS about this in the past but have received no joy.
If you cannot find another solution and the character looks correct when you type it in, then I recommend using a macro snippet like the one below, which I run when updating tables. You of course have to setup theRange as the area you are looking at. A whole file can take a while.
For aChar = 1 To theRange.Characters.count
theRange.Characters(aChar).Select
If Asc(Selection.Text) = 95 And Selection.Text <> "_" Then Selection.TypeText "Ω"
Next aChar