PDF viewers are not rendering all of the tamil letters as expected in the PDF generated using PDFbox. It seems I have to do the required substitutions while generating the PDF to get it rendered as expected.
Attempting the substitutions, any help would be appreciated. Basically these are the three cases requiring the substitution or change for Tamil letters. Most of the substitutions for the remaining letters more or less follow the same.
Need to reverse the glyphs
கெ = க + ெ = க ெ -> ெ + க = கெ
Need to split and reorder the glyphs
கொ = க + ொ = க ொ -> க + ெ + ா -> ெ + க + ா = கொ
New resultant glyphe - Substitute new glyphe for a series of glyphes. The new glyphe do not have unicode, only exist in the font file.
கு = க + ு = க ு -> கு
Input text
Char list from JDK
Code points from JDK
gid in ttf
Actual*
Expected
கெ
க + ெ
2965 3014 Character : க Codepoint : 2965 unicode : ub95 Character : ெ Codepoint : 3014 unicode : ubc6
1828 1856
க + ெ = க ெ
ெ + க = கெ
Reversing the glyphes expected.
கொ
க + ொ
2965 3018 Character : க Codepoint : 2965 unicode : ub95 Character : ொ Codepoint : 3018 unicode : ubca
1828 1859
க + ொ = க ொ
க + ெ + ா ெ + க + ா = கொ
Split and reorder expected.
கு
க + ு
2965 3009 Character : க Codepoint : 2965 unicode : ub95 Character : ு Codepoint : 3009 unicode : ubc1
1828 1854
க + ு = க ு
கு (gid = 6698)
New glyphe expected. The new glyphe do not have unicode, only exist in the font file.
Below is the actual content rendering from PDF viewer
Below is the expected content, did a hard coded substitutions(For the glyphe id without having unicode, hardcoded at PDCIDFontType2#public byte[] encode(int unicode). Reverse, split and reorder input text charsequence before calling the showtext. Also added the glyphe id that does not have a unicode at TrueTypeEmbedder Subsetter for embedding the glyphe into the generated pdf.) just to obtain it.
How to handle these substitutions in an efficient way?
Looking at the GlyphSubstitutionTable, fontbox.cmap.Identity-H, fontbox.unicode.Scripts.txt. Couldn’t get it so far. Any help would be appreciated.
Links, Font Actual Expected Use cases PDFBox Jira
You need to implement a text shaping engine to handle Tamil writing.
Please see the OpenType specification: https://learn.microsoft.com/en-us/typography/opentype/spec/ , the GSUB/GPOS tables are the main interest for you.
This is no easy task so maybe using an external library such as HarfBuzz is a better choice.
There is also this PDFBox issue (4189) regarding Bengali writing. Maybe it will help you implement support for Tamil
Update: for example this HarfBuzz command line:
hb-shape -O json -u U+0B95,U+0BC1 --no-glyph-names FreeSerif.otf
will return:
[{"g":6698,"cl":0,"dx":0,"dy":0,"ax":858,"ay":0}]
You have to parse the json output, get the glyph ids and provide them to PDFBox.
Related
I have an ISO 8583 ascuu message with Bitmap Hex.
I tried to unpack it as follows:
String hexMsg="0110E23800018ED006060000000000000011168...M..."
byte[] bmsg = ISOUtil.hex2byte(hexMsg);
ISOMsg mes = new ISOMsg();
mes.setPackager(new ISO87BPackager());
mes.unpack(bmsg);
But I got into trouble because the message was not hex and it contains the character M, for example
I tried to keep the bitmap separate and hex the rest and again the problem was not solved:
can any body help?
I'm trying to convert pdf to text files. The problem is that those pdf contain images, which I don't care about (this is the type of file I want to extract (https://www.sia.aviation-civile.gouv.fr/pub/media/store/documents/file/l/f/lf_sup_2020_213_fr.pdf). Note that if I do copy/paste with my mouse, it work quite well (except the line break), so I'd guess that it's possible. Most of the answer I found online work pretty well on dummy pdf with text only, but give especially bad result on the map.
For instance, something like this
from tika import parser # pip install tika
raw = parser.from_file('test2.pdf')
print(raw['content'])
works well for retrieving the text, but I have a lot of trash like this :
ERY
CTR
3
CH
A
which appear because of the map.
Something like this, which work by converting the pdf to images and then reading the images, face the same problem (I found it on a very similar thread on stackoverflow, but there is no answer) :
import pytesseract as pt
from PIL import Image
import sys
def convert(name):
pages = convert_from_path(name, dpi=200)
for idx,page in enumerate(pages):
page.save('page'+str(idx)+'.jpg', 'JPEG')
quote = Image.open('page'+str(idx)+'.jpg')
text = pt.image_to_string(quote, lang="fra")
file_ex = open('page'+str(idx)+'.text',"w")
file_ex.write(text)
file_ex.close()
if __name__ == '__main__':
convert(sys.argv[1])
Finally, I tried to remove the image first, and then using one of the solutions above, but it didn't work better :
from tika import parser # pip install tika
from PyPDF2 import PdfFileWriter, PdfFileReader
# Remove the images
inputStream = open("lf_sup_2020_213_fr.pdf", "rb")
outputStream = open("test3.pdf", "wb")
src = PdfFileReader(inputStream)
output = PdfFileWriter()
[output.addPage(src.getPage(i)) for i in range(src.getNumPages())]
output.removeImages()
output.write(outputStream)
outputStream.close()
# Read from pdf without images
raw = parser.from_file('test2.pdf')
print(raw['content'])
Do you know how to solve this ? It can be in any language.
Thanks
One approach you could try is to use a toolkit capable of parsing the text characters in the PDF then use the object properties to try and remove the unwanted map labels while keeping the text characters required.
For example, the ParsePages method from LEADTOOLS PDF toolkit (which is what I am familiar with since I work for the vendor of this toolkit) can be used to obtain the text from the PDF:
using (PDFDocument document = new PDFDocument(pdfFileName))
{
PDFParsePagesOptions options = PDFParsePagesOptions.All;
document.ParsePages(options, 1, -1);
using (StreamWriter writer = File.CreateText(txtFileName))
{
IList<PDFObject> objects = document.Pages[0].Objects;
writer.WriteLine("Objects: {0}", objects.Count);
foreach (PDFObject obj in objects)
{
if (obj.TextProperties.IsEndOfLine)
writer.WriteLine(obj.Code);
else
writer.Write(obj.Code);
}
writer.WriteLine("---------------------");
}
}
This will obtain all the text in the PDF for the first page, with the unwanted results as you mentioned. Here is an excerpt below:
Objects: 3918
5
91L
F5
4
1 LF
N
OY
L2
1AM
TService
8
26
1de l’Information
0
B09SUP AIP 213/20
7
Aéronautique
Date de publication : 05 NOV
e-mail : sia.qualite#aviation-civile.gouv.fr
Internet : www.sia.aviation-civile.gouv.fr
141
17˚
82
N20
9Objet : Création de 4 zones réglementées temporaires (ZRT) pour l’exercice VOLOPS en région de Chambéry
En vigueur : Du mercredi 25 Novembre 2020 au vendredi 04 décembre 2020
More code can be used to examine the properties for each parsed character:
writer.WriteLine(" ObjectType: {0}", obj.ObjectType.ToString());
writer.WriteLine(" Bounds: {0}, {1}, {2}, {3}", obj.Bounds.Left, obj.Bounds.Top, obj.Bounds.Right, obj.Bounds.Bottom);
writer.WriteLine(" TextProperties.FontHeight: {0}", obj.TextProperties.FontHeight.ToString());
writer.WriteLine(" TextProperties.FontIndex: {0}", obj.TextProperties.FontIndex.ToString());
writer.WriteLine(" Code: {0}", obj.Code);
writer.WriteLine("------");
This will give the properties for each character:
Objects: 3918
ObjectType: Text
Bounds: -60.952693939209, 1017.25231933594, -51.8431816101074, 1023.71826171875
TextProperties.FontHeight: 7.10454273223877
TextProperties.FontIndex: 48
Code: 5
------
Using these properties, the unwanted text might be filtered using their properties. For example, I noticed that the FontHeight for a good portion of the unwanted text is around 7 PDF units, so the first code might be altered to avoid extracting any text smaller than 7.25 PDF units:
foreach (PDFObject obj in objects)
{
if (obj.TextProperties.FontHeight > 7.25)
{
if (obj.TextProperties.IsEndOfLine)
writer.WriteLine(obj.Code);
else
writer.Write(obj.Code);
}
}
The extracted output would give a better result, an excerpt follows:
Objects: 3918
Service
de l’Information
SUP AIP 213/20
Aéronautique
Date de publication : 05 NOV
e-mail : sia.qualite#aviation-civile.gouv.fr
Internet : www.sia.aviation-civile.gouv.fr
Objet : Création de 4 zones réglementées temporaires (ZRT) pour l’exercice VOLOPS en région de Chambéry
En vigueur : Du mercredi 25 Novembre 2020 au vendredi 04 décembre 2020
Lieu : FIR : Marseille LFMM - AD : Chambéry Aix-Les-Bains LFLB, Chambéry Challes les Eaux LFLE
ZRT LE SIRE, MOTTE CASTRALE, ALLEVARD
*
C
D
E
In the end, you will have to try and come up with a good criteria to filter out the unwanted text without removing the text you need to keep, using this approach.
I need help constructing the Authorization header to PUT a block blob.
PUT\n\n\n11\n\n\n\n\n\n\n\n\nx-ms-blob-type:BlockBlob\nx-ms-date:Sat, 25 Feb 2017 22:20:13 GMT\nx-ms-version:2015-02-21\n/myaccountname/mycontainername/blob.txt\n
I take this, UTF 8 encode it. Then I take my access key in my Azure account and HMAC sha256 this UTF 8 encoded string with the key. Then I output that in base64. Let's call this output string.
My authorization header looks like this: SharedKey myaccountname:output string
It is not working.
The header in Postman also has x-ms-blob-type, x-ms-date, x-ms-version, Content-Length, and Authorization. The body for now says hello world.
Can anyone help me make this successful request in Postman?
<?xml version="1.0" encoding="utf-8"?>
<Error>
<Code>AuthenticationFailed</Code>
<Message>Server failed to authenticate the request. Make sure the value of Authorization header is formed correctly including the signature.
RequestId:cdeb9a5e-0001-0029-5fb5-8f7995000000
Time:2017-02-25T22:22:32.0300016Z</Message>
<AuthenticationErrorDetail>The MAC signature found in the HTTP request 'jiJtirohvi1syXulqkPKESnmQEJI4GpDU5JBn7BM/xY=' is not the same as any computed signature. Server used following string to sign: 'PUT
11
text/plain;charset=UTF-8
x-ms-date:Sat, 25 Feb 2017 22:20:13 GMT
x-ms-version:2015-02-21
/myaccountname/mycontainername/blob.txt'.</AuthenticationErrorDetail>
</Error>
EDIT:
First, I want to thank you and everyone who responded. I truly truly appreciate it. I have one last question and then I think I'll be set!! I'm not using that code - I'm doing this all by hand. If I have my key: X2iiy6v47j1jZZH5555555555zzQRrIAdxxVs55555555555av8uBUNGcBMotmS7tDqas14gU5O/w== changed slightly for anonymity - do I decode it: using an online base64decoder. Then, when I have my string which now looks like this: PUT\n\n\n11\n\ntext/plain;charset=UTF-8\n\n\n\n\n\n\nx-ms-blob-type:BlockBlob\nx-ms-date:Mon, 27 Feb 2017 21:53:13 GMT\nx-ms-version:2015-02-21\n/myaccount/mycontainer/blob.txt\n so I run this in https://mothereff.in/utf-8 and then use this in HMAC with my decoded key: https://www.liavaag.org/English/SHA-Generator/HMAC/ - using sha256 and base64 at the end.
Is that how I get the correct string to put here?: SharedKey myaccount:<string here>
I believe there's an issue with how you're specifying StringToSign here:
PUT\n\n\n11\n\n\n\n\n\n\n\n\nx-ms-blob-type:BlockBlob\nx-ms-date:Sat,
25 Feb 2017 22:20:13
GMT\nx-ms-version:2015-02-21\n/myaccountname/mycontainername/blob.txt\n
If you notice the error message returned from the server, string to sign by server is different than yours and the difference is that the server is using Content-Type (text/plain;charset=UTF-8) in signature calculation while you're not. Please include this content type in your code and things should work just fine.
Here's the sample code (partial only) I used:
var requestMethod = "PUT";
var urlPath = "test" + "/" + "myblob.txt";
var storageServiceVersion = "2015-12-11";
var date = DateTime.UtcNow.ToString("R", CultureInfo.InvariantCulture);
var blobType = "BlockBlob";
var contentBytes = Encoding.UTF8.GetBytes("Hello World");
var canonicalizedResource = "/" + accountName + "/" + urlPath;
var canonicalizedHeaders = "x-ms-blob-type:" + blobType + "\nx-ms-date:" + date + "\nx-ms-version:" + storageServiceVersion + "\n";
var stringToSign = requestMethod + "\n" +
"\n" + //Content Encoding
"\n" + //Content Language
"11\n" + //Content Length
"\n" + //Content MD5
"text/plain;charset=UTF-8" + "\n" + //Content Type
"\n" + //Date
"\n" + //If - Modified - Since
"\n" + //If - Match
"\n" + //If - None - Match
"\n" + //If - Unmodified - Since
"\n" + //Range +
canonicalizedHeaders +
canonicalizedResource;
string authorizationHeader = GenerateSharedKey(stringToSign, accountKey, accountName);
private static string GenerateSharedKey(string stringToSign, string key, string account)
{
string signature;
var unicodeKey = Convert.FromBase64String(key);
using (var hmacSha256 = new HMACSHA256(unicodeKey))
{
var dataToHmac = Encoding.UTF8.GetBytes(stringToSign);
signature = Convert.ToBase64String(hmacSha256.ComputeHash(dataToHmac));
}
return string.Format(CultureInfo.InvariantCulture, "{0} {1}:{2}", "SharedKey", account, signature);
}
According to your error message, it indicates that authorization signature is incorrect.
If the Content-Type "text/plain; charset=UTF-8" is not included in the header, please add it in the stringTosign and postman.
When we try to get the signature, we need to make sure the length of the blob.txt matches the Content length in the stringTosign. That means request body length should match the content length in the stringTosign.
I test it with Postman, it works correctly. We can get the signature with the code in another SO Thread. The following is my detail steps
Add the following header
Add the request body (example: Hello World)
Send the put blob request.
Update :
Please have a try to use the online tool to generate signature for test.
I have the following code that opens files in a directory, runs spaCy NLP on them, and the outputs dependency parse info into a file in a new directory.
import spacy, os
nlp = spacy.load('en')
path1 = 'C:/Path/to/my/input'
path2 = '../output'
for file in os.listdir(path1):
with open(file, encoding='utf-8') as text:
txt = text.read()
doc = nlp(txt)
for sent in doc.sents:
f = open(path2 + '/' + file, 'a+')
for token in sent:
f.write(file + '\t' + str(token.dep_) + '\t' + str(token.head) + '\t' + str(token.right_edge) + '\n')
f.close()
The trouble is that this won't preserver the order of the dependencies in the output file. I can't seem to find any references to character positions in the API documentation.
The character index is at token.idx. The word index is at token.i. I know this isn't particularly intuitive.
Tokens also compare by position, so you could do:
for child in sent:
word1, word2 = sorted((child, child.head))
This would get you each dependency arc, arranged in document order. I'm not sure what you're trying to do with the right edge there, though, so I'm not sure if this does quite what you want.
Let me start by saying I'm no expert in cryptography algorithms...
I am trying to build a method which formats an HTTP header for Windows Azure - and this header requires part of its message to be encrypted via HMAC with SHA256 (and then also base64 encoded).
I chose to use CryptoJS because it's got an active user community.
First, my code:
_encodeAuthHeader : function (url, params, date) {
//http://msdn.microsoft.com/en-us/library/windowsazure/dd179428
var canonicalizedResource = '/' + this.getAccountName() + url;
/*
StringToSign = Date + "\n" + CanonicalizedResource
*/
var stringToSign = date + '\n' + canonicalizedResource;
console.log('stringToSign >> ' + stringToSign)
var encodedBits = CryptoJS.HmacSHA256(stringToSign, this.getAccessKey());
console.log('encodedBits >> ' + encodedBits);
var base64Bits = CryptoJS.enc.Base64.stringify(encodedBits);
console.log('base64Bits >> ' + base64Bits);
var signature = 'SharedKeyLite ' + this.getAccountName() + ':' + base64Bits;
console.log('signature >> ' + signature);
return signature;
},
The method successfully returns a "signature" with the appropriate piece encrypted/encoded. However, Azure complains that it's not formatted correctly.
Some example output:
stringToSign >> Mon, 29 Jul 2013 16:04:20 GMT\n/senchaazurestorage/Tables
encodedBits >> 6723ace2ec7b0348e1270ccbaab802bfa5c1bbdddd108aece88c739051a8a767
base64Bits >> ZyOs4ux7A0jhJwzLqrgCv6XBu93dEIrs6IxzkFGop2c=
signature >> SharedKeyLite senchaazurestorage:ZyOs4ux7A0jhJwzLqrgCv6XBu93dEIrs6IxzkFGop2c=
Doing some debugging, I am noticing that CryptoJS is not returning the same value (HMAC with SHA256) as alternative implementations. For example, the string "Mon, 29 Jul 2013 16:04:20 GMT\n/senchaazurestorage/Tables" appears as:
"6723ace2ec7b0348e1270ccbaab802bfa5c1bbdddd108aece88c739051a8a767" via CryptoJS
"faa89f45ef029c63d04b8522d07c54024ae711924822c402b2d387d05398fc9f" via PHP hash_hmac('sha256', ... )
Digging even deeper, I'm seeing that most HMAC/SHA265 algorithms return data which matches the output from PHP... am I missing something in CryptoJS? Or is there a legitimate difference?
As I mentioned in my first comment, the newline ("\n") was causing problems. Escaping that ("\ \n", without the space inbetween) seems to have fixed the inconsistency in HMAC/SHA256 output.
I'm still having problems with the Azure HTTP "Authorization" header, but that's another issue.