I am a fresher my requirement is I have a file containing english and special character say chinese, french, etc. I need to read the file and replace some English characters in it.
Related
I'm trying to read the official pdf specification "Document management — Portable document format — Part 1: PDF 1.7" (PDF32000_2008.pdf) as bytes and then interpret them according to that specification.
In Annex D, Character Sets and Encodings, there is a list of all named characters, like:
or
When I parse PDF32000_2008.pdf, there are also named characters like "f_f", "uni00D0" and "a204", which are missing in that specification.
My guess is that "f_f" is a symbol for two 'f' characters, which might get printed with a special glyph. There is a unicode "Latin Small Ligature Ff" for 'ff'.
For example, there is also "f_i" in that file, which I expect to mean 'fi', one glyph showing the 2 characters 'f' and 'i'. However, the pdf specification has 'fi' as named character "fi" and what is the point for having 2 named characters pointing to the same symbol ?
I can imagine that "uni00D0" means the unicode character 'Ð'. However, pdf defines it already as named character "Eth"
What could be "a204" ? Maybe Ansi 204 'Ì', which also has already a named character "Igrave" ?
Why do they use also "a62", which would be just a '<' ?
However, my main question is: Where can I find a specification for these additional named characters ?
Of course, Adobe Acrobat understands them, but also Gmail seems not to have a problem with them. So I guess, their meaning must be specified somewhere.
I want to use Chinese bert model. In tokenization.py, I fond WordpieceTokenizer function(https://github.com/google-research/bert/blob/master/tokenization.py), but I don't think it is needed to use wordpiece for chinese, because the miminal unit of chinese is character.
WordpieceTokenizer is just for english text, am I right?
From the README:
We use character-based tokenization for Chinese, and WordPiece tokenization for all other languages.
However, from the Multilingual README (emphasis added):
Because Chinese (and Japanese Kanji and Korean Hanja) does not have whitespace characters, we add spaces around every character in the CJK Unicode range before applying WordPiece.
So WordPiece is presumably run on the whole sentence, though it would only matter for sentences that contained non-Chinese characters. So to run the code as-is you would want WordPiece.
However, to clarify:
WordPiece is not just for English, it can be used on any language and in practice is used on many
Whether single character-based tokenization for Chinese is the best decision is debated
WordPiece is not available outside Google, SentencePiece could be used as a replacement (though I think the BERT code might have a pretrained model)
I am using streamReader to read a text file and then saving content of file into another text file using streamwriter. File which I am trying to read can be in ANSI or UTF-8 format.
I am facing problem with non english characters. When input file contains Chinese or Japanese language every thing works fine.
But if input file contains characters like ã then output text file show question mark type symbol.
I tried to fix this by using encoding iso-8859-1 for streamreader but now Chinese and Japanese language coming like this ¢ãƒãƒ¼ãƒ»ãƒ¦ãƒ¼ã
I am currently trying to estimate the number of times each character is used in a large sample of traditional Chinese characters. I am interested in characters not words. The file also includes punctuation and western characters.
I am reading in an example file of traditional Chinese characters. The file contains a large sample of traditional Chinese characters. Here is a small subset:
首映鼓掌10分鐘 評語指不及《花樣年華》
該片在柏林首映,完場後獲全場鼓掌10分鐘。王家衛特別為該片剪輯「柏林版本
增減20處 趙本山香港戲分被刪
在柏林影展放映的《一代宗師》版本
教李小龍武功 葉問決戰散打王
另一增加的戲分是開場時葉問(梁朝偉飾)
My strategy is to read each line, split each line into a list, and go through and check each character to see if it already exists in a list or a dictionary of characters. If the character does not yet exist in my list or dictionary I will add it to that list, if it does exist in my list or dictionary, I will increase the counter for that specific character. I will probably use two lists, a list of characters, and a parallel list containing the counts. This will be more processing, but should also be much easier to code.
I have not gotten anywhere near this point yet.
I am able to read in the example file successfully. Then I am able to make a list for each line of my file. I am able to print out those individual lines into my output file and sort of reconstitute the original file, and the traditional Chinese comes out intact.
However, I run into trouble when I try to make a list of each character on a particular line.
I've read through the following article. I understood many of the comments, but unfortunately, was unable to understand enough of it to solve my problem.
How to do a Python split() on languages (like Chinese) that don't use whitespace as word separator?
My code looks like the following
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import codecs
wordfile = open('Chinese_example.txt', 'r')
output = open('Chinese_output_python.txt', 'w')
LINES = wordfile.readlines()
Through various tests I am sure the following line is not splitting the string LINES[0] into its component Chinese characters.
A_LINE = list(LINES[0])
output.write(A_LINE[0])
I mean you want to use this, from answerer 'flow' at How to do a Python split() on languages (like Chinese) that don't use whitespace as word separator? :
from re import compile as _Re
_unicode_chr_splitter = _Re( '(?s)((?:[\ud800-\udbff][\udc00-\udfff])|.)' ).split
def split_unicode_chrs( text ):
return [ chr for chr in _unicode_chr_splitter( text ) if chr ]
to successfully split a line of traditional Chinese characters.. I just had to know the proper syntax to handle encoded characters.. pretty basic.
my_new_list = list(unicode(LINE[0].decode('utf8')));
I am working with FontNames from PDF documents and wish to convert them to one of the 14 standard fonts if possible. An example is:
KAIKCD+Helvetica-Oblique
Are there standards for the punctuation and values of the pre/suf/fixes? I have found both -Oblique .I and -Italic as suffixes all - presumably - meaning Italic. Or are the names semantically void?
The prefix is as defined in the PDF specification:
For a font subset, the PostScript name of the font — the value of the
font’s BaseFont entry and the font descriptor’s FontName entry — shall
begin with a tag followed by a plus sign (+). The tag shall consist of
exactly six uppercase letters; the choice of letters is arbitrary, but
different subsets in the same PDF file shall have different tags.
Otherwise there merely are some common naming patterns; Jon Tan has tried to catalogue some here.