how do I split Chinese string into characters using Tensorflow - tensorflow

I want to use tf.data.TextLineDataset() to read Chinese sentences, then use the map() function to divide into the single word, but tf.split doesn't work for Chinese.

I also hope someone can help us kindly with the issue.
It is my current solution:
read Chinese sentence from the file with Utf-8 coding format.
tokenize the sentences with some tool like jieba.
construct the vocab table.
convert source/target sentence according to vocab table.
convert to the dataset using from_tensor_slices.
get iterator from the dataset.
do other things.
if using TextLineDataset to load chinese sentences directlly, the content of dataset is something strange , displayed with byte flow.
maybe we can consider every byte as one character in english kind of language.
can anyone confirm with this or has any other suggestion, plz?

The above answer is one common option when handling non-English style language like Chinese, Korean, Japanese, etc.
You can also use the code below.
BTW, as you know, TextLineDataSet will read text content as a byte string.
So if we want to handle Chinese, we need to first decode it to unicode.
Unfortunately, there is no such option in tensorflow.
We need to choose other method like py_funct to do this.
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import tensorflow as tf
def preprocess_func(x):
ret= "*".join(x.decode('utf-8'))
return ret
str = tf.py_func(
preprocess_func,
[tf.constant(u"我爱,南京")],
tf.string)
with tf.Session() as sess:
value = sess.run(str)
print(value.decode('utf-8'))
output: 我*爱*,*南*京

Related

How to tokenize word with hyphen in Spacy

I want to tokenize bs-it to ["bs","it"] using spacy, as I am using it with rasa. The output which I get from is ["bs-it"]. Can somebody help me with that?
You can add custom rules to spaCy's tokenizer. spaCy's tokenizer treats hyphenated words as a single token. In order to change that, you can add custom tokenization rule. In your case, you want to tokenize an infix i.e. something that occurs in between two words, these are usually hyphens or underscores.
import re
import spacy
from spacy.tokenizer import Tokenizer
infix_re = re.compile(r'[-]')
def custom_tokenizer(nlp):
return Tokenizer(nlp.vocab,infix_finditer=infix_re.finditer)
nlp = spacy.load("en_core_web_sm")
nlp.tokenizer = custom_tokenizer(nlp)
doc = nlp("bs-it")
print([t.text for t in doc])
Output
['bs', '-', 'it']

SpaCy use Lemmatizer as stand-alone component

I want to use SpaCy's lemmatizer as a standalone component (because I have pre-tokenized text, and I don't want to re-concatenate it and run the full pipeline because SpaCy will most likely tokenize differently in some cases).
I found the lemmatizer in the package but I somehow needs to load the dictionaries with the rules to initialize this Lemmatizer.
These files must be somewhere in the model of the English or German model, right? I couldn't find them there.
from spacy.lemmatizer import Lemmatizer
where do the LEMMA_INDEX, etc. files are comming from?
lemmatizer = Lemmatizer(LEMMA_INDEX, LEMMA_EXC, LEMMA_RULES)
I found a similar question here: Spacy lemmatizer issue/consistency
but this one did not entirely answer how to get these dictionary files from the model. The spacy.lang.* parameter seems to no longer exist in newer versions.
Here's an extracted bit of code I had, that used the SpaCy lemmatizer by itself. I'm not somewhere I can run it so it might have a small bug or two if I made an editing mistake.
Note that in general, you need to know the upos for the word in order to lemmatize correctly. This code will return all the possible lemmas but I would advise modifying it to pass in the correct upos for your word.
class SpacyLemmatizer(object):
def __init__(self, smodel):
import spacy
self.lemmatizer = spacy.load(smodel).vocab.morphology.lemmatizer
# get the lemmas for every upos
def getLemmas(self, entry):
possible_lemmas = set()
for upos in ('NOUN', 'VERB', 'ADJ', 'ADV'):
lemmas = self.lemmatizer(entry, upos, morphology=None)
lemma = lemmas[0] # See morphology.pyx::lemmatize
possible_lemmas.add( lemma )
return possible_lemmas

How to parse mxnet params file into plain text?

I'm trying to use Python to parse mxnet params into plain text. The code looks like the below. But the parsing result is not plain string, but some encoded text looks like this, "... \xaa>\x0f\xed\x8e>\xaf!\x8f>g ..." Could anybody give me some tips on it? Thanks a lot!
...
param_file = 'resnet-50-0000.params'
with open(param_file, 'rb') as f:
net_params = f.read()
...
The parameters are binary files. If you want to read them as plain text you need to decode them first as a dictionary of parameter_name->NDArray, that you can convert them to numpy. From numpy you can convert it to a list and then process it as a list (of lists) of scalar.
import mxnet as mx
params = mx.nd.load('resnet-50-0000.params')
for k, param in params.items():
print(k)
print(param.asnumpy().tolist())

How to create dataset in the same format as the FSNS dataset?

I'm working on this project based on TensorFlow.
I just want to train an OCR model by attention_ocr based on my own datasets, but I don't know how to store my images and ground truth in the same format as FSNS datasets.
Is there anybody also work on this project or know how to solve this problem?
The data format for storing training/test is defined in the FSNS paper https://arxiv.org/pdf/1702.03970.pdf (Table 4).
To store tfrecord files with tf.Example protos you can use tf.python_io.TFRecordWriter. There is a nice tutorial, an existing answer on the stackoverflow and a short gist.
Assume you have an numpy ndarray img which has num_of_views images stored side-by-side (see Fig. 3 in the paper):
and a corresponding text in a variable text. You will need to define some function to convert a unicode string into a list of character ids padded to a fixed length and unpadded as well. For example:
char_ids_padded, char_ids_unpadded = encode_utf8_string(
text='abc',
charset={'a':0, 'b':1, 'c':2},
length=5,
null_char_id=3)
the result should be:
char_ids_padded = [0,1,2,3,3]
char_ids_unpadded = [0,1,2]
If you use functions _int64_feature and _bytes_feature defined in the gist you can create a FSNS compatible tf.Example proto using a following snippet:
char_ids_padded, char_ids_unpadded = encode_utf8_string(
text, charset, length, null_char_id)
example = tf.train.Example(features=tf.train.Features(
feature={
'image/format': _bytes_feature("PNG"),
'image/encoded': _bytes_feature(img.tostring()),
'image/class': _int64_feature(char_ids_padded),
'image/unpadded_class': _int64_feature(char_ids_unpadded),
'height': _int64_feature(img.shape[0]),
'width': _int64_feature(img.shape[1]),
'orig_width': _int64_feature(img.shape[1]/num_of_views),
'image/text': _bytes_feature(text)
}
))
You should not use the below code directly:
"'image/encoded': _bytes_feature(img.tostring()),"
In my code, I wrote this:
_,jpegVector = cv2.imencode('.jpeg',img)
imgStr = jpegVector.tostring()
'image/encoded': _bytes_feature(imgStr)

Greek letters passed by argparse to matplotlib

I am using matplotlib to plot some data, however labels for plots are send via argparse. The problem is - I want some of the labels to contain greek letters. Which "greek letter code" type should I use to handle it?
I've used argparse to read in command line arguments for my legends for matplotlib. It translates Greek characters just fine when I specify something like '$\Delta G$' as an input argument (output as ΔG).