TTS Error:utterance lists extracted from utt2spk and text - text-to-speech

I'm using telugu Text To Speech (TTS) system I am getting an error like utterance lists extracted from utt2spk and text
I want an help to overcome with this error

Related

Arabic pdf text extraction

I'm trying to extract text from Arabic pdfs - raw data extraction not OCR -.
I tried many packages, tools and none of them worked, python packages, pdfBox, adobe API, and many other tools and all of them field to extract the text correctly, either it reads the text LTR or it do wrong decoding.
Here is a two sample from different tools
sample 1:
املحتويات
7 الثانية الطبعة مقدمة
9 وتاريخه األدب -١
51 الجاهليون -٢
95 الشعر نحل أسباب -٣
149 والشعراء الشعر -٤
213 مرض شعر -٥
271 الشعر -٦
285 الجاهيل النثر -٧
sample 2:
ﺔﻴﻧﺎﺜﻟا ﺔﻌﺒﻄﻟا ﺔﻣﺪﻘﻣ
ﻪﺨﻳرﺎﺗو بدﻷا -١
نﻮﻴﻠﻫﺎﺠﻟا -٢
ﺮﻌﺸﻟا ﻞﺤﻧ بﺎﺒﺳأ -٣
ءاﺮﻌﺸﻟاو ﺮﻌﺸﻟا -٤
ﴬﻣ ﺮﻌﺷ -٥
ﺮﻌﺸﻟا -٦
ﲇﻫﺎﺠﻟا ﺮﺜﻨﻟا -٧
original text
and yes I can copy it and get the same rendered text.
are there any tool that can extract Arabic text correctly
the book link can be found here
The text in a PDF is not the same as the text used for its construction, we can see that in your example where page 7 is shown in Arabic on the surface but is coded as 7 in the plain text.
However a greater problem is the Languages as supported by fonts, so in Notepad I had to accept a script font to see a similarity, but that is using a font substitution.
Another complication is Unicode and whitespace ordering.
so the result from
pdftotext -f 5 -l 5 في_الأدب_الجاهلي.pdf try.txt
At best will look like
Thus in summary your Sample 1 is equal if not better, than any other simple attempt.
Later Edit from B.A. comment below
I found a way to go around this, after extracting the text I open the txt file and normalize its content using unicodedata python module that offers unicodedata.normalize() function. So I can now say that pdftotext is the best tool for Arabic text extraction
Unicode Normalization should be fixing that issue. (you can choose NFKC)
Most programming languages have a normal.
check here for more info about normalization.
https://unicode.org/reports/tr15/

Camelot in python does not behave as expected

I have two pdf documents, both in same layout with different information. The problem is:
I can read one perfectly but the other one the data is unrecognizable.
This is an example which I can read perfectly, download here:
from_pdf = camelot.read_pdf('2019_05_2.pdf', flavor='stream', strict=False)
df_pdf = from_pdf[0].df
camelot.plot(from_pdf[0], kind='text').show()
print(from_pdf[0].parsing_report)
This is the dataframe as expected:
This is an example which after I read, the information is unrecognizable, download here:
from_pdf = camelot.read_pdf('2020_04_2.pdf', flavor='stream', strict=False)
df_pdf = from_pdf[0].df
camelot.plot(from_pdf[0], kind='text').show()
print(from_pdf[0].parsing_report)
This is the dataframe with unrecognizable information:
I don't understand what I have done wrong and why the same code doesn't work for both files. I need some help, thanks.
The problem: malformed PDF
Simply, the problem is that your second PDF is malformed / corrupted. It doesn't contain correct font information, so it is impossible to extract text from your PDF as is. It is a known and difficult problem (see this question).
You can check this by trying to open the PDF with Google Docs.
Google Docs tries to extract the text and this is the result:.
Possible solutions
If you want to extract the text, you can print the document to an image-based PDF and perform an OCR text extraction.
However, Camelot does not currently support image-based PDFs, so it is not possible to extract the table.
If you have no way to recover a well-formed PDF, you could try this strategy:
print PDF to an image-based PDF
add a good text layer to your image-based PDF (using OCRmyPDF)
try using Camelot to extract tables

Custom Speech: "normalized text is empty"

I uploaded a wav and text file to the custom speech portal. I got the following error: “Error: normalized text is empty.”
The text file is UTF-8 BOM, and is similar in format to a file that did work.
How I can trouble-shoot this?
There can be several reasons for a normalized text to be empty, e.g. if there are words of Latin and non-Latin characters in a sentence (depending on the locale). Also, words that are repeated multiple times in a row may cause this. Can you share which locale you're using to import the data? If you could share the text we can find the reason. Otherwise you could try to reduce the input text (no need to cut the audio for this) to find out what causes the normalization to discard the sentence.

Extracting PDF Text in Hindi using PDFBox

So I am trying to extract English and Hindi text from a PDF file. The English text is extracted properly. But when I try to extract the Hindi Text, some characters are replaced by circle/squares.
I copied the Hindi text snippet directly from the PDF File to a Word document and I get the same squares for some characters.
PDFBox Version: 2.0.7
PDF Version: 1.6(Acrobat 7.x)
Security Details(PDF):
Font Details:
I cannot attach the PDF, but here is a snippet of the PDF File(Adobe Acrobat Reader).
Note: I have drawn the black bar as it contains the address of someone.
Output of text extracted using PDFBox:
पता: कालकाजी, दि ण िद ी, िद ी - 110019
As you can see from the output of PDFBox text extraction above, some of the characters are replaced by circles. The same happens when I manually copy from PDF to a word document.
I have tried tesseract OCR also, but that is giving an even worse output. I would like to know any other options that I can try?
For instance, extracting the data using PDFBox, not as a text but an image?
EDIT:: Also getting the following warnings.
03:58:38.711 [main] WARN o.a.pdfbox.pdmodel.font.PDType0Font - No
Unicode mapping for CID+26 (26) in font Lohit-Devanagari

Extract sections of PDF

I am trying to extract sections of a PDF file, for use in text analysis. I tried using pdfextract to accomplish this. However, a command such as
pdf-extract extract --regions --no-lines Bauer2010.pdf
only extract the (x,y) coordinates of a region, as in the example below.
<region x="226.32" y="750.47" width="165.57" height="6.37"
line_height="6.37" font="BGBFHO+AdvP4DF60E">Patient Education and
Counseling 79 (2010) 315-319</region>
Can sections of a PDF be extracted?
Have a look at http://text-analyzer.com where you can upload your PDF file and it will convert it into a format suitable for Natural Language Processing. Once converted into a text file it can then process the file, breaking it down into sentences with sentiment analysis. It has over 40 different types of sentence views where you can tag sections. Those tagged sentences can be exported.