getting Unicode decode error while using gTTS in python - text-to-speech

While using gTTS google translator module in python 2.x, I am getting error-
File
"/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/gtts/tts.py",
line 94, in init
if self._len(text) <= self.MAX_CHARS: File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/gtts/tts.py",
line 154, in _len
return len(unicode(text)) UnicodeDecodeError: 'ascii' codec can't decode byte 0xe0 in position 0: ordinal not in range(128)`.
Even though I have included # -*- coding: utf-8 -*- in my python script, I am getting the error on using Non-ASCII characters. Tell me some other way to implement like I can write sentence in English and get translated in other language. But this is also not working as I am getting speech in English, only accent changed.
I have searched a lot everywhere but can't find an answer. Please help!

I have tried writing a string in unicode format as-
u'Qu'est-ce que tu fais? Gardez-le de côté.'.
The ASCII code characters are converted into unicode format and hence, resolve the error. So, the text you want to be converted into speech can even have utf-8 format characters and can be easily transformed.

You also need to decode the incoming argument in the gtts-cli.py
Change the following line from this:
if args.text == "-":
text = sys.stdin.read()
else:
text = arg.text
to this:
if args.text == "-":
text = sys.stdin.read()
else:
text = codecs.decode(arg.text,"utf-8")
works for me, give it a try.

Related

I need to replace non-ASCII characters in pandas data frame column in python 2.7

This question was asked many times, but non of the solutions worked for me.
The data frame was pulled from a third party excel file with 'UTF-8' encoding:
pd.read_excel(file, encoding = 'UTF-8', sheet_name = worksheet)
But I still have characters like " ’ " instead of " ' " in some lines.
On the top of the code I have the following
# -*- encoding: utf-8 -*-
The following line does not throw errors, but do not change anything in the data:
df['text'] = df['text'].str.replace("’","'")
I tried with dictionary (which has the same core), like
repl_dict = {"’": "'"}
for k,v in repl_dict.items():
df.loc[df.text.str.contains(k), 'text'] =
df.text.str.replace(pat=k,repl=v)
and tried many other approaches including regex, but nothing worked.
When I tried:
def replace_apostrophy(text):
return text.replace("’","'")
df['text'] = df['text'].apply(lambda x: replace_apostrophy(x))
I received the following error -
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)
When I tried:
df["text"] = df["text"].apply(lambda text: unicodedata.normalize('NFKD', text))
I got the following error -
TypeError: normalize() argument 2 must be unicode, not float
The text has also emojis that afterwords I need to count somehow.
Can someone give me a good advice?
Thank you very much!
I have found a solution myself. It might look clumsy, but works perfectly in my case:
df["text"] = df["text"].apply(lambda text: unicodedata.normalize('NFKD', text).encode('ascii','backslashreplace'))
I had to replace nan values prior to run that code.
That operation gives me ascii symbols only that can be easily replaced:
def replace_apostrophy(text):
return text.replace("a\u0302\u20acTM","'")
Hope this would help someone.

Python can't decode \x byte

I have a csv file with about 9 million rows. While processing it in Python, I got an error:
UnicodeEncodeError: 'charmap' codec can't encode character '\xe9' in position 63: character maps to undefined
Turns out the string is Beyonc\xe9 . So I guess it's something like é.
I tried just printing '\xe' in Python and it failed:
>>> print('\xe')
File "<stdin>", line 1
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-2: truncated \xXX escape
So I can't even replace or strip the backslash by s.replace('\\x', '') or s.strip('\\x').
Is there a quick way to fix this over the whole file? I tried to set the encoding while reading the file:
pandas.read_csv(inputFile, encoding='utf-8')
but it didn't help. Same problem.
Python version:
python --version
Python 3.5.2
although I installed 3.6.5
Windows 10
Update:
Following #Matti's answer I changed the encoding in pandas.read_csv() to latin1 and now the string became Beyonc\xc3\xa9. And \xc3\xa9 is unicode for é.
This is the line that's failing:
print(str(title) + ' , ' + str(artist))
title = 'Crazy In Love'
artist = 'Beyonc\xc3\xa9'
api is from lyricsgenius
The '\xe9' in the error message isn't an actual backslash followed by letters, it's just a representation of a single byte in the file. Your file is probably encoded as Latin-1, not UTF-8 as you specify. Specify 'latin1' as the encoding instead.

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xcc in position 3: invalid continuation byte

I'm trying to load a csv file using pd.read_csv but I get the following unicode error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xcc in position 3: invalid continuation byte
Unfortunately, CSV files have no built-in method of signalling character encoding.
read_csv defaults to guessing that the bytes in the CSV file represent text encoded in the UTF-8 encoding. This results in UnicodeDecodeError if the file is using some other encoding that results in bytes that don't happen to be a valid UTF-8 sequence. (If they by luck did also happen to be valid UTF-8, you wouldn't get the error, but you'd still get wrong input for non-ASCII characters, which would be worse really.)
It's up to you to specify what encoding is in play, which requires some knowledge (or guessing) of where it came from. For example if it came from MS Excel on a western install of Windows, it would probably be Windows code page 1252 and you could read it with:
pd.read_csv('../filename.csv', encoding='cp1252')
I got the following error
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position
51: invalid continuation byte
This was because I made changes to the file and its encoding. You could also try to change the encoding of file to utf-8 using some code or nqq editor in ubuntu as it provides directory option to change encoding. If problem remains then try to undo all the changes made to the file or change the directory.
Hope this helps
Copy the code, open a new .py file and enter code and save.
I had this same issue recently. This was what I did
import pandas as pd
data = pd.read_csv(filename, encoding= 'unicode_escape')

Using the Google text to speech service with spanish text

I am trying to generate a voice message from text using the following URL:
"http://translate.google.com/translate_tts?tl=es&q=La+castaña+está+muy+buena"
With python, but the urllib library fails with:
File "/usr/lib/python2.7/urllib.py", line 227, in retrieve
url = unwrap(toBytes(url))
File "/usr/lib/python2.7/urllib.py", line 1051, in toBytes
" contains non-ASCII characters")
UnicodeError: URL u'http://translate.google.com/translate_tts?tl=es&q=La+casta\xf1a+est\xe1+muy+buena' contains non-ASCII characters
How can I pass special characters (ñ, á, é, í, ó, ú), which have phonetical meaning, to the Google t2s service?
I was facing the same problem and found the solution in this question:
all I had to do was pass an additional parameter: &ie=UTF-8
That's it.
My resulting PHP looked something like:
$q = "Años más, años menos";
$data = file_get_contents('http://translate.google.com/translate_tts?tl=es&ie=UTF-8&q='.$q;
Note the translate_tts?tl=es&ie=UTF-8&q='.$q part of the GET parameters.

making a list of traditional Chinese characters from a string

I am currently trying to estimate the number of times each character is used in a large sample of traditional Chinese characters. I am interested in characters not words. The file also includes punctuation and western characters.
I am reading in an example file of traditional Chinese characters. The file contains a large sample of traditional Chinese characters. Here is a small subset:
首映鼓掌10分鐘 評語指不及《花樣年華》
該片在柏林首映,完場後獲全場鼓掌10分鐘。王家衛特別為該片剪輯「柏林版本
增減20處 趙本山香港戲分被刪
在柏林影展放映的《一代宗師》版本
教李小龍武功 葉問決戰散打王
另一增加的戲分是開場時葉問(梁朝偉飾)
My strategy is to read each line, split each line into a list, and go through and check each character to see if it already exists in a list or a dictionary of characters. If the character does not yet exist in my list or dictionary I will add it to that list, if it does exist in my list or dictionary, I will increase the counter for that specific character. I will probably use two lists, a list of characters, and a parallel list containing the counts. This will be more processing, but should also be much easier to code.
I have not gotten anywhere near this point yet.
I am able to read in the example file successfully. Then I am able to make a list for each line of my file. I am able to print out those individual lines into my output file and sort of reconstitute the original file, and the traditional Chinese comes out intact.
However, I run into trouble when I try to make a list of each character on a particular line.
I've read through the following article. I understood many of the comments, but unfortunately, was unable to understand enough of it to solve my problem.
How to do a Python split() on languages (like Chinese) that don't use whitespace as word separator?
My code looks like the following
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import codecs
wordfile = open('Chinese_example.txt', 'r')
output = open('Chinese_output_python.txt', 'w')
LINES = wordfile.readlines()
Through various tests I am sure the following line is not splitting the string LINES[0] into its component Chinese characters.
A_LINE = list(LINES[0])
output.write(A_LINE[0])
I mean you want to use this, from answerer 'flow' at How to do a Python split() on languages (like Chinese) that don't use whitespace as word separator? :
from re import compile as _Re
_unicode_chr_splitter = _Re( '(?s)((?:[\ud800-\udbff][\udc00-\udfff])|.)' ).split
def split_unicode_chrs( text ):
return [ chr for chr in _unicode_chr_splitter( text ) if chr ]
to successfully split a line of traditional Chinese characters.. I just had to know the proper syntax to handle encoded characters.. pretty basic.
my_new_list = list(unicode(LINE[0].decode('utf8')));