Using the Google text to speech service with spanish text - text-to-speech

I am trying to generate a voice message from text using the following URL:
"http://translate.google.com/translate_tts?tl=es&q=La+castaña+está+muy+buena"
With python, but the urllib library fails with:
File "/usr/lib/python2.7/urllib.py", line 227, in retrieve
url = unwrap(toBytes(url))
File "/usr/lib/python2.7/urllib.py", line 1051, in toBytes
" contains non-ASCII characters")
UnicodeError: URL u'http://translate.google.com/translate_tts?tl=es&q=La+casta\xf1a+est\xe1+muy+buena' contains non-ASCII characters
How can I pass special characters (ñ, á, é, í, ó, ú), which have phonetical meaning, to the Google t2s service?

I was facing the same problem and found the solution in this question:
all I had to do was pass an additional parameter: &ie=UTF-8
That's it.
My resulting PHP looked something like:
$q = "Años más, años menos";
$data = file_get_contents('http://translate.google.com/translate_tts?tl=es&ie=UTF-8&q='.$q;
Note the translate_tts?tl=es&ie=UTF-8&q='.$q part of the GET parameters.

Related

Is there any way in botium using which I can assert the response text with emojis?

I have response from bot like " Hi there! 👋". does botium provide any support to test emojis in response text
BotiumScript files are UTF-8, so as long as you are using UTF-8 characters it will work.

How to scrape NON-ASCII characters alone

I am using BeautifulSoup to scrape data. The text I want to scrape is "€ 48,50", which contains an ascii character. However, I would like to replace the euro sign with nothing so that the final output is "48,50". I have been getting errors because the console cannot print it. I am using python 2.7 on Windows for this. I will appreciate a solution.
I was basically getting errors and do not know how to go about this. Or is there a way I can just extract non-ascii characters alone?
w= item.find_all("div",{"class":"product-price"}).find("strong",
{"class":"product-price__money"}).text.replace("\\u20ac"," ")
print w
You need to decode the string and pass the replace function a unicode string.
text = "€ 48,50"
w = text.decode("utf-8").replace(u"\u20ac"," ")
print w
See How to replace unicode characters in string with something else python? for more details.

How to pass emoji scraping a text in phyton with bs4

I'm creating a scraper that scrapes all the comments in a URL page and I'm saving the text in a txt file (1 comment = 1 txt).
Now I'm having a problem when there are some emoji in the text of a comment. In fact, the program stops and says "UnicodeEncodeError: 'charmap' codec can't encode the character". How can I pass this problem? (I'm using bs4)
The structure of the code is like this:
q=requests.get(url)
soup=BeautifulSoup(q.content, "html.parser")
x=soup.find("a", {"class":"comments"})
y=x.find_all("div", {"class":"blabla"})
i=0
for item in y:
name=str(i)
comment=item.find_all("p")
out_file=open('%s.txt'%CreatorName, "w")
out_file.write(str(comment)
out_file.close
i=i+1
Thanks to everyone.
My guess is that you are on Windows. You code works perfectly on Linux. So change the encoding on the file you open to utf-8 like this:
out_file=open('%s.txt'%CreatorName, "w", encoding='utf-8')
This should write to the file without error although the emoji may not display properly in notepad you can always open it in FireFox or another application if you want to see the emoji. Other comment text should be readable in notepad though.

getting Unicode decode error while using gTTS in python

While using gTTS google translator module in python 2.x, I am getting error-
File
"/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/gtts/tts.py",
line 94, in init
if self._len(text) <= self.MAX_CHARS: File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/gtts/tts.py",
line 154, in _len
return len(unicode(text)) UnicodeDecodeError: 'ascii' codec can't decode byte 0xe0 in position 0: ordinal not in range(128)`.
Even though I have included # -*- coding: utf-8 -*- in my python script, I am getting the error on using Non-ASCII characters. Tell me some other way to implement like I can write sentence in English and get translated in other language. But this is also not working as I am getting speech in English, only accent changed.
I have searched a lot everywhere but can't find an answer. Please help!
I have tried writing a string in unicode format as-
u'Qu'est-ce que tu fais? Gardez-le de côté.'.
The ASCII code characters are converted into unicode format and hence, resolve the error. So, the text you want to be converted into speech can even have utf-8 format characters and can be easily transformed.
You also need to decode the incoming argument in the gtts-cli.py
Change the following line from this:
if args.text == "-":
text = sys.stdin.read()
else:
text = arg.text
to this:
if args.text == "-":
text = sys.stdin.read()
else:
text = codecs.decode(arg.text,"utf-8")
works for me, give it a try.

making a list of traditional Chinese characters from a string

I am currently trying to estimate the number of times each character is used in a large sample of traditional Chinese characters. I am interested in characters not words. The file also includes punctuation and western characters.
I am reading in an example file of traditional Chinese characters. The file contains a large sample of traditional Chinese characters. Here is a small subset:
首映鼓掌10分鐘 評語指不及《花樣年華》
該片在柏林首映,完場後獲全場鼓掌10分鐘。王家衛特別為該片剪輯「柏林版本
增減20處 趙本山香港戲分被刪
在柏林影展放映的《一代宗師》版本
教李小龍武功 葉問決戰散打王
另一增加的戲分是開場時葉問(梁朝偉飾)
My strategy is to read each line, split each line into a list, and go through and check each character to see if it already exists in a list or a dictionary of characters. If the character does not yet exist in my list or dictionary I will add it to that list, if it does exist in my list or dictionary, I will increase the counter for that specific character. I will probably use two lists, a list of characters, and a parallel list containing the counts. This will be more processing, but should also be much easier to code.
I have not gotten anywhere near this point yet.
I am able to read in the example file successfully. Then I am able to make a list for each line of my file. I am able to print out those individual lines into my output file and sort of reconstitute the original file, and the traditional Chinese comes out intact.
However, I run into trouble when I try to make a list of each character on a particular line.
I've read through the following article. I understood many of the comments, but unfortunately, was unable to understand enough of it to solve my problem.
How to do a Python split() on languages (like Chinese) that don't use whitespace as word separator?
My code looks like the following
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import codecs
wordfile = open('Chinese_example.txt', 'r')
output = open('Chinese_output_python.txt', 'w')
LINES = wordfile.readlines()
Through various tests I am sure the following line is not splitting the string LINES[0] into its component Chinese characters.
A_LINE = list(LINES[0])
output.write(A_LINE[0])
I mean you want to use this, from answerer 'flow' at How to do a Python split() on languages (like Chinese) that don't use whitespace as word separator? :
from re import compile as _Re
_unicode_chr_splitter = _Re( '(?s)((?:[\ud800-\udbff][\udc00-\udfff])|.)' ).split
def split_unicode_chrs( text ):
return [ chr for chr in _unicode_chr_splitter( text ) if chr ]
to successfully split a line of traditional Chinese characters.. I just had to know the proper syntax to handle encoded characters.. pretty basic.
my_new_list = list(unicode(LINE[0].decode('utf8')));