HtmlUtilities.ConvertToText really slow

HtmlUtilities.ConvertToText really slow - windows-8

I am doing simple app that gets some data from XML-RPC server and I noticed that HtmlUtilities.ConvertToText is really slow, here is the test that takes 900 ms on quad core machine :
[TestMethod]
public void TestHtmlToStringDogSlow()
{
string text = #"Zdjelice od dinje
Tartar od morskih delicija
Salata od ječma
Sirni namaz s avokadom
<p>Dosadili su vam uvijek isti namazi? Potpuno vas razumijemo i stoga nudimo namaz od avokada. Za one kojima treba više informacija, glavni su mu sastojci – uz avokado – pinjoli i svježi sir, a neodoljiv je uz prepečenac.</p>
Hladetina
<p>Popularno jelo s kolinja možete poslužiti i kao predjelo svečanog, pa i blagdanskog objeda ili večere, ali i kao samostalno malo jelo.</p>
Jaja u umaku od ajvara
<p>I kad vam nije do maštanja i velikih egzibicija, iz kuhinje možete iznijeti zanimljiva mala iznenađenja posve neočekivana okusa.</p>
Gurmanski zalogajčići
<p>Jelo – dosjetka ne vrednuje se ni brzinom pripreme ni brojem korištenih sastojaka, nego rezultatom. A ovi obogaćeni krekerčići zalogajčići su za bogove.</p>
Pikantni namaz od sira
<p>Dolaze vam gosti ili imate svježega kravljeg sira, a nemate ideju što biste s njime? Načinite vrlo zanimljiv i neuobičajen namaz, koji možete kratko sačuvati i u hladnjaku.</p>
Tuna <em>alla carpaccio</em>
<p>Iako je <em>carpaccio</em> izvorno od sirove zamrznute junetine odnosno govedine, priprema se od različitih namirnica. A pripremate li ga od ribe, ona osobito mora biti vrlo svježa.</p>
Namaz od svježeg sira i gorgonzole
<p>Gorgonzola i začinske trave i običnome sirnom namazu daju fini mediteranski štih. Uz pomno odabran kruh, pa s bademima i orasima, može se poslužiti i u svečanijim prilikama.</p>
Pašteta <em>Twist</em>
<p>Volite li paštete, i u njima možete uživati gurmanski. Idealno za veća okupljanja, za hladan bife.</p>
";
var convertedItems = new List<string>();
var items = text.Split(new string[] {Environment.NewLine}, StringSplitOptions.RemoveEmptyEntries);
foreach (string oneItem in items)
{
string cnv = HtmlUtilities.ConvertToText(oneItem);
//string cnv = oneItem.Replace("<p>", "").Replace("</p>", "");
convertedItems.Add(cnv);
}
}
I tried commenting the ConvertToText line and doing simple string.Replace, and then test takes only 2 ms. I am very aware that simple replace is very different from ConvertToText but just to put things to perspective.
So my question is does anyone has experience this slowness when using HtmlUtilities.ConvertToText ?

Yes, I have seen this as well. I have an app that loads data from a syndication feed. When I use this method, performance is absolutely horrible. Instead, I used regex to strip out the HTML tags.

You should use WebUtility.HtmlDecode instead of HtmlUtilities.ConvertToText. It's much faster, and it doesn't replace whitespace characters with space.

Related

How to convert PDF with images which I don't care about to text?

I'm trying to convert pdf to text files. The problem is that those pdf contain images, which I don't care about (this is the type of file I want to extract (https://www.sia.aviation-civile.gouv.fr/pub/media/store/documents/file/l/f/lf_sup_2020_213_fr.pdf). Note that if I do copy/paste with my mouse, it work quite well (except the line break), so I'd guess that it's possible. Most of the answer I found online work pretty well on dummy pdf with text only, but give especially bad result on the map.
For instance, something like this
from tika import parser # pip install tika
raw = parser.from_file('test2.pdf')
print(raw['content'])
works well for retrieving the text, but I have a lot of trash like this :
ERY
CTR
3
CH
A
which appear because of the map.
Something like this, which work by converting the pdf to images and then reading the images, face the same problem (I found it on a very similar thread on stackoverflow, but there is no answer) :
import pytesseract as pt
from PIL import Image
import sys
def convert(name):
pages = convert_from_path(name, dpi=200)
for idx,page in enumerate(pages):
page.save('page'+str(idx)+'.jpg', 'JPEG')
quote = Image.open('page'+str(idx)+'.jpg')
text = pt.image_to_string(quote, lang="fra")
file_ex = open('page'+str(idx)+'.text',"w")
file_ex.write(text)
file_ex.close()
if __name__ == '__main__':
convert(sys.argv[1])
Finally, I tried to remove the image first, and then using one of the solutions above, but it didn't work better :
from tika import parser # pip install tika
from PyPDF2 import PdfFileWriter, PdfFileReader
# Remove the images
inputStream = open("lf_sup_2020_213_fr.pdf", "rb")
outputStream = open("test3.pdf", "wb")
src = PdfFileReader(inputStream)
output = PdfFileWriter()
[output.addPage(src.getPage(i)) for i in range(src.getNumPages())]
output.removeImages()
output.write(outputStream)
outputStream.close()
# Read from pdf without images
raw = parser.from_file('test2.pdf')
print(raw['content'])
Do you know how to solve this ? It can be in any language.
Thanks

One approach you could try is to use a toolkit capable of parsing the text characters in the PDF then use the object properties to try and remove the unwanted map labels while keeping the text characters required.
For example, the ParsePages method from LEADTOOLS PDF toolkit (which is what I am familiar with since I work for the vendor of this toolkit) can be used to obtain the text from the PDF:
using (PDFDocument document = new PDFDocument(pdfFileName))
{
PDFParsePagesOptions options = PDFParsePagesOptions.All;
document.ParsePages(options, 1, -1);
using (StreamWriter writer = File.CreateText(txtFileName))
{
IList<PDFObject> objects = document.Pages[0].Objects;
writer.WriteLine("Objects: {0}", objects.Count);
foreach (PDFObject obj in objects)
{
if (obj.TextProperties.IsEndOfLine)
writer.WriteLine(obj.Code);
else
writer.Write(obj.Code);
}
writer.WriteLine("---------------------");
}
}
This will obtain all the text in the PDF for the first page, with the unwanted results as you mentioned. Here is an excerpt below:
Objects: 3918
5
91L
F5
4
1 LF
N
OY
L2
1AM
TService
8
26
1de l’Information
0
B09SUP AIP 213/20
7
Aéronautique
Date de publication : 05 NOV
e-mail : sia.qualite#aviation-civile.gouv.fr
Internet : www.sia.aviation-civile.gouv.fr
141
17˚
82
N20
9Objet : Création de 4 zones réglementées temporaires (ZRT) pour l’exercice VOLOPS en région de Chambéry
En vigueur : Du mercredi 25 Novembre 2020 au vendredi 04 décembre 2020
More code can be used to examine the properties for each parsed character:
writer.WriteLine(" ObjectType: {0}", obj.ObjectType.ToString());
writer.WriteLine(" Bounds: {0}, {1}, {2}, {3}", obj.Bounds.Left, obj.Bounds.Top, obj.Bounds.Right, obj.Bounds.Bottom);
writer.WriteLine(" TextProperties.FontHeight: {0}", obj.TextProperties.FontHeight.ToString());
writer.WriteLine(" TextProperties.FontIndex: {0}", obj.TextProperties.FontIndex.ToString());
writer.WriteLine(" Code: {0}", obj.Code);
writer.WriteLine("------");
This will give the properties for each character:
Objects: 3918
ObjectType: Text
Bounds: -60.952693939209, 1017.25231933594, -51.8431816101074, 1023.71826171875
TextProperties.FontHeight: 7.10454273223877
TextProperties.FontIndex: 48
Code: 5
------
Using these properties, the unwanted text might be filtered using their properties. For example, I noticed that the FontHeight for a good portion of the unwanted text is around 7 PDF units, so the first code might be altered to avoid extracting any text smaller than 7.25 PDF units:
foreach (PDFObject obj in objects)
{
if (obj.TextProperties.FontHeight > 7.25)
{
if (obj.TextProperties.IsEndOfLine)
writer.WriteLine(obj.Code);
else
writer.Write(obj.Code);
}
}
The extracted output would give a better result, an excerpt follows:
Objects: 3918
Service
de l’Information
SUP AIP 213/20
Aéronautique
Date de publication : 05 NOV
e-mail : sia.qualite#aviation-civile.gouv.fr
Internet : www.sia.aviation-civile.gouv.fr
Objet : Création de 4 zones réglementées temporaires (ZRT) pour l’exercice VOLOPS en région de Chambéry
En vigueur : Du mercredi 25 Novembre 2020 au vendredi 04 décembre 2020
Lieu : FIR : Marseille LFMM - AD : Chambéry Aix-Les-Bains LFLB, Chambéry Challes les Eaux LFLE
ZRT LE SIRE, MOTTE CASTRALE, ALLEVARD
*
C
D
E
In the end, you will have to try and come up with a good criteria to filter out the unwanted text without removing the text you need to keep, using this approach.

static class on .web project silverlight

I have been having a problem for a long time , but so far have not got a solution and I hope you can help me .
I have a Silverlight application where I use WCF for queries to retrieve information from the database and also for the communication between client ( Duplex ) and Socket (receive and send information between my application and others).
To control Duplex , when the client accesses a specific module of my application , I link that customer to a static class that I have on my Projeto.Web ( Application start ) , as in the code below :
//static Class
public static class ncClientes
{
private static List<IncServicoDuplex> clientesSupRecAlarme = new List<IncServicoDuplex>();
public static List<IncServicoDuplex> ClientesSupRecAlarme
{
get { return ncClientes.clientesSupRecAlarme; }
set { ncClientes.clientesSupRecAlarme = value; }
}
}
//Método chamado pelo WCF quando o cliente acessa o módulo
public void VinculaCliente(string strProjeto)
{
//GetCallbackChannel - obter o canal de comunicação entre o serviço e o cliente - retornará a instância do canal entre o serviço e o cliente.
IncServicoDuplex cliente = OperationContext.Current.GetCallbackChannel<IncServicoDuplex>();
//A palavra-chave lock marca um bloqueio de instruções como uma seção crítica, obter o bloqueio de exclusão de mútua para um determinado objeto,
//executar uma instrução e, em seguida, liberar o bloqueio.
switch (strProjeto)
{
case "ncPrincipal|Alarme":
case "ncAlarme":
case "ncConfiguracao|Supervisao":
lock (ncClientes.ClientesSupRecAlarme)
{
ncClientes.ClientesSupRecAlarme.Add(cliente);
}
break;
}
}
When I make a change to the database , all of my online customers who are at the receiving module that change the WCF service , use the following code :
//Método no meu WCF que transmite a informação de alteração, inclusão ou exclusão de um cliente para os outros
public void SupervisaoAlarmeOnline(ncAlarme objAlarme)
{
var varClientesAlarme = new List<IncServicoDuplex>(ncClientes.ClientesSupRecAlarme);
foreach (var item in varClientesAlarme)
{
try
{
item.SupervisaoAlarmeOnlineRetorno(objAlarme);
}
catch
{
ncClientes.ClientesSupRecAlarme.Remove(item);
}
}
}
My problem happens when I get some information by Socket ( class located in projeto.Web ) that creates an instance of my ServicoWCF and call the method to send the received information to clients. Apparently, my static variable is being reset when called on this side of the application.
Is there any difference on " which side " I call the service ? When I call it on WCF client side my static variable gets the correct count , but when I call it on the Socket class , count is set to 0.
I hope you can help me , I tried to be as clear as possible , if there is any doubt please let me know.
Thanks in advance !

Why does the Zebra QL 220 printer shut off in the middle of my talking to it?

I've got C# CE CF code that runs on a handheld device (Motorola MC3100) which should cause the Zebra QL220 belt printer to which it is attached to print something (code appended to this post).
I turn on the QL 220 (via the big green button at its base or top, depending on your perspective) as I start my app, but the printer shuts itself off in the middle of my code executing, and so nothing is printed (I’m assuming that’s the reason nothing is printed, anyway).
If I'm right about the cause for the silence of the printer, what must I do to make its “On” button “sticky”?
I tried mashing the blue button on the QL 220, also (icon of a roller and sheet of paper being ejected from it), but all that did was spit out some of the tape/printer paper in "real time."
. . .
using (SerialPort serialPort = new SerialPort())
{
serialPort.BaudRate = 19200;
serialPort.Handshake = Handshake.XOnXOff; // Handshake AKA Flowcontrol?
serialPort.DataBits = 8;
serialPort.Parity = Parity.None;
serialPort.StopBits = StopBits.One;
serialPort.PortName = "COM1:";
serialPort.ReadTimeout = 500;
serialPort.WriteTimeout = 500;
serialPort.StopBits = StopBits.One;
serialPort.Open();
Thread.Sleep(2500); // I don't know why this is needed, or if it really is...
// Try this first:
serialPort.WriteLine("! 0 200 200 210 1");
serialPort.WriteLine("TEXT 4 0 30 40 Bonjour la Monde"); //Hola el Mundo --- Hallo die Welt
serialPort.WriteLine("FORM");
serialPort.WriteLine("PRINT");
// or (if WriteLine does not include a carriage return and line feed):
// serialPort.Write("! 0 200 200 210 1\r\n");
// serialPort.Write("TEXT 4 0 30 40 Bonjour la Monde\r\n"); //Hola el Mundo --- Hallo die Welt
// serialPort.Write("FORM\r\n");
// serialPort.Write("PRINT\r\n");
serialPort.Close();
}

Besides appending the colon to "COM1" as ctacke revealed was necessary on another SO post, I also needed to swap the WriteLine lines for Write lines with the "\r\n" appended to each line, so that they are now:
serialPort.Write("! 0 200 200 210 1\r\n");
serialPort.Write("TEXT 4 0 30 40 Bonjour la Monde\r\n"); //Hola el Mundo --- Hallo die Welt
serialPort.Write("FORM\r\n");
serialPort.Write("PRINT\r\n");
That successfully printed out "Bonjour la Monde" although with too much wasted paper (about a mile above and below the line was printed).

httpagility pack scraping between broken tag

i need to scrape a p tag which has h3 tag after it but does not have a closing p tag. It looks like this :
<script ad>asdasdasd</script>
<p>Translation companies are
-----------------------
-----------------------
<h3 class="this_class">mind blown site</h3>
There is no </p> tag so i cannot parse it completely. Now i have two questions :
1) can this be parsed using httpagility xpath ?
2) i have a function to find text between two strings (getbetween). But i have a doubt - If i use "asdasdasd" and " is it always 100% that vb.net will use the script tag which is just above h3 because there are 2-3 same lines - "asdasdasd"
3) Any other method you guys are aware of ?
(had to write in code so html does not mess up)
Regards,

It might be a good idea to post some more "real" html to really help you, at least the tags between the h3 and the p.
Anyway, this should get you the p-Tag from the h3-Tag.
HtmlDocument doc = new HtmlDocument();
doc.Load(... //Load the Html...
//Either of these lines will do
HtmlNode pNode = doc.DocumentNode.SelectSingleNode("//h3[#class='this_class']/preceding-sibling::p");
//HtmlNode pNode = doc.DocumentNode.SelectSingleNode("//h3[contains(text(),'mind blown site')]/preceding-sibling::p");
string pInnerHtml = pNode.NextSibling.InnerHtml; //Has the text "Translation companies are...."

So in general, to get all the nodes from the opening p tag to the start of a tag you don't want, you could do this:
var p = doc.DocumentNode.SelectSingleNode("//p");
var h3 = p.SelectSingleNode("following-sibling::h3[#class='this_class']");
var following = new List<string>();
for (var current = p.NextSibling; current != h3; current = current.NextSibling)
{
following.Add(current.InnerText);
}
var innerText = String.Concat(following);

Google Text-To-Speech API

I want to know how can I use Google Text-to-Speech API in my .NET project. I think I need to call a URL to use the web service, but the idea for me is not clear. Can anyone help?

Old answer:
Try using this URL:
http://translate.google.com/translate_tts?tl=en&q=Hello%20World
It will automatically generate a wav file which you can easily get with an HTTP request through any .net programming.
Edit:
Ohh Google, you thought you could prevent people from using your wonderful service with flimsy http header verification.
Here is a solution to get a response in multiple languages (I'll try to add more as we go):
NodeJS
// npm install `request`
const fs = require('fs');
const request = require('request');
const text = 'Hello World';
const options = {
url: `https://translate.google.com/translate_tts?ie=UTF-8&q=${encodeURIComponent(text)}&tl=en&client=tw-ob`,
headers: {
'Referer': 'http://translate.google.com/',
'User-Agent': 'stagefright/1.2 (Linux;Android 5.0)'
}
}
request(options)
.pipe(fs.createWriteStream('tts.mp3'))
Curl
curl 'https://translate.google.com/translate_tts?ie=UTF-8&q=Hello%20Everyone&tl=en&client=tw-ob' -H 'Referer: http://translate.google.com/' -H 'User-Agent: stagefright/1.2 (Linux;Android 5.0)' > google_tts.mp3
Note that the headers are based on #Chris Cirefice's example, if they stop working at some point I'll attempt to recreate conditions for this code to function. All credits for the current headers go to him and the wonderful tool that is WireShark. (also thanks to Google for not patching this)

In an update to Schahriar SaffarShargh's answer, Google has recently implemented a 'Google abuse' feature, making it impossible to send just any regular old HTTP GET to a URL such as:
http://translate.google.com/translate_tts?tl=en&q=Hello%20World
which worked just fine and dandy previously. Now, following such a link presents you with a CAPTCHA. This also affects HTTP GET requests out-of-browser (such as with cURL), because using that URL gives a redirect to the abuse protection page (the CAPTCHA).
To start, you have to add the query parameter client to the request URL:
http://translate.google.com/translate_tts?tl=en&q=Hello%20World&client=t
Google Translate sends &client=t, so you should too.
Before you make that HTTP request, make sure that you set the Referer header:
Referer: http://translate.google.com/
Evidently, the User-Agent header is also required, but interestingly enough it can be blank:
User-Agent:
Edit: NOTE - on some user-agents, such as Android 4.X, the custom User-Agent header is not sent, meaning that Google will not service the request. In order to solve that problem, I simply set the User-Agent to a valid one, such as stagefright/1.2 (Linux;Android 5.0). Use Wireshark to debug requests (as I did) if Google's servers are not responding, and ensure that these headers are being set properly in the GET! Google will respond with a 503 Service Unavailable if the request fails, followed by a redirect to the CAPTCHA page.
This solution is a bit brittle; it is entirely possible that Google will change the way they handle these requests in the future, so in the end I would suggest asking Google to make a real API endpoint (free or paid) that we can use without feeling dirty for faking HTTP headers.
Edit 2: For those interested, this cURL command should work perfectly fine to download an mp3 of Hello in English:
curl 'http://translate.google.com/translate_tts?ie=UTF-8&q=Hello&tl=en&client=t' -H 'Referer: http://translate.google.com/' -H 'User-Agent: stagefright/1.2 (Linux;Android 5.0)' > google_tts.mp3
As you may notice, I have set both the Referer and User-Agent headers in the request, as well as added the client=t parameter to the querystring. You may use https instead of http, your choice!
Edit 3: Google now requires a token for each GET request (noted by tk in the querystring). Below is the revised cURL command that will correctly download a TTS mp3:
curl 'https://translate.google.com/translate_tts?ie=UTF-8&q=hello&tl=en&tk=995126.592330&client=t' -H 'user-agent: stagefright/1.2 (Linux;Android 5.0)' -H 'referer: https://translate.google.com/' > google_tts.mp3
Notice the &tk=995126.592330 in the querystring; this is the new token. I obtained this token by pressing the speaker icon on translate.google.com and looking at the GET request. I simply added this querystring parameter to the previous cURL command, and it works.
NOTE: obviously this solution is very frail, and breaks at the whim of the architects at Google who introduce new things like tokens required for the requests. This token may not work tomorrow (though I will check and report back)... the point is, it is not wise to rely on this method; instead, one should turn to a commercial TTS solution, especially if using TTS in production.
For further explanation of the token generation and what you might be able to do about it, see Boude's answer.
If this solution breaks any time in the future, please leave a comment on this answer so that we can attempt to find a fix for it!

Expanding on Chris' answer. I managed to reverse engineer the token generation process.
The token for the request is based on the text and a global TKK variable set in the page script. These are hashed in JavaScript thus resulting in the tk param.
Somewhere in the page script you will find something like this:
TKK='403413';
This is the amount of hours passed since epoch.
The text is pumped in the following function (somewhat deobfuscated):
var query = "Hello person";
var cM = function(a) {
return function() {
return a
}
};
var of = "=";
var dM = function(a, b) {
for (var c = 0; c < b.length - 2; c += 3) {
var d = b.charAt(c + 2),
d = d >= t ? d.charCodeAt(0) - 87 : Number(d),
d = b.charAt(c + 1) == Tb ? a >>> d : a << d;
a = b.charAt(c) == Tb ? a + d & 4294967295 : a ^ d
}
return a
};
var eM = null;
var cb = 0;
var k = "";
var Vb = "+-a^+6";
var Ub = "+-3^+b+-f";
var t = "a";
var Tb = "+";
var dd = ".";
var hoursBetween = Math.floor(Date.now() / 3600000);
window.TKK = hoursBetween.toString();
fM = function(a) {
var b;
if (null === eM) {
var c = cM(String.fromCharCode(84)); // char 84 is T
b = cM(String.fromCharCode(75)); // char 75 is K
c = [c(), c()];
c[1] = b();
// So basically we're getting window.TKK
eM = Number(window[c.join(b())]) || 0
}
b = eM;
// This piece of code is used to convert d into the utf-8 encoding of a
var d = cM(String.fromCharCode(116)),
c = cM(String.fromCharCode(107)),
d = [d(), d()];
d[1] = c();
for (var c = cb + d.join(k) +
of, d = [], e = 0, f = 0; f < a.length; f++) {
var g = a.charCodeAt(f);
128 > g ? d[e++] = g : (2048 > g ? d[e++] = g >> 6 | 192 : (55296 == (g & 64512) && f + 1 < a.length && 56320 == (a.charCodeAt(f + 1) & 64512) ? (g = 65536 + ((g & 1023) << 10) + (a.charCodeAt(++f) & 1023), d[e++] = g >> 18 | 240, d[e++] = g >> 12 & 63 | 128) : d[e++] = g >> 12 | 224, d[e++] = g >> 6 & 63 | 128), d[e++] = g & 63 | 128)
}
a = b || 0;
for (e = 0; e < d.length; e++) a += d[e], a = dM(a, Vb);
a = dM(a, Ub);
0 > a && (a = (a & 2147483647) + 2147483648);
a %= 1E6;
return a.toString() + dd + (a ^ b)
};
var token = fM(query);
var url = "https://translate.google.com/translate_tts?ie=UTF-8&q=" + encodeURI(query) + "&tl=en&total=1&idx=0&textlen=12&tk=" + token + "&client=t";
document.write(url);
I managed to successfully port this to python in my fork of gTTS, so I know this works.
Edit: By now the token generation code used by gTTS has been moved into gTTS-token.
Edit 2: Google has changed the API (somewhere around 2016-05-10), this method requires some modification. I'm currently working on this. In the meantime changing the client to tw-ob seems to work.
Edit 3:
The changes are minor, yet annoying to say the least. The TKK now has two parts. Looking something like 406986.2817744745. As you can see the first part has remained the same. The second part is the sum of two seemingly random numbers. TKK=eval('((function(){var a\x3d2680116022;var b\x3d137628723;return 406986+\x27.\x27+(a+b)})())'); Here \x3d means = and \x27 is '. Both a and b change every UTC minute. At one of the final steps in the algorithm the token is XORed by the second part.
The new token generation code is:
var xr = function(a) {
return function() {
return a
}
};
var yr = function(a, b) {
for (var c = 0; c < b.length - 2; c += 3) {
var d = b.charAt(c + 2)
, d = "a" <= d ? d.charCodeAt(0) - 87 : Number(d)
, d = "+" == b.charAt(c + 1) ? a >>> d : a << d;
a = "+" == b.charAt(c) ? a + d & 4294967295 : a ^ d
}
return a
};
var zr = null;
var Ar = function(a) {
var b;
if (null !== zr)
b = zr;
else {
b = xr(String.fromCharCode(84));
var c = xr(String.fromCharCode(75));
b = [b(), b()];
b[1] = c();
b = (zr = window[b.join(c())] || "") || ""
}
var d = xr(String.fromCharCode(116))
, c = xr(String.fromCharCode(107))
, d = [d(), d()];
d[1] = c();
c = "&" + d.join("") +
"=";
d = b.split(".");
b = Number(d[0]) || 0;
for (var e = [], f = 0, g = 0; g < a.length; g++) {
var l = a.charCodeAt(g);
128 > l ? e[f++] = l : (2048 > l ? e[f++] = l >> 6 | 192 : (55296 == (l & 64512) && g + 1 < a.length && 56320 == (a.charCodeAt(g + 1) & 64512) ? (l = 65536 + ((l & 1023) << 10) + (a.charCodeAt(++g) & 1023),
e[f++] = l >> 18 | 240,
e[f++] = l >> 12 & 63 | 128) : e[f++] = l >> 12 | 224,
e[f++] = l >> 6 & 63 | 128),
e[f++] = l & 63 | 128)
}
a = b;
for (f = 0; f < e.length; f++)
a += e[f],
a = yr(a, "+-a^+6");
a = yr(a, "+-3^+b+-f");
a ^= Number(d[1]) || 0;
0 > a && (a = (a & 2147483647) + 2147483648);
a %= 1E6;
return c + (a.toString() + "." + (a ^ b))
}
;
Ar("test");
Of course I can't generate a valid url anymore, since I don't know how a and b are generated.

An additional alternative is: responsivevoice.org a simple example JsFiddle is Here
HTML
<div id="container">
<input type="text" name="text">
<button id="gspeech" class="say">Say It</button>
<audio id="player1" src="" class="speech" hidden></audio>
</div>
JQuery
$(document).ready(function(){
$('#gspeech').on('click', function(){
var text = $('input[name="text"]').val();
responsiveVoice.speak("" + text +"");
<!-- http://responsivevoice.org/ -->
});
});
External Resource:
https://code.responsivevoice.org/responsivevoice.js

i have created this like : q= urlencode & tl = language name
Just try this :
https://translate.google.com.vn/translate_tts?ie=UTF-8&q=%E0%A6%86%E0%A6%AE%E0%A6%BF%20%E0%A6%A4%E0%A7%8B%E0%A6%AE%E0%A6%BE%E0%A6%AF%E0%A6%BC%20%E0%A6%AD%E0%A6%BE%E0%A6%B2%E0%A7%8B%E0%A6%AC%E0%A6%BE%E0%A6%B8%E0%A6%BF+&tl=bn&client=tw-ob

Allright, so Google has introduces tokens (see the tk parameter in the new url) and the old solution doesn't seem to work. I've found an alternative - which I even think is better-sounding, and has more voices! The command isn't pretty, but it works. Please note that this is for testing purposes only (I use it for a little domotica project) and use the real version from acapella-group if you're planning on using this commercially.
curl $(curl --data 'MyLanguages=sonid10&MySelectedVoice=Sharon&MyTextForTTS=Hello%20World&t=1&SendToVaaS=' 'http://www.acapela-group.com/demo-tts/DemoHTML5Form_V2.php' | grep -o "http.*mp3") > tts_output.mp3
Some of the supported voices are;
Sharon
Ella (genuine child voice)
EmilioEnglish (genuine child voice)
Josh (genuine child voice)
Karen
Kenny (artificial child voice)
Laura
Micah
Nelly (artificial child voice)
Rod
Ryan
Saul
Scott (genuine teenager voice)
Tracy
ValeriaEnglish (genuine child voice)
Will
WillBadGuy (emotive voice)
WillFromAfar (emotive voice)
WillHappy (emotive voice)
WillLittleCreature (emotive voice)
WillOldMan (emotive voice)
WillSad (emotive voice)
WillUpClose (emotive voice)
It also supports multiple languages and more voices - for that I refer you to their website; http://www.acapela-group.com/

You can download the Voice using Wget:D
wget -q -U Mozilla "http://translate.google.com/translate_tts?tl=en&q=Hello"
Save the output into a mp3 file:
wget -q -U Mozilla "http://translate.google.com/translate_tts?tl=en&q=Hello" -O hello.mp3
Enjoy !!

Google text to speech
<!DOCTYPE html>
<html>
<head>
<script>
function play(id){
var text = document.getElementById(id).value;
var url = 'http://translate.google.com/translate_tts?tl=en&q='+text;
var a = new Audio(url);
a.play();
}
</script>
</head>
<body>
<input type="text" id="text" />
<button onclick="play('text');"> Speak it </button>
</body>
</html>

Use http://www.translate.google.com/translate_tts?tl=en&q=Hello%20World
note the www.translate.google.com

As of now, Google official Text-to-Speech service is available at https://cloud.google.com/text-to-speech/
It's free for the first 4 million characters.

I used the url as above: http://translate.google.com/translate_tts?tl=en&q=Hello%20World
And requested with python library..however I'm getting HTTP 403 FORBIDDEN
In the end I had to mock the User-Agent header with the browser's one to succeed.

Go to console.developer.google.com login and get an API key
or use microsoft bing's API
https://msdn.microsoft.com/en-us/library/?f=255&MSPPError=-2147217396
or even better use AT&T's speech API developer.att.com(paid one)
For voice recognition
Public Class Voice_recognition
Public Function convertTotext(ByVal path As String, ByVal output As String) As String
Dim request As HttpWebRequest = DirectCast(HttpWebRequest.Create("https://www.google.com/speech-api/v1/recognize?xjerr=1&client=speech2text&lang=en-US&maxresults=10"), HttpWebRequest)
'path = Application.StartupPath & "curinputtmp.mp3"
request.Timeout = 60000
request.Method = "POST"
request.KeepAlive = True
request.ContentType = "audio/x-flac; rate=8000"
request.UserAgent = "speech2text"
Dim fInfo As New FileInfo(path)
Dim numBytes As Long = fInfo.Length
Dim data As Byte()
Using fStream As New FileStream(path, FileMode.Open, FileAccess.Read)
data = New Byte(CInt(fStream.Length - 1)) {}
fStream.Read(data, 0, CInt(fStream.Length))
fStream.Close()
End Using
Using wrStream As Stream = request.GetRequestStream()
wrStream.Write(data, 0, data.Length)
End Using
Try
Dim response As HttpWebResponse = DirectCast(request.GetResponse(), HttpWebResponse)
Dim resp = response.GetResponseStream()
If resp IsNot Nothing Then
Dim sr As New StreamReader(resp)
MessageBox.Show(sr.ReadToEnd())
resp.Close()
resp.Dispose()
End If
Catch ex As System.Exception
MessageBox.Show(ex.Message)
End Try
Return 0
End Function
End Class
And for text to speech: use this.
I think you'll understand this
if didn't then use vbscript to vb/C# converter.
still didn't then contact Me.
I have done this before ,can't find the code now that this why i'm not directly givin' you the code.

Because it came up in chat here , and the first page for googeling was this one, i decided to let all in on my findings googling some more XD
you really dont need to go any length anymore to make it work simply stand on the shoulders of giants:
there is a standard
https://dvcs.w3.org/hg/speech-api/raw-file/tip/webspeechapi.html
and an example
http://html5-examples.craic.com/google_chrome_text_to_speech.html
at least for your web projects this should work (e.g. asp.net)

#! /usr/bin/python2
# -*- coding: utf-8 -*-
def run(cmd):
import os
import sys
from subprocess import Popen, PIPE
print(cmd)
proc=Popen(cmd, stdin=None, stdout=PIPE, stderr=None, shell=True)
while True:
data = proc.stdout.readline() # Alternatively proc.stdout.read(1024)
if len(data) == 0:
print("Finished process")
break
sys.stdout.write(data)
import urllib
msg='Hello preety world'
msg=urllib.quote_plus(msg)
# -v verbosity
cmd='curl '+ \
'--output tts_responsivevoice.mp2 '+ \
"\""+'https://code.responsivevoice.org/develop/getvoice.php?t='+msg+'&tl=en-US&sv=g2&vn=&pitch=0.5&rate=0.5&vol=1'+"\""+ \
' -H '+"\""+'User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:44.0) Gecko/20100101 Firefox/44.0'+"\""+ \
' -H '+"\""+'Accept: audio/webm,audio/ogg,audio/wav,audio/*;q=0.9,application/ogg;q=0.7,video/*;q=0.6,*/*;q=0.5'+"\""+ \
' -H '+"\""+'Accept-Language: pl,en-US;q=0.7,en;q=0.3'+"\""+ \
' -H '+"\""+'Range: bytes=0-'+"\""+ \
' -H '+"\""+'Referer: http://code.responsivevoice.org/develop/examples/example2.html'+"\""+ \
' -H '+"\""+'Cookie: __cfduid=ac862i73b6a61bf50b66713fdb4d9f62c1454856476; _ga=GA1.2.2126195996.1454856480; _gat=1'+"\""+ \
' -H '+"\""+'Connection: keep-alive'+"\""+ \
''
print('***************************')
print(cmd)
print('***************************')
run(cmd)
Line:
/getvoice.php?t='+msg+'&tl=en-US&sv=g2&vn=&pitch=0.5&rate=0.5&vol=1'+"\""+ \
is responsible for language.
tl=en-US
There is another preety interesting site with tts engines that can be used in this manner.
substitute o for null
iv0na.c0m
have a nice day

The 2023 Answer:
There's a Google Text to Speech Service in Google Cloud. It is an API service. To use it, you must first enable the Text-to-Speech API in Google Console.
Next, go to APIs & Services > Credentials and create a new API Key (Create Credentials > API Key).
Finally, you can call the POST API endpoint https://texttospeech.googleapis.com/v1/text:synthesize?key=[API KEY]
{ "audioConfig": { "audioEncoding": "LINEAR16", "pitch": "0.00", "speakingRate": "1.00" }, "input": { "text": "Hello World" }, "voice": { "languageCode": "en-US", "name": "en-US-Wavenet-E" } }
Response:
{
"audioContent": "UklGRr7CAABXQVZF..."
}

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

HtmlUtilities.ConvertToText really slow - windows-8

Yes, I have seen this as well. I have an app that loads data from a syndication feed. When I use this method, performance is absolutely horrible. Instead, I used regex to strip out the HTML tags.

You should use WebUtility.HtmlDecode instead of HtmlUtilities.ConvertToText. It's much faster, and it doesn't replace whitespace characters with space.

Related

How to convert PDF with images which I don't care about to text?

static class on .web project silverlight

Why does the Zebra QL 220 printer shut off in the middle of my talking to it?

httpagility pack scraping between broken tag

Google Text-To-Speech API

Categories

Resources