How to scrape NON-ASCII characters alone - beautifulsoup

I am using BeautifulSoup to scrape data. The text I want to scrape is "€ 48,50", which contains an ascii character. However, I would like to replace the euro sign with nothing so that the final output is "48,50". I have been getting errors because the console cannot print it. I am using python 2.7 on Windows for this. I will appreciate a solution.
I was basically getting errors and do not know how to go about this. Or is there a way I can just extract non-ascii characters alone?
w= item.find_all("div",{"class":"product-price"}).find("strong",
{"class":"product-price__money"}).text.replace("\\u20ac"," ")
print w

You need to decode the string and pass the replace function a unicode string.
text = "€ 48,50"
w = text.decode("utf-8").replace(u"\u20ac"," ")
print w
See How to replace unicode characters in string with something else python? for more details.

Related

how to correct incoming data that contains char not in Ascii to unicode before saving to the database

I have a webservice api in vb.net that accepts string. but i cannot control the data coming to this API. I sometimes receive chars in between words in this format (–, Á, •ï€,ââ€ï€, etc. ) Is there a way for me to handle these or convert these characters to their correct symbols before saving to the database?
i know that the best solution would be to go after the source where the characters get malformed.. but i'll make that as plan B
my code is already using utf-8 as encoding pattern, but what if the client that uses my API messed up and inadvertently sent the malformed char thru the API. can i clean that string and convert the malformed char to the correct symbol?
If you only want to accept ASCII characters, you could remove non-ASCII characters by encoding and decoding the string - the default ASCII encoding uses "?" as a substitute for unrecognized characters, so you probably want to override that:
' Using System.Text
Dim input As String = "âh€eÁlâl€o¢wïo€râlâd€ï€"
Dim ascii As Encoding = Encoding.GetEncoding(
"us-ascii",
New EncoderReplacementFallback(" "),
New DecoderReplacementFallback(" ")
)
Dim bytes() As Byte = ascii.GetBytes(input)
Dim output As String = ascii.GetString(bytes)
Output:
h e l l o w o r l d
The replacement given to the En/DecoderReplacementFallback can be empty if you just want to drop the non-ASCII characters.
You could use a different encoding than ASCII if you want to accept more characters - but I would imagine that most of the characters you listed are valid in most European character sets.
While you are kind of vague I could guide you in something I think you could potentially do:
Sub Main()
Dim allowedValues = "abcdefghijklmnopqrstuvwxyz".ToCharArray()
Dim someGoodSomeBad = "###$##okay##$##"
Dim onlyGood = New String(someGoodSomeBad.ToCharArray().Where(Function(x) allowedValues.Contains(x)).ToArray)
Console.WriteLine(onlyGood)
End Sub
The first line would be valid characters, in my example I chose to use alpha characters, you could add more characters and numbers too. Basically you are creating a whitelist of acceptable characters, that you the developer would make.
The next line would be an output from your API that has some good and some bad lines.
The next part is really more simple than it looks. I am extending the string to be an array of characters, then I am finding ONLY the characters that match my whitelist in a lambda statement. Then I extend this to an array again because if I do a new String in .NET from a char array.
I then get a good string, but I could make 'good' to be subjective based on a whitelist.
The bigger question though is WHY is your Web API sending garbled data over? It should be sending well formed JSON or XML that is then able to well parsed and strongly type to models. Doing what I have shown above is more of a bandaide than a real fix to the underlying problem and it will have MANY holes.

Using the Google text to speech service with spanish text

I am trying to generate a voice message from text using the following URL:
"http://translate.google.com/translate_tts?tl=es&q=La+castaña+está+muy+buena"
With python, but the urllib library fails with:
File "/usr/lib/python2.7/urllib.py", line 227, in retrieve
url = unwrap(toBytes(url))
File "/usr/lib/python2.7/urllib.py", line 1051, in toBytes
" contains non-ASCII characters")
UnicodeError: URL u'http://translate.google.com/translate_tts?tl=es&q=La+casta\xf1a+est\xe1+muy+buena' contains non-ASCII characters
How can I pass special characters (ñ, á, é, í, ó, ú), which have phonetical meaning, to the Google t2s service?
I was facing the same problem and found the solution in this question:
all I had to do was pass an additional parameter: &ie=UTF-8
That's it.
My resulting PHP looked something like:
$q = "Años más, años menos";
$data = file_get_contents('http://translate.google.com/translate_tts?tl=es&ie=UTF-8&q='.$q;
Note the translate_tts?tl=es&ie=UTF-8&q='.$q part of the GET parameters.

making a list of traditional Chinese characters from a string

I am currently trying to estimate the number of times each character is used in a large sample of traditional Chinese characters. I am interested in characters not words. The file also includes punctuation and western characters.
I am reading in an example file of traditional Chinese characters. The file contains a large sample of traditional Chinese characters. Here is a small subset:
首映鼓掌10分鐘 評語指不及《花樣年華》
該片在柏林首映,完場後獲全場鼓掌10分鐘。王家衛特別為該片剪輯「柏林版本
增減20處 趙本山香港戲分被刪
在柏林影展放映的《一代宗師》版本
教李小龍武功 葉問決戰散打王
另一增加的戲分是開場時葉問(梁朝偉飾)
My strategy is to read each line, split each line into a list, and go through and check each character to see if it already exists in a list or a dictionary of characters. If the character does not yet exist in my list or dictionary I will add it to that list, if it does exist in my list or dictionary, I will increase the counter for that specific character. I will probably use two lists, a list of characters, and a parallel list containing the counts. This will be more processing, but should also be much easier to code.
I have not gotten anywhere near this point yet.
I am able to read in the example file successfully. Then I am able to make a list for each line of my file. I am able to print out those individual lines into my output file and sort of reconstitute the original file, and the traditional Chinese comes out intact.
However, I run into trouble when I try to make a list of each character on a particular line.
I've read through the following article. I understood many of the comments, but unfortunately, was unable to understand enough of it to solve my problem.
How to do a Python split() on languages (like Chinese) that don't use whitespace as word separator?
My code looks like the following
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import codecs
wordfile = open('Chinese_example.txt', 'r')
output = open('Chinese_output_python.txt', 'w')
LINES = wordfile.readlines()
Through various tests I am sure the following line is not splitting the string LINES[0] into its component Chinese characters.
A_LINE = list(LINES[0])
output.write(A_LINE[0])
I mean you want to use this, from answerer 'flow' at How to do a Python split() on languages (like Chinese) that don't use whitespace as word separator? :
from re import compile as _Re
_unicode_chr_splitter = _Re( '(?s)((?:[\ud800-\udbff][\udc00-\udfff])|.)' ).split
def split_unicode_chrs( text ):
return [ chr for chr in _unicode_chr_splitter( text ) if chr ]
to successfully split a line of traditional Chinese characters.. I just had to know the proper syntax to handle encoded characters.. pretty basic.
my_new_list = list(unicode(LINE[0].decode('utf8')));

Code for converting long string to pass in URL

I am trying to take a string like "Hello my name is Nick" and transform it to "Hello+my+name+is+Nick" to be passed through a URL. This would be easily done by replacing all the spaces with a + char however I also need to replace all special characters (. , ! &) with their ASCII values. I have searched the net but cannot find anything. I wonder if anyone knows of existing code to do this as its a fairly common task?
I think you're looking for this: HttpUtility.UrlEncode Method (String)
Handles non-URL compliant characters and spaces.

How can I write special character in VB code

I have a Sql statament using special character (ex: ('), (/), (&)) and I don't know how to write them in my VB.NET code. Please help me. Thanks.
Find out the Unicode code point for the character (from http://www.unicode.org) and then use ChrW to convert from the code point to the character. (To put this in another string, use concatenation. I'm somewhat surprised that VB doesn't have an escape sequence, but there we go.)
For example, for the Euro sign (U+20AC) you'd write:
Dim euro as Char = ChrW(&H20AC)
The advantage of this over putting the character directly into source code is that your source code stays "just pure ASCII" - which means you won't have any strange issues with any other program trying to read it, diff it, etc. The disadvantage is that it's harder to see the symbol in the code, of course.
The most common way seems to be to append a character of the form Chr(34)... 34 represents a double quote character. The character codes can be found from the windows program "charmap"... just windows/Run... and type charmap
If you are passing strings to be processed as SQL statement try doubling the characters for example.
"SELECT * FROM MyRecords WHERE MyRecords.MyKeyField = ""With a "" Quote"" "
The '' double works with the other special characters as well.
The ' character can be doubled up to allow it into a string e.g
lSQLSTatement = "Select * from temp where name = 'fred''s'"
Will search for all records where name = fred's
Three points:
1) The example characters you've given are not special characters. They're directly available on your keyboard. Just press the corresponding key.
2) To type characters that don't have a corresponding key on the keyboard, use this:
Alt + (the ASCII code number of the special character)
For example, to type ¿, press Alt and key in 168, which is the ASCII code for that special character.
You can use this method to type a special character in practically any program not just a VB.Net text editor.
3) What you probably looking for is what is called 'escaping' characters in a string. In your SQL query string, just place a \ before each of those characters. That should do.
Chr() is probably the most popular.
ChrW() can be used if you want to generate unicode characters
The ControlChars class contains some special and 'invisible' characters, plus the quote - for example, ControlChars.Quote