lxml clean breaks href attribute - lxml

import lxml.html.clean as clean
cleaner = clean.Cleaner(style=True, remove_tags=['div','span',], safe_attrs_only=['href',])
text = cleaner.clean_html('link')
print text
prints
link
how to get:
link
i.e href in normal encoding?

clean does the right thing -- the string in parentheses should be properly encoded, and the seemingly garbled thing is the proper encoding.
You might not know, but kyrillic domain names don't exist -- there's a complex system to map these to "allowed" characters.

Related

read_csv pandas, encoding issue

I have a csv-file with a list of keywords that I want to use for some filtering of texts.
I saved the csv-file, and tried to open it in my notebook using pd.from_csv('file.csv', encoding = 'UTF-8')
This didn't work even though I specified the encoding to this encoding type.
After some searching, I found some different encodings, I decided to go for
keywords = pd.read_csv('file.csv', encoding = 'latin1')
gets me the actual keywords, but when inspecting the words, I get that the spaces are passed as follows:
['falsification\xa0',
'détournement\xa0de\xa0subsides\xa0',
'parachutes\xa0dorés\xa0',...]
about the csv-file: it has two columns of keywords, one column in dutch, the other one in French. The issue with the spaces persists even when I use other encodings like

making a list of traditional Chinese characters from a string

I am currently trying to estimate the number of times each character is used in a large sample of traditional Chinese characters. I am interested in characters not words. The file also includes punctuation and western characters.
I am reading in an example file of traditional Chinese characters. The file contains a large sample of traditional Chinese characters. Here is a small subset:
首映鼓掌10分鐘 評語指不及《花樣年華》
該片在柏林首映,完場後獲全場鼓掌10分鐘。王家衛特別為該片剪輯「柏林版本
增減20處 趙本山香港戲分被刪
在柏林影展放映的《一代宗師》版本
教李小龍武功 葉問決戰散打王
另一增加的戲分是開場時葉問(梁朝偉飾)
My strategy is to read each line, split each line into a list, and go through and check each character to see if it already exists in a list or a dictionary of characters. If the character does not yet exist in my list or dictionary I will add it to that list, if it does exist in my list or dictionary, I will increase the counter for that specific character. I will probably use two lists, a list of characters, and a parallel list containing the counts. This will be more processing, but should also be much easier to code.
I have not gotten anywhere near this point yet.
I am able to read in the example file successfully. Then I am able to make a list for each line of my file. I am able to print out those individual lines into my output file and sort of reconstitute the original file, and the traditional Chinese comes out intact.
However, I run into trouble when I try to make a list of each character on a particular line.
I've read through the following article. I understood many of the comments, but unfortunately, was unable to understand enough of it to solve my problem.
How to do a Python split() on languages (like Chinese) that don't use whitespace as word separator?
My code looks like the following
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import codecs
wordfile = open('Chinese_example.txt', 'r')
output = open('Chinese_output_python.txt', 'w')
LINES = wordfile.readlines()
Through various tests I am sure the following line is not splitting the string LINES[0] into its component Chinese characters.
A_LINE = list(LINES[0])
output.write(A_LINE[0])
I mean you want to use this, from answerer 'flow' at How to do a Python split() on languages (like Chinese) that don't use whitespace as word separator? :
from re import compile as _Re
_unicode_chr_splitter = _Re( '(?s)((?:[\ud800-\udbff][\udc00-\udfff])|.)' ).split
def split_unicode_chrs( text ):
return [ chr for chr in _unicode_chr_splitter( text ) if chr ]
to successfully split a line of traditional Chinese characters.. I just had to know the proper syntax to handle encoded characters.. pretty basic.
my_new_list = list(unicode(LINE[0].decode('utf8')));

Encoding issue in I/O with Jena

I'm generating some RDF files with Jena. The whole application works with utf-8 text. The source code as well is stored in utf-8.
When I print a string contaning non-English characters on the console, I get the right format, e.g. Est un lieu généralement officielle assis....
Then, I use the RDF writer to output the file:
Model m = loadMyModelWithMultipleLanguages()
log.info( getSomeStringFromModel(m) ) // log4j, correct output
RDFWriter w = m.getWriter( "RDF/XML" ) // default enc: utf-8
w.setProperty("showXmlDeclaration","true") // optional
OutputStream out = new FileOutputStream(pathToFile)
w.write( m, out, "http://someurl.org/base/" )
// file contains garbled text
The RDF file starts with: <?xml version="1.0"?>. If I add utf-8 nothing changes.
By default the text should be encoded to utf-8.
The resulting RDF file validates ok, but when I open it with any editor/visualiser (vim, Firefox, etc.), non-English text is all messed up: Est un lieu généralement officielle assis ... or Est un lieu g\u221A\u00A9n\u221A\u00A9ralement officielle assis....
(Either way, this is obviously not acceptable from the user's viewpoint).
The same issue happens with any output format supported by Jena (RDF, NT, etc.).
I can't really find a logical explanation to this.
The official documentation doesn't seem to address this issue.
Any hint or tests I can run to figure it out?
My guess would be that your strings are messed up, and your printStringFromModel() method just happens to output them in a way that accidentally makes them display correctly, but it's rather hard to say without more information.
You're instructing Jena to include an XML declaration in the RDF/XML file, but don't say what encoding (if any) Jena declares in the XML declaration. This would be helpful to know.
You're also not showing how you're printing the strings in the printStringFromModel() method.
Also, in Firefox, go to the View menu and then to Character Encoding. What encoding is selected? If it's not UTF-8, then what happens when you select UTF-8? Do you get it to show things correctly when selecting some other encoding?
Edit: The snippet you show in your post looks fine and should work. My best guess is that the code that reads your source strings into a Jena model is broken, and reads the UTF-8 source as ISO-8859-1 or something similar. You should be able to confirm or disconfirm that by checking the length() of one of the offending strings: If each of the troublesome characters like é are counted as two, then the error is on reading; if it's correctly counted as one, then it's on writing.
My hint/answer would be to inspect the byte sequence in 3 places:
The data source. Using a hex editor, confirm that the é character in your source data is represented by the expected utf-8 hex sequence 0xc3a8.
In memory. Right after your call to printStringFromModel, put a breakpoint and inspect the bytes in the string (or convert to hex and print them out.
The output file. Again, use a hex editor to inspect the byte sequence is 0xc3a8.
This will tell exactly what is happening to the bytes as they travel along the path of your program, and also where they deviate from the expected 0xc3a8.
The best way to address this would be to package up the smallest unit of your code that you can that demonstrates the issue, and submit a complete, runnable test case as a ticket on the Jena Jira.

Pound sign (#) in filename causing error

I have a very simple file upload that allows users to upload PDF files. On another page I then reference those files through an anchor tag. However, it seems that when a user upload a file that contains the pound sign (#) it breaks the anchor tag. It doesn't cause any type of Coldfusion error, it just can't find the file. If I remove the #, it works just fine. I am sure there are a number of other characters that would have this same issue.
I've tried putting URLEncodedFormat() around the file name inside the anchor but that doesn't help. The only other thing I could think of was to rename the file each time it was uploaded and remove the "#" character (and any other "bad" character).
There has got to be an easier solution. Any ideas?
If you control the file upload code try validating the string with
IsValid("url",usersFileName) or
IsValid("regex",usersFileName,"[a-zA-Z0-9]")
Otherwise if you are comfortable with regex I would suggest something like the previous posters are commenting on
REReplace(usersfilename,"[^a-zA-Z0-9]","","ALL")
These samples assume you will add the ".pdf" and only allows letters and numbers. If you need underscores or the period it would look like this...
REReplace(usersfilename,"[^a-zA-Z0-9\._]","","ALL")
I am not a regex guru, if I have one of these wrong I am sure several will jump in and correct me :)
Pound signs are not legal within filenames on the web. They are used for in-page anchor targets:
<a name="target">
So if you have file#name.pdf, the browser is actually looking for the file "file" and the internal anchor "name.pdf".
Yes, you will have to rename your files on upload.
I can't comment yet,
but Kevink's solution is good unless you need to perserve what you're replacing.
We ran into an instance where we needed to rename the filename but the filename needed to be somewhat preserved (user requirement). Simply removing special characters wasn't an option. As a result we had to handle each replace individually, something like.
<cfset newName = replace(thisFile, "##", "(pound)", "All")>
<cfset newName = replace(newName , "&", "(amp)", "All")>
<cffile action="rename"source = "#ExpandPath("\uploads\#thisFolder#\#thisFile#")#" destination = "#newName#">
Probably you would have to replace # with ## to avoid this, I think this is caused because # is figured as Coldfusion keyword.

Inserting special character in Redmine wiki page

I'm using Redmine and I'm trying to insert the special character | inside a table in a Redmine wiki page. I don't want this character to be parsed as a column separator.
I've achieved this by doing a <code>|</code> around this character, but I don't want to use the code tag, since this character will gain code attributes, namely the courier new font.
Is there a tag for displaying plain text and avoid the parsing from the Redmine wiki engine?
I'm reading the redmine wiki formatting documentation but it is very poor and points me to textile formatting which doesn't seem to include this special case.
I could not get the exclimation point to work, but this works for me.
<notextile>|</notextile>
The only way I found out to overcome this problem is to insert the HTML code for the character I want to isolate. For instance, instead of putting an underscore and make the wiki think I'm starting an italic word, I have to put the HTML code for it:
_
Example:
this is a _test - _text comment here_
Without the underscore code (_) redmine wiki engine will think that italic starts at test and this is the wrong result:
this is a test - text comment here
So, putting the ASCII code for the underscore corrects this problem. Unfortunately, this parsing is not very clever (yet I hope).
Here is a link for an ASCII code table with many symbols and characters:
http://www.ascii.cl/htmlcodes.htm