Feed Encoding Problems Ruby 1.9 - ruby-on-rails-3

i am trying to parse rss/atom-feeds in my rails app, but i encountered some serious problems with non-ASCII characters, eg. the german umlauts ÄÖÜ or ß. Some feeds in the wild use proper UTF-8, but some others make me cry. The general Problem is:
I must be able to parse any Feeds, whatever encoding they might have. The "loss" of characters is not an option (though its my current status), because i do some text and language analysis with the feed-items.
What i use so far:
FeedZirra for fetching and parsing the feeds, works well so far. I also "sanitize" the values i get from FeedZirra.
HTMLEntities (gem) for unescaping special characters, like "Ä" which means "Ä"
rCharDet19 gem, to figure out which encoding the feed might have, and to:
string.encode! to convert from whatever it is to utf-8
Ruby 1.9.3 (lastest) and Rails 3.2.8 on Ubuntu Linux 12.04
The problem is, that i literally have no idea what i'm doing wrong.
def self.sanitize_encoding_and_htmlentities str
cd = CharDet.detect str
s = str.encode(:invalid => :replace, :undef => :replace, :replace => '')
coder = HTMLEntities.new
coder.decode(s)
end
This is my current sanitize method. As sample-feed i use
http://www.N24.de/2/index.rss
So far, the "special" characters got replaced completely. This is the only variant i found which just works without raising an error due to invalid byte stuff. I changed the encode method slightly, because i read in the ruby doc that without any encoding given, the encode method should "translate" to the given default_internal Encoding of the app, which is utf-8 in my case. CharDet stands there just for possible changes to anything related, might be useful.
I used the magic_encoding gem, so every file in my project should have the comment on the first line. My database is sqlite3 with utf-8.
As of 2012, is there anything i should look at? Did i make anything really wrong?
Thanks for help!
EDIT:
The feeds may be rss of any kind, atom, and/or just invalid XML. The Encoding may be UTF-8, something different, or just says "utf-8" while its some windows-XXX stuff, and so on. I really need a solution for this alltogether.
Also the fetching/parsing must be as fast as possible, that's why i picked feedzirra.
My current Idea is to get the feedcontent, replace every char in the "title" and "description" nodes with htmlentities if possible, use the encode! method to switch to utf-8, and then unescape the htmlentities. After this, special characters should be keeped i think, but i can't get something like this working at the moment. Might this be a good approach?

Finally i found the main Problem:
Feedzirra already returns UTF-8 when accessing entries and their attributes. But i used the sanitize method to access attributes, which returns ASCII-8BIT and weird characters escaped as html-entities.
However, i kicked all the sanitizing and encoding stuff out of my code, and now it just works. Seems that FeedZirra has something built in to transcode the feeds if neccessary.

Related

Scrapy: how to solve the "empty" item in html due to a foreign language symbol?

One of the scrapy-ed items seems contain no content in HTML. In MySQL database, it does have content including a non-regular - (dash) that is slightly longer. It could be a dash symbol from Chinese input, or something similar. I am copy it below, not sure whether it will keep the original form. The web link is here and this non-regular dash is in the title and the beginning of the description.
**Hospitalist – Chattanooga**
To further prove it, the export CVS file from MySQL convert this weird dash to ?€?. Most likely this weird symbol causes the non-display problem.
I want to either delete this weird symbol or replace it with a , or a regular dash. Where can it be done? During Scrapy? Or in MySQL? Sorry this is not a specific coding question. I need some guidance before figuring out any codes for this problem.
The long dash is called an EM dash fileformat - EM dash
The reason you are seeing it is likely due to the chosen encoding.
Try setting a different encoding or replacing the EM dash with the , sign as you mentioned in your question.
In php you can do so with the following code:
str_replace(chr(151), ',' $input);

Twisted.web File directory listing issues

I'm trying to use Twisted in a web-app, and I'm coming across an interesting issue. I'm very new to Twisted, so I'm not sure if I'm seeing a bug in Twisted, or if I just am not using it correctly.
Theoretically from the example, a File resource object can be use to both serve files from a directory, as well as provide the directory listing. So assuming I have the variables (port, reportsDir) defined elsewhere before the code snippet, I do the following:
rootResource = Resource()
rootResource.putChild("reports", File(reportsDir))
reactor.listenTCP(port, Site(rootResource))
reactor.run(installSignalHandlers=False)
Now, when I access '/reports' on my host I get a message "Request did not return bytes" in my browser with a bunch of stuff that was obviously produced by twisted, but also contains a print of a u'.....' string literal, which in fact has the directory listing in it. So the DirectoryLister is obviously creating the listing HTML, but it isn't seeing as valid by something in Twisted. It doesn't seem to like the unicode string; which was in fact produced by Twisted itself.
Do I need to set some other configuration item to get it to convert the unicode string to the necessary bytes object (or whatever), or some other approach?
Many thanks,
-D
Well, it seems like the issue is that Python will promote any string to unicode if any source string on a format was unicode. In my case, "reportsDir" was unicode because it came from a XML file, and that set it down the error path.
Changing the above line:
rootResource.putChild("reports", File(reportsDir))
to:
rootResource.putChild("reports", File(reportsDir.encode('ascii', 'ignore')))
fixed the issue. I would however suggest that the Twisted developers do a check for unicode in the constructor for File, or in the DirectoryLister simply check for unicode, and if it is then return the ascii-encoded version.

IOS JSON escaping special characters

I'm working in IOS and trying to pass some content to a web server via an NSURLRequest. On the server I have a PHP script setup to accept the request string and convert it into an JSON object using the Zend_JSON framework. The issue I am having is whenever the character "ø" is in any part of the request parameters, then the request string is cut short by one character.
Request string before going to server.
[{"description":"Blah blah","type":"Russebuss","name":"Roscoe Simulator","appVersion":"1.0.20","osVersion":"IOS 5.1","phone":"5555555","country":"Østfold","udid":"bed164974ea0d436a43f3cdee0e005a1"}]
Request string on server before any parsing
[{"description":"Blah blah","type":"Russebuss","name":"Roscoe Simulator","appVersion":"1.0.20","osVersion":"IOS 5.1","phone":"5555555","country":"Nord-Trøndelag","udid":"bed164974ea0d436a43f3cdee0e005a1"}
Everything looks exactly the same except the final closing ] is missing. I'm thinking it's having an issue when converting the string to UTF-8, but not sure the correct way to fix this issue.
Does anyone have any ideas why this is happening?
first of all do not trust the xcode console in such cases. you never know which coding the console is actually using.
second, escape the invalid characters before you build you json string. easiest way would probably to make sure you are using the same unicode representation, like utf-8, all the time.
third, if there are still invalid characters use a json lib with a parser (does the encoding). validate the output by parsing back to e.g. NSString. or validate the output manually by using a web form like http://jsonformatter.curiousconcept.com/
the badest way is to replace the single characters in the string, build your json and convert back. one way to do this could be to replace e.g an german ä with its unicode representaion U+00E4 (http://www.utf8-chartable.de/).
Thats the way I do it. I am glad that I nerver needed to go further than step three and this is the step you should do anyway to keep your code simple.
Please try to use Zends internal json Encoding:
Zend_Json::$useBuiltinEncoderDecoder = true;
should fix your issue.

Char.ConvertFromUtf32 not available in Silverlight

I'm converting a WinForms app to Silverlight (VB.NET). What should I use instead of Char.ConvertFromUtf32 as it's not available to use in Silverlight?
UTF-32 is currently not part of Silverlight, so you have to find a way around the limitation. I think you should stop a moment and think exactly why you need to read UTF32-encoded text.
If you are reading such text from a database or a file on the server, I would perform the conversion server-side (if possible I would convert everything to UTF-8 and get rid of the UTF-32 data in one shot).
If you are parsing a user-provided file on the client side, I would detect the UTF-32 encoding and gently tell the user that the file encoding is not supported. UTF32 is pretty rare nowadays, so I guess it should not be a very common case (but I could be wrong not knowing your exact situation).
In order to detect the file encoding you have to look at the first few bytes (byte order mark) -more information here, if they are not present the task becomes much harder and involves some kind of heuristics based on character frequency.
From: https://learn.microsoft.com/en-us/dotnet/csharp/programming-guide/types/how-to-convert-between-hexadecimal-strings-and-numeric-types
You can use a direct cast, like:
// Get the character corresponding to the integral value.
string stringValue = Char.ConvertFromUtf32(value);
char charValue = (char)value;
Small warning, it will only work up to 0xffff. It will not work for high range Unicode from 0x10000 to 0x10ffff.
Also, if you need to parse \uXXXX, try this other question: How do I convert Unicode escape sequences to Unicode characters in a .NET string?

Rails ActiveRecord: Inserting text containing unprintable/weird characters

I am inserting some text from scraped web into my database. some of the fields in the string have unprintable/weird characters. For example,
if text is "C__O__?__P__L__E__T__E",
then the text in the database is stored only as "C__O__"
I know about h(), strip_tags()... sanitize, ... etc etc. But I do not want to sanitize this SQL. The activerecord logs the SQL correctly, and when run in phpMySQL, the query is executed correctly. something happens between the SQL query generation and it being executed.
Help is much appreciated.
Just replace the question mark in the string with a string containing a question mark, I haven't found any other way either:
["C__O__?__P__L__E__T__E", '?']
works perfectly.
Can you escape the question mark using "\?"?
Hmmmm.. using CGI escape, I found out that the character coming in the system is not what I expected it to be. It is not a question mark (%3F) but a question mark (%D5).
C__%D5__M__P__L__%80___T__%80__
C__%3F__M__P__L__%3F___T__%3F__
Eventually I gsubbed out the non-printable characters before saving.
gsub(/[^[:print:]]/, '')
Only after removing the invalid characters in my string, was I able to save the item properly.
None of the other solutions worked, partially because the problem was not understood clearly upfront.
I know this is way late, but I ran into the same problem when we were trying to process a file as UTF-8 that actually used the ISO-8859-1 character encoding. I suspect you had a similar issue in your scraping where you assumed the wrong encoding and it ended up causing things to fail.