Rails truncate UTF-8 strings containing é (for example) - ruby-on-rails-3

I am working on a rails 3.1 app with ruby 1.9.3 and mongoid as my ORM. I am facing an annoying issue. I would like to truncate the content of a post like this:
<%= raw truncate(strip_tags(post.content), :length => 200) %>
I am using raw and strip_tags because my post.content is actually handled with a rich text editor.
I have a serious issue with non ASCII characters. Imagine my post content is the following:
éééé éééé éééé éééé éééé éééé éééé éééé
What I am doing above in a naive way does this:
éééé éééé éééé éééé éééé &eac...
Looks like truncate is seeing every word of the string like é&eactute;éé.
Is there a way to either:
Have truncate handle an actual UTF-8 strings, where 'é' stands for a single character ? That would be my favorite approach.
Hack the above instruction such that the result is better, like force rails to truncate between 2 words,
I am asking this question because I have not found any solution so far. This is the only place in my app where I have problems with such character, and it is a major issues since the whole content of the website is in french, so contains a lot of é, ç, à, ù.
Also, I think this behavior is quite unfortunate for the truncate helper because in my case it does not truncate 200 characters at all, but approximately 25 characters !

Probably too late to help with your issue, but...
You can use the ActiveSupport::Multibyte::Chars limit method, like so:
post.content.mb_chars.limit(200).to_s
see http://api.rubyonrails.org/v3.1.1/classes/ActiveSupport/Multibyte/Chars.html#method-i-limit
I was having a very similar issue (truncating strings in different languages) and this worked for my case. This is after making sure the encoding is set to UTF-8 everywhere: rails config, database config and/or database table definitions, and any html templates.

If your string is HTML then I would suggest you check out the truncate_html gem. I've not used it with characters like this but it should be aware of where it can safely truncate the string.

There is a simple way, but not a nice solution. First you have to make sure the content you save is UTF-8. This might not necessary.
content = "éééé"
post.content = content.force_encoding('utf-8') unless content.encoding.to_s = "UTF-8"
Then when you read it you can read force it back
<%= raw truncate(strip_tags(post.content.force_encoding('utf-8')), :length => 200) %>

I've written strings to help truncate, align, wrap multibyte text with support for no whitespace languages(Japanese, Chinese etc…)
Strings.truncate('ラドクリフ、マラソン五輪代表に1万m出場にも含み', 12)
# => "ラドクリフ…"

Related

HTML not rendering through EJS

so basically I have a bunch of HTML strings in a MySQL table and I am trying to display then through EJS.
For instance, I have a string that looks like this is a link with some <code>code</code> next to it. In my code I try to display it in that way.
<%- listOfStrings["myString"] -%>
However, as you probably guessed when reading the title, the string seems to be escaped when displaying on the screen.
What's even weirder to me is that I have two tables with such strings, and it works for the first one, while it doesn't for the second one. One difference though, is that the first one is hardcoded, while the second one can be edited through some tool on my website. Encoding is utf32_unicode_ci for both tables, if that matters.
For debugging purposes I tried to store the aforementioned strings in a js variable and display them in the console: then it seems like <and > characters are all escaped for some reason. Is there an explanation to this behavior, and if so how to fix it so that HTML renders correctly?
Thanks for your help!
You can try it :
<%=listOfStrings["myString"]%>

How to fix this XSS in Rails

I don't know if it counts as a XSS, but it is causing errors
I have an image_tag and the :alt tag is generated by the user
however, using sanitize/h/html_escape doesn't help with this (from OWASP- here)
';alert(String.fromCharCode(88,83,83))//';alert(String.fromCharCode(88,83,83))//";
alert(String.fromCharCode(88,83,83))//";alert(String.fromCharCode(88,83,83))//--
></SCRIPT>">'><SCRIPT>alert(String.fromCharCode(88,83,83))</SCRIPT>
when doing
:alt => (the string above)
the output of the image is messed up
Is there a way to fix this XSS?
I'm using latest rails,ruby
Since Rails 3.2.8 and thus the fix of CVE-2012-3464, the Rails escape helpers escape both double quotes and single quotes.
If you are actually using the correct version, you should be just fine.
>> ERB::Util.h '\';alert(String.fromCharCode(88,83,83))//\';alert(String.fromCharCode(88,83,83))//";alert(String.fromCharCode(88,83,83))//";alert(String.fromCharCode(88,83,83))//--></SCRIPT>">\'><SCRIPT>alert(String.fromCharCode(88,83,83))</SCRIPT>'
=> "';alert(String.fromCharCode(88,83,83))//';alert(String.fromCharCode(88,83,83))//";alert(String.fromCharCode(88,83,83))//";alert(String.fromCharCode(88,83,83))//--></SCRIPT>">'><SCRIPT>alert(String.fromCharCode(88,83,83))</SCRIPT>"
(Note: the backslashes in the above raw string need to be there for Ruby to properly parse the string which then contains the single quotes verbatim.)
u can fix it by filter " ' and > just this three characters is ennough

Rails 3.1: UrlEncode issues with periods (.)

I have a site where when a user searches for an artist, song or album and click search, the search results are displayed. The individual search terms are then set to be clickable, meaning each use their own paths (or routes) to generate links.
The issue I keep running into is with random weird characters showing up in some of the artists, songs or album names (such as periods (.)). Is there anyway to url encode these?
Here is my current code:
<% artists[0..5].each do |art| %>
<li><span><strong><%= link_to "#{art}", artist_path(CGI::escape(art)) %></strong></span></li>
<% end %>
Assume you have an album name called "slash#^[]/=hi?qqq=123"
encoded = URI.escape str_to_be_encoded
encoded = URI.escape(str_to_be_encoded, Regexp.new("[^#{URI::PATTERN::UNRESERVED.gsub('.','')}]"))
The first one would encode to
"slash#%5E[]/=hi?qqq=123"
The second would encode to
"slash%40%5E%5B%5D%2F%3Dhi%3Fqqq%3D123"
What happens is that most url encoding methods would not escape characters that it thinks it part of a value url, so symbols like equal and question mark are not escaped.
The second method tells the escape function to also escape url-legal characters. So you get a better encoded string.
You can then append it to your url like
"http://localhost:3000/albums/1-#{encoded}"
Hope this helps.

help to get rid of HTML special chars in database

I've migrated my site from interspire CMS to Joomla! CMS.
I've managed to migrate all the database of articles, but some of them have a weird issue - when I access the page from joomla, the title contains HTML entities like ’.
As you can guess from the CMS's I use, I rely on PHP as my server side, and MySql for my database.
I tried to go over the titles of the articles in the database with htmlspecialchars_decode AND html_entity_decode in order to get rid of those, but it had no effect.
if I just grab an example from the DB and echo it, it will look OK:
What’s Your Pleasure, Lasagna Or Pizza Manchester Style?
if I go to the article page in joomla it will look like this:
What’s Your Pleasure, Lasagna Or Pizza Manchester Style?
When I go to PhpMyAdmin to see directly what is in the database, this is the contents of the title:
What’s Your Pleasure, Lasagna Or Pizza Manchester Style?
I even tried to remove the symbol with:
str_replace("’","",$title);
or replace it like this
str_replace('’',"'",$title);
but nothing.
When I tried to encode it again instead of decoding it (just to see if i'm on the right DB) it worked and encoded it again...
Please, I would be glad to have any new ideas...
Thanks,
Yanipan
Try setting encoding to cp1252. This worked out for me:
$decoded = html_entity_decode($your_string, ENT_QUOTES, 'cp1252');
Probably your best bet is to do search and replace within the database itself vs trying to do it with php. Search and replace in mysql is done like this:
update TABLE_NAME set FIELD_NAME = replace(FIELD_NAME, ‘find this string’, ‘replace found string with this string’);
So yours should look something like:
update ARTICLES set TITLE = replace(TITLE, '’', '\'');
Give that a shot.
Need more info
What is the character encoding on your database? That & or ;, may be something other than the typical ASCII.
It's possible that PHP/Joomla is double-encoding your string. Look at the browser's page source and find the text in the produced HTML. Instead of What’s, it might just be one of the following:
What&rsquo&59;s
What&38;rsquo&59;s
What&rsquo;s

Can you remove the _snowman in Rails 3?

I'm building an app on Rails 3 RC. I understand the point behind the _snowman param (http://railssnowman.info/)...however, I have a search form which makes a GET request to the index. Therefore, submitting the form is creating the following query string:
?_snowman=☃&search=Box
I don't know that supporting UTF encoding is as important as a clean query string for this particular form. (Perhaps I'm just too much of a perfectionist...hehe) Is there some way to remove the _snowman param for just this form? I'd rather not convert the form to a POST request to hide the snowman, but I'd also prefer it not be in my query string. Any thoughts?
You can avoid the snowman (now a checkmark) in Rails 3 by.... not using Rails for the search form. Instead of using form_tag, write your own as outlined in:
Rails 3 UTF-8 query string showing up in URL?
Rails helpers are great unless they're not helping. Do-it-yourself is good as long as you understand the consequences, and are willing to maintain it in the future.
I believe the snowman has to be sent over the wire to ensure your data is being encoded properly, which means you can't really remove the snowman input from forms. Since, it's being sent in your GET request, it will have to be appended to the URL.
I suppose you could write some javascript to clean up the URL once the search page loads, or you could setup a redirect to the equivalent URL minus the snowman. Both options don't really feel right to me.
Also, it doesn't seem there is any way to configure Rails to not output it. If you really wanted to get rid of it, you could comment out those lines in Rails' source (the committed patches at the bottom of railssnowman.info should lead you to the files and line numbers). This adds some maintenance chores for you when you upgrade Rails. Perhaps you can submit a patch to be able to turn this off?
EDIT: Looks like they just switched it to what looks like a checkmark instead of a snowman.
EDIT: Oops, back to a snowman.
In Rails 4.1 you can use the option :enforce_utf8 => false to disable utf8 input tag.
However I want to use this in Rails 3, so I monkey-patched my Rails. I put the following in the config/initializers directory.
# allow removing utf8 using enforce_utf8, remove after Rails 4.1
module ActionView
module Helpers
module FormTagHelper
def extra_tags_for_form(html_options)
authenticity_token = html_options.delete("authenticity_token")
method = html_options.delete("method").to_s
method_tag = case method
when /^get$/i # must be case-insensitive, but can't use downcase as might be nil
html_options["method"] = "get"
''
when /^post$/i, "", nil
html_options["method"] = "post"
token_tag(authenticity_token)
else
html_options["method"] = "post"
tag(:input, :type => "hidden", :name => "_method", :value => method) + token_tag(authenticity_token)
end
enforce_utf8 = html_options.delete("enforce_utf8") { true }
tags = (enforce_utf8 ? utf8_enforcer_tag : ''.html_safe) << method_tag
content_tag(:div, tags, :style => 'margin:0;padding:0;display:inline')
end
end
end
end