Rails utf-8 problem - ruby-on-rails-3

I there, I'm new to ruby (and rails) and having som problems when using Swedish letters in strings. In my action a create a instance variable like this:
#title = "Välkommen"
And I get the following error:
invalid multibyte char (US-ASCII)
syntax error, unexpected $end, expecting keyword_end
#title = "Välkommen"
^
What's happening?
EDIT: If I add:
# coding: utf-8
at the top of my controller it works. Why is that and how can I slove this "issue"?

See Joel spolsky's article "The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)".
To quote the part that answers this questions concisely
The Single Most Important Fact About Encodings
If you completely forget everything I just explained, please remember
one extremely important fact. It does not make sense to have a string
without knowing what encoding it uses. You can no longer stick your
head in the sand and pretend that "plain" text is ASCII.
This is why you must tell ruby what encoding is used in your file. Since the encoding is not marked in some sort of metadata associated with your file, some software assumed ASCII until it knows better. Ruby 1.9 probably does so until your comment when it will stop, and restart reading the file now decoding it as utf-8.
Obviously, if you used some other Unicode encoding or some more local encoding for your ruby file, you would need to change the comment to indicate the correct encoding.

The "magic comment" in Ruby 1.9 (on which Rails 3 is based) tells the interpreter what encoding to expect. It is important because in Ruby 1.9, every string has an encoding. Prior to 1.9, every string was just a sequence of bytes.
A very good description of the issue is in James Gray's series of blog posts on Ruby and Unicode. The one that is exactly relevant to your question is http://blog.grayproductions.net/articles/ruby_19s_three_default_encodings (but see the others because they are very good).
The important line from the article:
The first is the main rule of source Encodings: source files receive a US-ASCII Encoding, unless you say otherwise.

There are several places that can cause problems with utf-8 encoding.
but some tricks are to solve this problem:
make sure that every file in your project is utf-8 based (if you
are using rad rails, this is simple to accomplish: mark your project,
select properties, in the "text-file-encoding" box, select "other:
utf-8")
Be sure to put in your strange "å,ä,ö" characters in your files again
or you'll get a mysql error, because it will change your "å,ä,ö" to a
"square" (unknown character)
in your databases.yml set for each server environment (in this
example "development" with mysql)
development:
adapter: mysql
encoding: utf8
set a before filter in your application controller
(application.rb):
class ApplicationController < ActionController::Base
before_filter :set_charset
def set_charset
#headers["Content-Type"] = "text/html; charset=utf-8"
end
end
be sure to set the encoding to utf-8 in your mysql (I've only used
mysql.. so I don't know about other databases) for every table. If you
use mySQL Administrator you can do like this: edit table, press the
"table option" tab, change charset to "utf8" and collation to
"utf8_general_ci"
( Courtsey : kombatsanta )

Related

Encoding issue in Postgres ERROR "UTF8" is it best to set encoding to UTF8 or to make the data WIN1252 compatible?

I created a table importing a CSV file from an excel spreadsheet. When I try to run the select statement below I get the error.
test=# SELECT * FROM dt_master;
ERROR: character with byte sequence 0xc2 0x9d in encoding "UTF8" has no equivalent in encoding "WIN1252"
I have read the solution posted in this stack overflow post and was able to overcome the issue by setting the encoding to UTF8, so up to that point I am still able to keep working with the data. My question, however, is whether setting the encoding to UTF8 actually is solving the problem or it is just a workaround that and will create other problems down the road and I would be better off removing the conflicting characters and making the data WIN1252 compliant.
Thank you
You have a weird character in your database (Unicode code point 9D, a control character) that probably got there by mistake.
You have to set the client encoding to the encoding that your application expects; no other value will produce correct results, even if you get rid of the error. The error has a reason.
You have two choices:
Fix the data in the database. The character is very likely not what was intended.
Change the application to use LATIN1 or (better) UTF-8 internally and set the client encoding appropriately.
Using UTF-8 everywhere would have the advantage that you are safe from this kind of problem.

Illegal xml parsing import to sql mac roman

I have a xml that says it's encoding is UTF-8. When I use openxml to import data into sql, I always get "XML parsing: line xxxxxx, character xx, illegal xml character.
Right now I can go to each line and replace it with the a legal character and it goes well. Sometimes there maybe be more than 5 mac roman characters and it becomes tedious to replace. I am currently using notepad ++ and there is probably a way for this.
Can anyone suggest if anything can be done in sql level or does it have to checked before ran in sql?
So far, most of the characters found are, x95, x92, x96, xbc, xbd, xbo.
Thanks.
In your question, you did not specify whether illegal characters you had to remove were Unicode or not. Or whether the file was really expected to contain UTF-8 characters. Unlike for the ASCII, for UTF-8 some byte combinations are illegal, so if you declare the text file to be encoded in UTF-8, you might not be able to read it successfully till end (such a thing could never happen with ASCII).
So it is possible that by removal of <?xml version="1.0" encoding="UTF-8"?> you just declared some non-unicode encoding of your file (instead of previously declared UTF-8), so reading the data passed. You did not have many foreign characters like ľťčý in the file, did you? Normally, it is a must that you check what happened to those after the import. It might happen that your import passes without error, but city name Čadca becomes äadca and somebody will thank your company for rendering his address unreadable.

UTF-8 vs. ASCII-8BIT Encoding Inconsistency in Ruby on Rails

I know there are many questions about rails and encoding out there, but I haven't been able to find anything about this specific question.
I have a Ruby on Rails application using rails 3.1.3 and running under Jruby 1.6.7. We have support for both English and French - and we use the I18n library/gem to accomplish this.
Sample translation file parts:
#---- config/locales/en.yml ----
en:
button_label_verify: "Verify"
#---- config/locales/fr.yml ----
fr:
button_label_verify: "Vérifier"
In certain cases I am getting the following encoding error:
Internal Server Error: Encoding::CompatibilityError incompatible character encodings: UTF-8 and ASCII-8BIT
Case 1:
#---- app/views/_view_page.html.erb ----
.....
<h3><%= get_button_label() %></h3>
....
#---- app/helpers/page_helper.rb ----
def get_button_label
return I18n.t(:button_label_verify)
end
This works - there are no encoding errors and translations between French and English work just fine.
Case 2:
#---- app/views/_view_page.html.erb ----
.....
<h3><%= get_button_label() %></h3>
....
#---- app/helpers/page_helper.rb ----
def get_button_label
return "#{I18n.t(:button_label_verify)}"
end
This however does not work. The only difference is the value being returned includes strings with computed code in the string as opposed to something like
return "string " + I18n.t(:button_label_verify)
Note: The above causes no errors either - the encoding issue is only when the computed I18n translation is in the quotes.
Case 3:
#---- app/views/_view_page.html.erb ----
.....
<h3><%= "#{I18n.t(:button_label_verify)}" %></h3>
....
This causes no error... so the problem seems to somehow be related to the dynamic code (with French characters) within the string, on top of printing out a string returned from a helper function.
I know how to work around this/fix it - but what I am wondering is if anyone can provide some insight into why it is this way - is it this way for any good reason? IMO, when you get to low level - printing out a string is printing out a string, so I don't understand how one way causes and error and another way doesn't.
Putting
#encoding: utf-8
at the top of your files containing ascii-extended characters should fix encoding related issues (at elast the one coming from project files ...)
I couldn't tell why it doesn't work on a helper when using interpolation though ...
Sometimes you need to set the KCODE environment variable for the file (this is important for ruby 1.8 compatibility):
# encoding: UTF-8
$KCODE = 'UTF8' unless RUBY_VERSION >= '1.9'
It could also be that your files are not encoded in UTF-8. For that you need more than just the plaintext header. In Eclipse it is hidden under Preferences -> General -> Editors -> Spelling and for Notepad and most Windows programs when when you Save As the file. The enca command is one way of doing it on Linux but I'm sure there are others. I can't count the times I have seen a file say it is UTF-8 but it is actually some other encoding because UTF-8 functions like ASCII for 8-bit characters so you don't often notice the problem until you check the headers in a HEX editor.
Please take some time to read about file encoding:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets
Why perl doesn't use UTF-8
Every character has a story #4: U+feff (alternate title: UTF-8 is the BOM, dude!)
This is incredibly important to get right and it can save you a lot of pain later when you port to East Asian languages (you should always plan to do this!)

Encoding issue in I/O with Jena

I'm generating some RDF files with Jena. The whole application works with utf-8 text. The source code as well is stored in utf-8.
When I print a string contaning non-English characters on the console, I get the right format, e.g. Est un lieu généralement officielle assis....
Then, I use the RDF writer to output the file:
Model m = loadMyModelWithMultipleLanguages()
log.info( getSomeStringFromModel(m) ) // log4j, correct output
RDFWriter w = m.getWriter( "RDF/XML" ) // default enc: utf-8
w.setProperty("showXmlDeclaration","true") // optional
OutputStream out = new FileOutputStream(pathToFile)
w.write( m, out, "http://someurl.org/base/" )
// file contains garbled text
The RDF file starts with: <?xml version="1.0"?>. If I add utf-8 nothing changes.
By default the text should be encoded to utf-8.
The resulting RDF file validates ok, but when I open it with any editor/visualiser (vim, Firefox, etc.), non-English text is all messed up: Est un lieu généralement officielle assis ... or Est un lieu g\u221A\u00A9n\u221A\u00A9ralement officielle assis....
(Either way, this is obviously not acceptable from the user's viewpoint).
The same issue happens with any output format supported by Jena (RDF, NT, etc.).
I can't really find a logical explanation to this.
The official documentation doesn't seem to address this issue.
Any hint or tests I can run to figure it out?
My guess would be that your strings are messed up, and your printStringFromModel() method just happens to output them in a way that accidentally makes them display correctly, but it's rather hard to say without more information.
You're instructing Jena to include an XML declaration in the RDF/XML file, but don't say what encoding (if any) Jena declares in the XML declaration. This would be helpful to know.
You're also not showing how you're printing the strings in the printStringFromModel() method.
Also, in Firefox, go to the View menu and then to Character Encoding. What encoding is selected? If it's not UTF-8, then what happens when you select UTF-8? Do you get it to show things correctly when selecting some other encoding?
Edit: The snippet you show in your post looks fine and should work. My best guess is that the code that reads your source strings into a Jena model is broken, and reads the UTF-8 source as ISO-8859-1 or something similar. You should be able to confirm or disconfirm that by checking the length() of one of the offending strings: If each of the troublesome characters like é are counted as two, then the error is on reading; if it's correctly counted as one, then it's on writing.
My hint/answer would be to inspect the byte sequence in 3 places:
The data source. Using a hex editor, confirm that the é character in your source data is represented by the expected utf-8 hex sequence 0xc3a8.
In memory. Right after your call to printStringFromModel, put a breakpoint and inspect the bytes in the string (or convert to hex and print them out.
The output file. Again, use a hex editor to inspect the byte sequence is 0xc3a8.
This will tell exactly what is happening to the bytes as they travel along the path of your program, and also where they deviate from the expected 0xc3a8.
The best way to address this would be to package up the smallest unit of your code that you can that demonstrates the issue, and submit a complete, runnable test case as a ticket on the Jena Jira.

How to allow right to left languages in Ruby on Rails 3

I am interested in creating a website in Hebrew using Ruby on Rails 3. The problem is when I put Hebrew into my view I am told that it is not supported and I should add UTF-8.
I've been working on this for a while and I Can't seem to find how to do this. I am also using Sqlite3 and I would like to save Hebrew strings there too.
How would I achieve this?
The error code I am given is:
Your template was not saved as valid UTF-8. Please either specify UTF-8 as the encoding for your template in your text editor, or mark the template with its encoding by inserting the following as the first line of the template:...
Edit:
Problem was I was working on Notepad++ which did not save my files in UTF-8 format although they were UTF-8 formated files. Solved by changing file format.
If you are using notepad++, first set the encoding to "Encode in UTF-8" and then start coding. If you have already created/saved the file then just changing the encoding type will not do. You will have to keep a copy of the existing code, then delete the existing file, open notepad++, set the encoding first(Encode in UTF-8) and then start writing/copying the code to it. This way utf-8 encoding is ensured and you won't have to put "# encoding: UTF-8" at the top of your file.
You should try adding on the first line of your .rb files the following:
# encoding: utf-8
and on the first line of your .erb
<%# encoding: utf-8 %>
encoding: utf-8 and coding: utf-8 and are equivalent.
Hope this helps.
Make sure that in your database configurations utf-8 is the default character set, and not latin1.
If you use MySQL change it in the "MySQL Server Instance Config Wizard".
EDIT: Try putting this code in your application controller:
class ApplicationController < ActionController::Base
before_filter :set_charset
def set_charset
#headers["Content-Type"] = "text/html; charset=utf-8"
end
end
read more on this article: http://www.dotmana.com/?p=95
you can put
config.encoding = "utf-8"
in your config/application.rb which is equivalent to
Encoding.default_external = Encoding::UTF_8
Encoding.default_internal = Encoding::UTF_8
which in turn is the equivalent to putting:
# encoding: UTF-8
or a BOM at the top of every file.
This allows utf-8 globally on all files of the rails app.
If you want a global option on all ruby files, you can use the -Ku ruby option and set it via the RUBYOPT environment variable, like:
export RUBYOPT=-Ku
This might be caused by the file encoding itself. Make sure you have set UTF-8 as default encoding for project in your editor/IDE preferences.
Edit:
You can check file for encoding with:
file -I myview.erb.html
(that's a capital 'i').