So I have some Spanish content saved in Excel, that I am exporting into a .csv format so I can import it from the Firefox sql manager add-on into a .sql db. The problem is that when I import it, whenever there is an accent mark, (or whatever the technical name for those things are) Firefox doesn't recognize it, and accordingly produces a big black diamond with a white ?. Is there a better way to do this? Is there something I can do to have my Spanish content readable in a sql db? Maybe a more preferable program than the Firefox extension? Please let me know if you have any thoughts or ideas. Thanks!
You need to follow the chain and make sure nothing gets lost "in translation".
Specifically:
assert which encoding is used in the CSV file; ensure that the special charaters are effectively in there, and see how they are encoded (UTF8, particular Code page, ...)
ensure the that SQL server can
a) read these characters and
b) store them in an encoding which will preserve their integrity. (BTW, the encoding used in the CSV can of course be remapped to some other encoding of your choosing, i.e. one that you know will be suitable for consumption by your target application)
ensure that the database effectively stored these characters ok.
see if Firefox (or whichever "consumer" of this text) properly handles characters in this particular encoding.
It is commonplace but useful for this type of inquiries to recommend the following reading assignement:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky
Related
I have some .doc binary files stored in my database and i would like to now search them all (without converting them to .doc) to see which one contains the word "hello" for instance.
Is there any way to do this search in the binary file?
You could go down the route of using commercial tools. Aspose.Words can load a document from a stream and has all sorts of methods for finding text within the document.
If you have the stream from the DB, then you code would look like this:
Aspose.Words.Document doc = new Aspose.Words.Document(streamObjectFromDatabase);
if (doc.GetText().ToLower().Contains("hello world"))
MessageBox.Show("Hello World exists");
Note: The benefit of this tool is that it does not require Word objects to be installed and it can work with streams in memory.
Not without a lot of pain, as far as I can tell. According to Wikipedia, Microsoft has within the past few years finally released the .doc specification. So you could create a parser based on the spec if you have the time, assuming all of your documents are in the same version of the .doc format.
Of course you could just search for the text you're looking for amid all the binary data, on the assumption that the actual text is stored as plain text. But even if that assumption were true, how could you be sure that the plain text you found was the actual document text, and not some of the document meta data that's also stored in plain text? And there's always the off chance that the binary data will match your text pattern.
If the Word libraries are available to you, I would go that route. If not, a homegrown parser may be your least bad option.
I'm writing a quick front end to display guitar tablature. The front end is in Flash but I want to store the tab in some human-readable format. Anyone know of something that already exists? Any suggestions on how to go about it? One idea I got from reading some stackoverflow posts was to use a strict ASCII tab format like so:
e||-1------3--------------0--|----2-------0---
B||--1-----3------------1----|----3-------0---
G||---2----0----------0------|----2-------1---
D||----3---0--------2--------|----0-------2---
A||----3---2------3----------|------------2---
E||----1---3----3------------|------------0---
It has advantages. I can gain a lot of info from the structure (how many strings, their tunings, the relative placement of notes) but it is a bit verbose. I'm guessing the '-'s will compress away pretty well when sent over the wire.
If anyone knows of an existing data-format for describing guitar tab I'll take a look as well.
edit:
I should note that this format is 90% for me and may not ever been seen by anyone other than myself. I want an easy way to write tab files that will be displayed eventually as graphics in a Flash front-end and I don't want to have to write an editor front end.
Check out the ASCII tab format. Also great description of the format is here:
http://www.howtoreadguitartabs.net/
ASCII export would be a great feature, but using ASCII as internal data format is not a good idea. For example, note durations would be extremely hard to express (hou would you store 32nds or even 16ths?, not to mention triplets...), so parsing those files would be extremely difficult. Moreover, users would be tempted to load ASCII files created outside your app, which will be likely to fail.
To sum up, i'd recommend to either try to reuse existing format or invent your own if that's not feasible. You may try to use XML for that.
EDIT: Beside DGuitar, i know of TuxGuitar and KGuitar, which support Guitar Pro files. You can look into their sources or ask their authors about file formats. I think there is also open source PowerTab-to-ASCII converter.
See Supported file formats in TuxGuitar.
TuxGuitar is open-source multiplatform software for reading, writing and playing the guitar tabs.
It supports the mentioned Guitar Pro and PowerTab format, and it also has its own TuxGuitar (.tg) format.
If you need the backend data structure to remain in human readable form I would probably stick it in a CDATA inside of XML. That could be inserted into a relational database with song/artist/title information and become searchable. Another option is to save it as zipped text files and insert links to those files in a database with the main artist info still searchable by sql.
These are not human readable:
Most common formats are Guitar Pro (proprietary) and PowerTab (freeware). DGuitar and TuxGuitar are open source viewers for Guitar Pro format. I'm sure that they have documentation for the format somewhere (at least in the code).
Advantage for using a common format would be the easiness of making tabs with those programs.
The Guitar Pro 4 format is described here http://dguitar.sourceforge.net/GP4format.html
I wrote a quick utility for displaying tab. For personal use. You can happily take the internal format I used.
I use a very simple string based format. There are three important structures.
Column, a vertical column in the output tab - all notes played simultaneously.
Bar, a collection of Columns
Motif, a collection of Bars
A Column looks like ':#|:#|*:#' where each * is a string number and each # is a fret number. If you are playing a chord you separate each string:fret with a '|'
A Bar looks like '[,,-,*]' where each * is a Column. A - indicates an empty column where no notes are played.
A Motif looks is just many Bars running back to back. For instance
"[1:5,-,3:7,-,3:5,-,3:7,-,-,3:5,3:7,-,1:8,-,1:5]"
e||---------------|---------------||
B||---------------|---------------||
G||---------------|---------------||
D||--7-5-7--57----|--7-5-7--57----||
A||---------------|---------------||
E||5-----------8-5|5-----------8-5||
"[-,-,1:3|2:2|3:0|4:0|5:3|6:3,-,-][-,-,3:0|4:2|5:3|6:2,-,-]"
e||--3--|--2--||
B||--3--|--3--||
G||--0--|--2--||
D||--0--|--0--||
A||--2--|-----||
E||--3--|-----||
I am trying to support arbitrary unicode from a variety of international users. They have already put a bunch of data into sqlite databases on their iPhones, and now I want to capture the data into a database, then send it back to their device. Right now I am using a php page that is sending data back to from an internet mysql database. The data is saved in the mysql database properly, but when it's sent back it comes out as unicode text, such as
Frank\u00e2\u0080\u0099s iPad
instead of just
Frank's iPad
where the apostrophe should really be a curly apostrophe.
The answer posted to another question indicates that there is no built-in Cocoa methods to convert the "\u00e2\u0080\u0099" portion of the unicode string from the webserver to an NSString object. Is this correct?
That seems really surprising (and scarily disappointing), since Cocoa definitely allows input from many different Unicode characters, and I need to support any arbitrary language that I have never heard of, and all of the possible characters. I save them to and from the local sqlite database just fine now, but once I send it to a web server, then perhaps pull down different data, I want to ensure the data pulled from the web server is correctly formatted.
[...] there is no built-in Cocoa methods to convert [...]. Is this
correct?
It's not correct.
You might be interested in CFStringTransform and it's capabilities. It is a full blown ICU transformation engine, which can (also) perform your requested transformation.
See Using Objective C/Cocoa to unescape unicode characters, ie \u1234
All NSStrings are Unicode.
The problem with the “Frank\u00e2\u0080\u0099s iPad” data isn't that it's Unicode; it's that it's escaped to ASCII. “Frank’s iPad” is valid Unicode in any UTF, and is what you need.
So, you need to see whether the database is returning the data escaped or the PHP layer is escaping it at some point. If either of those is the case, fix it if you can; the PHP resource should return UTF-8/16/32. Only if that approach fails should you seek to unescape the string on the Cocoa side.
You're correct that there is no built-in way to unescape the string in Cocoa. If you get to that point, see if you can find some open-source code to do it; if not, you'll need to do it yourself, probably using NSScanner.
Check that your web service response has Content type and charset. Also that xml has encoding specified. In PHP you need to add the following before printing XML:
header('Content-type: text/xml; charset=UTF-8');
print '<?xml version="1.0" encoding="UTF-8"?>';
I guess there is just no encoding specified.
My .NET library has to marshal strings to a C library that expects text encoded using the system's default ANSI code page. Since .NET supports Unicode, this makes it possible for users to pass a string to the library that doesn't properly convert to ANSI. For example, on an English machine, "デスクトップ" will turn in to "?????" when passed to the C library.
To address this, I wrote a method that detects when this will happen by comparing the orginal string to a string converted using the ANSI code page. I'd like to test this method, but I really need a string that's guaranteed to be not encodable. For example, we test our code on English and Japanese machines (among other languages.) If I write the test to use the Japanese string above, the test will fail when the Japanese system properly encodes the string. I could write the test to check the current system's encoding, but then I have a maintenance nightmare every time we add/remove a new language.
Is there a unicode character that doesn't encode with any ANSI code page? Failing that, could a string be constructed with characters from enough different code pages to guarantee failure? My first attempt was to use Chinese characters since we don't cover Chinese, but apparently Japanese can convert the Chinese characters I tried.
edit I'm going to accept the answer that proposes a Georgian string for now, but was really expecting a result with a smattering of characters from different languages. I don't know if we plan on supporting Georgian so it seems OK for now. Now I have to test it on each language. Joy!
There are quite a few Unicode-only languages. Georgian is one of them. Here's the word 'English' in Georgian: ინგლისური
You can find more in Georgian file (ka.xml) of the CLDR DB.
If by "ANSI" you mean Windows code pages, I am pretty sure the characters out of BMP are not covered by any Windows code pages.
For instance, try some of Byzantine Musical Symbols
There are Windows code pages, which cover all Unicode characters (e.g. Cp1200, Cp12000, Cp65000 and Cp65001), so it's not always possible to create a string, which is not convertable.
What do you mean by an 'ANSI code page'? On Windows, the code pages are Microsoft, not ANSI. ISO defines the 8859-x series of code sets; Microsoft has Windows code pages analogous to most of these.
Are you thinking of single-byte code sets? If so, you should look for Unicode characters in esoteric languages for which there is less likely to be a non-Unicode, single-byte code set.
You could look at languages such as: Devanagari, Oi Chiki, Cherokee, Ogham.
How do you change the name of the font embedded in the .ttf file? (It's for a device that's expecting a hard coded font name that I'd like to swap w/ another more readable openly licensed font).
I'd prefer a method which I can implement myself rather than installing a program.
TrueType is a pretty complex binary data format -- the kind that takes an entire book-length spec to describe. I've worked with it in the distant past.
There are specialized tools that can edit fonts, including metadata like names. I would not recommend trying to mess with the binary data in a font file without such a tool. There might be libraries available that you could call to manipulate TrueType data; if one existed, I would guess Python would be the most likely language to find it in, because there's a long correlation between font hackers and Python (Guido van Rossum's brother is a well-known typographer.)
This may be only useful in very specific situations, but should you need to change a font's name to something else that is the exact same length, you can do so in a hex editor (e.g. Okteta. Find all the instances of the name, and then edit them to be the new name. I found there were 2 copies of the name in each place - one that's normal, and another with 0x00 in between each letter.
The only evidence I have that this actually works is empirical with a sample size of 1.