Arabic and English text in PostgreSQL database - sql

I need to insert both English and Arabic text into a PostgreSQL database.
I'm running the following command via a .bat script:
psql.exe --echo-all --username=postgres --dbname=dbname -f populate.sql
populate.sql has statements like this:
insert into table1 (column1, column2) VALUES (2, 'المستخدم ');
If I do this via pgadmin, it works. The thing is I need to do this via some .sql population scripts that are ran once the application is started.
In that case I get gibberish, like this:
العرض
I created the scripts in Notepad++ using the Encode in UTF-8 without BOM option, since the normal encoding in UTF-8 adds an extra character to the start of the file and some of the inserts are not made.
I'm assuming this is an encode problem, but I have yet to figure out exactly what is wrong.
The databse is in UTF-8.
Thanks in advance!

The Windows console doesn't speak Unicode of any form by default, it speaks a "native" codepage. Which codepage depends on your Windows install's language settings.
If you chcp 65001 then it'll switch into utf-8.
Overall, though, text encoding handling in batch/cmd files and the Windows command line is absolutely awful. I generally recommend that you instead put anything that isn't basically 7-bit ASCII into a separate .sql file and execute it via psql -f with a suitable client_encoding; the PGCLIENTENCODING environment variable is useful for this.
So try:
SET PGCLIENTENCODING=utf-8
psql.exe --username=postgres --dbname=dbname -f populate.sql

Related

issue on creating language model for sinhala usin SRILM

I'm trying to create a sinhala voice recognition system using pocketsphinx. I use SRILM tool to create language model. My source files to create the laguage model are Here . Im using cygwin on windows 8.1 to run SRILM 1.7.1. But once i run the command
ngram-count -vocab sinhalalexicon.txt -text sinhalacorpus.Train -order 3 -write sinhala.count -unk
I'm getting
iconv: Invalid or incomplete multibyte or wide character
iconv: Invalid or incomplete multibyte or wide character
What did I do wrong here? sinhalacorpus.Train file was created by manually using Notepad++
I found the solution to my issue. once I convert the corpus and lexicon files to Unix format and change the encoding to UTF-8 without BOM it worked. I used Notepad++ to do the changes.

text to pdf with utf8 encoding (alternative to a2ps)

The programm a2ps does not support utf-8. At least my version does only
support the latin-X encodings:
a2ps --list=encoding
Version:
GNU a2ps 4.14
How can I convert a simple utf-8 text to postscript or pdf?
If what you actually want is to use a2ps or enscript (which is a similar tool), and if your single need is to use them with some UTF-8 document, you only have to convert your document to ISO-8859-1 or some supported encoding. Various tools allow this. For instance, here is a workflow for enscript (but you can surely do the same with a2ps):
cat document.txt | iconv -c -f utf-8 -t ISO-8859-1 | enscript -o document.ps
But you may lose some characters during the conversion because such encodings have a smaller range than UTF-8.
On the other hand, if UTF-8 is a requirement, you may rather have to look for some recent tool allowing to convert UTF-8 to PDF. I wrote myself a Python program called txt2pdf; you may find it here. Have also a look at tools like pandoc, gimli, rst2pdf or wkhtmltopdf.
You can use Vim. Open the file and execute the command :hardcopy > output.ps in normal mode. You can also do this directly from the shell. Executing
$ vim -c ":hardcopy > output.ps" -c ":quit" input.txt
in your shell will open Vim, generate the output.ps, and then close Vim.
Use paps! For instance I use it as follow:
paps --font="Monospace 10" input.txt > output.ps
and I have no problem with utf encoding.
If you need a pdf file then
pdf2ps output.ps
I've gotten acceptable results (for printing code listings) from https://github.com/arsv/u2ps
https://gitlab.com/gnomify/u2ps is the replacement of gnome-u2ps.
If the text file is small, paps converts to text to ps, which then can be fed to ps2pdf. The problem is ps file from paps causes ps2pdf to create a very big pdf file. If that is ok, this is possible. Currently, I am having a large file size pdf from paps.
There's a utility based on gnome libraries and named gnome-u2ps. It has less functionality than a2ps, and it seems that it is not maintained anymore.

How to split sql in MAC OSX?

Is there any app for mac to split sql files or even script?
I have a large files which i have to upload it to hosting that doesn't support files over 8 MB.
*I don't have SSH access
You can use this : http://www.ozerov.de/bigdump/
Or
Use this command to split the sql file
split -l 5000 ./path/to/mysqldump.sql ./mysqldump/dbpart-
The split command takes a file and breaks it into multiple files. The -l 5000 part tells it to split the file every five thousand lines. The next bit is the path to your file, and the next part is the path you want to save the output to. Files will be saved as whatever filename you specify (e.g. “dbpart-”) with an alphabetical letter combination appended.
Now you should be able to import your files one at a time through phpMyAdmin without issue.
More info http://www.webmaster-source.com/2011/09/26/how-to-import-a-very-large-sql-dump-with-phpmyadmin/
This tool should do the trick: MySQLDumpSplitter
It's free and open source.
Unlike the accepted answer to this question, this app will always keep extended inserts intact so the precise form of your query doesn't matter; the resulting files will always have valid SQL syntax.
Full disclosure: I am a share holder of the company that hosts this program.
The UploadDir feature in phpMyAdmin could help you, if you have FTP access and can modify your phpMyAdmin's configuration (or are allowed to install your own instance of phpMyAdmin).
http://docs.phpmyadmin.net/en/latest/config.html?highlight=uploaddir#cfg_UploadDir
You can split into working SQL statements with:
csplit -s -f db-part db.sql "/^# Dump of table/" "{99}"
Which makes up to 99 files named 'db-part[n]' from db.sql
You can use "CREATE TABLE" or "INSERT INTO" instead of "# Dump of ..."
Also: Avoid installing any programs or uploading your data into any online service. You don't know what will be done with your information!

How to print sqlite to file in utf-8?

I've opened sqlite3.exe in windows console and made a database with special characters.
.dump showed me the sql query with special characters.
Then I changed output to file: .output file.sql
And executed the .dump command.
The special characters were missing when I imported the database using .read file.sql.
I used pragma encoding="UTF-8"; but it didn't change anything (I don't know if it should).
The Windows console makes it hard to use UTF-8 correctly, and the Microsoft compiler has lots of bugs that make it impossible to use UTF-8 with portable I/O functions.
If you have entered data in the Windows console, those strings are not valid UTF-8.
If a non-ASCII string is output with correct characters in the Windows console, it is not valid UTF-8.
To ensure that your data is valid UTF-8, you have to go through files.
Alternatively, use any SQLite shell that does not use the console (such as the SQLite Manager Firefox extension).
This work fine for CP852, but could be used for any codepage known by iconv.
chcp 852 >NUL
echo INSERT into NAMES (name,timestamp) VALUES ('ěščřžýáíé','1429001515'); | iconv.exe -f cp852 -t utf-8 | ..\utilities\sqlite3.exe test.db
Windows can handle unicode internaly, but if you print it on console (by 'echo' command for example) than character mismatch. Using on-the-fly reencoding solve this problem.

DB upload in mysql - loses UTF characters

I'm frequently updating my db on the server, and I run the following line from the command line:
mysqldump -u root --password=mypass mydb|mysql -h mysite.cc -u remotusr --password=remotpsw remotdb
The problem is that it loses the UTF characters along the way.
How can I keep the utf chars in cmd, or what is a better practice doing this?
( Upgrading to an answer )
As documented under mysqldump — A Database Backup Program:
--default-character-set=charset_name
Use charset_name as the default character set. See Section 10.5, “Character Set Configuration”. If no character set is specified, mysqldump uses utf8, and earlier versions use latin1.
[ deletia ]
--set-charset
Add SET NAMES default_character_set to the output. This option is enabled by default. To suppress the SET NAMES statement, use --skip-set-charset.
Therefore, unless you have settings in an option file which are overriding these defaults (you can specify --no-defaults to ensure they are not), the output from mysqldump should be more than capable of being redirected to another mysql session without loss of Unicode characters.
Instead, the conversion to a smaller character set appears to be occurring on retrieving & displaying your data from the new database. Since you are using PHP's Original MySQL API for this purpose, despite the warning in the introduction to its manual chapter (below), you should use mysql_set_charset() to set the connection character set.
This extension is not recommended for writing new code. Instead, either the mysqli or PDO_MySQL extension should be used. See also the MySQL API Overview for further help while choosing a MySQL API.