issue on creating language model for sinhala usin SRILM - voice-recognition

I'm trying to create a sinhala voice recognition system using pocketsphinx. I use SRILM tool to create language model. My source files to create the laguage model are Here . Im using cygwin on windows 8.1 to run SRILM 1.7.1. But once i run the command
ngram-count -vocab sinhalalexicon.txt -text sinhalacorpus.Train -order 3 -write sinhala.count -unk
I'm getting
iconv: Invalid or incomplete multibyte or wide character
iconv: Invalid or incomplete multibyte or wide character
What did I do wrong here? sinhalacorpus.Train file was created by manually using Notepad++

I found the solution to my issue. once I convert the corpus and lexicon files to Unix format and change the encoding to UTF-8 without BOM it worked. I used Notepad++ to do the changes.

Related

Openvms: Extracting RMS Indexed file t to Windows as a sequential flat file

I haven't used openvms for 20+ years. It was my 1st OS. I've been asked if it possible to copy the data from RMS files from openvms server to windows as a text file - so that it's readable.
No-one has experience or knowledge of the record structures etc.
The files are xyz.DAT and are relative files. I'm hoping the dat files are fixed length.
My 1st attempt would be to try and use Datatrieve (DTR) but get an error that the image isn't loaded.
Thought it might be as easy using CONVERT/FDL = nnnn.FDL - by changing the Relative to Sequential. The file seems still to be unreadable.
Is there an easy way to stream an RMS index file to a flat ASCII file?
I use to use COBOL and C to access the data in the past but had lots of libraries to help....
I've notice some solution may use odbc to connect but not sure what I can or cannot install on the server.
I can FTP using Filezilla to the server....
Another plan writing C application to read a file and output out as string.....or DCL too.....doesn't have to be quick...
Any ideas
Has mentioned before
The simple solution MIGHT be to to just use: $ TYPE/OUT=test.TXT test.DAT.
This will handle Relatie and Indexed files alike.
It is much the same as $ CONVERT / FDL=NL: test.DAT test.TXT
Both will just read records from the source and transfer the bytes, byte for byte, to the records in a sequential file.
FTP in ASCII mode will transfer that nicely to windows.
You can also use an 'inline' FDL file to generate a 'unix' LF file like:
$ conv /fdl="record; format stream_lf" test.DAT test.TXT
Or CR-LF file using:
$ conv /fdl="record; format stream" test.DAT test.TXT
Both can be transferring in Binary or Ascii with FTP.
MOSTLY - because this really only works well for TEXT ONLY source .DAT file.
There should be no CR, LF, FF or NUL characters in the source or things will break.
As 'habo' points out, use DUMP /RECORD=COUNT=3 to see how 'readable' the source data is.
If you spot 'binary' data using DUMP then you will need to find a record defintion somewhere which maps byte to Integers or Floating points or Dates as needed.
These defintions can be COBOL LIB files, or BASIC MAPS and are often stores IN the CDD (Common Data Dictionary) or indeed in DATATRIEVE .DIC DICTIONARIES
To use such definition you likely need a program to just read following the 'map' and write/print as text. Normally that's not too hard - notably not when you can find an example program on the server to tweak.
If it is just one or two 'suspect' byte ranges, then you can create a DCL loop to read and write and use F$EXTRACT to select the chunks you like.
If you want further help, kindly describe in words what kind of data is expected and perhaps provide the output from DUMP for 3 or 5 rows.
Good luck!
Hein.

dbt Error : Encountered an error: 'utf-8' codec can't decode byte 0xa0 in position 441: invalid start byte

I have upgraded my dbt version to 1.0.0 yesterday night and ran few connection test. It went well . Now when i am running the my first dbt example model , i am getting below error , even though i have not changed any code in this default example model.
Same error i am getting while running dbt seed command also for a csv dataset . The csv is utf-8 encoded and no special character in it .
I am using python 3.9
Could anyone suggest what is the issue ?
Below is my first dbt model sql
After lots of back and forth, I figured out the issue. This is more like fundamental concept issue.
Every time we execute dbt run, dbt will scan through the entire project directory ( including seeds directory even though it is not materializing the seed ) [Attached screenshot below].
If it finds any csv it also parsed it .
In case of above error, I had a csv file which looks follows :
If we see the highlighted line it contains some symbol character which dbt (i.e python) was not able to parse it causing above error.
This symbol was not visible earlier in excel or notepad++.
It could be the issue with Snowflake python connector that #PeterH has pointed out .
As temporary solution , for now we are manually removing these character from Data file.
I’d leave this as a comment but I don’t have the rep yet…
This appears to be related to a recently-opened issue.
https://github.com/dbt-labs/dbt-snowflake/issues/66
Apparently it’s something to do with the snowflake python adapter.
Since you’re seeing the error from a different context, it might be helpful for you to post in that issue that you’re seeing this outside of query preview.

doxygen latex make fails for input encoding error

I have a git repo project in eclipse which I have been documenting using doxygen (v1.8.4).
If I run the latex make ion a fresh clone of the project it runs fine and the PDF is made.
However, if I then run a doxy build, which completes OK, then attempt to run the latex make, it fails for
! Package inputenc Error: Keyboard character used is undefined
(inputenc) in inputencoding `utf8'.
See the inputenc package documentation for explanation.
Type H <return> for immediate help.
...
I have tried switching the encoding of the doxyfile by setting DOXYFILE_ENCODING to ISO-8859-1 with no change in the result... How can I fix this?? Thanks.
EDIT: I have used no non-UTF-8 chars as far as I know in my files, the file referenced before the error is very short and definitely doesn't have non-UTF-8 chars in it. I've even tried clearing my latex output dir and building from scratch with no luck...
EDIT: Irealised that the doxy build only appears to run correctly. It doesnt show any errors, but it should, for example run DOT and build about 10 graphs. The console output says Running dot, but it doesn't say generating graph (n/x) like it should when it actually makes the graphs...
Short answer: So by a slow process of elimination I found that this was caused by a single apostrophe in a file that had appeared to be already built and made without error!!
Long answer: Firstly I used used the project properties to flip the encoding from the default Cp1252 to UTF-8. Then I started removing files one-by-one until rebuilding and remaking after each removal, until the make ran successfully. I re-added all files, but deleted the content in the most recently removed file and tested the make - to confirm it was this file and only this file that caused the issue. the make ran fine. So I pasted the content back into the empty file, and started deleting smaller and smaller sections of the file, again rebuilding and remaking each time until I was left with a good make without the apostrophe and a bad one with it... I simply retyped the apostrophe (as this would then force it to be a UTF-8 char) and success!! Such an annoying bug!
Dude you made it a hard way. Why not use python to do the work for you:
f = open(fn,"rb")
data = f.read()
f.close()
for i in range(len(data)):
ch = data[i]
if(ch > 0x7F): # non ASCII character
print("char: %c, idx: %d, file: %s"%(ch,i,fn))
str2 = str(data[i-30:i+30])#.decode("utf-8")
print("txt: %s" % (str2))

How to print sqlite to file in utf-8?

I've opened sqlite3.exe in windows console and made a database with special characters.
.dump showed me the sql query with special characters.
Then I changed output to file: .output file.sql
And executed the .dump command.
The special characters were missing when I imported the database using .read file.sql.
I used pragma encoding="UTF-8"; but it didn't change anything (I don't know if it should).
The Windows console makes it hard to use UTF-8 correctly, and the Microsoft compiler has lots of bugs that make it impossible to use UTF-8 with portable I/O functions.
If you have entered data in the Windows console, those strings are not valid UTF-8.
If a non-ASCII string is output with correct characters in the Windows console, it is not valid UTF-8.
To ensure that your data is valid UTF-8, you have to go through files.
Alternatively, use any SQLite shell that does not use the console (such as the SQLite Manager Firefox extension).
This work fine for CP852, but could be used for any codepage known by iconv.
chcp 852 >NUL
echo INSERT into NAMES (name,timestamp) VALUES ('ěščřžýáíé','1429001515'); | iconv.exe -f cp852 -t utf-8 | ..\utilities\sqlite3.exe test.db
Windows can handle unicode internaly, but if you print it on console (by 'echo' command for example) than character mismatch. Using on-the-fly reencoding solve this problem.

MSI DB, Visual Basic and cp1252 encoded string problems

I have such line of code (generated by MakeMSI)
oRec.StringData(2) = "A publicitar a aplicação"
oRec is record from Msi database, opened with:
oInstaller = MkObject("WindowsInstaller.Installer")
oMsi = oInstaller.OpenDatabase(MsiName, msiOpenDatabaseModeDirect)
oMsi.OpenView(selectQuery)
After executing and commiting string "A publicitar a aplicação" is converted to "A publicitar a aplicaçao" (ã is converted to a) in the database. I'm 100% sure database is cp1252 encoded, as when I edit field manualy and insert ã it displays well. Any ideas how to workaround this?
EDIT:
When building installer on Portugese Windows everything is fine
What is the codepage of the computer where you edit the property?
I don't know if VBA uses Unicode internally to store strings or not. If it does, then it should work on any computer; if it does not, then it should work correctly only where the system code page supports ‘ã’.
So another part of the problem is the source file itself: to work as expected it should be Unicode-enabled (UTF-8 or UTF-16), and the interpreter should handle it this was. Otherwise, you'll get unexpected results where the current code page is not compatible with cp1252.
Check the setting for Language for non-Unicode programs in the Regional settings in Windows. It should be set to Portuguese.