CSV encoding inside input text file step on Kettle - pentaho

I am doing i simple input text file into Kettle Pentaho PDI 8.1.0. The file has several acceding char like "á" and it is a .csv file.
In the settings of the input text file step i set the encoding to ISO-8859-1. So when i go to "Show file content" button everything are correct.
But when i press the Preview rows so i can see the data separated into columns then i get error on all acceding chars and are replaced with ? So Mária becomes M�ria.
By using the word error i do not mean that kettle does not run the transformation but that the data are not correct.
Any Idea?

Your file is obviously not encoded in ISO-8859-1.
The Encoding field in the Content tab of 'Text file input' is used by the "Preview rows" button but not by the "Show file content" button.
Try another encoding.

Try encoding cp866, hope it helps, or also you could try encoding with latin-1

Related

Characters not displayed correctly when reading CSV file

I have an issue when trying to read a string from a .CSV file. When I execute the application and the text is shown in a textbox, certain characters such as "é" or "ó" are shown as a question mark symbol.
The idea is that this code reads the whole CSV file and then splits each line into variables depending on the first word of the line.
The code I'm using to read is:
Dim test() As String
test = IO.File.ReadAllLines("Libro1.csv")
Dim test_chart As String = Array.Find(vls1load, Function(x) (x.StartsWith("sample")))
Dim test_chart_div() As String = test_chart.Split(";")
variable1 = test_chart_div(1)
variable2 = test_chart_div(2)
...etc
I have also tried with:
Dim test() As String
test = IO.File.ReadAllLines("Libro1.csv", System.Text.Encoding.UTF8)
But none of them works. The .csv file is supposed to be UTF8. The "web options" that you can see when saving the file in excel show encoding UTF8. I also tried the trick of changing the file extension to HTML and opening it with the browser to see that the encoding is also correct.
Can someone advice anything else I can try?
Thanks in advance.
When an Excel file is exported using the CSV Comma Separated output format, the Encoding selected in Tools -> Web Option -> Encoding of Excel's Save As... dialog doesn't actually generate the expected result:
the Text file is saved using the Encoding relative to the current Language selected in the Excel Application, not the Unicode (UTF16-LE) or UTF-8 Encoding selected (which is ignored) nor the default Encoding determined by the current System Language.
To import the CSV file, you can use the Encoding.GetEncoding() method to specify the Name or CodePage of the Encoding used in the machine that generated the file: again, not the Encoding related to System Language, but the Encoding of the Language that the Excel Application is currently using.
CodePage 1252 (Windows-1252) and ISO-8859-1 are commonly used in Latin1 zone.
Based the symbols you're referring to, this is most probably the original encoding used.
In Windows, use the former. ISO-8859-1 is still used, mostly in old Web Pages (or Web Pages created without care for the Encoding used).
As a note, CodePage 1252 and ISO-8859-1 are not exactly the same Encoding, there are subtle differences.
If you find documentation that states the opposite, the documentation is wrong.

Combine SQL files with command `copy` in a batch file introduce an incorrect syntaxe because it does add an invisible character `U+FEFF`

In a pre-build event, a batch file is executed to combine multiple SQL files into a single one.
It is done using this command :
COPY %#ProjectDir%\Migrations\*.sql %#ProjectDir%ContinuousDeployment\AllFilesMergedTogether.sql
Everything appear to work fine but somehow the result give an incorrect syntaxe error.
After two hours of investigation, it turn out the issue is caused by an invisible character that remain invisible even with notepad++.
Using an online website, the character has been spotted and is U+FEFF has shown in following image.
Here are the two input scripts.
PRINT 'Script1'
PRINT 'Script2'
Here is the output given by the copy command.
PRINT 'Script1'
PRINT 'Script2'
Additional info :
Batch file is encoded with UTF-8
Input files are encoded with UTF-8-BOM
Output file is encoded with UTF-8-BOM.
I'm not sure it is possible to change the encoding output of command copy.
I've tried and failed.
What should be done to eradicate this extremely frustrating parasitic character?
It has turned out that changing encoding of input files to ANSI does fix the issue.
No more pesky character(s).
Also, doing so does change the encoding of the result file to UTF-8 instead of UTF-8-BOM which is great I believe.
Encoding can be changed using Notepad++ as show in following picture.

how do I write my output to a file using file sink to analyze the output data?

I'm reading binary data from a(two) file(s)(.txt), after performing a logical operation(XOR),writing output to another file(.txt)(using file sink). After I execute the flow graph, and open the file, it shows something like corrupted word document.please help me deal with it.
XORing two printable character bytes may lead to unprintable characters. So your text editor may not be able to open it properly. Try to open the file with a hex editor like hexdump or okteta.

Mgmt Studio save results as issue

I have a SQL query that returns 1 column. I run it in SQL Server Management Studio 2008R2.
I use File\Save Grid Results and create a .TXT file.
My problem is the first record of the file has 3 bytes inserted in front of the data. They three bytes are x'EFBBBF'. This causes problem when I use the file in another process.
I get the same thing wheter I save as .TXT or as .CSV.
Any ideas?
Found it.
Save Results As...
Choose a folder
Enter a file name
Save button now has a dropdown arrow to the right
Click on dropdown arrow and select Save with Encoding...
Select ANSI
Click OK
The ANSI-encoded file will not contain a UTF-8 BOM.
kuru kuru na is on the right track, those bytes are the UTF-8 BOM. I haven't found any settings to change the file encoding that Management Studio uses for saving results. I just use Vim to remove the BOM after saving the file. Your favorite text editor may have a similar option, or you could use a tool like iconv if you need to remove the mark or re-encode the file in a script.
I think it's called a "bom" (byte order mark) signature, which has something to do with telling whatever reads your file that it contains utf characters. I suspect it might be in your SMSS settings somewhere. But at least this is a place to start.

how to import flat file source to database using sql

im currently want to inport my data from flat file to the database.
the flat file is in a txt file. in that txt file, i save a list of URLs. example:
http://www.mimi.com/Hotels-g303188-Rurrenabaque-Hotels.html
im using the SQL Server Import and Export wizard to do it. but when the time of execution, it has error saying
Error 0xc02020a1:
Data Flow Task 1: Data conversion failed. The data conversion for column
"Column 0" returned status value 4 and status text "Text was truncated or one
or more characters had no match in the target code page.".
can anyone help?..
You get this error because the text is too long for the column youve chosen to put it in.
Text was truncated or
You might want to check the size of the database column vis-a-vis your input data. Does the longest URL less than the column width?
one or more characters had no match in the target code page.".
Check if your input file has any special characters. An easy way to check this would be to save your file in ANSI (Notepad > Save As > Encoding = ANSI). Note - you'd still have to select the right code page so that the import interprets your input text correctly.
Here's a very nice link that has some background on what code pages are - http://www.joelonsoftware.com/articles/Unicode.html
Note you can also change the target column data type (to text stream for example) in the Datasource->Advanced section