Combine SQL files with command `copy` in a batch file introduce an incorrect syntaxe because it does add an invisible character `U+FEFF` - sql

In a pre-build event, a batch file is executed to combine multiple SQL files into a single one.
It is done using this command :
COPY %#ProjectDir%\Migrations\*.sql %#ProjectDir%ContinuousDeployment\AllFilesMergedTogether.sql
Everything appear to work fine but somehow the result give an incorrect syntaxe error.
After two hours of investigation, it turn out the issue is caused by an invisible character that remain invisible even with notepad++.
Using an online website, the character has been spotted and is U+FEFF has shown in following image.
Here are the two input scripts.
PRINT 'Script1'
PRINT 'Script2'
Here is the output given by the copy command.
PRINT 'Script1'
PRINT 'Script2'
Additional info :
Batch file is encoded with UTF-8
Input files are encoded with UTF-8-BOM
Output file is encoded with UTF-8-BOM.
I'm not sure it is possible to change the encoding output of command copy.
I've tried and failed.
What should be done to eradicate this extremely frustrating parasitic character?

It has turned out that changing encoding of input files to ANSI does fix the issue.
No more pesky character(s).
Also, doing so does change the encoding of the result file to UTF-8 instead of UTF-8-BOM which is great I believe.
Encoding can be changed using Notepad++ as show in following picture.

Related

Fix Unicode Decode Error Without Specifying Encoding='UTF-8'

I am getting the following error:
'ascii' codec can't decode byte 0xf4 in position 560: ordinal not in range(128)
I find this very weird given that my .csv file doesn't have special characters. Perhaps it has special characters that specify header rows and what not, idk.
But the main problem is that I don't actually have access to the source code that reads in the file, so I cannot simply add the keyword argument encoding='UTF-8'. I need to figure out which encoding is compatible with codecs.ascii_decode(...). I DO have access to the .csv file that I'm trying to read, and I can adjust the encoding to that, but not the source file that reads it.
I have already tried exporting my .csv file into Western (ASCII) and Unicode (UTF-8) formats, but neither of those worked.
Fixed. Had nothing to do with unicode shenanigans, my script was writing a parquet file when my Cloud Formation Template was expecting a csv file. Thanks for the help.

How to delete  in notepad++

I have a ksh file which have set of SQL commands (for ex. 35 commands) in it and I used to open it in notepad++. It is executing and returning expected results for the first 5 SQL queries, but for the rest, it is throwing an error.
When I dug into it, in between the SQL commands, there is an extra character like Â. How to remove this? I have tried a lot to get rid of it with no luck.
Thanks.
There's a Hex Editor plugin for NotePad++. Look at your file as a sequence of byte values rather than an encoded text file. Find the byte value that corresponds with the  glyph.

What is the meaning of the file names flanked by the '#' sign and how can I remove them?

When I do the 'ls' command in the terminal on my Raspberry Pi 2, I see different types of names of files, some like "#example.cpp#", as well as others like "homework1.cpp~".
What do these two file types mean, and how can I get rid of them? Simply using the 'rm' command doesn't seem to be working for me. Thanks!
Some applications will create a copy of a file and use special characters when creating the filename for the copy. For instance some text editors will make a copy of a file you are starting to edit by using the same name and adding a tilde character (~) to the end of the file. That way you will have a backup of the file that you are about to edit.
Another reason would be if an application is processing the file into a temporary file with the temporary file then being used for the next step. For example perhaps the C/C++ compiler is reading the file homework1.cpp with the C Preprocessor to generate the temporary file #homework1.cpp# which is then compiled by the compiler to generate the object code file.
I am not familiar with raspberry pi so am not sure as to what may be creating the filenames with the pound sign (#) on the front and back. Perhaps it is the C++ compiler. I am pretty sure the files with the tilde character on appended to the end of the file name is a back file from vi or vim containing a copy of the file at the time it was last opened with the text editor.
One thing that you could do is to look in those files to see what is there using a Linux command or a text editor. If you use a text editor I would copy the file to another folder as a back up and then look at it there.
Edit: Someone just posted and then deleted an answer which also mentioned about how to remove these files.
What I read was that the rm command is used however for some kinds of special characters you will need to use quotes around the name and you may also need to use an escape to escape certain special characters.
The command shell reads the command line you type in and makes changes to the text before passing it on to the command you type in. So if the filename has a space in it, say jj Johny then when you remove the file you have to specify rm "jj Johny" since spaces are used by the command processor to separate out arguments.
The other poster mentioned that you had to escape out the pound sign (#) using the back slash character in order to prevent it from being modified by the command processor.

how do I write my output to a file using file sink to analyze the output data?

I'm reading binary data from a(two) file(s)(.txt), after performing a logical operation(XOR),writing output to another file(.txt)(using file sink). After I execute the flow graph, and open the file, it shows something like corrupted word document.please help me deal with it.
XORing two printable character bytes may lead to unprintable characters. So your text editor may not be able to open it properly. Try to open the file with a hex editor like hexdump or okteta.

Encoding issue in I/O with Jena

I'm generating some RDF files with Jena. The whole application works with utf-8 text. The source code as well is stored in utf-8.
When I print a string contaning non-English characters on the console, I get the right format, e.g. Est un lieu généralement officielle assis....
Then, I use the RDF writer to output the file:
Model m = loadMyModelWithMultipleLanguages()
log.info( getSomeStringFromModel(m) ) // log4j, correct output
RDFWriter w = m.getWriter( "RDF/XML" ) // default enc: utf-8
w.setProperty("showXmlDeclaration","true") // optional
OutputStream out = new FileOutputStream(pathToFile)
w.write( m, out, "http://someurl.org/base/" )
// file contains garbled text
The RDF file starts with: <?xml version="1.0"?>. If I add utf-8 nothing changes.
By default the text should be encoded to utf-8.
The resulting RDF file validates ok, but when I open it with any editor/visualiser (vim, Firefox, etc.), non-English text is all messed up: Est un lieu généralement officielle assis ... or Est un lieu g\u221A\u00A9n\u221A\u00A9ralement officielle assis....
(Either way, this is obviously not acceptable from the user's viewpoint).
The same issue happens with any output format supported by Jena (RDF, NT, etc.).
I can't really find a logical explanation to this.
The official documentation doesn't seem to address this issue.
Any hint or tests I can run to figure it out?
My guess would be that your strings are messed up, and your printStringFromModel() method just happens to output them in a way that accidentally makes them display correctly, but it's rather hard to say without more information.
You're instructing Jena to include an XML declaration in the RDF/XML file, but don't say what encoding (if any) Jena declares in the XML declaration. This would be helpful to know.
You're also not showing how you're printing the strings in the printStringFromModel() method.
Also, in Firefox, go to the View menu and then to Character Encoding. What encoding is selected? If it's not UTF-8, then what happens when you select UTF-8? Do you get it to show things correctly when selecting some other encoding?
Edit: The snippet you show in your post looks fine and should work. My best guess is that the code that reads your source strings into a Jena model is broken, and reads the UTF-8 source as ISO-8859-1 or something similar. You should be able to confirm or disconfirm that by checking the length() of one of the offending strings: If each of the troublesome characters like é are counted as two, then the error is on reading; if it's correctly counted as one, then it's on writing.
My hint/answer would be to inspect the byte sequence in 3 places:
The data source. Using a hex editor, confirm that the é character in your source data is represented by the expected utf-8 hex sequence 0xc3a8.
In memory. Right after your call to printStringFromModel, put a breakpoint and inspect the bytes in the string (or convert to hex and print them out.
The output file. Again, use a hex editor to inspect the byte sequence is 0xc3a8.
This will tell exactly what is happening to the bytes as they travel along the path of your program, and also where they deviate from the expected 0xc3a8.
The best way to address this would be to package up the smallest unit of your code that you can that demonstrates the issue, and submit a complete, runnable test case as a ticket on the Jena Jira.