Load raw data from a file without dropping backslash characters - sql

I have a file that contains the following content (simplified version that demonstrates the problem):
"abc\"def"
I would like to load the literal content of the file into a table without any mangling of the data. Here is what I am currently doing:
CREATE TABLE file_content (content text);
COPY file_content FROM '/path/to/test.txt';
The resulting line in the table is:
"abc"def"
In other words, the backslash was silently dropped/ignored. I've tried the copy with different encodings (UTF8, LATIN1, SQL_ASCII) without any change in behavior.
Also, the ESCAPE and QUOTE options seemed promising at first, but they are only for COPY ... TO.
Is there a way to load raw data from a file without the mangling? I'm using version PostgreSQL version 9.4.6.

You need to change \ to \\. You can use sed for that:
sed -i -- 's/\\/\\\\/g' import.file
Please make sure you have reviewed your data and backuped it before performing operation above.

Related

Combine SQL files with command `copy` in a batch file introduce an incorrect syntaxe because it does add an invisible character `U+FEFF`

In a pre-build event, a batch file is executed to combine multiple SQL files into a single one.
It is done using this command :
COPY %#ProjectDir%\Migrations\*.sql %#ProjectDir%ContinuousDeployment\AllFilesMergedTogether.sql
Everything appear to work fine but somehow the result give an incorrect syntaxe error.
After two hours of investigation, it turn out the issue is caused by an invisible character that remain invisible even with notepad++.
Using an online website, the character has been spotted and is U+FEFF has shown in following image.
Here are the two input scripts.
PRINT 'Script1'
PRINT 'Script2'
Here is the output given by the copy command.
PRINT 'Script1'
PRINT 'Script2'
Additional info :
Batch file is encoded with UTF-8
Input files are encoded with UTF-8-BOM
Output file is encoded with UTF-8-BOM.
I'm not sure it is possible to change the encoding output of command copy.
I've tried and failed.
What should be done to eradicate this extremely frustrating parasitic character?
It has turned out that changing encoding of input files to ANSI does fix the issue.
No more pesky character(s).
Also, doing so does change the encoding of the result file to UTF-8 instead of UTF-8-BOM which is great I believe.
Encoding can be changed using Notepad++ as show in following picture.

SQL Server adding extra special characters in query result

I am trying to extract some records in a file using BCP command in SQL Server. However when the file is generated, there are extract spaces in between the result for each column.
To try I just wrote basic SQL Query as simple as this
select 'ABC', 40, 'TEST','NOTWORKING'
When we copy the output of above query and paste it in Notepad, the output comes as
ABC 40 TEST NOTWORKING
Notice the space between each value? The file that system is generating using BCP command also has same space coming in the output file which is incorrect. What I want to see in the output file is
ABC40TESTNOTWORKING
What must be causing this issue? I am simply amazed to see such weird issue and hoping that it can be fixed by some changes or setting. Please help.
Sample BCP command
EXEC xp_cmdshell 'bcp "select ''ABC'', 40, ''TEST'',''NOTWORKING''" queryout "E:\Testfile.txt" -c -T -S""'
Output in the File - Testfile.txt
ABC 40 TEST NOTWORKING
There are probably tabs between the values. If you want a single value, use concat():
select CONCAT('ABC', 40, 'TEST', 'NOTWORKING')
There's no issue. The command line has no field terminator argument, so the default is used, a tab. That's described in the docs :
-t field_term
Specifies the field terminator. The default is \t (tab character). Use this parameter to override the default field terminator. For more information, see Specify Field and Row Terminators (SQL Server).
If you specify the field terminator in hexadecimal notation in a bcp.exe command, the value will be truncated at 0x00. For example, if you specify 0x410041, 0x41 will be used.
If field_term begins with a hyphen (-) or a forward slash (/), do not include a space between -t and the field_term value.
The link points to an entire article that explains how to use terminators, for each of the bulk operations.
As for the Copy/Paste operation, it has nothing to do with SQL Server. SQL Server has no UI, it's a service. I suspect what was pasted in Notepad was copied from an SSMS grid.
SSMS is a client tool just like any other. When you copy data from it into the clipboard, it decides what to put there and what format to use. That format can be plain text, using spaces and tabs for layout, RTF, HTML etc.
Plain text with tabs as field separators is probably the best choice for any tool, as it preserves the visual layout up to a point and uses only a single character as a separator. A fixed-length layout using spaces could also be used but that would add characters that may well be part of a field.
Encodings and codepages
-c exports the data using the user's default codepage. This means that text stored in varchar fields using a different codepage (collation) may get mangled. Non-visible Unicode characters will also get mangled and appear as something else, or as ?.
-c
Performs the operation using a character data type. This option does not prompt for each field; it uses char as the storage type, without prefixes and with \t (tab character) as the field separator and \r\n (newline character) as the row terminator. -c is not compatible with -w.
It's better to use export the file as UTF16 using -w.
-w
Performs the bulk copy operation using Unicode characters. This option does not prompt for each field; it uses nchar as the storage type, no prefixes, \t (tab character) as the field separator, and \n (newline character) as the row terminator. -w is not compatible with -c.
The codepage can be specified using the -C parameter. -C 1251 for example will export the data using Windows' Latin1 codepage. 1253 will export it using the Greek codepage.
-C { ACP | OEM | RAW | code_page }
Specifies the code page of the data in the data file. code_page is relevant only if the data contains char, varchar, or text columns with character values greater than 127 or less than 32.
SQL Server 2016 and later can also export text as UTF8 with -C 65001. Earlier versions don't support UTF8.
Versions prior to version 13 (SQL Server 2016 (13.x)) do not support code page 65001 (UTF-8 encoding). Versions beginning with 13 can import UTF-8 encoding to earlier versions of SQL Server.
All this is described in bcp's online documentation.
This subject is so important for any database that it has an entire section in the docs, that describes data format and considerations, using format files to specify different settings per column, and guidelines to ensure compatibility with other applications

Does pig support load with no delimiter?

I'd like to load a lot of small files from HDFS with Pig and process them as tuples (filename, filecontent).
a=LOAD 'mydir' USING PigStorage('','-tagPath') AS (filepath:chararray, filecontents:chararray);
However it seems like I cannot omit specifying the delimiter. Is there some sort of a "NULL" in Pig or is there any other way to make sure the content of the file will not be split?
You will have to write your own custom loader by extending LoadFunc.
Short answer to your question is no.In order to make sure the content is not split,use a delimiter that would not exist in the content.In that way, the whole content would be loaded to the field filecontents:chararray.So assuming,your input files do not have a special character '~'
a=LOAD 'mydir' USING PigStorage('~','-tagPath') AS (filepath:chararray, filecontents:chararray);

What is the meaning of the file names flanked by the '#' sign and how can I remove them?

When I do the 'ls' command in the terminal on my Raspberry Pi 2, I see different types of names of files, some like "#example.cpp#", as well as others like "homework1.cpp~".
What do these two file types mean, and how can I get rid of them? Simply using the 'rm' command doesn't seem to be working for me. Thanks!
Some applications will create a copy of a file and use special characters when creating the filename for the copy. For instance some text editors will make a copy of a file you are starting to edit by using the same name and adding a tilde character (~) to the end of the file. That way you will have a backup of the file that you are about to edit.
Another reason would be if an application is processing the file into a temporary file with the temporary file then being used for the next step. For example perhaps the C/C++ compiler is reading the file homework1.cpp with the C Preprocessor to generate the temporary file #homework1.cpp# which is then compiled by the compiler to generate the object code file.
I am not familiar with raspberry pi so am not sure as to what may be creating the filenames with the pound sign (#) on the front and back. Perhaps it is the C++ compiler. I am pretty sure the files with the tilde character on appended to the end of the file name is a back file from vi or vim containing a copy of the file at the time it was last opened with the text editor.
One thing that you could do is to look in those files to see what is there using a Linux command or a text editor. If you use a text editor I would copy the file to another folder as a back up and then look at it there.
Edit: Someone just posted and then deleted an answer which also mentioned about how to remove these files.
What I read was that the rm command is used however for some kinds of special characters you will need to use quotes around the name and you may also need to use an escape to escape certain special characters.
The command shell reads the command line you type in and makes changes to the text before passing it on to the command you type in. So if the filename has a space in it, say jj Johny then when you remove the file you have to specify rm "jj Johny" since spaces are used by the command processor to separate out arguments.
The other poster mentioned that you had to escape out the pound sign (#) using the back slash character in order to prevent it from being modified by the command processor.

Postgres 9.3 end-of-copy marker corrupt - Any way to change this setting?

I am trying to stream data through an AWK program to a Postgres COPY command. This works great usually. However, in my data recently I have been getting long text stings containing '\.' values.
Postgres Documentation mentions this combination of characters represents the end-of-data marker, http://www.postgresql.org/docs/9.2/static/sql-copy.html, and I am getting the associated errors when trying to insert with COPY.
My question is, is there a way to turn this off? Perhaps change the end-of-data marker to a different combination of characters? Or do I have to alter/remove these strings before trying to insert using the COPY command?
You can try to filter your data through sed 's:\\:\\\\:g' - this would change every \ in your data to \\, which is a correct escape sequence for single backslash in copy data.
But I think not only backslash would be problematic. Also newlines should be encoded by \n, carriage returns as \r and tabs as \t (tab is a default field delimiter in copy).