Postgres 9.3 end-of-copy marker corrupt - Any way to change this setting? - sql

I am trying to stream data through an AWK program to a Postgres COPY command. This works great usually. However, in my data recently I have been getting long text stings containing '\.' values.
Postgres Documentation mentions this combination of characters represents the end-of-data marker, http://www.postgresql.org/docs/9.2/static/sql-copy.html, and I am getting the associated errors when trying to insert with COPY.
My question is, is there a way to turn this off? Perhaps change the end-of-data marker to a different combination of characters? Or do I have to alter/remove these strings before trying to insert using the COPY command?

You can try to filter your data through sed 's:\\:\\\\:g' - this would change every \ in your data to \\, which is a correct escape sequence for single backslash in copy data.
But I think not only backslash would be problematic. Also newlines should be encoded by \n, carriage returns as \r and tabs as \t (tab is a default field delimiter in copy).

Related

removing unconventional field separators (^#^#^#) in a text file [duplicate]

I have a text file containing unwanted null characters (ASCII NUL, \0). When I try to view it in vi I see ^# symbols, interleaved in normal text. How can I:
Identify which lines in the file contain null characters? I have tried grepping for \0 and \x0, but this did not work.
Remove the null characters? Running strings on the file cleaned it up, but I'm just wondering if this is the best way?
I’d use tr:
tr < file-with-nulls -d '\000' > file-without-nulls
If you are wondering if input redirection in the middle of the command arguments works, it does. Most shells will recognize and deal with I/O redirection (<, >, …) anywhere in the command line, actually.
Use the following sed command for removing the null characters in a file.
sed -i 's/\x0//g' null.txt
this solution edits the file in place, important if the file is still being used. passing -i'ext' creates a backup of the original file with 'ext' suffix added.
A large number of unwanted NUL characters, say one every other byte, indicates that the file is encoded in UTF-16 and that you should use iconv to convert it to UTF-8.
I discovered the following, which prints out which lines, if any, have null characters:
perl -ne '/\000/ and print;' file-with-nulls
Also, an octal dump can tell you if there are nulls:
od file-with-nulls | grep ' 000'
If the lines in the file end with \r\n\000 then what works is to delete the \n\000 then replace the \r with \n.
tr -d '\n\000' <infile | tr '\r' '\n' >outfile
Here is example how to remove NULL characters using ex (in-place):
ex -s +"%s/\%x00//g" -cwq nulls.txt
and for multiple files:
ex -s +'bufdo!%s/\%x00//g' -cxa *.txt
For recursivity, you may use globbing option **/*.txt (if it is supported by your shell).
Useful for scripting since sed and its -i parameter is a non-standard BSD extension.
See also: How to check if the file is a binary file and read all the files which are not?
I used:
recode UTF-16..UTF-8 <filename>
to get rid of zeroes in file.
I faced the same error with:
import codecs as cd
f=cd.open(filePath,'r','ISO-8859-1')
I solved the problem by changing the encoding to utf-16
f=cd.open(filePath,'r','utf-16')
Remove trailing null character at the end of a PDF file using PHP, . This is independent of OS
This script uses PHP to remove a trailing NULL value at the end of a binary file, solving a crashing issue that was triggered by the NULL value. You can edit this script to remove all NULL characters, but seeing it done once will help you understand how this works.
Backstory
We were receiving PDF's from a 3rd party that we needed to upload to our system using a PDF library. In the files being sent to us, there was a null value that was sometimes being appended to the PDF file. When our system processed these files, files that had the trailing NULL value caused the system to crash.
Originally we were using sed but sed behaves differently on Macs and Linux machines. We needed a platform independent method to extract the trailing null value. Php was the best option. Also, it was a PHP application so it made sense :)
This script performs the following operation:
Take the binary file, convert it to HEX (binary files don't like exploding by new lines or carriage returns), explode the string using carriage return as the delimiter, pop the last member of the array if the value is null, implode the array using carriage return, process the file.
//In this case we are getting the file as a string from another application.
// We use this line to get a sample bad file.
$fd = file_get_contents($filename);
//We trim leading and tailing whitespace and convert the string into hex
$bin2hex = trim(bin2hex($fd));
//We create an array using carriage return as the delminiter
$bin2hex_ex = explode('0d0a', $bin2hex);
//look at the last element. if the last element is equal to 00 we pop it off
$end = end($bin2hex_ex);
if($end === '00') {
array_pop($bin2hex_ex);
}
//we implode the array using carriage return as the glue
$bin2hex = implode('0d0a', $bin2hex_ex);
//the new string no longer has the null character at the EOF
$fd = hex2bin($bin2hex);

SQL Server adding extra special characters in query result

I am trying to extract some records in a file using BCP command in SQL Server. However when the file is generated, there are extract spaces in between the result for each column.
To try I just wrote basic SQL Query as simple as this
select 'ABC', 40, 'TEST','NOTWORKING'
When we copy the output of above query and paste it in Notepad, the output comes as
ABC 40 TEST NOTWORKING
Notice the space between each value? The file that system is generating using BCP command also has same space coming in the output file which is incorrect. What I want to see in the output file is
ABC40TESTNOTWORKING
What must be causing this issue? I am simply amazed to see such weird issue and hoping that it can be fixed by some changes or setting. Please help.
Sample BCP command
EXEC xp_cmdshell 'bcp "select ''ABC'', 40, ''TEST'',''NOTWORKING''" queryout "E:\Testfile.txt" -c -T -S""'
Output in the File - Testfile.txt
ABC 40 TEST NOTWORKING
There are probably tabs between the values. If you want a single value, use concat():
select CONCAT('ABC', 40, 'TEST', 'NOTWORKING')
There's no issue. The command line has no field terminator argument, so the default is used, a tab. That's described in the docs :
-t field_term
Specifies the field terminator. The default is \t (tab character). Use this parameter to override the default field terminator. For more information, see Specify Field and Row Terminators (SQL Server).
If you specify the field terminator in hexadecimal notation in a bcp.exe command, the value will be truncated at 0x00. For example, if you specify 0x410041, 0x41 will be used.
If field_term begins with a hyphen (-) or a forward slash (/), do not include a space between -t and the field_term value.
The link points to an entire article that explains how to use terminators, for each of the bulk operations.
As for the Copy/Paste operation, it has nothing to do with SQL Server. SQL Server has no UI, it's a service. I suspect what was pasted in Notepad was copied from an SSMS grid.
SSMS is a client tool just like any other. When you copy data from it into the clipboard, it decides what to put there and what format to use. That format can be plain text, using spaces and tabs for layout, RTF, HTML etc.
Plain text with tabs as field separators is probably the best choice for any tool, as it preserves the visual layout up to a point and uses only a single character as a separator. A fixed-length layout using spaces could also be used but that would add characters that may well be part of a field.
Encodings and codepages
-c exports the data using the user's default codepage. This means that text stored in varchar fields using a different codepage (collation) may get mangled. Non-visible Unicode characters will also get mangled and appear as something else, or as ?.
-c
Performs the operation using a character data type. This option does not prompt for each field; it uses char as the storage type, without prefixes and with \t (tab character) as the field separator and \r\n (newline character) as the row terminator. -c is not compatible with -w.
It's better to use export the file as UTF16 using -w.
-w
Performs the bulk copy operation using Unicode characters. This option does not prompt for each field; it uses nchar as the storage type, no prefixes, \t (tab character) as the field separator, and \n (newline character) as the row terminator. -w is not compatible with -c.
The codepage can be specified using the -C parameter. -C 1251 for example will export the data using Windows' Latin1 codepage. 1253 will export it using the Greek codepage.
-C { ACP | OEM | RAW | code_page }
Specifies the code page of the data in the data file. code_page is relevant only if the data contains char, varchar, or text columns with character values greater than 127 or less than 32.
SQL Server 2016 and later can also export text as UTF8 with -C 65001. Earlier versions don't support UTF8.
Versions prior to version 13 (SQL Server 2016 (13.x)) do not support code page 65001 (UTF-8 encoding). Versions beginning with 13 can import UTF-8 encoding to earlier versions of SQL Server.
All this is described in bcp's online documentation.
This subject is so important for any database that it has an entire section in the docs, that describes data format and considerations, using format files to specify different settings per column, and guidelines to ensure compatibility with other applications

Kotlin: Printing string with array elements that cuts off left side of answers

I am writing a small text based game to familiarize myself with Kotlin. I am creating two strings that print out the multiple choice options. I have confirmed that all four array elements are captured appropriately, but when the string prints it cuts off the a) and c) options. I have used \t, spaces, etc. and it does the same thing. I have also tried to just use print() and then use a \n at the end
println(menuList[0])
println(menuList[1])
println(menuList[2])
println(menuList[3])
println("a) ${menuList[0]} b) ${menuList[1]}")
println("c) ${menuList[2]} d) ${menuList[3]}")
Output:
erroneous output of multiple choice text
The source text came from a file which was separating each line with \r\n, but the code reading it was splitting it with \n. The result was that each entry ended with \r. When printed out, this caused the first value to be overwritten.
The solution is, when reading the file, to split by \r\n rather than \n.

Embedded Newline Character Issue in Redshift Copy Command

We have fifteen embedded newline characters in the field of a source S3 file. The field size in target table in Redshift is VARCHAR(5096). The field length in the source file is 5089 bytes. We are escaping each of the fifteen newline characters with a backslash \ as required by the ESCAPE option of the copy command. Our expectation with the ESCAPE option is that the backslash \ that has been inserted by us before each newline character will be ignored before loading the target in Redshift. However, when we use copy command with the ESCAPE option we are getting
err_code:1204 - String length exceeds DDL length."
Is there a way in which the added backslash \ characters are not counted for target column loads in Redshift?
Note: When we truncated the above source field in the file to 4000 bytes and inserted the backslash \ before the newline characters, the copy command with ESCAPE option successfully loaded the field in Redshift. Also, the backslash \ characters were not loaded in Redshift as expected.
You cold extend your VARCHAR length to allow for more characters.
Or, you could use the TRUNCATECOLUMNS options to load as much as possible without generating an error.
Our understanding w.r.t the above issue was incorrect. The backslashes "\" that we had inserted were not causing the error "err_code:1204 - String length exceeds DDL length.". The "escape" option with the copy command was actually not counting the inserted backslash characters towards the target limit and was also removing them from the loaded value properly.
The actual issue that were facing was that some of the characters that we were trying to load were multibyte UTF8 characters. Since, we were incorrectly assuming them to be of length 1 byte, the size of the target field was proving to be insufficient. We increased the length of the target field from varchar(5096) to varchar(7096), after which all data was loaded successfully.

handling strings with \n in plain text e-mail

I have a column in my database that contains a string like this:
"Warning set for 7 days.\nCritical Notice - Last Time Machine backup was 118 days ago at 2012-11-16 20:40:52\nLast Time Machine Destination was FreeAgent GoFlex Drive\n\nDefined Destinations:\nDestination Name: FreeAgent GoFlex Drive\nBackup Path: Not Specified\nLatest Backup: 2012-11-17"
I am displaying this data in an e-mail to users. I have be able to easily format the field in my html e-mails perfectly by doing the following:
simple_format(#servicedata.service_exit_details.gsub('\n', '<br>'))
The above code replaces the "\n" with "<br>" tags and simple_format handles the rest.
My issues arises with how to format it properly in the plain text template. Initially I thought I could just call the column, seeing as it has "\n" I assumed the plain text would interpret and all would be well. However this simply spits out the string with "\n" intact just as displayed above rather than created line breaks as desired.
In an attempt to find a way to parse the string so the line breaks are acknowledged. I have tried:
#servicedata.service_exit_details.gsub('\n', '"\r\n"')
#servicedata.service_exit_details.gsub('\n', '\r\n')
raw #servicedata.service_exit_details
markdown(#servicedata.service_exit_details, autolinks: false) # with all the necessary markdown setup
simple_format(#servicedata.service_exit_details.html_safe)
none of which worked.
Can anyone tell me what I'm doing wrong or how I can make this work?
What I want is for the plain text to acknowledge the line breaks and format the string as follows:
Warning set for 7 days.
Critical Notice - Last Time Machine backup was 118 days ago at 2012-11-16 20:40:52
Last Time Machine Destination was FreeAgent GoFlex Drive
Defined Destinations:
Destination Name: FreeAgent GoFlex Drive
Backup Path: Not Specified\nLatest Backup: 2012-11-17"
I see.
You need to differentiate a literal backslash followed by a letter n as a sequence of two characters, and a LF character (a.k.a. newline) that is usually represented as \n.
You also need to distinguish two different kinds of quoting you're using in Ruby: singles and doubles. Single quotes are literal: the only thing that is interpreted in single quotes specially is the sequence \', to escape a single quote, and the sequence \\, which produces a single backslash. Thus, '\n' is a two-character string of a backslash and a letter n.
Double quotes allow for all kinds of weird things in it: you can use interpolation with #{}, and you can insert special characters by escape sequences: so "\n" is a string containing the LF control character.
Now, in your database you seem to have the former (backslash and n), as hinted by two pieces of evidence: the fact that you're seeing literal backslash and n when you print it, and the fact that gsub finds a '\n'. What you need to do is replace the useless backslash-and-n with the actual line separator characters.
#servicedata.service_exit_details.gsub('\n', "\r\n")