First off, sorry for the long post. I wanted to be thorough with my examples/data, and the bulk of this post is just that.
I inherited a Bulk Import Process using a format file (.fmt) at my new job. This process was created by the guy that worked here before me, and it is my job to learn this process (and fix it now). I have limited knowledge of this stuff, but I have done some research. After a few weeks, I haven't really gotten anywhere. Here is what I am working with...
--BCP Command to import data from C:\Desktop\20180629_2377167_PR_NP.txt to table LA_Temp.dbo.ProvReg
bcp LA_Temp.dbo.ProvReg IN C:\Desktop\20180629_2377167_PR_NP.txt -f C:\Desktop\PROVREG.FMT -T -S SERVERNAME -k -m 1000000
--Table Structure which format file is created from:
SELECT [NPI]
,[D1]
,[EntityType]
,[D2]
,[ReplaceNPI]
,[D3]
,[ProvName]
,[D4]
,[MailAddr1]
,[D5]
,[MailAddr2]
,[D6]
,[MailCity]
,[D7]
,[MailState]
,[D8]
,[MailZip]
,[D9]
,[MailCountry]
,[D10]
,[MailPhone]
,[D11]
,[MailFax]
,[D12]
,[LocAddr1]
,[D13]
,[LocAddr2]
,[D14]
,[LocCity]
,[D15]
,[LocState]
,[D16]
,[LocZip]
,[D17]
,[LocCountry]
,[D18]
,[LocPhone]
,[D19]
,[LocFax]
,[D20]
,[Taxonomy1]
,[D21]
,[Taxonomy2]
,[D22]
,[Taxonomy3]
,[D23]
,[OtherProvID]
,[D24]
,[OtherProvIDType]
,[D25]
,[ProvEnumDate]
,[D26]
,[LastUpdate]
,[D27]
,[DeactivateRC]
,[D28]
,[DeactivateDate]
,[D29]
,[ReactivateDate]
,[D30]
,[Gender]
,[D31]
,[License]
,[D32]
,[LicenseState]
,[D33]
,[AuthorizedContact]
,[D34]
,[ContactTitle]
,[D35]
,[ContactPhone]
,[D36]
,[PanelOpen]
,[D37]
,[Language1]
,[D38]
,[Language2]
,[D39]
,[Language3]
,[D40]
,[Language4]
,[D41]
,[Language5]
,[D42]
,[AgeRestrict]
,[D43]
,[PCPMax]
,[D44]
,[PCPActual]
,[D45]
,[PCPAll]
,[D46]
,[EnrollInd]
,[D47]
,[EnrollDate]
,[D48]
,[FamilyOnly]
,[D49]
,[SubSpec1]
,[D50]
,[SubSpec2]
,[D51]
,[SubSpec3]
,[D52]
,[ContractName]
,[D53]
,[ContractBegin]
,[D54]
,[ContractEnd]
,[D55]
,[Parish1]
,[D56]
,[Parish2]
,[D57]
,[Parish3]
,[D58]
,[Parish4]
,[D59]
,[Parish5]
,[D60]
,[Parish6]
,[D61]
,[Parish7]
,[D62]
,[Parish8]
,[D63]
,[Parish9]
,[D64]
,[Parish10]
,[D65]
,[Parish11]
,[D66]
,[Parish12]
,[D67]
,[Parish13]
,[D68]
,[Parish14]
,[D69]
,[Parish15]
,[D70]
,[PCPInd]
,[D71]
,[DisplayOnline]
,[D72]
,[ExpAgeRestrict]
,[D73]
,[Suffix]
,[D74]
,[Title]
,[D75]
,[PrescriberInd]
,[Spaces]
,[End]
FROM [LA_Temp].[dbo].[ProvReg]
--Example Text File Data (this is one line)
9999999999 ^0^ ^ ^3800 HMA BLVD STE 305 ^ ^METAIRIE ^LA^70006 ^ ^5048729679^ ^3800 HMA BLVD ^ ^METAIRIE ^LA^70006 ^ ^9999999999^ ^207Q00000X^ ^ ^0000000^2001^ ^00000000^ ^00000000^00000000^F^ ^LA^ ^ ^ ^N^1^0^0^0^0^2^00000^00000^00000^ ^ ^ ^ ^ ^ ^000000000000000000000000000000^00000000^00000000^26^00^00^00^00^00^00^00^00^00^00^00^00^00^00^0^0^Accept patients of age 000-000^ ^MD ^ ^
--Format file
11.0
153
1 SQLCHAR 0 40 "\t" 1 NPI SQL_Latin1_General_Pref_CP1_CI_AS
2 SQLCHAR 0 2 "\t" 2 D1 SQL_Latin1_General_Pref_CP1_CI_AS
3 SQLCHAR 0 2 "\t" 3 EntityType
...all the way to...
153 SQLCHAR 0 2 "\r\n" 153 End
I have changed directories, servername, and some of the text file data to maintain security, however, it is very similar.
Here is the problem I am encountering:
With the "\t" used in the format file I just created from the SQL table, I get the error: [Microsoft][SQL Server Native Client 11.0]Unexpected EOF encountered in BCP data-file.
If I change this to just "" or "^" (as I 'think' it should be since the text file is using carrot delimiter), the rows began to copy with error
[Microsoft][SQL Server Native Client 11.0]String data, right truncation SQLState = 22001, NativeError = 0. BCP copy in failed.
If anyone can please point me in the right direction here for troubleshooting this issue, or if you see anything out of place, please let me know. As I mentioned, I have been at this for some time, and can use any suggestions I can get. Unfortunately, there is no one at my company I can ask about this.
try adding the -e option to your bcp command. this will give you an error file in which BCP will write some samlpe lines from the file that it had problems with. Very helpful with troubleshooting the type of error you are getting now (you are correct to change your delimiter in the format file).
The error you are getting now "string data" and "truncation" is just as it states. However, this truncation can be occurring for a number of reasons. The destination table's columns may not be large enough to hold the data that is contained between the defined field delimiters. There may be delimiters appearing in your data and so this could be tricking the bcp utility into thinking a column has ended before it was intended to end in the file (this is less likely with the delimiter you are using... but ya never know... I always prefer fixed width if possible.). And, of course, the source of the data may very well have written you a file that contradict whatever agreed upon spec led you to define your destination as you have.
The error is accurate, teh trick is finding where. Use the -e option to allow BCP to capture problematic lines:
BCP table_dest IN "C:\FILE.TXT" -S SVR -T -f"C:\FORMAT_FILE.txt" -e"C:\ERROR_FILE.txt"
The "error_file.txt" will include line numbers and will include a sample of lines that it couldn't handle. Just copy and past to find in the file youare trying to load to see for yourself.
Strongly suggest using a more advanced text editing tool. Do not use windows notepad or wordpad. Use something like notepad++ or ultraedit to inspect ascii text files.
I wrote the below code, which will extract the directory name along with the file name and I will use purge command on that extracted Text.
$ sear VAXMANAGERS_ROOT:[PROC]TEMP.LIS LOG/out=VAXMANAGERS_ROOT:[DEV]FVLIM.TXT
$ OPEN IN VAXMANAGERS_ROOT:[DEV]FVLIM.TXT
$ LOOP:
$ READ/END_OF_FILE=ENDIT IN ABCD
$ GOTO LOOP
$ ENDIT:
$ close in
$ ERROR=F$EXTRACT(0,59,ABCD)
$ sh sym ERROR
$ purge/keep=1 'ERROR'
The output is as follows:
ERROR = "$1$DKC102:[PROD_LIVE.LOG]DP2017_TMP2.LIS;27392 "
Problem here is --- Every time the directory length varies (Length may be 59 or 40 or some other value, but the directory and filename length will not exceed 59 characters in my system). So in the above output, the system is also fetching the Version number of that file number. So I am not able to purge the file along with the version number.
%PURGE-E-PURGEVER, version numbers not permitted
Any suggestion -- How to eliminate the version number from the output ?
I cannot use the exact length of the directory, as directory length varies everytime.... :(
The answer with F$ELEMENT( 0, ";", ABCD ) should work, as confirmed. I might script something like this:
$ ERROR = F$PARSE(";",ERROR) ! will return $1$DKC102:[PROD_LIVE.LOG]DP2017_TMP2.LIS;
$ ERROR = ERROR - ";"
$ PURGE/KEEP=1 'ERROR'
Not sure why you have the read loop. What you will get is the last line in the file, but assuming that's what you want.
While HABO explained it, some more explanations
Suppose I use f$search to check if a file exists
a = f$search("sys$manager:net$server.log")
then I find I it exists
wr sys$output a
shows
SYS$SYSROOT:[SYSMGR]NET$SERVER.LOG;9
From the help of f$parse I get
help lex f$parse arg
shows, among other things
`Specifies a character string containing the name of a field
in a file specification. Specifying the field argument causes
the F$PARSE function to return a specific portion of a file
specification.
Specify one of the following field names (do not abbreviate):
NODE Node name
DEVICE Device name
DIRECTORY Directory name
NAME File name
TYPE File type
VERSION File version number`
So I can do
wr sys$output f$parse(a,,,"DEVICE")
which shows
SYS$SYSROOT:
and also
wr sys$output f$parse(a,,,"DIRECTORY")
so I get
[SYSMGR]
and
wr sys$output f$parse(a,,,"NAME")
shows
NET$SERVER
and
wr sys$output f$parse(a,,,"TYPE")
shows
.LOG
the version is
wr sys$output f$parse(a,,,"VERSION")
shown as
;9
The lexicals functions can be handy, check it using
help lexical
it shows
F$CONTEXT F$CSID F$CUNITS F$CVSI F$CVTIME F$CVUI F$DELTA_TIME F$DEVICE F$DIRECTORY F$EDIT
F$ELEMENT F$ENVIRONMENT F$EXTRACT F$FAO F$FID_TO_NAME F$FILE_ATTRIBUTES F$GETDVI F$GETENV
F$GETJPI F$GETQUI F$GETSYI F$IDENTIFIER F$INTEGER F$LENGTH F$LICENSE F$LOCATE F$MATCH_WILD
F$MESSAGE F$MODE F$MULTIPATH F$PARSE F$PID F$PRIVILEGE F$PROCESS F$READLINK F$SEARCH
F$SETPRV F$STRING F$SYMLINK_ATTRIBUTES F$TIME F$TRNLNM F$TYPE F$UNIQUE F$USER
We are currently using sed to filter output of regression runs. Sometimes we have a filter that looks like this:
/copyright/,/end copyright/d
If that end copyright is ever missing, the rest of the file is deleted. I'm wondering if there's some way to generate an error for this? awk would also be okay to use. I don't really want to add code that reads the file line by line and issues an error if it hits EOF.
here's a string
copyright
2016 jan 15
end copyright
date 2016 jan 5 time 15:36
last one
I'd like to get an error if end copyright is missing. The real filter also would replace the date line with DATE, so it's more that just ripping out the copyright.
You can persuade sed to generate an error if you reach end of input (i.e. see address $) between your start and end, but it won't be a very helpful message:
/copyright/,/end copyright/{
$s//\1/ # here
d
}
This will error if end copyright is missing or on the last line, with an exit status of 1 and the helpful message:
sed: -e expression #1, char 0: invalid reference \1 on `s' command's RHS
If you're using this in a makefile, you might want to echo a helpful message first, or (better) to wrap this in something that catches the error and produces a more useful one.
I tested this with GNU sed; though if you are using GNU sed, you could more easily use its useful extension:
q [EXIT-CODE]
This command only accepts a single address.
Exit 'sed' without processing any more commands or input. Note
that the current pattern space is printed if auto-print is not
disabled with the -n options. The ability to return an exit code
from the 'sed' script is a GNU 'sed' extension.
Q [EXIT-CODE]
This command only accepts a single address.
This command is the same as 'q', but will not print the contents of
pattern space. Like 'q', it provides the ability to return an exit
code to the caller.
So you could simply write
/copyright/,/end copyright/{
$Q 42
d
}
Never use range expressions /start/,/end/ as they make trivial code very slightly briefer but require a complete rewrite or duplicate conditions when you have the tiniest requirements change. Always use a flag instead. Note that since sed doesn't support variables, it doesn't support flag variables, and so you shouldn't be using sed you should be using awk instead.
In this case your original code would be:
awk '/copyright/{f=1} !f; /end copyright/{f=0}' file
And your modified code would be:
awk '/copyright/{f=1} !f; /end copyright/{f=0} END{if (f) print "Missing end copyright"}' file
The above is obviously untested since you didn't provide any sample input/output we could test a potential solution against.
With sed you can build a loop:
sed -e '/copyright/{:a;/end copyright/d;N;ba;};' file
:a defines the label "a"
/copyright end/d deletes the pattern space, only when "end copyright" matches
N appends the next line to the pattern space
ba jumps to the label "a"
Note that d ends the loop.
In this way you can avoid to delete the text until the end.
If you don't want the text to be displayed at all and prefer an error message when a "copyright" block stays unclosed, you obviously need to wait the end of the file. You can do it with sed too storing all the lines in the buffer space until the end:
sed -n -e '/copyright/{:a;/end copyright/d;${c\ERROR MESSAGE
;};N;ba;};H;${g;p};' file
H appends the current line to the buffer space
g put the content of the buffer space to the pattern space
The file content is only displayed once the last line reached with ${g;p} otherwise when the closing "end copyright" is missing, the current line is changed in the error message with ${c\ERROR MESSAGE\n;} inside the loop.
This way you can test what returns sed before redirecting it to whatever you want.
What method can I use to delete a specific line from a csv/txt file that is too big too load into memory and edit manually?
Background
My question is actually an indirect solution to a problem related with importing csv into sql databases.
I have a series of 10-30gb csv files I want to import and populate an sqlite table from within R (Since they are too large to import as data frames as a whole into R). I am using the 'RSQlite' package for this.
A couple fail because of an error related to one of the lines being badly formatted. The populating process is then cancelled. R returns the line number which caused the process to fail.
The error given is:
./csvfilename line 102206973 expected 9 columns of data but found 3)
So I know exactly the line which causes the error.
I see 2 potential 'indirect' solutions which I was hoping someone could help me with.
(i) Deleting the line causing the error in 20+gb files. e.g. line 102,206,973 in the example above.
I am not concerned with 'losing' the data in line 102,206,973 by just skipping or deleting it. However I have tried and failed to somehow access the csv file and to remove the line.
(ii) Using sqlite directly (or anything else?) to import an csv which does allow you to skip lines or an error.
Although not likely to be related directly to the solution, here is the R code used.
db <- dbConnect(SQLite(), dbname=name_of_table)
dbWriteTable(conn = db, name ="currentdata", value = csvfilename, row.names = FALSE, header = TRUE)
Thanks!
To delete a specific line you can use sed:
sed -e '102206973d' your_file
If you want the replacement to be done in-place, do
sed -i.bak -e '102206973d' your_file
This will create a backup names your_file.bak and your_file will have the specified line removed.
Example
$ cat a
1
2
3
4
5
$ sed -i.bak -e '3d' a
$ cat a
1
2
4
5
$ cat a.bak
1
2
3
4
5
I'm trying to use sqlcmd on a windows machine running SQL Server 2005 to write a query to a csv file. The command line options we typically use are:
-l 60 -t 300 -r 1 -b -W -h -1
However, the columns are getting truncated at 256 bytes. In an attempt to circumvent this, I tried using this command line option in place of -W:
-y 8000
This captures the entire fields, but the problem with this method is that the file balloons up from just over 1mb to about 200mb due to all the extra space (I realize 8000 is probably overkill, but it will probably have to be at least 4000 and I'm currently only working with a small subset of data). The -W option typically eliminates all this extra space, but when I try to use them together it tells me they're mutually exclusive.
Is there a way to get sqlcmd around this limit, or does anyone know if another program (such as bcp or osql) would make this easier?
Edit:
Here are the code snippets we're using to get the field that's being truncated (similar code is used for a bunch of fields):
SELECT ALIASES.AliasList as complianceAliases,
...
LEFT OUTER JOIN (Select M1.ID, M1.LIST_ID,stuff((SELECT '{|}' + isnull(Content2,'')+' '+isnull(Content3,'')+' '+isnull(Content4,'')+' '+isnull(Content5,'')+' '+isnull(Content6,'')+' '+isnull(Content7,'')
FROM fs_HOST3_TEST_web.ISI_APP_COMP_MULTI M2 with (nolock)
WHERE M1.LIST_ID = M2.LIST_ID and M1.ID = M2.ID and M1.TYPE = M2.TYPE
FOR XML PATH('')
),1,1,'') as AliasList
FROM fs_HOST3_TEST_web.ISI_APP_COMP_MULTI M1 with (nolock)
WHERE M1.LIST_ID = 2001 AND M1.TYPE = 'Aliases'
GROUP BY m1.list_id,m1.ID,m1.Type) as ALIASES
ON ALIASES.LIST_ID = PAIR.COMP_LIST_ID AND ALIASES.ID = PAIR.COMP_ID
I ended up solving this by using the "-y0" argument. It still left a bunch of whitespace but it looks like it only went to the end of the longest piece of data in each field.
I then ran the output through a program that removed repeating spaces and that solved all of the problems.