First off, sorry for the long post. I wanted to be thorough with my examples/data, and the bulk of this post is just that.
I inherited a Bulk Import Process using a format file (.fmt) at my new job. This process was created by the guy that worked here before me, and it is my job to learn this process (and fix it now). I have limited knowledge of this stuff, but I have done some research. After a few weeks, I haven't really gotten anywhere. Here is what I am working with...
--BCP Command to import data from C:\Desktop\20180629_2377167_PR_NP.txt to table LA_Temp.dbo.ProvReg
bcp LA_Temp.dbo.ProvReg IN C:\Desktop\20180629_2377167_PR_NP.txt -f C:\Desktop\PROVREG.FMT -T -S SERVERNAME -k -m 1000000
--Table Structure which format file is created from:
SELECT [NPI]
,[D1]
,[EntityType]
,[D2]
,[ReplaceNPI]
,[D3]
,[ProvName]
,[D4]
,[MailAddr1]
,[D5]
,[MailAddr2]
,[D6]
,[MailCity]
,[D7]
,[MailState]
,[D8]
,[MailZip]
,[D9]
,[MailCountry]
,[D10]
,[MailPhone]
,[D11]
,[MailFax]
,[D12]
,[LocAddr1]
,[D13]
,[LocAddr2]
,[D14]
,[LocCity]
,[D15]
,[LocState]
,[D16]
,[LocZip]
,[D17]
,[LocCountry]
,[D18]
,[LocPhone]
,[D19]
,[LocFax]
,[D20]
,[Taxonomy1]
,[D21]
,[Taxonomy2]
,[D22]
,[Taxonomy3]
,[D23]
,[OtherProvID]
,[D24]
,[OtherProvIDType]
,[D25]
,[ProvEnumDate]
,[D26]
,[LastUpdate]
,[D27]
,[DeactivateRC]
,[D28]
,[DeactivateDate]
,[D29]
,[ReactivateDate]
,[D30]
,[Gender]
,[D31]
,[License]
,[D32]
,[LicenseState]
,[D33]
,[AuthorizedContact]
,[D34]
,[ContactTitle]
,[D35]
,[ContactPhone]
,[D36]
,[PanelOpen]
,[D37]
,[Language1]
,[D38]
,[Language2]
,[D39]
,[Language3]
,[D40]
,[Language4]
,[D41]
,[Language5]
,[D42]
,[AgeRestrict]
,[D43]
,[PCPMax]
,[D44]
,[PCPActual]
,[D45]
,[PCPAll]
,[D46]
,[EnrollInd]
,[D47]
,[EnrollDate]
,[D48]
,[FamilyOnly]
,[D49]
,[SubSpec1]
,[D50]
,[SubSpec2]
,[D51]
,[SubSpec3]
,[D52]
,[ContractName]
,[D53]
,[ContractBegin]
,[D54]
,[ContractEnd]
,[D55]
,[Parish1]
,[D56]
,[Parish2]
,[D57]
,[Parish3]
,[D58]
,[Parish4]
,[D59]
,[Parish5]
,[D60]
,[Parish6]
,[D61]
,[Parish7]
,[D62]
,[Parish8]
,[D63]
,[Parish9]
,[D64]
,[Parish10]
,[D65]
,[Parish11]
,[D66]
,[Parish12]
,[D67]
,[Parish13]
,[D68]
,[Parish14]
,[D69]
,[Parish15]
,[D70]
,[PCPInd]
,[D71]
,[DisplayOnline]
,[D72]
,[ExpAgeRestrict]
,[D73]
,[Suffix]
,[D74]
,[Title]
,[D75]
,[PrescriberInd]
,[Spaces]
,[End]
FROM [LA_Temp].[dbo].[ProvReg]
--Example Text File Data (this is one line)
9999999999 ^0^ ^ ^3800 HMA BLVD STE 305 ^ ^METAIRIE ^LA^70006 ^ ^5048729679^ ^3800 HMA BLVD ^ ^METAIRIE ^LA^70006 ^ ^9999999999^ ^207Q00000X^ ^ ^0000000^2001^ ^00000000^ ^00000000^00000000^F^ ^LA^ ^ ^ ^N^1^0^0^0^0^2^00000^00000^00000^ ^ ^ ^ ^ ^ ^000000000000000000000000000000^00000000^00000000^26^00^00^00^00^00^00^00^00^00^00^00^00^00^00^0^0^Accept patients of age 000-000^ ^MD ^ ^
--Format file
11.0
153
1 SQLCHAR 0 40 "\t" 1 NPI SQL_Latin1_General_Pref_CP1_CI_AS
2 SQLCHAR 0 2 "\t" 2 D1 SQL_Latin1_General_Pref_CP1_CI_AS
3 SQLCHAR 0 2 "\t" 3 EntityType
...all the way to...
153 SQLCHAR 0 2 "\r\n" 153 End
I have changed directories, servername, and some of the text file data to maintain security, however, it is very similar.
Here is the problem I am encountering:
With the "\t" used in the format file I just created from the SQL table, I get the error: [Microsoft][SQL Server Native Client 11.0]Unexpected EOF encountered in BCP data-file.
If I change this to just "" or "^" (as I 'think' it should be since the text file is using carrot delimiter), the rows began to copy with error
[Microsoft][SQL Server Native Client 11.0]String data, right truncation SQLState = 22001, NativeError = 0. BCP copy in failed.
If anyone can please point me in the right direction here for troubleshooting this issue, or if you see anything out of place, please let me know. As I mentioned, I have been at this for some time, and can use any suggestions I can get. Unfortunately, there is no one at my company I can ask about this.
try adding the -e option to your bcp command. this will give you an error file in which BCP will write some samlpe lines from the file that it had problems with. Very helpful with troubleshooting the type of error you are getting now (you are correct to change your delimiter in the format file).
The error you are getting now "string data" and "truncation" is just as it states. However, this truncation can be occurring for a number of reasons. The destination table's columns may not be large enough to hold the data that is contained between the defined field delimiters. There may be delimiters appearing in your data and so this could be tricking the bcp utility into thinking a column has ended before it was intended to end in the file (this is less likely with the delimiter you are using... but ya never know... I always prefer fixed width if possible.). And, of course, the source of the data may very well have written you a file that contradict whatever agreed upon spec led you to define your destination as you have.
The error is accurate, teh trick is finding where. Use the -e option to allow BCP to capture problematic lines:
BCP table_dest IN "C:\FILE.TXT" -S SVR -T -f"C:\FORMAT_FILE.txt" -e"C:\ERROR_FILE.txt"
The "error_file.txt" will include line numbers and will include a sample of lines that it couldn't handle. Just copy and past to find in the file youare trying to load to see for yourself.
Strongly suggest using a more advanced text editing tool. Do not use windows notepad or wordpad. Use something like notepad++ or ultraedit to inspect ascii text files.
I'm running regular uploading job to upload csv into BigQuery. The job runs every hour. According to recent fail log, it says:
Error: [REASON] invalid [MESSAGE] Invalid argument: service.geotab.com [LOCATION] File: 0 / Offset:268436098 / Line:218637 / Field:2
Error: [REASON] invalid [MESSAGE] Too many errors encountered. Limit is: 0. [LOCATION]
I went to line 218638 (the original csv has a headline, so I assume 218638 should be the actual failed line, let me know if I'm wrong) but it seems all right. I checked according table in BigQuery, it has that line too, which means I actually successfully uploaded this line before.
Then why does it causes failure recently?
project id: red-road-574
Job ID: Job_Upload-7EDCB180-2A2E-492B-9143-BEFFB36E5BB5
This indicates that there was a problem with the data in your file, where it didn't match the schema.
The error message says it occurred at File: 0 / Offset:268436098 / Line:218637 / Field:2. This means the first file (it looks like you just had one), and then the chunk of the file starting at 268436098 bytes from the beginning of the file, then the 218637th line from that file offset.
The reason for the offset portion is that bigquery processes large files in parallel in multiple workers. Each file worker starts at an offset from the beginning of the file. The offset that we include is the offset that the worker started from.
From the rest of the error message, it looks like the string service.geotab.com showed up in the second field, but the second field was a number, and service.geotab.com isn't a valid number. Perhaps there was a stray newline?
You can see what the lines looked like around the error by doing:
cat <yourfile> | tail -c +268436098 | tail -n +218636 | head -3
This will print out three lines... the one before the error (since I used -n +218636 instead of +218637), the one that had the error, and the next line as well.
Note that if this is just one line in the file that has a problem, you may be able to work around the issue by specifying maxBadRecords.
I ran into an error when importing gzipped tab delimited files into bigquery
The output I got was:
root#a20c6fbdf9b5:/opt/batch/jobs# bq show -j bqjob_r5720e2f2267a5a5b_0000014d09571f27_1
Job infra-bedrock-861:bqjob_r5720e2f2267a5a5b_0000014d09571f27_1
Job Type State Start Time Duration Bytes Processed
---------- --------- ----------------- ---------- -----------------
load FAILURE 30 Apr 08:00:44 0:02:05
Errors encountered during job execution. Bad character (ASCII 0) encountered: field starts with: <H:|\ufc0f\ufffd(>
Failure details:
- File: 1 / Line:1 / Field:1: Bad character (ASCII 0) encountered:
field starts with: <\ufff>
- File: 1 / Line:3 / Field:1: Bad character (ASCII 0) encountered:
field starts with: <\u0475\ufffd=\ufffd\ufffd\u03d6>
- File: 1 / Line:4 / Field:1: Bad character (ASCII 0) encountered:
field starts with: <-\ufffd\ufffdY\u049a\ufffd>
- File: 1 / Line:6 / Field:1: Bad character (ASCII 0) encountered:
field starts with: <\u018e\ufffd\ufffd\ufffd\ufffd>
I tried manually downloading the files, unzipping and then uploading the files again. The uncompressed files could be imported into bigquery without any problems.
This looks like a bug in bigquery with zip files
Inspecting the job configuration, you include a non-gzip file as the first uri, ending in .../20150426/_SUCCESS. BigQuery uses the first file to determine whether compression is enabled.
Assuming this file is empty, you can remove it from your load requests to fix this. If there is data in this file, attach a ".gz" suffix or re-order this file so it is not first in the uri list.
I tried to load the data from cloud and it failed 3 times.
Job ID: job_2ed0ded6ce1d4837873e0ab498b0bc1b
Start Time: 9:10pm, 1 Aug 2012
End Time: 10:55pm, 1 Aug 2012
Destination Table: 567402616005:company.ox_data_summary_ad_hourly
Source URI: gs://daily_log/ox_data_summary_ad_hourly.txt.gz
Delimiter:
Max Bad Records: 30000
Job ID: job_47447ab60d2a40f588c89dfe638aa438
Line:176073205 / Field:1, Bad character (ASCII 0) encountered. Rest of file not processed.
Too many errors encountered. Limit is: 0.
Should I try again? or is there any issue with the source file?
This is a known bug dealing with gzipped files. The only workaround currently is just to use an uncompressed file.
There are changes coming soon that should make it easier to handle large, uncompressed files (imports will be faster, and file size limits will increase).
The status is shown as success but the file is not actually transferred to big-query.
# bq show -j abc
Job Type State Start Time Duration Bytes Processed
---------- --------- ----------------- ---------- -----------------
load SUCCESS 05 Jul 15:32:45 0:26:24
From web interface, I can see the actual error.
Line:9732968, Too few columns: expected 27 column(s) but got 9 column(s)
Line:10893908 / Field:1, Bad character (ASCII 0) encountered. Rest of file not processed.
1) How do I know which bad character needs to be removed?
2) Why does "success" shown as job status?
Update:
Job ID: summary_2012_07_09_to_2012_07_10a2
The error that I got at command prompt:
BigQuery error in load operation: Backend Error
A lot of lines were not processed at all. The details from web interface:
Line:9857286 / Field:1, Bad character (ASCII 0) encountered: field starts with: <15>
Line:9857287 / Field:1, Bad character (ASCII 0) encountered. Rest of file not processed.
All the lines where successfully processed in the second attempt:
job_id: summary_2012_07_09_to_2012_07_10a3
Update 2:
Line:174952407 / Field:1, Bad character (ASCII 0) encountered. Rest of file not processed.
Job ID: job_19890847cbc3410495c3cecaf79b31fb
Sorry for the slow response, the holiday weekend meant most of the bigquery team was not answering support questions. The 'bad character' looks like it may be a known bug with some gzipped files where we improperly detect an ascii 0 value at the end of the file.
If the job is actually failing but reporting success, that sounds like a problem but we'll need the job id of the failing job in order to be able to debug. Also if you can reproduce it that would be helpful since we may not have the logs around for the original job anymore.