Snowflake COPY INTO from JSON - ON_ERROR = CONTINUE - Weird Issue - sql

I am trying to load JSON file from Staging area (S3) into Stage table using COPY INTO command.
Table:
create or replace TABLE stage_tableA (
RAW_JSON VARIANT NOT NULL
);
Copy Command:
copy into stage_tableA from #stgS3/filename_45.gz file_format = (format_name = 'file_json')
Got the below error when executing the above (sample provided)
SQL Error [100069] [22P02]: Error parsing JSON: document is too large, max size 16777216 bytes If you would like to continue loading
when an error is encountered, use other values such as 'SKIP_FILE' or
'CONTINUE' for the ON_ERROR option. For more information on loading
options, please run 'info loading_data' in a SQL client.
When I had put "ON_ERROR=CONTINUE" , records got partially loaded, i.e until the record with more than max size. But no records after the Error record was loaded.
Was "ON_ERROR=CONTINUE" supposed to skip only the record that has max size and load records before and after it ?

Yes, the ON_ERROR=CONTINUE skips the offending line and continues to load the rest of the file.
To help us provide more insight, can you answer the following:
How many records are in your file?
How many got loaded?
At what line was the error first encountered?
You can find this information using the COPY_HISTORY() table function

Try setting the option strip_outer_array = true for file format and attempt the loading again.
The considerations for loading large size semi-structured data are documented in the below article:
https://docs.snowflake.com/en/user-guide/semistructured-considerations.html

I partially agree with Chris. The ON_ERROR=CONTINUE option only helps if the there are in fact more than 1 JSON objects in the file. If it's 1 massive object then you would simply not get an error or the record loaded when using ON_ERROR=CONTINUE.
If you know your JSON payload is smaller than 16mb then definitely try the strip_outer_array = true. Also, if your JSON has a lot of nulls ("NULL") as values use the STRIP_NULL_VALUES = TRUE as this will slim your payload as well. Hope that helps.

Related

Uploading to BigQuery GIS: "Invalid nesting: loop 1 should not contain loop 0"

I'm uploading a CSV file to BigQuery GIS but it fails with "Invalid nesting: loop 1 should not contain loop 0".
This is the error in full:
Upload complete.
BigQuery error in load operation: Error processing job 'mytable-1176:bqjob_rc625a5098ae5fb_0000017289ff6c01_1': Error while reading data,
error message: CSV table encountered too many errors, giving up. Rows: 1; errors: 1. Please look into the errors[] collection for more details.
Failure details:
- Error while reading data, error message: Could not parse 'POLYGON
((-0.02679766923455 51.8338973454281,-0.02665013926462
51.83390841216...' as geography for field geom (position 0)
starting at location 23 Invalid nesting: loop 1 should not contain
loop 0
I've pasted the offending row of the CSV file in full below. It appears to be valid WKT (or at any rate, if I do SELECT ST_GeomFromText('POLYGON(....)') in Postgres, or in BigQuery itself, I don't get an error.
I can upload other rows of the CSV file just fine, so this isn't a problem with the schema etc.
My CSV file, in full:
"POLYGON ((-0.02679766923455 51.8338973454281,-0.02665013926462 51.8339084121668,-0.026560668613456 51.8339151296535,-0.026487799104821 51.8339205969954,-0.026347243837181 51.8339311509483,-0.026281189190482 51.8339361120738,-0.026195955952021 51.8339425137374,-0.026119198116154 51.8339482752767,-0.026003486068282 51.8339569617883,-0.025952488850356 51.8339607906816,-0.025861270469634 51.8339676401587,-0.025858880441179 51.833967860811,-0.025828348676774 51.8339706933666,-0.025711206268708 51.8339815496773,-0.025689593686264 51.8339835517849,-0.025640538635981 51.8339884382609,-0.025529205899202 51.8339995358046,-0.025488766227802 51.8340035687339,-0.02540224517472 51.8340121962521,-0.025345346891212 51.8340178681142,-0.025335200706554 51.8340188757231,-0.025340349282672 51.834037216355,-0.025358081339574 51.8341162858309,-0.02530239219927 51.8342041307738,-0.025137161521987 51.8344648008,-0.025126602922354 51.8344835251042,-0.025117917332667 51.8344989177765,-0.025112142226516 51.8345091618299,-0.025029432267333 51.8347162751015,-0.025024329062831 51.8347290752289,-0.025009526856435 51.8347661622769,-0.025093294071485 51.8348450548926,-0.025096355164552 51.8348576953871,-0.02509500197774 51.8348697132258,-0.02509108994899 51.8348834865336,-0.025088461171685 51.8348894941415,-0.025083736509161 51.834900277382,-0.025072148729253 51.8349192991243,-0.025066276262158 51.8349271226369,-0.025058952287815 51.8349372687386,-0.025042611052947 51.834953135315,-0.025001680789442 51.8349786422223,-0.024920260658812 51.8350331883359,-0.024784590322317 51.8351240689296,-0.024592664989898 51.8352526349419,-0.024589543101483 51.8352549834202,-0.024520606248935 51.8353068434611,-0.024501351755 51.8353358705759,-0.024471109202685 51.8353814835663,-0.024447518472049 51.8354170560924,-0.024415970917911 51.8354646254313,-0.024417975754184 51.835564067945,-0.024418971386853 51.8356132810276,-0.02441636899482 51.8356607970846,-0.024412069417915 51.8357391278945,-0.024405890624276 51.835851768366,-0.024404645018278 51.8358745876421,-0.024405581774846 51.835921497729,-0.024408752345367 51.8360808568968,-0.024410061546665 51.8361464500484,-0.024413180002599 51.8363030207515,-0.024413726767404 51.8363306539892,-0.0244128839888 51.8363572028036,-0.024408747608483 51.8364878440004,-0.024407687807063 51.8365213401423,-0.024406476329442 51.8365599592925,-0.024391641544557 51.836678011519,-0.024386208097079 51.8367212266961,-0.024342690041741 51.8367069893255,-0.024319288279565 51.8366997621065,-0.024292175040315 51.8366914384262,-0.024256993656756 51.8366804074215,-0.024209378619192 51.8366665867489,-0.024155315566661 51.8366508502892,-0.024105536982117 51.8366373798885,-0.024053478095738 51.8366236103787,-0.024004689328762 51.8366107410593,-0.023970131471096 51.836601716714,-0.023909047139602 51.8365858891387,-0.023872509708533 51.8365763279375,-0.023813366264893 51.8365609285831,-0.023800608538063 51.8365575849119,-0.023784986418606 51.8365536895414,-0.023742454842103 51.8365431193584,-0.023711439553104 51.8365354043849,-0.023694010323261 51.8365310110389,-0.023665520720037 51.8365239589599,-0.023639661080915 51.8365175085891,-0.023621685803881 51.8365129891569,-0.023606995500241 51.8365093612042,-0.023594058239185 51.836506140384,-0.023579988098348 51.8365026037803,-0.023576657648261 51.836501756494,-0.023573212341634 51.8365008803008,-0.023570943037101 51.8365003296066,-0.023566131987943 51.8364991517017,-0.023560820359479 51.836497803524,-0.023550842839 51.836494614457,-0.023534152361942 51.8364892623447,-0.023519322307527 51.8364845169988,-0.02348978408578 51.8364729061891,-0.023475831119834 51.8364673572881,-0.023454784263057 51.8364590634327,-0.023440395773423 51.8364535161996,-0.023420399054051 51.8364457705282,-0.023412745105461 51.8364422248457,-0.023381943124318 51.8364279850281,-0.023357997010233 51.8364146337661,-0.023352306254349 51.8364113368901,-0.023317752798112 51.8363896244224,-0.023286469209969 51.836365503042,-0.023270902824062 51.8363503952582,-0.023271184375594 51.8363476124069,-0.023383603050238 51.836110354259,-0.023465777469187 51.8359369179285,-0.023472651192556 51.8358651407666,-0.023528809877392 51.8357004122102,-0.023529686057616 51.835690679384,-0.023538063914411 51.83565382625,-0.023543689808186 51.835629102312,-0.023572573948308 51.8355021681838,-0.023577786720198 51.8354792537265,-0.023618788068211 51.8352990106071,-0.023627360729621 51.8352613514407,-0.023661334277455 51.8351119863322,-0.023669026973139 51.8350781790229,-0.023688225716162 51.8349937501145,-0.023717627975541 51.8348645765991,-0.023722804894815 51.8348418143991,-0.023756486300809 51.8346938021715,-0.023763907778347 51.8346612132372,-0.023790901015912 51.834542628006,-0.023795341167441 51.8345230996081,-0.023800489727373 51.8345105970367,-0.023838123662149 51.834419158237,-0.023854300226396 51.8343798733486,-0.023881261233424 51.8343143955334,-0.02360425553657 51.8343556996006,-0.023715342137444 51.8343230723896,-0.023748777209506 51.8343116206823,-0.023788430888355 51.8342980344079,-0.023870547343144 51.8342676988052,-0.023942069257488 51.8342416452568,-0.023977465644738 51.8342284909458,-0.024007122432287 51.8342174702501,-0.024015599076738 51.8342144833934,-0.024058719278701 51.8341992917174,-0.024087038370323 51.834189318598,-0.024105922723606 51.8341826669339,-0.024107853809983 51.8341819889956,-0.024165165953312 51.8341609030475,-0.024196280263318 51.834154052201,-0.024216549190068 51.834149590904,-0.02430529023458 51.8341300490434,-0.024364346544379 51.8341175259067,-0.024380183743731 51.8341141681084,-0.024434405357854 51.8341026697566,-0.024448191552983 51.8340997450887,-0.02450384848189 51.8340892779444,-0.024534730584179 51.8340827558198,-0.024571252512039 51.834075051532,-0.024617624661644 51.8340664695984,-0.02465593398552 51.8340593798021,-0.024659169226995 51.8340587507378,-0.024698227325573 51.8340511339728,-0.024715633647968 51.8340477395406,-0.024790094189827 51.8340371025226,-0.024838827806997 51.8340302776774,-0.024900748436463 51.8340216150771,-0.024978773610225 51.8340117751699,-0.025070395982381 51.8340016689899,-0.025162402800403 51.833992063766,-0.02519959947995 51.8339881743187,-0.025258434658431 51.8339820224963,-0.025299537926511 51.8339784144308,-0.025322340623602 51.8339764143945,-0.025367654362725 51.8339724453802,-0.025433566943196 51.833966754013,-0.02546940835365 51.8339636510231,-0.025478048048634 51.8339629058588,-0.025545930176343 51.8339589740019,-0.025596525520343 51.8339560377582,-0.025600745391944 51.8339557938799,-0.025724272639777 51.833947958345,-0.025744834562549 51.8339467478911,-0.025819588005694 51.8339423467392,-0.025865426412415 51.8339396542424,-0.025940870479972 51.8339357232096,-0.026005709012821 51.8339323514664,-0.026011573315888 51.8339320092857,-0.026097210950677 51.833926981359,-0.026163087127967 51.8339231143903,-0.026182935984022 51.8339219458446,-0.02626063552581 51.8339182143234,-0.026381722979675 51.8339123963649,-0.026433135483049 51.8339100220004,-0.026570879650513 51.833903674085,-0.026606068864626 51.8339025380898,-0.026689092338078 51.8338998667835,-0.026732668330249 51.8338984578396,-0.02679766923455 51.8338973454281),(-0.026648986091666 51.8339075655333,-0.026350862668855 51.833928055419,-0.026484565304541 51.8339188701762,-0.026564475458192 51.8339133771113,-0.026648986091666 51.8339075655333),(-0.026648986091666 51.8339075655333,-0.026649000595502 51.8339075657766,-0.02679766923455 51.8338973454281,-0.026648986091666 51.8339075655333),(-0.024985527558019 51.8340129946412,-0.024837331236428 51.8340326354761,-0.024948153517607 51.834017951115,-0.024985527558019 51.8340129946412),(-0.025153445104997 51.8339934420137,-0.025006535825553 51.8340102091839,-0.025133197780121 51.8339957546967,-0.025153445104997 51.8339934420137))"
Could this be a winding order problem?
For reference my source data is an EPSG:27700 shapefile, and I converted it to EPSG:4326 WKT CSV using ogr2ogr.
This is a problem with polygon orientation, for details see: https://cloud.google.com/bigquery/docs/gis-data#polygon_orientation
When loading WKT geography data, BigQuery assumes the polygons are oriented as described in the link above (to allow loading polygons larger than hemisphere).
ST_GeogFromText has different default, and that allows it to consume this data. If you pass second parameter oriented => TRUE to ST_GeogFromText function, you get the same result.
The workaround is to either load data as STRING and then convert to Geography using ST_GeogFromText, or load using GeoJSON format instead of WKT. GeoJSON is planar map format, so there is no polygon orientation ambiguity.
You can make ogr2ogr produce CSV with GeoJSON using something like
ogr2ogr -f csv -dialect sqlite \
-sql "select AsGeoJSON(ST_Transform(geometry, 4326)) geom, * from shp"
out.csv ./input/
See also https://medium.com/#mentin/loading-large-spatial-features-to-bigquery-geography-2f6ceb6796df

Ignore large or corrupt records when loading files with pig using PigStorage

I am seeing the following error when loading in a large file using pig.
java.io.IOException: Too many bytes before newline: 2147483971
at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:251)
at org.apache.hadoop.util.LineReader.readLine(LineReader.java:176)
at org.apache.hadoop.mapreduce.lib.input.UncompressedSplitLineReader.readLine(UncompressedSplitLineReader.java:94)
at org.apache.hadoop.mapreduce.lib.input.LineRecordReader.initialize(LineRecordReader.java:123)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader.initialize(PigRecordReader.java:181)
at org.apache.tez.mapreduce.lib.MRReaderMapReduce.setupNewRecordReader(MRReaderMapReduce.java:157)
at org.apache.tez.mapreduce.lib.MRReaderMapReduce.setSplit(MRReaderMapReduce.java:88)
at org.apache.tez.mapreduce.input.MRInput.initFromEventInternal(MRInput.java:703)
at org.apache.tez.mapreduce.input.MRInput.processSplitEvent(MRInput.java:631)
at org.apache.tez.mapreduce.input.MRInput.handleEvents(MRInput.java:590)
at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.handleEvent(LogicalIOProcessorRuntimeTask.java:732)
at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.access$600(LogicalIOProcessorRuntimeTask.java:106)
at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask$1.runInternal(LogicalIOProcessorRuntimeTask.java:809)
at org.apache.tez.common.RunnableWithNdc.run(RunnableWithNdc.java:35)
at java.lang.Thread.run(Thread.java:748)
The command I am using is as follows:
LOAD 'file1.data' using PigStorage('\u0001') as (
id:long,
source:chararray,
)
Is there any option that can be passed here to drop the record that is causing the issue and continue?
You can skip a certain number of records by using the following setting at the top of your pig script
set mapred.skip.map.max.skip.records = 1000;
Link: The number of acceptable skip records surrounding the bad record PER bad record in mapper. The number includes the bad record as well. To turn the feature of detection/skipping of bad records off, set the value to 0. The framework tries to narrow down the skipped range by retrying until this threshold is met OR all attempts get exhausted for this task. Set the value to Long.MAX_VALUE to indicate that framework need not try to narrow down. Whatever records(depends on application) get skipped are acceptable.

CSV file input not working together with set field value step in Pentaho Kettle

I have a very simple Pentaho Kettle transformation that causes a strange error. It consists of reading a field X from a CSV, add a field Y, set Y=X and finally write it back to another CSV.
Here you can see the steps and the configuration for them:
You can also download the ktr file from here. The input data is just this:
1
2
3
When I run this transformation, I get this error message:
ERROR (version 5.4.0.1-130, build 1 from 2015-06-14_12-34-55 by buildguy) : Unexpected error
ERROR (version 5.4.0.1-130, build 1 from 2015-06-14_12-34-55 by buildguy) : org.pentaho.di.core.exception.KettleStepException:
Error writing line
Error writing field content to file
Y Number : There was a data type error: the data type of [B object [[B©b4136a] does not correspond to value meta [Number]
at org.pentaho.di.trans.steps.textfiIeoutput.TextFiIeOutput.writeRowToFile(TextFiIeOutput.java:273)
at org.pentaho.di.trans.steps.textfiIeoutput.TextFileOutput.processRow(TextFiIeOutput.java:195)
at org.pentaho.di.trans.step.RunThread.run(RunThread.java:62)
atjava.Iang.Thread.run(Unknown Source)
Caused by: org.pentaho.di.core.exception.KettleStepException:
Error writing field content to file
Y Number : There was a data type error: the data type of [B object [[B©b4136a] does not correspond to value meta [Number]
at org.pentaho.di.trans.steps.textfiIeoutput.TextFiIeOutput.writeField(TextFileOutput.java:435)
at org.pentaho.di.trans.steps.textfiIeoutput.TextFiIeOutput.writeRowToFile(TextFiIeOutput.java:249)
3 more
Caused by: org.pentaho.di.core.exception.KettleVaIueException:
Y Number : There was a data type error: the data type of [B object [[B©b4136a] does not correspond to value meta [Number]
at org.pentaho.di.core.row.vaIue.VaIueMetaBase.getBinaryString(VaIueMetaBase.java:2185)
at org.pentaho.di.trans.steps.textfiIeoutput.TextFiIeOutput.formatField(TextFiIeOutput.java:290)
at org.pentaho.di.trans.steps.textfiIeoutput.TextFiIeOutput.writeField(TextFileOutput.java:392)
4 more
All of the above lines start with 2015/09/23 12:51:18 - Text file output.0 -, but I edited it out for brevity. I think the relevant, and confusing, part of the error message is this:
Y Number : There was a data type error: the data type of [B object [[B©b4136a] does not correspond to value meta [Number]
Some further notes:
If I bypass the set field value step by using the lower hop instead, the transformation finish without errors. This leads me to believe that it is the set field value step that causes the problem.
If I replace the CSV file input with a data frame with the same data (1,2,3) everything works just fine.
If I replace the file output step with a dummy the transformation finish without errors. However, if I preview the dummy, it causes a similar error and the field Y has the value <null> on all three rows.
Before I created this MCVE I got the error on all sorts of seemingly random steps, even when there was no file output present. So I do not think this is related to the file output.
If I change the format from Number to Integer, nothing changes. But if I change it to string the transformations finish without errors, and I get this output:
X;Y
1;[B#49e96951
2;[B#7b016abf
3;[B#1a0760b0
Is this a bug? Am I doing something wrong? How can I make this work?
It's because of lazy conversion. Turn it off. This is behaving exactly as designed - although admittedly the error and user experience could be improved.
Lazy conversion must not be used when you need to access the field value in your transformation. That's exactly what it does. The default should probably be off rather than on.
If your field is going directly to a database, then use it and it will be faster.
You can even have "partially lazy" streams, where you use lazy conversion for speed, but then use select values step, to "un-lazify" the fields you want to access, whilst the remainder remain lazy.
Cunning huh?

File: 0: Unexpected from Google BigQuery load job

I've a compressed json file (900MB, newline delimited) and load into a new table via bq command and get the load failure:
e.g.
bq load --project_id=XXX --source_format=NEWLINE_DELIMITED_JSON --ignore_unknown_values mtdataset.mytable gs://xxx/data.gz schema.json
Waiting on bqjob_r3ec270ec14181ca7_000001461d860737_1 ... (1049s) Current status: DONE
BigQuery error in load operation: Error processing job 'XXX:bqjob_r3ec270ec14181ca7_000001461d860737_1': Too many errors encountered. Limit is: 0.
Failure details:
- File: 0: Unexpected. Please try again.
Why the error?
I tried again with the --max_bad_records, still not useful error message
bq load --project_id=XXX --source_format=NEWLINE_DELIMITED_JSON --ignore_unknown_values --max_bad_records 2 XXX.test23 gs://XXX/20140521/file1.gz schema.json
Waiting on bqjob_r518616022f1db99d_000001461f023f58_1 ... (319s) Current status: DONE
BigQuery error in load operation: Error processing job 'XXX:bqjob_r518616022f1db99d_000001461f023f58_1': Unexpected. Please try again.
And also cannot find any useful message in the console.
To BigQuery team, can you have a look using the job ID?
As far I know there are two error sections on a job. There is one error result, and that's what you see now. And there is a second, which should be a stream of errors. This second is important as you could have errors in it, but the actual job might succeed.
Also you can set the --max_bad_records=3 on the BQ tool. Check here for more params https://developers.google.com/bigquery/bq-command-line-tool
You probably have an error that is for each line, so you should try a sample set from this big file first.
Also there is an open feature request to improve the error message, you can star (vote) this ticket https://code.google.com/p/google-bigquery-tools/issues/detail?id=13
This answer will be picked up by the BQ team, so for them I am sharing that: We need an endpoint where we can query based on a jobid, the state, or the stream of errors. It would help a lot to get a full list of errors, it would help debugging the BQ jobs. This could be easy to implement.
I looked up this job in the BigQuery logs, and unfortunately, there isn't any more information than "failed to read" somewhere after about 930 MB have been read.
I've filed a bug that we're dropping important error information in one code path and submitted a fix. However, this fix won't be live until next week, and all that will do is give us more diagnostic information.
Since this is repeatable, it isn't likely a transient error reading from GCS. That means one of two problems: we have trouble decoding the .gz file, or there is something wrong with that particular GCS object.
For the first issue, you could try decompressing the file and re-uploading it as uncompressed. While it may sound like a pain to send gigabytes of data over the network, the good news is that the import will be faster since it can be done in parallel (we can't import a compressed file in parallel since it can only be read sequentially).
For the second issue (which is somewhat less likely) you could try downloading the file yourself to make sure you don't get errors, or try re-uploading the same file and seeing if that works.

Bigquery : Unexpected. Please try again when loading a 53GB CSV/ 1.4GB gZIP

I was trying to load 1.4Gb gZIP data in to my BigQuery table and i am getting the error Unexpected. Please try again consistently
job_7f1aa8d29ae641459c82243530eb1c65
I was trying to load a structure Row ID,Order Priority,Discount,Unit Price,Shipping Cost,Customer ID,Customer Name,Ship Mode,Product Category,Product Sub-Category,Product Base Margin,Region,State or Province,City,Postal Code,Order Date,Ship Date,Profit,Quantity ordered new,Sales,Order ID
the error is not clear on whats going wrong.
anyone else encountered this error?
Thanks.
It looks like your job ran out of time-- a 53 GB CSV file is a lot to process in one thread. Best practice is to either split your data in multiple chunks, or upload uncompressed data which can be processed in parallel.
I'm in the process of raising the allowed time somewhat, and we'll work on improving the error message when this happens.