I have a pentaho transformation, which is used to read a text file, to check some conditions( from which you can have errors, such as the number should be a positive number). From this errors I'm creating an excel file and I need for my job the number of the lines in this error file plus to log which lines were with problem.
The problem is that sometimes I have an error " the return values id can't be found in the input row".
This error is not every time. The job is running every night and sometimes it can work without any problems like one month and in one sunny day I just have this error.
I don't think that this is from the file, because if I execute the job again with the same file it is working. I can't understand what is the reason to fail, because it is saying the value "id", but I don't have such a value/column. Why it is searching a value, which doesn't exists.
Another strange thing is that normally the step, which fails should be executed at all( as far as I know), because no errors were found, so we don't have rows at all to this step.
Maybe the problem is connected with the "Prioritize Stream" step? Here I'm getting all errors( which use exactly the same columns). I tried before the grouping steps to put a sorting, but it didn't help. Now I'm thinking to try with "Blocking step".
The problem is that I don't know why this happen and how to fix it. Any suggestions?
see here
Check if all your aggregates ins the Group by step have a name.
However, sometimes the error comes from a previous step: the group (count...) request data from the Prioritize Stream, and if that step has an error, the error gets reported mistakenly as coming from the group rather than from the Prioritze.
Also, you mention a step which should not be executed because there is no data: I do not see any Filter which would prevent rows with missing id to flow from the Prioritize to the count.
This is a bug. It happens randomly in one of my transformations that often ends up with empty stream (no rows). It mostly works, but once in a while it gives this error. It seems to only fail when the stream is empty though.
I enriched a public dataset of reddit comments with data from LIWC (Linguistic Inquiry and Word Count). I have 60 files รก 600mb. The idea is now to upload to BigQuery, getting them together and analyze the results. Alas i faced some problems.
For a first test I had a test sample with 200 rows and 114 columns. Here is a link to the csv i used
I first asked on Reddit and fhoffa provided a really good answer. The problem seems to be the newlines (/n) in the body_raw column, as redditors often include them in their text. It seems BigQuery cannot process them.
I tried to transfer the original data, which i transfered to storage, back to BigQuery, unedited, untouched, but the same problem. BigQuery cannot even process the original data, which comes from BigQuery...?
Anyway, I can open the csv without problems in other programs such as R, which means that the csv itself is not damaged or the schema is inconsistent. So fhoffa's command should get rid of it.
bq load --allow_quoted_newlines --allow_jagged_rows --skip_leading_rows=1 tt.delete_201607a myproject.newtablename gs://my_testbucket/dat2.csv body_raw,score_hidden,archived,name,author,author_flair_text,downs,created_utc,subreddit_id,link_id,parent_id,score,retrieved_on,controversiality,gilded,id,subreddit,ups,distinguished,author_flair_css_class,WC,Analytic,Clout,Authentic,Tone,WPS,Sixltr,Dic,function,pronoun,ppron,i,we,you,shehe,they,ipron,article,prep,auxverb,adverb,conj,negate,verb,adj,compare,interrog,number,quant,affect,posemo,negemo,anx,anger,sad,social,family,friend,female,male,cogproc,insight,cause,discrep,tentat,certain,differ,percept,see,hear,feel,bio,body,health,sexual,ingest,drives,affiliation,achieve,power,reward,risk,focuspast,focuspresent,focusfuture,relativ,motion,space,time,work,leisure,home,money,relig,death,informal,swear,netspeak,assent,nonflu,filler,AllPunc,Period,Comma,Colon,SemiC,QMark,Exclam,Dash,Quote,Apostro,Parenth,OtherP
The output was:
Too many positional args, still have ['body_raw,score_h...]
If i take away "tt.delete_201607a" from the command, i get the same error message I have often seen now:
BigQuery error in load operation: Error processing job 'xx': Too many errors encountered.
So i do not know what to do here. Should I get rid of /n with Python? That would take probably days (although im not sure, i am not a programmer), as my complete data set is around 55 million rows.
Or do you have any other ideas?
I checked again, and I was able to load the file you left on dropbox without a problem.
First I made sure to download your original file:
wget https://www.dropbox.com/s/5eqrit7mx9sp3vh/dat2.csv?dl=0
Then I run the following command:
bq load --allow_quoted_newlines --allow_jagged_rows --skip_leading_rows=1 \
tt.delete_201607b dat2.csv\?dl\=0 \
body_raw,score_hidden,archived,name,author,author_flair_text,downs,created_utc,subreddit_id,link_id,parent_id,score,retrieved_on,controversiality,gilded,id,subreddit,ups,distinguished,author_flair_css_class,WC,Analytic,Clout,Authentic,Tone,WPS,Sixltr,Dic,function,pronoun,ppron,i,we,you,shehe,they,ipron,article,prep,auxverb,adverb,conj,negate,verb,adj,compare,interrog,number,quant,affect,posemo,negemo,anx,anger,sad,social,family,friend,female,male,cogproc,insight,cause,discrep,tentat,certain,differ,percept,see,hear,feel,bio,body,health,sexual,ingest,drives,affiliation,achieve,power,reward,risk,focuspast,focuspresent,focusfuture,relativ,motion,space,time,work,leisure,home,money,relig,death,informal,swear,netspeak,assent,nonflu,filler,AllPunc,Period,Comma,Colon,SemiC,QMark,Exclam,Dash,Quote,Apostro,Parenth,OtherP,oops
As mentioned in reddit, you need the following options:
--allow_quoted_newlines: There are newlines inside some strings, hence the CSV is not strictly newline delimited.
--allow_jagged_rows: Not every row has the same number of columns.
,oops: There is an extra column in some rows. I added this column to the list of columns.
When it says "too many positional arguments", it's because your command says:
tt.delete_201607a myproject.newtablename
Well, tt.delete_201607a is how I named my table. myproject.newtablename is how you named your table. Choose one, not both.
Are you sure you are not able to load the sample file you left on dropbox? Or you are getting errors from rows I can't find on that file?