NiFi header index out of range - schema

I'm converting pipe delimited text file to parquet. to do so I'm using convert record processor.
Reader-csv reader
writer-ParquetSetWriter.
schema strategy- InferSchema in reader and inherit record schema in writer.
But I'm getting an error in ConvertRecord processor. It says that index for header "ColumnName" is 24 but has only 24 values.

Based on the input provided, it looks like you have a line that has 25 values because of an improperly delimited set of values. "On index 24, but there are only 24 values" would mean you are on position 25 from an offset of 0.
To debug this, if it's a really big CSV file, you can chain together SplitRecord and ValidateRecord to try to catch the line that has this problem.

Related

How do I define the record structure of ebcdic file?

I have ebcdic file in hdfs I want to load data to spark dataframe, process it and load results as orc files, I found that there is a open source solution which is cobrix cobrix, that allow to get data from ebcdic files, but developer must provide a copybook file which is a schema definition.
A few line of my ebcedic file are presented in the attached image.
I want to get the format of copybook of the ebcdic file, essentially I want to read the vin his length is 17, vin_data the length is 3 and finally vin_val the length is 100.
how to define a copybook file of ebcdic data?
You don't.
A copybook may be used as a record definition (=how the data is stored), it has nothing to do with the encoding of data that may be stored in that.
This leaves the question "How do I define the record structure?"
You'd need the amount of fields, their length and type (it likely is not only USAGE DISPLAY) and then just define it with some fancy names. Ideally you just get the original record definition from the COBOL program writing the file, put that into a copybook if it isn't in one yet, and use that.
Your link has samples that show actually how a copybook looks like, if you struggle on the definition then please edit your question with the copybook you've defined and we may be able to help.
Based on your comment in the question, and looking at the input file, you could start with this.
01 VIN-RECORD.
05 VIN PIC X(17).
05 VIN-COUNT PIC S9(5) COMP-3.
05 VIN-VALUE PIC X(100).
I'm guessing that the second field is COMP-3 based on the six examples all ending with a C byte. This indicates a positive COMP-3 value. A D byte would be a negative COMP-3 value. An F byte would indicate an unsigned COMP-3 value.
The third field is variable length and right padded with spaces.

'Missing close double quote (") character' is complained when there're line feeds in csv file when loading data to BigQuery

The culprit line is as follows. It should be composed of 14 columns, with one of the column, starting with 'Hi I'm Niger...', covering multiple line with line feeds.
17935,9a7105ee-30c8-4a6d-9374-10875b7d6288.jpg,"""top""=>""0"", ""left""=>""0"", ""width""=>""180"", ""height""=>""180""",,"",2015-07-26 19:33:57.292058,2015-07-26 20:25:30.068887,fe43876f-1b2c-464a-aa20-bf335ed3ff62,c68c8c70-bc2b-11e4-90a1-22000b21105f,{},2e790350-15fb-0133-2cb8-22000ba51078,"Hi I'm Nigerian so wish to study in sweden.
so I'm Undergraduate student I want study Engineering.
Thanks.","",{}
When loading this csv data into BigQuery via command bq load --replace --source_format=CSV -F"," ..., Error complains. Could anyone give me an solution to this BigQuery Load Data command?
- File: 0 / Line:17192 / Field:12: Missing close double quote (")
character: field starts with: <Hi I'm N>
- File: 0 / Line:17193: Too few columns: expected 14 column(s) but
got 1 column(s). For additional help: http://goo.gl/RWuPQ
- File: 0 / Line:17194: Too few columns: expected 14 column(s) but
got 3 column(s). For additional help: http://goo.gl/RWuPQ
If you are loading CSV with embedded newlines, you need to specify allowQuotedNewlines.
https://cloud.google.com/bigquery/docs/reference/v2/jobs#configuration.load.allowQuotedNewlines
The BigQuery default is to assume that CSV data does not contain newlines. This allows for a much higher parsing throughput when dealing with large data files since the input files can be split at arbitrary newlines. If your data contains newlines within strings, each file needs to be parsed linearly by a single machine.
Make sure you include this line before loading data to BigQuery: 'job_config.allow_quoted_newlines = True'
job_config = bigquery.LoadJobConfig()
job_config.allow_quoted_newlines = True
If you trying to load a CSV file to a table from the BigQuery google console make sure you select the Advanced option -> Quoted new lines.

'Bad repeat count' while inputting a file, FORTRAN

I am trying to read a file into my code.
there are 2 subroutines, one which writes a file and the other which reads it.
the writing part was:
write(*,*)'entered refile, shall make file'
ileunitA=int(presentstep)
write(fname,1012)ileunitA
1012 format('DATA_',i6.6,'.dat')
write(fnam,1112)index
1112 format('pp',i3.3)
open(UNIT=ileunitA,FILE=fname)
!variables from module global
write(ileunita,*)u,v,w,pc,p,p0,rho1,gam,con
write(ileunita,*)aip,aim,ajp,ajm,akp,akm,ap,ap0
write(ileunita,*) scon,smomu,smomv,smomw
...
The reading part was as follows(in another subroutine):
ileunita=25;
open(unit=ILEUNITA,file='DATA_010500.dat')
!variables from module global
read(ileunita,*)u,v,w,pc,p,p0,rho1,gam,con
read(ileunita,*)aip,aim,ajp,ajm,akp,akm,ap,ap0
read(ileunita,*) scon,smomu,smomv,smomw
...
When I run the code, it shows the following error:
At line 3682 of file bub2.f90 (unit = 25, file = 'DATA_000001.dat')
Fortran runtime error: Bad repeat count in item 1 of list input
Can anyone help me figure out what could be the problem? And what is 'repeat count'. What is a 'bad' repeat count? Thanks
Guessing a little (you could show the text in the problematic line in your question...), but you are using list directed input (and output) with the * as the second specifier in the read (and write) statements. List directed input allows multiple fields that have the same value to be represented using the syntax r*c, where r is a numeric repeat count and c is the value to be repeated.
If any of your output items generate a field that contains a * then that could be confusing the processing of input.
(It is permissible (though rare) for a processor to represent multiple output fields that have the same value using a repeat count, for example WRITE (unit,*) 23, 23, 23, 23 could result in an input file that contains the text 4*23.)
List directed input also has some other features, such as the handling of delimiter characters, the / character causing input processing to terminate and the possibility and handling of null values. Some of these features may surprise those not familiar with the rules (which are inspired by typical short cuts taken when input was submitted via punched cards), which why it is often better to avoid list directed input and output and use an explicit format instead.
If any of your data fields are of type character you should consider using a non-default DELIM mode to avoid any special characters within the character variable value from confusing the input processing.

SSIS Bulk Insert where fields contain commas?

My bulk insert in SSIS is failing when a field contains a comma character. My flat file source is tab delimited and there are many instances in which a text field will contain commas. For example, a UserComment may have a comma. This causes the bulk insert to fail.
How can I tell SSIS to ignore the commas? I thought it would happen automatically since the row delimiter is {CR}{LF} and the column delimiter is "Tab". Why does it bark at the comma? Also please note that I am NOT currently using a format file.
Thanks in advance.
UPDATE:
Here is the error I get in SSIS:
Error: 0xC002F304 at Bulk Insert Task, Bulk Insert Task: An error occurred with the following error message: "Bulk load data conversion error (type mismatch or invalid character for the specified codepage) for row 183, column 5 (EmailAddress).Bulk load data conversion error (type mismatch or invalid character for the specified codepage) for row 182, column 5 (EmailAddress).Bulk load data conversion error (type mismatch or invalid character for the specified codepage) for row 181, column 5 (EmailAddress).".
Task failed: Bulk Insert Task
It seems to fail on record 131988 which is why I think it's because of the "something,something" email with no space. Many records before 131988 come across fine.
131988 01 MEMPHIS, TN someone#somewhere.com
131988 02 NORTH LITTLE ROCK, AR someone#somewhere.com,someone1#somewhere1.com
131988 03 HOUSTON, TX someone#somewhere.com,someone1#somewhere1.com
I doubt the comma or the # sign is being called an "invalid character".
I see there are two tabs in the input record just before the field that contains the email addresses, so that email address column would be the fifth column. But when the error message refers to "column 5" it's presumably using zero-based indexing, so the email column is only index 4. Is there tab and another column? Maybe the invalid character is there.
I suspect there is a invisible bad character embedded in whatever column is causing the error. I often pick up bad characters when cutting and pasting out of email address lines, so that's a likely suspect.
Run the failing line by itself to make sure it still fails.
Then copy it into, say, Notepad, and do a "Save As" with the Encoding set to ANSI. (It may complain at that point if there's a bad character.) Use the "Save As" file as the new import file. At this point you should be able to be reasonably confident that "what you see is what you get", and that there are no invisible characters embedded in the import file.
If this turns out to be the problem, you'll need some way to verify that future import files are clean, or else handle them somehow during the import process.
(I presume you've checked the destination column length is okay. That would definitely be a showstopper.)
"Type mismatch or invalid character for the specified codepage" is a misleading error message. The source table's field length exceeded the destination table's specified length and thus the error. After adjusting lengths, everything worked properly.

how to import flat file source to database using sql

im currently want to inport my data from flat file to the database.
the flat file is in a txt file. in that txt file, i save a list of URLs. example:
http://www.mimi.com/Hotels-g303188-Rurrenabaque-Hotels.html
im using the SQL Server Import and Export wizard to do it. but when the time of execution, it has error saying
Error 0xc02020a1:
Data Flow Task 1: Data conversion failed. The data conversion for column
"Column 0" returned status value 4 and status text "Text was truncated or one
or more characters had no match in the target code page.".
can anyone help?..
You get this error because the text is too long for the column youve chosen to put it in.
Text was truncated or
You might want to check the size of the database column vis-a-vis your input data. Does the longest URL less than the column width?
one or more characters had no match in the target code page.".
Check if your input file has any special characters. An easy way to check this would be to save your file in ANSI (Notepad > Save As > Encoding = ANSI). Note - you'd still have to select the right code page so that the import interprets your input text correctly.
Here's a very nice link that has some background on what code pages are - http://www.joelonsoftware.com/articles/Unicode.html
Note you can also change the target column data type (to text stream for example) in the Datasource->Advanced section