Is there was to apply a NEW Avro schema to an existing schema in Nifi without infering order? - schema

I am using Nifi to load CSVs, apply a NEW schema and load them into a SQl db. Currently I am writting an Avro schema, and applying the schema to each CSV. I am writing the schema based on the order of the incoming CSV- the first field = first column in CSV. Is there a way to map one schema to another based on column name? I.e. can I say 'csv.name -> sql.username'.
I know this can be done manually before uploading the csvs, I am wondering if there is a way within Nifi to map a schema to data based on the datas current schema, not knowing the order of the current schema, just the fields.
I have read about recordpaths and update records. I am looking for something to match the whole incoming schema to a new schema, not based on order.
Avro Schema Settings:
PutDatabaseRecord settings

As I see it, you have two options:
Option 1(better one):
Add a header line to your records and set Treat First Line as Header to True in your CSVReader
Option 2:
Set Schema Access Strategy in your CSVReader to Infer Schema(available since NiFi 1.9.0)
The first one can guarantee a correct mapping of your fields their types.

Related

How to get an array from JSON in the Azure Data Factory?

My actual (not properly working) setup has two pipelines:
Get API data to lake: for each row in metadata table in SQL calling the REST API and copy the reply (json-files) to the Blob datalake.
Copy data from the lake to SQL: For Each file auto create table in SQL.
The result is the correct number of tables in SQL. Only the content of the tables is not what I hoped for. They all contain 1 column named odata.metadata and 1 entry, the link to the metadata.
If I manually remove the metadata from the JSON in the datalake and then run the second pipeline, the SQL table is what I want to have.
Have:
{ "odata.metadata":"https://test.com",
"value":[
{
"Key":"12345",
"Title":"Name",
"Status":"Test"
}]}
Want:
[{
"Key":"12345",
"Title":"Name",
"Status":"Test"
}]
I tried to add $.['value'] in the API call. The result then was no odata.metadata line, but the array started with {value: which resulted in an error copying to SQL
I also tried to use mapping (in sink) to SQL. That gives the wanted result for the dataset I manually specified the mapping for, but only goes well for the dataset with the same number of column in the array. I don't want to manually do the mapping for 170 calls...
Does anyone know how handle this in ADF? For now I feel like the only solution is to add a Python step in the pipeline, but I hope for a somewhat standard ADF way to do this!
You can add another pipeline with dataflow to remove the content from JSON file before copying data to SQL, using flatten formatters.
Before flattening the JSON file:
This is what I see when JSON data copied to SQL database without flattening:
After flattening the JSON file:
Added a pipeline with dataflow to flatten the JSON file to remove 'odata.metadata' content from the array.
Source preview:
Flatten formatter:
Select the required object from the Input array
After selecting value object from input array, you can see only the values under value in Flatten formatter preview.
Sink preview:
File generated after flattening.
Copy the generated file as Input to SQL.
Note: If your Input file schema is not constant, you can enable Allow schema drift to allow schema changes
Reference: Schema drift in mapping data flow

Copy Data from Blob to SQL via Azure data factory

I have two sample files in blob as sample1.csv and sample2.csv as below
data sample
SQL table name sample2, with column Name,id,last name,amount
Created a ADF flow without schema, it results as below
preview data
source settings are allow schema drift checked.
sink setting are auto mapping turned on. allow insert checked. table action none.
I have also tried setting a define schema in dataset, its result are same.
any help here?
my expected outcome would be data in sample1 will inserted null into the column "last name"
If I understand correctly, you said: "my expected outcome would be data in sample1 will inserted null into the column last name", you only need to add a derived column to you sample1.csv file.
You could follow my steps:
I create a sample1.csv file in Blob Storage and a sample2 table in my SQL database:
Using DerivedColumn to create new column last name with null value:
expression: toString(null())
Sink settings:
Run the pipeline and check the data in table:
Hope this helps.
You cannot mix schemas in the same source in the same data flow execution.
Schema Drift will handle changes to the schema on an execution-per-execution basis.
But if you are reading multiple different schemas from a folder, you will get non-deterministic results.
Instead, if you loop through those files in a pipeline ForEach one-by-one, data flow will be able to handle the evolving schema.

How to auto detect schema from file in GCS and load to BigQuery?

I'm trying to load a file from GCS to BigQuery whose schema is auto-generated from the file in GCS. I'm using Apache Airflow to do the same, the problem I'm having is that when I use auto-detect schema from file, BigQuery creates schema based on some ~100 initial values.
For example, in my case there is a column say X, the values in X is mostly of Integer type, but there are some values which are of String type, so bq load will fail with schema mismatch, in such a scenario we need to change the data type to STRING.
So what I could do is manually create a new table by generating schema on my own. Or I could set the max_bad_record value to some 50, but that doesn't seem like a good solution. An ideal solution would be like this:
Try to load the file from GCS to BigQuery, if the table was created successfully in BQ without any data mismatch, then I don't need to do anything.
Otherwise I need to be able to update the schema dynamically and complete the table creation.
As you can not change column type in bq (see this link)
BigQuery natively supports the following schema modifications:
BigQuery natively supports the following schema modifications:
* Adding columns to a schema definition
* Relaxing a column's mode from REQUIRED to NULLABLE
All other schema modifications are unsupported and require manual workarounds
So as a workaround I suggest:
Use --max_rows_per_request = 1 in your script
Use 1 line which is the best suitable for your case with the optimized field type.
This will create the table with the correct schema and 1 line and from there you can load the rest of the data.

What is meaning of schema evolution for Parquet and Avro file format in Hive

Can anyone explain meaning of schema evolution for parquet and Avro file format in Hive.
Schema evolution is nothing but a term used for how to store the behaves when schema changes . Users can start with a simple schema, and gradually add more columns to the schema as needed. In this way, users may end up with multiple Parquet/Avro files with different but mutually compatible schemas.
so lets say if you have one avro/parquet file and you want to change its schema, you can rewrite that file with a new schema inside. But what if you have terabytes of avro/parquet files and you want to change their schema? Will you rewrite all of the data, every time the schema changes?
Schema evolution allows you to update the schema used to write new data, while maintaining backwards compatibility with the schema(s) of your old data. Then you can read it all together, as if all of the data has one schema. Of course there are precise rules governing the changes allowed, to maintain compatibility. Those rules are listed under Schema Resolution.

Appending data to a table created from an Avro file in BigQuery

Every morning, an automatic job creates a new table from an Avro file. In the afternoon, I would need to append some data to this table from a Query.
When trying to do so, I get the following error:
Error: Invalid schema update. Field chn has changed mode from REQUIRED to NULLABLE
I noticed that I can change the property of the field chn from REQUIRED to NULLABLE in the BigQuery Web UI and then it works fine, but I would have to do it manually everyday which is not what I am looking for.
Is there a way to "cast" the field as REQUIRED during the append query ?
Or during the first import from the Avro file, force the field to be NULLABLE and not REQUIRED ?
Thanks !
The feature that allows relaxing a field as part of a query or a load job will be available in production shortly. I will update this answer when it goes live (likely within a week).
Update: 08/25/2016
You can supply schemaUpdateOptions in load or query job configuration.
Multiple options can be provided.
It allows the schema of the destination table to be updated as a side effect of the load or query job. Schema update options are supported in two cases:
When writeDisposition is WRITE_APPEND
When writeDisposition is WRITE_TRUNCATE and the destination table is a partition of a table, specified by partition decorators
For non-partitioned tables, WRITE_TRUNCATE will always overwrite the schema.
The following values are supported:
ALLOW_FIELD_ADDITION: allow adding a nullable field to the schema
ALLOW_FIELD_RELAXATION: allow relaxing a required field in the original schema to nullable
NOTE: This doesn't currently work with schema auto-detection. We plan to support that soon.