Converting StructType to Avro Schema, returns type as Union when using databricks spark-avro

Converting StructType to Avro Schema, returns type as Union when using databricks spark-avro - apache-spark-sql

I am using databricks spark-avro to convert a dataframe schema into avro schema.The returned avro schema fails to have a default value. This is causing issues when i am trying to create a Generic record out of the schema. Can, any one help with the right way of using this function ?
Dataset<Row> sellableDs = sparkSession.sql("sql query");
SchemaBuilder.RecordBuilder<Schema> rb = SchemaBuilder.record("testrecord").namespace("test_namespace");
Schema sc = SchemaConverters.convertStructToAvro(sellableDs.schema(), rb, "test_namespace");
System.out.println(sc.toString());
System.out.println(sc.getFields().get(0).toString());
String schemaString = sc.toString();
sellableDs.foreach(
(ForeachFunction<Row>) row -> {
Schema scEx = new Schema.Parser().parse(schemaString);
GenericRecord gr;
gr = new GenericData.Record(scEx);
System.out.println("Generic record Created");
int fieldSize = scEx.getFields().size();
for (int i = 0; i < fieldSize; i++ ) {
// System.out.println( row.get(i).toString());
System.out.println("field: " + scEx.getFields().get(i).toString() + "::" + "value:" + row.get(i));
gr.put(scEx.getFields().get(i).toString(), row.get(i));
//i++;
}
}
);
This is the df schema:
StructType(StructField(key,IntegerType,true), StructField(value,DoubleType,true))
This is the avro converted schema:
{"type":"record","name":"testrecord","namespace":"test_namespace","fields":[{"name":"key","type":["int","null"]},{"name":"value","type":["double","null"]}]}

The problems is that the class SchemaConverters does not include default values as part of the schema creation. You have 2 options, modify the schema adding default values before Record creation or filling the record before building with some value( it could be actually values from your row). For example null. This is an example how create a Record using your schema
import org.apache.avro.generic.GenericRecordBuilder
import org.apache.avro.Schema
var schema = new Schema.Parser().parse("{\"type\":\"record\",\"name\":\"testrecord\",\"namespace\":\"test_namespace\",\"fields\":[{\"name\":\"key\",\"type\":[\"int\",\"null\"]},{\"name\":\"value\",\"type\":[\"double\",\"null\"]}]}")
var builder = new GenericRecordBuilder(schema);
for (i <- 0 to schema.getFields().size() - 1 ) {
builder.set(schema.getFields().get(i).name(), null)
}
var record = builder.build();
print(record.toString())

Related

Not able to upload json data to Bigquery tables using c#

I am trying to upload json data to one of the table created under the dataset in Bigquery but fails with "Google.GoogleApiException: 'Google.Apis.Requests.RequestError
Not found: Table currency-342912:sampleDataset.currencyTable [404]"
Service account is created with roles BigQuery.Admin/DataEditor/DataOwner/DataViewer.
The roles are also applied to the table also.
Below is the snippet
public static void LoadTableGcsJson(string projectId = "currency-342912", string datasetId = "sampleDataset", string tableId= "currencyTable ")
{
//Read the Serviceaccount key json file
string dir = Directory.GetParent(Directory.GetCurrentDirectory()).Parent.Parent.FullName + "\\" + "currency-342912-ae9b22f23a36.json";
GoogleCredential credential = GoogleCredential.FromFile(dir);
string toFileName = Directory.GetParent(Directory.GetCurrentDirectory()).Parent.Parent.FullName + "\\" + "sample.json";
BigQueryClient client = BigQueryClient.Create(projectId,credential);
var dataset = client.GetDataset(datasetId);
using (FileStream stream = File.Open(toFileName, FileMode.Open))
{
// Create and run job
BigQueryJob loadJob = client.UploadJson(datasetId, tableId, null, stream); //This throws error
loadJob.PollUntilCompleted();
}
}
Permissions for the table, using the service account "sampleservicenew" from the screenshot
Any leads on this , much appreciated

Your issue might reside in your user credentials. Please follow this steps to check your code:
Please check if the user you are using to execute your application have access to the table you want to insert data.
If your json tags matchs your table columns.
If you json inputs are correct ( table name, dataset name ).
Use a dummy table to perform a quick test of your credentials and data integrity.
These steps will help you identifying what could be missing on your side. I perform the following operations to reproduce your case:
I created a table on BigQuery based on the values of your json data:
create or replace table `projectid.datasetid.tableid` (
IsFee BOOL,
BlockDateTime timestamp,
Address STRING,
BlockHeight INT64,
Type STRING,
Value INT64
);
Created a .json file with your test data
{"IsFee":false,"BlockDateTime":"2018-09-11T00:12:14Z","Address":"tz3UoffC7FG7zfpmvmjUmUeAaHvzdcUvAj6r","BlockHeight":98304,"Type":"OUT","Value":1}
{"IsFee":false,"BlockDateTime":"2018-09-11T00:12:14Z","Address":"tz2KuCcKSyMzs8wRJXzjqoHgojPkSUem8ZBS","BlockHeight":98304,"Type":"IN","Value":18}
Build & Run below code.
using System;
using Google.Cloud.BigQuery.V2;
using Google.Apis.Auth.OAuth2;
using System.IO;
namespace stackoverflow
{
class Program
{
static void Main(string[] args)
{
String projectid = "projectid";
String datasetid = "datasetid";
String tableid = "tableid";
String safilepath ="credentials.json";
var credentials = GoogleCredential.FromFile(safilepath);
BigQueryClient client = BigQueryClient.Create(projectid,credentials);
using (FileStream stream = File.Open("data.json", FileMode.Open))
{
BigQueryJob loadJob = client.UploadJson(datasetid, tableid, null, stream);
loadJob.PollUntilCompleted();
}
}
}
}
output
Row
IsFee
BlockDateTime
Address
BlockHeight
Type
value
1
false
2018-09-11 00:12:14 UTC
tz3UoffC7FG7zfpmvmjUmUeAaHvzdcUvAj6r
98304
OUT
1
2
false
2018-09-11 00:12:14 UTC
tz2KuCcKSyMzs8wRJXzjqoHgojPkSUem8ZBS
98304
IN
18
Note: You can use above code to perform your quick tests of your credentials and the integrity of the data to insert.
I also make use of the following documentation:
Load Credentials from a file
Google.Cloud.BigQuery.V2
Load Json data into a new table

How can I automatically infer schemas of CSV files on S3 as I load them?

Context
Currently I am using Snowflake as a Data Warehouse and AWS' S3 as a data lake. The majority of the files that land on S3 are in the Parquet format. For these, I am using a new limited feature by Snowflake (documented here) that automatically detects the schema from the parquet files on S3, which I can use to generate a CREATE TABLE statement with the correct column names and inferred data types. This feature currently only works for Apache Parquet, Avro, and ORC files. I would like to find a way that achieves the same desired objective but for CSV files.
What I have tried to do
This is how I currently infer the schema for Parquet files:
select generate_column_description(array_agg(object_construct(*)), 'table') as columns
from table (infer_schema(location=>'${LOCATION}', file_format=>'${FILE_FORMAT}'))
However if I try specifying the FILE_FORMAT as csv that approach will fail.
Other approaches I have considered:
Transferring all files that land on S3 to parquet (this involves more code, and infra setup so wouldn't be my top choice, especially that I'd like to keep some files in their natural type on s3)
Having a script (using libraries like Pandas in Python for example) that infer the schema for files in S3 (this also involves more code, and will be strange in the sense that parquet files are handled in Snowflake, but non parquet files are handled by some script on aws).
Using a Snowflake UDF to infer the schema. Haven't fully considered my options there yet.
Desired Behaviour
As a new csv file lands on S3 (on a pre-existing STAGE), I would like to infer the schema, and be able to generate a CREATE TABLE statement with the inferred data types. Preferably, I would like to do that within Snowflake as the existing aforementioned schema-inference solution exists there. Happy to add further information if needed.

UPDATE: I modified the SP that infers data types in untyped (all string type columns) tables and it now works directly against Snowflake stages. The project code is available here: https://github.com/GregPavlik/InferSchema
I wrote a stored procedure to assist with this; however, its only goal is to infer the data types of untyped columns. It works as follows:
Load the CSV into a table with all columns defined as varchars.
Call the SP with a query against the new table (main point is to get only the columns you want and limit the row count to keep type inference times reasonable).
Also in the SP call is the DB, schema, and table for the old and new locations -- old with all varchar and new with the inferred types.
The SP will then infer the data types and create two SQL statements. One statement will create the new table with the inferred data types. One statement will copy from the untyped (all varchar) table to the new table with appropriate wrappers such as try_multi_timestamp(), a UDF that extends try_to_timestamp() to try various common formats.
I meant to extend this so that it didn't require the untyped (all varchar) table at all, but haven't gotten around to it. Since it's come up here, I may circle back and update the SP with that capability. You can specify a query that reads directly from the stage, but you'd have to use $1, $2... with aliases for the column names (or else the DDL will try to create column names like $1). If the query runs directly against a stage, for the old DB, schema, and table, you could put in whatever because that's only used to generate an insert from select statement.
-- This shows how to use on the Snowflake TPCH sample, but could be any query.
-- Keep the row count down to reduce the time it take to infer the types.
call infer_data_types('select * from SNOWFLAKE_SAMPLE_DATA.TPCH_SF1.LINEITEM limit 10000',
'SNOWFLAKE_SAMPLE_DATA', 'TPCH_SF1', 'LINEITEM',
'TEST', 'PUBLIC', 'LINEITEM');
create or replace procedure INFER_DATA_TYPES(SOURCE_QUERY string,
DATABASE_OLD string,
SCHEMA_OLD string,
TABLE_OLD string,
DATABASE_NEW string,
SCHEMA_NEW string,
TABLE_NEW string)
returns string
language javascript
as
$$
/****************************************************************************************************
* *
* DataType Classes
* *
****************************************************************************************************/
class Query{
constructor(statement){
this.statement = statement;
}
}
class DataType {
constructor(db, schema, table, column, sourceQuery) {
this.db = db;
this.schema = schema;
this.table = table;
this.sourceQuery = sourceQuery
this.column = column;
this.insert = '"#~COLUMN~#"';
this.totalCount = 0;
this.notNullCount = 0;
this.typeCount = 0;
this.blankCount = 0;
this.minTypeOf = 0.95;
this.minNotNull = 1.00;
}
setSQL(sqlTemplate){
this.sql = sqlTemplate;
this.sql = this.sql.replace(/#~DB~#/g, this.db);
this.sql = this.sql.replace(/#~SCHEMA~#/g, this.schema);
this.sql = this.sql.replace(/#~TABLE~#/g, this.table);
this.sql = this.sql.replace(/#~COLUMN~#/g, this.column);
}
getCounts(){
var rs;
rs = GetResultSet(this.sql);
rs.next();
this.totalCount = rs.getColumnValue("TOTAL_COUNT");
this.notNullCount = rs.getColumnValue("NON_NULL_COUNT");
this.typeCount = rs.getColumnValue("TO_TYPE_COUNT");
this.blankCount = rs.getColumnValue("BLANK");
}
isCorrectType(){
return (this.typeCount / (this.notNullCount - this.blankCount) >= this.minTypeOf);
}
isNotNull(){
return (this.notNullCount / this.totalCount >= this.minNotNull);
}
}
class TimestampType extends DataType{
constructor(db, schema, table, column, sourceQuery){
super(db, schema, table, column, sourceQuery)
this.syntax = "timestamp";
this.insert = 'try_multi_timestamp(trim("#~COLUMN~#"))';
this.sourceQuery = SOURCE_QUERY;
this.setSQL(GetCheckTypeSQL(this.insert, this.sourceQuery));
this.getCounts();
}
}
class IntegerType extends DataType{
constructor(db, schema, table, column, sourceQuery){
super(db, schema, table, column, sourceQuery)
this.syntax = "number(38,0)";
this.insert = 'try_to_number(trim("#~COLUMN~#"), 38, 0)';
this.setSQL(GetCheckTypeSQL(this.insert, this.sourceQuery));
this.getCounts();
}
}
class DoubleType extends DataType{
constructor(db, schema, table, column, sourceQuery){
super(db, schema, table, column, sourceQuery)
this.syntax = "double";
this.insert = 'try_to_double(trim("#~COLUMN~#"))';
this.setSQL(GetCheckTypeSQL(this.insert, this.sourceQuery));
this.getCounts();
}
}
class BooleanType extends DataType{
constructor(db, schema, table, column, sourceQuery){
super(db, schema, table, column, sourceQuery)
this.syntax = "boolean";
this.insert = 'try_to_boolean(trim("#~COLUMN~#"))';
this.setSQL(GetCheckTypeSQL(this.insert, this.sourceQuery));
this.getCounts();
}
}
// Catch all is STRING data type
class StringType extends DataType{
constructor(db, schema, table, column, sourceQuery){
super(db, schema, table, column, sourceQuery)
this.syntax = "string";
this.totalCount = 1;
this.notNullCount = 0;
this.typeCount = 1;
this.minTypeOf = 0;
this.minNotNull = 1;
}
}
/****************************************************************************************************
* *
* Main function *
* *
****************************************************************************************************/
var pass = 0;
var column;
var typeOf;
var ins = '';
var newTableDDL = '';
var insertDML = '';
var columnRS = GetResultSet(GetTableColumnsSQL(DATABASE_OLD, SCHEMA_OLD, TABLE_OLD));
while (columnRS.next()){
pass++;
if(pass > 1){
newTableDDL += ",\n";
insertDML += ",\n";
}
column = columnRS.getColumnValue("COLUMN_NAME");
typeOf = InferDataType(DATABASE_OLD, SCHEMA_OLD, TABLE_OLD, column, SOURCE_QUERY);
newTableDDL += '"' + typeOf.column + '" ' + typeOf.syntax;
ins = typeOf.insert;
insertDML += ins.replace(/#~COLUMN~#/g, typeOf.column);
}
return GetOpeningComments() +
GetDDLPrefixSQL(DATABASE_NEW, SCHEMA_NEW, TABLE_NEW) +
newTableDDL +
GetDDLSuffixSQL() +
GetDividerSQL() +
GetInsertPrefixSQL(DATABASE_NEW, SCHEMA_NEW, TABLE_NEW) +
insertDML +
GetInsertSuffixSQL(DATABASE_OLD, SCHEMA_OLD, TABLE_OLD) ;
/****************************************************************************************************
* *
* Helper functions *
* *
****************************************************************************************************/
function InferDataType(db, schema, table, column, sourceQuery){
var typeOf;
typeOf = new IntegerType(db, schema, table, column, sourceQuery);
if (typeOf.isCorrectType()) return typeOf;
typeOf = new DoubleType(db, schema, table, column, sourceQuery);
if (typeOf.isCorrectType()) return typeOf;
typeOf = new BooleanType(db, schema, table, column, sourceQuery); // May want to do a distinct and look for two values
if (typeOf.isCorrectType()) return typeOf;
typeOf = new TimestampType(db, schema, table, column, sourceQuery);
if (typeOf.isCorrectType()) return typeOf;
typeOf = new StringType(db, schema, table, column, sourceQuery);
if (typeOf.isCorrectType()) return typeOf;
return null;
}
/****************************************************************************************************
* *
* SQL Template Functions *
* *
****************************************************************************************************/
function GetCheckTypeSQL(insert, sourceQuery){
var sql =
`
select count(1) as TOTAL_COUNT,
count("#~COLUMN~#") as NON_NULL_COUNT,
count(${insert}) as TO_TYPE_COUNT,
sum(iff(trim("#~COLUMN~#")='', 1, 0)) as BLANK
--from "#~DB~#"."#~SCHEMA~#"."#~TABLE~#";
from (${sourceQuery})
`;
return sql;
}
function GetTableColumnsSQL(dbName, schemaName, tableName){
var sql =
`
select COLUMN_NAME
from ${dbName}.INFORMATION_SCHEMA.COLUMNS
where TABLE_CATALOG = '${dbName}' and
TABLE_SCHEMA = '${schemaName}' and
TABLE_NAME = '${tableName}'
order by ORDINAL_POSITION;
`;
return sql;
}
function GetOpeningComments(){
return `
/**************************************************************************************************************
* *
* Copy and paste into a worksheet to create the typed table and insert into the new table from the old one. *
* *
**************************************************************************************************************/
`;
}
function GetDDLPrefixSQL(db, schema, table){
var sql =
`
create or replace table "${db}"."${schema}"."${table}"
(
`;
return sql;
}
function GetDDLSuffixSQL(){
return "\n);";
}
function GetDividerSQL(){
return `
/**************************************************************************************************************
* *
* The SQL statement below this attempts to copy all rows from the string tabe to the typed table. *
* *
**************************************************************************************************************/
`;
}
function GetInsertPrefixSQL(db, schema, table){
var sql =
`\ninsert into "${db}"."${schema}"."${table}" select\n`;
return sql;
}
function GetInsertSuffixSQL(db, schema, table){
var sql =
`\nfrom "${db}"."${schema}"."${table}" ;`;
return sql;
}
//function GetInsertSuffixSQL(db, schema, table){
//var sql = '\nfrom "${db}"."${schema}"."${table}";';
//return sql;
//}
/****************************************************************************************************
* *
* SQL functions *
* *
****************************************************************************************************/
function GetResultSet(sql){
cmd1 = {sqlText: sql};
stmt = snowflake.createStatement(cmd1);
var rs;
rs = stmt.execute();
return rs;
}
function ExecuteNonQuery(queryString) {
var out = '';
cmd1 = {sqlText: queryString};
stmt = snowflake.createStatement(cmd1);
var rs;
rs = stmt.execute();
}
function ExecuteSingleValueQuery(columnName, queryString) {
var out;
cmd1 = {sqlText: queryString};
stmt = snowflake.createStatement(cmd1);
var rs;
try{
rs = stmt.execute();
rs.next();
return rs.getColumnValue(columnName);
}
catch(err) {
if (err.message.substring(0, 18) == "ResultSet is empty"){
throw "ERROR: No rows returned in query.";
} else {
throw "ERROR: " + err.message.replace(/\n/g, " ");
}
}
return out;
}
function ExecuteFirstValueQuery(queryString) {
var out;
cmd1 = {sqlText: queryString};
stmt = snowflake.createStatement(cmd1);
var rs;
try{
rs = stmt.execute();
rs.next();
return rs.getColumnValue(1);
}
catch(err) {
if (err.message.substring(0, 18) == "ResultSet is empty"){
throw "ERROR: No rows returned in query.";
} else {
throw "ERROR: " + err.message.replace(/\n/g, " ");
}
}
return out;
}
function getQuery(sql){
var cmd = {sqlText: sql};
var query = new Query(snowflake.createStatement(cmd));
try {
query.resultSet = query.statement.execute();
} catch (err) {
throw "ERROR: " + err.message.replace(/\n/g, " ");
}
return query;
}
$$;

Have you tried STAGES?
Create 2 stages ... one with no header and the other with header .. .
see examples below.
Then a bit of SQL and voila your DDL.
Only issue - you need to know the # of cols to put correct number of t.$'s.
If someone could automate that ... we'd have an almost automatic DDL generator for CSV's.
Obviously once you have the SQL stmt then just add the create or replace table to the front and your table is nicely created with all the names from the CSV.
:-)
-- create or replace stage CSV_NO_HEADER
URL = 's3://xxx-x-dev-landing/xxx/'
STORAGE_INTEGRATION = "xxxLAKE_DEV_S3_INTEGRATION"
FILE_FORMAT = ( TYPE = CSV SKIP_HEADER = 1 FIELD_OPTIONALLY_ENCLOSED_BY = '"' )
-- create or replace stage CSV
URL = 's3://xxx-xxxlake-dev-landing/xxx/'
STORAGE_INTEGRATION = "xxxLAKE_DEV_S3_INTEGRATION"
FILE_FORMAT = ( TYPE = CSV FIELD_OPTIONALLY_ENCLOSED_BY = '"' )
select concat('select t.$1 ', t.$1, ',t.$2 ', t.$2,',t.$3 ', t.$3, ',t.$4 ', t.$4,',t.$5 ', t.$5,',t.$6 ', t.$6,',t.$7 ', t.$7,',t.$8 ', t.$8,',t.$9 ', t.$9,
',t.$10 ', t.$10, ',t.$11 ', t.$11,',t.$12 ', t.$12 ,',t.$13 ', t.$13, ',t.$14 ', t.$14 ,',t.$15 ', t.$15 ,',t.$16 ', t.$16 ,',t.$17 ', t.$17 ,' from #xxxx_NO_HEADER/SUB_TRANSACTION_20201204.csv t') from
--- CHANGE TABLE ---
#xxx/SUB_TRANSACTION_20201204.csv t limit 1;

VarcharType mismatch Spark dataframe

I'am trying to change the schema of a dataframe. every time i have a column of string type i want to change it's type to VarcharType(max) where max is the maximum lentgh of string in that column. i wrote the following code. ( i want to export the dataframe later to sql server and i don't want to have nvarchar in sql server so i'am trying to limit it on spark side )
val df = spark.sql(s"SELECT * FROM $tableName")
var l : List [StructField] = List()
val schema = df.schema
schema.fields.foreach(x => {
if (x.dataType == StringType) {
val dataColName = x.name
val maxLength = df.select(dataColName).reduce((x, y) => {
if (x.getString(0).length >= y.getString(0).length) {
x
} else {
y
}
}).getString(0).length
val dataType = VarcharType(maxLength)
l = l :+ StructField(dataColName, dataType)
} else {
l = l :+ x
}
})
val newSchema = StructType(l)
val newDf = spark.createDataFrame(df.rdd, newSchema)
However when running it i get this error.
20/01/22 15:29:44 ERROR ApplicationMaster: User class threw exception: scala.MatchError:
VarcharType(9) (of class org.apache.spark.sql.types.VarcharType)
scala.MatchError: VarcharType(9) (of class org.apache.spark.sql.types.VarcharType)
Can a dataframe column can be of type VarcharType(n) ?

The data mapping from a database to/from dataframe happens in the dialect class. For MS SQL server the class is org.apache.spark.sql.jdbc.MsSqlServerDialect. You can inherit from this and override getJDBCType to influence datatype mapping from a dataframe to a table. Then register your dialect for it to take effect.
I have done this for Oracle (not sqlserver), however it can be done similarly.
//Change this
override def getJDBCType(dt: DataType): Option[JdbcType] = dt match {
case TimestampType => Some(JdbcType("DATETIME", java.sql.Types.TIMESTAMP))
case StringType => Some(JdbcType("NVARCHAR(MAX)", java.sql.Types.NVARCHAR))
case BooleanType => Some(JdbcType("BIT", java.sql.Types.BIT))
case _ => None
}
You can't use VarcharType because it is not a DataType. Also you can't check length of actual data because it is not exposed. You only have access to "dt: DataType", so you can set a default size for NVARCHAR if max is not acceptable.

How to instantiate com.google.cloud.bigquery.Table

I want to process csv file present in cloud bucket and insert its data in a BQ table. I found following piece of code but I am not sure how I can instantiate com.google.cloud.bigquery.Table for a given table name
com.google.cloud.bigquery.Table table = null;
com.google.cloud.bigquery.Job job = table.load(FormatOptions.csv(), sourceUri);
com.google.cloud.bigquery.Job completedJob = job.waitFor(WaitForOption.checkEvery(1, TimeUnit.SECONDS),
WaitForOption.timeout(3, TimeUnit.MINUTES));
if (!(completedJob != null && completedJob.getStatus().getError() == null)) {
throw new InterruptedException("Unable to load file from bucket into BQ");
}
return job;

Snippet taken from here.
[imports]
BigQuery bigquery = BigQueryOptions.getDefaultInstance().getService();
TableId tableId = TableId.of("dataset", "table");
Table table = bigquery.getTable(tableId);
[..]
Side note - that is an Alpha client library you are using. Just so you know.

automated mapping in groovy for ODI with expression

this is my first question and I hope you can help me. I make a script in Groovy (in Oracle Data Integrator 12c) to automate mappings. Here is the description of my prodecure:
1 step: removing old mapping if exists.
2 step: looking for the project and the folder (if doesn't exist: create new one).
3 step: create new mapping
4 step: implement source and target table
5 step: create expression
6 step: link every column
Now my question: Can someone help me to make this script with a dynamic expression? Like this:
step 1: get the data types of the target columns
step 2: get the right data types into the expression
step 3: change the false types (always Varchar) into the right types (Number or Date or still Varchar)
step 4: link every column
My handicap: I have never done something with groovy and in Java I'm not very good. So it is not possible for me to make this dynamic. Almost everything in my Script is placed together from some internet sites. It would be great to find some guys who know something about my problem. And I think it would be a good script for all who will change from OWB to ODI.
Thanks!
//Von ODI Studio erstellt
//
//name of the project
projectName = "SRC_TO_TRG"
//name of the folder
ordnerName = "FEN_TEST"
//name of the mapping
mappingName = "MAP1_FF_TO_TRG"
//name of the model
modelName = "DB_FEN"
//name of the source datastore
sourceDatastoreName = "SRC_TEST_FEN"
//name of the target datastore
targetDatastoreName = "TRG_TEST_FEN"
import oracle.odi.domain.project.finder.IOdiProjectFinder
import oracle.odi.domain.model.finder.IOdiDataStoreFinder
import oracle.odi.domain.project.finder.IOdiFolderFinder
import oracle.odi.domain.project.finder.IOdiKMFinder
import oracle.odi.domain.mapping.finder.IMappingFinder
import oracle.odi.domain.adapter.project.IKnowledgeModule.ProcessingType
import oracle.odi.domain.model.OdiDataStore
import oracle.odi.core.persistence.transaction.support.DefaultTransactionDefinition
//set expression to the component
def createExp(comp, tgtTable, propertyName, expressionText) {
DatastoreComponent.findAttributeForColumn(comp,tgtTable.getColumn(propertyName)) .setExpressionText(expressionText)
}
//delete mapping with the same name
def removeMapping(folder, map_name) {
txnDef = new DefaultTransactionDefinition()
tm = odiInstance.getTransactionManager()
tme = odiInstance.getTransactionalEntityManager()
txnStatus = tm.getTransaction(txnDef)
try {
Mapping map = ((IMappingFinder) tme.getFinder(Mapping.class)).findByName(folder, map_name)
if (map != null) {
odiInstance.getTransactionalEntityManager().remove(map);
}
} catch (Exception e) {e.printStackTrace();}
tm.commit(txnStatus)
}
//looking for a project and folder
def find_folder(project_code, folder_name) {
txnDef = new DefaultTransactionDefinition()
tm = odiInstance.getTransactionManager()
tme = odiInstance.getTransactionalEntityManager()
txnStatus = tm.getTransaction(txnDef)
pf = (IOdiProjectFinder)tme.getFinder(OdiProject.class)
ff = (IOdiFolderFinder)tme.getFinder(OdiFolder.class)
project = pf.findByCode(project_code)
//if there is no project, create new one
if (project == null) {
project = new OdiProject(project_code, project_code)
tme.persist(project)
}
//if there is no folder, create new one
folderColl = ff.findByName(folder_name, project_code)
OdiFolder folder = null
if (folderColl.size() == 1)
folder = folderColl.iterator().next()
if (folder == null) {
folder = new OdiFolder(project, folder_name)
tme.persist(folder)
}
tm.commit(txnStatus)
return folder
}
//name of the project and the folder
folder = find_folder(projectName,ordnerName)
//delete old mapping
removeMapping(folder, mappingName)
txnDef = new DefaultTransactionDefinition()
tm = odiInstance.getTransactionManager()
tme = odiInstance.getTransactionalEntityManager()
txnStatus = tm.getTransaction(txnDef)
dsf = (IOdiDataStoreFinder)tme.getFinder(OdiDataStore.class)
mapf = (IMappingFinder) tme.getFinder(Mapping.class)
//create new mapping
map = new Mapping(mappingName, folder);
tme.persist(map)
//insert source table
boundTo_emp = dsf.findByName(sourceDatastoreName, modelName)
comp_emp = new DatastoreComponent(map, boundTo_emp)
//insert target table
boundTo_tgtemp = dsf.findByName(targetDatastoreName, modelName)
comp_tgtemp = new DatastoreComponent(map, boundTo_tgtemp)
//create expression-operator
comp_expression = new ExpressionComponent(map, "EXPRESSION")
// define expression
comp_expression.addExpression("LAND_KM", "TO_NUMBER(SRC_TEST_FEN.LAND_KM)", null,null,null);
comp_expression.addExpression("DATE_OF_ELECTION", "TO_DATE(SRC_TEST_FEN.DATE_OF_ELECTION, 'DD.MM.YYYY')", null,null,null);
//weitere Transformationen anhängen möglich
//link source table with expression
comp_emp.connectTo(comp_expression)
//link expression with target table
comp_expression.connectTo(comp_tgtemp)
createExp(comp_tgtemp, boundTo_tgtemp, "ABBR", "SRC_TEST_FEN.ABBR")
createExp(comp_tgtemp, boundTo_tgtemp, "NAME", "SRC_TEST_FEN.NAME")
createExp(comp_tgtemp, boundTo_tgtemp, "LAND_KM", "EXPRESSION.LAND_KM")
createExp(comp_tgtemp, boundTo_tgtemp, "DATE_OF_ELECTION", "EXPRESSION.DATE_OF_ELECTION")
tme.persist(map)
tm.commit(txnStatus)

You can pass the Datatype as the third argument of the method addExpression.
You can also pass the size and the scale as fourth and fifth arguments.
For instance, for the LAND_KM expression, replace your line by this :
MapAttribute map_attr = DatastoreComponent.findAttributeForColumn(comp_tgtemp,boundTo_tgtemp.getColumn("LAND_KM"))
comp_expression.addExpression("LAND_KM", "TO_NUMBER(SRC_TEST_FEN.LAND_KM)", map_attr.getDataType(),map_attr.getSize(),map_attr.getScale());
It retrieves the target column for LAND_KM thanks to findAttributeForColumn, then retrieves the datatype, the size and the scale, and use that when adding the new expression in the Expression component.
If you want to auto map it based on the name, David Allan wrote a post on the official Oracle blog about how to do it and he provides his code : https://blogs.oracle.com/dataintegration/entry/odi_12c_mapping_sdk_auto

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Converting StructType to Avro Schema, returns type as Union when using databricks spark-avro - apache-spark-sql

Related

Not able to upload json data to Bigquery tables using c#

How can I automatically infer schemas of CSV files on S3 as I load them?

VarcharType mismatch Spark dataframe

How to instantiate com.google.cloud.bigquery.Table

automated mapping in groovy for ODI with expression

Categories

Resources