Permanently store sqoop --map-column-hive mapping for DB2 - hive

I am importing several DB2 database tables with an Oozie workflow that uses Sqoop to import to Hive. Currently I have to map each column with an unsupported data type manually with "--map-column-hive".
Is there any way to permanently store mappings for specific data types? I am importing several tables that contain DB2-"Character" columns which all have to be mapped to HIVE-"STRING" manually.
For ~50 tables there are ~200 columns that use the datatype "Character" for FKs which have to be mapped manually.
I want to permanently save that DB2-"Character" is mapped to the datatype HIVE-"STRING".
Can this be done?
regards

As far as I can see, Sqoop does not provide ability to pass type-to-type mappings as parameters.
They are all hardcoded explicitly:
switch (sqlType) {
case Types.INTEGER:
case Types.SMALLINT:
return HIVE_TYPE_INT;
case Types.VARCHAR:
case Types.CHAR:
case Types.LONGVARCHAR:
case Types.NVARCHAR:
case Types.NCHAR:
case Types.LONGNVARCHAR:
case Types.DATE:
case Types.TIME:
case Types.TIMESTAMP:
case Types.CLOB:
return HIVE_TYPE_STRING;
case Types.NUMERIC:
case Types.DECIMAL:
case Types.FLOAT:
case Types.DOUBLE:
case Types.REAL:
return HIVE_TYPE_DOUBLE;
case Types.BIT:
case Types.BOOLEAN:
return HIVE_TYPE_BOOLEAN;
case Types.TINYINT:
return HIVE_TYPE_TINYINT;
case Types.BIGINT:
return HIVE_TYPE_BIGINT;
default:
// TODO(aaron): Support BINARY, VARBINARY, LONGVARBINARY, DISTINCT,
// BLOB, ARRAY, STRUCT, REF, JAVA_OBJECT.
return null;
}
Also there's a specific case for XML columns in DB2:
if (colTypeName.toUpperCase().startsWith("XML")) {
return XML_TO_JAVA_DATA_TYPE;
}
If your column type is not recognized by this mapping + user-defined mappings via
--map-column-hive parameter, you'll get an exception.
What I'd do in your case (if not considering manual column mapping):
Make sure once again that type mapping for DB2-"Character" does not work
Download the sources of your Sqoop version, add a new if-branch in Db2Manager.toDbSpecificHiveType, build and test with some your tables
Create a PR and wait for the next release OR use the customized version of Sqoop (might be painful when you want to upgrade Sqoop version)

Related

How to get StreamSets Record Fields Type inside Jython Evaluator

I have a StreamSets pipeline, where I read from a remote SQL Server database using JDBC component as an origin and put the data into a Hive and a Kudu Data Lake.
I'm facing some issues with the type Binary Columns, as there is no Binary type support in Impala, which I use to access both Hive and Kudu.
I decided to convert the Binary type columns (Which flows in the pipeline as Byte_Array type) to String and insert it like that.
I tried to use a Field Type Converter element to convert all Byte_Array types to String, but it didn't work. So I used a Jython component to convert all arr.arr types to String. It works fine, until I got a Null value on that field, so the Jython type was None.type and I was unable to detect the Byte_Array type and unable to convert it to String. So I couldn't insert it into Kudu.
Any help how to get StreamSets Record Field Types inside Jython Evaluator? Or any suggested work around for the problem I'm facing?
You need to use sdcFunctions.getFieldNull() to test whether the field is NULL_BYTE_ARRAY. For example:
import array
def convert(item):
return ':-)'
def is_byte_array(record, k, v):
# getFieldNull expect a field path, so we need to prepend the '/'
return (sdcFunctions.getFieldNull(record, '/'+k) == NULL_BYTE_ARRAY
or (type(v) == array.array and v.typecode == 'b'))
for record in records:
try:
record.value = {k: convert(v) if is_byte_array(record, k, v) else v
for k, v in record.value.items()}
output.write(record)
except Exception as e:
error.write(record, str(e))
So here is my final solution:
You can use the logic below to detect any StreamSets type inside the Jython component by using the NULL_CONSTANTS:
NULL_BOOLEAN, NULL_CHAR, NULL_BYTE, NULL_SHORT, NULL_INTEGER, NULL_LONG,
NULL_FLOAT, NULL_DOUBLE, NULL_DATE, NULL_DATETIME, NULL_TIME, NULL_DECIMAL,
NULL_BYTE_ARRAY, NULL_STRING, NULL_LIST, NULL_MAP
The idea is to save the value of the field in a temp variable, set the value of the field to be None and use the function sdcFunctions.getFieldNull to know the StreamSets type by comparing it to one of the NULL_CONSTANTS.
import binascii
def toByteArrayToHexString(value):
if value is None:
return NULL_STRING
value = '0x'+binascii.hexlify(value).upper()
return value
for record in records:
try:
for colName,value in record.value.items():
temp = record.value[colName]
record.value[colName] = None
if sdcFunctions.getFieldNull(record,'/'+colName) is NULL_BYTE_ARRAY:
temp = toByteArrayToHexString(temp)
record.value[colName] = temp
output.write(record)
except Exception as e
error.write(record, str(e))
Limitation:
The code above converts the Date type to Datetime type only when it has a value (When its not NULL)

Can't figure out how to insert keys and values of nested JSON data into SQL rows with NiFi

I'm working on a personal project and very new (learning as I go) to JSON, NiFi, SQL, etc., so forgive any confusing language used here or a potentially really obvious solution. I can clarify as needed.
I need to take the JSON output from a website's API call and insert it into a table in my MariaDB local server that I've set up. The issue is that the JSON data is nested, and two of the key pieces of data that I need to insert are used as variable key objects rather than values, so I don't know how to extract it and put it in the database table. Essentially, I think I need to identify different pieces of the JSON expression and insert them as values, but I'm clueless how to do so.
I've played around with the EvaluateJSON, SplitJSON, and FlattenJSON processors in particular, but I can't make it work. All I can ever do is get the result of the whole expression, rather than each piece of it.
{"5381":{"wind_speed":4.0,"tm_st_snp":26.0,"tm_off_snp":74.0,"tm_def_snp":63.0,"temperature":58.0,"st_snp":8.0,"punts":4.0,"punt_yds":178.0,"punt_lng":55.0,"punt_in_20":1.0,"punt_avg":44.5,"humidity":47.0,"gp":1.0,"gms_active":1.0},
"1023":{"wind_speed":4.0,"tm_st_snp":26.0,"tm_off_snp":82.0,"tm_def_snp":56.0,"temperature":74.0,"off_snp":82.0,"humidity":66.0,"gs":1.0,"gp":1.0,"gms_active":1.0},
"5300":{"wind_speed":17.0,"tm_st_snp":27.0,"tm_off_snp":80.0,"tm_def_snp":64.0,"temperature":64.0,"st_snp":21.0,"pts_std":9.0,"pts_ppr":9.0,"pts_half_ppr":9.0,"idp_tkl_solo":4.0,"idp_tkl_loss":1.0,"idp_tkl":4.0,"idp_sack":1.0,"idp_qb_hit":2.0,"humidity":100.0,"gp":1.0,"gms_active":1.0,"def_snp":23.0},
"608":{"wind_speed":6.0,"tm_st_snp":20.0,"tm_off_snp":53.0,"tm_def_snp":79.0,"temperature":88.0,"st_snp":4.0,"pts_std":5.5,"pts_ppr":5.5,"pts_half_ppr":5.5,"idp_tkl_solo":4.0,"idp_tkl_loss":1.0,"idp_tkl_ast":1.0,"idp_tkl":5.0,"humidity":78.0,"gs":1.0,"gp":1.0,"gms_active":1.0,"def_snp":56.0},
"3396":{"wind_speed":6.0,"tm_st_snp":20.0,"tm_off_snp":60.0,"tm_def_snp":70.0,"temperature":63.0,"st_snp":19.0,"off_snp":13.0,"humidity":100.0,"gp":1.0,"gms_active":1.0}}
This is a snapshot of an output with a couple thousand lines. Each of the numeric keys that you see above (5381, 1023, 5300, etc) are player IDs for the following stats. I have a table set up with three columns: Player ID, Stat ID, and Stat Value. For example, I need that first snippet to be inserted into my table as such:
Player ID Stat ID Stat Value
5381 wind_speed 4.0
5381 tm_st_snp 26.0
5381 tm_off_snp 74.0
And so on, for each piece of data. But I don't know how to have NiFi select the right pieces of data to insert in the right columns.
I believe that it's possible to use jolt to transform your json into a format:
[
{"playerId":"5381", "statId":"wind_speed", "statValue": 0.123},
{"playerId":"5381", "statId":"tm_st_snp", "statValue": 0.456},
...
]
then use PutDatabaseRecord with json reader.
Another approach is to use ExecuteGroovyScript processor.
Add new parameter to it with name SQL.mydb and link it to your DBCP controller service
And use the following script as Script Body parameter:
import groovy.json.JsonSlurper
import groovy.json.JsonBuilder
def ff=session.get()
if(!ff)return
//read flow file content and parse it
def body = ff.read().withReader("UTF-8"){reader->
new JsonSlurper().parse(reader)
}
def results = []
//use defined sql connection to create a batch
SQL.mydb.withTransaction{
def cmd = 'insert into mytable(playerId, statId, statValue) values(?,?,?)'
results = SQL.mydb.withBatch(100, cmd){statement->
//run through all keys/subkeys in flow file body
body.each{pid,keys->
keys.each{k,v->
statement.addBatch(pid,k,v)
}
}
}
}
//write results as a new flow file content
ff.write("UTF-8"){writer->
new JsonBuilder(results).writeTo(writer)
}
//transfer to success
REL_SUCCESS << ff

QueryDsl column case sensitivity bug on identifier

I have seen many post on column name case sensitivity for in memory database like hsqldb and h2. We are using sql server camel case column names. However, I am testing using HyperSql which is column name case sensitive. I don't see any settings to handle column name sensitivity in hypersql except when creating the table quote the column names which will make them what ever case are inside the quotes, for example
Insert Into AddressType ("AddressTypeName", "CreateUser")
Values ('Mailing', 'User')
This will create tables in hsqldb with column name AddressTypeName and CreateUser
sql server is not willing to make theirs all upper case. As a result, when creating the columns in hsqldb I quote them which creates them in camel case. Works good except when using Querydsl their identiifers does not quote them which then results in the lookup with upper case column names and the database now has them as camel case.
The whole problem started because QueryDsl Q types which were generated off the sql server columns has the metadata with quote columns and since sql server had them as camel case it looks them up as camel case. But now when doing a Q type query it uses the identifier un quoted which results in a upper case lookup which fails. I don't see any work around except to let the database make them upper case and change all the Q types to be upper case for the meta data column name look ups.
I assume this is a bug in querydsl.
Example QAddressType addMetadata. Notice the ColumnMetadata.named("AddressTypeName") By these being quoted the database must have them in camel case.
public void addMetadata() {
addMetadata(addressTypeID, ColumnMetadata.named("AddressTypeID").withIndex(1).ofType(Types.BIGINT).withSize(19).notNull());
addMetadata(addressTypeName, ColumnMetadata.named("AddressTypeName").withIndex(2).ofType(Types.VARCHAR).withSize(50).notNull());
ColumnMetadata.named("CreateUser").withIndex(3).ofType(Types.VARCHAR).withSize(100).notNull());
}
But if I switch the database to be camel case then the queryDslTemplate call fails because it does not quote the identifier. So in the below code .where(qAddressType.addressTypeID.eq(id)); does this in sql a.addressTypeID = 1 where a is the alias for QAddressType. It needs to be a."addressTypeID" = 1 or else hsqldb looks it up as a.ADDDRESTYPEID = 1
private static QAddressType qAddressType = new QAddressType("a");
#Override
public AddressType getById(Long id) {
AddressType addressType = null;
try {
SQLQuery sqlQuery = queryDslJdbcTemplate.newSqlQuery()
.from(qAddressType)
.where(qAddressType.addressTypeID.eq(id));
addressType = queryDslJdbcTemplate.queryForObject(sqlQuery, new AddressTypeProjection(qAddressType));
//the API was not throwing the Exception so let's explicitly throw it
return addressType;
}

Apache PIG, validate input

How to handle bad records in a Apache PIG scripts. In my case I'm processing a comma seperated file wich usually has 14 fields on every row.
But sometimes the row contains a \n and the record is splitted in two lines and my PIG script failes to insert this record and all records after into HBase.
The problem is that the length of the map within the UDF is always 3. Probably because of the schema defined within the PIG script. How to determine if a records has the number of fields equal to the schema...
PIG
REGISTER 'files.py' using jython as myfuncs
A = LOAD '/etl/incoming/test.txt' USING PigStorage(',') AS (name:chararray, age:int, gpa:float);
B = FOREACH A {
GENERATE
myfuncs.checkFormat(TOTUPLE(*)) as fields;
}
DUMP B;
UDF
import org.apache.pig.data.DataType as DataType
import org.apache.pig.impl.logicalLayer.schema.SchemaUtil as SchemaUtil
#outputSchema("record:map[]")
def checkFormat(record):
print(type(record))
print(record)
record = list(record)
print("length: %d" % len(record)) #always return 3
return record
You can write validations as Pig UDFs in a variety of languages
I usually return the same schema with an additional field that signify validity and then Filter the results (once for logging into an error log and once for the continued operation)

How to create a view against a table that has record fields?

We have a weekly backup process which exports our production Google Appengine Datastore onto Google Cloud Storage, and then into Google BigQuery. Each week, we create a new dataset named like YYYY_MM_DD that contains a copy of the production tables on that day. Over time, we have collected many datasets, like 2014_05_10, 2014_05_17, etc. I want to create a data set Latest_Production_Data that contains a view for each of the tables in the most recent YYYY_MM_DD dataset. This will make it easier for downstream reports to write their query once and always retrieve the most recent data.
To do this, I have code that gets the most recent dataset and the names of all the tables that dataset contains from the BigQuery API. Then, for each of these tables, I fire a tables.insert call to create a view that is a SELECT * from the table I am looking to create a reference to.
This fails for tables that contain a RECORD field, from what looks to be a pretty benign column-naming rule.
For example, I have this table:
For which I issue this API call:
{
'tableReference': {
'projectId': 'redacted',
'tableId': u'AccountDeletionRequest',
'datasetId': 'Latest_Production_Data'
}
'view': {
'query': u'SELECT * FROM [2014_05_17.AccountDeletionRequest]'
},
}
This results in the following error:
HttpError: https://www.googleapis.com/bigquery/v2/projects//datasets/Latest_Production_Data/tables?alt=json returned "Invalid field name "__key__.namespace". Fields must contain only letters, numbers, and underscores, start with a letter or underscore, and be at most 128 characters long.">
When I execute this query in the BigQuery web console, the columns are renamed to translate the . to an _. I kind of expected the same thing to happen when I issued the create view API call.
Is there an easy way I can programmatically create a view for each of the tables in my dataset, regardless of their underlying schema? The problem I'm encountering now is for record columns, but another problem I anticipate is for tables that have repeated fields. Is there some magic alternative to SELECT * that will take care of all these intricacies for me?
Another idea I had was doing a table copy, but I would prefer not to duplicate the data if I can at all avoid it.
Here is the workaround code I wrote to dynamically generate a SELECT statement for each of the tables:
def get_leaf_column_selectors(dataset, table):
schema = table_service.get(
projectId=BQ_PROJECT_ID,
datasetId=dataset,
tableId=table
).execute()['schema']
return ",\n".join([
_get_leaf_selectors("", top_field)
for top_field in schema["fields"]
])
def _get_leaf_selectors(prefix, field):
if prefix:
format = prefix + ".%s"
else:
format = "%s"
if 'fields' not in field:
# Base case
actual_name = format % field["name"]
safe_name = actual_name.replace(".", "_")
return "%s as %s" % (actual_name, safe_name)
else:
# Recursive case
return ",\n".join([
_get_leaf_selectors(format % field["name"], sub_field)
for sub_field in field["fields"]
])
We had a bug where you needed to need to select out the individual fields in the view and use an 'as' to rename the fields to something legal (i.e they don't have '.' in the name).
The bug is now fixed, so you shouldn't see this issue any more. Please ping this thread or start a new question if you see it again.