How to get StreamSets Record Fields Type inside Jython Evaluator - hive

I have a StreamSets pipeline, where I read from a remote SQL Server database using JDBC component as an origin and put the data into a Hive and a Kudu Data Lake.
I'm facing some issues with the type Binary Columns, as there is no Binary type support in Impala, which I use to access both Hive and Kudu.
I decided to convert the Binary type columns (Which flows in the pipeline as Byte_Array type) to String and insert it like that.
I tried to use a Field Type Converter element to convert all Byte_Array types to String, but it didn't work. So I used a Jython component to convert all arr.arr types to String. It works fine, until I got a Null value on that field, so the Jython type was None.type and I was unable to detect the Byte_Array type and unable to convert it to String. So I couldn't insert it into Kudu.
Any help how to get StreamSets Record Field Types inside Jython Evaluator? Or any suggested work around for the problem I'm facing?

You need to use sdcFunctions.getFieldNull() to test whether the field is NULL_BYTE_ARRAY. For example:
import array
def convert(item):
return ':-)'
def is_byte_array(record, k, v):
# getFieldNull expect a field path, so we need to prepend the '/'
return (sdcFunctions.getFieldNull(record, '/'+k) == NULL_BYTE_ARRAY
or (type(v) == array.array and v.typecode == 'b'))
for record in records:
try:
record.value = {k: convert(v) if is_byte_array(record, k, v) else v
for k, v in record.value.items()}
output.write(record)
except Exception as e:
error.write(record, str(e))

So here is my final solution:
You can use the logic below to detect any StreamSets type inside the Jython component by using the NULL_CONSTANTS:
NULL_BOOLEAN, NULL_CHAR, NULL_BYTE, NULL_SHORT, NULL_INTEGER, NULL_LONG,
NULL_FLOAT, NULL_DOUBLE, NULL_DATE, NULL_DATETIME, NULL_TIME, NULL_DECIMAL,
NULL_BYTE_ARRAY, NULL_STRING, NULL_LIST, NULL_MAP
The idea is to save the value of the field in a temp variable, set the value of the field to be None and use the function sdcFunctions.getFieldNull to know the StreamSets type by comparing it to one of the NULL_CONSTANTS.
import binascii
def toByteArrayToHexString(value):
if value is None:
return NULL_STRING
value = '0x'+binascii.hexlify(value).upper()
return value
for record in records:
try:
for colName,value in record.value.items():
temp = record.value[colName]
record.value[colName] = None
if sdcFunctions.getFieldNull(record,'/'+colName) is NULL_BYTE_ARRAY:
temp = toByteArrayToHexString(temp)
record.value[colName] = temp
output.write(record)
except Exception as e
error.write(record, str(e))
Limitation:
The code above converts the Date type to Datetime type only when it has a value (When its not NULL)

Related

How do I use values within a list to specify changing selection conditions and export paths?

I'm trying to split a large csv data using a condition. To automate this process, I'm pulling a list of unique conditions from a column in the data set and wanting to use this list within a loop to specify condition and also rename the export file.
I've converted the array of values into a list and have tried fitting my function into a loop, however, I believe syntax is the main error.
# df1718 is my df
# znlist is my list of values (e.g. 0 1 2 3 4)
# serial is specified at the top e.g. '4'
for x in znlist:
dftemps = df1718[(df1718.varname == 'RoomTemperature') & (df1718.zone == x)]
dftemps.to_csv('E:\\path\\test%d_zone(x).csv', serial)
So in theory, I would like each iteration to export the data relevant to the next zone in the list and the export file to be named test33_zone0.csv (for example). Thanks for any help!
EDIT:
The error I am getting is: "delimiter" must be string, not int
So if the error is in saving the file try this
dftemps.to_csv('E:\\path\\test{}_zone{}.csv'.format(str(serial),str(x)))

Attribute Error when getting unique column values

I am new to Python and Pandas. I am trying to write a function to get a unique list of my column values. My function looks like below, where "placename" is the attribute that I want to get unique values. 'placename' should be passed as a string argument,corresponding to the header of the csv file.
def get_city_list(state, type, placename):
city_dir = os.path.join(baseDir, type + ".csv")
city_df = pd.read_csv(city_dir, quotechar = '"', skipinitialspace = True, sep = ",")
state_df = city_df.loc[city_df["state"] == state]
city_list = state_df.placename.unique()
return city_list
However, when I call this function, it throws a attribute error saying 'DataFrame' object has no attribute "placename". It seems that "placename" should not be a string, and when I substitute it as
city_list = state_df.cityname.unique(), it works, where cityname (without " ") is the actual header of the column in the original csv file. Since I want to make my function versatile,I want to find a way to deal with this case so that I dont have to manually change the content of "placename" every time.
Your help is greatly appreciated!
Thanks
The dot operator . is reserved to access attributes of an object. pandas is nice enough to make column names accessible via an attribute. But you can't do something like df."myplace"
Change your code to:
state_df[placename].unique()
This way, you are passing placename on to the getitem method.

Aerospike limiting records by lexicographic order

Can Aerospike get records by lexicographic order.For example if U want all the records that start with "a" then U like to search for bin >="a" AND bin <="az"
aerospike support UDF modules(in LUA and C language) https://www.aerospike.com/docs/udf/developing_lua_modules.html
which can serve your purpose.
User-Defined Functions written in Lua extend the core functionality of Aerospike. You would create a stream UDF and attach it to a query.
One best practice for stream UDFs in Aerospike is to eliminate as many records as possible before passing the results into the UDF, so in this case I would create another bin to hold a prefix (first letter, or a substring, depending on your use case) and build a secondary index on it. The idea is that the query portion should return as small of a subset as you can reliably. For your example the prefix can be a single character, you can add a new bin 'firstchar' to the records in the set, then build a secondary index on it.
The stream UDF module would look something like:
local function range_filter(bin_name, substr_from, substr_to)
return function(record)
local val = record[bin_name]
if type(val) ~= 'string' then
return false
end
if val >= substr_from and val <= substr_to then
return true
else
return false
end
end
end
local function rec_to_map(record)
local xrec = map()
for i, bin_name in ipairs(record.bin_names(record)) do
xrec[bin_name] = xrec[bin_name]
end
return xrec
end
function str_between(stream, bin_name, substr_from, substr_to)
return stream : filter(range_filter(bin_name, substr_from, substr_to)) : map(rec_to_map)
end
In the Python client you'd invoke it as follows:
import aerospike
from aerospike import predicates as p
# instantiate the client and connect to the cluster, then:
query = client.query('test', 'this')
query.where(p.equals('firstchar', 'a'))
query.apply('strrangemod', 'str_between', ['a','az'])

Apache PIG, validate input

How to handle bad records in a Apache PIG scripts. In my case I'm processing a comma seperated file wich usually has 14 fields on every row.
But sometimes the row contains a \n and the record is splitted in two lines and my PIG script failes to insert this record and all records after into HBase.
The problem is that the length of the map within the UDF is always 3. Probably because of the schema defined within the PIG script. How to determine if a records has the number of fields equal to the schema...
PIG
REGISTER 'files.py' using jython as myfuncs
A = LOAD '/etl/incoming/test.txt' USING PigStorage(',') AS (name:chararray, age:int, gpa:float);
B = FOREACH A {
GENERATE
myfuncs.checkFormat(TOTUPLE(*)) as fields;
}
DUMP B;
UDF
import org.apache.pig.data.DataType as DataType
import org.apache.pig.impl.logicalLayer.schema.SchemaUtil as SchemaUtil
#outputSchema("record:map[]")
def checkFormat(record):
print(type(record))
print(record)
record = list(record)
print("length: %d" % len(record)) #always return 3
return record
You can write validations as Pig UDFs in a variety of languages
I usually return the same schema with an additional field that signify validity and then Filter the results (once for logging into an error log and once for the continued operation)

How to filter 'NaN' in Pig

I have data that has some rows that look like this:
(1655,var0,var1,NaN)
The first column is an ID, the second and third come from the correlation. The fourth column is the correlation value (from using the COR function).
I would like to filter these rows.
From the Apache Pig documentation, I was under the impression that NaN is equivalent to a null. Therefore I added this to my code:
filter_corr = filter correlation by (corr IS NOT NULL);
This obviously did not work since apparently Pig does not treat null and NaN in the same way.
I would like to know what is the correct way to filter NaN since it is not clear from the Pig documentation.
Thanks!
Eventually you could specify your column as chararray in you schema and Filter with a not matches 'NaN'
Or evenly if you want to replace your NaNs by something else, you put the chararray in your schema as previously and then :
Data = FOREACH Data GENERATE ..., (correlation matches 'NaN' ? 0 : (double) correlation), ...
I hope this could help, good luck ;)
You could read in the data as one chararray line and the use a udf to parse the rows. I made a dataset it looks like this
1665,var0,var1,NaN
1453,var2,var3,5.432
3452,var4,var5,7.654
8765,var6,var7,NaN
Create UDF
#!/usr/bin/env python
# -*- coding: utf-8 -*-
### name of file: udf.py ###
#outputSchema("id:int, col2:chararray, col3:chararray, corr:float")
def format_input(line):
parsed = line.split(',')
if parsed[len(parsed) - 1] == 'NaN'
parsed.pop()
parsed.append(None)
return tuple(parsed)
Then in the pig shell
$ pig -x local
grunt>
/* register udf */
register 'udf.py' using jython as udf;
data = load 'file' as (line:chararray);
A = foreach data generate FLATTEN(udf.format_input(line));
filtered = filter A by corr is not null;
dump filtered
output
(1453,var2,var3,5.432)
(3452,var4,var5,7.654)
I've gone with this solution:
filter_corr = filter data by (corr != 'NaN');
data1 = foreach filter_corr generate ID, (double)corr as double_corr;
I renamed the column and reassigned the data type from chararray to double.
I appreciate the responses but I cannot use UDFs during prototyping due to a limitation in the UI that I am using (Cloudera)