Developing Hive UDAF meet a ClassCastException without an idea - hive

`public class GenericUdafMemberLevel implements GenericUDAFResolver2 {
private static final Log LOG = LogFactory
.getLog(GenericUdafMemberLevel.class.getName());
#Override
public GenericUDAFEvaluator getEvaluator(GenericUDAFParameterInfo paramInfo)
throws SemanticException {
return new GenericUdafMeberLevelEvaluator();
}
#Override
//参数校验
public GenericUDAFEvaluator getEvaluator(TypeInfo[] parameters)
throws SemanticException {
if (parameters.length != 2) {//参数大小
throw new UDFArgumentTypeException(parameters.length - 1,
"Exactly two arguments are expected.");
}
//参数必须是原型,即不能是
if (parameters[0].getCategory() != ObjectInspector.Category.PRIMITIVE) {
throw new UDFArgumentTypeException(0,
"Only primitive type arguments are accepted but "
+ parameters[0].getTypeName() + " is passed.");
}
if (parameters[1].getCategory() != ObjectInspector.Category.PRIMITIVE) {
throw new UDFArgumentTypeException(1,
"Only primitive type arguments are accepted but "
+ parameters[1].getTypeName() + " is passed.");
}
return new GenericUdafMeberLevelEvaluator();
}
public static class GenericUdafMeberLevelEvaluator extends GenericUDAFEvaluator {
private PrimitiveObjectInspector inputOI;
private PrimitiveObjectInspector inputOI2;
private DoubleWritable result;
#Override
public ObjectInspector init(Mode m, ObjectInspector[] parameters)
throws HiveException {
super.init(m, parameters);
if (m == Mode.PARTIAL1 || m == Mode.COMPLETE){
inputOI = (PrimitiveObjectInspector) parameters[0];
inputOI2 = (PrimitiveObjectInspector) parameters[1];
result = new DoubleWritable(0);
}
return PrimitiveObjectInspectorFactory.writableLongObjectInspector;
}
/** class for storing count value. */
static class SumAgg implements AggregationBuffer {
boolean empty;
double value;
}
#Override
//创建新的聚合计算的需要的内存,用来存储mapper,combiner,reducer运算过程中的相加总和。
//使用buffer对象前,先进行内存的清空——reset
public AggregationBuffer getNewAggregationBuffer() throws HiveException {
SumAgg buffer = new SumAgg();
reset(buffer);
return buffer;
}
#Override
//重置为0
//mapreduce支持mapper和reducer的重用,所以为了兼容,也需要做内存的重用。
public void reset(AggregationBuffer agg) throws HiveException {
((SumAgg) agg).value = 0.0;
((SumAgg) agg).empty = true;
}
private boolean warned = false;
//迭代
//map阶段调用,只要把保存当前和的对象agg,再加上输入的参数,就可以了。
#Override
public void iterate(AggregationBuffer agg, Object[] parameters)
throws HiveException {
// parameters == null means the input table/split is empty
if (parameters == null) {
return;
}
try {
double flag = PrimitiveObjectInspectorUtils.getDouble(parameters[1], inputOI2);
if(flag > 1.0) //参数条件
merge(agg, parameters[0]); //这里将Map之后的操作,放入combiner进行合并
} catch (NumberFormatException e) {
if (!warned) {
warned = true;
LOG.warn(getClass().getSimpleName() + " "
+ StringUtils.stringifyException(e));
}
}
}
#Override
//combiner合并map返回的结果,还有reducer合并mapper或combiner返回的结果。
public void merge(AggregationBuffer agg, Object partial)
throws HiveException {
if (partial != null) {
//通过ObejctInspector取每一个字段的数据
double p = PrimitiveObjectInspectorUtils.getDouble(partial, inputOI);
((SumAgg) agg).value += p;
}
}
#Override
//reducer返回结果,或者是只有mapper,没有reducer时,在mapper端返回结果。
public Object terminatePartial(AggregationBuffer agg)
throws HiveException {
return terminate(agg);
}
#Override
public Object terminate(AggregationBuffer agg) throws HiveException {
result.set(((SumAgg) agg).value);
return result;
}
}
}`
I have used some chinese to comment the code for understanding the theory.
Actually, the idea of the UDAF is like follow:
select test_sum(col1,col2) from tbl ;
if col2 satisfy some condition, then sum col1's value.
Most of the code are copied from the offical avg() udaf function.
I met a weried Exception:
java.lang.RuntimeException: Hive Runtime Error while closing operators
at org.apache.hadoop.hive.ql.exec.ExecMapper.close(ExecMapper.java:226)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1136)
at org.apache.hadoop.mapred.Child.main(Child.java:249)
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.ClassCastException: org.apache.hadoop.io.DoubleWritable cannot be cast to org.apache.hadoop.io.LongWritable
at org.apache.hadoop.hive.ql.exec.GroupByOperator.closeOp(GroupByOperator.java:1132)
at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:558)
at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:567)
at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:567)
at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:567)
at org.apache.hadoop.hive.ql.exec.ExecMapper.close(ExecMapper.java:193)
... 8 more
Caused by: java.lang.ClassCastException: org.apache.hadoop.io.DoubleWritable cannot be cast to org.apache.hadoop.io.LongWritable
at org.apache.hadoop.hive.serde2.objectinspector.primitive.WritableLongObjectInspector.get(WritableLongObjectInspector.java:35)
at org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe.serialize(LazyBinarySerDe.java:323)
at org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe.serializeStruct(LazyBinarySerDe.java:255)
at org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe.serialize(LazyBinarySerDe.java:202)
at org.apache.hadoop.hive.ql.exec.ReduceSinkOperator.processOp(ReduceSinkOperator.java:236)
at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:474)
at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:800)
at org.apache.hadoop.hive.ql.exec.GroupByOperator.forward(GroupByOperator.java:1061)
at org.apache.hadoop.hive.ql.exec.GroupByOperator.closeOp(GroupByOperator.java:1113)
... 13 more
Am I have something wrong with my UDAF??
please kindly point it out.
Thanks a lllllllot .

Replace PrimitiveObjectInspectorFactory.writableLongObjectInspector in init method with PrimitiveObjectInspectorFactory.writableDoubleObjectInspector.

Related

What happens if no records exists when I use Room to query?

I use Room in my Android Studio App, the Code A will crash when no record exists to query
What happens if no records exists when I use Room to query with Code B ?
#Dao
interface RecordDao {
// Code A
#Query("SELECT * FROM record_table where id=:id")
fun getByID(id:Int): Flow<RecordEntity>
// Code B
#Query("SELECT * FROM record_table")
fun getAll(): Flow<List<RecordEntity>>
}
If you look at the code generated by room for the RecordDao_Impl class implementation of RecordDao, you'll notice multiple things:
The getById function code returns null when no matching records exist in the table, and since your 'Code A' function return type is not null, kotlin throws a NullPointerException for the first function.
The getAll function code returns a Flow object with an ArrayList, in case it found any records then it adds them to the list, otherwise it just emits the empty ArrayList to the Flow object, therefore, you'll always get a flow object with a list inside it regardless if room found matching records or not, so no exception is thrown.
You can understand this a bit more if you look at the code generated for the two functions here:
#Override
public Flow<RecordEntity> getByID(final int id) {
final String _sql = "SELECT * FROM record_table where id=?";
final RoomSQLiteQuery _statement = RoomSQLiteQuery.acquire(_sql, 1);
int _argIndex = 1;
_statement.bindLong(_argIndex, id);
return CoroutinesRoom.createFlow(__db, false, new String[]{"record_table"}, new Callable<RecordEntity>() {
#Override
public RecordEntity call() throws Exception {
final Cursor _cursor = DBUtil.query(__db, _statement, false, null);
try {
final int _cursorIndexOfId = CursorUtil.getColumnIndexOrThrow(_cursor, "id");
final RecordEntity _result;
if (_cursor.moveToFirst()) {
final int _tmpId;
_tmpId = _cursor.getInt(_cursorIndexOfId);
_result = new RecordEntity(_tmpId);
} else {
_result = null;
}
return _result;
} finally {
_cursor.close();
}
}
#Override
protected void finalize() {
_statement.release();
}
});
}
#Override
public Flow<List<RecordEntity>> getAll() {
final String _sql = "SELECT * FROM record_table";
final RoomSQLiteQuery _statement = RoomSQLiteQuery.acquire(_sql, 0);
return CoroutinesRoom.createFlow(__db, false, new String[]{"record_table"}, new Callable<List<RecordEntity>>() {
#Override
public List<RecordEntity> call() throws Exception {
final Cursor _cursor = DBUtil.query(__db, _statement, false, null);
try {
final int _cursorIndexOfId = CursorUtil.getColumnIndexOrThrow(_cursor, "id");
final List<RecordEntity> _result = new ArrayList<RecordEntity>(_cursor.getCount());
while (_cursor.moveToNext()) {
final RecordEntity _item;
final int _tmpId;
_tmpId = _cursor.getInt(_cursorIndexOfId);
_item = new RecordEntity(_tmpId);
_result.add(_item);
}
return _result;
} finally {
_cursor.close();
}
}
#Override
protected void finalize() {
_statement.release();
}
});
}

guava synchronizedSortedSetMultimap compareto nullpointerexception

I use guava synchronizedSortedSetMultimap to save data:
public static final Multimap<TimestampedDeviceId, ParsedPayload> multimap = Multimaps.synchronizedSortedSetMultimap(
TreeMultimap.create(new Comparator<TimestampedDeviceId>() {
#Override
public int compare(TimestampedDeviceId o1, TimestampedDeviceId o2) {
return o1.compareTo(o2);
}
}, new Comparator<ParsedPayload>() {
#Override
public int compare(ParsedPayload o1, ParsedPayload o2) {
return o1.getCurrentPkgNum()-o2.getCurrentPkgNum();
}
}));
The key of the map is the combination of timestamp and devID:
#Data
#AllArgsConstructor
public class TimestampedDeviceId implements Comparable<TimestampedDeviceId> {
String deviceId;
TimeStamp timeStamp;
#Override
public int compareTo(TimestampedDeviceId o) {
// log.info("o:{},{}",o,timeStamp);
if(!deviceId.equals(o.deviceId)){
return deviceId.compareTo(o.deviceId);
}
else {
return timeStamp.compareTo(o.getTimeStamp());
} }
#Override
public boolean equals(Object o) {
if (this == o) return true;
if (o == null || getClass() != o.getClass()) return false;
TimestampedDeviceId that = (TimestampedDeviceId) o;
return Objects.equals(deviceId, that.deviceId) &&
Objects.equals(timeStamp, that.timeStamp);
}
#Override
public int hashCode() {
return Objects.hash(deviceId, timeStamp);
}
}
And the data comes in order, when first data coming, I got the exception:
2021-03-18 10:03:06.284 INFO com.aier.camerawater.service.OnenetService - id:"668609823",total:28,current:1,time:{"day":18,"hour":10,"minute":3,"month":3,"second":4,"year":2021},exception:{}
java.lang.NullPointerException: null
at com.aier.camerawater.model.TimeStamp.compareTo(TimeStamp.java:39)
at com.aier.camerawater.model.TimestampedDeviceId.compareTo(TimestampedDeviceId.java:24)
at com.aier.camerawater.util.Common$1.compare(Common.java:32)
at com.aier.camerawater.util.Common$1.compare(Common.java:28)
at java.util.TreeMap.getEntryUsingComparator(TreeMap.java:376)
at java.util.TreeMap.getEntry(TreeMap.java:345)
at java.util.TreeMap.get(TreeMap.java:278)
at com.google.common.collect.AbstractMapBasedMultimap.put(AbstractMapBasedMultimap.java:184)
at com.google.common.collect.AbstractSetMultimap.put(AbstractSetMultimap.java:137)
at com.google.common.collect.TreeMultimap.put(TreeMultimap.java:74)
at com.google.common.collect.Synchronized$SynchronizedMultimap.put(Synchronized.java:636)
at com.aier.camerawater.service.OnenetService.parse(OnenetService.java:92)
at com.aier.camerawater.controller.OnenetController.receive(OnenetController.java:28)
at sun.reflect.GeneratedMethodAccessor40.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
It dosen't make sense why the first data cause the comoareTo method NPE.
Any help will be grateful

List of map as inout of custom Hive UD

I'm created a UDF and expect List<MAP<STRING,STRING>> argument. It works fine in unit test. However when I use it in hql. I cannot get any value of the map by key. it's always null.
However log shows every key has non-null value. don't know why.
error message:
Caused by: java.lang.RuntimeException: service_type cannot be
null:{value=告訴%{name}為甚麼你需要更改訂單,
source_hash=d793db7dee0d1941600c29427383bce8c03ebd84,
source_locale=en, source_updated_at=1501377418000, content_type=PLAIN,
service_type=HUMAN}
// UDF code:
public class MusselContentUDF extends GenericUDF {
private static final Log LOG = LogFactory.getLog(MusselContentUDF.class);
private ListObjectInspector listObjectInspector;
private MapObjectInspector mapObjectInspector;
#Override
public ObjectInspector initialize(ObjectInspector[] arguments) throws UDFArgumentException {
ObjectInspector a = arguments[0];
if (!(a instanceof ListObjectInspector)) {
throw new UDFArgumentException("first argument must be a list / array");
}
this.listObjectInspector = (ListObjectInspector) a;
if (!(listObjectInspector.getListElementObjectInspector() instanceof MapObjectInspector)) {
throw new UDFArgumentException("element must be map type");
}
this.mapObjectInspector =
(MapObjectInspector) (listObjectInspector.getListElementObjectInspector());
return PrimitiveObjectInspectorFactory.javaByteArrayObjectInspector;
}
#Override
public byte[] evaluate(DeferredObject[] arguments) throws HiveException {
List<MusselContentUnit> contentUnits =
this.listObjectInspector
.getList(arguments[0].get())
.stream()
.map(
e -> {
Map<?, ?> map = mapObjectInspector.getMap(e);
if (map.get("service_type") == null) {
LOG.error("service_type cannot be null:" + map);
throw new RuntimeException("service_type cannot be null:" + map);
}
if (map.get("value") == null) {
LOG.error("value cannot be null:" + map);
throw new RuntimeException("value cannot be null:" + map);
}
if (map.get("source_hash") == null) {
LOG.error("source_hash cannot be null:" + map);
throw new RuntimeException("source_hash cannot be null:" + map);
}
if (map.get("source_locale") == null) {
LOG.error("source_locale cannot be null:" + map);
throw new RuntimeException("source_locale cannot be null:" + map);
}
if (map.get("source_updated_at") == null) {
LOG.error("source_updated_at cannot be null:" + map);
throw new RuntimeException("source_updated_at cannot be null:" + map);
}
return MusselContentUnit.builder()
.serviceType((String) map.get("service_type"))
.value((String) map.get("value"))
.sourceContentDescriptor(
SourceContentDescriptor.builder()
.sourceHash((String) map.get("source_hash"))
.sourceLocale((String) map.get("source_locale"))
.sourceUpdatedAt(Long.valueOf((String) map.get("source_updated_at")))
.contentType((String) map.get("content_type"))
.build())
.build();
})
.collect(Collectors.toList());
try {
return ThriftCodec.serialize(MusselContent.builder().units(contentUnits).build(), true);
} catch (TException e) {
throw new HiveException("Cannot parse idl request content");
}
}
#Override
public String getDisplayString(String[] children) {
return "MusselContentUDF(" + children[0] + ")";
}
}
hql code:
q4 AS (
SELECT MUSSEL_PRIMARY_KEY(publisher_name, model, field_name, shard_num) AS primary_key
,CONCAT(field_name, '.', locale) AS secondary_key
,MAP(
'service_type', service_type,
'value', value,
'source_hash', source_hash,
'source_locale', source_locale,
'source_updated_at', source_updated_at,
'content_type', content_type
) AS content_unit
FROM q3
)
,
q5 AS (
SELECT primary_key
,secondary_key
,COLLECT_LIST(content_unit) AS content_units
FROM q4
GROUP BY primary_key, secondary_key
)
The map keys may not be strings when you get the map from mapObjectInspector. Currently I see two possible ways how to solve the problem:
Try to use getMapValueElement(Object data, Object key) method to get the map value
To work out keys and values ObjectInspectors, e.g.,
protected transient PrimitiveObjectInspector keyOI;
protected transient PrimitiveObjectInspector valueOI;
keyOI = (PrimitiveObjectInspector) this.mapObjectInspector.getMapKeyObjectInspector();
valueOI = (PrimitiveObjectInspector) this.mapObjectInspector.getMapValueObjectInspector();
then use getPrimitiveJavaObject to get the Java object and cast it to String, this way you can cast Map<?, ?> map to Map<String, String>, and then use new_map.get("service_type").
Unit tests are not realy helpful, I would suggest you to use HiveRunner instead

Oracle Coherence index not working with ContainsFilter query

I've added an index to a cache. The index uses a custom extractor that extends AbstractExtractor and overrides only the extract method to return a List of Strings. Then I have a ContainsFilter which uses the same custom extractor that looks for the occurence of a single String in the List of Strings. It does not look like my index is being used based on the time it takes to execute my test. What am I doing wrong? Also, is there some debugging I can switch on to see which indices are used?
public class DependencyIdExtractor extends AbstractExtractor {
/**
*
*/
private static final long serialVersionUID = 1L;
#Override
public Object extract(Object oTarget) {
if (oTarget == null) {
return null;
}
if (oTarget instanceof CacheValue) {
CacheValue cacheValue = (CacheValue)oTarget;
// returns a List of String objects
return cacheValue.getDependencyIds();
}
throw new UnsupportedOperationException();
}
}
Adding the index:
mCache = CacheFactory.getCache(pCacheName);
mCache.addIndex(new DependencyIdExtractor(), false, null);
Performing the ContainsFilter query:
public void invalidateByDependencyId(String pDependencyId) {
ContainsFilter vContainsFilter = new ContainsFilter(new DependencyIdExtractor(), pDependencyId);
#SuppressWarnings("rawtypes")
Set setKeys = mCache.keySet(vContainsFilter);
mCache.keySet().removeAll(setKeys);
}
I solved this by adding a hashCode and equals method implementation to the DependencyIdExtractor class. It is important that you use exactly the same value extractor when adding an index and creating your filter.
public class DependencyIdExtractor extends AbstractExtractor {
/**
*
*/
private static final long serialVersionUID = 1L;
#Override
public Object extract(Object oTarget) {
if (oTarget == null) {
return null;
}
if (oTarget instanceof CacheValue) {
CacheValue cacheValue = (CacheValue)oTarget;
return cacheValue.getDependencyIds();
}
throw new UnsupportedOperationException();
}
#Override
public int hashCode() {
return 1;
}
#Override
public boolean equals(Object obj) {
if (obj == null) {
return false;
}
if (obj instanceof DependencyIdExtractor) {
return true;
}
return false;
}
}
To debug Coherence indices/queries, you can generate an explain plan similar to database query explain plans.
http://www.oracle.com/technetwork/tutorials/tutorial-1841899.html
#SuppressWarnings("unchecked")
public void invalidateByDependencyId(String pDependencyId) {
ContainsFilter vContainsFilter = new ContainsFilter(new DependencyIdExtractor(), pDependencyId);
if (mLog.isTraceEnabled()) {
QueryRecorder agent = new QueryRecorder(RecordType.EXPLAIN);
Object resultsExplain = mCache.aggregate(vContainsFilter, agent);
mLog.trace("resultsExplain = \n" + resultsExplain + "\n");
}
#SuppressWarnings("rawtypes")
Set setKeys = mCache.keySet(vContainsFilter);
mCache.keySet().removeAll(setKeys);
}

How does hive achieve count(distinct ...)?

In the GenericUDAFCount.java:
#Description(name = "count",
value = "_FUNC_(*) - Returns the total number of retrieved rows, including "
+ "rows containing NULL values.\n"
+ "_FUNC_(expr) - Returns the number of rows for which the supplied "
+ "expression is non-NULL.\n"
+ "_FUNC_(DISTINCT expr[, expr...]) - Returns the number of rows for "
+ "which the supplied expression(s) are unique and non-NULL.")
but I don`t see any code to deal with the 'distinct' expression.
public static class GenericUDAFCountEvaluator extends GenericUDAFEvaluator {
private boolean countAllColumns = false;
private LongObjectInspector partialCountAggOI;
private LongWritable result;
#Override
public ObjectInspector init(Mode m, ObjectInspector[] parameters)
throws HiveException {
super.init(m, parameters);
partialCountAggOI =
PrimitiveObjectInspectorFactory.writableLongObjectInspector;
result = new LongWritable(0);
return PrimitiveObjectInspectorFactory.writableLongObjectInspector;
}
private GenericUDAFCountEvaluator setCountAllColumns(boolean countAllCols) {
countAllColumns = countAllCols;
return this;
}
/** class for storing count value. */
static class CountAgg implements AggregationBuffer {
long value;
}
#Override
public AggregationBuffer getNewAggregationBuffer() throws HiveException {
CountAgg buffer = new CountAgg();
reset(buffer);
return buffer;
}
#Override
public void reset(AggregationBuffer agg) throws HiveException {
((CountAgg) agg).value = 0;
}
#Override
public void iterate(AggregationBuffer agg, Object[] parameters)
throws HiveException {
// parameters == null means the input table/split is empty
if (parameters == null) {
return;
}
if (countAllColumns) {
assert parameters.length == 0;
((CountAgg) agg).value++;
} else {
assert parameters.length > 0;
boolean countThisRow = true;
for (Object nextParam : parameters) {
if (nextParam == null) {
countThisRow = false;
break;
}
}
if (countThisRow) {
((CountAgg) agg).value++;
}
}
}
#Override
public void merge(AggregationBuffer agg, Object partial)
throws HiveException {
if (partial != null) {
long p = partialCountAggOI.get(partial);
((CountAgg) agg).value += p;
}
}
#Override
public Object terminate(AggregationBuffer agg) throws HiveException {
result.set(((CountAgg) agg).value);
return result;
}
#Override
public Object terminatePartial(AggregationBuffer agg) throws HiveException {
return terminate(agg);
}
}
How does hive achieve count(distinct ...)? When task runs, it really cost much time.
Where is it in the source code?
As you can just run SELECT DISTINCT column1 FROM table1, DISTINCT expression isn't a flag or option, it's evaluated independently
This page says:
The actual filtering of data bound to parameter types for DISTINCT
implementation is handled by the framework and not the COUNT UDAF
implementation.
If you want drill down to source details, have a look into hive git repository