List of map as inout of custom Hive UD - hive

I'm created a UDF and expect List<MAP<STRING,STRING>> argument. It works fine in unit test. However when I use it in hql. I cannot get any value of the map by key. it's always null.
However log shows every key has non-null value. don't know why.
error message:
Caused by: java.lang.RuntimeException: service_type cannot be
null:{value=告訴%{name}為甚麼你需要更改訂單,
source_hash=d793db7dee0d1941600c29427383bce8c03ebd84,
source_locale=en, source_updated_at=1501377418000, content_type=PLAIN,
service_type=HUMAN}
// UDF code:
public class MusselContentUDF extends GenericUDF {
private static final Log LOG = LogFactory.getLog(MusselContentUDF.class);
private ListObjectInspector listObjectInspector;
private MapObjectInspector mapObjectInspector;
#Override
public ObjectInspector initialize(ObjectInspector[] arguments) throws UDFArgumentException {
ObjectInspector a = arguments[0];
if (!(a instanceof ListObjectInspector)) {
throw new UDFArgumentException("first argument must be a list / array");
}
this.listObjectInspector = (ListObjectInspector) a;
if (!(listObjectInspector.getListElementObjectInspector() instanceof MapObjectInspector)) {
throw new UDFArgumentException("element must be map type");
}
this.mapObjectInspector =
(MapObjectInspector) (listObjectInspector.getListElementObjectInspector());
return PrimitiveObjectInspectorFactory.javaByteArrayObjectInspector;
}
#Override
public byte[] evaluate(DeferredObject[] arguments) throws HiveException {
List<MusselContentUnit> contentUnits =
this.listObjectInspector
.getList(arguments[0].get())
.stream()
.map(
e -> {
Map<?, ?> map = mapObjectInspector.getMap(e);
if (map.get("service_type") == null) {
LOG.error("service_type cannot be null:" + map);
throw new RuntimeException("service_type cannot be null:" + map);
}
if (map.get("value") == null) {
LOG.error("value cannot be null:" + map);
throw new RuntimeException("value cannot be null:" + map);
}
if (map.get("source_hash") == null) {
LOG.error("source_hash cannot be null:" + map);
throw new RuntimeException("source_hash cannot be null:" + map);
}
if (map.get("source_locale") == null) {
LOG.error("source_locale cannot be null:" + map);
throw new RuntimeException("source_locale cannot be null:" + map);
}
if (map.get("source_updated_at") == null) {
LOG.error("source_updated_at cannot be null:" + map);
throw new RuntimeException("source_updated_at cannot be null:" + map);
}
return MusselContentUnit.builder()
.serviceType((String) map.get("service_type"))
.value((String) map.get("value"))
.sourceContentDescriptor(
SourceContentDescriptor.builder()
.sourceHash((String) map.get("source_hash"))
.sourceLocale((String) map.get("source_locale"))
.sourceUpdatedAt(Long.valueOf((String) map.get("source_updated_at")))
.contentType((String) map.get("content_type"))
.build())
.build();
})
.collect(Collectors.toList());
try {
return ThriftCodec.serialize(MusselContent.builder().units(contentUnits).build(), true);
} catch (TException e) {
throw new HiveException("Cannot parse idl request content");
}
}
#Override
public String getDisplayString(String[] children) {
return "MusselContentUDF(" + children[0] + ")";
}
}
hql code:
q4 AS (
SELECT MUSSEL_PRIMARY_KEY(publisher_name, model, field_name, shard_num) AS primary_key
,CONCAT(field_name, '.', locale) AS secondary_key
,MAP(
'service_type', service_type,
'value', value,
'source_hash', source_hash,
'source_locale', source_locale,
'source_updated_at', source_updated_at,
'content_type', content_type
) AS content_unit
FROM q3
)
,
q5 AS (
SELECT primary_key
,secondary_key
,COLLECT_LIST(content_unit) AS content_units
FROM q4
GROUP BY primary_key, secondary_key
)

The map keys may not be strings when you get the map from mapObjectInspector. Currently I see two possible ways how to solve the problem:
Try to use getMapValueElement(Object data, Object key) method to get the map value
To work out keys and values ObjectInspectors, e.g.,
protected transient PrimitiveObjectInspector keyOI;
protected transient PrimitiveObjectInspector valueOI;
keyOI = (PrimitiveObjectInspector) this.mapObjectInspector.getMapKeyObjectInspector();
valueOI = (PrimitiveObjectInspector) this.mapObjectInspector.getMapValueObjectInspector();
then use getPrimitiveJavaObject to get the Java object and cast it to String, this way you can cast Map<?, ?> map to Map<String, String>, and then use new_map.get("service_type").
Unit tests are not realy helpful, I would suggest you to use HiveRunner instead

Related

Spring R2dbc: Is there are way to get constant stream from postgresql database and process them?

I want to fetch records for newly created records in a table in postgresql as a live/continuous stream. Is it possible to use using spring r2dbc? If so what options do I have?
Thanks
You need to use pg_notify and start to listing on it. Any change that you want to see should be wrapped in simple trigger that will send message to pg_notify.
I have an example of this on my github, but long story short:
prepare function and trigger:
CREATE OR REPLACE FUNCTION notify_member_saved()
RETURNS TRIGGER
AS $$
BEGIN
PERFORM pg_notify('MEMBER_SAVED', row_to_json(NEW)::text);
RETURN NULL;
END;
$$ LANGUAGE plpgsql;
CREATE TRIGGER member_saved_trigger
AFTER INSERT OR UPDATE
ON members
FOR EACH ROW
EXECUTE PROCEDURE notify_member_saved();
In java code prepare listener
#Service
#RequiredArgsConstructor
#Slf4j
class NotificationService {
private final ConnectionFactory connectionFactory;
private final Set<NotificationTopic> watchedTopics = Collections.synchronizedSet(new HashSet<>());
#Qualifier("postgres-event-mapper")
private final ObjectMapper objectMapper;
private PostgresqlConnection connection;
#PreDestroy
private void preDestroy() {
this.getConnection().close().subscribe();
}
private PostgresqlConnection getConnection() {
if(connection == null) {
synchronized(NotificationService.class) {
if(connection == null) {
try {
connection = Mono.from(connectionFactory.create())
.cast(Wrapped.class)
.map(Wrapped::unwrap)
.cast(PostgresqlConnection.class)
.toFuture().get();
} catch(InterruptedException e) {
throw new RuntimeException(e);
} catch(ExecutionException e) {
throw new RuntimeException(e);
}
}
}
}
return this.connection;
}
public <T> Flux<T> listen(final NotificationTopic topic, final Class<T> clazz) {
if(!watchedTopics.contains(topic)) {
executeListenStatement(topic);
}
return getConnection().getNotifications()
.log("notifications")
.filter(notification -> topic.name().equals(notification.getName()) && notification.getParameter() != null)
.handle((notification, sink) -> {
final String json = notification.getParameter();
if(!StringUtils.isBlank(json)) {
try {
sink.next(objectMapper.readValue(json, clazz));
} catch(JsonProcessingException e) {
log.error(String.format("Problem deserializing an instance of [%s] " +
"with the following json: %s ", clazz.getSimpleName(), json), e);
Mono.error(new DeserializationException(topic, e));
}
}
});
}
private void executeListenStatement(final NotificationTopic topic) {
getConnection().createStatement(String.format("LISTEN \"%s\"", topic)).execute()
.doOnComplete(() -> watchedTopics.add(topic))
.subscribe();
}
public void unlisten(final NotificationTopic topic) {
if(watchedTopics.contains(topic)) {
executeUnlistenStatement(topic);
}
}
private void executeUnlistenStatement(final NotificationTopic topic) {
getConnection().createStatement(String.format("UNLISTEN \"%s\"", topic)).execute()
.doOnComplete(() -> watchedTopics.remove(topic))
.subscribe();
}
}
start listiong from controller
#GetMapping("/events")
public Flux<ServerSentEvent<Object>> listenToEvents() {
return Flux.merge(listenToDeletedItems(), listenToSavedItems())
.map(o -> ServerSentEvent.builder()
.retry(Duration.ofSeconds(4L))
.event(o.getClass().getName())
.data(o).build()
);
}
#GetMapping("/unevents")
public Mono<ResponseEntity<Void>> unlistenToEvents() {
unlistenToDeletedItems();
unlistenToSavedItems();
return Mono.just(
ResponseEntity
.status(HttpStatus.I_AM_A_TEAPOT)
.body(null)
);
}
private Flux<Member> listenToSavedItems() {
return this.notificationService.listen(MEMBER_SAVED, Member.class);
}
private void unlistenToSavedItems() {
this.notificationService.unlisten(MEMBER_SAVED);
}
but remember that if something broke then you lost pg_notify events for some time so it is for non-mission-citical solutions.

Oracle Coherence index not working with ContainsFilter query

I've added an index to a cache. The index uses a custom extractor that extends AbstractExtractor and overrides only the extract method to return a List of Strings. Then I have a ContainsFilter which uses the same custom extractor that looks for the occurence of a single String in the List of Strings. It does not look like my index is being used based on the time it takes to execute my test. What am I doing wrong? Also, is there some debugging I can switch on to see which indices are used?
public class DependencyIdExtractor extends AbstractExtractor {
/**
*
*/
private static final long serialVersionUID = 1L;
#Override
public Object extract(Object oTarget) {
if (oTarget == null) {
return null;
}
if (oTarget instanceof CacheValue) {
CacheValue cacheValue = (CacheValue)oTarget;
// returns a List of String objects
return cacheValue.getDependencyIds();
}
throw new UnsupportedOperationException();
}
}
Adding the index:
mCache = CacheFactory.getCache(pCacheName);
mCache.addIndex(new DependencyIdExtractor(), false, null);
Performing the ContainsFilter query:
public void invalidateByDependencyId(String pDependencyId) {
ContainsFilter vContainsFilter = new ContainsFilter(new DependencyIdExtractor(), pDependencyId);
#SuppressWarnings("rawtypes")
Set setKeys = mCache.keySet(vContainsFilter);
mCache.keySet().removeAll(setKeys);
}
I solved this by adding a hashCode and equals method implementation to the DependencyIdExtractor class. It is important that you use exactly the same value extractor when adding an index and creating your filter.
public class DependencyIdExtractor extends AbstractExtractor {
/**
*
*/
private static final long serialVersionUID = 1L;
#Override
public Object extract(Object oTarget) {
if (oTarget == null) {
return null;
}
if (oTarget instanceof CacheValue) {
CacheValue cacheValue = (CacheValue)oTarget;
return cacheValue.getDependencyIds();
}
throw new UnsupportedOperationException();
}
#Override
public int hashCode() {
return 1;
}
#Override
public boolean equals(Object obj) {
if (obj == null) {
return false;
}
if (obj instanceof DependencyIdExtractor) {
return true;
}
return false;
}
}
To debug Coherence indices/queries, you can generate an explain plan similar to database query explain plans.
http://www.oracle.com/technetwork/tutorials/tutorial-1841899.html
#SuppressWarnings("unchecked")
public void invalidateByDependencyId(String pDependencyId) {
ContainsFilter vContainsFilter = new ContainsFilter(new DependencyIdExtractor(), pDependencyId);
if (mLog.isTraceEnabled()) {
QueryRecorder agent = new QueryRecorder(RecordType.EXPLAIN);
Object resultsExplain = mCache.aggregate(vContainsFilter, agent);
mLog.trace("resultsExplain = \n" + resultsExplain + "\n");
}
#SuppressWarnings("rawtypes")
Set setKeys = mCache.keySet(vContainsFilter);
mCache.keySet().removeAll(setKeys);
}

entity framework 5 change log how to implement?

I am creating an application with MVC4 and entity framework 5. How do can I implement this?
I have looked around and found that I need to override SaveChanges .
Does anyone have any sample code on this? I am using code first approach.
As an example, the way I am saving data is as follows,
public class AuditZoneRepository : IAuditZoneRepository
{
private AISDbContext context = new AISDbContext();
public int Save(AuditZone model, ModelStateDictionary modelState)
{
if (model.Id == 0)
{
context.AuditZones.Add(model);
}
else
{
var recordToUpdate = context.AuditZones.FirstOrDefault(x => x.Id == model.Id);
if (recordToUpdate != null)
{
recordToUpdate.Description = model.Description;
recordToUpdate.Valid = model.Valid;
recordToUpdate.ModifiedDate = DateTime.Now;
}
}
try
{
context.SaveChanges();
return 1;
}
catch (Exception ex)
{
modelState.AddModelError("", "Database error has occured. Please try again later");
return -1;
}
}
}
There is no need to override SaveChanges.
You can
Trigger Context.ChangeTracker.DetectChanges(); // may be necessary depending on your Proxy approach
Then analyze the context BEFORE save.
you can then... add the Change Log to the CURRENT Unit of work.
So the log gets saved in one COMMIT transaction.
Or process it as you see fit.
But saving your change log at same time. makes sure it is ONE Transaction.
Analyzing the context sample:
I have a simple tool, to Dump context content to debug output so when in debugger I can use immediate window to check content. eg
You can use this as a starter to prepare your CHANGE Log.
Try it in debugger immediate window. I have FULL dump on my Context class.
Sample Immediate window call. UoW.Context.FullDump();
public void FullDump()
{
Debug.WriteLine("=====Begin of Context Dump=======");
var dbsetList = this.ChangeTracker.Entries();
foreach (var dbEntityEntry in dbsetList)
{
Debug.WriteLine(dbEntityEntry.Entity.GetType().Name + " => " + dbEntityEntry.State);
switch (dbEntityEntry.State)
{
case EntityState.Detached:
case EntityState.Unchanged:
case EntityState.Added:
case EntityState.Modified:
WriteCurrentValues(dbEntityEntry);
break;
case EntityState.Deleted:
WriteOriginalValues(dbEntityEntry);
break;
default:
throw new ArgumentOutOfRangeException();
}
Debug.WriteLine("==========End of Entity======");
}
Debug.WriteLine("==========End of Context======");
}
private static void WriteCurrentValues(DbEntityEntry dbEntityEntry)
{
foreach (var cv in dbEntityEntry.CurrentValues.PropertyNames)
{
Debug.WriteLine(cv + "=" + dbEntityEntry.CurrentValues[cv]);
}
}
private static void WriteOriginalValues(DbEntityEntry dbEntityEntry)
{
foreach (var cv in dbEntityEntry.OriginalValues.PropertyNames)
{
Debug.WriteLine(cv + "=" + dbEntityEntry.OriginalValues[cv]);
}
}
}
EDIT: Get the changes
I use this routine to get chnages...
public class ObjectPair {
public string Key { get; set; }
public object Original { get; set; }
public object Current { get; set; }
}
public virtual IList<ObjectPair> GetChanges(object poco) {
var changes = new List<ObjectPair>();
var thePoco = (TPoco) poco;
foreach (var propName in Entry(thePoco).CurrentValues.PropertyNames) {
var curr = Entry(thePoco).CurrentValues[propName];
var orig = Entry(thePoco).OriginalValues[propName];
if (curr != null && orig != null) {
if (curr.Equals(orig)) {
continue;
}
}
if (curr == null && orig == null) {
continue;
}
var aChangePair = new ObjectPair {Key = propName, Current = curr, Original = orig};
changes.Add(aChangePair);
}
return changes;
}
edit 2 If you must use the Internal Object tracking.
var context = ???// YOUR DBCONTEXT class
// get objectcontext from dbcontext...
var objectContext = ((IObjectContextAdapter) context).ObjectContext;
// for each tracked entry
foreach (var dbEntityEntry in context.ChangeTracker.Entries()) {
//get the state entry from the statemanager per changed object
var stateEntry = objectContext.ObjectStateManager.GetObjectStateEntry(dbEntityEntry.Entity);
var modProps = stateEntry.GetModifiedProperties();
Debug.WriteLine(modProps.ToString());
}
I decompiled EF6 . Get modified is indeed using private bit array to track fields that have
been changed.
// EF decompiled source..... _modifiedFields is a bitarray
public override IEnumerable<string> GetModifiedProperties()
{
this.ValidateState();
if (EntityState.Modified == this.State && this._modifiedFields != null)
{
for (int i = 0; i < this._modifiedFields.Length; ++i)
{
if (this._modifiedFields[i])
yield return this.GetCLayerName(i, this._cacheTypeMetadata);
}
}
}

Developing Hive UDAF meet a ClassCastException without an idea

`public class GenericUdafMemberLevel implements GenericUDAFResolver2 {
private static final Log LOG = LogFactory
.getLog(GenericUdafMemberLevel.class.getName());
#Override
public GenericUDAFEvaluator getEvaluator(GenericUDAFParameterInfo paramInfo)
throws SemanticException {
return new GenericUdafMeberLevelEvaluator();
}
#Override
//参数校验
public GenericUDAFEvaluator getEvaluator(TypeInfo[] parameters)
throws SemanticException {
if (parameters.length != 2) {//参数大小
throw new UDFArgumentTypeException(parameters.length - 1,
"Exactly two arguments are expected.");
}
//参数必须是原型,即不能是
if (parameters[0].getCategory() != ObjectInspector.Category.PRIMITIVE) {
throw new UDFArgumentTypeException(0,
"Only primitive type arguments are accepted but "
+ parameters[0].getTypeName() + " is passed.");
}
if (parameters[1].getCategory() != ObjectInspector.Category.PRIMITIVE) {
throw new UDFArgumentTypeException(1,
"Only primitive type arguments are accepted but "
+ parameters[1].getTypeName() + " is passed.");
}
return new GenericUdafMeberLevelEvaluator();
}
public static class GenericUdafMeberLevelEvaluator extends GenericUDAFEvaluator {
private PrimitiveObjectInspector inputOI;
private PrimitiveObjectInspector inputOI2;
private DoubleWritable result;
#Override
public ObjectInspector init(Mode m, ObjectInspector[] parameters)
throws HiveException {
super.init(m, parameters);
if (m == Mode.PARTIAL1 || m == Mode.COMPLETE){
inputOI = (PrimitiveObjectInspector) parameters[0];
inputOI2 = (PrimitiveObjectInspector) parameters[1];
result = new DoubleWritable(0);
}
return PrimitiveObjectInspectorFactory.writableLongObjectInspector;
}
/** class for storing count value. */
static class SumAgg implements AggregationBuffer {
boolean empty;
double value;
}
#Override
//创建新的聚合计算的需要的内存,用来存储mapper,combiner,reducer运算过程中的相加总和。
//使用buffer对象前,先进行内存的清空——reset
public AggregationBuffer getNewAggregationBuffer() throws HiveException {
SumAgg buffer = new SumAgg();
reset(buffer);
return buffer;
}
#Override
//重置为0
//mapreduce支持mapper和reducer的重用,所以为了兼容,也需要做内存的重用。
public void reset(AggregationBuffer agg) throws HiveException {
((SumAgg) agg).value = 0.0;
((SumAgg) agg).empty = true;
}
private boolean warned = false;
//迭代
//map阶段调用,只要把保存当前和的对象agg,再加上输入的参数,就可以了。
#Override
public void iterate(AggregationBuffer agg, Object[] parameters)
throws HiveException {
// parameters == null means the input table/split is empty
if (parameters == null) {
return;
}
try {
double flag = PrimitiveObjectInspectorUtils.getDouble(parameters[1], inputOI2);
if(flag > 1.0) //参数条件
merge(agg, parameters[0]); //这里将Map之后的操作,放入combiner进行合并
} catch (NumberFormatException e) {
if (!warned) {
warned = true;
LOG.warn(getClass().getSimpleName() + " "
+ StringUtils.stringifyException(e));
}
}
}
#Override
//combiner合并map返回的结果,还有reducer合并mapper或combiner返回的结果。
public void merge(AggregationBuffer agg, Object partial)
throws HiveException {
if (partial != null) {
//通过ObejctInspector取每一个字段的数据
double p = PrimitiveObjectInspectorUtils.getDouble(partial, inputOI);
((SumAgg) agg).value += p;
}
}
#Override
//reducer返回结果,或者是只有mapper,没有reducer时,在mapper端返回结果。
public Object terminatePartial(AggregationBuffer agg)
throws HiveException {
return terminate(agg);
}
#Override
public Object terminate(AggregationBuffer agg) throws HiveException {
result.set(((SumAgg) agg).value);
return result;
}
}
}`
I have used some chinese to comment the code for understanding the theory.
Actually, the idea of the UDAF is like follow:
select test_sum(col1,col2) from tbl ;
if col2 satisfy some condition, then sum col1's value.
Most of the code are copied from the offical avg() udaf function.
I met a weried Exception:
java.lang.RuntimeException: Hive Runtime Error while closing operators
at org.apache.hadoop.hive.ql.exec.ExecMapper.close(ExecMapper.java:226)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1136)
at org.apache.hadoop.mapred.Child.main(Child.java:249)
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.ClassCastException: org.apache.hadoop.io.DoubleWritable cannot be cast to org.apache.hadoop.io.LongWritable
at org.apache.hadoop.hive.ql.exec.GroupByOperator.closeOp(GroupByOperator.java:1132)
at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:558)
at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:567)
at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:567)
at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:567)
at org.apache.hadoop.hive.ql.exec.ExecMapper.close(ExecMapper.java:193)
... 8 more
Caused by: java.lang.ClassCastException: org.apache.hadoop.io.DoubleWritable cannot be cast to org.apache.hadoop.io.LongWritable
at org.apache.hadoop.hive.serde2.objectinspector.primitive.WritableLongObjectInspector.get(WritableLongObjectInspector.java:35)
at org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe.serialize(LazyBinarySerDe.java:323)
at org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe.serializeStruct(LazyBinarySerDe.java:255)
at org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe.serialize(LazyBinarySerDe.java:202)
at org.apache.hadoop.hive.ql.exec.ReduceSinkOperator.processOp(ReduceSinkOperator.java:236)
at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:474)
at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:800)
at org.apache.hadoop.hive.ql.exec.GroupByOperator.forward(GroupByOperator.java:1061)
at org.apache.hadoop.hive.ql.exec.GroupByOperator.closeOp(GroupByOperator.java:1113)
... 13 more
Am I have something wrong with my UDAF??
please kindly point it out.
Thanks a lllllllot .
Replace PrimitiveObjectInspectorFactory.writableLongObjectInspector in init method with PrimitiveObjectInspectorFactory.writableDoubleObjectInspector.

How does hive achieve count(distinct ...)?

In the GenericUDAFCount.java:
#Description(name = "count",
value = "_FUNC_(*) - Returns the total number of retrieved rows, including "
+ "rows containing NULL values.\n"
+ "_FUNC_(expr) - Returns the number of rows for which the supplied "
+ "expression is non-NULL.\n"
+ "_FUNC_(DISTINCT expr[, expr...]) - Returns the number of rows for "
+ "which the supplied expression(s) are unique and non-NULL.")
but I don`t see any code to deal with the 'distinct' expression.
public static class GenericUDAFCountEvaluator extends GenericUDAFEvaluator {
private boolean countAllColumns = false;
private LongObjectInspector partialCountAggOI;
private LongWritable result;
#Override
public ObjectInspector init(Mode m, ObjectInspector[] parameters)
throws HiveException {
super.init(m, parameters);
partialCountAggOI =
PrimitiveObjectInspectorFactory.writableLongObjectInspector;
result = new LongWritable(0);
return PrimitiveObjectInspectorFactory.writableLongObjectInspector;
}
private GenericUDAFCountEvaluator setCountAllColumns(boolean countAllCols) {
countAllColumns = countAllCols;
return this;
}
/** class for storing count value. */
static class CountAgg implements AggregationBuffer {
long value;
}
#Override
public AggregationBuffer getNewAggregationBuffer() throws HiveException {
CountAgg buffer = new CountAgg();
reset(buffer);
return buffer;
}
#Override
public void reset(AggregationBuffer agg) throws HiveException {
((CountAgg) agg).value = 0;
}
#Override
public void iterate(AggregationBuffer agg, Object[] parameters)
throws HiveException {
// parameters == null means the input table/split is empty
if (parameters == null) {
return;
}
if (countAllColumns) {
assert parameters.length == 0;
((CountAgg) agg).value++;
} else {
assert parameters.length > 0;
boolean countThisRow = true;
for (Object nextParam : parameters) {
if (nextParam == null) {
countThisRow = false;
break;
}
}
if (countThisRow) {
((CountAgg) agg).value++;
}
}
}
#Override
public void merge(AggregationBuffer agg, Object partial)
throws HiveException {
if (partial != null) {
long p = partialCountAggOI.get(partial);
((CountAgg) agg).value += p;
}
}
#Override
public Object terminate(AggregationBuffer agg) throws HiveException {
result.set(((CountAgg) agg).value);
return result;
}
#Override
public Object terminatePartial(AggregationBuffer agg) throws HiveException {
return terminate(agg);
}
}
How does hive achieve count(distinct ...)? When task runs, it really cost much time.
Where is it in the source code?
As you can just run SELECT DISTINCT column1 FROM table1, DISTINCT expression isn't a flag or option, it's evaluated independently
This page says:
The actual filtering of data bound to parameter types for DISTINCT
implementation is handled by the framework and not the COUNT UDAF
implementation.
If you want drill down to source details, have a look into hive git repository