How does hive achieve count(distinct ...)?

How does hive achieve count(distinct ...)? - hive

In the GenericUDAFCount.java:
#Description(name = "count",
value = "_FUNC_(*) - Returns the total number of retrieved rows, including "
+ "rows containing NULL values.\n"
+ "_FUNC_(expr) - Returns the number of rows for which the supplied "
+ "expression is non-NULL.\n"
+ "_FUNC_(DISTINCT expr[, expr...]) - Returns the number of rows for "
+ "which the supplied expression(s) are unique and non-NULL.")
but I don`t see any code to deal with the 'distinct' expression.
public static class GenericUDAFCountEvaluator extends GenericUDAFEvaluator {
private boolean countAllColumns = false;
private LongObjectInspector partialCountAggOI;
private LongWritable result;
#Override
public ObjectInspector init(Mode m, ObjectInspector[] parameters)
throws HiveException {
super.init(m, parameters);
partialCountAggOI =
PrimitiveObjectInspectorFactory.writableLongObjectInspector;
result = new LongWritable(0);
return PrimitiveObjectInspectorFactory.writableLongObjectInspector;
}
private GenericUDAFCountEvaluator setCountAllColumns(boolean countAllCols) {
countAllColumns = countAllCols;
return this;
}
/** class for storing count value. */
static class CountAgg implements AggregationBuffer {
long value;
}
#Override
public AggregationBuffer getNewAggregationBuffer() throws HiveException {
CountAgg buffer = new CountAgg();
reset(buffer);
return buffer;
}
#Override
public void reset(AggregationBuffer agg) throws HiveException {
((CountAgg) agg).value = 0;
}
#Override
public void iterate(AggregationBuffer agg, Object[] parameters)
throws HiveException {
// parameters == null means the input table/split is empty
if (parameters == null) {
return;
}
if (countAllColumns) {
assert parameters.length == 0;
((CountAgg) agg).value++;
} else {
assert parameters.length > 0;
boolean countThisRow = true;
for (Object nextParam : parameters) {
if (nextParam == null) {
countThisRow = false;
break;
}
}
if (countThisRow) {
((CountAgg) agg).value++;
}
}
}
#Override
public void merge(AggregationBuffer agg, Object partial)
throws HiveException {
if (partial != null) {
long p = partialCountAggOI.get(partial);
((CountAgg) agg).value += p;
}
}
#Override
public Object terminate(AggregationBuffer agg) throws HiveException {
result.set(((CountAgg) agg).value);
return result;
}
#Override
public Object terminatePartial(AggregationBuffer agg) throws HiveException {
return terminate(agg);
}
}
How does hive achieve count(distinct ...)? When task runs, it really cost much time.
Where is it in the source code?

As you can just run SELECT DISTINCT column1 FROM table1, DISTINCT expression isn't a flag or option, it's evaluated independently
This page says:
The actual filtering of data bound to parameter types for DISTINCT
implementation is handled by the framework and not the COUNT UDAF
implementation.
If you want drill down to source details, have a look into hive git repository

Related

What happens if no records exists when I use Room to query?

I use Room in my Android Studio App, the Code A will crash when no record exists to query
What happens if no records exists when I use Room to query with Code B ?
#Dao
interface RecordDao {
// Code A
#Query("SELECT * FROM record_table where id=:id")
fun getByID(id:Int): Flow<RecordEntity>
// Code B
#Query("SELECT * FROM record_table")
fun getAll(): Flow<List<RecordEntity>>
}

If you look at the code generated by room for the RecordDao_Impl class implementation of RecordDao, you'll notice multiple things:
The getById function code returns null when no matching records exist in the table, and since your 'Code A' function return type is not null, kotlin throws a NullPointerException for the first function.
The getAll function code returns a Flow object with an ArrayList, in case it found any records then it adds them to the list, otherwise it just emits the empty ArrayList to the Flow object, therefore, you'll always get a flow object with a list inside it regardless if room found matching records or not, so no exception is thrown.
You can understand this a bit more if you look at the code generated for the two functions here:
#Override
public Flow<RecordEntity> getByID(final int id) {
final String _sql = "SELECT * FROM record_table where id=?";
final RoomSQLiteQuery _statement = RoomSQLiteQuery.acquire(_sql, 1);
int _argIndex = 1;
_statement.bindLong(_argIndex, id);
return CoroutinesRoom.createFlow(__db, false, new String[]{"record_table"}, new Callable<RecordEntity>() {
#Override
public RecordEntity call() throws Exception {
final Cursor _cursor = DBUtil.query(__db, _statement, false, null);
try {
final int _cursorIndexOfId = CursorUtil.getColumnIndexOrThrow(_cursor, "id");
final RecordEntity _result;
if (_cursor.moveToFirst()) {
final int _tmpId;
_tmpId = _cursor.getInt(_cursorIndexOfId);
_result = new RecordEntity(_tmpId);
} else {
_result = null;
}
return _result;
} finally {
_cursor.close();
}
}
#Override
protected void finalize() {
_statement.release();
}
});
}
#Override
public Flow<List<RecordEntity>> getAll() {
final String _sql = "SELECT * FROM record_table";
final RoomSQLiteQuery _statement = RoomSQLiteQuery.acquire(_sql, 0);
return CoroutinesRoom.createFlow(__db, false, new String[]{"record_table"}, new Callable<List<RecordEntity>>() {
#Override
public List<RecordEntity> call() throws Exception {
final Cursor _cursor = DBUtil.query(__db, _statement, false, null);
try {
final int _cursorIndexOfId = CursorUtil.getColumnIndexOrThrow(_cursor, "id");
final List<RecordEntity> _result = new ArrayList<RecordEntity>(_cursor.getCount());
while (_cursor.moveToNext()) {
final RecordEntity _item;
final int _tmpId;
_tmpId = _cursor.getInt(_cursorIndexOfId);
_item = new RecordEntity(_tmpId);
_result.add(_item);
}
return _result;
} finally {
_cursor.close();
}
}
#Override
protected void finalize() {
_statement.release();
}
});
}

Is there a standard way to package many Restlets into a single Restlet?

I have a situation where the application developers and the framework provider are not the people. As a framework provider, I would like to be able to hand the developers what looks like a single Filter, but is in fact a chain of standard Filters (such as authentication, setting up invocation context, metrics, ++).
I don't seem to find this functionality in the standard library, but maybe there is an extension with it.

Instead of waiting for an answer, I went ahead with my own implementation and sharing here if some needs this.
/**
* Composes an array of Restlet Filters into a single Filter.
*/
public class ComposingFilter extends Filter
{
private final Filter first;
private final Filter last;
public ComposingFilter( Filter... composedOf )
{
Objects.requireNonNull( composedOf );
if( composedOf.length == 0 )
{
throw new IllegalArgumentException( "Filter chain can't be empty." );
}
first = composedOf[ 0 ];
Filter prev = first;
for( int i = 1; i < composedOf.length; i++ )
{
Filter next = composedOf[ i ];
prev.setNext( next );
prev = next;
}
last = composedOf[ composedOf.length - 1 ];
}
#Override
protected int doHandle( Request request, Response response )
{
if( first != null )
{
first.handle( request, response );
Response.setCurrent( response );
if( getContext() != null )
{
Context.setCurrent( getContext() );
}
}
else
{
response.setStatus( Status.SERVER_ERROR_INTERNAL );
getLogger().warning( "The filter " + getName() + " was executed without a next Restlet attached to it." );
}
return CONTINUE;
}
#Override
public synchronized void start()
throws Exception
{
if( isStopped() )
{
first.start();
super.start();
}
}
#Override
public synchronized void stop()
throws Exception
{
if( isStarted() )
{
super.stop();
first.stop();
}
}
#Override
public Restlet getNext()
{
return last.getNext();
}
#Override
public void setNext( Class<? extends ServerResource> targetClass )
{
last.setNext( targetClass );
}
#Override
public void setNext( Restlet next )
{
last.setNext( next );
}
}
NOTE: Not tested yet.

Oracle Coherence index not working with ContainsFilter query

I've added an index to a cache. The index uses a custom extractor that extends AbstractExtractor and overrides only the extract method to return a List of Strings. Then I have a ContainsFilter which uses the same custom extractor that looks for the occurence of a single String in the List of Strings. It does not look like my index is being used based on the time it takes to execute my test. What am I doing wrong? Also, is there some debugging I can switch on to see which indices are used?
public class DependencyIdExtractor extends AbstractExtractor {
/**
*
*/
private static final long serialVersionUID = 1L;
#Override
public Object extract(Object oTarget) {
if (oTarget == null) {
return null;
}
if (oTarget instanceof CacheValue) {
CacheValue cacheValue = (CacheValue)oTarget;
// returns a List of String objects
return cacheValue.getDependencyIds();
}
throw new UnsupportedOperationException();
}
}
Adding the index:
mCache = CacheFactory.getCache(pCacheName);
mCache.addIndex(new DependencyIdExtractor(), false, null);
Performing the ContainsFilter query:
public void invalidateByDependencyId(String pDependencyId) {
ContainsFilter vContainsFilter = new ContainsFilter(new DependencyIdExtractor(), pDependencyId);
#SuppressWarnings("rawtypes")
Set setKeys = mCache.keySet(vContainsFilter);
mCache.keySet().removeAll(setKeys);
}

I solved this by adding a hashCode and equals method implementation to the DependencyIdExtractor class. It is important that you use exactly the same value extractor when adding an index and creating your filter.
public class DependencyIdExtractor extends AbstractExtractor {
/**
*
*/
private static final long serialVersionUID = 1L;
#Override
public Object extract(Object oTarget) {
if (oTarget == null) {
return null;
}
if (oTarget instanceof CacheValue) {
CacheValue cacheValue = (CacheValue)oTarget;
return cacheValue.getDependencyIds();
}
throw new UnsupportedOperationException();
}
#Override
public int hashCode() {
return 1;
}
#Override
public boolean equals(Object obj) {
if (obj == null) {
return false;
}
if (obj instanceof DependencyIdExtractor) {
return true;
}
return false;
}
}
To debug Coherence indices/queries, you can generate an explain plan similar to database query explain plans.
http://www.oracle.com/technetwork/tutorials/tutorial-1841899.html
#SuppressWarnings("unchecked")
public void invalidateByDependencyId(String pDependencyId) {
ContainsFilter vContainsFilter = new ContainsFilter(new DependencyIdExtractor(), pDependencyId);
if (mLog.isTraceEnabled()) {
QueryRecorder agent = new QueryRecorder(RecordType.EXPLAIN);
Object resultsExplain = mCache.aggregate(vContainsFilter, agent);
mLog.trace("resultsExplain = \n" + resultsExplain + "\n");
}
#SuppressWarnings("rawtypes")
Set setKeys = mCache.keySet(vContainsFilter);
mCache.keySet().removeAll(setKeys);
}

Passing custom parameters to a pig udf function in java

This is the way I am looking to process my data.. from pig..
A = Load 'data' ...
B = FOREACH A GENERATE my.udfs.extract(*);
or
B = FOREACH A GENERATE my.udfs.extract('flag');
So basically extract either has no arguments or takes an argument... 'flag'
On my udf side...
#Override
public DataBag exec(Tuple input) throws IOException {
//if flag == true
//do this
//else
// do that
}
Now how do i implement this in pig?

The preferred way is to use DEFINE.
,,Use DEFINE to specify a UDF function when:
...
The constructor for the
function takes string parameters. If you need to use different
constructor parameters for different calls to the function you will
need to create multiple defines – one for each parameter set"
E.g:
Given the following UDF:
public class Extract extends EvalFunc<String> {
private boolean flag;
public Extract(String flag) {
//Note that a boolean param cannot be passed from script/grunt
//therefore pass it as a string
this.flag = Boolean.valueOf(flag);
}
public Extract() {
}
public String exec(Tuple input) throws IOException {
if (input == null || input.size() == 0) {
return null;
}
try {
if (flag) {
...
}
else {
...
}
}
catch (Exception e) {
throw new IOException("Caught exception processing input row ", e);
}
}
}
Then
define ex_arg my.udfs.Extract('true');
define ex my.udfs.Extract();
...
B = foreach A generate ex_arg(); --calls extract with flag set to true
C = foreach A generate ex(); --calls extract without any flag set
Another option (hack?) :
In this case the UDF gets instantiated with its noarg constructor and you pass the flag you want to evaluate in its exec method. Since this method takes a tuple as a parameter you need to first check whether the first field is the boolean flag.
public class Extract extends EvalFunc<String> {
public String exec(Tuple input) throws IOException {
if (input == null || input.size() == 0) {
return null;
}
try {
boolean flag = false;
if (input.getType(0) == DataType.BOOLEAN) {
flag = (Boolean) input.get(0);
}
//process rest of the fields in the tuple
if (flag) {
...
}
else {
...
}
}
catch (Exception e) {
throw new IOException("Caught exception processing input row ", e);
}
}
}
Then
...
B = foreach A generate Extract2(true,*); --use flag
C = foreach A generate Extract2();
I'd rather stick to the first solution as this smells.

Developing Hive UDAF meet a ClassCastException without an idea

`public class GenericUdafMemberLevel implements GenericUDAFResolver2 {
private static final Log LOG = LogFactory
.getLog(GenericUdafMemberLevel.class.getName());
#Override
public GenericUDAFEvaluator getEvaluator(GenericUDAFParameterInfo paramInfo)
throws SemanticException {
return new GenericUdafMeberLevelEvaluator();
}
#Override
//参数校验
public GenericUDAFEvaluator getEvaluator(TypeInfo[] parameters)
throws SemanticException {
if (parameters.length != 2) {//参数大小
throw new UDFArgumentTypeException(parameters.length - 1,
"Exactly two arguments are expected.");
}
//参数必须是原型，即不能是
if (parameters[0].getCategory() != ObjectInspector.Category.PRIMITIVE) {
throw new UDFArgumentTypeException(0,
"Only primitive type arguments are accepted but "
+ parameters[0].getTypeName() + " is passed.");
}
if (parameters[1].getCategory() != ObjectInspector.Category.PRIMITIVE) {
throw new UDFArgumentTypeException(1,
"Only primitive type arguments are accepted but "
+ parameters[1].getTypeName() + " is passed.");
}
return new GenericUdafMeberLevelEvaluator();
}
public static class GenericUdafMeberLevelEvaluator extends GenericUDAFEvaluator {
private PrimitiveObjectInspector inputOI;
private PrimitiveObjectInspector inputOI2;
private DoubleWritable result;
#Override
public ObjectInspector init(Mode m, ObjectInspector[] parameters)
throws HiveException {
super.init(m, parameters);
if (m == Mode.PARTIAL1 || m == Mode.COMPLETE){
inputOI = (PrimitiveObjectInspector) parameters[0];
inputOI2 = (PrimitiveObjectInspector) parameters[1];
result = new DoubleWritable(0);
}
return PrimitiveObjectInspectorFactory.writableLongObjectInspector;
}
/** class for storing count value. */
static class SumAgg implements AggregationBuffer {
boolean empty;
double value;
}
#Override
//创建新的聚合计算的需要的内存，用来存储mapper,combiner,reducer运算过程中的相加总和。
//使用buffer对象前，先进行内存的清空——reset
public AggregationBuffer getNewAggregationBuffer() throws HiveException {
SumAgg buffer = new SumAgg();
reset(buffer);
return buffer;
}
#Override
//重置为0
//mapreduce支持mapper和reducer的重用，所以为了兼容，也需要做内存的重用。
public void reset(AggregationBuffer agg) throws HiveException {
((SumAgg) agg).value = 0.0;
((SumAgg) agg).empty = true;
}
private boolean warned = false;
//迭代
//map阶段调用，只要把保存当前和的对象agg，再加上输入的参数，就可以了。
#Override
public void iterate(AggregationBuffer agg, Object[] parameters)
throws HiveException {
// parameters == null means the input table/split is empty
if (parameters == null) {
return;
}
try {
double flag = PrimitiveObjectInspectorUtils.getDouble(parameters[1], inputOI2);
if(flag > 1.0) //参数条件
merge(agg, parameters[0]); //这里将Map之后的操作，放入combiner进行合并
} catch (NumberFormatException e) {
if (!warned) {
warned = true;
LOG.warn(getClass().getSimpleName() + " "
+ StringUtils.stringifyException(e));
}
}
}
#Override
//combiner合并map返回的结果，还有reducer合并mapper或combiner返回的结果。
public void merge(AggregationBuffer agg, Object partial)
throws HiveException {
if (partial != null) {
//通过ObejctInspector取每一个字段的数据
double p = PrimitiveObjectInspectorUtils.getDouble(partial, inputOI);
((SumAgg) agg).value += p;
}
}
#Override
//reducer返回结果，或者是只有mapper，没有reducer时，在mapper端返回结果。
public Object terminatePartial(AggregationBuffer agg)
throws HiveException {
return terminate(agg);
}
#Override
public Object terminate(AggregationBuffer agg) throws HiveException {
result.set(((SumAgg) agg).value);
return result;
}
}
}`
I have used some chinese to comment the code for understanding the theory.
Actually, the idea of the UDAF is like follow:
select test_sum(col1,col2) from tbl ;
if col2 satisfy some condition, then sum col1's value.
Most of the code are copied from the offical avg() udaf function.
I met a weried Exception:
java.lang.RuntimeException: Hive Runtime Error while closing operators
at org.apache.hadoop.hive.ql.exec.ExecMapper.close(ExecMapper.java:226)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1136)
at org.apache.hadoop.mapred.Child.main(Child.java:249)
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.ClassCastException: org.apache.hadoop.io.DoubleWritable cannot be cast to org.apache.hadoop.io.LongWritable
at org.apache.hadoop.hive.ql.exec.GroupByOperator.closeOp(GroupByOperator.java:1132)
at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:558)
at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:567)
at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:567)
at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:567)
at org.apache.hadoop.hive.ql.exec.ExecMapper.close(ExecMapper.java:193)
... 8 more
Caused by: java.lang.ClassCastException: org.apache.hadoop.io.DoubleWritable cannot be cast to org.apache.hadoop.io.LongWritable
at org.apache.hadoop.hive.serde2.objectinspector.primitive.WritableLongObjectInspector.get(WritableLongObjectInspector.java:35)
at org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe.serialize(LazyBinarySerDe.java:323)
at org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe.serializeStruct(LazyBinarySerDe.java:255)
at org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe.serialize(LazyBinarySerDe.java:202)
at org.apache.hadoop.hive.ql.exec.ReduceSinkOperator.processOp(ReduceSinkOperator.java:236)
at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:474)
at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:800)
at org.apache.hadoop.hive.ql.exec.GroupByOperator.forward(GroupByOperator.java:1061)
at org.apache.hadoop.hive.ql.exec.GroupByOperator.closeOp(GroupByOperator.java:1113)
... 13 more
Am I have something wrong with my UDAF??
please kindly point it out.
Thanks a lllllllot .

Replace PrimitiveObjectInspectorFactory.writableLongObjectInspector in init method with PrimitiveObjectInspectorFactory.writableDoubleObjectInspector.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

How does hive achieve count(distinct ...)? - hive

Related

What happens if no records exists when I use Room to query?

Is there a standard way to package many Restlets into a single Restlet?

Oracle Coherence index not working with ContainsFilter query

Passing custom parameters to a pig udf function in java

Developing Hive UDAF meet a ClassCastException without an idea

Categories

Resources