Why Text is preferred than String in Hive UDF java class - hive

There is a UDF java class shown as below:
import org.apache.commons.lang.StringUtils;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.Text;
public class Strip extends UDF {
private Text result = new Text();
public Text evaluate(Text str) {
if (str == null) {
return null;
}
result.set(StringUtils.strip(str.toString()));
return result;
}
public Text evaluate(Text str, String stripChars) {
if (str == null) {
return null;
}
result.set(StringUtils.strip(str.toString(), stripChars));
return result;
}
}
Hive actually supports Java
primitives in UDFs (and a few other types, such as java.util.List and
java.util.Map), so a signature like:
public String evaluate(String str)
would work equally well. However, by using Text we can take advantage of object reuse,
which can bring efficiency savings, so this is preferred in general.
Can someone tell me the reason why Text is preferred? Why we could take advantage of object reuse by using Text. When we execute the following command in Hive:
hive> SELECT strip(' bee ') FROM dummy;
After that we execute another command using that Strip function, then the Strip object is created again, right? So we cannot reuse it, right?

You can reuse a Text instance by calling one of the set() methods on it. For example:
Text t = new Text("hadoop");
t.set("pig");

Related

How to use a predefined lambda in kotlin?

I am learning Kotlin coming from Java, and I stumbled upon an unexpected behavior.
I noticed, that in my below code, I seem to accidentally declare a new lambda at a bad position instead of using the one I already have. How can I fix this?
I wrote these two declarations:
/**
* Dataclass used as an example.
*/
data class Meeple(var name: String, var color: String = "translucent")
/**
* Function to map from a List<T> to a new List of equal length,
* containing the ordered elements received by applying a Mapper's map
* function to every element of the input List.
*
* #param T Type of input List-elements
* #param O Type of output List-elements
* #param mapper The mapping function applied to every input element.
* #return The List of output elements received by applying the mapper on all
* input elements.
*/
fun <T, O> List<T>.map(mapper: (T) -> O?): List<O?> {
val target = ArrayList<O?>();
for (t in this) {
val mapped: O? = mapper.invoke(t)
target.add(mapped);
}
return target;
}
The data class is just a dummy example of a simple object. The List.map extension function is meant to map from the elements of the list to a new type and return a new List of that new type, almost like a Stream.map would in Java.
I then create some dummy Meeples and try to map them to their respective names:
fun main(args: Array<String>) {
val meeples = listOf(
Meeple("Jim", "#fff"),
Meeple("Cassidy"),
Meeple("David", "#f00")
)
var toFilter: String = "Cassidy"
val lambda: (Meeple) -> String? =
{ if (it.name == toFilter) null else it.name }
toFilter = "Jim"
for (name in meeples.map { lambda }) {
println(name ?: "[anonymous]") // This outputs "(Meeple) -> kotlin.String?" (x3 because of the loop)
}
}
I did this to check the behavior of the lambda, and whether it would later filter "Jim" or "Cassidy", my expectation being the later, as that was the state of toFilter at lambda initialization.
However I got an entirely different result. The invoke method, though described by IntelliJ as being (T) -> O? seems to yield the name of the lambda instead of the name of the Meeple.
It seems, that the call to meeples.map { lambda } does not bind the lambda as I expected, but creates a new lambda, that returns lambda and probably internally calls toString on that as well.
How would I actually invoke the real lambda method, instead of declaring a new one?
You already mentioned in the comments you figured out that you were passing a new lambda that returns your original lambda.
As for the toFilter value changing: The lambda function is like any other interface. As you have defined it, it captures the toFilter variable, so it will always use the current value of it when the lambda is executed. If you want to avoid capturing the variable, copy its current value into the lambda when you define the lambda. There are various ways to do this. One way is to copy it to a local variable first.
var toFilter: String = "Cassidy"
val constantToFilter = toFilter
val lambda: (Meeple) -> String? =
{ if (it.name == constantToFilter) null else it.name }
toFilter = "Jim"
Pretty much anything you can do with Stream in Java, you can do to an Iterable directly in Kotlin. The map function is already available, as mentioned in the comments.
Edit: Since you mentioned Java behavior in the comments.
Java can capture member variables, but local variables have to be marked final for the compiler to allow you to pass them to a lambda or interface. So in this sense they capture values only (unless you pass member variable). The equivalent to Java's final for a local variable in Kotlin is val.
Kotlin is more lenient than Java in this situation, and also allows you to pass a non-final local variable (var) to an interface or lambda, and it captures the variable in this case. This is what your original code is doing.
Even though you have found the issue as you mention in comments, I am adding this answer with some details to help any future readers.
So when you create lambda using
val lambda: (Meeple) -> String? = { if (it.name == toFilter) null else it.name }
This basically translates to
final Function1 lambda = (Function1)(new Function1() {
public Object invoke(Object var1) {
return this.invoke((Meeple)var1);
}
#Nullable
public final String invoke(#NotNull Meeple it) {
Intrinsics.checkNotNullParameter(it, "it");
return Intrinsics.areEqual(it.getName(), (String)toFilter.element) ? null : it.getName();
}
});
Now correct way to pass this to your map method would be as you have mentioned in comments
name in meeples.map(lambda)
but instead of (lambda) you wrote { lambda }, this is the trailing lambda convention
name in meeples.map { lambda }
// if the last parameter of a function is a function, then a lambda expression passed as the corresponding argument can be placed outside the parentheses:
// If the lambda is the only argument in that call, the parentheses can be omitted entirely
this creates a new lambda which returns the lambda we defined above, this line basically gets translated to following
HomeFragmentKt.map(meeples, (Function1)(new Function1() {
public Object invoke(Object var1) {
return this.invoke((Meeple)var1);
}
#Nullable
public final Function1 invoke(#NotNull Meeple it) {
Intrinsics.checkNotNullParameter(it, "it");
return lambda; // It simply returns the lambda you defined, and the code to filter never gets invoked
}
}))

How best to return a single value of different types from function

I have a function that returns either an error message (String) or a Firestore DocumentReference. I was planning to use a class containing both and testing if the error message is non-null to detect an error and if not then the reference is valid. I thought that was far too verbose however, and then thought it may be neater to return a var. Returning a var is not allowed however. Therefore I return a dynamic and test if result is String to detect an error.
IE.
dynamic varResult = insertDoc(_sCollection,
dataRec.toJson());
if (varResult is String) {
Then after checking for compliance, I read the following from one of the gurus:
"It is bad style to explicitly mark a function as returning Dynamic (or var, or Any or whatever you choose to call it). It is very rare that you need to be aware of it (only when instantiating a generic with multiple type arguments where some are known and some are not)."
I'm quite happy using dynamic for the return value if that is appropriate, but generally I try to comply with best practice. I am also very aware of bloated software and I go to extremes to avoid it. That is why I didn't want to use a Class for the return value.
What is the best way to handle the above situation where the return type could be a String or alternatively some other object, in this case a Firestore DocumentReference (emphasis on very compact code)?
One option would be to create an abstract state class. Something like this:
abstract class DocumentInsertionState {
const DocumentInsertionState();
}
class DocumentInsertionError extends DocumentInsertionState {
final String message;
const DocumentInsertionError(this.message);
}
class DocumentInsertionSuccess<T> extends DocumentInsertionState {
final T object;
const DocumentInsertionSuccess(this.object);
}
class Test {
void doSomething() {
final state = insertDoc();
if (state is DocumentInsertionError) {
}
}
DocumentInsertionState insertDoc() {
try {
return DocumentInsertionSuccess("It worked");
} catch (e) {
return DocumentInsertionError(e.toString());
}
}
}
Full example here: https://github.com/ReactiveX/rxdart/tree/master/example/flutter/github_search

How to add a UDF with variable number parameter in calcite?

I'm using Apache Calcite to validate SQL. I add tables and UDFs dynamically.
The problem is when I add a UDF with variable number parameter, the validator can not find this function.
Version of Calcite is 1.18.0
And this is my code.
TestfuncFunction.java
public class TestfuncFunction {
public String testfunc(String... arg0) {
return null;
}
}
Add UDF
Function schemafunction = ScalarFunctionImpl.create(TestfuncFunction.class),"testfunc");
SchemaPlus schemaPlus = Frameworks.createRootSchema(true);
schemaPlus.add("testfunc", schemafunction);
SQL
select testfunc(field1, field2) from test_table
testfunc is a ScalarFunction with variable number parameter,field1 and field2 are columns of test_table. So this is a legal SQL. But I got this CalciteContextException when validating:
No match found for function signature testfunc(<CHARACTER>, <CHARACTER>)
I tryed to change my sql into one parameter like this:
select testfunc(field1) from test_table
and got this exception
java.lang.AssertionError: No assign rules for OTHER defined
at org.apache.calcite.sql.type.SqlTypeAssignmentRules.canCastFrom(SqlTypeAssignmentRules.java:386)
at org.apache.calcite.sql.type.SqlTypeUtil.canCastFrom(SqlTypeUtil.java:864)
at org.apache.calcite.sql.SqlUtil.lambda$filterRoutinesByParameterType$4(SqlUtil.java:554)
...
It seems that calcite transform java array type into SqlTypeName.OTHER.
I have tryed to override method "createJavaType" in JavaTypeFactoryImpl like this:
private static class CustomJavaTypeFactoryImpl extends JavaTypeFactoryImpl {
#Override
public RelDataType createJavaType(Class clazz) {
if (clazz.isArray()) {
return new ArraySqlType(super.createJavaType(clazz.getComponentType()), true);
}
return super.createJavaType(clazz);
}
}
but it did not work.
Do Calcite support UDF with variable number parameter, and what should I do.

How to refer to an outer function from a lambda?

The question is in the comment. I want to refer to the outer function append, and not the one that's defined in the StringBuilder, how do I do this?
fun append(c: Char) {
println("TEST")
}
fun sbTest() = with(StringBuilder()) {
for(c in 'A'..'Z') {
append(c) // how do I refer to the append function declared above?
}
toString()
}
I know I can introduce a function reference variable, like this:
val f = ::append
and call f instead of append, but is there another way?
The problem is that anything called within with shadows the outer functions, because this is introduced. The same problem appears if you have a class and a top-level function with a function with the same signature.
The obvious option would just be re-naming it. Also, the function you have there isn't really descriptive compared to what it actually does. But if you for some reason can't rename, there are still other options.
Top-level methods can be referenced by package in Kotlin, for an instance like com.some.package.method. It can also be imported as such, which is the most common way to do it. There are very few methods that are called as com.some.package.method().
Kotlin, like Python, allows as in imports. Which means, you can do this:
package com.example;
// This is the important line; it declares the import with a custom name.
import com.example.append as appendChar; // Just an example name; this could be anything ***except append***. If it's append, it defeats the purpose
fun append(c: Char) {
println("TEST")
}
fun sbTest() = with(StringBuilder()) {
for(c in 'A'..'Z') {
appendChar(c)
}
toString()
}
Alternatively, as I mentioned, you can add the package:
for(c in 'A'..'Z') {
com.example.append(c)
}
val f = ::append is of course an option too, either way, it is still easier to rename the function than create imports with as or constants, assuming you have access to the function (and that it doesn't belong to a dependency).
If your file is outside a package, which I do not recommend you do, you can just declare the import as:
import append as appendChar
You could also use an extension function instead of with(), such as .let{...}.
This will send StringBuilder as an argument to the extension function as it (You can rename it to whatever you want btw):
fun sbTest() = StringBuilder().let{ it ->
for(c in 'A'..'Z') {
// using string builder
it.append(c)
// using your function
append(c)
}
it.toString()
}
The .let{...} function returns your last statement, aka the String from toString(), so your original function would still return it properly. Other functions can return this instead, such as .also{...}
I tend to use extension functions rather than with() as they're more flexible.
See this post to master extension functions: https://medium.com/#elye.project/mastering-kotlin-standard-functions-run-with-let-also-and-apply-9cd334b0ef84
EDIT: Got also{} and let{} mixed up. I switched them

Does PetaPoco handle enums?

I'm experimenting with PetaPoco to convert a table into POCOs.
In my table, I've got a column named TheEnum. The values in this column are strings that represent the following enum:
public enum MyEnum
{
Fred,
Wilma
}
PetaPoco chokes when it tries to convert the string "Fred" into a MyEnum value.
It does this in the GetConverter method, in the line:
Convert.ChangeType( src, dstType, null );
Here, src is "Fred" (a string), and dstType is typeof(MyEnum).
The exception is an InvalidCastException, saying Invalid cast from 'System.String' to 'MyEnum'
Am I missing something? Is there something I need to register first?
I've got around the problem by adding the following into the GetConverter method:
if (dstType.IsEnum && srcType == typeof(string))
{
converter = delegate( object src )
{
return Enum.Parse( dstType, (string)src ) ;
} ;
}
Obviously, I don't want to run this delegate on every row as it'll slow things down tremendously. I could register this enum and its values into a dictionary to speed things up, but it seems to me that something like this would likely already be in the product.
So, my question is, do I need to do anything special to register my enums with PetaPoco?
Update 23rd February 2012
I submitted a patch a while ago but it hasn't been pulled in yet. If you want to use it, look at the patch and merge into your own code, or get just the code from here.
I'm using 4.0.3 and PetaPoco automatically converts enums to integers and back. However, I wanted to convert my enums to strings and back. Taking advantage of Steve Dunn's EnumMapper and PetaPoco's IMapper, I came up with this. Thanks guys.
Note that it does not handle Nullable<TEnum> or null values in the DB. To use it, set PetaPoco.Database.Mapper = new MyMapper();
class MyMapper : PetaPoco.IMapper
{
static EnumMapper enumMapper = new EnumMapper();
public void GetTableInfo(Type t, PetaPoco.TableInfo ti)
{
// pass-through implementation
}
public bool MapPropertyToColumn(System.Reflection.PropertyInfo pi, ref string columnName, ref bool resultColumn)
{
// pass-through implementation
return true;
}
public Func<object, object> GetFromDbConverter(System.Reflection.PropertyInfo pi, Type SourceType)
{
if (pi.PropertyType.IsEnum)
{
return dbObj =>
{
string dbString = dbObj.ToString();
return enumMapper.EnumFromString(pi.PropertyType, dbString);
};
}
return null;
}
public Func<object, object> GetToDbConverter(Type SourceType)
{
if (SourceType.IsEnum)
{
return enumVal =>
{
string enumString = enumMapper.StringFromEnum(enumVal);
return enumString;
};
}
return null;
}
}
You're right, handling enums is not built into PetaPoco and usually I just suggest doing exactly what you've done.
Note that this won't slow things down for requests that don't use the enum type. PetaPoco generates code to map responses to pocos so the delegate will only be called when really needed. In other words, the GetConverter will only be called the first time a particular poco type is used, and the delegate will only be called when an enum needs conversion. Not sure on the speed of Enum.Parse, but yes you could cache in a dictionary if it's too slow.
If you are using PetaPoco's T4 generation and you want enums in your generated type, you can use the PropertyType override in Database.tt:
tables["App"]["Type"].PropertyType = "Full.Namespace.To.AppType";
I you want to store the value of the enum instead of the index number (1,2,4 for example) you can locate the update function in PetaPoco class because the code is "managed" etc, when you add it as nuget package it will store the .cs file to your project. If we would have the enum variable Color = {red, yellow, blue}
Instead of:
// Store the parameter in the command
AddParam(cmd, pc.GetValue(poco), pc.PropertyInfo);
change to:
//enum?
if (i.Value.PropertyInfo.PropertyType.IsEnum)
{
AddParam(cmd, i.Value.GetValue(poco).ToString(), i.Value.PropertyInfo);
}
else
{
// Store the parameter in the command
AddParam(cmd, i.Value.GetValue(poco), i.Value.PropertyInfo);
}
It would store "yellow" instead of 2