apache pig & piggybank avro union types - apache-pig

I have a record of union type of
union {TypeA, TypeB, TypeC, TypeD, TypeE} mydata;
I have the serialized data in avro format, however when I am trying to use piggybank.jar's AvroStorage function to load the avro data, it gives me the following error:
Caused by: java.io.IOException: We don't accept schema containing generic unions.
at org.apache.pig.piggybank.storage.avro.AvroSchema2Pig.convert(AvroSchema2Pig.java:54)
at org.apache.pig.piggybank.storage.avro.AvroStorage.getSchema(AvroStorage.java:384)
at org.apache.pig.newplan.logical.relational.LOLoad.getSchemaFromMetaData(LOLoad.java:174)
... 23 more
So, after reading the piggybank source code here https://github.com/triplel/pig/blob/branch-0.12/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/avro/AvroStorageUtils.java
/** determine whether a union is a nullable union;
* note that this function doesn't check containing
* types of the input union recursively. */
public static boolean isAcceptableUnion(Schema in) {
if (! in.getType().equals(Schema.Type.UNION))
return false;
List<Schema> types = in.getTypes();
if (types.size() <= 1) {
return true;
} else if (types.size() > 2) {
return false; /*contains more than 2 types */
} else {
/* one of two types is NULL */
return types.get(0).getType().equals(Schema.Type.NULL) || types.get(1) .getType().equals(Schema.Type.NULL);
}
}
basically piggybank's AvroStorage does not support more than 2 union types, I am wondering what is the idea behind this decision? Why not just make it compatible with Avro?

Related

Room database query returns null where it should be null-safe in Kotlin

I have a query that does not have a result, when the DB is empty. Therefore NULL is the correct return value.
However, the compiler in Android Studio gives me the warning: Condition 'maxDateTime != null' is always 'true'.
If I debug the code, the null check performs correctly as the value is actually null.
When I rewrite the interface to 'fun queryMaxServerDate(): String?' (notice the question mark), the compiler warning goes away.
But should not 'fun queryMaxServerDate(): String' result in a compilation error since it can be null?
#Dao
interface CourseDao {
// Get latest downloaded entry
#Query("SELECT MAX(${Constants.COL_SERVER_LAST_MODIFIED}) from course")
fun queryMaxServerDate(): String
}
// calling function
/**
* #return Highest server date in table in milliseconds or 1 on empty/error.
*/
fun queryMaxServerDateMS(): Long {
val maxDateTime = courseDao.queryMaxServerDate()
var timeMS: Long = 0
if (maxDateTime != null) { // Warning: Condition 'maxDateTime != null' is always 'true'
timeMS = TimeTools.parseDateToMillisOrZero_UTC(maxDateTime)
}
return if (timeMS <= 0) 1 else timeMS
}
The underlying code generated by the annotation is java and thus the exception to null safety as per :-
Kotlin's type system is aimed to eliminate NullPointerException's from
our code. The only possible causes of NPE's may be:
An explicit call to throw NullPointerException(); Usage of the !!
operator that is described below;
Some data inconsistency with regard
to initialization, such as when:
An uninitialized this available in a
constructor is passed and used somewhere ("leaking this");
A
superclass constructor calls an open member whose implementation in
the derived class uses uninitialized state;
Java interoperation:
Attempts to access a member on a null reference of a platform type;
Generic types used for Java interoperation with incorrect nullability,
e.g. a piece of Java code might add null into a Kotlin
MutableList, meaning that MutableList should be used
for working with it;
Other issues caused by external Java code.
Null Safety
e.g. the generated code for queryMaxServerDate() in CourseDao would be along the lines of :-
#Override
public String queryMaxServerDate() {
final String _sql = "SELECT max(last_mopdified) from course";
final RoomSQLiteQuery _statement = RoomSQLiteQuery.acquire(_sql, 0);
__db.assertNotSuspendingTransaction();
final Cursor _cursor = DBUtil.query(__db, _statement, false, null);
try {
final String _result;
if(_cursor.moveToFirst()) {
final String _tmp;
_tmp = _cursor.getString(0);
_result = _tmp;
} else {
_result = null;
}
return _result;
} finally {
_cursor.close();
_statement.release();
}
}
As you can see, no data extracted (no first row) and null is returned.

Flatbuffers: How to allow multiple types for a single field

I'm writing a communication protocol schema for a list of parameters which can be of multiple values: uint64, float64, string or bool.
How can I set a table field to a union of multiple primitive scalar & non-scalar primitive type?
I've already tried using a union of those types, but I end up with the following error when building:
$ schemas/foobar.fbs:28: 0: error: type referenced but not defined
(check namespace): uint64, originally at: schemas/request.fbs:5
Here's the schema in its current state:
namespace Foobar;
enum RequestCode : uint16 { Noop, Get, Set, BulkGet, BulkSet }
union ParameterValue { uint64, float64, bool, string }
table Parameter {
name:string;
value:ParameterValue;
unit:string;
}
table Request {
code:RequestCode = Noop;
payload:[Parameter];
}
table Result {
request:Request;
success:bool = true;
payload:[Parameter];
}
The end result I'm looking for is the Request and Result tables to contain a list of parameters, where a parameter contains a name and value, and optionally the units.
Thx in advance!
Post-answer solution:
Here's what I came up with in the end, thx to Aardappel.
namespace foobar;
enum RequestCode : uint16 { Noop, Get, Set, BulkGet, BulkSet }
union ValueType { UnsignedInteger, SignedInteger, RealNumber, Boolean, Text }
table UnsignedInteger {
value:uint64 = 0;
}
table SignedInteger {
value:int64 = 0;
}
table RealNumber {
value:float64 = 0.0;
}
table Boolean {
value:bool = false;
}
table Text {
value:string (required);
}
table Parameter {
name:string (required);
valueType:ValueType;
unit:string;
}
table Request {
code:RequestCode = Noop;
payload:[Parameter];
}
table Result {
request:Request (required);
success:bool = true;
payload:[Parameter];
}
You currently can't put scalars directly in a union, so you'd have to wrap these in a table or a struct, where struct would likely be the most efficient, e.g.
struct UInt64 { u:uint64 }
union ParameterValue { UInt64, Float64, Bool, string }
This is because a union must be uniformly the same size, so it only allows types to which you can have an offset.
Generally though, FlatBuffers is a strongly typed system, and the schema you are creating here is undoing that by emulating dynamically typed data, since your data is essentially a list of (string, any type) pairs. You may be better off with a system designed for this particular use case, such as FlexBuffers (https://google.github.io/flatbuffers/flexbuffers.html, currently only C++) which explicitly has a map type that is all string -> any type pairs.
Of course, even better is to not store data so generically, but instead make a new schema for each type of request and response you have, and make parameter names into fields, rather than serialized data. This is by far the most efficient, and type safe.

Use multiple return statement

Is there a considerable difference of optimization between these two codes (in Java and/or C++, currently, even if I guess it's the same in every languages) ? Or is it just a question of code readability ?
int foo(...) {
if (cond) {
if (otherCondA)
return 1;
if (otherCondB)
return 2;
return 3;
}
int temp = /* context and/or param-dependent */;
if (otherCondA)
return 4 * temp;
if (otherCondB)
return 4 / temp;
return 4 % temp;
}
and
int foo(...) {
int value = 0;
if (cond) {
if (otherCondA)
value = 1;
else if (otherCondB)
value = 2;
else value = 3;
}
else {
int temp = /* context and/or param-dependent */;
if (otherCondA)
value = 4 * temp;
else if (otherCondB)
value = 4 / temp;
else
value = 4 % temp;
}
return value;
}
The first one is shorter, avoid multiple imbrications of else statement and economize one variable (or at least seems to do so), but I'm not sure that it really changes something...
After looking deeper into the different assembly codes generated by GCC, here's the results :
The multiple return statement is more efficient during "normal" compilation, but with the -O_ flag, the balance change :
The more you optimise the code, the less the first approach worths. It makes the code harder to optimise, so, use it carefully. As said in comments, it's very powerful when used at the front of the function when testing preconditions, but in the middle of the function, it's a nightmare for the compiler.
Of course the multiple return is acceptable.
Because you can halt the program as soon as the function is finished

CallableStatement + registerOutParameter + multiple row result

I've got a SQL statement of the form:
BEGIN\n
UPDATE tab
SET stuff
WHERE stuff
RETURNING intA, intB, stringC
INTO ?,?,?
I've registered the appropriate Out parameters.
Here's where I have some questions: Do I call stmt.executeQuery() or stmt.execute()? Further, I know with a normal SELECT query I can loop through the resultSet and populate my object -- what's the equivalent for multiple rows of Out parameters?
EDIT:
Perhaps I can register a single out parameter of type CURSOR and loop over this result.
EDIT2:
Could I potentially have multiple resultSet's that I need to loop over?
Thanks!
I believe you can achieve what you are looking for, but you will need to handle PL/SQL arrays rather than cursors or result sets. Below is a demonstration.
I have a table, called TEST, with the following structure:
SQL> desc test;
Name Null? Type
----------------------------------------- -------- -----------------
A NUMBER(38)
B NUMBER(38)
C NUMBER(38)
and containing the following data:
SQL> select * from test;
A B C
---------- ---------- ----------
1 2 3
4 5 6
7 8 9
I need to create an array type for each type of column used. Here, I only have NUMBERs, but if you have one or more VARCHAR2 columns as well, you'll need to create a type for those too.
SQL> create type t_integer_array as table of integer;
2 /
Type created.
The table and any necessary types are all we need to set up in the database. Once we've done that, we can write a short Java class that does an UPDATE ... RETURNING ..., returning multiple values to Java:
import java.math.BigDecimal;
import java.util.Arrays;
import java.sql.*;
import oracle.sql.*;
import oracle.jdbc.*;
public class UpdateWithBulkReturning {
public static void main(String[] args) throws Exception {
Connection c = DriverManager.getConnection(
"jdbc:oracle:thin:#localhost:1521:XE", "user", "password");
c.setAutoCommit(false);
/* You need BULK COLLECT in order to return multiple rows. */
String sql = "BEGIN UPDATE test SET a = a + 10 WHERE b <> 5 " +
"RETURNING a, b, c BULK COLLECT INTO ?, ?, ?; END;";
CallableStatement stmt = c.prepareCall(sql);
/* Register the out parameters. Note that the third parameter gives
* the name of the corresponding array type. */
for (int i = 1; i <= 3; ++i) {
stmt.registerOutParameter(i, Types.ARRAY, "T_INTEGER_ARRAY");
}
/* Use stmt.execute(), not stmt.executeQuery(). */
stmt.execute();
for (int i = 1; i <= 3; ++i) {
/* stmt.getArray(i) returns a java.sql.Array for the output parameter in
* position i. The getArray() method returns the data within this
* java.sql.Array object as a Java array. In this case, Oracle converts
* T_INTEGER_ARRAY into a Java BigDecimal array. */
BigDecimal[] nums = (BigDecimal[]) (stmt.getArray(i).getArray());
System.out.println(Arrays.toString(nums));
}
stmt.close();
c.rollback();
c.close();
}
}
When I run this, I get the following output:
C:\Users\Luke\stuff>java UpdateWithBulkReturning
[11, 17]
[2, 8]
[3, 9]
The outputs displayed are the values returned from the columns A, B and C respectively. There are only two values for each column since we filtered out the row with B equal to 5.
You might want the values grouped by row instead of grouped by column. In other words, you might want the output to contain [11, 2, 3] and [17, 8, 9] instead. If that's what you want, I'm afraid you'll need to do that part yourself.
To build upon what Luke Woodward answered and to refine my previous answer, you can create an Oracle type, use it to temporarily store data, and then return a sys_refcursor with your updates.
Create the new type:
CREATE OR REPLACE TYPE rowid_tab AS TABLE OF varchar2(30);
Create the database function:
CREATE OR REPLACE
FUNCTION update_tab
RETURN sys_refcursor
IS
ref_cur sys_refcursor;
v_tab rowid_tab;
BEGIN
UPDATE tab
SET intA = intA+2
, intB = intB*2
, stringC = stringC||' more stuff.'
RETURNING ROWID
BULK COLLECT INTO v_tab;
OPEN ref_cur FOR
WITH DATA AS (SELECT * FROM TABLE(v_tab))
SELECT intA, intB, stringC
FROM tab
where rowid in (select * from data);
RETURN ref_cur;
END;
Now, call the function in your java:
import java.math.BigDecimal;
import java.util.Arrays;
import java.sql.*;
import oracle.sql.*;
import oracle.jdbc.*;
public class StructTest {
public static void main(String[] args)
throws Exception
{
System.out.println("Start...");
ResultSet results = null;
Connection c = DriverManager.getConnection( "jdbc:oracle:thin:#localhost:1521:xe", "scott", "tiger");
c.setAutoCommit(false);
String sql = "begin ? := update_tab(); end;";
System.out.println("sql = "+sql);
CallableStatement stmt = c.prepareCall(sql);
/* Register the out parameter. */
System.out.println("register out param");
stmt.registerOutParameter(1, OracleTypes.CURSOR);
// get the result set
stmt.execute();
results = (ResultSet) stmt.getObject(1);
while (results.next()){
System.out.println("intA: "+results.getString(1)+", intB: "+results.getString(2)+", stringC: "+results.getString(3));
}
c.rollback();
c.close();
}
}
With my test data, I got the following results:
intA: 3, intB: 4, stringC: a more stuff.
intA: 6, intB: 10, stringC: C more stuff.
intA: 3, intB: 4, stringC: a more stuff.

SQL set-based range

How can I have SQL repeat some set-based operation an arbitrary number of times without looping? How can I have SQL perform an operation against a range of numbers? I'm basically looking for a way to do a set-based for loop.
I know I can just create a small table with integers in it, say from 1 to 1000 and then use it for range operations that are within that range.
For example, if I had that table I could make a select to find the sum of numbers 100-200 like this:
select sum(n) from numbers where n between 100 and 200
Any ideas? I'm kinda looking for something that works for T-SQL but any platform would be okay.
[Edit] I have my own solution for this using SQL CLR which works great for MS SQL 2005 or 2008. See below.
I think the very short answer to your question is to use WITH clauses to generate your own.
Unfortunately, the big names in databases don't have built-in queryable number-range pseudo-tables. Or, more generally, easy pure-SQL data generation features. Personally, I think this is a huge failing, because if they did it would be possible to move a lot of code that is currently locked up in procedural scripts (T-SQL, PL/SQL, etc.) into pure-SQL, which has a number of benefits to performance and code complexity.
So anyway, it sounds like what you need in a general sense is the ability to generate data on the fly.
Oracle and T-SQL both support a WITH clause that can be used to do this. They work a little differently in the different DBMS's, and MS calls them "common table expressions", but they are very similar in form. Using these with recursion, you can generate a sequence of numbers or text values fairly easily. Here is what it might look like...
In Oracle SQL:
WITH
digits AS -- Limit recursion by just using it for digits.
(SELECT
LEVEL - 1 AS num
FROM
DUAL
WHERE
LEVEL < 10
CONNECT BY
num = (PRIOR num) + 1),
numrange AS
(SELECT
ones.num
+ (tens.num * 10)
+ (hundreds.num * 100)
AS num
FROM
digits ones
CROSS JOIN
digits tens
CROSS JOIN
digits hundreds
WHERE
hundreds.num in (1, 2)) -- Use the WHERE clause to restrict each digit as needed.
SELECT
-- Some columns and operations
FROM
numrange
-- Join to other data if needed
This is admittedly quite verbose. Oracle's recursion functionality is limited. The syntax is clunky, it's not performant, and it is limited to 500 (I think) nested levels. This is why I chose to use recursion only for the first 10 digits, and then cross (cartesian) joins to combine them into actual numbers.
I haven't used SQL Server's Common Table Expressions myself, but since they allow self-reference, recursion is MUCH simpler than it is in Oracle. Whether performance is comparable, and what the nesting limits are, I don't know.
At any rate, recursion and the WITH clause are very useful tools in creating queries that require on-the-fly generated data sets. Then by querying this data set, doing operations on the values, you can get all sorts of different types of generated data. Aggregations, duplications, combinations, permutations, and so on. You can even use such generated data to aid in rolling up or drilling down into other data.
UPDATE: I just want to add that, once you start working with data in this way, it opens your mind to new ways of thinking about SQL. It's not just a scripting language. It's a fairly robust data-driven declarative language. Sometimes it's a pain to use because for years it has suffered a dearth of enhancements to aid in reducing the redundancy needed for complex operations. But nonetheless it is very powerful, and a fairly intuitive way to work with data sets as both the target and the driver of your algorithms.
I created a SQL CLR table valued function that works great for this purpose.
SELECT n FROM dbo.Range(1, 11, 2) -- returns odd integers 1 to 11
SELECT n FROM dbo.RangeF(3.1, 3.5, 0.1) -- returns 3.1, 3.2, 3.3 and 3.4, but not 3.5 because of float inprecision. !fault(this)
Here's the code:
using System;
using System.Data.SqlTypes;
using Microsoft.SqlServer.Server;
using System.Collections;
[assembly: CLSCompliant(true)]
namespace Range {
public static partial class UserDefinedFunctions {
[Microsoft.SqlServer.Server.SqlFunction(DataAccess = DataAccessKind.None, IsDeterministic = true, SystemDataAccess = SystemDataAccessKind.None, IsPrecise = true, FillRowMethodName = "FillRow", TableDefinition = "n bigint")]
public static IEnumerable Range(SqlInt64 start, SqlInt64 end, SqlInt64 incr) {
return new Ranger(start.Value, end.Value, incr.Value);
}
[Microsoft.SqlServer.Server.SqlFunction(DataAccess = DataAccessKind.None, IsDeterministic = true, SystemDataAccess = SystemDataAccessKind.None, IsPrecise = true, FillRowMethodName = "FillRowF", TableDefinition = "n float")]
public static IEnumerable RangeF(SqlDouble start, SqlDouble end, SqlDouble incr) {
return new RangerF(start.Value, end.Value, incr.Value);
}
public static void FillRow(object row, out SqlInt64 n) {
n = new SqlInt64((long)row);
}
public static void FillRowF(object row, out SqlDouble n) {
n = new SqlDouble((double)row);
}
}
internal class Ranger : IEnumerable {
Int64 _start, _end, _incr;
public Ranger(Int64 start, Int64 end, Int64 incr) {
_start = start; _end = end; _incr = incr;
}
public IEnumerator GetEnumerator() {
return new RangerEnum(_start, _end, _incr);
}
}
internal class RangerF : IEnumerable {
double _start, _end, _incr;
public RangerF(double start, double end, double incr) {
_start = start; _end = end; _incr = incr;
}
public IEnumerator GetEnumerator() {
return new RangerFEnum(_start, _end, _incr);
}
}
internal class RangerEnum : IEnumerator {
Int64 _cur, _start, _end, _incr;
bool hasFetched = false;
public RangerEnum(Int64 start, Int64 end, Int64 incr) {
_start = _cur = start; _end = end; _incr = incr;
if ((_start < _end ^ _incr > 0) || _incr == 0)
throw new ArgumentException("Will never reach end!");
}
public long Current {
get { hasFetched = true; return _cur; }
}
object IEnumerator.Current {
get { hasFetched = true; return _cur; }
}
public bool MoveNext() {
if (hasFetched) _cur += _incr;
return (_cur > _end ^ _incr > 0);
}
public void Reset() {
_cur = _start; hasFetched = false;
}
}
internal class RangerFEnum : IEnumerator {
double _cur, _start, _end, _incr;
bool hasFetched = false;
public RangerFEnum(double start, double end, double incr) {
_start = _cur = start; _end = end; _incr = incr;
if ((_start < _end ^ _incr > 0) || _incr == 0)
throw new ArgumentException("Will never reach end!");
}
public double Current {
get { hasFetched = true; return _cur; }
}
object IEnumerator.Current {
get { hasFetched = true; return _cur; }
}
public bool MoveNext() {
if (hasFetched) _cur += _incr;
return (_cur > _end ^ _incr > 0);
}
public void Reset() {
_cur = _start; hasFetched = false;
}
}
}
and I deployed it like this:
create assembly Range from 'Range.dll' with permission_set=safe -- mod path to point to actual dll location on disk.
go
create function dbo.Range(#start bigint, #end bigint, #incr bigint)
returns table(n bigint)
as external name [Range].[Range.UserDefinedFunctions].[Range]
go
create function dbo.RangeF(#start float, #end float, #incr float)
returns table(n float)
as external name [Range].[Range.UserDefinedFunctions].[RangeF]
go
This is basically one of those things that reveal SQL to be less than ideal. I'm thinking maybe the right way to do this is to build a function that creates the range. (Or a generator.)
I believe the correct answer to your question is basically, "you can't".
(Sorry.)
You can use a common table expression to do this in SQL2005+.
WITH CTE AS
(
SELECT 100 AS n
UNION ALL
SELECT n + 1 AS n FROM CTE WHERE n + 1 <= 200
)
SELECT n FROM CTE
If using SQL Server 2000 or greater, you could use the table datatype to avoid creating a normal or temporary table. Then use the normal table operations on it.
With this solution you have essentially a table structure in memory that you can use almost like a real table, but much more performant.
I found a good discussion here: Temporary tables vs the table data type
Here's a hack you should never use:
select sum(numberGenerator.rank)
from
(
select
rank = ( select count(*)
from reallyLargeTable t1
where t1.uniqueValue > t2.uniqueValue ),
t2.uniqueValue id1,
t2.uniqueValue id2
from reallyLargeTable t2
) numberGenerator
where rank between 1 and 10
You can simplify this using the Rank() or Row_Number functions in SQL 2005