How to read data from data bag within a PIG script - apache-pig

I have a databag which is the following format
{([ChannelName#{ (bigXML,[])} ])}
DataBag consists of only one item which is a Tuple.
Tuple consists of only item that is Map.
Map is of type which is a map between channel names and values.
Here is value is of type DataBag, which consists of only one tuple.
The tuple consists of two items one is a charrarray (very big string) and other is a map
I have a UDF that emits the above bag.
Now i need to invoke another UDF by passing the only tuple within the DataBag against a given Channel from the Map.
Assuming there was not data bag and a tuple as
([ChannelName#{ (bigXML,[])} ])
I can access the data using $0.$0#'StdOutChannel'
Now with the tuple inside a bag
{([ChannelName#{ (bigXML,[])} ])}
If i do $0.$0.$0#'StdOutChannel' (Prepend $0), i get the following error
ERROR 1052: Cannot cast bag with schema bag({bytearray}) to map
How can I access data within a data bag?

Try to break this problem down a little.
Let's say you get your inner bag:
MYBAG = $0.$0#'StdOutChannel';
First, can you ILLUSTRATE or DUMP this?
What can you do with this bag? Usually FOREACH over the tuples inside.
A = FOREACH MYBAG {
GENERATE $0 AS MyCharArray, $1 AS MyMap
};
ILLUSTRATE A; -- or if this doesn't work
DUMP A;
Can you try this interactively and maybe edit your question a little more with some details as a result of you trying these things.
Some editing hints for StackOverflow:
put backticks around your code (`ILLUSTRATE`)
indent code blocks by 4 spaces on each line

Related

How to chain filter expressions together

I have data in the following format
ArrayList<Map.Entry<String,ByteString>>
[
{"a":[a-bytestring]},
{"b":[b-bytestring]},
{"a:model":[amodel-bytestring]},
{"b:model":[bmodel-bytestring]},
]
I am looking for a clean way to transform this data into the format (List<Map.Entry<ByteString,ByteString>>) where the key is the value of a and value is the value of a:model.
Desired output
List<Map.Entry<ByteString,ByteString>>
[
{[a-bytestring]:[amodel-bytestring]},
{[b-bytestring]:[bmodel-bytestring]}
]
I assume this will involve the use of filters or other map operations but am not familiar enough with Kotlin yet to know this
It's not possible to give an exact, tested answer without access to the ByteString class — but I don't think that's needed for an outline, as we don't need to manipulate byte strings, just pass them around. So here I'm going to substitute Int; it should be clear and avoid any dependencies, but still work in the same way.
I'm also going to use a more obvious input structure, which is simply a map:
val input = mapOf("a" to 1,
"b" to 2,
"a:model" to 11,
"b:model" to 12)
As I understand it, what we want is to link each key without :model with the corresponding one with :model, and return a map of their corresponding values.
That can be done like this:
val output = input.filterKeys{ !it.endsWith(":model") }
.map{ it.value to input["${it.key}:model"] }.toMap()
println(output) // Prints {1=11, 2=12}
The first line filters out all the entries whose keys end with :model, leaving only those without. Then the second creates a map from their values to the input values for the corresponding :model keys. (Unfortunately, there's no good general way to create one map directly from another; here map() creates a list of pairs, and then toMap() creates a map from that.)
I think if you replace Int with ByteString (or indeed any other type!), it should do what you ask.
The only thing to be aware of is that the output is a Map<Int, Int?> — i.e. the values are nullable. That's because there's no guarantee that each input key has a corresponding :model key; if it doesn't, the result will have a null value. If you want to omit those, you could call filterValues{ it != null } on the result.
However, if there's an ‘orphan’ :model key in the input, it will be ignored.

Get field names from a TFRecord

Given a .tfrecord file, we can define a record iterator
record_iterator = tf.python_io.tf_record_iterator(path=record)
Then parse it using
example = tf.train.SequenceExample()
for element in record_iterator:
example.ParseFromString(element)
The question is: how do we infer all the field names in the context ?
If we know the structure in advance, we can say example.context.feature["width"]. In addition, str(example.context) returns a string with the entire record structure. However, I was wondering if there is any built-in function to get the field names and avoid parsing this string (e.g. by searching for "key")

Neighbours list extracted out of polygon regions

I've got a SQL database which contains some coded polygon structures. Those can be extracted as follows
poly <- data.frame(sqldf("SELECT ST_astext(geometry) FROM table"))
The data.frame 'poly' contains strings that now can be converted to real 'SpatialPolygons' objects as follows (for the first string)
realWKT(poly[1,1])
I can do the previous for each string, and save it in a vector
list <- c()
for (i in 1:100){
list <- c(list, readWKT(poly[i,1])
}
The last thing I want to do, is to create a neighbourhood list, based on all the SpatialPolygons by making use of the following function
poly2nb(list)
But sadly, this command results in the following error
Error: extends(class(pl), "SpatialPolygons") is not TRUE
I know that the problem has something to do with the classtype of the list, but I really don't see a way out.. Any help will be appreciated!
Edit
As suggested, some parts of the output. Keep in mind that the rows of 'poly' are really long strings of coordinates
> poly[1,1]
[1] "POLYGON((4.155976 50.78233,...,4.153225 50.76121,4.152384 50.761191,4.151878 50.761194,4.151319 50.761163,4.150872 50.761126))"
> poly[2,1]
[1] "POLYGON((5.139526 50.914059,...,5.140994 50.913612,5.156976 50.895945))"
This seems to work:
list <- lapply(1:2,function(i)readWKT(poly[i,1],id=i))
sp <- SpatialPolygons(lapply(list,function(sp)sp#polygons[[1]]))
library(spdep)
poly2nb(sp)
The internal structure of SpatialPolygons is rather complex. A SpatialPolygons object is a collection (list) of Polygons objects (which represent geographies), and each of these is a list of Polygon objects, which represent geometric shapes. So for example, a SpatialPolygons object that represents US states, has 50 or so Polygons objects (one for each state), and each of those can have multiple Polygon objects (if the state is not contiguous, e.g. has islands, etc.).
It looks like poly2nb(...) takes a single SpatialPolygons object and calculates neighborhood structure based on the contained list of Polygons objects. You were passing a list of SpatialPolygons objects.
So the challenge is to convert the result of your SQL query to a single SpatialPolygons object. readWKT(...) converts each row to a SpatialPolygons object, each of which contains exactly one Polygons object. So you have to extract those and re-assemble them into a single SpatialPolygons object. The line:
sp <- SpatialPolygons(lapply(list,function(sp)sp#polygons[[1]]))
does that. The line before:
list <- lapply(1:2,function(i)readWKT(poly[i,1],id=i))
replaces your for (...) loop and also adds a polygon id to each polygon, which is necessary for the call to SpatialPolygons(...).

ANTLR String Template - How to specify a template with two multi-valued attributes?

I have a scenario, where I need to instantiate an anonymous template using two multivalued attributes.
Here is what I tried from reading the documentation:
<mva_set1, mva_set2:{ x, y | Permit,IP,,<x>,,<y>,,,}; separator="\n">
For the sake of illustrating the problems, lets assume I am supplying the
following two arrays to the two attributes:
string[] input_1 = new string[] { "128.230.0.0/16", "10.20.0.0/16" };
string[] input_2 = new string[] { "131.230.0.0/16", "154.20.0.0/16" };
I then applied the attribute using the following two function calls:
template.Add("mva_set1", input_1);
template.Add("mva_set2", input_2);
The result surprised me. I thought I will get four rows because I expected the template to be instantiated for each of the four pairs. However, what I got was just two rows from instantiating two pairs:
Permit IP 128.230.0.0/16 131.230.0.0/16
Permit IP 10.20.0.0/16 154.20.0.0/16
Am I using this incorrectly? Is there a better alternative way to do this?
The amount of output you get is the maximum size of any one list.
Ter

apache pig group by output -- remove "(" and "{"

I do the following:
a = load '/hive/warehouse/' USING PigStorage('^') as (a1,b1,c1);
b = group a by (a1) ;
c = foreach b generate group, a.$2;
dump c;
Output shows all the groups:
abc {(1),(44),(66)}
cde {(1),(44),(66)}
How can I remove "{" and "(" characters so that the final HDFS file can be read as a coma delimited file?
You can't do this directly in Pig. The special syntax is required because you are storing a bag, and in order for Pig to be able to read this bag later, it needs to be stored with braces (for the bag) and parentheses (for the tuples contained in the bag).
You have a couple of options. You can read the file back into Pig, but instead of reading it as a bag, read it as a chararray. Then you can perform regex substitution to get rid of the punctuation (untested):
a = LOAD 'output' AS (group:chararray, list:chararray);
b = FOREACH A GENERATE group, REPLACE(list, '[{()}]', '');
Another option is to write a UDF which will turn a bag into a tuple. Note that this is not a well-defined operation: bags have no particular order, so from one run to the next, your tuple is not guaranteed to be in the same order. But for your purposes it sounds like that may not matter. The UDF could look like (very rough draft, untested):
public class BAG_TO_TUPLE extends EvalFunc(Tuple) {
public Tuple exec(Tuple input) {
DataBag bag = input.get(0);
Iterator<Tuple> iterator = bag.iterator();
Tuple out = new DefaultTuple();
while(iterator.hasNext()) {
out.append(iterator.next().get(0));
}
return out;
}
}
The above UDF is terrible -- it assumes that you have exactly one element in every tuple of the bag (that you care about) and does no checking whatsoever that the input is valid, etc. But it should get you towards what you want.
The best solution, though, is to find a way to handle the extra punctuation outside of Pig if Pig is not part of your downstream processing.
This functionality is now provided in Pig as a built-in func (I'm using 0.11).
http://pig.apache.org/docs/r0.11.0/api/org/apache/pig/builtin/BagToString.html
c = foreach b generate group, a.$2 as stuff;
d = foreach c generate group, BagToString(stuff, ',');
I don't need a comma-delimited file for my use case, but I assume you can use a store func to get the final comma (between group and the now-comma-delimited-list of bag things).
Try the FLATTEN operator;
c = foreach b generate group, FLATTEN(a.$2);