How can I generate schema from text file? (Hadoop-Pig) - apache-pig

Somehow i got filename.log which looks like for example (tab separated)
Name:Peter Age:18
Name:Tom Age:25
Name:Jason Age:35
because the value of key column may differ i cannot define schema when i load text like
a = load 'filename.log' as (Name:chararray,Age:int);
Neither do i want to call column by position like
b = foreach a generate $0,$1;
What I want to do is, from only that filename.log, to make it possible to call each value by key, for example
a = load 'filename.log' using PigStorage('\t');
b = group b by Name;
c = foreach b generate group, COUNT(b);
dump c;
for that purpose, i wrote some Java UDF which seperate key:value and get value for every field in tuple as below
public class SPLITALLGETCOL2 extends EvalFunc<Tuple>{
#Override
public Tuple exec(Tuple input){
TupleFactory mTupleFactory = TupleFactory.getInstance();
ArrayList<String> mProtoTuple = new ArrayList<String>();
Tuple output;
String target=input.toString().substring(1, input.toString().length()-1);
String[] tokenized=target.split(",");
try{
for(int i=0;i<tokenized.length;i++){
mProtoTuple.add(tokenized[i].split(":")[1]);
}
output = mTupleFactory.newTupleNoCopy(mProtoTuple);
return output;
}catch(Exception e){
output = mTupleFactory.newTupleNoCopy(mProtoTuple);
return output;
}
}
}
How should I alter this method to get what I want? or How should I write other UDF to get there?

Whatever you do, don't use a tuple to store the output. Tuples are intended to store a fixed number of fields, where you know what every field contains. Since you don't know that the keys will be in Name,Age form (or even exist, or that there won't be more) you should use a bag. Bags are unordered sets of tuples. They can have any number of tuples in them as long as the tuples have the same schema. These are all valid bags for the schema B: {T:(key:chararray, value:chararray)}:
{(Name,Foo),(Age,Bar)}
{(Age,25),(Name,Jim)}
{(Name,Bob)}
{(Age,30),(Name,Roger),(Hair Color,Brown)}
{(Hair Color,),(Name,Victor)} -- Note the Null value for Hair Color
However, it sounds like you really want a map:
myudf.py
#outputSchema('M:map[]')
def mapize(the_input):
out = {}
for kv in the_input.split(' '):
k, v = kv.split(':')
out[k] = v
return out
myscript.pig
register '../myudf.py' using jython as myudf ;
A = LOAD 'filename.log' AS (total:chararray) ;
B = FOREACH A GENERATE myudf.mapize(total) ;
-- Sample usage, grouping by the name key.
C = GROUP B BY M#'Name' ;
Using the # operator you can pull out all values from the map using the key you give. You can read more about maps here.

Related

How to join and filter 2 relations by columns in Pig?

I'm a novice to Pig script, and trying to modify some existing pig script to extract some data from log files.
E.g. I have 2 log files, one with the schema as:
message Class {
message Student {
optional int32 uid = 1;
optional string name = 2;
}
optional int32 cid = 1;
repeated Student students = 2;
}
After loading, I think a bag (say, bag1) is created (correct me if I'm wrong):
bag1:
{
(uid1, {(cid11, name11), (cid12, name12), (cid13, name13), ...}),
(uid2, {(cid21, name21), (cid22, name22), (cid23, name23), ...}),
...
}
And another log file is simple, the resulting bag (bag2) is like this.
bag2:
{
(name11),
(name13),
(name22),
...
}
What I want is, get all the rows from bag1 if any name in bag2 is contained inside the row, like:
result bag:
{
(uid1, (name11, name13)),
(uid2, (name22)),
}
I think I'll need to do some join/filter on these 2 bags, but don't know how.
I tried a script snippet like below, but it's even not a valid script.
res = FOREACH bag1 {
names = FOREACH students GENERATE name;
xnames = JOIN names by name, bag2 by name;
GENERATE cid, xnames;
};
FILTER res BY not IsEmpty(xnames);
So could anyone pls. give me some help on the script?
You won't be able to use JOIN inside a nested FOREACH, you can try flattening your tuple and then join it with the second table:
bag1_flat = FOREACH bag1 GENERATE $0 AS uid, FLATTEN($1);
bag1_flat = FOREACH bag1_flat GENERATE uid, $2 AS name;
An inner join, will filter the lines :
bag12 = JOIN bag1_flat by name, bag2 by $0;
bag12 = FOREACH bag12 GENERATE bag1_flat::uid AS uid, bag1_flat::name AS name;
Finally, group by uid you won't get tuples though as they cannot be different sizes, you'll get bags:
bag12_group = GROUP bag12 BY uid;
res = FOREACH bag12_group GENERATE group AS uid, bag12.name AS names;

piglatin positional notation, how to specify an interval

Lets say I have a very wide data source:
big_thing = LOAD 'some_path' using MySpecialLoader;
Now I want to generate some smaller thing composed of a subset of big_thing's columns.
smaller_thing = FOREACH big_thing GENERATE
$21,$22,$23 ...... $257;
Is there a way to achieve this without having to write out all the columns?
I'm assuming yes but my searches aren't coming up with much, I think I'm just using the wrong terminology.
EDIT:
So it looks like my question is being very misunderstood. Since I'm a Python person I'll give a python analogy.
Say I have an array l1 which is made up of arrays. So it looks like a grid right? Now I want the array l2 to be a subset of 'l1such thatl2' contains a bunch of columns from l1. I would do something like this:
l2 = [[l[a],l[b],l[c],l[d]] for l in l1]
# a,b,c,d are just some constants.
In pig this is equivalent to something like:
smaller_thing = FOREACH big_thing GENERATE
$1,$22,$3,$21;
But I have a heck of a lot of columns. And the columns I'm interested in are all sequential and there are a lot of those. Then in python I would do this:
l2 = [l[x:y] for l in l2]
#again, x and y are constants, eg x=20, y=180000000. See, lots of stuff I dont want to type out
My question is what is the pig equivalent to this?
smaller_thing = FOREACH big_thing GENERATE ?????
And what about stuff like this:
Python:
l2 = [l[x:y]+l[a:b]+[l[b],l[c],l[d]] for l in l2]
Pig:
smaller_thing = FOREACH big_thing GENERATE ?????
Yes, you can simply load the dataset without columns.
But if you load the data with column names will help you to identify the column details in future scripts.
UDF can help you to perform your query,
For example,
REGISTER UDF\path;
a = load 'data' as (a1);
b = foreach a generate UDF.Func(a1,2,4);
UDF:
public class col_gen extends EvalFunc<String>
{
#Override
public String exec(Tuple tuple) throws IOException {
String data = tuple.get(0).toString();
int x = (int)tuple.get(1);
int y = (int)tuple.get(2);
String[] data3 = data.split(",");
String data2 = data3[x]+",";
x = x+1;
while(x <= y)
{
data2 += data3[x]+",";
x++;
}
data2 = data2.substring(0, data2.length()-1);
return data2;
}
}
The answer can be found in this post: http://blog.cloudera.com/blog/2012/08/process-a-million-songs-with-apache-pig/
Distanced = FOREACH Different GENERATE artistLat..songPreview, etc;
The .. says use everything from artistLat to songPreview.
The same thing can be done with positional notation. eg $1..$6

hibernate how to assign ParameterList for the IN clause

I need to use the IN clause in the SQL sentence.
I have data in one table with the type on Int(11).
And Y have a String from another table that is the criteria.
For example, in table A i have the value 3 of type Int.
In table/process B i have the String "0123".
I need to query table A to meet this criteria:
Select * from Table A where attrib_1 IN (0,1,2,3)
Because record n have value 3, it should be returned.
So i'm trying to use .setParameterList, like this:
List<BloqueCruzamiento> bloques = session.createQuery("FROM BloqueCruzamiento AS b WHERE b.anio=:anio AND b.activo=true AND b.grupo=:categoria AND b.pr IN(:pr_set)ORDER BY b.nroParcela, b.cruza, b.pedigree")
.setParameter("anio", grupo.getAnio())
.setParameter("categoria", grupo.getCategoria())
.setParameterList("pr_set", pr_parm)
.list();
the quid is on "pr_set" parameter.
I want to know how to convert a String , "0123", to a Collection of Integers (0,1,2,3).
So I can pass this parameter to setParameterList() method.
Anapproach that I'm right now is to convert the String to a Char Array, then loop, and convert each element into an Integer Array.
Can somebody give anothe solution ?
Regards
you can use code below to get list from String
String s = "0123";
List<Integer> pr_parm = new ArrayList<Integer>();
for(int i=0;i<s.length();i++) {
if (Character.isDigit(s.charAt(i))) {
pr_parm.add(Integer.parseInt(String.valueOf(s.charAt(i))));
}
}
System.out.println(pr_parm);
Then you can use the list in your setParameterList("pr_set", pr_parm)
this was the solution.
String[] sele = grupo.getPr_sel().split(",");
Integer[] pr_parm_int = new Integer[sele.length];
for (int x=0; x<sele.length;x++){
pr_parm_int[x] = Integer.valueOf(sele[x]);
}
the first line is to parse the string and strip comas.

Can I pass parameters to UDFs in Pig script?

I am relatively new to PigScript. I would like to know if there is a way of passing parameters to Java UDFs in Pig?
Here is the scenario:
I have a log file which have different columns (each representing a Primary Key in another table). My task is to get the count of distinct primary key values in the selected column.
I have written a Pig script which does the job of getting the distinct primary keys and counting them.
However, I am now supposed to write a new UDF for each column. Is there a better way to do this? Like if I can pass a row number as parameter to UDF, it avoids the need for me writing multiple UDFs.
The way to do it is by using DEFINE and the constructor of the UDF. So here is an example of a customer "splitter":
REGISTER com.sample.MyUDFs.jar;
DEFINE CommaSplitter com.sample.MySplitter(',');
B = FOREACH A GENERATE f1, CommaSplitter(f2);
Hopefully that conveys the idea.
To pass parameters you do the following in your pigscript:
UDF(document, '$param1', '$param2', '$param3')
edit: Not sure if those params need to be wrappedin ' ' or not
while in your UDF you do:
public class UDF extends EvalFunc<Boolean> {
public Boolean exec(Tuple input) throws IOException {
if (input == null || input.size() == 0)
return false;
FileSystem fs = FileSystem.get(UDFContext.getUDFContext().getJobConf());
String var1 = input.get(1).toString();
InputStream var1In = fs.open(new Path(var1));
String var2 = input.get(2).toString();
InputStream var2In = fs.open(new Path(var2));
String var3 = input.get(3).toString();
InputStream var3In = fs.open(new Path(var3));
return doyourthing(input.get(0).toString());
}
}
for example
Yes, you can pass any parameter in the Tuple parameter input of your UDF:
exec(Tuple input)
and access it using
input.get(index)

Find all available values for a field in lucene .net

If I have a field x, that can contain a value of y, or z etc, is there a way I can query so that I can return only the values that have been indexed?
Example
x available settable values = test1, test2, test3, test4
Item 1 : Field x = test1
Item 2 : Field x = test2
Item 3 : Field x = test4
Item 4 : Field x = test1
Performing required query would return a list of:
test1, test2, test4
I've implemented this before as an extension method:
public static class ReaderExtentions
{
public static IEnumerable<string> UniqueTermsFromField(
this IndexReader reader, string field)
{
var termEnum = reader.Terms(new Term(field));
do
{
var currentTerm = termEnum.Term();
if (currentTerm.Field() != field)
yield break;
yield return currentTerm.Text();
} while (termEnum.Next());
}
}
You can use it very easily like this:
var allPossibleTermsForField = reader.UniqueTermsFromField("FieldName");
That will return you what you want.
EDIT: I was skipping the first term above, due to some absent-mindedness. I've updated the code accordingly to work properly.
TermEnum te = indexReader.Terms(new Term("fieldx"));
do
{
Term t = te.Term();
if (t==null || t.Field() != "fieldx") break;
Console.WriteLine(t.Text());
} while (te.Next());
You can use facets to return the first N values of a field if the field is indexed as a string or is indexed using KeywordTokenizer and no filters. This means that the field is not tokenized but just saved as it is.
Just set the following properties on a query:
facet=true
facet.field=fieldname
facet.limit=N //the number of values you want to retrieve
I think a WildcardQuery searching on field 'x' and value of '*' would do the trick.
I once used Lucene 2.9.2 and there I used the approach with the FieldCache as described in the book "Lucene in Action" by Manning:
String[] fieldValues = FieldCache.DEFAULT.getStrings(indexReader, fieldname);
The array fieldValues contains all values in the index for field fieldname (Example: ["NY", "NY", "NY", "SF"]), so it is up to you now how to process the array. Usually you create a HashMap<String,Integer> that sums up the occurrences of each possible value, in this case NY=3, SF=1.
Maybe this helps. It is quite slow and memory consuming for very large indexes (1.000.000 documents in index) but it works.