DataFu BagGroup will group all the bags instead of group at the FOREACH scope. How to fix? - apache-pig

I am using DataFu to baggroup my a bag. The is as following:
pvlist_grp = GROUP pvlist by uid;
uid_vid_pv = FOREACH pvlist_grp {
vids = FOREACH pvlist GENERATE date, vid;
GENERATE uid,
vids as vid,
BagGroup(pvlist.(date, uid, vid), pvlist.date) as grouped;
}
uid_vid_pv: {uid: chararray,vid: {(date: chararray,vid: chararray)},grouped: {(group: chararray,{(date: chararray,uid: chararray,vid: chararray)})}}
When I dump the first 10, I see all the vids that contains (date, vid) for each uid. However, the grouped shows other uid records. For example:
(60,{(20160103,255),(20160103,255),(20160103,257),(20160103,255),(20160101,252)},{(20160103,{(20160103,21,18),(20160103,21,453),(20160103,21,452),(20160103,21,67),(20160103,21,18),(20160103,21,455),(20160103,21,43),(20160103,21,453),(20160103,21,16),(20160103,21,45),(20160103,21,18),(20160103,21,18),(20160103,21,67),(20160103,21,455),.............})})
The dumped result shows the baggroup with other uid data in it. It groups the entire vid bags from all uids, but I want it just per uid.
The idea result should be like:
(60,{(20160103,255),(20160103,255),(20160103,257),(20160103,255),(20160101,252)},{(20160103,{(20160103,255),(20160103,255),(20160103,257),(20160103,255)}),(20160101,{(20160101,252)})})
Any Help why? I am using pig 1.2.0.
UPdate:
Looks like the BagGroup sort of calls operation from memory. So the first uid BagGroup is always correct, then it adds the bags to operate together from what has been processed before. IE. if the first record is uid 21, then the BagGroup has all the 21 results Grouped. Next, if the second record uid 60, then the BagGroup will output 21 and 60 results together.

I had exactly the same problem. In order to solve it, I had to modify the BagGroup UDF (ver.1.2.0). Adding groups.clear(); at the beginning of the exec method resolves this issue.
#SuppressWarnings("unchecked")
#Override
public DataBag exec(Tuple input) throws IOException {
fieldNames = (List<String>)getInstanceProperties().get(FIELD_NAMES_PROPERTY);
DataBag inputBag = (DataBag)input.get(0);
groups.clear();
for (Tuple tuple : inputBag) {
Tuple key = extractKey(tuple);
addGroup(key, tuple);
}
}

Related

play-slick scala many to many

I have an endpoint lets say /order/ where i can send json object(my order), which contains some products etc, so my problem is i have to first save the order and wait for the order id back from the db and then save my products with this new order id( we are talking many to many relation thats why theres another table)
Consider this controller method
def postOrder = Action(parse.json[OrderRest]) { req => {
Created(Json.toJson(manageOrderService.insertOrder(req.body)))
}
}
this is how my repo methods look like
def addOrder(order: Order) = db.run {
(orders returning orders) += order
}
how can i chain db.runs to first insert order, get order id and then insert my products with this order id i just got?
im thinking about putting some service between my controller and repo, and managing those actions there, but i have no idea where to start
You can use for to chain database operations. Here is an example of adding a table to a db by adding a header row to represent the table and then adding the data rows. In this case it is a simple table containing (age, value).
/** Add a new table to the database */
def addTable(name: String, table: Seq[(Int, Int)]) = {
val action = for {
key <- (Headers returning Headers.map(_.tableId)) += HeadersRow(0, name)
_ <- Values ++= table.map { case (age, value) => ValuesRow(key, age, value) }
} yield key
db.run(action.transactionally)
}
This is cut down from the working code, but it should give the idea of how to do what you want. The first for statement would generate the order id and then the second statement would add the order with that order id.
This is done transactionally so that the new order will not be created unless the order data is valid (in database terms).

Pig Latin: All pairs within a group - Nested foreach with a cross and filter

I have some grouped data:
glu: (
group:tuple(foo:bytearray, bar:chararray),
bam: bag {
:tuple(foo:bytearray, bar:chararray, pom:Long)
}
)
What I want is to do a nested cross-product to get all pairs of pom, and a filter to reduce to only pairs where the first pom is less than the second pom. Ending up with something like this:
glu: (
group:tuple(foo:bytearray, bar:chararray),
bam: bag {
:tuple(foo:bytearray, bar:chararray, pom1:Long, pom2:Long)
}
)
something like:
glupairs = FOREACH glu {
pairs = CROSS bam, bam;
filtered = FILTER pairs BY (bam1 != bam2) AND (bam1 < bam2);
GENERATE group, filtered;
};
This, of course, does not work. Is there a way to do this? Can I take a cross product of a relation against itself? how can I select the fields afterwards (to do the filter)?
Thanks in advance.
I figured this out by doing:
glupairs = FOREACH glu {
copied = FOREACH bam GENERATE -(-pom); -- Deals with the self cross bug
pairs = CROSS bam, copied;
filtered = FILTER pairs BY (bam.pom != copied.pom) AND (bam.pom < copied.pom);
GENERATE group, filtered;
};

How do I accumulate vectors into a map?

I have an alias A like this:
{cookie: chararray,
keywords: {tuple_of_tokens: (token: chararray)},
weight: double}
where the 2nd and 3rd fields are defined as
keywords = TOKENIZE((chararray)$5,',');
weight = 1.0/(double)SIZE(keywords);
now I want to do
foreach (group A by cookie) generate
group.cookie as cookie,
???? as keywords;
and keywords should be a map from a keyword into a the sum of weights.
E.g.,
1 k1,k2,k3
1 k2,k4
should turn into
1 {k1:1/3, k2:5/6, k3:1/3, k4:1/2}
I am already using datafu, but I am open to any alternative...
I'd do
A_counts = foreach A generate cookie,flatten(keywords) as keyword,1.0/SIZE(keywords) as weight;
then
A_counts_gr = group A by (cookie,keyword); and
result= foreach A_counts_gr generate flatten(group) as (cookie,token), sum(A_counts_gr.weight);
and then one can group by cookie to get a bag like you want...after grouping by cookie again there will be a bag, than you can turn this bag to a map with datafu...

PIG FILTER relation with next row the same relation

i'm searching for a long time now to solve my problem but nearly found nothing helpful.
Hopefully some of you can give me a tip.
I have a relation A with the following format: username, timestamp, ip
For example:
Harald 2014-02-18T16:14:49.503Z 123.123.123.123
Harald 2014-02-18T16:14:51.503Z 123.123.123.123
Harald 2014-02-18T16:14:55.503Z 321.321.321.321
And i want to find out, who changed his ip adress in less then 5 seconds. So the second and the third row should be interesting.
I want do group the relation by username und want to compare the timestamp of the actuall row with the next row. if the ip adress isnt the same and the timestamp is less then 5 seconds bigger, this should be at the output.
could someone help me with that issue?
regards.
first i want to thank you for your time.
but i actually stuck at the Sessionize part.
this is my data comming in:
aoebcu 2014-02-19T14:23:17.503Z 220.61.65.25
aoebcu 2014-02-19T14:23:14.503Z 222.117.144.19
aoebcu 2014-02-19T14:23:14.503Z 222.117.144.19
jekgru 2014-02-19T14:23:14.503Z 213.56.157.109
zmembx 2014-02-19T14:23:12.503Z 199.188.198.91
qhixcg 2014-02-19T14:23:11.503Z 203.40.104.119
and my code till now looks like this:
hijack_Reduced = FOREACH finalLogs GENERATE ClientUserName, timestamp, OriginalClientIP;
hijack_Filtered = FILTER hijack_Reduced BY OriginalClientIP != '-';
hijack_Sessionized = FOREACH (GROUP hijack_Filtered BY ClientUserName) {
views = ORDER hijack_Filtered BY timestamp;
GENERATE FLATTEN(Sessionize(views)) AS (ClientUserName,timestamp,OriginalClientIP,session_id);
}
but when i run this script, i got the following error Message:
15:36:22 ERROR -
org.apache.pig.tools.pigstats.SimplePigStats.setBackendException(542)
| ERROR 0: Exception while executing [POUserFunc (Name:
POUserFunc(datafu.pig.sessions.Sessionize)[bag] - scope-199 Operator
Key: scope-199) children: null at []]:
java.lang.IllegalArgumentException: Invalid format: "aoebcu"
i already tried a lot, but nothing worked.
do you got an idea?
Regards
While you could write a UDF for this, you can actually make use of the UDFs already available in Apache DataFu to solve this.
My solution involves applying sessionization to the data. Basically you look at consecutive events and assign each event a session ID. If the time elapsed between two events exceeds a specified amount of time, in your case 5 seconds, then the next event gets a new session ID. Otherwise consecutive events get the same session ID. Once each event is assigned its session ID the rest is easy. We group by session ID and look for sessions that have more than one distinct IP address.
I'll walk through my solution.
Suppose you have the following input data. Both Harold and Kumar change their IP addresses. But Harold does it within 5 seconds, while Kumar does not. So the output of our script should just be simply "Harold".
Harold,2014-02-18T16:14:49.503Z,123.123.123.123
Harold,2014-02-18T16:14:51.503Z,123.123.123.123
Harold,2014-02-18T16:14:55.503Z,321.321.321.321
Kumar,2014-02-18T16:14:49.503Z,123.123.123.123
Kumar,2014-02-18T16:14:55.503Z,123.123.123.123
Kumar,2014-02-18T16:15:05.503Z,321.321.321.321
Load the data
data = LOAD 'input' using PigStorage(',')
AS (user:chararray,time:chararray,ip:chararray);
Now define a couple UDFs from DataFu. The Sessionize UDF performs sessionization as I described earlier. The DistinctBy UDF will be used to find the distinct IP addresses within each session.
define Sessionize datafu.pig.sessions.Sessionize('5s');
define DistinctBy datafu.pig.bags.DistinctBy('1');
Group the data by user, sort by time, and apply the Sessonize UDF. Note that the timestamp must be the first field, as this is what Sessionize expects. This UDF appends a session ID to each tuple.
data = FOREACH data GENERATE time,user,ip;
data_sessionized = FOREACH (GROUP data BY user) {
views = ORDER data BY time;
GENERATE flatten(Sessionize(views)) as (time,user,ip,session_id);
}
Now that the data is sessionized, we can group by the user and session. I group by user too because I want to spit this value back out. We pass the bag of events into the DistinctBy UDF. Check the documentation of this UDF for a more detailed description. But essentially we will get as many tuples as there are distinct IP addresses per session. Note that I have removed the time from the relation below. This is because 1) it isn't needed, and 2) the DistinctBy in 1.2.0 of DataFu has a bug when handling fields containing dashes, as the time field does.
data_sessionized = FOREACH data_sessionized GENERATE user,ip,session_id;
data_sessionized = FOREACH (GROUP data_sessionized BY (user, session_id)) GENERATE
group.user as user,
SIZE(DistinctBy(data_sessionized)) as distinctIpCount;
Now select all the sessions that had more than one distinct IP address and return the distinct users for these sessions.
data_sessionized = FILTER data_sessionized BY distinctIpCount > 1;
data_sessionized = FOREACH data_sessionized GENERATE user;
data_sessionized = DISTINCT data_sessionized;
This produces simply:
Harold
Here is the full source code, which you should be able to paste directly into the DataFu unit tests and run:
/**
define Sessionize datafu.pig.sessions.Sessionize('5s');
define DistinctBy datafu.pig.bags.DistinctBy('1'); -- distinct by ip
data = LOAD 'input' using PigStorage(',') AS (user:chararray,time:chararray,ip:chararray);
data = FOREACH data GENERATE time,user,ip;
data_sessionized = FOREACH (GROUP data BY user) {
views = ORDER data BY time;
GENERATE flatten(Sessionize(views)) as (time,user,ip,session_id);
}
data_sessionized = FOREACH data_sessionized GENERATE user,ip,session_id;
data_sessionized = FOREACH (GROUP data_sessionized BY (user, session_id)) GENERATE
group.user as user,
SIZE(DistinctBy(data_sessionized)) as distinctIpCount;
data_sessionized = FILTER data_sessionized BY distinctIpCount > 1;
data_sessionized = FOREACH data_sessionized GENERATE user;
data_sessionized = DISTINCT data_sessionized;
STORE data_sessionized INTO 'output';
*/
#Multiline private String sessionizeUserIpTest;
private String[] sessionizeUserIpTestData = new String[] {
"Harold,2014-02-18T16:14:49.503Z,123.123.123.123",
"Harold,2014-02-18T16:14:51.503Z,123.123.123.123",
"Harold,2014-02-18T16:14:55.503Z,321.321.321.321",
"Kumar,2014-02-18T16:14:49.503Z,123.123.123.123",
"Kumar,2014-02-18T16:14:55.503Z,123.123.123.123",
"Kumar,2014-02-18T16:15:05.503Z,321.321.321.321"
};
#Test
public void sessionizeUserIpTest() throws Exception
{
PigTest test = createPigTestFromString(sessionizeUserIpTest);
this.writeLinesToFile("input",
sessionizeUserIpTestData);
List<Tuple> result = this.getLinesForAlias(test, "data_sessionized");
assertEquals(result.size(),1);
assertEquals(result.get(0).get(0),"Harold");
}

Raven DB Count Queries

I have a need to get a Count of Documents in a particular collection :
There is an existing index Raven/DocumentCollections that stores the Count and Name of the collection paired with the actual documents belonging to the collection. I'd like to pick up the count from this index if possible.
Here is the Map-Reduce of the Raven/DocumentCollections index :
from doc in docs
let Name = doc["#metadata"]["Raven-Entity-Name"]
where Name != null
select new { Name , Count = 1}
from result in results
group result by result.Name into g
select new { Name = g.Key, Count = g.Sum(x=>x.Count) }
On a side note, var Count = DocumentSession.Query<Post>().Count(); always returns 0 as the result for me, even though clearly there are 500 odd documents in my DB atleast 50 of them have in their metadata "Raven-Entity-Name" as "Posts". I have absolutely no idea why this Count query keeps returning 0 as the answer - Raven logs show this when Count is done
Request # 106: GET - 0 ms - TestStore - 200 - /indexes/dynamic/Posts?query=&start=0&pageSize=1&aggregation=None
For anyone still looking for the answer (this question was posted in 2011), the appropriate way to do this now is:
var numPosts = session.Query<Post>().Count();
To get the results from the index, you can use:
session.Query<Collection>("Raven/DocumentCollections")
.Where(x=>x.Name == "Posts")
.FirstOrDefault();
That will give you the result you want.