i'm searching for a long time now to solve my problem but nearly found nothing helpful.
Hopefully some of you can give me a tip.
I have a relation A with the following format: username, timestamp, ip
For example:
Harald 2014-02-18T16:14:49.503Z 123.123.123.123
Harald 2014-02-18T16:14:51.503Z 123.123.123.123
Harald 2014-02-18T16:14:55.503Z 321.321.321.321
And i want to find out, who changed his ip adress in less then 5 seconds. So the second and the third row should be interesting.
I want do group the relation by username und want to compare the timestamp of the actuall row with the next row. if the ip adress isnt the same and the timestamp is less then 5 seconds bigger, this should be at the output.
could someone help me with that issue?
regards.
first i want to thank you for your time.
but i actually stuck at the Sessionize part.
this is my data comming in:
aoebcu 2014-02-19T14:23:17.503Z 220.61.65.25
aoebcu 2014-02-19T14:23:14.503Z 222.117.144.19
aoebcu 2014-02-19T14:23:14.503Z 222.117.144.19
jekgru 2014-02-19T14:23:14.503Z 213.56.157.109
zmembx 2014-02-19T14:23:12.503Z 199.188.198.91
qhixcg 2014-02-19T14:23:11.503Z 203.40.104.119
and my code till now looks like this:
hijack_Reduced = FOREACH finalLogs GENERATE ClientUserName, timestamp, OriginalClientIP;
hijack_Filtered = FILTER hijack_Reduced BY OriginalClientIP != '-';
hijack_Sessionized = FOREACH (GROUP hijack_Filtered BY ClientUserName) {
views = ORDER hijack_Filtered BY timestamp;
GENERATE FLATTEN(Sessionize(views)) AS (ClientUserName,timestamp,OriginalClientIP,session_id);
}
but when i run this script, i got the following error Message:
15:36:22 ERROR -
org.apache.pig.tools.pigstats.SimplePigStats.setBackendException(542)
| ERROR 0: Exception while executing [POUserFunc (Name:
POUserFunc(datafu.pig.sessions.Sessionize)[bag] - scope-199 Operator
Key: scope-199) children: null at []]:
java.lang.IllegalArgumentException: Invalid format: "aoebcu"
i already tried a lot, but nothing worked.
do you got an idea?
Regards
While you could write a UDF for this, you can actually make use of the UDFs already available in Apache DataFu to solve this.
My solution involves applying sessionization to the data. Basically you look at consecutive events and assign each event a session ID. If the time elapsed between two events exceeds a specified amount of time, in your case 5 seconds, then the next event gets a new session ID. Otherwise consecutive events get the same session ID. Once each event is assigned its session ID the rest is easy. We group by session ID and look for sessions that have more than one distinct IP address.
I'll walk through my solution.
Suppose you have the following input data. Both Harold and Kumar change their IP addresses. But Harold does it within 5 seconds, while Kumar does not. So the output of our script should just be simply "Harold".
Harold,2014-02-18T16:14:49.503Z,123.123.123.123
Harold,2014-02-18T16:14:51.503Z,123.123.123.123
Harold,2014-02-18T16:14:55.503Z,321.321.321.321
Kumar,2014-02-18T16:14:49.503Z,123.123.123.123
Kumar,2014-02-18T16:14:55.503Z,123.123.123.123
Kumar,2014-02-18T16:15:05.503Z,321.321.321.321
Load the data
data = LOAD 'input' using PigStorage(',')
AS (user:chararray,time:chararray,ip:chararray);
Now define a couple UDFs from DataFu. The Sessionize UDF performs sessionization as I described earlier. The DistinctBy UDF will be used to find the distinct IP addresses within each session.
define Sessionize datafu.pig.sessions.Sessionize('5s');
define DistinctBy datafu.pig.bags.DistinctBy('1');
Group the data by user, sort by time, and apply the Sessonize UDF. Note that the timestamp must be the first field, as this is what Sessionize expects. This UDF appends a session ID to each tuple.
data = FOREACH data GENERATE time,user,ip;
data_sessionized = FOREACH (GROUP data BY user) {
views = ORDER data BY time;
GENERATE flatten(Sessionize(views)) as (time,user,ip,session_id);
}
Now that the data is sessionized, we can group by the user and session. I group by user too because I want to spit this value back out. We pass the bag of events into the DistinctBy UDF. Check the documentation of this UDF for a more detailed description. But essentially we will get as many tuples as there are distinct IP addresses per session. Note that I have removed the time from the relation below. This is because 1) it isn't needed, and 2) the DistinctBy in 1.2.0 of DataFu has a bug when handling fields containing dashes, as the time field does.
data_sessionized = FOREACH data_sessionized GENERATE user,ip,session_id;
data_sessionized = FOREACH (GROUP data_sessionized BY (user, session_id)) GENERATE
group.user as user,
SIZE(DistinctBy(data_sessionized)) as distinctIpCount;
Now select all the sessions that had more than one distinct IP address and return the distinct users for these sessions.
data_sessionized = FILTER data_sessionized BY distinctIpCount > 1;
data_sessionized = FOREACH data_sessionized GENERATE user;
data_sessionized = DISTINCT data_sessionized;
This produces simply:
Harold
Here is the full source code, which you should be able to paste directly into the DataFu unit tests and run:
/**
define Sessionize datafu.pig.sessions.Sessionize('5s');
define DistinctBy datafu.pig.bags.DistinctBy('1'); -- distinct by ip
data = LOAD 'input' using PigStorage(',') AS (user:chararray,time:chararray,ip:chararray);
data = FOREACH data GENERATE time,user,ip;
data_sessionized = FOREACH (GROUP data BY user) {
views = ORDER data BY time;
GENERATE flatten(Sessionize(views)) as (time,user,ip,session_id);
}
data_sessionized = FOREACH data_sessionized GENERATE user,ip,session_id;
data_sessionized = FOREACH (GROUP data_sessionized BY (user, session_id)) GENERATE
group.user as user,
SIZE(DistinctBy(data_sessionized)) as distinctIpCount;
data_sessionized = FILTER data_sessionized BY distinctIpCount > 1;
data_sessionized = FOREACH data_sessionized GENERATE user;
data_sessionized = DISTINCT data_sessionized;
STORE data_sessionized INTO 'output';
*/
#Multiline private String sessionizeUserIpTest;
private String[] sessionizeUserIpTestData = new String[] {
"Harold,2014-02-18T16:14:49.503Z,123.123.123.123",
"Harold,2014-02-18T16:14:51.503Z,123.123.123.123",
"Harold,2014-02-18T16:14:55.503Z,321.321.321.321",
"Kumar,2014-02-18T16:14:49.503Z,123.123.123.123",
"Kumar,2014-02-18T16:14:55.503Z,123.123.123.123",
"Kumar,2014-02-18T16:15:05.503Z,321.321.321.321"
};
#Test
public void sessionizeUserIpTest() throws Exception
{
PigTest test = createPigTestFromString(sessionizeUserIpTest);
this.writeLinesToFile("input",
sessionizeUserIpTestData);
List<Tuple> result = this.getLinesForAlias(test, "data_sessionized");
assertEquals(result.size(),1);
assertEquals(result.get(0).get(0),"Harold");
}
At a high level, I need a query that can pull a subset of records based on the sum of a column, just like Linq: How to query items from a collection until the sum reaches a certain value.
However, the key difference is that he's already got his records in an object, and I don't and can't. My table can have millions of records. If I build my query the way he did, I get this error:
"A lambda expression with a statement body
cannot be converted to an expression tree"
Which makes sense after researching it, LINQ can't turn the answer in the above referenced question into valid SQL.
I'm going to make a hypothetical table that represents my situation.
Order Id | Cookie Name | Qty
1 Sugar 5
2 Snickerdoodle 4
3 Chocolate chip 8
4 Snickerdoodle 10
5 Snickerdoodle 5
Given this sample, I need to write a query that grabs the first X orders of Snickerdoodle until the summed Qty exceedes an input from the parameter (i.e. If the user chooses 13, it would return records 2 & 4 ).
I'm using Nhibernate.Linq, because I'm more comfortable in LINQ. I'm completely open to ICreate if the need arises.
As a side note, I'm interested in this as a concept as well as a direct problem. Even though I need a Sum, there has to be a way to do something akin to a takewhile that executes until a condition is met.
pragmatic approach
int needed = ...;
int actual = 0;
int page = 0;
const int pagesize = 20; // set to some sensible value, eg. the pagesize of the grid shown to the user
var results = new List<CookieOrder>();
while (actual < needed)
{
var partialResults = session.Query<CookieOrder>()
.Where(c => c.Name == "Snickerdoodle")
.OrderBy(c => c.Id)
.Skip(page * pagesize)
.Take(pagesize)
.ToList();
for(int i = 0; i < partialResults.Length && actual < needed; i++)
{
results.Add(partialResults[i]);
actual = partialResults[i].Quantity;
}
page++;
}
return results;
I have this scenario:
var query = Session.QueryOver<T>()
var criteria = query.UnderlyingCriteria.SomethingThereAddCriterion()
How can I transform criteria back to IQueryOver()?
Your criterias has been added to the UnderlyingCriteria of query. So you don't need to transform criteria to IQueryOver(). Just use query again.
I have a need to get a Count of Documents in a particular collection :
There is an existing index Raven/DocumentCollections that stores the Count and Name of the collection paired with the actual documents belonging to the collection. I'd like to pick up the count from this index if possible.
Here is the Map-Reduce of the Raven/DocumentCollections index :
from doc in docs
let Name = doc["#metadata"]["Raven-Entity-Name"]
where Name != null
select new { Name , Count = 1}
from result in results
group result by result.Name into g
select new { Name = g.Key, Count = g.Sum(x=>x.Count) }
On a side note, var Count = DocumentSession.Query<Post>().Count(); always returns 0 as the result for me, even though clearly there are 500 odd documents in my DB atleast 50 of them have in their metadata "Raven-Entity-Name" as "Posts". I have absolutely no idea why this Count query keeps returning 0 as the answer - Raven logs show this when Count is done
Request # 106: GET - 0 ms - TestStore - 200 - /indexes/dynamic/Posts?query=&start=0&pageSize=1&aggregation=None
For anyone still looking for the answer (this question was posted in 2011), the appropriate way to do this now is:
var numPosts = session.Query<Post>().Count();
To get the results from the index, you can use:
session.Query<Collection>("Raven/DocumentCollections")
.Where(x=>x.Name == "Posts")
.FirstOrDefault();
That will give you the result you want.
Imagine the following (simplified) database layout:
We have many "holiday" records that relate to going to a particular Accommodation on a certain date etc.
I would like to pull from the database the "best" holiday going to each accommodation (i.e. lowest price), given a set of search criteria (e.g. duration, departure airport etc).
There will be multiple records with the same price, so then we need to choose by offer saving (descending), then by departure date ascending.
I can write SQL to do this that looks like this (I'm not saying this is necessarily the most optimal way):
SELECT *
FROM Holiday h1 INNER JOIN (
SELECT h2.HolidayID,
h2.AccommodationID,
ROW_NUMBER() OVER (
PARTITION BY h2.AccommodationID
ORDER BY OfferSaving DESC
) AS RowNum
FROM Holiday h2 INNER JOIN (
SELECT AccommodationID,
MIN(price) as MinPrice
FROM Holiday
WHERE TradeNameID = 58001
/*** Other Criteria Here ***/
GROUP BY AccommodationID
) mp
ON mp.AccommodationID = h2.AccommodationID
AND mp.MinPrice = h2.price
WHERE TradeNameID = 58001
/*** Other Criteria Here ***/
) x on h1.HolidayID = x.HolidayID and x.RowNum = 1
As you can see, this uses a subquery within another subquery.
However, for several reasons my preference would be to achieve this same result in NHibernate.
Ideally, this would be done with QueryOver - the reason being that I build up the search criteria dynamically and this is much easier with QueryOver's fluent interface. (I had started out hoping to use NHibernate Linq, but unfortunately it's not mature enough).
After a lot of effort (being a relative newbie to NHibernate) I was able to re-create the very inner query that fetches all accommodations and their min price.
public IEnumerable<HolidaySearchDataDto> CriteriaFindAccommodationFromPricesForOffers(IEnumerable<IHolidayFilter<PackageHoliday>> filters, int skip, int take, out bool hasMore)
{
IQueryOver<PackageHoliday, PackageHoliday> queryable = NHibernateSession.CurrentFor(NHibernateSession.DefaultFactoryKey).QueryOver<PackageHoliday>();
queryable = queryable.Where(h => h.TradeNameId == website.TradeNameID);
var accommodation = Null<Accommodation>();
var accommodationUnit = Null<AccommodationUnit>();
var dto = Null<HolidaySearchDataDto>();
// Apply search criteria
foreach (var filter in filters)
queryable = filter.ApplyFilter(queryable, accommodationUnit, accommodation);
var query1 = queryable
.JoinQueryOver(h => h.AccommodationUnit, () => accommodationUnit)
.JoinQueryOver(h => h.Accommodation, () => accommodation)
.SelectList(hols => hols
.SelectGroup(() => accommodation.Id).WithAlias(() => dto.AccommodationId)
.SelectMin(h => h.Price).WithAlias(() => dto.Price)
);
var list = query1.OrderByAlias(() => dto.Price).Asc
.Skip(skip).Take(take+1)
.Cacheable().CacheMode(CacheMode.Normal).List<object[]>();
// Cacheing doesn't work this way...
/*.TransformUsing(Transformers.AliasToBean<HolidaySearchDataDto>())
.Cacheable().CacheMode(CacheMode.Normal).List<HolidaySearchDataDto>();*/
hasMore = list.Count() == take;
var dtos = list.Take(take).Select(h => new HolidaySearchDataDto
{
AccommodationId = (string)h[0],
Price = (decimal)h[1],
});
return dtos;
}
So my question is...
Any ideas on how to achieve what I want using QueryOver, or if necessary Criteria API?
I'd prefer not to use HQL but if it is necessary than I'm willing to see how it can be done with that too (it makes it harder (or more messy) to build up the search criteria though).
If this just isn't doable using NHibernate, then I could use a SQL query. In which case, my question is can the SQL be improved/optimised?
I have manage to achieve such dynamic search criterion by using Criteria API's. Problem I ran into was duplicates with inner and outer joins and especially related to sorting and pagination, and I had to resort to using 2 queries, 1st query for restriction and using the result of 1st query as 'in' clause in 2nd creteria.