How to count and then compute the total average in PIG - apache-pig

Each line in my dataset is a sale and my goal is to compute the average time a client buys during his lifetime.
I have already grouped and counted by clientId like this:
byClientId = GROUP sales BY clientId;
countByClientId = FOREACH byClientId GENERATE group, count($1);
This creates a table with 2 columns: clientId, count of transactions.
Now, I am trying to get the total average of the second column (i.e. the overall average of sales to same client). I am using this code:
groupCount = GROUP countByClientId all;
avg = foreach groupCount generate AVG($1);
But I get this error message:
[main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1045:
<line 18, column 31> Could not infer the matching function for org.apache.pig.builtin.AVG
as multiple or none of them fit. Please use an explicit cast.
How to get the overall average of the second column?

It would have been simpler for us with a sample of input data.. I created my own, to be sure that my solution would work. You only have one mistake : once you grouped all your schema become group:chararray,countByClientId:bag{:tuple(group:chararray,:long)}
So, $1 refers to a bag and this is why you can't compute the mean. If you want to access $1 (which is the second element) inside this bag you have two choices, either $1.$1, or countByClientId.$1. So your last line should be :
avg = foreach groupCount generate AVG(countByClientId.$1);
I hope it's clear.

Related

Create Hourly Average query from Logger table

I use a logger table with timestamp,point id and value fields.
The control system adds a record each time a point id value is changed to the logger table.
Here's an example of the SELECT query of the logger table:
I want to run query on that table and get in return 1 hourly average values of some tags(point id's of different value - for example: 1 hourly average of point_id=LT_174, 1 hourly average of point_id=AT_INK and so on).
I would like to present the results list in pivot table results list.
Can I do that? and if it's imposible to get all requested tags together, how can I run the same query for 1 tag? (I use VB.Net as a platform for running this query, so I can build it by calculating all requested tags in a loop, each time 1 tag).
I'll be happy to get ideas and suggestions for this problem.

Pig: summing column b of rows with the same column a

I'm trying to count the number of tweets with a certain hashtag over a period of time but I'm getting an error when trying to use the built-in SUM function.
Example:
data = LOAD 'tweets_2.csv' USING PigStorage('\t') AS (date:float,hashtag:chararray,count:int, year:int, month:int, day:int, hour:int, minute:int, second:int);
NBLNabilVoto_count = FILTER data BY hashtag == 'NBLNabilaVoto';
NBLNabilVoto_group = GROUP NBLNabilVoto by count;
X = FOREACH NBLNabilVoto GENERATE group, SUM(data.count);
Error:
<line 22, column 47> Could not infer the matching function for org.apache.pig.builtin.SUM as multiple or none of them fit. Please use an explicit cast.
First Load the data then filter for the time interval you want to process. Group the record based on the hashtag. Use count() function to count the number of twitter for the corresponding hashtag.
I am not sure that the code is doing what you think or want it to do but the error you are getting is because you are doing a SUM on the wrong thing. You need to do this
X = FOREACH NBLNabilVoto GENERATE group, SUM(NBLNabilVoto_count.count);
NBLNabilVoto_count is the name of the tuples in the databag
i think you are using the wrong realtion in your SUM, you could SUM NBLNabilVoto_count not data realtion. i have question why you are groupping by COUNT ?
if you want count all your tweet with hashtag NBLNabilVoto.
i think the code must be like :
data = LOAD 'tweets_2.csv' USING PigStorage('\t') AS (date:float,hashtag:chararray,count:int, year:int, month:int, day:int, hour:int, minute:int, second:int);
NBLNabilVoto_count = FILTER data BY hashtag == 'NBLNabilaVoto';
NBLNabilVoto_group = GROUP NBLNabilVoto by all;
X = FOREACH NBLNabilVoto GENERATE group, SUM(NBLNabilVoto_count.count.count);

Qlikview calculation of range for frequencies

I am given a task to calculate the frequency of calls across a territory. If the rep called a physician regarding the sale of the product 5 times, then frequency is 5 and HCP count is 1....I generated frequencies from 1 to 124 in my pivot table using a calculated dimension which is working fine. But my concern is :
My manager wants frequencies till 19 in order from 1..2..3..4...5..6.....19...
And from the frequency 21-124 as 20+.
I would be grateful if someone helps me with this.....Eager for the reply....
Use the Class function in the dimension, to split into buckets:
=class(CallId,5)
And the expression:
=count(Distinct CallId)
You can then customize the output by adding parameters:
class( var,10 ) with var = 23 returns '20<=x<30'
class( var,5,'value' ) with var = 23 returns '20<= value <25'
class( var,10,'x',5 ) with var = 23 returns '15<=x<25'
I think you can do this with a calculated dimension.
If your data has one row per physician coming from the load statement below will likely work.
Dimension
- =IF(CallCount<=19,CallCount,'+20')
Expression
- =COUNT(DISTINCT Physician_ID)
Sort
- Numeric Value Ascending
If your data has to be aggregated, more than one call row per provider incoming from the load try above substituting below for the Dimension.
Dimension
- =IF(AGGR(SUM(CallCount), Physician_ID) <=19,AGGR(SUM(CallCount), Physician_ID),'+20')

Percent of Group, not total

It seems like there are a lot of answers out there but I can't seem to relate it to my specific issue. I want to get the breakdown of yes/no for the specific Group. Not get the percent of the yes for the entire population of data.
I have tried the following code in the "What I'm Getting" % of Total cell =
=FormatPercent(Count(Fields!SessionID.Value)/Count((Fields!SessionID.Value), "Tablix1"),)
=FormatPercent(Count(Fields!Value.Value)/Count((Fields!SessionID.Value), "Value"),)
It should just be a case of changing the Scope in your expression to make sure the denominator is the total for the group, not the entire Dataset or Tablix, i.e. something like:
=Count(Fields!SessionID.Value) / Count(Fields!SessionID.Value, "MyGroup")
Where MyGroup is the name of the group, i.e. something like:
If this is still not clear, your best option would be to add a few sample rows, and your desired result for these, to the question so we can replicate your exact issue.
Edit after more info added
Thanks for adding more details. I have created a Dataset based on your example:
And I've created a table based on this:
The group is based on the Group field:
The Group % expression is:
=Fields!YesNoCount.Value / Sum(Fields!YesNoCount.Value, "MyGroup")
This is taking the YesNoCount value of each row and comparing it to the total YesNoCount value in that particular group (i.e. the MyGroup scope).
Note that I'm using Sum here, not Count as in your example expression - that seems to be the appropriate aggregate for your data and the required value.
Results look OK to me:

Grouping, totaling in Rails and Active Record

I'm trying to group a series of records in Active Record so I can do some calculations to normalize that quantity attribute of each record for example:
A user enters a date and a quantity. Dates are not unique, so I may have 10 - 20 quantities for each date. I need to work with only the totals for each day, not every individual record. Because then, after determining the highest and lowest value, I convert each one by basically dividing by n which is usually 10.
This is what I'm doing right now:
def heat_map(project, word_count, n_div)
return "freezing" if word_count == 0
words = project.words
counts = words.map(&:quantity)
max = counts.max
min = counts.min
return "max" if word_count == max
return "min" if word_count == min
break_point = (max - min).to_f/n_div.to_f
heat_index = (((word_count - min).to_f)/break_point).to_i
end
This works great if I display a table of all the word counts, but I'm trying to apply the heat map to a calendar that displays running totals for each day. This obviously doesn't total the days, so I end up with numbers that are out of the normal scale.
I can't figure out a way to group the word counts and total them by day before I do the normalization. I tried doing a group_by and then adding the map call, but I got an error an undefined method error. Any ideas? I'm also open to better / cleaner ways of normalizing the word counts, too.
Hard to answer without knowing a bit more about your models. So I'm going to assume that the date you're interested in is just the created_at date in the words table. I'm assuming that you have a field in your words table called word where you store the actual word.
I'm also assuming that you might have multiple entries for the same word (possibly with different quantities) in the one day.
So, this will give you an ordered hash of counts of words per day:
project.words.group('DATE(created_at)').group('word').sum('quantity')
If those guesses make no sense, then perhaps you can give a bit more detail about the structure of your models.