Converting apache pig to hive

Converting apache pig to hive - hive

Trying to figure out "group" flatten and what this particular "flatten" code is doing. I have been working on the code below trying to figure out how to convert it to hive for a few days off and on, and I just don't get it. Normally, they use flatten to create multiple rows for two or more columns that they want named the same in the output. But in this case, I'm not sure what it's doing to replicate it in hive. Any assistance would be greatly appreciated as I don't have much time to work on this while I'm expected to complete and test it in the next couple of weeks. Thanks.
Change_pop = GROUP IPChange_pop BY (acct_num,strategy_code);
Oldest_GLChange = FOREACH Change_pop {
OList = ORDER IPChange_pop BY process_date ASC, new_loc DESC;
Oldest = LIMIT OList 1;
GENERATE
FLATTEN(GLChange_pop) as (email,acct_num,acct_nm,cust_num,type,strategy_code,process_date,last_5,cmGroup,current_loc,new_loc,update_ts),
FLATTEN(group.strategy_code) as grp_strategy_code,
FLATTEN(Oldest.process_date) as early_process_date, FLATTEN(Oldest.new_loc) as early_new_loc;
};

Flatten is being used to un-nest tuples, bags, and maps. From the top of my head, I recall Hive equivalent would be using EXPLODE() function along with LATERAL VIEW.
https://pig.apache.org/docs/latest/basic.html#flatten
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-explode

Related

How to get the first tuple inside a bag when Grouping

I do not understand how to deal with duplicates when generating my output, so I ended up getting several duplicates but I want one only.
I've tried using LIMIT but that only applies when selecting I suppose. I also used DISTINCT but wrong scenario I guess.
grouped = GROUP wantedTails BY tail_number;
smmd = FOREACH grouped GENERATE wantedTails.tail_number as Tails, SUM(wantedTails.distance) AS totaldistance;
So for my grouped, I got smg like (not the whole):
({(N983JB),(N983JB),(N983JB),(N983JB),(N983JB),(N983JB),(N983JB),(N983JB),(N983JB),(N983JB),(N983JB),(N983JB),(N983JB),(N983JB),(N983JB),(N983JB),(N983JB),(N983JB)},44550)
but I expect (N983JB,44550). How can I delete those duplicates generated during grouping? Thank you!

The way I see it, there are two ways to de-duplicate data in Pig.
Less flexible but a convenient way is to apply MAX to the columns which need to be de-duplicated after performing a GROUP BY. Apply SUM only if you want to add up values across duplicates:
dataWithDuplicates = LOAD '<path_to_data>';
grouped = GROUP dataWithDuplicates BY tail_number;
dedupedData= FOREACH grouped GENERATE
--Since you have grouped on tailNumber, it is already de-duped
group AS tailNumber,
MAX(dataWithDuplicates.distance) AS dedupedDistance,
SUM(dataWithDuplicates.distance) AS totalDistance;
If you want more flexibility while de-duping, you can take help of nested-FOREACH in Pig. This question captures the gist of its usage: how to delete the rows of data which is repeating in Pig. Other references for nested-FORACH: https://www.safaribooksonline.com/library/view/programming-pig/9781449317881/ch06.html

Qlik View: Get Values WHERE (value2 = max)

i have a relativly simple problem:
i have a Dataset that consists of an id(not for entry but for specific object), an age of the object and a power value.
So what i get is a lot of entries where there is a power at a specific age for a specific object.
I want to create a diagram that shows the average of all power values at the highest age over all objects(ids).
In SQl this basically would look something like SELECT power WHERE max(age).
Can anybody suggest a smart way how to this in a smart way in qlik view?
I already tried using the sum() function with total and aggr it over all ids but i keep getting weird results.
I tried using set analysis with aggr ({} power, id) but it doesnt work.
Edit: I tried
aggr(if (age= max(age), power), id)
but as soon as i select an id with more than one entry (different ages) there is no data displayed. Same when i remove the aggr function.
And:
Avg({$<age = max(age)>}Power)
Displays nothing at all (it also displays an error)
Also tried:
Sum({$ <age= {$(=max(age))} > } power )
Still nothing.
Thanks
Julian

Solved it with firstordervalue:
avg(aggr(firstsortedvalue (power, -age), id))

Yes Set Analysis should work.
Something like:
Avg({$<age = max(age)>}Power)
Alternatively, you can use a conditional sum as well:
if (age = max(age), avg(Power))
Aggr is used to run a statistic over a list of records with a 'group by' condition as in SQL

SSRS spatial Bubble map - Hide bubbles for 0 values

In SSRS, when you add a map and select "Bubble map" in the wizard, the map will display bubbles for 0 values too.
I’m trying to visualize data as per follows:
It doesn’t matter if you count a field or sum. SSRS seems to show bubbles everywhere when there is a match on the spatial and the analytical table. Country_code in my case.
Can somebody please help me to hide the bubbles when the analytical data = 0 ?

I figured out how to do this with a little trick.
Right-click the map>Center Point properties>General>Click the function button next to the Marker type field and type the following expression:
=iif(Fields!Your_analytical_field.Value=0,"None","Circle")
Or if you want to do this only for null values:
=iif(Fields!Your_analytical_field.Value is nothing,"None","Circle")
That's it !
Don't know if this is the best way to accomplish what you need, but it's working anyway :)

Another way would be to filter your spatial dataset by joining with the analytical one. If using cube data, use openquery to join like that :
SELECT a.*
FROM
(SELECT your_geo_data, some_matching_id FROM SpatialData) a
INNER JOIN
(SELECT "[some hierarchy].[some_other_matching_id]" some_other_matching_id FROM OPENQUERY(YOUR_LINKED_SERVER, 'SELECT NON EMPTY { ... } on 0 FROM ... ' ) ) b
on a.some_matching_id = b.some_other_matching_id
The problem here might be performance as you would run the analytical dataset query twice, one for the analytical dataset itself and one for the join.

computing average in pig

I have data in format
1,1.2
2,1.3
and so on..
So basically this is id, val combination where id is unique...
I want to calculate the average of all the values..
So here.. avg(1.2,1.3)
I was going thru the documentation but most of the aggregation function involves grouping by some id.. and then using AVG... but since the id is unique.. how do I group them???
So basically the outcome of this endeavor would be one float..
Any suggestions will be greatly appreciated.
Thanks

GROUP X ALL should solve your problem :)
A = LOAD 'data' USING PigStorage(') AS (f1:int, f2:int);
B = GROUP A ALL;
AV = FOREACH B GENERATE AVG(A.f1);
DUMP AV;

What's the least expensive way to get the number of rows (data) in a SQLite DB?

When I need to get the number of row(data) inside a SQLite database, I run the following pseudo code.
cmd = "SELECT Count(*) FROM benchmark"
res = runcommand(cmd)
read res to get result.
But, I'm not sure if it's the best way to go. What would be the optimum way to get the number of data in a SQLite DB? I use python for accessing SQLite.

Your query is correct but I would add an alias to make it easier to refer to the result:
SELECT COUNT(*) AS cnt FROM benchmark
Regarding this line:
count size of res
You don't want to count the number of rows in the result set - there will always be only one row. Just read the result out from the column cnt of the first row.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Converting apache pig to hive - hive

Related

How to get the first tuple inside a bag when Grouping

Qlik View: Get Values WHERE (value2 = max)

SSRS spatial Bubble map - Hide bubbles for 0 values

computing average in pig

What's the least expensive way to get the number of rows (data) in a SQLite DB?

Categories

Resources