how to store grouped data into json in pyspark - dataframe

I am new to pyspark
I have a dataset which looks like (just a snapshot of few columns)
I want to group my data by key. My key is
CONCAT(a.div_nbr,a.cust_nbr)
My ultimate goal is to convert the data into JSON, formated like this
k1[{v1,v2,....},{v1,v2,....}], k2[{v1,v2,....},{v1,v2,....}],....
e.g
248138339 [{ PRECIMA_ID:SCP 00248 0000138339, PROD_NBR:5553505, PROD_DESC:Shot and a Beer Battered Onion Rings (5553505 and 9285840) , PROD_BRND:Molly's Kitchen,PACK_SIZE:4/2.5 LB, QTY_UOM:CA } ,
{ PRECIMA_ID:SCP 00248 0000138339 , PROD_NBR:6659079 , PROD_DESC:Beef Chuck Short Rib Slices, PROD_BRND:Stockyards , PACK_SIZE:12 LBA , QTY_UOM:CA} ,{...,...,} ],
1384611034793[{},{},{}],....
I have created a dataframe (I am joining two tables basically to get some more fields)
joinstmt = sqlContext.sql(
"SELECT a.precima_id , CONCAT(a.div_nbr,a.cust_nbr) as
key,a.prod_nbr , a.prod_desc,a.prod_brnd , a.pack_size , a.qty_uom , a.sales_opp , a.prc_guidance , a.pim_mrch_ctgry_desc , a.pim_mrch_ctgry_id , b.start_date,b.end_date
FROM scoop_dtl a join scoop_hdr b on (a.precima_id =b.precima_id)")
Now, in order to get the above result I need to group by the result based on key, I did the following
groupbydf = joinstmt.groupBy("key")
This resulted intp a grouped data and after reading I got to know that I cannot use it directly and I need to convert it back into dataframes to store it.
I am new to it, need some help inorder to convert it back into dataframes or I would appreciate if there are any other ways as well.

If your joined dataframe looks like this:
gender age
M 5
F 50
M 10
M 10
F 10
You can then use below code to get desired output
joinedDF.groupBy("gender") \
.agg(collect_list("age").alias("ages")) \
.write.json("jsonOutput.txt")
Output would look like below:
{"gender":"F","ages":[50,10]}
{"gender":"M","ages":[5,10,10]}
In case you have multiple columns like name, salary. You can add columns like below:
df.groupBy("gender")
.agg(collect_list("age").alias("ages"),collect_list("name").alias("names"))
Your output would look like:
{"gender":"F","ages":[50,10],"names":["ankit","abhay"]}
{"gender":"M","ages":[5,10,10],"names":["snchit","mohit","rohit"]}

You cannot use GroupedData directly. It has to be aggregated first. It could be partially covered by aggregation with built-in functions like collect_list but it is simply not possible to achieve desired output, with values used to represent keys, using DataFrameWriter.
In can try something like this instead:
from pyspark.sql import Row
import json
def make_json(kvs):
k, vs = kvs
return json.dumps({k[0]: list(vs)})
(df.select(struct(*keys), values)
.rdd
.mapValues(Row.asDict)
.groupByKey()
.map(make_json))
and saveAsTextFile.

Related

How to take fields from parent dataframe into its subset in pyspark

I have been working on json to flatten it. I have flattened the json and got all feilds
I have a parent dataset with 4 columns:---firstname, last_name, gmail, age
and a child subset extracted from parent subset using filter(filter heighest age age) and group by(group by gmail) .
now columns I'm getting for subset is :-max age
max_age and gmail
Now what I want is using this below subset, I want to extract all columns that are present in subset from parent set
How can this be done.... Thanks in advance
You can join based on both 'age' and 'gmail'
Consider the following dataframe:
df = spark.createDataFrame([["Amy","Santiago", 32, "good99#gmail.com"],["Rosa", "Dias", 31,"good99#gmail.com"],["Norm", "Skully", 50,"bad99#gmail.com"]]).toDF("Name", "LastName", "Age", "Gmail");
This would work
df_agg = df.alias("df1")\
.join(df.groupBy("gmail").agg(F.max("Age").alias("Age")).alias("df2"), F.col("df1.Age").eqNullSafe(F.col("df2.Age")) & F.col("df1.Gmail").eqNullSafe(F.col("df2.Gmail")))\ # Join
.selectExpr("df1.*")\ # Select only Df1 columns
.show()
Output:

SQL filter elements of array

I have a table of employee similar to this:
Department Data
A [{"name":"John", "age":10, "job":"Manager"},{"name":"Eli", "age":40, "job":"Worker"},{"name":"Sam", "age":32, "job":"Manager"}]
B [{"name":"Jack", "age":50, "job":"CEO"},{"name":"Mike", "age":334 "job":"CTO"},{"name":"Filip", "age":63, "job":"Worker"}]
I want to get the department, name, and age of all employees, something similar to this:
Department Data
A [{"name":"John", "age":10},{"name":"Eli", "age":40},{"name":"Sam", "age":32}]
B [{"name":"Jack", "age":50},{"name":"Mike", "age":334},{"name":"Filip", "age":63}]
How can I achieve this using SQL query?
I assume you are using Hive/Spark and the datatype of the column is an array of maps.
Using explode and collect_list and map functions.
select dept,collect_list(map("name",t.map_elem['name'],"age",t.map_elem['age'])) as res
from tbl
lateral view explode(data) t as map_elem
group by dept
Note that this would be not be as performant as a Spark solution or a UDF with which you can access the required keys in an array of maps, without a function like explode.
One more way to do this with Spark SQL functions transform and map_filter (only available starting Spark version 3.0.0).
spark.sql("select dept,transform(data, map_elem -> map_filter(map_elem, (k, v) -> k != \"job\")) as res from tbl")
Another option with Spark versions > 2.4 is using function element_at with transform and selecting the required keys.
spark.sql("select dept," +
"transform(data, map_elem -> map(\"name\",element_at(map_elem,\"name\"),\"age\",element_at(map_elem,\"age\"))) as res " +
"from tbl")
I'd get your table into tabular format:
Department | Age | Job
Then:
SELECT Name, Age
FROM EMPLOYEE
GROUP BY Job

Count the grouped records in pig query

Below is my test data.
John,q1,Correct
Jack,q1,wrong
John,q2,Correct
Jack,q2,wrong
John,q3,wrong
Jack,q3,Correct
John,q4,wrong
Jack,q4,wrong
John,q5,wrong
Jack,q5,wrong
I want to find something like below:
John wrong 4
John correct 1
Jack wrong 3
Jack correct 2
My Code:
data = LOAD '/stackoverflowq4.txt' USING PigStorage(',') AS (
name:chararray,
number:chararray,
result:chararray);
B = GROUP data by (name,result);
Now the out put looks like below:
((John,wrong),{(John,q5,wrong),(John,q4,wrong),(John,q2,wrong),(John,q1,wrong)})
((John,Correct),{(John,q3,Correct)})
((Jack,wrong),{(Jack,q5,wrong),(Jack,q4,wrong),(Jack,q3,wrong)})
((Jack,Correct),{(Jack,q2,Correct),(Jack,q1,Correct)})
How should I calculate count the grouped records.
The COUNT function will give you the number of elements in a bag, which is exactly what you want. After grouping by user and result, you end up with a bag with the number of times each combination appeared.
Therefore, you only have to add one line:
data = LOAD '/stackoverflowq4.txt' USING PigStorage(',') AS (
name:chararray,
number:chararray,
result:chararray);
B = GROUP data by (name,result);
C = foreach B generate FLATTEN(group) as (name,result), COUNT(data) as count;
dump D;
(Jack,wrong,4)
(Jack,Correct,1)
(John,wrong,3)
(John,Correct,2)
The FLATTEN(group) is because after grouping, a tuple containing the elements you grouped by is generated, and by the looks of what you want as output you don't want it inside a tuple, as the output would be like ((Jack,wrong),4).

Sending relation to UDF functions

Can I Send a relation to Pig UDF function as input? A relation can have multiple tuples in it. How do we read each tuple one by one in Pig UDF function?
Ok.Below is my Sample input file.
Surender,HDFC,60000,CTS
Raja,AXIS,80000,TCS
Raj,HDFC,70000,TCS
Kumar,AXIS,70000,CTS
Remya,AXIS,40000,CTS
Arun,SBI,30000,TCS
Vimal,SBI,10000,TCS
Ankur,HDFC,80000,CTS
Karthic,HDFC,95000,CTS
Sandhya,AXIS,60000,CTS
Amit,SBI,70000,CTS
myinput = LOAD '/home/cloudera/surender/laurela/balance.txt' USING PigStorage(',') AS(name:chararray,bank:chararray,amt:long,company:chararray);
grouped = GROUP myinput BY company;
All i need is details about highest paid employee in each company. How do i use UDF for that ?
I need something like this
CTS Karthic,HDFC,95000,CTS
TCS Raja,AXIS,80000,TCS
Can SomeOne Help me on this.
This script will give you the results you want :
A = LOAD '/home/cloudera/surender/laurela/balance.txt' USING PigStorage(',') AS(name:chararray,bank:chararray,amt:long,company:chararray);
B = GROUP A BY (company);
topResults = FOREACH B {result = TOP(1, 2, A); GENERATE FLATTEN(result);}
dump topResults;
Explanation:
First we group A on the basis of company.So A is:
(CTS,{(Surender,HDFC,60000,CTS),(Kumar,AXIS,70000,CTS),(Remya,AXIS,40000,CTS),(Ankur,HDFC,80000,CTS),(Karthic,HDFC,95000,CTS),(Sandhya,AXIS,60000,CTS),(Amit,SBI,70000,CTS)})
(TCS,{(Raja,AXIS,80000,TCS),(Raj,HDFC,70000,TCS),(Arun,SBI,30000,TCS),(Vimal,SBI,10000,TCS)})
Then we say foreach tuple in B , generate another tuple result which is equal to the top 1 record from the relation A found in B on the basis of value of column number 2 i.e. amt. The columns are numbered from 0.
Note
First your data has extra spaces after company name. Please remove the extra spaces or use the following data :
Surender,HDFC,60000,CTS
Raja,AXIS,80000,TCS
Raj,HDFC,70000,TCS
Kumar,AXIS,70000,CTS
Remya,AXIS,40000,CTS
Arun,SBI,30000,TCS
Vimal,SBI,10000,TCS
Ankur,HDFC,80000,CTS
Karthic,HDFC,95000,CTS
Sandhya,AXIS,60000,CTS
mit,SBI,70000,CTS
You don't need to write an UDF to do this, you can simply do it with the top function from pig : http://pig.apache.org/docs/r0.11.0/func.html#topx
Here is an example of code that should work ( not tested) :
grouped = GROUP myinput BY company;
result = FOREACH grouped GENERATE company, FLATTEN(TOP(1,2,grouped));

Using pig, how do I parse a mixed format line into tuples and a bag of tuples?

I'm new to pig, and I'm having an issue parsing my input and getting it into a format that I can use. The input file contains lines that have both fixed fields and KV pairs as follows:
FF1|FF2|FF3|FF4|KVP1|KVP2|...|KVPn
My goal here is to count the number of unique fixed field combinations for each of the KV Pairs. So considering the following input lines:
1|2|3|4|key1=value1|key2=value2
2|3|4|5|key1=value7|key2=value2|key3=value3
When I'm done, I'd like to be able to generate the following results (the output format doesn't really matter at this point, I'm just showing you what I'd like the results to be):
key1=value1 : 1
key1=value7 : 1
key2=value2 : 2
key3=value3 : 1
It seems like I should be able to do this by grouping the fixed fields and flattening a bag of the KV Pairs to generate the cross product
I've tried reading this in with something like:
data = load 'myfile' using PigStorage('|');
A = foreach data generate $0 as ff1:chararray, $1 as ff2:long, $2 as ff3:chararray, $3 as ff4:chararray, TOBAG($4..) as kvpairs:bag{kvpair:tuple()};
B = foreach A { sorted = order A by ff2; lim = limit sorted 1; generate group.ff1, group.ff4, flatten( lim.kvpairs ); };
C = filter B by ff3 matches 'somevalue';
D = foreach C generate ff1, ff4, flatten( kvpairs ) as kvpair;
E = group D by (ff1, ff4, kvpair);
F = foreach E generate group, COUNT(E);
This generates records with a schema as follows:
A: {date: long,hms: long,id: long,ff1: chararray,ff2: long,ff3: chararray,ff4: chararray,kvpairs: {kvpair: (NULL)}}
While this gets me the schema that I want, there are several problems that I can't seem to solve:
By using the TOBAG with .., no schema can be applied to my kvpairs, so I can't ever filter on kvpair, and I don't seem to be able to cast this at any point, so it's an all or nothing query.
The filter in statement 'C' seems to return no data regardless of what value I use, even if I use something like '.*' or '.+'. I don't know if this is because there is no schema, or if this is actually a bug in pig. If I dump some data from statement B, I definitely see data there that would match those expressions.
So I've tried approaching the problem differently, by loading the data using:
data = load 'myfile' using PigStorage('\n') as (line:chararray);
init_parse = foreach data generate FLATTEN( STRSPLIT( line, '\\|', 4 ) ) as (ff1:chararray, ff2:chararray, ff3:chararray, ff4:chararray, kvpairsStr:chararray);
A = foreach mc_bk_data generate ff1, ff2, ff3, ff4, TOBAG( STRSPLIT( kvpairsStr, '\\|', 500 ) ) as kvpairs:bag{t:(kvpair:chararray)};
The issue here is that the TOBAG( STRSPLIT( ... ) ) results in a bag of a single tuple, with each of the kvpairs being a field in that tuple. I really need the bag to contain, each of the individual kvpairs as a tuple of one field so that when I flatten the bag, I get the cross product of the bag and the group that I'm interested in.
I'm open to other ways of attacking this problem as well, but I can seem to find good way to transform my tuple of multiple fields into a bag of tuples, with each tuple having one field each.
I'm using Apache Pig version 0.11.1.1.3.0.0-107
Thanks in advance.
Your second approach is on the right track. Unfortunately, you'll need a UDF to convert a tuple to a bag, and as far as I know there is no builtin to do this. It's a simple matter to write one, however.
You won't want to group on the fixed fields, but rather on the key-value pairs themselves. So you only need to keep the tuple of key-value pairs; you can completely ignore the fixed fields.
The UDF is pretty simple. In Java, you can just do something like this in your exec method:
DataBag b = new DefaultDataBag();
Tuple t = (Tuple) input.get(0);
for (int i = 0; i < t.size(); i++) {
Object o = t.get(i);
Tuple e = TupleFactory.getInstance().createTuple(o);
b.add(e);
}
return b;
Once you have that, turn the tuple from STRSPLIT into a bag, flatten it, and then do the grouping and counting.