How to take fields from parent dataframe into its subset in pyspark - dataframe

I have been working on json to flatten it. I have flattened the json and got all feilds
I have a parent dataset with 4 columns:---firstname, last_name, gmail, age
and a child subset extracted from parent subset using filter(filter heighest age age) and group by(group by gmail) .
now columns I'm getting for subset is :-max age
max_age and gmail
Now what I want is using this below subset, I want to extract all columns that are present in subset from parent set
How can this be done.... Thanks in advance

You can join based on both 'age' and 'gmail'
Consider the following dataframe:
df = spark.createDataFrame([["Amy","Santiago", 32, "good99#gmail.com"],["Rosa", "Dias", 31,"good99#gmail.com"],["Norm", "Skully", 50,"bad99#gmail.com"]]).toDF("Name", "LastName", "Age", "Gmail");
This would work
df_agg = df.alias("df1")\
.join(df.groupBy("gmail").agg(F.max("Age").alias("Age")).alias("df2"), F.col("df1.Age").eqNullSafe(F.col("df2.Age")) & F.col("df1.Gmail").eqNullSafe(F.col("df2.Gmail")))\ # Join
.selectExpr("df1.*")\ # Select only Df1 columns
.show()
Output:

Related

SQL filter elements of array

I have a table of employee similar to this:
Department Data
A [{"name":"John", "age":10, "job":"Manager"},{"name":"Eli", "age":40, "job":"Worker"},{"name":"Sam", "age":32, "job":"Manager"}]
B [{"name":"Jack", "age":50, "job":"CEO"},{"name":"Mike", "age":334 "job":"CTO"},{"name":"Filip", "age":63, "job":"Worker"}]
I want to get the department, name, and age of all employees, something similar to this:
Department Data
A [{"name":"John", "age":10},{"name":"Eli", "age":40},{"name":"Sam", "age":32}]
B [{"name":"Jack", "age":50},{"name":"Mike", "age":334},{"name":"Filip", "age":63}]
How can I achieve this using SQL query?
I assume you are using Hive/Spark and the datatype of the column is an array of maps.
Using explode and collect_list and map functions.
select dept,collect_list(map("name",t.map_elem['name'],"age",t.map_elem['age'])) as res
from tbl
lateral view explode(data) t as map_elem
group by dept
Note that this would be not be as performant as a Spark solution or a UDF with which you can access the required keys in an array of maps, without a function like explode.
One more way to do this with Spark SQL functions transform and map_filter (only available starting Spark version 3.0.0).
spark.sql("select dept,transform(data, map_elem -> map_filter(map_elem, (k, v) -> k != \"job\")) as res from tbl")
Another option with Spark versions > 2.4 is using function element_at with transform and selecting the required keys.
spark.sql("select dept," +
"transform(data, map_elem -> map(\"name\",element_at(map_elem,\"name\"),\"age\",element_at(map_elem,\"age\"))) as res " +
"from tbl")
I'd get your table into tabular format:
Department | Age | Job
Then:
SELECT Name, Age
FROM EMPLOYEE
GROUP BY Job

how to query dataframe in for loop with changing values, ValueError: Lengths must match to compare?

This works, but I had to type 4 different times for each value in charge_names list
charge_names = ['Vehicle Theft','Robbery','Burglary','Receive Stolen Property']
charges[charges['Charge Group Description']== 'Vehicle Theft'].head(2)
I tried to run for loop like this:
charge_names = ['Vehicle Theft','Robbery','Burglary','Receive Stolen Property']
for name in charge_names:
charges[charges['Charge Group Description']== name].head(2)
but not much success.
this is not working:
charges[['Charge Group Description'].isin(['Robbery', 'Burglary'])]
How can I query for all 4 values in the charge_names list in one line?
DataFrame.isin Whether each element in the DataFrame is contained in values.
DataFrame.groupby Group DataFrame based on entries
charge_names = ['Vehicle Theft','Robbery','Burglary','Receive Stolen Property']
charges[charges['Charge Group Description'].isin(charge_names)].groupby('Charge Group Description').head(2)

how to store grouped data into json in pyspark

I am new to pyspark
I have a dataset which looks like (just a snapshot of few columns)
I want to group my data by key. My key is
CONCAT(a.div_nbr,a.cust_nbr)
My ultimate goal is to convert the data into JSON, formated like this
k1[{v1,v2,....},{v1,v2,....}], k2[{v1,v2,....},{v1,v2,....}],....
e.g
248138339 [{ PRECIMA_ID:SCP 00248 0000138339, PROD_NBR:5553505, PROD_DESC:Shot and a Beer Battered Onion Rings (5553505 and 9285840) , PROD_BRND:Molly's Kitchen,PACK_SIZE:4/2.5 LB, QTY_UOM:CA } ,
{ PRECIMA_ID:SCP 00248 0000138339 , PROD_NBR:6659079 , PROD_DESC:Beef Chuck Short Rib Slices, PROD_BRND:Stockyards , PACK_SIZE:12 LBA , QTY_UOM:CA} ,{...,...,} ],
1384611034793[{},{},{}],....
I have created a dataframe (I am joining two tables basically to get some more fields)
joinstmt = sqlContext.sql(
"SELECT a.precima_id , CONCAT(a.div_nbr,a.cust_nbr) as
key,a.prod_nbr , a.prod_desc,a.prod_brnd , a.pack_size , a.qty_uom , a.sales_opp , a.prc_guidance , a.pim_mrch_ctgry_desc , a.pim_mrch_ctgry_id , b.start_date,b.end_date
FROM scoop_dtl a join scoop_hdr b on (a.precima_id =b.precima_id)")
Now, in order to get the above result I need to group by the result based on key, I did the following
groupbydf = joinstmt.groupBy("key")
This resulted intp a grouped data and after reading I got to know that I cannot use it directly and I need to convert it back into dataframes to store it.
I am new to it, need some help inorder to convert it back into dataframes or I would appreciate if there are any other ways as well.
If your joined dataframe looks like this:
gender age
M 5
F 50
M 10
M 10
F 10
You can then use below code to get desired output
joinedDF.groupBy("gender") \
.agg(collect_list("age").alias("ages")) \
.write.json("jsonOutput.txt")
Output would look like below:
{"gender":"F","ages":[50,10]}
{"gender":"M","ages":[5,10,10]}
In case you have multiple columns like name, salary. You can add columns like below:
df.groupBy("gender")
.agg(collect_list("age").alias("ages"),collect_list("name").alias("names"))
Your output would look like:
{"gender":"F","ages":[50,10],"names":["ankit","abhay"]}
{"gender":"M","ages":[5,10,10],"names":["snchit","mohit","rohit"]}
You cannot use GroupedData directly. It has to be aggregated first. It could be partially covered by aggregation with built-in functions like collect_list but it is simply not possible to achieve desired output, with values used to represent keys, using DataFrameWriter.
In can try something like this instead:
from pyspark.sql import Row
import json
def make_json(kvs):
k, vs = kvs
return json.dumps({k[0]: list(vs)})
(df.select(struct(*keys), values)
.rdd
.mapValues(Row.asDict)
.groupByKey()
.map(make_json))
and saveAsTextFile.

pandas groupby by different key and merge

I have a transaction data main containing three variables: user_id/item_id/type, one user_id have more than one item_id and type_id ,the type_id is in (1,2,3,4)
data=DataFrame({'user_id':['a','a','a','b','b','c'],'item_id':['1','3','3','2','4','1'],'type_id':['1','2','2','3','4','4']})
ui=data.groupby(['user_id','item_id','type']).size()
u=data.groupby(['user_id','type']).size()
What I want to get in the end is get every user_id's amount of distinct type_id
and also the every user_id,item's amount of distinct type_id,and merge then by the user_id
Your question is difficult to answer but here is one solution:
import pandas as pd
data= pd.DataFrame({'user_id':['a','a','a','b','b','c'],'item_id':['1','3','3','2','4','1'],'type_id':['1','2','2','3','4','4']})
ui = data.copy()
ui.drop('item_id',axis=1,inplace=True)
ui = data.groupby('user_id').type_id.nunique().reset_index()
u = data.groupby(['user_id','item_id']).type_id.nunique().reset_index()
final = ui.merge(u,on='user_id',how='inner').set_index('user_id')
final.columns = ['distinct_type_id','item_id','distinct_type_id']
print final

Pig: Summing Fields

I have some census data in which each line has a number denoting the county and fields for the number of people in a certain age range (eg, 5 and under, 5 to 17, etc.). After some initial processing in which I removed the unneeded columns, I grouped the labeled data as follows (labeled_data is of the schema {county: chararray,pop1: int,pop2: int,pop3: int,pop4: int,pop5: int,pop6: int,pop7: int,pop8: int}):
grouped_data = GROUP filtered_data BY county;
So grouped_data is of the schema
{group: chararray,filtered_data: {(county: chararray,pop1: int,pop2: int,pop3: int,pop4: int,pop5: int,pop6: int,pop7: int,pop8: int)}}
Now I would like to to sum up all of the pop fields for each county, yielding the total population of each county. I'm pretty sure the command to do this will be of the form
pop_sums = FOREACH grouped_data GENERATE group, SUM(something about the pop fields);
but I've been unable to get this to work. Thanks in advance!
I don't know if this is helpful, but the following is a representative entry of grouped_data:
(147,{(147,385,1005,283,468,649,738,933,977),(147,229,655,178,288,394,499,579,481)})
Note that the 147 entries are actually county codes, not populations. They are therefore of type chararray.
Can you try the below approach?
Sample input:
147,1,1,1,1,1,1,1,1
147,2,2,2,2,2,2,2,2
145,5,5,5,5,5,5,5,5
PigScript:
A = LOAD 'input' USING PigStorage(',') AS(country:chararray,pop1:int,pop2:int,pop3:int,pop4:int,pop5:int,pop6:int,pop7:int,pop8:int);
B = GROUP A BY country;
C = FOREACH B GENERATE group,(SUM(A.pop1)+SUM(A.pop2)+SUM(A.pop3)+SUM(A.pop4)+SUM(A.pop5)+SUM(A.pop6)+SUM(A.pop7)+SUM(A.pop8)) AS totalPopulation;
DUMP C;
Output:
(145,40)
(147,24)