SQL filter elements of array - sql

I have a table of employee similar to this:
Department Data
A [{"name":"John", "age":10, "job":"Manager"},{"name":"Eli", "age":40, "job":"Worker"},{"name":"Sam", "age":32, "job":"Manager"}]
B [{"name":"Jack", "age":50, "job":"CEO"},{"name":"Mike", "age":334 "job":"CTO"},{"name":"Filip", "age":63, "job":"Worker"}]
I want to get the department, name, and age of all employees, something similar to this:
Department Data
A [{"name":"John", "age":10},{"name":"Eli", "age":40},{"name":"Sam", "age":32}]
B [{"name":"Jack", "age":50},{"name":"Mike", "age":334},{"name":"Filip", "age":63}]
How can I achieve this using SQL query?

I assume you are using Hive/Spark and the datatype of the column is an array of maps.
Using explode and collect_list and map functions.
select dept,collect_list(map("name",t.map_elem['name'],"age",t.map_elem['age'])) as res
from tbl
lateral view explode(data) t as map_elem
group by dept
Note that this would be not be as performant as a Spark solution or a UDF with which you can access the required keys in an array of maps, without a function like explode.
One more way to do this with Spark SQL functions transform and map_filter (only available starting Spark version 3.0.0).
spark.sql("select dept,transform(data, map_elem -> map_filter(map_elem, (k, v) -> k != \"job\")) as res from tbl")
Another option with Spark versions > 2.4 is using function element_at with transform and selecting the required keys.
spark.sql("select dept," +
"transform(data, map_elem -> map(\"name\",element_at(map_elem,\"name\"),\"age\",element_at(map_elem,\"age\"))) as res " +
"from tbl")

I'd get your table into tabular format:
Department | Age | Job
Then:
SELECT Name, Age
FROM EMPLOYEE
GROUP BY Job

Related

Querying Column Headers in GBQ

Is it possible to do a query to provide me an output with the column headers of a specific table? I'm uploading multiple files into our server via GBQ and while it auto-detects the headers, I would like to list out the headers either in rows or as a comma separated cell.
Thank you
I am assuming your files are in CSV format so schema of table does not have repeated fields. With this in mind - below is for BigQuery Standard SQL and requires just fully qualified table name
#standardSQL
SELECT
REGEXP_EXTRACT_ALL(TO_JSON_STRING(t), r'"(.+?)"') cols_as_array,
ARRAY_TO_STRING(REGEXP_EXTRACT_ALL(TO_JSON_STRING(t), r'"(.+?)"'), ',') cols_as_string
FROM (SELECT 1) LEFT JOIN
(SELECT * FROM `project.dataset.table` WHERE FALSE) t
ON TRUE
If to apply to some real table as in below example
#standardSQL
SELECT
REGEXP_EXTRACT_ALL(TO_JSON_STRING(t), r'"(.+?)"') cols_as_array,
ARRAY_TO_STRING(REGEXP_EXTRACT_ALL(TO_JSON_STRING(t), r'"(.+?)"'), ',') cols_as_string
FROM (SELECT 1) LEFT JOIN
(SELECT * FROM `bigquery-public-data.utility_us.us_states_area` WHERE FALSE) t
ON TRUE
result will be
Row cols_as_array cols_as_string
1 region_code region_code,division_code,state_fips_code,state_gnis_code,state_geo_id,state_abbreviation,state_name,legal_area_code,feature_class_code,functional_status_code,area_land_meters,area_water_meters,internal_point_lat,internal_point_lon,state_geom
division_code
state_fips_code
state_gnis_code
state_geo_id
state_abbreviation
state_name
legal_area_code
feature_class_code
functional_status_code
area_land_meters
area_water_meters
internal_point_lat
internal_point_lon
state_geom
You can choose which version to use: list as array or list as comma separated string
Also note, above query does not incur any cost at all!

how to store grouped data into json in pyspark

I am new to pyspark
I have a dataset which looks like (just a snapshot of few columns)
I want to group my data by key. My key is
CONCAT(a.div_nbr,a.cust_nbr)
My ultimate goal is to convert the data into JSON, formated like this
k1[{v1,v2,....},{v1,v2,....}], k2[{v1,v2,....},{v1,v2,....}],....
e.g
248138339 [{ PRECIMA_ID:SCP 00248 0000138339, PROD_NBR:5553505, PROD_DESC:Shot and a Beer Battered Onion Rings (5553505 and 9285840) , PROD_BRND:Molly's Kitchen,PACK_SIZE:4/2.5 LB, QTY_UOM:CA } ,
{ PRECIMA_ID:SCP 00248 0000138339 , PROD_NBR:6659079 , PROD_DESC:Beef Chuck Short Rib Slices, PROD_BRND:Stockyards , PACK_SIZE:12 LBA , QTY_UOM:CA} ,{...,...,} ],
1384611034793[{},{},{}],....
I have created a dataframe (I am joining two tables basically to get some more fields)
joinstmt = sqlContext.sql(
"SELECT a.precima_id , CONCAT(a.div_nbr,a.cust_nbr) as
key,a.prod_nbr , a.prod_desc,a.prod_brnd , a.pack_size , a.qty_uom , a.sales_opp , a.prc_guidance , a.pim_mrch_ctgry_desc , a.pim_mrch_ctgry_id , b.start_date,b.end_date
FROM scoop_dtl a join scoop_hdr b on (a.precima_id =b.precima_id)")
Now, in order to get the above result I need to group by the result based on key, I did the following
groupbydf = joinstmt.groupBy("key")
This resulted intp a grouped data and after reading I got to know that I cannot use it directly and I need to convert it back into dataframes to store it.
I am new to it, need some help inorder to convert it back into dataframes or I would appreciate if there are any other ways as well.
If your joined dataframe looks like this:
gender age
M 5
F 50
M 10
M 10
F 10
You can then use below code to get desired output
joinedDF.groupBy("gender") \
.agg(collect_list("age").alias("ages")) \
.write.json("jsonOutput.txt")
Output would look like below:
{"gender":"F","ages":[50,10]}
{"gender":"M","ages":[5,10,10]}
In case you have multiple columns like name, salary. You can add columns like below:
df.groupBy("gender")
.agg(collect_list("age").alias("ages"),collect_list("name").alias("names"))
Your output would look like:
{"gender":"F","ages":[50,10],"names":["ankit","abhay"]}
{"gender":"M","ages":[5,10,10],"names":["snchit","mohit","rohit"]}
You cannot use GroupedData directly. It has to be aggregated first. It could be partially covered by aggregation with built-in functions like collect_list but it is simply not possible to achieve desired output, with values used to represent keys, using DataFrameWriter.
In can try something like this instead:
from pyspark.sql import Row
import json
def make_json(kvs):
k, vs = kvs
return json.dumps({k[0]: list(vs)})
(df.select(struct(*keys), values)
.rdd
.mapValues(Row.asDict)
.groupByKey()
.map(make_json))
and saveAsTextFile.

Sending relation to UDF functions

Can I Send a relation to Pig UDF function as input? A relation can have multiple tuples in it. How do we read each tuple one by one in Pig UDF function?
Ok.Below is my Sample input file.
Surender,HDFC,60000,CTS
Raja,AXIS,80000,TCS
Raj,HDFC,70000,TCS
Kumar,AXIS,70000,CTS
Remya,AXIS,40000,CTS
Arun,SBI,30000,TCS
Vimal,SBI,10000,TCS
Ankur,HDFC,80000,CTS
Karthic,HDFC,95000,CTS
Sandhya,AXIS,60000,CTS
Amit,SBI,70000,CTS
myinput = LOAD '/home/cloudera/surender/laurela/balance.txt' USING PigStorage(',') AS(name:chararray,bank:chararray,amt:long,company:chararray);
grouped = GROUP myinput BY company;
All i need is details about highest paid employee in each company. How do i use UDF for that ?
I need something like this
CTS Karthic,HDFC,95000,CTS
TCS Raja,AXIS,80000,TCS
Can SomeOne Help me on this.
This script will give you the results you want :
A = LOAD '/home/cloudera/surender/laurela/balance.txt' USING PigStorage(',') AS(name:chararray,bank:chararray,amt:long,company:chararray);
B = GROUP A BY (company);
topResults = FOREACH B {result = TOP(1, 2, A); GENERATE FLATTEN(result);}
dump topResults;
Explanation:
First we group A on the basis of company.So A is:
(CTS,{(Surender,HDFC,60000,CTS),(Kumar,AXIS,70000,CTS),(Remya,AXIS,40000,CTS),(Ankur,HDFC,80000,CTS),(Karthic,HDFC,95000,CTS),(Sandhya,AXIS,60000,CTS),(Amit,SBI,70000,CTS)})
(TCS,{(Raja,AXIS,80000,TCS),(Raj,HDFC,70000,TCS),(Arun,SBI,30000,TCS),(Vimal,SBI,10000,TCS)})
Then we say foreach tuple in B , generate another tuple result which is equal to the top 1 record from the relation A found in B on the basis of value of column number 2 i.e. amt. The columns are numbered from 0.
Note
First your data has extra spaces after company name. Please remove the extra spaces or use the following data :
Surender,HDFC,60000,CTS
Raja,AXIS,80000,TCS
Raj,HDFC,70000,TCS
Kumar,AXIS,70000,CTS
Remya,AXIS,40000,CTS
Arun,SBI,30000,TCS
Vimal,SBI,10000,TCS
Ankur,HDFC,80000,CTS
Karthic,HDFC,95000,CTS
Sandhya,AXIS,60000,CTS
mit,SBI,70000,CTS
You don't need to write an UDF to do this, you can simply do it with the top function from pig : http://pig.apache.org/docs/r0.11.0/func.html#topx
Here is an example of code that should work ( not tested) :
grouped = GROUP myinput BY company;
result = FOREACH grouped GENERATE company, FLATTEN(TOP(1,2,grouped));

Include column names in Grails SQL query results

I have a query that looks like this...
def data = session.createSQLQuery("""SELECT
a.id AS media_id,
a.uuid,
a.date_created,
a.last_updated,
a.device_date_time,
a.ex_time,
a.ex_source,
a.sequence,
a.time_zone,
a.time_zone_offset,
a.media_type,
a.size_in_bytes,
a.orientation,
a.width,
a.height,
a.duration,
b.id AS app_user_record_id,
b.app_user_identifier,
b.application_id,
b.date_created AS app_user_record_date_created,
b.last_updated AS app_user_record_last_updated,
b.instance_id,
b.user_uuid
FROM media a, app_user_record b
WHERE a.uuid = b.user_uuid
LIMIT :firstResult, :maxResults """)
.setInteger("firstResult", cmd.firstResult)
.setInteger("maxResults", cmd.maxResults)
.list()
The problem is the .list method returns an array that has no column names. Does anybody know of a way to include/add the column names from a Grails native sql query. I could obviously transform the results into a map and hard code the column names myself.
Use setResultTransformer(Criteria.ALIAS_TO_ENTITY_MAP) for the query. This would result a map of entries.
import org.hibernate.Criteria
def query = """Your query"""
def data = session.createSQLQuery(query)
.setInteger("firstResult", cmd.firstResult)
.setInteger("maxResults", cmd.maxResults)
.setResultTransformer(Criteria.ALIAS_TO_ENTITY_MAP)
.list()
data.each{println it.UUID}
I tested it and realized that earlier I used to use the column number to fetch each field instead of the column name.
NOTE
Keys are upper case. so ex_source would be EX_SOURCE in the result map.

getting count(*) using createSQLQuery in hibernate?

I have several sql queries that I simply want to fire at the database.
I am using hibernate throughout the whole application, so i would prefer to use hibernate to call this sql queries.
In the example below i want to get count + name, but cant figure out how to get that info when i use createSQLQuery().
I have seen workarounds where people only need to get out a single "count()" from the result, but in this case I am using count() + a column as ouput
SELECT count(*), a.name as count FROM user a
WHERE a.user_id IN (SELECT b.user_id FROM user b)
GROUP BY a.name
HAVING COUNT(*) BETWEEN 2 AND 5;
fyi, the above query would deliver a result like this if i call it directly on the database:
1, John
2, Donald
1, Ralph
...
Alternatively, you can use
SQLQuery query = session.createSQLQuery("SELECT count(*) as num, a.name as name FROM user a WHERE a.user_id IN (SELECT b.user_id FROM user b) GROUP BY a.name HAVING COUNT(*) BETWEEN 2 AND 5;";
query.addScalar("num", Hibernate.INTEGER).addScalar("name", Hibernate.STRING);
// you might need to use org.hibernate.type.StandardBasicTypes.INTEGER / STRING
// for Hibernate v3.6+,
// see https://hibernate.onjira.com/browse/HHH-5138
List<Object> result = query.list();
// result.get(2*i + 0) -> i-th row num
// result.get(2*i + 1) -> i-th row name
I'm using this in case of time-pressure, imo much faster to code then creating your own beans & transformers.
Cheers!
Jakub
cheers for the info Thomas, worked wonderful for generating objects
the problem i had with my initial query was that "count" was a reserved word :P
when i changed the name to something else it worked.
If your SQL statement looks like this SELECT count(*) as count, a.name as name... you could use setResultTransformer(new AliasToBeanResultTransformer(YourSimpleBean.class)) on your Query.
Where YourSimpleBean has the fields Integer count and String name respectively the setters setCount and setName.
On execution of the query with query.list() hibernate will return a List of YourSimpleBeans.