Extracting value of a json in Spark SQl - apache-spark-sql

I am looking to aggregate by extracting the value of a json key here from one of the column here. can someone help me with the right syntax in Spark SQL
select count(distinct(Name)) as users, xHeaderFields['xyz'] as app group by app order by users desc
The table column is something like this. I have removed other columns for simplification.Table has columns like Name etc.

Assuming that your dataset is called ds and there is only one key=xyz object per columns;
First, to JSON conversion (if needed):
ds = ds.withColumn("xHeaderFields", expr("from_json(xHeaderFields, 'array<struct<key:string,value:string>>')"))
Then filter the key = xyz and take the first element (assuming there is only one xyz key):
.withColumn("xHeaderFields", expr("filter(xHeaderFields, x -> x.key == 'xyz')[0]"))
Finally, extract value from your object:
.withColumn("xHeaderFields", expr("xHeaderFields.value"))
Final result:
+-------------+
|xHeaderFields|
+-------------+
|null |
|null |
|Settheclass |
+-------------+
Good luck!

Related

How to transform a JSON String into a new table in Bigquery

I have a lot of raw data into one of my tables in Bigquery and from this raw data I need to create a new table.
Raw data table have a column named raw_output this column contains a JSON object that was stringify. It looks like that:
| raw_output |
| ----------------------------------------------------------------------|
| {"client":"A9310","c_integration":"889625","idntf":false,"nf_p":8.32} |
| {"client":"VB050","c_integration":"236590","idntf":true,"nf_p":4.36} |
| {"client":"XT5543","c_integration":"326957","idntf":true,"nf_p":2.33} |
From this table I would like to get something like:
client
c_integration
idntf
nf_p
A9310
889625
false
8.32
VB050
236590
true
4.36
XT5543
326957
true
2.33
So I can perform JOINS and do other operations with the data, I have looked into google's BQ docs (JSON functions) but I was not able to get the expected output. Any idea/solution is much appreciated.
Thank you all in advance.
This should help
with raw_data as (
select '{"client":"A9310","c_integration":"889625","idntf":false,"nf_p":8.32}' as raw_input union all
select '{"client":"VB050","c_integration":"236590","idntf":true,"nf_p":4.36}' as raw_input union all
select '{"client":"XT5543","c_integration":"326957","idntf":true,"nf_p":2.33}' as raw_input
)
select
json_extract_scalar(raw_input, '$.client' ) as client,
json_extract_scalar(raw_input, '$.c_integration' ) as c_integration,
json_extract_scalar(raw_input, '$.idntf' ) as idntf,
json_extract_scalar(raw_input, '$.nf_p' ) as nf_p
from raw_data

Need Column data to be the ROW header for my query

I am trying to use a LATERAL JOIN on a particular data set however i cannot seem to get the syntax correct for the query.
What am i trying to achieve:
Take the first column in the dataset (See picture) and use that as the Table headers (rows) and populate the rows with the data from the StringValue column
Currently it appears like this:
cfname | stringvalue |
----------------------------------------
customerrequesttype | newformsubmission|
Assignmentgroup | ITDEPT |
and I would like to have it appear as this:
customerrequesttype| Assignmentgroup|
-------------------------------------
newformsubmission | ITDEPT
As mentioned i am very new to SQL i know limited basics

How to fix: cannot retrieve all fields MongoDB collection with Apache Drill SQL expression query

I'm trying to retrieve all(*) columns from a MongoDB object with Apache Drill expression SQL:
`_id`.`$oid`
Background: I'm using Apache Drill to query MongoDB collections. By default, Drill retrieves the ObjectId values in a different format than the stored in the database. For example:
Mongo: ObjectId(“59f2c3eba83a576fe07c735c”)
Drill query result: [B#3149a…]
In order to get the data in String format (59f2c3eba83a576fe07c735c) from the object, I changed the Drill config "store.mongo.bson.record.reader" to "false".
ALTER SESSION SET store.mongo.bson.record.reader = false
Drill query result after config set to false:
select * from calc;
+--------------------------------------+---------+
| _id | name |
+--------------------------------------+---------+
| {"$oid":"5cb0e161f0849231dfe16d99"} | thiago |
+--------------------------------------+---------+
Running a query by _id:
select `o`.`_id`.`$oid` , `o`.`name` from mongo.od_teste.calc o where `o`.`_id`.`$oid`='5cb0e161f0849231dfe16d99';
Result:
+---------------------------+---------+
| EXPR$0 | name |
+---------------------------+---------+
| 5cb0e161f0849231dfe16d99 | thiago |
+---------------------------+---------+
For an object with a few columns like the one above (_id, name) it's ok to specify all the columns in the select query by id. However, in my production database, the objects have a "hundred" of columns.
If I try to query all (*) columns from the collection, this is the result:
select `o`.* from mongo.od_teste.calc o where `o`.`_id`.`$oid`='5cb0e161f0849231dfe16d99';
or
select * from mongo.od_teste.calc o where `o`.`_id`.`$oid`='5cb0e161f0849231dfe16d99';
+-----+
| ** |
+-----+
+-----+
No rows selected (6.112 seconds)
Expected result: Retrieve all columns from a MongoDB collection instead of declaring all of them on the SQL query.
I have no suggestions here, because it is a bug in Mongo Storage Plugin.
I have created Jira ticket for it, please take a look and feel free to add any related info there: DRILL-7176

In Postgres: Select columns from a set of arrays of columns and check a condition on all of them

I have a table like this:
I want to perform count on different set of columns (all subsets where there is at least one element from X and one element from Y). How can I do that in Postgres?
For example, I may have {x1,x2,y3}, {x4,y1,y2,y3},etc. I want to count number of "id"s having 1 in each set. So for the first set:
SELECT COUNT(id) FROM table WHERE x1=1 AND x2=1 AND x3=1;
and for the second set does the same:
SELECT COUNT(id) FROM table WHERE x4=1 AND y1=1 AND y2=1 AND y3=1;
Is it possible to write a loop that goes over all these sets and query the table accordingly? The array will have more than 10000 sets, so it cannot be done manually.
You should be able convert the table columns to an array using ARRAY[col1, col2,...], then use the array_positions function, setting the second parameter to be the value you're checking for. So, given your example above, this query:
SELECT id, array_positions(array[x1,x2,x3,x4,y1,y2,y3,y4], 1)
FROM tbl
ORDER BY id;
Will yield this result:
+----+-------------------+
| id | array_positions |
+----+-------------------+
| a | {1,4,5} |
| b | {1,2,4,7} |
| c | {1,2,3,4,6,7,8} |
+----+-------------------+
Here's a SQL Fiddle.

Google BigQuery - Parsing string data from a Bigquery table column

I have a table A within a dataset in Bigquery. This table has multiple columns and one of the columns called hits_eventInfo_eventLabel has values like below:
{ID:AEEMEO,Score:8.990000;ID:SEAMCV,Score:8.990000;ID:HBLION;Property
ID:DNSEAWH,Score:0.391670;ID:CP1853;ID:HI2367;ID:H25600;}
If you write this string out in a tabular form, it contains the following data:
**ID | Score**
AEEMEO | 8.990000
SEAMCV | 8.990000
HBLION | -
DNSEAWH | 0.391670
CP1853 | -
HI2367 | -
H25600 | -
Some IDs have scores, some don't. I have multiple records with similar strings populated under the column hits_eventInfo_eventLabel within the table.
My question is how can I parse this string successfully WITHIN BIGQUERY so that I can get a list of property ids and their respective recommendation scores (if existing)? I would like to have the order in which the IDs appear in the string to be preserved after parsing this data.
Would really appreciate any info on this. Thanks in advance!
I would use combination of SPLIT to separate into different rows and REGEXP_EXTRACT to separate into different columns, i.e.
select
regexp_extract(x, r'ID:([^,]*)') as id,
regexp_extract(x, r'Score:([\d\.]*)') score from (
select split(x, ';') x from (
select 'ID:AEEMEO,Score:8.990000;ID:SEAMCV,Score:8.990000;ID:HBLION;Property ID:DNSEAWH,Score:0.391670;ID:CP1853;ID:HI2367;ID:H25600;' as x))
It produces the following result:
Row id score
1 AEEMEO 8.990000
2 SEAMCV 8.990000
3 HBLION null
4 DNSEAWH 0.391670
5 CP1853 null
6 HI2367 null
7 H25600 null
You can write your own JavaScript functions in BigQuery to get exactly what you want now: http://googledevelopers.blogspot.com/2015/08/breaking-sql-barrier-google-bigquery.html