Applying when condition only when column exists in the dataframe - dataframe

I am using spark-sql-2.4.1v with java8. I have a scenario where I need to perform certain operation if columns presents in the given dataframe column list
I have Sample data frame as below, the columns of dataframe would differ based on external query executed on the database table.
val data = List(
("20", "score", "school", "2018-03-31", 14 , 12 , 20),
("21", "score", "school", "2018-03-31", 13 , 13 , 21),
("22", "rate", "school", "2018-03-31", 11 , 14, 22),
("21", "rate", "school", "2018-03-31", 13 , 12, 23)
)
val df = data.toDF("id", "code", "entity", "date", "column1", "column2" ,"column3"..."columnN")
as show above dataframe "data" columns are not fixed and would vary and would have "column1", "column2" ,"column3"..."columnN" ...
So depend on the column availability i need to perform some operations
for the same i am trying to use "when-clause" , when a column present then i have to perform certain operation on the specified column else move on to the next operation..
I am trying below two ways using "when-cluase"
First-way :
Dataset<Row> resultDs = df.withColumn("column1_avg",
when( df.schema().fieldNames().contains(col("column1")) , avg(col("column1"))))
)
Second-way :
Dataset<Row> resultDs = df.withColumn("column2_sum",
when( df.columns().contains(col("column2")) , sum(col("column1"))))
)
Error:
Cannot invoke contains(Column) on the array type String[]
so how to handle this scenario using java8 code ?

You can create a column having all the column names. then you can check if the column is present or not and process if it is available-
df.withColumn("columns_available", array(df.columns.map(lit): _*))
.withColumn("column1_org",
when( array_contains(col("columns_available"),"column1") , col("column1")))
.withColumn("x",
when( array_contains(col("columns_available"),"column4") , col("column1")))
.withColumn("column2_new",
when( array_contains(col("columns_available"),"column2") , sqrt("column2")))
.show(false)

Related

Json Arrays of objects PostgreSQL Table format

I have a JSON file (array of objects) which I have to convert into a table format using a PostgreSQL query.
Follow Sample Data.
"b", "c", "d", "e" are to be extracted as separate tables as they are arrays and in these arrays, there are objects
I have tried using json_populate_recordset() but it only works if I have a single array.
[{a:"1",b:"2"},{a:"10",b:"20"}]
I have referred to some links and codes.
jsonb_array_element example
postgreSQL functions
Expected Output
Sample Data:
{
"b":[
{columnB1:value, columnB2:value},
{columnB1:value, columnB2:value},
],
"c":[
{columnC1:value, columnC2:value, columnC3:value},
{columnC1:value, columnC2:value, columnC3:value},
{columnC1:value, columnC2:value, columnC3:value}
],
"d":[
{columnD1:value, columnD2:value},
{columnD1:value, columnD2:value},
],
"e":[
{columnE1:value, columnE2:value},
]
}
expected output
b should be one table in which columnA1 and columnA2 are displayed with their values.
Similarly table c, d, e with their respective columns and values.
Expected Output
You can use jsonb_to_recordset() but you need to unnest your JSON. You need to do this inline as this is a JSON Processing Function which cannot used derived values.
I am using validated JSON as simplified and formatted at end of this answer
To unnest your JSON use below notation which extracts JSON object field with the given key.
--one level
select '{"a":1}'::json->'a'
result : 1
--two levels
select '{"a":{"b":[2]}}'::json->'a'->'b'
result : [2]
We now expand this to include json_to_recordset()
select * from
json_to_recordset(
'{"a":{"b":[{"f1":2,"f2":4},{"f1":3,"f2":6}]}}'::json->'a'->'b' --inner table b
)
as x("f1" int, "f2" int); --fields from table b
or using json_array_elements. Either way we need to list our fields. With second solution type will be json not int so you cant sum etc
with b as (select json_array_elements('{"a":{"b":[{"f1":2,"f2":4},{"f1":3,"f2":6}]}}'::json->'a'->'b') as jx)
select jx->'f1' as f1, jx->'f2' as f2 from b;
Output
f1 f2
2 4
3 6
We now use your data structure in jsonb_to_recordset()
select * from jsonb_to_recordset( '{"a":{"b":[{"columnname1b":"value1b","columnname2b":"value2b"},{"columnname1b":"value","columnname2b":"value"}],"c":[{"columnname1":"value","columnname2":"value"},{"columnname1":"value","columnname2":"value"},{"columnname1":"value","columnname2":"value"}]}}'::jsonb->'a'->'b') as x(columnname1b text, columnname2b text);
Output:
columnname1b columnname2b
value1b value2b
value value
For table c
select * from jsonb_to_recordset( '{"a":{"b":[{"columnname1b":"value1b","columnname2b":"value2b"},{"columnname1b":"value","columnname2b":"value"}],"c":[{"columnname1":"value","columnname2":"value"},{"columnname1":"value","columnname2":"value"},{"columnname1":"value","columnname2":"value"}]}}'::jsonb->'a'->'c') as x(columnname1 text, columnname2 text);
Output
columnname1 columnname2
value value
value value
value value
Sample JSON
{
"a": {
"b": [
{
"columnname1b": "value1b",
"columnname2b": "value2b"
},
{
"columnname1b": "value",
"columnname2b": "value"
}
],
"c": [
{
"columnname1": "value",
"columnname2": "value"
},
{
"columnname1": "value",
"columnname2": "value"
},
{
"columnname1": "value",
"columnname2": "value"
}
]
}
}
Well, I came up with some ideas, here is one that worked. I was able to get one table at a time.
https://www.postgresql.org/docs/9.5/functions-json.html
I am using json_populate_recordset.
The column used in the first select statement comes from a table whose column is a JSON type which we are trying to extract into a table.
The 'tablename from column' in the json_populate_recordset function, is the table we are trying to extract followed with b its columns and datatypes.
WITH input AS(
SELECT cast(column as json) as a
FROM tablename
)
SELECT b.*
FROM input c,
json_populate_recordset(NULL::record,c.a->'tablename from column') as b(columnname1 datatype, columnname2 datatype)

How to filter dataframe by column which contains lists in Presto?

With query:
SELECT
value,
type
FROM dt
I get:
value type
12 [increase, upload]
12 [increase, download]
12 [decrease, delete]
I want to get values which have 'upload' in column type. However this:
SELECT
value,
type
FROM dt
WHERE type LIKE 'upload'
doesn't work. how to do that?
Assuming type is an ARRAY of varchars you can simply use contains:
WITH dataset (value, type) AS (
VALUES (12, array [ 'increase', 'upload' ]),
(12, array [ 'increase', 'download' ]),
(12, array [ 'decrease', 'delete' ])
)
SELECT value,
type
FROM dataset
WHERE contains(type, 'upload')
Output:
value
type
12
[increase, upload]

How to update a value based on key in json array in postgres?

In a table there is a column data(jsonb)
and json array like this
[
{"pid": "123", "percentage": "10"},
{"pid": "456", "percentage": "50"},
{"pid": "789", "percentage": "40"}
]
I want to update percentage 30 where pid is 789.
I used this query but not succeeded.
UPDATE table
SET data =
jsonb_set(data, '{pid}'::text[], data->'pid' || '{"percentage":"30"}'::jsonb)
WHERE (data->> 'pid') = '789' and id= '1'; [id is table's primary key]
There is no easy way to do this (except to change your data model to properly normalized model). You will have to unnest the array, and replace the percentage for the PID in question. Then aggregate the elements back into an array.
You also can't use ->> on an array as that operator doesn't work with arrays.
update the_table t
set data = (select jsonb_agg(case d.element ->> 'pid'
when '789' then d.element || '{"percentage": 30}'
else d.element
end)
from jsonb_array_elements(t.data) as d(element))
where id = 1
and data #> '[{"pid": "789"}]'

How do I select columns " from the below schema?

Read a JSON file and registered a temporary table with the below schema(inferred from JSON file with Native Spark SQL inference).
df = spark.read.json('/path/to/json', multiLine=True)
babynames.registerTempTable("babynames")
Now I would like to select columns
"sid", "id", "position", "created_at", "created_meta", "updated_at", "updated_meta", "meta", "year", "first_name", "county", "sex", "count"
using Spark SQL select statement.
Here is the data source: https://data.cityofnewyork.us/api/views/25th-nujf/rows.json?accessType=DOWNLOAD
Once you have the json file located at specific location you can read the column names as under but you need to have a better understanding of the json elements.
Using spark Sql :
val df = spark.read.option("multiline",true).json("/path/to/json")
df.createOrReplaceTempView("TestTable")
val selectedColumnsDf = spark.sql(""" Select meta.view.columns.id ,meta.view.columns.position, meta.view.createdAt from TestTable """)
Using DataFrame Api it can be done as below :
val df = spark.read.option("multiline",true).json("/path/to/json")
val selectedColumnsDf = df.select("meta.view.columns.id","meta.view.columns.position","meta.view.createdAt")
I am just selecting the three columns just to give you an idea. you can add remaining columns as per your requirement.

PostgreSQL: exclude complete jsonb array if one element fails the WHERE clause

Assume a table json_table with columns id (int), data (jsonb).
A sample jsonb value would be
{"a": [{"b":{"c": "xxx", "d": 1}},{"b":{"c": "xxx", "d": 2}}]}
When I use an SQL statement like the following:
SELECT data FROM json_table j, jsonb_array_elements(j.data#>'{a}') dt WHERE (dt#>>'{b,d}')::integer NOT IN (2,4,6,9) GROUP BY id;
... the two array elements are unnested and the one that qualifies the WHERE clause is still returned. This makes sense since each array element is considered individually. In this example I will get back the complete row
{"a": [{"b":{"c": "xxx", "d": 1}},{"b":{"c": "xxx", "d": 2}}]}
I'm looking for a way to exclude the complete json_table row when any jsonb array element fails the condition
You can move the condition to the WHERE clause and use NOT EXISTS:
SELECT data
FROM json_table j
WHERE NOT EXISTS (SELECT 1
FROM jsonb_array_elements(j.data#>'{a}') dt
WHERE (dt#>>'{b,d}')::integer IN (2, 4, 6, 9)
);
You can achieve it with the following query:
select data
from json_table
where jsonb_path_match(data, '!exists($.a[*].b.d ? ( # == 2 || # == 4 || # == 6 || # == 9))')