how to update nested json_data in delta table pyspark/hive

how to update nested json_data in delta table pyspark/hive - apache-spark-sql

(Databricks)
%sql
select * from df2
jsonData
--------
{"col1":"AA","col2":"BB","col3":"CC","col4":"DD"}
I want to update col4 value as TT like "col4":"TT"
I tried with below code:
update df2 set jsonData = JSON_SET(jsonData '$.col4', 'TT')
and
update df2 set jsonData = JSON_MODIFY(jsonData '$.col4', 'TT')
Getting below error:
Error in SQL statement: AnalysisException: Undefined function: 'JSON_MODIFY'.
This function is neither a registered temporary function nor a permanent function
registered in the database 'default'.

Use from_json function to flatten out json into columns then update col4 finally recreate json object using to_json function.
Example:
df.show(10,False)
#+-------------------------------------------------+
#|jsonData |
#+-------------------------------------------------+
#|{"col1":"AA","col2":"BB","col3":"CC","col4":"DD"}|
#+-------------------------------------------------+
from pyspark.sql.functions import *
df.selectExpr("from_json(jsonData,'col1 string,col2 string,col3 string,col4 string') as jsn_str").\
select("jsn_str.*").\
withColumn("col4",lit("TT")).\
withColumn("jsonData",to_json(struct(col("col1"),col("col2"),col("col3"),col("col4")))).\
select("jsonData").\
show(10,False)
#+-------------------------------------------------+
#|jsonData |
#+-------------------------------------------------+
#|{"col1":"AA","col2":"BB","col3":"CC","col4":"TT"}|
#+-------------------------------------------------+

Related

Postgres query with pandas - add explicit type casts

I am trying to query a postgres database via pandas connection, passing the query as a string, like so:
import pandas.io.sql as psql
from sqlalchemy import create_engine
engine = create_engine('postgresql://username:password#localhost:5432/FAPESP-Covid19')
PacientesStat = psql.read_sql("SELECT CD_Municipio, Min(2021-aa_nasc), Max(2021-aa_nasc), Count(*) Tot\
FROM Pacientes\
GROUP BY 1\
ORDER BY 1 NULLS FIRST;", engine)
But I get the error:
LINE 1: SELECT CD_Municipio, Min(2021-aa_nasc), Max(2021-aa_nasc), C...
^
HINT: No operator matches the given name and argument types. You might need to add explicit type casts.
I see that aa_nasc column is set as CHAR.
Ho do I fix the types here?

The following worked:
UPDATE pacientes SET aa_nasc = NULL WHERE aa_nasc = 'AAAA' OR aa_nasc = 'YYYY';
ALTER TABLE pacientes ALTER COLUMN aa_nasc TYPE INT USING aa_nasc::integer;

How to use widgets to pass dynamic column names in Dataframe select statement

I have a Dataframe and I want to dynamically pass the columns names through widgets in a select statement in my Databricks Notebook. How can I do it?
I am using the below code
df1 = spark.sql("select * from tableraw")
where df1 has columns "tablename" and "layer"
df = df1.select("tablename", "layer")
Now, our requirement is to use the values of the widgets to select those columns, something like:
df = df1.select(dbutils.widget.get("tablename"), dbutils.widget.get("datalayer"))

Python / Scala
Create Widgets
%python
dbutils.widgets.text(name = "pythonTextWidget", defaultValue = "columnName")
dbutils.widgets.dropdown(name = "pythonDropdownWidget", defaultValue = "col1", choices = ["col1", "col2", "col3"])
%scala
dbutils.widgets.text("scalaTextWidget", "columnName")
dbutils.widgets.dropdown("scalaDropdownWidget", "col1", Seq("col1", "col2", "col3"))
Extract Value from Widgets
%python
textColumn = dbutils.widgets.get("pythonTextWidget")
dropdownColumn = dbutils.widgets.get("pythonDropdownWidget")
%scala
val textColumn = dbutils.widgets.get("scalaTextWidget")
val dropdownColumn = dbutils.widgets.get("scalaDropdownWidget")
Use Value to select Column
%python
from pyspark.sql.functions import col
df.select(col(textColumn), col(dropdownColumn))
%scala
import org.apache.spark.sql.functions.col
df.select(col(textColumn), col(dropdownColumn))
SQL
The Widgets in SQL work slightly different compared to Python/Scala in the sense that you cannot use them to select a column. However, widgets can be used to dynamically adjust filters.
Create Widget
%sql CREATE WIDGET text sqlTextWidget DEFAULT "ACTIVE"
%sql CREATE WIDGET DROPDOWN sqlDropdownWidget DEFAULT "ACTIVE" CHOICES SELECT DISTINCT Status FROM <databaseName>.<tableName> WHERE Status IS NOT NULL
Apply Widget Value to Filter Statement
%sql SELECT * FROM <databaseName>.<tableName> WHERE Status = getArgument("sqlTextWidget")
More background can be found on the Databricks documentation on Widgets.

Right way to implement pandas.read_sql with ClickHouse

Trying to implement pandas.read_sql function.
I created a clickhouse table and filled it:
create table regions
(
date DateTime Default now(),
region String
)
engine = MergeTree()
PARTITION BY toYYYYMM(date)
ORDER BY tuple()
SETTINGS index_granularity = 8192;
insert into regions (region) values ('Asia'), ('Europe')
Then python code:
import pandas as pd
from sqlalchemy import create_engine
uri = 'clickhouse://default:#localhost/default'
engine = create_engine(uri)
query = 'select * from regions'
pd.read_sql(query, engine)
As the result I expected to get a dataframe with columns date and region but all I get is empty dataframe:
Empty DataFrame
Columns: [2021-01-08 09:24:33, Asia]
Index: []
UPD. It occured that defining clickhouse+native solves the problem.
Can it be solved without +native?

There is encient issue https://github.com/xzkostyan/clickhouse-sqlalchemy/issues/10. Also there is a hint which assumes to add FORMAT TabSeparatedWithNamesAndTypes at the end of a query. So the init query will be look like this:
select *
from regions
FORMAT TabSeparatedWithNamesAndTypes

Cannot have map type columns in DataFrame which calls set operations

: org.apache.spark.sql.AnalysisException: Cannot have map type columns in DataFrame which calls set operations(intersect, except, etc.), but the type of column map_col is map
I have a hive table with a column of type - MAP<Float, Float>. I get the above error when I try to do an insertion on this table in a spark context. Insertion works fine without the 'distinct'.
create table test_insert2(`test_col` string, `map_col` MAP<INT,INT>)
location 's3://mybucket/test_insert2';
insert into test_insert2
select distinct 'a' as test_col, map(0,0) as map_col

Try to convert dataframe to .rdd then apply .distinct function.
Example:
spark.sql("select 'a'test_col,map(0,0)map_col
union all
select 'a'test_col,map(0,0)map_col").rdd.distinct.collect
Result:
Array[org.apache.spark.sql.Row] = Array([a,Map(0 -> 0)])

Listagg Alternative in Spark SQL [duplicate]

I have a table of two string type columns (username, friend) and for each username, I want to collect all of its friends on one row, concatenated as strings. For example: ('username1', 'friends1, friends2, friends3')
I know MySQL does this with GROUP_CONCAT. Is there any way to do this with Spark SQL?

Before you proceed: This operations is yet another another groupByKey. While it has multiple legitimate applications it is relatively expensive so be sure to use it only when required.
Not exactly concise or efficient solution but you can use UserDefinedAggregateFunction introduced in Spark 1.5.0:
object GroupConcat extends UserDefinedAggregateFunction {
def inputSchema = new StructType().add("x", StringType)
def bufferSchema = new StructType().add("buff", ArrayType(StringType))
def dataType = StringType
def deterministic = true
def initialize(buffer: MutableAggregationBuffer) = {
buffer.update(0, ArrayBuffer.empty[String])
}
def update(buffer: MutableAggregationBuffer, input: Row) = {
if (!input.isNullAt(0))
buffer.update(0, buffer.getSeq[String](0) :+ input.getString(0))
}
def merge(buffer1: MutableAggregationBuffer, buffer2: Row) = {
buffer1.update(0, buffer1.getSeq[String](0) ++ buffer2.getSeq[String](0))
}
def evaluate(buffer: Row) = UTF8String.fromString(
buffer.getSeq[String](0).mkString(","))
}
Example usage:
val df = sc.parallelize(Seq(
("username1", "friend1"),
("username1", "friend2"),
("username2", "friend1"),
("username2", "friend3")
)).toDF("username", "friend")
df.groupBy($"username").agg(GroupConcat($"friend")).show
## +---------+---------------+
## | username| friends|
## +---------+---------------+
## |username1|friend1,friend2|
## |username2|friend1,friend3|
## +---------+---------------+
You can also create a Python wrapper as shown in Spark: How to map Python with Scala or Java User Defined Functions?
In practice it can be faster to extract RDD, groupByKey, mkString and rebuild DataFrame.
You can get a similar effect by combining collect_list function (Spark >= 1.6.0) with concat_ws:
import org.apache.spark.sql.functions.{collect_list, udf, lit}
df.groupBy($"username")
.agg(concat_ws(",", collect_list($"friend")).alias("friends"))

You can try the collect_list function
sqlContext.sql("select A, collect_list(B), collect_list(C) from Table1 group by A
Or you can regieter a UDF something like
sqlContext.udf.register("myzip",(a:Long,b:Long)=>(a+","+b))
and you can use this function in the query
sqlConttext.sql("select A,collect_list(myzip(B,C)) from tbl group by A")

In Spark 2.4+ this has become simpler with the help of collect_list() and array_join().
Here's a demonstration in PySpark, though the code should be very similar for Scala too:
from pyspark.sql.functions import array_join, collect_list
friends = spark.createDataFrame(
[
('jacques', 'nicolas'),
('jacques', 'georges'),
('jacques', 'francois'),
('bob', 'amelie'),
('bob', 'zoe'),
],
schema=['username', 'friend'],
)
(
friends
.orderBy('friend', ascending=False)
.groupBy('username')
.agg(
array_join(
collect_list('friend'),
delimiter=', ',
).alias('friends')
)
.show(truncate=False)
)
In Spark SQL the solution is likewise:
SELECT
username,
array_join(collect_list(friend), ', ') AS friends
FROM friends
GROUP BY username;
The output:
+--------+--------------------------+
|username|friends |
+--------+--------------------------+
|jacques |nicolas, georges, francois|
|bob |zoe, amelie |
+--------+--------------------------+
This is similar to MySQL's GROUP_CONCAT() and Redshift's LISTAGG().

Here is a function you can use in PySpark:
import pyspark.sql.functions as F
def group_concat(col, distinct=False, sep=','):
if distinct:
collect = F.collect_set(col.cast(StringType()))
else:
collect = F.collect_list(col.cast(StringType()))
return F.concat_ws(sep, collect)
table.groupby('username').agg(F.group_concat('friends').alias('friends'))
In SQL:
select username, concat_ws(',', collect_list(friends)) as friends
from table
group by username

-- the spark SQL resolution with collect_set
SELECT id, concat_ws(', ', sort_array( collect_set(colors))) as csv_colors
FROM (
VALUES ('A', 'green'),('A','yellow'),('B', 'blue'),('B','green')
) as T (id, colors)
GROUP BY id

One way to do it with pyspark < 1.6, which unfortunately doesn't support user-defined aggregate function:
byUsername = df.rdd.reduceByKey(lambda x, y: x + ", " + y)
and if you want to make it a dataframe again:
sqlContext.createDataFrame(byUsername, ["username", "friends"])
As of 1.6, you can use collect_list and then join the created list:
from pyspark.sql import functions as F
from pyspark.sql.types import StringType
join_ = F.udf(lambda x: ", ".join(x), StringType())
df.groupBy("username").agg(join_(F.collect_list("friend").alias("friends"))

Language: Scala
Spark version: 1.5.2
I had the same issue and also tried to resolve it using udfs but, unfortunately, this has led to more problems later in the code due to type inconsistencies. I was able to work my way around this by first converting the DF to an RDD then grouping by and manipulating the data in the desired way and then converting the RDD back to a DF as follows:
val df = sc
.parallelize(Seq(
("username1", "friend1"),
("username1", "friend2"),
("username2", "friend1"),
("username2", "friend3")))
.toDF("username", "friend")
+---------+-------+
| username| friend|
+---------+-------+
|username1|friend1|
|username1|friend2|
|username2|friend1|
|username2|friend3|
+---------+-------+
val dfGRPD = df.map(Row => (Row(0), Row(1)))
.groupByKey()
.map{ case(username:String, groupOfFriends:Iterable[String]) => (username, groupOfFriends.mkString(","))}
.toDF("username", "groupOfFriends")
+---------+---------------+
| username| groupOfFriends|
+---------+---------------+
|username1|friend2,friend1|
|username2|friend3,friend1|
+---------+---------------+

Below python-based code that achieves group_concat functionality.
Input Data:
Cust_No,Cust_Cars
1, Toyota
2, BMW
1, Audi
2, Hyundai
from pyspark.sql import SparkSession
from pyspark.sql.types import StringType
from pyspark.sql.functions import udf
import pyspark.sql.functions as F
spark = SparkSession.builder.master('yarn').getOrCreate()
# Udf to join all list elements with "|"
def combine_cars(car_list,sep='|'):
collect = sep.join(car_list)
return collect
test_udf = udf(combine_cars,StringType())
car_list_per_customer.groupBy("Cust_No").agg(F.collect_list("Cust_Cars").alias("car_list")).select("Cust_No",test_udf("car_list").alias("Final_List")).show(20,False)
Output Data:
Cust_No, Final_List
1, Toyota|Audi
2, BMW|Hyundai

You can also use Spark SQL function collect_list and after you will need to cast to string and use the function regexp_replace to replace the special characters.
regexp_replace(regexp_replace(regexp_replace(cast(collect_list((column)) as string), ' ', ''), ',', '|'), '[^A-Z0-9|]', '')
it's an easier way.

Higher order function concat_ws() and collect_list() can be a good alternative along with groupBy()
import pyspark.sql.functions as F
df_grp = df.groupby("agg_col").agg(F.concat_ws("#;", F.collect_list(df.time)).alias("time"), F.concat_ws("#;", F.collect_list(df.status)).alias("status"), F.concat_ws("#;", F.collect_list(df.llamaType)).alias("llamaType"))
Sample Output
+-------+------------------+----------------+---------------------+
|agg_col|time |status |llamaType |
+-------+------------------+----------------+---------------------+
|1 |5-1-2020#;6-2-2020|Running#;Sitting|red llama#;blue llama|
+-------+------------------+----------------+---------------------+

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

how to update nested json_data in delta table pyspark/hive - apache-spark-sql

Related

Postgres query with pandas - add explicit type casts

How to use widgets to pass dynamic column names in Dataframe select statement

Right way to implement pandas.read_sql with ClickHouse

Cannot have map type columns in DataFrame which calls set operations

Listagg Alternative in Spark SQL [duplicate]

Categories

Resources