Pyspark: create new column by splitting text - apache-spark-sql

I have a pyspark dataframe like this:
spark.createDataFrame(
[
(1, '1234ESPNnonzodiac'),
(2, '1234ESPNzodiac'),
(3, '963CNNnonzodiac'),
(4, '963CNNzodiac'),
],
['id', 'col1']
)
I would like to create a new column where I split col1 on the words zodiac or nonzodiac, so that I can eventually groupby this new column.
I would like the final output to be like this:
spark.createDataFrame(
[
(1, '1234ESPNnonzodiac', '1234ESPN'),
(2, '1234ESPNzodiac', '1234ESPN'),
(3, '963CNNnonzodiac', '963CNN'),
(4, '963CNNzodiac', '963CNN'),
],
['id', 'col1', 'col2']
)

I would use from pyspark.sql.functions import regexp_extract:
df.withColumn("col2", regexp_extract(df.col1, r"([\s\S]+?)(?:non)?zodiac", 1)).show()
+---+-----------------+--------+
| id| col1| col2|
+---+-----------------+--------+
| 1|1234ESPNnonzodiac|1234ESPN|
| 2| 1234ESPNzodiac|1234ESPN|
| 3| 963CNNnonzodiac| 963CNN|
| 4| 963CNNzodiac| 963CNN|
+---+-----------------+--------+

Related

Querying struct within array - Databricks SQL

I am using Databricks SQL to query a dataset that has a column formatted as an array, and each item in the array is a struct with 3 named fields.
I have the following table:
id
array
1
[{"firstName":"John","lastName":"Smith","age":"10"},{"firstName":"Jane","lastName":"Smith","age":"12"}]
2
[{"firstName":"Bob","lastName":"Miller","age":"13"},{"firstName":"Betty","lastName":"Miller","age":"11"}]
In a different SQL editor, I was able to achieve this by doing the following:
SELECT
id,
struct.firstName
FROM
table
CROSS JOIN UNNEST(array) as t(struct)
With a resulting table of:
id
firstName
1
John
1
Jane
2
Bob
2
Betty
Unfortunately, this syntax does not work in the Databricks SQL editor, and I get the following error.
[UNRESOLVED_COLUMN] A column or function parameter with name `array` cannot be resolved.
I feel like there is an easy way to query this, but my search on Stack Overflow and Google has come up empty so far.
1. SQL API
The first solution uses the SQL API. The first code snippet prepares the test case, so you can ignore it if you already have it in place.
import pyspark.sql.types
schema = StructType([
StructField('id', IntegerType(), True),
StructField("people", ArrayType(StructType([
StructField('firstName', StringType(), True),
StructField('lastName', StringType(), True),
StructField('age', StringType(), True)
])), True)
])
sql_df = spark.createDataFrame([
(1, [{"firstName":"John","lastName":"Smith","age":"10"},{"firstName":"Jane","lastName":"Smith","age":"12"}]),
(2, [{"firstName":"Bob","lastName":"Miller","age":"13"},{"firstName":"Betty","lastName":"Miller","age":"11"}])
], schema)
sql_df.createOrReplaceTempView("sql_df")
What you need to use is the LATERAL VIEW clause (docs) which allows to explode the nested structures, like this:
SELECT id, exploded.firstName
FROM sql_df
LATERAL VIEW EXPLODE(sql_df.people) sql_df AS exploded;
+---+---------+
| id|firstName|
+---+---------+
| 1| John|
| 1| Jane|
| 2| Bob|
| 2| Betty|
+---+---------+
2. DataFrame API
The alternative approach is to use explode method (docs), which gives you the same results, like this:
from pyspark.sql.functions import explode, col
sql_df.select("id", explode(col("people.firstName"))).show()
+---+-----+
| id| col|
+---+-----+
| 1| John|
| 1| Jane|
| 2| Bob|
| 2|Betty|
+---+-----+

check first dataframe value startswith any of the second dataframe value

I have two pyspark dataframe as follow :
df1 = spark.createDataFrame(
["yes","no","yes23", "no3", "35yes", """41no["maybe"]"""],
"string"
).toDF("location")
df2 = spark.createDataFrame(
["yes","no"],
"string"
).toDF("location")
i want to check if values in location col from df1, startsWith, values in location col of df2 and vice versa.
Something like :
df1.select("location").startsWith(df2.location)
Following is the output i am expecting here:
+-------------+
| location|
+-------------+
| yes|
| no|
| yes23|
| no3|
+-------------+
Using spark SQL looks the easiest to me:
df1.createOrReplaceTempView('df1')
df2.createOrReplaceTempView('df2')
joined = spark.sql("""
select df1.*
from df1
join df2
on df1.location rlike '^' || df2.location
""")

PySpark - how to update Dataframe by using join?

I have a dataframe a:
id,value
1,11
2,22
3,33
And another dataframe b:
id,value
1,123
3,345
I want to update dataframe a with all matching values from b (based on column 'id').
Final dataframe 'c' would be:
id,value
1,123
2,22
3,345
How to achieve that using datafame joins (or other approach)?
Tried:
a.join(b, a.id == b.id, "inner").drop(a.value)
Gives (not desired output):
+---+---+-----+
| id| id|value|
+---+---+-----+
| 1| 1| 123|
| 3| 3| 345|
+---+---+-----+
Thanks.
I don't think there is an update functionality. But this should work:
import pyspark.sql.functions as F
df1.join(df2, df1.id == df2.id, "left_outer") \
.select(df1.id, df2.id, F.when(df2.value.isNull(), df1.value).otherwise(df2.value).alias("value")))

How to aggregate data into ranges (bucketize)?

I have a table like
+---------------+------+
|id | value|
+---------------+------+
| 1|118.0|
| 2|109.0|
| 3|113.0|
| 4| 82.0|
| 5| 60.0|
| 6|111.0|
| 7|107.0|
| 8| 84.0|
| 9| 91.0|
| 10|118.0|
+---------------+------+
ans would like aggregate or bin the values to a range 0,10,20,30,40,...80,90,100,110,120how can I perform this in SQL or more specific spark sql?
Currently I have a lateral view join with the range but this seems rather clumsy / inefficient.
The quantile discretized is not really what I want, rather a CUT with this range.
edit
https://github.com/collectivemedia/spark-ext/blob/master/sparkext-mllib/src/main/scala/org/apache/spark/ml/feature/Binning.scala would perform dynamic bins, but I would rather need this specified range.
In the general case, static binning can be performed using org.apache.spark.ml.feature.Bucketizer:
val df = Seq(
(1, 118.0), (2, 109.0), (3, 113.0), (4, 82.0), (5, 60.0),
(6, 111.0), (7, 107.0), (8, 84.0), (9, 91.0), (10, 118.0)
).toDF("id", "value")
val splits = (0 to 12).map(_ * 10.0).toArray
import org.apache.spark.ml.feature.Bucketizer
val bucketizer = new Bucketizer()
.setInputCol("value")
.setOutputCol("bucket")
.setSplits(splits)
val bucketed = bucketizer.transform(df)
val solution = bucketed.groupBy($"bucket").agg(count($"id") as "count")
Result:
scala> solution.show
+------+-----+
|bucket|count|
+------+-----+
| 8.0| 2|
| 11.0| 4|
| 10.0| 2|
| 6.0| 1|
| 9.0| 1|
+------+-----+
The bucketizer throws errors when values lie outside the defined bins. It is possible to define split points as Double.NegativeInfinity or Double.PositiveInfinity to capture outliers.
Bucketizer is designed to work efficiently with arbitrary splits by performing binary search of the right bucket. In the case of regular bins like yours, one can simply do something like:
val binned = df.withColumn("bucket", (($"value" - bin_min) / bin_width) cast "int")
where bin_min and bin_width are the left interval of the minimum bin and the bin width, respectively.
Try "GROUP BY" with this
SELECT id, (value DIV 10)*10 FROM table_name ;
The following would be using the Dataset API for Scala:
df.select(('value divide 10).cast("int")*10)

SparkSQL : SQL.DataFrame.Aggregate function on multiple columns with different operations

Is there any way to apply an aggregate function to join or add lists of tuples in columns of a data frame, when doing a group by?
My data frame looks like this:
+--------+-----+-------------+----------------+
|WindowID|State| City| Details|
+--------+-----+-------------+----------------+
| 1| IA| Ames| [(524292, 2)]|
| 6| PA| Bala Cynwyd| [(6, 48)]|
| 7| AL| Birmingham| [(1048584, 6)]|
| 1| FL| Orlando| [(18, 27)]|
| 7| TN| Nashville| [(1048608, 9)]|
+--------+-----+-------------+----------------+
My goal is to group rows that have the same values in 'WindowID' and merge the content of columns 'State' and 'City' into list of strings and the contents of column 'Details' into list of tuples.
Result must look like this:
+--------+---------+------------------------+-----------------------------+
|WindowID| State| City| Details|
+--------+---------+------------------------+-----------------------------+
| 1| [IA, FL]| [Ames, Orlando]| [(524292, 2), (18, 27)]|
| 6| [PA]| [Bala Cynwyd]| [(6, 48)]|
| 7| [AL, TN]| [Birmingham, Nashville]| [(1048584, 6), (1048608, 9)]|
+--------+---------+------------------------+-----------------------------+
My code is:
sqlc = SQLContext(sc)
df = sqlc.createDataFrame(rdd, ['WindowID', 'State', 'City', 'Details'])
df1 = df.groupBy('WindowID').agg( // Here i want to do merge operation. )
How can i do this using spark sql data dataframe in python.
Creating the data for the input dataframe:
data =[(1, 'IA', 'Ames', (524292, 2)),
(6, 'PA', 'Bala Cynwyd', (6, 48)),
(7, 'AL', 'Birmingham', (1048584, 6)),
(1, 'FL', 'Orlando', (18, 27)),
(7, 'TN', 'Nashville', (1048608, 9))]
table = sqlContext.createDataFrame(data, ['WindowId', 'State', 'City', 'Details'])
table.show()
+--------+-----+-----------+-----------+
|WindowId|State| City| Details|
+--------+-----+-----------+-----------+
| 1| IA| Ames| [524292,2]|
| 6| PA|Bala Cynwyd| [6,48]|
| 7| AL| Birmingham|[1048584,6]|
| 1| FL| Orlando| [18,27]|
| 7| TN| Nashville|[1048608,9]|
+--------+-----+-----------+-----------+
Using collect_list Aggregate function:
from pyspark.sql.functions import collect_list
table.groupby('WindowId').agg(collect_list('State').alias('State'),
collect_list('City').alias('City'),
collect_list('Details').alias('Details')).show()
+--------+--------+--------------------+--------------------+
|WindowId| State| City| Details|
+--------+--------+--------------------+--------------------+
| 1|[FL, IA]| [Orlando, Ames]|[[18,27], [524292...|
| 6| [PA]| [Bala Cynwyd]| [[6,48]]|
| 7|[AL, TN]|[Birmingham, Nash...|[[1048584,6], [10...|
+--------+--------+--------------------+--------------------+