Split a column in multiple columns using Spark SQL - sql

I have a column col1 that represents a GPS coordinate format:
25 4.1866N 55 8.3824E
I would like to split it in multiple columns based on white-space as separator, as in the output example table_example below:
| 1st_split | 2nd_split | 3rd_split | 4th_split |
|:-----------|------------:|:------------:|:------------:|
| 25 | 4.1866N | 55 | 8.3824E |
Considering the fact that there is the split() function, I have tried in this way:
SELECT explode(split(`col1`, ' ')) AS `col` FROM table_example;
But, instead of splitting per multiple columns, it splits per multiple rows, like in the output below:
Can someone clarify me which would be the worth approach for get the expected result?

If you have a dataframe as
+---------------------+
|col |
+---------------------+
|25 4.1866N 55 8.3824E|
+---------------------+
Using Scala API
You can simply use split inbuilt function and select appropriately as
import org.apache.spark.sql.functions._
df.withColumn("split", split(col("col"), " "))
.select(col("split")(0).as("1st_split"), col("split")(1).as("2nd_split"),col("split")(2).as("3rd_split"),col("split")(3).as("4th_split"))
.show(false)
which would give you
+---------+---------+---------+---------+
|1st_split|2nd_split|3rd_split|4th_split|
+---------+---------+---------+---------+
|25 |4.1866N |55 |8.3824E |
+---------+---------+---------+---------+
Using SQL way
Sql is much easier and similar to the api way
df.createOrReplaceTempView("table_example")
val splitted = sqlContext.sql("SELECT split(`col`, ' ') AS `col` FROM table_example")
splitted.createOrReplaceTempView("splitted_table")
val result = sqlContext.sql("SELECT `col`[0] AS `1st_split`, `col`[1] AS `2nd_split`, `col`[2] AS `3rd_split`, `col`[3] AS `4th_split` FROM splitted_table")
result.show(false)
I hope the answer is helpful

Related

Scala convert Array to DataFrame Column

I am trying to add an Array of values as a new column to the DataFrame.
Ex:
Lets assume there is an Array(4,5,10) and a dataframe
+----------+-----+
| name | age |
+----------+-----+
| John | 32 |
| Elizabeth| 28 |
| Eric | 41 |
+----------+-----+
My requirement is to add the above array as a new column to the dataframe. My expected output is as follows:
+----------+-----+------+
| name | age | rank |
+----------+-----+------+
| John | 32 | 4 |
| Elizabeth| 28 | 5 |
| Eric | 41 | 10 |
+----------+-----+------+
I am trying if I can achieve this using rdd and zipWithIndex.
df.rdd.zipWithIndex.map(_.swap).join(array_rdd.zipWithIndex.map(_.swap))
This is resulting in something of this sort.
(0,([John, 32],4))
I want to convert the above RDD back to required dataframe. Let me know how to achieve this.
Are there any alternatives available for achieving the desired result other than using rdd and zipWithIndex? What is the best way to do it?
PS:
Context for better understanding:
I am using Xpress optimization suite to solve a mathematical problem. Xpress takes inputs interms of Arrays and also outputs the result in an Array. I get input as a DataFrame and I am extracting columns as Arrays(using collect) and passing to Xpress. Xpress outputs Array[Double] as solution. I want to add this solution back to the dataframe as a column and every value in the solution array corresponds to the row of the dataframe at its index i.e value at index 'n' of the output Array corresponds to 'n'th row of the dataframe
After the join just map the results to what you are looking for.
You can convert this back to a dataframe after joining the RDDs.
val originalDF = Seq(("John", 32), ("Elizabeth", 28), ("Eric", 41)).toDF("name", "age")
val rank = Array(4, 5, 10)
// convert to Seq first
val rankDF = rank.toSeq.toDF("rank")
val joined = originalDF.rdd.zipWithIndex.map(_.swap).join(rankDF.rdd.zipWithIndex.map(_.swap))
val finalRDD = joined.map{ case (k,v) => (k, v._1.getString(0), v._1.getInt(1), v._2.getInt(0)) }
val finalDF = finalRDD.toDF("id", "name", "age", "rank")
finalDF.show()
/*
+---+---------+---+----+
| id| name|age|rank|
+---+---------+---+----+
| 0| John| 32| 4|
| 1|Elizabeth| 28| 5|
| 2| Eric| 41| 10|
+---+---------+---+----+
*/
The only alternate way that I can think of is to use the org.apache.spark.sql.functions.row_number() window function. This essentially achieves the same thing by adding an increasing, consecutive row number to the dataframe.
The drawback with this is the large amount of data shuffle into one partition, since we need to have unrepeated row numbers for all rows in the dataframe. If your data is very large this can lead to an out of memory issue. (Note: this may not be applicable in your case, since you mentioned you are doing a collect on the data and have not mentioned any memory issues in this).
The approach of converting to an rdd and using zipWithIndex is an acceptable solution, but generally converting from dataframe to rdd is not recommended due to the performance difference of using an RDD instead of a dataframe.

Create new column with fuzzy-score across two string columns in the same dataframe

I'm trying to calculate a fuzzy score (preferable partial_ratio score) across two columns in the same dataframe.
| column1 | column2|
| -------- | -------------- |
| emmett holt| holt
| greenwald| christopher
It would need to look something like this:
| column1 | column2|partial_ratio|
| -------- | -------------- |-----------|
| emmett holt| holt|100|
| greenwald| christopher|22|
|schaefer|schaefer|100|
With the help of another question on this website, I worked towards the following code:
compare=pd.MultiIndex.from_product([ dataframe['column1'],dataframe ['column2'] ]).to_series()
def metrics (tup):
return pd.Series([fuzz.partial_ratio(*tup)], ['partial_ratio'])
df['partial_ratio'] = df.apply(lambda x: fuzz.partial_ratio(x['original_title'], x['title']), axis=1)
But the problem already starts with the first line of the code that returns the following error notification:
Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions.
You can say I'm kind of stuck here so any advice on this is appreciated!
You need a UDF to use fuzzywuzzy:
from fuzzywuzzy import fuzz
import pyspark.sql.functions as F
#F.udf
def fuzzyudf(original_title, title):
return fuzz.partial_ratio(original_title, title)
df2 = df.withColumn('partial_ratio', fuzzyudf('column1', 'column2'))
df2.show()
+-----------+-----------+-------------+
| column1| column2|partial_ratio|
+-----------+-----------+-------------+
|emmett holt| holt| 100|
| greenwald|christopher| 22|
+-----------+-----------+-------------+

Pyspark dataframe - Illegal values appearing in the column?

So I have a table (sample)
I'm using pyspark dataframe APIs to filter out the 'NOC's that has never won a gold medal and here's the code I write
First part of my code
from pyspark.sql import SQLContext
from pyspark.sql.types import *
from pyspark.sql.functions import *
spark = SQLContext(sc)
df1 = spark.read.format("csv").options(header = 'true').load("D:\\datasets\\athlete_events.csv")
df = df1.na.replace('NA', '-')
countgdf = gdf.groupBy('NOC').agg(count('Medal').alias('No of Gold medals')).select('NOC').show()
It will generate the output
+---+
|NOC|
+---+
|POL|
|JAM|
|BRA|
|ARM|
|MOZ|
|JOR|
|CUB|
|FRA|
|ALG|
|BRN|
+---+
only showing top 10 rows
The next part of the code is something like
allgdf = df.select('NOC').distinct()
This display the output
+-----------+
| NOC|
+-----------+
| DeRuyter|
| POL|
| Russia|
| JAM|
| BUR|
| BRA|
| ARM|
| MOZ|
| CUB|
| JOR|
| Sweden|
| FRA|
| ALG|
| SOM|
| IVB|
|Philippines|
| BRN|
| MAL|
| COD|
| FSM|
+-----------+
Notice the values that are more than 3 characters? Those are supposed to be the values of the column 'Team' but I'm not sure why those values are getting displayed in 'NOC' column. It's hard to figure out why this is happening i.e illegal values in the column.
When I write the final code
final = allgdf.subtract(countgdf).show()
The same happens as illegal values appear in the final dataframe column.
Any help would be appericiated. Thanks.
You should specify a delimiter for your CSV file. By default Spark is using comma separators (,)
This can be done, for example, with :
.option("delimiter",";")

Joining two pyspark dataframes by unique values in a column

Let's say, I have two pyspark dataframes, users and shops. A few sample rows for both the dataframes are shown below.
users dataframe:
+---------+-------------+---------+
| idvalue | day-of-week | geohash |
+---------+-------------+---------+
| id-1 | 2 | gcutjjn |
| id-1 | 3 | gcutjjn |
| id-1 | 5 | gcutjht |
+---------+-------------+---------+
shops dataframe
+---------+-----------+---------+
| shop-id | shop-name | geohash |
+---------+-----------+---------+
| sid-1 | kfc | gcutjjn |
| sid-2 | mcd | gcutjhq |
| sid-3 | starbucks | gcutjht |
+---------+-----------+---------+
I need to join both of these dataframes on the geohash column. I can do a naive equi-join for sure, but the users dataframe is huge, containing billions of rows, and geohashes are likely to repeat, within and across idvalues. So, I was wondering if there's a way to perform joins on unique geohashes in the users dataframe and geohashes in the shops dataframe. If we can do that, then it's easy to replicate the shops entries for matching geohashes in resultant dataframe.
Probably it can be achieved with a pandas udf, where I would perform a groupby on users.idvalue, do a join with shops within the udf by only taking the first row from the group (because all ids are same anyway within the group), and creating a one row dataframe. Logically it feels like this should work, but not sure sure on the performance aspect as udf(s) are usually slower than spark native transformations. Any ideas are welcome.
You said that your Users dataframe is huge and that "geohashes are likely to repeat, within and across idvalues". You didn't referred however if there might be duplicated geohashes in your shops dataframe.
If there are no repeated hashes in the latter, I think that a simple join would solve your problem:
val userDf = Seq(("id-1",2,"gcutjjn"),("id-2",2,"gcutjjn"),("id-1",3,"gcutjjn"),("id-1",5,"gcutjht")).toDF("idvalue","day_of_week","geohash")
val shopDf = Seq(("sid-1","kfc","gcutjjn"),("sid-2","mcd","gcutjhq"),("sid-3","starbucks","gcutjht")).toDF("shop_id","shop_name","geohash")
userDf.show
+-------+-----------+-------+
|idvalue|day_of_week|geohash|
+-------+-----------+-------+
| id-1| 2|gcutjjn|
| id-2| 2|gcutjjn|
| id-1| 3|gcutjjn|
| id-1| 5|gcutjht|
+-------+-----------+-------+
shopDf.show
+-------+---------+-------+
|shop_id|shop_name|geohash|
+-------+---------+-------+
| sid-1| kfc|gcutjjn|
| sid-2| mcd|gcutjhq|
| sid-3|starbucks|gcutjht|
+-------+---------+-------+
shopDf
.join(userDf,Seq("geohash"),"inner")
.groupBy($"geohash",$"shop_id",$"idvalue")
.agg(collect_list($"day_of_week").alias("days"))
.show
+-------+-------+-------+------+
|geohash|shop_id|idvalue| days|
+-------+-------+-------+------+
|gcutjjn| sid-1| id-1|[2, 3]|
|gcutjht| sid-3| id-1| [5]|
|gcutjjn| sid-1| id-2| [2]|
+-------+-------+-------+------+
If you have repeated hash values in your shops dataframe, a possible approach would be to remove those repeated hashes from your shops dataframe (if your requirements allow this), and then perform the same join operation.
val userDf = Seq(("id-1",2,"gcutjjn"),("id-2",2,"gcutjjn"),("id-1",3,"gcutjjn"),("id-1",5,"gcutjht")).toDF("idvalue","day_of_week","geohash")
val shopDf = Seq(("sid-1","kfc","gcutjjn"),("sid-2","mcd","gcutjhq"),("sid-3","starbucks","gcutjht"),("sid-4","burguer king","gcutjjn")).toDF("shop_id","shop_name","geohash")
userDf.show
+-------+-----------+-------+
|idvalue|day_of_week|geohash|
+-------+-----------+-------+
| id-1| 2|gcutjjn|
| id-2| 2|gcutjjn|
| id-1| 3|gcutjjn|
| id-1| 5|gcutjht|
+-------+-----------+-------+
shopDf.show
+-------+------------+-------+
|shop_id| shop_name|geohash|
+-------+------------+-------+
| sid-1| kfc|gcutjjn| << Duplicated geohash
| sid-2| mcd|gcutjhq|
| sid-3| starbucks|gcutjht|
| sid-4|burguer king|gcutjjn| << Duplicated geohash
+-------+------------+-------+
//Dataframe with hashes to exclude:
val excludedHashes = shopDf.groupBy("geohash").count.filter("count > 1")
excludedHashes.show
+-------+-----+
|geohash|count|
+-------+-----+
|gcutjjn| 2|
+-------+-----+
//Create a dataframe of shops without the ones with duplicated hashes
val cleanShopDf = shopDf.join(excludedHashes,Seq("geohash"),"left_anti")
cleanShopDf.show
+-------+-------+---------+
|geohash|shop_id|shop_name|
+-------+-------+---------+
|gcutjhq| sid-2| mcd|
|gcutjht| sid-3|starbucks|
+-------+-------+---------+
//Perform the same join operation
cleanShopDf.join(userDf,Seq("geohash"),"inner")
.groupBy($"geohash",$"shop_id",$"idvalue")
.agg(collect_list($"day_of_week").alias("days"))
.show
+-------+-------+-------+----+
|geohash|shop_id|idvalue|days|
+-------+-------+-------+----+
|gcutjht| sid-3| id-1| [5]|
+-------+-------+-------+----+
The code provided was written in Scala but it can be easily converted to Python.
Hope this helps!
This is an idea if it possible you used pyspark SQL to select distinct geohash and create to the tempory table. Then join from this table instead of dataframes.

SQLAlchemy getting label names out from columns

I want to use the same labels from a SQLAlchemy table, to re-aggregate some data (e.g. I want to iterate through mytable.c to get the column names exactly).
I have some spending data that looks like the following:
| name | region | date | spending |
| John | A | .... | 123 |
| Jack | A | .... | 20 |
| Jill | B | .... | 240 |
I'm then passing it to an existing function we have, that aggregates spending over 2 periods (using a case statement) and groups by region:
grouped table:
| Region | Total (this period) | Total (last period) |
| A | 3048 | 1034 |
| B | 2058 | 900 |
The function returns a SQLAlchemy query object that I can then use subquery() on to re-query e.g.:
subquery = get_aggregated_data(original_table)
region_A_results = session.query(subquery).filter(subquery.c.region = 'A')
I want to then re-aggregate this subquery (summing every column that can be summed, replacing the region column with a string 'other'.
The problem is, if I iterate through subquery.c, I get labels that look like:
anon_1.region
anon_1.sum_this_period
anon_1.sum_last_period
Is there a way to get the textual label from a set of column objects, without the anon_1. prefix? Especially since I feel that the prefix may change depending on how SQLAlchemy decides to generate the query.
Split the name string and take the second part, and if you want to prepare for the chance that the name is not prefixed by the table name, put the code in a try - except block:
for col in subquery.c:
try:
print(col.name.split('.')[1])
except IndexError:
print(col.name)
Also, the result proxy (region_A_results) has a method keys which returns an a list of column names. Again, if you don't need the table names, you can easily get rid of them.