Spark join two dataframes on best .startswith match - dataframe

Two dataframes with phone numbers with ids that could only be matched via regex, e.g.
idPrefix
other columns
420
42055
420551
phoneNumber
other columns
420551666
421709560
I would need to join these dataframes on the best match of idPrefix to the phoneNumber, matching the longest starting prefix possible, if there is one. E.g. if there were any option to join on longest idPrefix for phoneNumber.startswith(idPrefix), that would be great.
e.g. result would look like
I tried a few UDFs with regex instead of a df.join(), but there does not seem to be a way to have a dataframe as an input to UDF.

Join on startsWith condition then use row_number to keep longest idPrefix per phoneNumber:
import org.apache.spark.sql.expressions.Window
val df1 = Seq(("420"), ("42055"), ("420551")).toDF("idPrefix")
val df2 = Seq(("420551666"), ("421709560")).toDF("phoneNumber")
val df = (df2
.join(df1, col("phoneNumber").startsWith(col("idPrefix")), "left")
.withColumn(
"rn",
row_number().over(
Window.partitionBy("phoneNumber").orderBy(length(col("idPrefix")).desc)
)
)
.filter("rn = 1")
.drop("rn")
)
df.show
//+-----------+--------+
//|phoneNumber|idPrefix|
//+-----------+--------+
//| 421709560| null|
//| 420551666| 420551|
//+-----------+--------+

It's true that you can use directly the startsWith in the join, but this will require that one of the tables can be broadcasted, if this case can't be achieved it will become a join with a cartesian product complexity making it very inefficient. If both dataframes are very large, you can try to make a join first by prefixes and then filter by the startsWith.
The code will look something like this:
val df1 = Seq(("420"), ("42055"), ("420551")).toDF("idPrefix")
val df2 = Seq(("420551666"), ("421709560")).toDF("phoneNumber")
val prefixSize = df1.select(min(length(col("idPrefix")))).as[Int].head()
//takes the length of the smallest prefix, or if its static set it hardcoded
val df = df2
.join(df1, substring(col("idPrefix"), 0, prefixSize) === substring(col("phoneNumber"), 0, prefixSize), "left")
.filter(col("phoneNumber").startsWith(col("idPrefix")) || col("idPrefix").isNull) //join by the prefix and filter all false elements
.groupBy("phoneNumber").agg(max(struct(length(col("idPrefix").as("size")), col("idPrefix"))).as("idPrefix"))
.withColumn("idPrefix", col("idPrefix.idPrefix")) //take the maximum length for the phone number in a single aggregation
df.show
//+-----------+--------+
//|phoneNumber|idPrefix|
//+-----------+--------+
//| 420551666| 420551|
//| 421709560| null|
//+-----------+--------+

Related

Is there a way in pyspark to count unique values

I have a spark dataframe (12m x 132) and I am trying to calculate the number of unique values by column, and remove columns that have only 1 unique value.
So far, I have used the pandas nunique function as such:
import pandas as pd
df = sql_dw.read_table(<table>)
df_p = df.toPandas()
nun = df_p.nunique(axis=0)
nundf = pd.DataFrame({'atr':nun.index, 'countU':nun.values})
dropped = []
for i, j in nundf.values:
if j == 1:
dropped.append(i)
df = df.drop(i)
print(dropped)
Is there a way to do this that is more native to spark - i.e. not using pandas?
Please have a look at the commented example below. The solution requires more python as pyspark specific knowledge.
import pyspark.sql.functions as F
#creating a dataframe
columns = ['asin' ,'ctx' ,'fo' ]
l = [('ASIN1','CTX1','FO1')
,('ASIN1','CTX1','FO1')
,('ASIN1','CTX1','FO2')
,('ASIN1','CTX2','FO1')
,('ASIN1','CTX2','FO2')
,('ASIN1','CTX2','FO2')
,('ASIN1','CTX2','FO3')
,('ASIN1','CTX3','FO1')
,('ASIN1','CTX3','FO3')]
df=spark.createDataFrame(l, columns)
df.show()
#we create a list of functions we want to apply
#in this case countDistinct for each column
expr = [F.countDistinct(c).alias(c) for c in df.columns]
#we apply those functions
countdf = df.select(*expr)
#this df has just one row
countdf.show()
#we extract the columns which have just one value
cols2drop = [k for k,v in countdf.collect()[0].asDict().items() if v == 1]
df.drop(*cols2drop).show()
Output:
+-----+----+---+
| asin| ctx| fo|
+-----+----+---+
|ASIN1|CTX1|FO1|
|ASIN1|CTX1|FO1|
|ASIN1|CTX1|FO2|
|ASIN1|CTX2|FO1|
|ASIN1|CTX2|FO2|
|ASIN1|CTX2|FO2|
|ASIN1|CTX2|FO3|
|ASIN1|CTX3|FO1|
|ASIN1|CTX3|FO3|
+-----+----+---+
+----+---+---+
|asin|ctx| fo|
+----+---+---+
| 1| 3| 3|
+----+---+---+
+----+---+
| ctx| fo|
+----+---+
|CTX1|FO1|
|CTX1|FO1|
|CTX1|FO2|
|CTX2|FO1|
|CTX2|FO2|
|CTX2|FO2|
|CTX2|FO3|
|CTX3|FO1|
|CTX3|FO3|
+----+---+
My apologies as I don't have the solution in pyspark but in pure spark, which may be transferable or used in case you can't find a pyspark way.
You can create a blank list and then using a foreach, check which columns have a distinct count of 1, then append them to the blank list.
From there you can use the list as a filter and drop those columns from your dataframe.
var list_of_columns: List[String] = ()
df_p.columns.foreach{c =>
if (df_p.select(c).distinct.count == 1)
list_of_columns ++= List(c)
df_p_new = df_p.drop(list_of_columns:_*)
you can group your df by that column and count distinct value of this column:
df = df.groupBy("column_name").agg(countDistinct("column_name").alias("distinct_count"))
And then filter your df by row which has more than 1 distinct_count:
df = df.filter(df.distinct_count > 1)

Spark SQL: How to apply specific functions to all specified columns

I there an easy way to call sql on multiple column on spark sql.
For example, Let's say I have a query that should be applied to most columns
select
min(c1) as min,
max(c1) as max,
max(c1) - min(c1) range
from table tb1
If there are multiple columns, is there a way to execute the query for all the columns, and get result one time.
Similar to how df.describe does.
Use the meta data (columns in this case) included in your dataframe (which you can get via spark.table("<table_name>") if you don't have it in scope already to get the column names, then apply the functions you want and pass to df.select (or df.selectExpr).
Build some test data:
scala> var seq = Seq[(Int, Int, Float)]()
seq: Seq[(Int, Int, Float)] = List()
scala> (1 to 1000).foreach(n => { seq = seq :+ (n,r.nextInt,r.nextFloat) })
scala> val df = seq.toDF("id", "some_int", "some_float")
Denote some functions we want to run on all the columns:
scala> val functions_to_apply = Seq("min", "max")
functions_to_apply: Seq[String] = List(min, max)
Setup the final Seq of SQL Columns:
scala> var select_columns = Seq[org.apache.spark.sql.Column]()
select_columns: Seq[org.apache.spark.sql.Column] = List()
Iterate over the columns and functions to apply to populate the select_columns Seq:
scala> val cols = df.columns
scala> cols.foreach(col => { functions_to_apply.foreach(f => {select_columns = select_columns :+ expr(s"$f($col)")})})
Run the actual query:
scala> df.select(select_columns:_*).show
+-------+-------+-------------+-------------+---------------+---------------+
|min(id)|max(id)|min(some_int)|max(some_int)|min(some_float)|max(some_float)|
+-------+-------+-------------+-------------+---------------+---------------+
| 1| 1000| -2143898568| 2147289642| 1.8781424E-4| 0.99964607|
+-------+-------+-------------+-------------+---------------+---------------+

Sql DataFrame - Operation

I am stuck with a situation where i need to perform division on output from two sql Data Frame . Any Suggestion How it can be done ?
scala> val TotalDie = sqlc.sql("select COUNT(DISTINCT XY) from Data")
TotalDie: org.apache.spark.sql.DataFrame = [_c0: bigint]
scala> TotalDie.show()
+---+
|_c0|
+---+
|887|
+---+
scala> val PassDie = sqlc.sql("select COUNT(DISTINCT XY) from Data where Sbin = '1'")
PassDie: org.apache.spark.sql.DataFrame = [_c0: bigint]
scala> PassDie.show()
+---+
|_c0|
+---+
|413|
+---+
I need to perform to calculate the Yield which refer to (PassDie/TotalDie)*100,
I am new to spark-shell
In case of multiple values (ie multiple rows): do you have a column (or key or id) to join the two dataframes (or tables) on ?
In case of always a single value (ie single row): something along the lines of: 100* PassDie.collect() / TotalDie.collect()
UPDATE
The exact syntax in case of 1 value:
100.0 * passdie.collect()(0).getInt(0) / totaldie.collect()(0).getInt(0)
res25: Double = 46.56144306651635
It is possible to do this with just SparkSQL, too.
Here's what i'd do to solve it that way:
>>> rdd1 = sc.parallelize([("a",1.12),("a",2.22)])
>>> rdd2 = sc.parallelize([("b",9.12),("b",12.22)])
>>> r1df = rdd1.toDF()
>>> r2df = rdd2.toDF()
>>> r1df.registerTempTable('r1')
>>> r2df.registerTempTable('r2')
>>> r3df = sqlContext.sql("SELECT * FROM r1 UNION SELECT * FROM r2").show()
>>> r3df.registerTempTable('r3')
>>> sqlContext.sql("SELECT * FROM r3") -------> do your aggregation / math here.
Now from here, in theory, you can do basic grouping and arithmetic just using SQL queries, since you've got this grand table of data. I realize in my example code, I didn't really declare a good schema with column names, and that makes this example not really work, but you have a schema, so you get the idea.

Concatenate columns in Apache Spark DataFrame

How do we concatenate two columns in an Apache Spark DataFrame?
Is there any function in Spark SQL which we can use?
With raw SQL you can use CONCAT:
In Python
df = sqlContext.createDataFrame([("foo", 1), ("bar", 2)], ("k", "v"))
df.registerTempTable("df")
sqlContext.sql("SELECT CONCAT(k, ' ', v) FROM df")
In Scala
import sqlContext.implicits._
val df = sc.parallelize(Seq(("foo", 1), ("bar", 2))).toDF("k", "v")
df.registerTempTable("df")
sqlContext.sql("SELECT CONCAT(k, ' ', v) FROM df")
Since Spark 1.5.0 you can use concat function with DataFrame API:
In Python :
from pyspark.sql.functions import concat, col, lit
df.select(concat(col("k"), lit(" "), col("v")))
In Scala :
import org.apache.spark.sql.functions.{concat, lit}
df.select(concat($"k", lit(" "), $"v"))
There is also concat_ws function which takes a string separator as the first argument.
Here's how you can do custom naming
import pyspark
from pyspark.sql import functions as sf
sc = pyspark.SparkContext()
sqlc = pyspark.SQLContext(sc)
df = sqlc.createDataFrame([('row11','row12'), ('row21','row22')], ['colname1', 'colname2'])
df.show()
gives,
+--------+--------+
|colname1|colname2|
+--------+--------+
| row11| row12|
| row21| row22|
+--------+--------+
create new column by concatenating:
df = df.withColumn('joined_column',
sf.concat(sf.col('colname1'),sf.lit('_'), sf.col('colname2')))
df.show()
+--------+--------+-------------+
|colname1|colname2|joined_column|
+--------+--------+-------------+
| row11| row12| row11_row12|
| row21| row22| row21_row22|
+--------+--------+-------------+
One option to concatenate string columns in Spark Scala is using concat.
It is necessary to check for null values. Because if one of the columns is null, the result will be null even if one of the other columns do have information.
Using concat and withColumn:
val newDf =
df.withColumn(
"NEW_COLUMN",
concat(
when(col("COL1").isNotNull, col("COL1")).otherwise(lit("null")),
when(col("COL2").isNotNull, col("COL2")).otherwise(lit("null"))))
Using concat and select:
val newDf = df.selectExpr("concat(nvl(COL1, ''), nvl(COL2, '')) as NEW_COLUMN")
With both approaches you will have a NEW_COLUMN which value is a concatenation of the columns: COL1 and COL2 from your original df.
concat(*cols)
v1.5 and higher
Concatenates multiple input columns together into a single column. The function works with strings, binary and compatible array columns.
Eg: new_df = df.select(concat(df.a, df.b, df.c))
concat_ws(sep, *cols)
v1.5 and higher
Similar to concat but uses the specified separator.
Eg: new_df = df.select(concat_ws('-', df.col1, df.col2))
map_concat(*cols)
v2.4 and higher
Used to concat maps, returns the union of all the given maps.
Eg: new_df = df.select(map_concat("map1", "map2"))
Using concat operator (||):
v2.3 and higher
Eg: df = spark.sql("select col_a || col_b || col_c as abc from table_x")
Reference: Spark sql doc
If you want to do it using DF, you could use a udf to add a new column based on existing columns.
val sqlContext = new SQLContext(sc)
case class MyDf(col1: String, col2: String)
//here is our dataframe
val df = sqlContext.createDataFrame(sc.parallelize(
Array(MyDf("A", "B"), MyDf("C", "D"), MyDf("E", "F"))
))
//Define a udf to concatenate two passed in string values
val getConcatenated = udf( (first: String, second: String) => { first + " " + second } )
//use withColumn method to add a new column called newColName
df.withColumn("newColName", getConcatenated($"col1", $"col2")).select("newColName", "col1", "col2").show()
From Spark 2.3(SPARK-22771) Spark SQL supports the concatenation operator ||.
For example;
val df = spark.sql("select _c1 || _c2 as concat_column from <table_name>")
Here is another way of doing this for pyspark:
#import concat and lit functions from pyspark.sql.functions
from pyspark.sql.functions import concat, lit
#Create your data frame
countryDF = sqlContext.createDataFrame([('Ethiopia',), ('Kenya',), ('Uganda',), ('Rwanda',)], ['East Africa'])
#Use select, concat, and lit functions to do the concatenation
personDF = countryDF.select(concat(countryDF['East Africa'], lit('n')).alias('East African'))
#Show the new data frame
personDF.show()
----------RESULT-------------------------
84
+------------+
|East African|
+------------+
| Ethiopian|
| Kenyan|
| Ugandan|
| Rwandan|
+------------+
Here is a suggestion for when you don't know the number or name of the columns in the Dataframe.
val dfResults = dfSource.select(concat_ws(",",dfSource.columns.map(c => col(c)): _*))
Do we have java syntax corresponding to below process
val dfResults = dfSource.select(concat_ws(",",dfSource.columns.map(c => col(c)): _*))
In Spark 2.3.0, you may do:
spark.sql( """ select '1' || column_a from table_a """)
In Java you can do this to concatenate multiple columns. The sample code is to provide you a scenario and how to use it for better understanding.
SparkSession spark = JavaSparkSessionSingleton.getInstance(rdd.context().getConf());
Dataset<Row> reducedInventory = spark.sql("select * from table_name")
.withColumn("concatenatedCol",
concat(col("col1"), lit("_"), col("col2"), lit("_"), col("col3")));
class JavaSparkSessionSingleton {
private static transient SparkSession instance = null;
public static SparkSession getInstance(SparkConf sparkConf) {
if (instance == null) {
instance = SparkSession.builder().config(sparkConf)
.getOrCreate();
}
return instance;
}
}
The above code concatenated col1,col2,col3 seperated by "_" to create a column with name "concatenatedCol".
In my case, I wanted a Pipe-'I' delimited row.
from pyspark.sql import functions as F
df.select(F.concat_ws('|','_c1','_c2','_c3','_c4')).show()
This worked well like a hot knife over butter.
use concat method like this:
Dataset<Row> DF2 = DF1
.withColumn("NEW_COLUMN",concat(col("ADDR1"),col("ADDR2"),col("ADDR3"))).as("NEW_COLUMN")
Another way to do it in pySpark using sqlContext...
#Suppose we have a dataframe:
df = sqlContext.createDataFrame([('row1_1','row1_2')], ['colname1', 'colname2'])
# Now we can concatenate columns and assign the new column a name
df = df.select(concat(df.colname1, df.colname2).alias('joined_colname'))
Indeed, there are some beautiful inbuilt abstractions for you to accomplish your concatenation without the need to implement a custom function. Since you mentioned Spark SQL, so I am guessing you are trying to pass it as a declarative command through spark.sql(). If so, you can accomplish in a straight forward manner passing SQL command like:
SELECT CONCAT(col1, '<delimiter>', col2, ...) AS concat_column_name FROM <table_name>;
Also, from Spark 2.3.0, you can use commands in lines with:
SELECT col1 || col2 AS concat_column_name FROM <table_name>;
Wherein, is your preferred delimiter (can be empty space as well) and is the temporary or permanent table you are trying to read from.
We can simple use SelectExpr as well.
df1.selectExpr("*","upper(_2||_3) as new")
We can use concat() in select method of dataframe
val fullName = nameDF.select(concat(col("FirstName"), lit(" "), col("LastName")).as("FullName"))
Using withColumn and concat
val fullName1 = nameDF.withColumn("FullName", concat(col("FirstName"), lit(" "), col("LastName")))
Using spark.sql concat function
val fullNameSql = spark.sql("select Concat(FirstName, LastName) as FullName from names")
Taken from https://www.sparkcodehub.com/spark-dataframe-concat-column
val newDf =
df.withColumn(
"NEW_COLUMN",
concat(
when(col("COL1").isNotNull, col("COL1")).otherwise(lit("null")),
when(col("COL2").isNotNull, col("COL2")).otherwise(lit("null"))))
Note: For this code to work you need to put the parentheses "()" in the "isNotNull" function. -> The correct one is "isNotNull()".
val newDf =
df.withColumn(
"NEW_COLUMN",
concat(
when(col("COL1").isNotNull(), col("COL1")).otherwise(lit("null")),
when(col("COL2").isNotNull(), col("COL2")).otherwise(lit("null"))))

Renaming columns for PySpark DataFrame aggregates

I am analysing some data with PySpark DataFrames. Suppose I have a DataFrame df that I am aggregating:
(df.groupBy("group")
.agg({"money":"sum"})
.show(100)
)
This will give me:
group SUM(money#2L)
A 137461285853
B 172185566943
C 271179590646
The aggregation works just fine but I dislike the new column name SUM(money#2L). Is there a way to rename this column into something human readable from the .agg method? Maybe something more similar to what one would do in dplyr:
df %>% group_by(group) %>% summarise(sum_money = sum(money))
Although I still prefer dplyr syntax, this code snippet will do:
import pyspark.sql.functions as sf
(df.groupBy("group")
.agg(sf.sum('money').alias('money'))
.show(100))
It gets verbose.
withColumnRenamed should do the trick. Here is the link to the pyspark.sql API.
df.groupBy("group")\
.agg({"money":"sum"})\
.withColumnRenamed("SUM(money)", "money")
.show(100)
I made a little helper function for this that might help some people out.
import re
from functools import partial
def rename_cols(agg_df, ignore_first_n=1):
"""changes the default spark aggregate names `avg(colname)`
to something a bit more useful. Pass an aggregated dataframe
and the number of aggregation columns to ignore.
"""
delimiters = "(", ")"
split_pattern = '|'.join(map(re.escape, delimiters))
splitter = partial(re.split, split_pattern)
split_agg = lambda x: '_'.join(splitter(x))[0:-ignore_first_n]
renamed = map(split_agg, agg_df.columns[ignore_first_n:])
renamed = zip(agg_df.columns[ignore_first_n:], renamed)
for old, new in renamed:
agg_df = agg_df.withColumnRenamed(old, new)
return agg_df
An example:
gb = (df.selectExpr("id", "rank", "rate", "price", "clicks")
.groupby("id")
.agg({"rank": "mean",
"*": "count",
"rate": "mean",
"price": "mean",
"clicks": "mean",
})
)
>>> gb.columns
['id',
'avg(rate)',
'count(1)',
'avg(price)',
'avg(rank)',
'avg(clicks)']
>>> rename_cols(gb).columns
['id',
'avg_rate',
'count_1',
'avg_price',
'avg_rank',
'avg_clicks']
Doing at least a bit to save people from typing so much.
It's simple as:
val maxVideoLenPerItemDf = requiredItemsFiltered.groupBy("itemId").agg(max("playBackDuration").as("customVideoLength"))
maxVideoLenPerItemDf.show()
Use .as in agg to name the new row created.
.alias and .withColumnRenamed both work if you're willing to hard-code your column names. If you need a programmatic solution, e.g. friendlier names for an aggregation of all remaining columns, this provides a good starting point:
grouping_column = 'group'
cols = [F.sum(F.col(x)).alias(x) for x in df.columns if x != grouping_column]
(
df
.groupBy(grouping_column)
.agg(
*cols
)
)
df = df.groupby('Device_ID').agg(aggregate_methods)
for column in df.columns:
start_index = column.find('(')
end_index = column.find(')')
if (start_index and end_index):
df = df.withColumnRenamed(column, column[start_index+1:end_index])
The above code can strip out anything that is outside of the "()". For example, "sum(foo)" will be renamed as "foo".
import findspark
findspark.init()
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
spark = SparkSession.builder.appName('test').getOrCreate()
data = [(1, "siva", 100), (2, "siva2", 200),(3, "siva3", 300),(4, "siva4", 400),(5, "siva5", 500)]
schema = ['id', 'name', 'sallary']
df = spark.createDataFrame(data, schema=schema)
df.show()
+---+-----+-------+
| id| name|sallary|
+---+-----+-------+
| 1| siva| 100|
| 2|siva2| 200|
| 3|siva3| 300|
| 4|siva4| 400|
| 5|siva5| 500|
+---+-----+-------+
**df.agg({"sallary": "max"}).withColumnRenamed('max(sallary)', 'max').show()**
+---+
|max|
+---+
|500|
+---+
While the previously given answers are good, I think they're lacking a neat way to deal with dictionary-usage in the .agg()
If you want to use a dict, which actually might be also dynamically generated because you have hundreds of columns, you can use the following without dealing with dozens of code-lines:
# Your dictionary-version of using the .agg()-function
# Note: The provided logic could actually also be applied to a non-dictionary approach
df = df.groupBy("group")\
.agg({
"money":"sum"
, "...": "..."
})
# Now do the renaming
newColumnNames = ["group", "money", "..."] # Provide the names for ALL columns of the new df
df = df.toDF(*newColumnNames) # Do the renaming
Of course the newColumnNames-list can also be dynamically generated. E.g., if you only append columns from the aggregation to your df you can pre-store newColumnNames = df.columns and then just append the additional names.
Anyhow, be aware that the newColumnNames must contain all column names of the dataframe, not only those to be renamed (because .toDF() creates a new dataframe due to Sparks immutable RDDs)!
Another quick little one liner to add the the mix:
df.groupBy('group')
.agg({'money':'sum',
'moreMoney':'sum',
'evenMoreMoney':'sum'
})
.select(*(col(i).alias(i.replace("(",'_').replace(')','')) for i in df.columns))
just change the alias function to whatever you'd like to name them. The above generates sum_money, sum_moreMoney, since I do like seeing the operator in the variable name.