Spark SQL: How to apply specific functions to all specified columns - sql

I there an easy way to call sql on multiple column on spark sql.
For example, Let's say I have a query that should be applied to most columns
select
min(c1) as min,
max(c1) as max,
max(c1) - min(c1) range
from table tb1
If there are multiple columns, is there a way to execute the query for all the columns, and get result one time.
Similar to how df.describe does.

Use the meta data (columns in this case) included in your dataframe (which you can get via spark.table("<table_name>") if you don't have it in scope already to get the column names, then apply the functions you want and pass to df.select (or df.selectExpr).
Build some test data:
scala> var seq = Seq[(Int, Int, Float)]()
seq: Seq[(Int, Int, Float)] = List()
scala> (1 to 1000).foreach(n => { seq = seq :+ (n,r.nextInt,r.nextFloat) })
scala> val df = seq.toDF("id", "some_int", "some_float")
Denote some functions we want to run on all the columns:
scala> val functions_to_apply = Seq("min", "max")
functions_to_apply: Seq[String] = List(min, max)
Setup the final Seq of SQL Columns:
scala> var select_columns = Seq[org.apache.spark.sql.Column]()
select_columns: Seq[org.apache.spark.sql.Column] = List()
Iterate over the columns and functions to apply to populate the select_columns Seq:
scala> val cols = df.columns
scala> cols.foreach(col => { functions_to_apply.foreach(f => {select_columns = select_columns :+ expr(s"$f($col)")})})
Run the actual query:
scala> df.select(select_columns:_*).show
+-------+-------+-------------+-------------+---------------+---------------+
|min(id)|max(id)|min(some_int)|max(some_int)|min(some_float)|max(some_float)|
+-------+-------+-------------+-------------+---------------+---------------+
| 1| 1000| -2143898568| 2147289642| 1.8781424E-4| 0.99964607|
+-------+-------+-------------+-------------+---------------+---------------+

Related

Spark join two dataframes on best .startswith match

Two dataframes with phone numbers with ids that could only be matched via regex, e.g.
idPrefix
other columns
420
42055
420551
phoneNumber
other columns
420551666
421709560
I would need to join these dataframes on the best match of idPrefix to the phoneNumber, matching the longest starting prefix possible, if there is one. E.g. if there were any option to join on longest idPrefix for phoneNumber.startswith(idPrefix), that would be great.
e.g. result would look like
I tried a few UDFs with regex instead of a df.join(), but there does not seem to be a way to have a dataframe as an input to UDF.
Join on startsWith condition then use row_number to keep longest idPrefix per phoneNumber:
import org.apache.spark.sql.expressions.Window
val df1 = Seq(("420"), ("42055"), ("420551")).toDF("idPrefix")
val df2 = Seq(("420551666"), ("421709560")).toDF("phoneNumber")
val df = (df2
.join(df1, col("phoneNumber").startsWith(col("idPrefix")), "left")
.withColumn(
"rn",
row_number().over(
Window.partitionBy("phoneNumber").orderBy(length(col("idPrefix")).desc)
)
)
.filter("rn = 1")
.drop("rn")
)
df.show
//+-----------+--------+
//|phoneNumber|idPrefix|
//+-----------+--------+
//| 421709560| null|
//| 420551666| 420551|
//+-----------+--------+
It's true that you can use directly the startsWith in the join, but this will require that one of the tables can be broadcasted, if this case can't be achieved it will become a join with a cartesian product complexity making it very inefficient. If both dataframes are very large, you can try to make a join first by prefixes and then filter by the startsWith.
The code will look something like this:
val df1 = Seq(("420"), ("42055"), ("420551")).toDF("idPrefix")
val df2 = Seq(("420551666"), ("421709560")).toDF("phoneNumber")
val prefixSize = df1.select(min(length(col("idPrefix")))).as[Int].head()
//takes the length of the smallest prefix, or if its static set it hardcoded
val df = df2
.join(df1, substring(col("idPrefix"), 0, prefixSize) === substring(col("phoneNumber"), 0, prefixSize), "left")
.filter(col("phoneNumber").startsWith(col("idPrefix")) || col("idPrefix").isNull) //join by the prefix and filter all false elements
.groupBy("phoneNumber").agg(max(struct(length(col("idPrefix").as("size")), col("idPrefix"))).as("idPrefix"))
.withColumn("idPrefix", col("idPrefix.idPrefix")) //take the maximum length for the phone number in a single aggregation
df.show
//+-----------+--------+
//|phoneNumber|idPrefix|
//+-----------+--------+
//| 420551666| 420551|
//| 421709560| null|
//+-----------+--------+

How to separate a collection inside a column in spark-dataframe and pass the values to other column?

Suppose this is a dataframe :
val data = Seq(
Row("Ramesh",List("English","German"),Map("hair"->"black","eye"->"brown")),
Row("Vijay",List("Spark","French",null),Map("hair"->"brown","eye"->null)),
Row("Yann",List("Mandrin",""),Map("hair"->"red","eye"->"")),
Row("Ram",null,null),
Row("Jefferson",List(),Map())
)
val schema = new StructType()
.add("name",StringType)
.add("languages", ArrayType(StringType))
.add("properties", MapType(StringType,StringType))
val df = spark.createDataFrame(spark.sparkContext.parallelize(data), schema)
OUTPUT :
+----------+--------------+
|name |languages |
+----------+--------------+
I want all the output with names with atomic values. Null should also be covered.
Use explode_outer from org.apache.spark.sql.functions. This flattens the collection in the dataframe. And outer command considers the null value.
df.select($"name",explode_outer($"languages"))
.show(false)

how to pass in array into udf spark

I have a problem that 1) I don't really know how to call a registered UDF. I found some answer saying use callUDF so this is how I call the function in my code. 2) I don't really know how to pass in arrays as parameters.
Here is my code:
val df = Seq(("1","2","3","4","5","6")).toDF("A","B","C","D","E","F")
val newdf = Seq(("1","2","3","4","5","6")).toDF("A","B","C","D","E","F")
val cols = df.columns
val temp = Array(df.select($"A"),df.select($"B"),df.select($"C"),df.select($"D"),df.select($"E"),df.select($"F"))
val temp2 = Array(newdf.select($"A"),newdf.select($"B"),newdf.select($"C"),newdf.select($"D"),newdf.select($"E"),newdf.select($"F"))
sparkSession.udf.register ( "myfunc" , ((A:Array[String],B:Array[String]) => {for(i <- 0 to 5)yield( if (A(i)==B(i)) "U" else "N")} ) )
val a = df.withColumn("A",callUDF("myfunc",(temp,temp2)))
Thanks in advance!
You are trying to use columns from two different dataframes which is illegal in a UDF. Spark's UDF can only work on a per row basis. You can't combine rows from different dataframes. To do so you need to perform a join between the two.
In your case you have just one row but in a realistic case you would have multiple rows, you need to make sure you have some unique key to join by such as a unique id.
If you don't and both dataframes have the same number of rows and the same number of partitions you can easily create an id for both dataframes like this:
df.withColumn("id",monotonicallyIncreasingId)
You should probably also rename the columns to have different names.
Look at the different options for join (see http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Dataset) to see what best matches your need.
As for registering and calling a udf you can do:
def myFunc(s1: Seq[Int], s2: Seq[Int]) = {
for(i <- 0 to 5) yield {
if (s1(i)==s2(i)) "U" else "N"
}
}
val u = udf(myFunc)
val a = df.withColumn("A", myFunc(temp,temp2))
note that temp and temp2 should each be a column representing an array in the same dataframe, i.e. you should define them after the join on the relevant columns.

Sql DataFrame - Operation

I am stuck with a situation where i need to perform division on output from two sql Data Frame . Any Suggestion How it can be done ?
scala> val TotalDie = sqlc.sql("select COUNT(DISTINCT XY) from Data")
TotalDie: org.apache.spark.sql.DataFrame = [_c0: bigint]
scala> TotalDie.show()
+---+
|_c0|
+---+
|887|
+---+
scala> val PassDie = sqlc.sql("select COUNT(DISTINCT XY) from Data where Sbin = '1'")
PassDie: org.apache.spark.sql.DataFrame = [_c0: bigint]
scala> PassDie.show()
+---+
|_c0|
+---+
|413|
+---+
I need to perform to calculate the Yield which refer to (PassDie/TotalDie)*100,
I am new to spark-shell
In case of multiple values (ie multiple rows): do you have a column (or key or id) to join the two dataframes (or tables) on ?
In case of always a single value (ie single row): something along the lines of: 100* PassDie.collect() / TotalDie.collect()
UPDATE
The exact syntax in case of 1 value:
100.0 * passdie.collect()(0).getInt(0) / totaldie.collect()(0).getInt(0)
res25: Double = 46.56144306651635
It is possible to do this with just SparkSQL, too.
Here's what i'd do to solve it that way:
>>> rdd1 = sc.parallelize([("a",1.12),("a",2.22)])
>>> rdd2 = sc.parallelize([("b",9.12),("b",12.22)])
>>> r1df = rdd1.toDF()
>>> r2df = rdd2.toDF()
>>> r1df.registerTempTable('r1')
>>> r2df.registerTempTable('r2')
>>> r3df = sqlContext.sql("SELECT * FROM r1 UNION SELECT * FROM r2").show()
>>> r3df.registerTempTable('r3')
>>> sqlContext.sql("SELECT * FROM r3") -------> do your aggregation / math here.
Now from here, in theory, you can do basic grouping and arithmetic just using SQL queries, since you've got this grand table of data. I realize in my example code, I didn't really declare a good schema with column names, and that makes this example not really work, but you have a schema, so you get the idea.

Concatenate columns in Apache Spark DataFrame

How do we concatenate two columns in an Apache Spark DataFrame?
Is there any function in Spark SQL which we can use?
With raw SQL you can use CONCAT:
In Python
df = sqlContext.createDataFrame([("foo", 1), ("bar", 2)], ("k", "v"))
df.registerTempTable("df")
sqlContext.sql("SELECT CONCAT(k, ' ', v) FROM df")
In Scala
import sqlContext.implicits._
val df = sc.parallelize(Seq(("foo", 1), ("bar", 2))).toDF("k", "v")
df.registerTempTable("df")
sqlContext.sql("SELECT CONCAT(k, ' ', v) FROM df")
Since Spark 1.5.0 you can use concat function with DataFrame API:
In Python :
from pyspark.sql.functions import concat, col, lit
df.select(concat(col("k"), lit(" "), col("v")))
In Scala :
import org.apache.spark.sql.functions.{concat, lit}
df.select(concat($"k", lit(" "), $"v"))
There is also concat_ws function which takes a string separator as the first argument.
Here's how you can do custom naming
import pyspark
from pyspark.sql import functions as sf
sc = pyspark.SparkContext()
sqlc = pyspark.SQLContext(sc)
df = sqlc.createDataFrame([('row11','row12'), ('row21','row22')], ['colname1', 'colname2'])
df.show()
gives,
+--------+--------+
|colname1|colname2|
+--------+--------+
| row11| row12|
| row21| row22|
+--------+--------+
create new column by concatenating:
df = df.withColumn('joined_column',
sf.concat(sf.col('colname1'),sf.lit('_'), sf.col('colname2')))
df.show()
+--------+--------+-------------+
|colname1|colname2|joined_column|
+--------+--------+-------------+
| row11| row12| row11_row12|
| row21| row22| row21_row22|
+--------+--------+-------------+
One option to concatenate string columns in Spark Scala is using concat.
It is necessary to check for null values. Because if one of the columns is null, the result will be null even if one of the other columns do have information.
Using concat and withColumn:
val newDf =
df.withColumn(
"NEW_COLUMN",
concat(
when(col("COL1").isNotNull, col("COL1")).otherwise(lit("null")),
when(col("COL2").isNotNull, col("COL2")).otherwise(lit("null"))))
Using concat and select:
val newDf = df.selectExpr("concat(nvl(COL1, ''), nvl(COL2, '')) as NEW_COLUMN")
With both approaches you will have a NEW_COLUMN which value is a concatenation of the columns: COL1 and COL2 from your original df.
concat(*cols)
v1.5 and higher
Concatenates multiple input columns together into a single column. The function works with strings, binary and compatible array columns.
Eg: new_df = df.select(concat(df.a, df.b, df.c))
concat_ws(sep, *cols)
v1.5 and higher
Similar to concat but uses the specified separator.
Eg: new_df = df.select(concat_ws('-', df.col1, df.col2))
map_concat(*cols)
v2.4 and higher
Used to concat maps, returns the union of all the given maps.
Eg: new_df = df.select(map_concat("map1", "map2"))
Using concat operator (||):
v2.3 and higher
Eg: df = spark.sql("select col_a || col_b || col_c as abc from table_x")
Reference: Spark sql doc
If you want to do it using DF, you could use a udf to add a new column based on existing columns.
val sqlContext = new SQLContext(sc)
case class MyDf(col1: String, col2: String)
//here is our dataframe
val df = sqlContext.createDataFrame(sc.parallelize(
Array(MyDf("A", "B"), MyDf("C", "D"), MyDf("E", "F"))
))
//Define a udf to concatenate two passed in string values
val getConcatenated = udf( (first: String, second: String) => { first + " " + second } )
//use withColumn method to add a new column called newColName
df.withColumn("newColName", getConcatenated($"col1", $"col2")).select("newColName", "col1", "col2").show()
From Spark 2.3(SPARK-22771) Spark SQL supports the concatenation operator ||.
For example;
val df = spark.sql("select _c1 || _c2 as concat_column from <table_name>")
Here is another way of doing this for pyspark:
#import concat and lit functions from pyspark.sql.functions
from pyspark.sql.functions import concat, lit
#Create your data frame
countryDF = sqlContext.createDataFrame([('Ethiopia',), ('Kenya',), ('Uganda',), ('Rwanda',)], ['East Africa'])
#Use select, concat, and lit functions to do the concatenation
personDF = countryDF.select(concat(countryDF['East Africa'], lit('n')).alias('East African'))
#Show the new data frame
personDF.show()
----------RESULT-------------------------
84
+------------+
|East African|
+------------+
| Ethiopian|
| Kenyan|
| Ugandan|
| Rwandan|
+------------+
Here is a suggestion for when you don't know the number or name of the columns in the Dataframe.
val dfResults = dfSource.select(concat_ws(",",dfSource.columns.map(c => col(c)): _*))
Do we have java syntax corresponding to below process
val dfResults = dfSource.select(concat_ws(",",dfSource.columns.map(c => col(c)): _*))
In Spark 2.3.0, you may do:
spark.sql( """ select '1' || column_a from table_a """)
In Java you can do this to concatenate multiple columns. The sample code is to provide you a scenario and how to use it for better understanding.
SparkSession spark = JavaSparkSessionSingleton.getInstance(rdd.context().getConf());
Dataset<Row> reducedInventory = spark.sql("select * from table_name")
.withColumn("concatenatedCol",
concat(col("col1"), lit("_"), col("col2"), lit("_"), col("col3")));
class JavaSparkSessionSingleton {
private static transient SparkSession instance = null;
public static SparkSession getInstance(SparkConf sparkConf) {
if (instance == null) {
instance = SparkSession.builder().config(sparkConf)
.getOrCreate();
}
return instance;
}
}
The above code concatenated col1,col2,col3 seperated by "_" to create a column with name "concatenatedCol".
In my case, I wanted a Pipe-'I' delimited row.
from pyspark.sql import functions as F
df.select(F.concat_ws('|','_c1','_c2','_c3','_c4')).show()
This worked well like a hot knife over butter.
use concat method like this:
Dataset<Row> DF2 = DF1
.withColumn("NEW_COLUMN",concat(col("ADDR1"),col("ADDR2"),col("ADDR3"))).as("NEW_COLUMN")
Another way to do it in pySpark using sqlContext...
#Suppose we have a dataframe:
df = sqlContext.createDataFrame([('row1_1','row1_2')], ['colname1', 'colname2'])
# Now we can concatenate columns and assign the new column a name
df = df.select(concat(df.colname1, df.colname2).alias('joined_colname'))
Indeed, there are some beautiful inbuilt abstractions for you to accomplish your concatenation without the need to implement a custom function. Since you mentioned Spark SQL, so I am guessing you are trying to pass it as a declarative command through spark.sql(). If so, you can accomplish in a straight forward manner passing SQL command like:
SELECT CONCAT(col1, '<delimiter>', col2, ...) AS concat_column_name FROM <table_name>;
Also, from Spark 2.3.0, you can use commands in lines with:
SELECT col1 || col2 AS concat_column_name FROM <table_name>;
Wherein, is your preferred delimiter (can be empty space as well) and is the temporary or permanent table you are trying to read from.
We can simple use SelectExpr as well.
df1.selectExpr("*","upper(_2||_3) as new")
We can use concat() in select method of dataframe
val fullName = nameDF.select(concat(col("FirstName"), lit(" "), col("LastName")).as("FullName"))
Using withColumn and concat
val fullName1 = nameDF.withColumn("FullName", concat(col("FirstName"), lit(" "), col("LastName")))
Using spark.sql concat function
val fullNameSql = spark.sql("select Concat(FirstName, LastName) as FullName from names")
Taken from https://www.sparkcodehub.com/spark-dataframe-concat-column
val newDf =
df.withColumn(
"NEW_COLUMN",
concat(
when(col("COL1").isNotNull, col("COL1")).otherwise(lit("null")),
when(col("COL2").isNotNull, col("COL2")).otherwise(lit("null"))))
Note: For this code to work you need to put the parentheses "()" in the "isNotNull" function. -> The correct one is "isNotNull()".
val newDf =
df.withColumn(
"NEW_COLUMN",
concat(
when(col("COL1").isNotNull(), col("COL1")).otherwise(lit("null")),
when(col("COL2").isNotNull(), col("COL2")).otherwise(lit("null"))))