Alteryx regex_countmatches equivalent in PySpark? - dataframe

I am working on some alteryx workflow migration to PySpark task, as part of which came across the following filter condition.
length([acc_id]) = 9
AND
(REGEX_CountMatches(right([acc_id],7),"[[:alpha:]]")=0 AND
REGEX_CountMatches(left([acc_id],2),"[[:alpha:]]")=2)
OR
(REGEX_CountMatches(right([acc_id],7),"[[:alpha:]]")=0 AND
REGEX_CountMatches(left([acc_id],1),"[[:alpha:]]")=1 AND
REGEX_CountMatches(right(left([acc_id],2),1), '9')=1
)
Can someone help me in re-writing this condition in PySpark dataframe?

You can use length with regexp_replace to get the equivalent of Alteryx's REGEX_CountMatches function :
REGEX_CountMatches(right([acc_id],7),"[[:alpha:]]")=0
Becomes:
# replace all non aplhapetic caracters with '' then get length
F.length(F.regexp_replace(F.expr("right(acc_id, 7)"), '[^A-Za-z]', '')) == 0
right and left functions are only available in SQL, you can use them with expr.
Full example:
from pyspark.sql import functions as F
df = spark.createDataFrame([("AB1234567",), ("AD234XG1234TT5",)], ["acc_id"])
def regex_count_matches(c: Column, regex: str) -> Column:
"""
helper function equivalent to REGEX_CountMatches
"""
return F.length(F.regexp_replace(c, regex, ''))
df.filter(
(F.length("acc_id") == 9) &
(
(regex_count_matches(F.expr("right(acc_id, 7)"), '[^A-Za-z]') == 0)
& (regex_count_matches(F.expr("left(acc_id, 2)"), '[^A-Za-z]') == 2)
) | (
(regex_count_matches(F.expr("right(acc_id, 7)"), '[^A-Za-z]') == 0)
& (regex_count_matches(F.expr("left(acc_id, 1)"), '[^A-Za-z]') == 1)
& (regex_count_matches(F.expr("right(left(acc_id, 2), 1)"), '[^9]') == 1)
)
).show()
#+---------+
#| acc_id|
#+---------+
#|AB1234567|
#+---------+

You can use size and split. You also need to use '[a-zA-Z]' for the regex because expressions like "[[:alpha:]]" is not supported in Spark.
For example,
REGEX_CountMatches(right([acc_id],7),"[[:alpha:]]")=0
should be equivalent to (in Spark SQL)
size(split(right(acc_id, 7), '[a-zA-Z]')) - 1 = 0
You can put the Spark SQL string directly into the filter clause for a Spark dataframe:
df2 = df.filter("size(split(right(acc_id, 7), '[a-zA-Z]')) - 1 = 0")

Related

compare after lower and trim df column in pyspark

I want to compare dataframe column after trim and convert it into lower case in pyspark.
is below code is wrong ?
if f.trim(Loc_Country_df.LOC_NAME.lower) == f.trim(sdf.location_name.lower):
print('y')
else:
print('N')
No you can't do like this because df columns are not just variable , they are collection of values(Iterable).
The best way is you can perform join.
join_df=join_df.withColumn("LOC_NAME",trim(col("LOC_NAME")))
sdf=sdf.withColumn("location_name",trim(col("location_name")))
join_df=Loc_Country_df.join(sdf,Loc_Country_df.LOC_NAME==sdf.location_name,"left")
from pyspark.sql import functions as f
join_df.withColumn('Result', f.when(f.col('LOC_NAME') == 0, "N").otherwise("Y")).show()

Is there an equivalent of 'REGEXP_SUBSTR' of SnowFlake in PySpark?

Is there an equivalent of Snowflake's REGEXP_SUBSTR in PySpark/spark-sql?
REGEXP_EXTRACT exists, but that doesn't support as many parameters as are supported by REGEXP_SUBSTR.
Here is a link to REGEXP_SUBSTR.
Here is a link to REGEXP_EXTRACT.
More specifically, I'm looking for alternatives for position, occurrence and regex parameters which are supported by Snowflake's REGEXP_SUBSTR.
position: Number of characters from the beginning of the string where the function starts searching for matches.
occurrence: Specifies which occurrence of the pattern to match. The function skips the first occurrence - 1 matches.
regex_parameters: I'm looking specifically for the parameter 'e', which does the following:
extract sub-matches.
So the query is something like:
REGEXP_SUBSTR(string, pattern, 1, 2, 'e', 2).
Sample Input: It was the best of times, it was the worst in times.
Expected output: worst
Assuming string1 = It was the best of times, it was the worst in times.
Equivalent SF query:
SELECT regexp_substr(string1, 'the(\\W+)(\\w+)', 1, 2, 'e', 2)
One of the best things about Spark is that you don't have to rely on a vendor to create a library of functions for you. You can create a User Defined Function in python and use it in a Spark SQL Statement. EG staring with
import pandas as pd
from pyspark.sql.functions import pandas_udf, PandasUDFType
from pyspark.sql.functions import broadcast,col, lit, concat, udf
from pyspark.sql.types import StructField, StructType, IntegerType, StringType
import re
def regexp_substr(subject:str, pattern:str, position:int,occurance:int) -> str:
s = subject[position:]
searchResult = re.search(pattern,s)
if searchResult:
return searchResult.group(occurance)
return None
#bench testing the python function
string1 = 'It was the best of times, it was the worst in times.'
pattern = 'the(\W+)(\w+)'
# print(pattern)
rv = regexp_substr(string1, pattern, 1,2)
print(rv)
# register for use in python
regexp_substr_udf = udf(regexp_substr , StringType())
# register for use in Spark SQL
spark.udf.register("REGEXP_SUBSTR", regexp_substr, StringType())
#craeate a spark DataFrame
df = spark.range(100).withColumn("s",lit(string1))
df.createOrReplaceTempView("df")
then you can run Spark SQL queries like
%%sql
select *, REGEXP_SUBSTR(s,'the(\\W+)(\\w+)',1,2) ex from df

Spark dataframe filter issue

Coming from a SQL background here.. I'm using df1 = spark.read.jdbc to load data from Azure sql into a dataframe. I am trying to filter the data to exclude rows meeting the following criteria:
df2 = df1.filter("ItemID <> '75' AND Code1 <> 'SL'")
The dataframe ends up being empty but when i run equivalent SQL it is correct. When i change it to
df2 = df1.filter("ItemID **=** '75' AND Code1 **=** 'SL'")
it produces the rows i want to filter out.
What is the best way to remove the rows meeting the criteria, so they can be pushed to a SQL server?
Thank you
In SQL world, <> means Checks if the value of two operands are equal or not, if values are not equal then condition becomes true.
The equivalent of it in spark sql is !=. Thus your sql condition inside filter becomes-
# A != B -> TRUE if expression A is not equivalent to expression B; otherwise FALSE
df2 = df1.filter("ItemID != '75' AND Code1 != 'SL'")
= has same meaning in spark sql as ansi sql
df2 = df1.filter("ItemID = '75' AND Code1 = 'SL'")
Use & operator with != in pyspark.
<> deprecated from python3.
Example:
df=spark.createDataFrame([(75,'SL'),(90,'SL1')],['ItemID','Code1'])
df.filter((col("ItemID") != '75') & (col("code1") != 'SL') ).show()
#or using negation
df.filter(~(col("ItemID") == '75') & ~(col("Code1") == 'SL') ).show()
#+------+-----+
#|ItemID|Code1|
#+------+-----+
#| 90| SL1|
#+------+-----+

Is there a way in pyspark to count unique values

I have a spark dataframe (12m x 132) and I am trying to calculate the number of unique values by column, and remove columns that have only 1 unique value.
So far, I have used the pandas nunique function as such:
import pandas as pd
df = sql_dw.read_table(<table>)
df_p = df.toPandas()
nun = df_p.nunique(axis=0)
nundf = pd.DataFrame({'atr':nun.index, 'countU':nun.values})
dropped = []
for i, j in nundf.values:
if j == 1:
dropped.append(i)
df = df.drop(i)
print(dropped)
Is there a way to do this that is more native to spark - i.e. not using pandas?
Please have a look at the commented example below. The solution requires more python as pyspark specific knowledge.
import pyspark.sql.functions as F
#creating a dataframe
columns = ['asin' ,'ctx' ,'fo' ]
l = [('ASIN1','CTX1','FO1')
,('ASIN1','CTX1','FO1')
,('ASIN1','CTX1','FO2')
,('ASIN1','CTX2','FO1')
,('ASIN1','CTX2','FO2')
,('ASIN1','CTX2','FO2')
,('ASIN1','CTX2','FO3')
,('ASIN1','CTX3','FO1')
,('ASIN1','CTX3','FO3')]
df=spark.createDataFrame(l, columns)
df.show()
#we create a list of functions we want to apply
#in this case countDistinct for each column
expr = [F.countDistinct(c).alias(c) for c in df.columns]
#we apply those functions
countdf = df.select(*expr)
#this df has just one row
countdf.show()
#we extract the columns which have just one value
cols2drop = [k for k,v in countdf.collect()[0].asDict().items() if v == 1]
df.drop(*cols2drop).show()
Output:
+-----+----+---+
| asin| ctx| fo|
+-----+----+---+
|ASIN1|CTX1|FO1|
|ASIN1|CTX1|FO1|
|ASIN1|CTX1|FO2|
|ASIN1|CTX2|FO1|
|ASIN1|CTX2|FO2|
|ASIN1|CTX2|FO2|
|ASIN1|CTX2|FO3|
|ASIN1|CTX3|FO1|
|ASIN1|CTX3|FO3|
+-----+----+---+
+----+---+---+
|asin|ctx| fo|
+----+---+---+
| 1| 3| 3|
+----+---+---+
+----+---+
| ctx| fo|
+----+---+
|CTX1|FO1|
|CTX1|FO1|
|CTX1|FO2|
|CTX2|FO1|
|CTX2|FO2|
|CTX2|FO2|
|CTX2|FO3|
|CTX3|FO1|
|CTX3|FO3|
+----+---+
My apologies as I don't have the solution in pyspark but in pure spark, which may be transferable or used in case you can't find a pyspark way.
You can create a blank list and then using a foreach, check which columns have a distinct count of 1, then append them to the blank list.
From there you can use the list as a filter and drop those columns from your dataframe.
var list_of_columns: List[String] = ()
df_p.columns.foreach{c =>
if (df_p.select(c).distinct.count == 1)
list_of_columns ++= List(c)
df_p_new = df_p.drop(list_of_columns:_*)
you can group your df by that column and count distinct value of this column:
df = df.groupBy("column_name").agg(countDistinct("column_name").alias("distinct_count"))
And then filter your df by row which has more than 1 distinct_count:
df = df.filter(df.distinct_count > 1)

Concatenate columns in Apache Spark DataFrame

How do we concatenate two columns in an Apache Spark DataFrame?
Is there any function in Spark SQL which we can use?
With raw SQL you can use CONCAT:
In Python
df = sqlContext.createDataFrame([("foo", 1), ("bar", 2)], ("k", "v"))
df.registerTempTable("df")
sqlContext.sql("SELECT CONCAT(k, ' ', v) FROM df")
In Scala
import sqlContext.implicits._
val df = sc.parallelize(Seq(("foo", 1), ("bar", 2))).toDF("k", "v")
df.registerTempTable("df")
sqlContext.sql("SELECT CONCAT(k, ' ', v) FROM df")
Since Spark 1.5.0 you can use concat function with DataFrame API:
In Python :
from pyspark.sql.functions import concat, col, lit
df.select(concat(col("k"), lit(" "), col("v")))
In Scala :
import org.apache.spark.sql.functions.{concat, lit}
df.select(concat($"k", lit(" "), $"v"))
There is also concat_ws function which takes a string separator as the first argument.
Here's how you can do custom naming
import pyspark
from pyspark.sql import functions as sf
sc = pyspark.SparkContext()
sqlc = pyspark.SQLContext(sc)
df = sqlc.createDataFrame([('row11','row12'), ('row21','row22')], ['colname1', 'colname2'])
df.show()
gives,
+--------+--------+
|colname1|colname2|
+--------+--------+
| row11| row12|
| row21| row22|
+--------+--------+
create new column by concatenating:
df = df.withColumn('joined_column',
sf.concat(sf.col('colname1'),sf.lit('_'), sf.col('colname2')))
df.show()
+--------+--------+-------------+
|colname1|colname2|joined_column|
+--------+--------+-------------+
| row11| row12| row11_row12|
| row21| row22| row21_row22|
+--------+--------+-------------+
One option to concatenate string columns in Spark Scala is using concat.
It is necessary to check for null values. Because if one of the columns is null, the result will be null even if one of the other columns do have information.
Using concat and withColumn:
val newDf =
df.withColumn(
"NEW_COLUMN",
concat(
when(col("COL1").isNotNull, col("COL1")).otherwise(lit("null")),
when(col("COL2").isNotNull, col("COL2")).otherwise(lit("null"))))
Using concat and select:
val newDf = df.selectExpr("concat(nvl(COL1, ''), nvl(COL2, '')) as NEW_COLUMN")
With both approaches you will have a NEW_COLUMN which value is a concatenation of the columns: COL1 and COL2 from your original df.
concat(*cols)
v1.5 and higher
Concatenates multiple input columns together into a single column. The function works with strings, binary and compatible array columns.
Eg: new_df = df.select(concat(df.a, df.b, df.c))
concat_ws(sep, *cols)
v1.5 and higher
Similar to concat but uses the specified separator.
Eg: new_df = df.select(concat_ws('-', df.col1, df.col2))
map_concat(*cols)
v2.4 and higher
Used to concat maps, returns the union of all the given maps.
Eg: new_df = df.select(map_concat("map1", "map2"))
Using concat operator (||):
v2.3 and higher
Eg: df = spark.sql("select col_a || col_b || col_c as abc from table_x")
Reference: Spark sql doc
If you want to do it using DF, you could use a udf to add a new column based on existing columns.
val sqlContext = new SQLContext(sc)
case class MyDf(col1: String, col2: String)
//here is our dataframe
val df = sqlContext.createDataFrame(sc.parallelize(
Array(MyDf("A", "B"), MyDf("C", "D"), MyDf("E", "F"))
))
//Define a udf to concatenate two passed in string values
val getConcatenated = udf( (first: String, second: String) => { first + " " + second } )
//use withColumn method to add a new column called newColName
df.withColumn("newColName", getConcatenated($"col1", $"col2")).select("newColName", "col1", "col2").show()
From Spark 2.3(SPARK-22771) Spark SQL supports the concatenation operator ||.
For example;
val df = spark.sql("select _c1 || _c2 as concat_column from <table_name>")
Here is another way of doing this for pyspark:
#import concat and lit functions from pyspark.sql.functions
from pyspark.sql.functions import concat, lit
#Create your data frame
countryDF = sqlContext.createDataFrame([('Ethiopia',), ('Kenya',), ('Uganda',), ('Rwanda',)], ['East Africa'])
#Use select, concat, and lit functions to do the concatenation
personDF = countryDF.select(concat(countryDF['East Africa'], lit('n')).alias('East African'))
#Show the new data frame
personDF.show()
----------RESULT-------------------------
84
+------------+
|East African|
+------------+
| Ethiopian|
| Kenyan|
| Ugandan|
| Rwandan|
+------------+
Here is a suggestion for when you don't know the number or name of the columns in the Dataframe.
val dfResults = dfSource.select(concat_ws(",",dfSource.columns.map(c => col(c)): _*))
Do we have java syntax corresponding to below process
val dfResults = dfSource.select(concat_ws(",",dfSource.columns.map(c => col(c)): _*))
In Spark 2.3.0, you may do:
spark.sql( """ select '1' || column_a from table_a """)
In Java you can do this to concatenate multiple columns. The sample code is to provide you a scenario and how to use it for better understanding.
SparkSession spark = JavaSparkSessionSingleton.getInstance(rdd.context().getConf());
Dataset<Row> reducedInventory = spark.sql("select * from table_name")
.withColumn("concatenatedCol",
concat(col("col1"), lit("_"), col("col2"), lit("_"), col("col3")));
class JavaSparkSessionSingleton {
private static transient SparkSession instance = null;
public static SparkSession getInstance(SparkConf sparkConf) {
if (instance == null) {
instance = SparkSession.builder().config(sparkConf)
.getOrCreate();
}
return instance;
}
}
The above code concatenated col1,col2,col3 seperated by "_" to create a column with name "concatenatedCol".
In my case, I wanted a Pipe-'I' delimited row.
from pyspark.sql import functions as F
df.select(F.concat_ws('|','_c1','_c2','_c3','_c4')).show()
This worked well like a hot knife over butter.
use concat method like this:
Dataset<Row> DF2 = DF1
.withColumn("NEW_COLUMN",concat(col("ADDR1"),col("ADDR2"),col("ADDR3"))).as("NEW_COLUMN")
Another way to do it in pySpark using sqlContext...
#Suppose we have a dataframe:
df = sqlContext.createDataFrame([('row1_1','row1_2')], ['colname1', 'colname2'])
# Now we can concatenate columns and assign the new column a name
df = df.select(concat(df.colname1, df.colname2).alias('joined_colname'))
Indeed, there are some beautiful inbuilt abstractions for you to accomplish your concatenation without the need to implement a custom function. Since you mentioned Spark SQL, so I am guessing you are trying to pass it as a declarative command through spark.sql(). If so, you can accomplish in a straight forward manner passing SQL command like:
SELECT CONCAT(col1, '<delimiter>', col2, ...) AS concat_column_name FROM <table_name>;
Also, from Spark 2.3.0, you can use commands in lines with:
SELECT col1 || col2 AS concat_column_name FROM <table_name>;
Wherein, is your preferred delimiter (can be empty space as well) and is the temporary or permanent table you are trying to read from.
We can simple use SelectExpr as well.
df1.selectExpr("*","upper(_2||_3) as new")
We can use concat() in select method of dataframe
val fullName = nameDF.select(concat(col("FirstName"), lit(" "), col("LastName")).as("FullName"))
Using withColumn and concat
val fullName1 = nameDF.withColumn("FullName", concat(col("FirstName"), lit(" "), col("LastName")))
Using spark.sql concat function
val fullNameSql = spark.sql("select Concat(FirstName, LastName) as FullName from names")
Taken from https://www.sparkcodehub.com/spark-dataframe-concat-column
val newDf =
df.withColumn(
"NEW_COLUMN",
concat(
when(col("COL1").isNotNull, col("COL1")).otherwise(lit("null")),
when(col("COL2").isNotNull, col("COL2")).otherwise(lit("null"))))
Note: For this code to work you need to put the parentheses "()" in the "isNotNull" function. -> The correct one is "isNotNull()".
val newDf =
df.withColumn(
"NEW_COLUMN",
concat(
when(col("COL1").isNotNull(), col("COL1")).otherwise(lit("null")),
when(col("COL2").isNotNull(), col("COL2")).otherwise(lit("null"))))