How to check if a column only contains certain letters - apache-spark-sql

I have a dataframe and I want to check one column that only contains letter A for example.
The column contains a lot of letters. It looks like:
AAAAAAAAAAAAAAAA
AAABBBBBDBBSBSBB
I want to check if this column only contains letter A, or both letter A or B, but nothing else.
Do you know which function I shall use?

Try this: I have considered four samples of letters. We can use rlike function in spark. I have used regex of [^AB]. This will return true to the column values having letters other than A or B and False will be displayed to the values having A or B or both AB. we can filter out False and that will be your answer.
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
spark = SparkSession.builder \
.appName('SO')\
.getOrCreate()
li = [[("AAAAAAAAAAAAAAAAAAABBBBBDBBSBSBB")], [("AAAAAAAAA")],[("BBBBBBBB")], [("AAAAAABBBBBBBB")]]
df = spark.createDataFrame(li, ["letter"])
df.show(truncate=False)
#
# +--------------------------------+
# |letter |
# +--------------------------------+
# |AAAAAAAAAAAAAAAAAAABBBBBDBBSBSBB|
# |AAAAAAAAA |
# |BBBBBBBB |
# |AAAAAABBBBBBBB |
# +--------------------------------+
df1 = df.withColumn("contains_A_or_B", F.col('letter').rlike("[^AB]"))
df.show(truncate=False)
+--------------------------------+---------------+
# |letter |contains_A_or_B|
# +--------------------------------+---------------+
# |AAAAAAAAAAAAAAAAAAABBBBBDBBSBSBB|true |
# |AAAAAAAAA |false |
# |BBBBBBBB |false |
# |AAAAAABBBBBBBB |false |
# +--------------------------------+---------------+
df1.filter(F.col('contains_A_or_B')==False).select("letter").show()
# +--------------+
# | letter|
# +--------------+
# | AAAAAAAAA|
# | BBBBBBBB|
# |AAAAAABBBBBBBB|
# +--------------+

Use rlike.
Example from the official documentation:
df.filter(df.name.rlike('ice$')).collect()
[Row(age=2, name='Alice')]
http://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=regex#pyspark.sql.Column.rlike

Related

Checking how many dictionaries a pyspark dataframe has and collecting values from them

Basically, I have a dataframe that looks exactly like this:
id
values
01
[{"final_price":10.0,"currency":"USD"},{"final_price":18.0,"currency":"CAD"}]
02
[{"final_price":44.15,"currency":"USD"},{"final_price":60.0,"currency":"CAD"}]
03
[{"final_price":99.99,"currency":"USD"},{"final_price":115.0,"currency":"CAD"}]
04
[{"final_price":25.0,"currency":"USD"},{"final_price":32.0,"currency":"CAD"}]
the same procut id have the price in US dollars and Canadian dollars. However, I need to check how many dicts this column has. Because some products only have the price in USD and others only in CAD. How can I check how many currencies are there and create new columns for each one of them?
Thanks!
Convert the JSON strings into array of structs using from_json. The number of dicts (currencies) will correspond to the size of the resulting array. And to select them as new columns, you can pivot like this:
from pyspark.sql import functions as F
df = spark.createDataFrame([
("01", "[{'final_price':10.0,'currency':'USD'},{'final_price':18.0,'currency':'CAD'}]"),
("02", "[{'final_price':44.15,'currency':'USD'},{'final_price':60.0,'currency':'CAD'}]"),
("03", "[{'final_price':99.99,'currency':'USD'},{'final_price':115.0,'currency':'CAD'}]"),
("04", "[{'final_price':25.0,'currency':'USD'},{'final_price':32.0,'currency':'CAD'}]")
], ["id", "values"])
df.selectExpr(
"id",
"inline(from_json(values, 'array<struct<final_price:float,currency:string>>'))"
).groupby("id").pivot("currency").agg(
F.first("final_price")
).show()
# +---+-----+-----+
# | id| CAD| USD|
# +---+-----+-----+
# | 01| 18.0| 10.0|
# | 03|115.0|99.99|
# | 02| 60.0|44.15|
# | 04| 32.0| 25.0|
# +---+-----+-----+

Insert data into a single column but in dictionary format after concatenating few column of data

I want to create a single column after concatenating number of columns in a single column but in dictionary format in PySpark.
I have concatenated data into a single column but I am unable to store it in a dictionary format.
Please find the below attached screenshot for more details.
Let me know if need more information.
In your current situation, you can use str_to_map
from pyspark.sql import functions as F
df = spark.createDataFrame([("datatype:0,length:1",)], ['region_validation_check_status'])
df = df.withColumn(
'region_validation_check_status',
F.expr("str_to_map(region_validation_check_status, ',')")
)
df.show(truncate=0)
# +------------------------------+
# |region_validation_check_status|
# +------------------------------+
# |{datatype -> 0, length -> 1} |
# +------------------------------+
If you didn't have a string yet, you could do it from column values with to_json and from_json
from pyspark.sql import functions as F
df = spark.createDataFrame([(1, 2), (3, 4)], ['a', 'b'])
df.show()
# +---+---+
# | a| b|
# +---+---+
# | 1| 2|
# | 3| 4|
# +---+---+
df = df.select(
F.from_json(F.to_json(F.struct('a', 'b')), 'map<string, int>')
)
df.show()
# +----------------+
# | entries|
# +----------------+
# |{a -> 1, b -> 2}|
# |{a -> 3, b -> 4}|
# +----------------+

Create new column with fuzzy-score across two string columns in the same dataframe

I'm trying to calculate a fuzzy score (preferable partial_ratio score) across two columns in the same dataframe.
| column1 | column2|
| -------- | -------------- |
| emmett holt| holt
| greenwald| christopher
It would need to look something like this:
| column1 | column2|partial_ratio|
| -------- | -------------- |-----------|
| emmett holt| holt|100|
| greenwald| christopher|22|
|schaefer|schaefer|100|
With the help of another question on this website, I worked towards the following code:
compare=pd.MultiIndex.from_product([ dataframe['column1'],dataframe ['column2'] ]).to_series()
def metrics (tup):
return pd.Series([fuzz.partial_ratio(*tup)], ['partial_ratio'])
df['partial_ratio'] = df.apply(lambda x: fuzz.partial_ratio(x['original_title'], x['title']), axis=1)
But the problem already starts with the first line of the code that returns the following error notification:
Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions.
You can say I'm kind of stuck here so any advice on this is appreciated!
You need a UDF to use fuzzywuzzy:
from fuzzywuzzy import fuzz
import pyspark.sql.functions as F
#F.udf
def fuzzyudf(original_title, title):
return fuzz.partial_ratio(original_title, title)
df2 = df.withColumn('partial_ratio', fuzzyudf('column1', 'column2'))
df2.show()
+-----------+-----------+-------------+
| column1| column2|partial_ratio|
+-----------+-----------+-------------+
|emmett holt| holt| 100|
| greenwald|christopher| 22|
+-----------+-----------+-------------+

Pyspark dataframe - Illegal values appearing in the column?

So I have a table (sample)
I'm using pyspark dataframe APIs to filter out the 'NOC's that has never won a gold medal and here's the code I write
First part of my code
from pyspark.sql import SQLContext
from pyspark.sql.types import *
from pyspark.sql.functions import *
spark = SQLContext(sc)
df1 = spark.read.format("csv").options(header = 'true').load("D:\\datasets\\athlete_events.csv")
df = df1.na.replace('NA', '-')
countgdf = gdf.groupBy('NOC').agg(count('Medal').alias('No of Gold medals')).select('NOC').show()
It will generate the output
+---+
|NOC|
+---+
|POL|
|JAM|
|BRA|
|ARM|
|MOZ|
|JOR|
|CUB|
|FRA|
|ALG|
|BRN|
+---+
only showing top 10 rows
The next part of the code is something like
allgdf = df.select('NOC').distinct()
This display the output
+-----------+
| NOC|
+-----------+
| DeRuyter|
| POL|
| Russia|
| JAM|
| BUR|
| BRA|
| ARM|
| MOZ|
| CUB|
| JOR|
| Sweden|
| FRA|
| ALG|
| SOM|
| IVB|
|Philippines|
| BRN|
| MAL|
| COD|
| FSM|
+-----------+
Notice the values that are more than 3 characters? Those are supposed to be the values of the column 'Team' but I'm not sure why those values are getting displayed in 'NOC' column. It's hard to figure out why this is happening i.e illegal values in the column.
When I write the final code
final = allgdf.subtract(countgdf).show()
The same happens as illegal values appear in the final dataframe column.
Any help would be appericiated. Thanks.
You should specify a delimiter for your CSV file. By default Spark is using comma separators (,)
This can be done, for example, with :
.option("delimiter",";")

How to use LinearRegression across groups in DataFrame?

Let us say my spark DataFrame (DF) looks like
id | age | earnings| health
----------------------------
1 | 34 | 65 | 8
2 | 65 | 12 | 4
2 | 20 | 7 | 10
1 | 40 | 75 | 7
. | .. | .. | ..
and I would like to group the DF, apply a function (say linear
regression which depends on multiple columns - two columns in this case -
of aggregated DF) on each aggregated DF and get output like
id | intercept| slope
----------------------
1 | ? | ?
2 | ? | ?
from sklearn.linear_model import LinearRegression
lr_object = LinearRegression()
def linear_regression(ith_DF):
# Note: for me it is necessary that ith_DF should contain all
# data within this function scope, so that I can apply any
# function that needs all data in ith_DF
X = [i.earnings for i in ith_DF.select("earnings").rdd.collect()]
y = [i.health for i in ith_DF.select("health").rdd.collect()]
lr_object.fit(X, y)
return lr_object.intercept_, lr_object.coef_[0]
coefficient_collector = []
# following iteration is not possible in spark as 'GroupedData'
# object is not iterable, please consider it as pseudo code
for ith_df in df.groupby("id"):
c, m = linear_regression(ith_df)
coefficient_collector.append((float(c), float(m)))
model_df = spark.createDataFrame(coefficient_collector, ["intercept", "slope"])
model_df.show()
I think this can be done since Spark 2.3 using pandas_UDF. In fact, there is an example of fitting grouped regressions on the announcement of pandas_UDFs here:
Introducing Pandas UDF for Python
What I'd do is to filter the main DataFrame to create smaller DataFrames and do the processing, say a linear regression.
You can then execute the linear regression in parallel (on separate threads using the same SparkSession which is thread-safe) and the main DataFrame cached.
That should give you the full power of Spark.
p.s. My limited understanding of that part of Spark makes me think that a very similar approach is used for grid search-based model selection in Spark MLlib and also TensorFrames which is "Experimental TensorFlow binding for Scala and Apache Spark".