Spark regex 'COIN' in column values -> rlike approach - dataframe

I would like to check if the column values contains 'COIN' etc. in values.
Is there a possibility to change my regex so as not to include "CRYPTOCOIN|KUCOIN|COINBASE"? I'd like to have something like
"regex associated with COIN word|BTCBIT.NET"
Please find my attached code below:
val CRYPTO_CARD_INDICATOR: String = ("BTCBIT.NET|KUCOIN|COINBASE|CRYPTCOIN")
val CryptoCheckDataset = df.withColumn("is_crypto_indicator",when(upper(col("company_name")).rlike(CRYPTO_CARD_INDICATOR), 1).otherwise(0))

I think the following should work:
COIN|BTCBIT.NET
Full test in PySpark:
from pyspark.sql.functions import *
CRYPTO_CARD_INDICATOR = "COIN|BTCBIT.NET"
df = spark.createDataFrame([('kucoin',), ('coinbase',), ('crypto',)], ['company_name'])
CryptoCheckDataset = df.withColumn("is_crypto_indicator", when(upper(col("company_name")).rlike(CRYPTO_CARD_INDICATOR), 1).otherwise(0))
CryptoCheckDataset.show()
# +------------+-------------------+
# |company_name|is_crypto_indicator|
# +------------+-------------------+
# | kucoin| 1|
# | coinbase| 1|
# | crypto| 0|
# +------------+-------------------+

Related

optimize pyspark code to find a keyword and its count in a dataframe

We have a lot of files in our s3 bucket. The current pyspark code I have reads each file, takes one column from that file and looks for the keyword and returns a dataframe with count of keyword in the column and the file.
Here is the code in pyspark. (we are using databricks to write code if that helps)
import s3fs
fs = s3fs.S3FileSystem()
from pyspark.sql.functions import lower, col
keywords = ['%keyword1%','%keyword2%']
prefix = ''
deployment_id = ''
pull_id = ''
paths = fs.ls(prefix+'/'+deployment_id+'/'+pull_id)
result = []
errors = []
try:
for path in paths:
df = spark.read.parquet('s3://'+path)
print(path)
for keyword in keywords:
for col in df.columns:
filtered_df = df.filter(lower(df[col]).like(keyword))
filtered_count = filtered_df.count()
if filtered_count > 0 :
#print(col +' has '+ str(filtered_count) +' appearences')
result.append({'keyword': keyword, 'column': col, 'count': filtered_count,'table':path.split('/')[-1]})
except Exception as e:
errors.append({'error_msg':e})
try:
errors = spark.createDataFrame(errors)
except Exception as e:
print('no errors')
try:
result = spark.createDataFrame(result)
result.display()
except Exception as e:
print('problem with results. May be no results')
I am new to pyspark,databricks and spark. Code here works very slow. I know that cause we have a local code in python that is faster than this one. we wanted to use pyspark, databricks cause we thought it would be faster and on local code we need to put aws access keys every day and some times if the file is huge it gives a memory error.
NOTE - The above code reads data faster but the search functionality seems to be slower when compared to local python code
here is the python code in our local system
def search_df(self,keyword,df,regex=False):
start=time.time()
if regex:
mask = df.applymap(lambda x: re.search(keyword,x) is not None if isinstance(x,str) else False).to_numpy()
else:
mask = df.applymap(lambda x: keyword.lower() in x.lower() if isinstance(x,str) else False).to_numpy()
I was hoping if I could have any code changes to the pyspark so its faster.
Thanks.
tried changing
.like(keyword) to .contains(keyword) to see if thats faster. but doesnt seem to work
Check out the below code. Have defined a function that uses List Comprehensions to search each column in the df for a keyword. Next calling that function for each keyword. There will be a new df returned for each keyword, which then need to be unioned using reduce function.
import pyspark.sql.functions as F
from pyspark.sql import DataFrame
from functools import reduce
sampleData = [["Hello s1","Y","Hi s1"],["What is your name?","What is s1","what is s2?"] ]
df = spark.createDataFrame(sampleData,["col1","col2","col3"])
df.show()
# Sample input dataframe
+------------------+----------+-----------+
| col1| col2| col3|
+------------------+----------+-----------+
| Hello s1| Y| Hi s1|
|What is your name?|What is s1|what is s2?|
+------------------+----------+-----------+
keywords=["s1","s2"]
def calc(k) -> DataFrame:
return df.select([F.count(F.when(F.col(c).rlike(k),c)).alias(c) for c in df.columns] ).withColumn("keyword",F.lit(k))
lst=[calc(k) for k in keywords]
fDf=reduce(DataFrame.unionByName, [y for y in lst])
stExpr="stack(3,'col1',col1,'col2',col2,'col3',col3) as (ColName,Count)"
fDf.select("keyword",F.expr(stExpr)).show()
# Output
+-------+-------+-----+
|keyword|ColName|Count|
+-------+-------+-----+
| s1| col1| 1|
| s1| col2| 1|
| s1| col3| 1|
| s2| col1| 0|
| s2| col2| 0|
| s2| col3| 1|
+-------+-------+-----+
You can add a where clause at the end to filter rows greater than 0 ==>
where("Count >0")

Extract key value from dataframe in PySpark

I have the below dataframe which I have read from a JSON file.
1
2
3
4
{"todo":["wakeup", "shower"]}
{"todo":["brush", "eat"]}
{"todo":["read", "write"]}
{"todo":["sleep", "snooze"]}
I need my output to be as below Key and Value. How do I do this? Do I need to create a schema?
ID
todo
1
wakeup, shower
2
brush, eat
3
read, write
4
sleep, snooze
The key-value which you refer to is a struct. "keys" are struct field names, while "values" are field values.
What you want to do is called unpivoting. One of the ways to do it in PySpark is using stack. The following is a dynamic approach, where you don't need to provide existent column names.
Input dataframe:
df = spark.createDataFrame(
[((['wakeup', 'shower'],),(['brush', 'eat'],),(['read', 'write'],),(['sleep', 'snooze'],))],
'`1` struct<todo:array<string>>, `2` struct<todo:array<string>>, `3` struct<todo:array<string>>, `4` struct<todo:array<string>>')
Script:
to_melt = [f"\'{c}\', `{c}`.todo" for c in df.columns]
df = df.selectExpr(f"stack({len(to_melt)}, {','.join(to_melt)}) (ID, todo)")
df.show()
# +---+----------------+
# | ID| todo|
# +---+----------------+
# | 1|[wakeup, shower]|
# | 2| [brush, eat]|
# | 3| [read, write]|
# | 4| [sleep, snooze]|
# +---+----------------+
Use from_json to convert string to array. Explode to cascade each unique element to row.
data
df = spark.createDataFrame(
[(('{"todo":"[wakeup, shower]"}'),('{"todo":"[brush, eat]"}'),('{"todo":"[read, write]"}'),('{"todo":"[sleep, snooze]"}'))],
('value1','values2','value3','value4'))
code
new = (df.withColumn('todo', explode(flatten(array(*[map_values(from_json(x, "MAP<STRING,STRING>")) for x in df.columns])))) #From string to array to indivicual row
.withColumn('todo', translate('todo',"[]",'')#Remove corner brackets
) ).show(truncate=False)
outcome
+---------------------------+-----------------------+------------------------+--------------------------+--------------+
|value1 |values2 |value3 |value4 |todo |
+---------------------------+-----------------------+------------------------+--------------------------+--------------+
|{"todo":"[wakeup, shower]"}|{"todo":"[brush, eat]"}|{"todo":"[read, write]"}|{"todo":"[sleep, snooze]"}|wakeup, shower|
|{"todo":"[wakeup, shower]"}|{"todo":"[brush, eat]"}|{"todo":"[read, write]"}|{"todo":"[sleep, snooze]"}|brush, eat |
|{"todo":"[wakeup, shower]"}|{"todo":"[brush, eat]"}|{"todo":"[read, write]"}|{"todo":"[sleep, snooze]"}|read, write |
|{"todo":"[wakeup, shower]"}|{"todo":"[brush, eat]"}|{"todo":"[read, write]"}|{"todo":"[sleep, snooze]"}|sleep, snooze |
+---------------------------+-----------------------+------------------------+--------------------------+--------------+

Grouping alternative items with PySpark

The sample of the dataset I am working on:
# Creating the DataFrame
test =sqlContext.createDataFrame([(1,2),(2,1),
(1,3),(2,3),
(3,2),(3,1),
(4,5),(5,4)],
['cod_item','alter_cod'])
And it looks like this after grouping the equivalent items in lists:
test.createOrReplaceTempView("teste")
teste = spark.sql("""select cod_item,
collect_list(alter_cod) as alternative_item
from teste
group by cod_item""")
In the first column, I have certain items and in the second column, I have items that are equivalent. I would like, for each list, to have only one item that represents it.
I would like the final dataframe to look like this:
or
Where the items on the right are the items representing their respective equivalent items.
After collect_list, you should filter out rows where any alter_cod is bigger than cod_item. This method would work on strings too.
test = (test
.groupBy('cod_item')
.agg(F.collect_list('alter_cod').alias('alter_cod'))
.filter(F.forall('alter_cod', lambda x: x > F.col('cod_item')))
)
test.show()
# +--------+---------+
# |cod_item|alter_cod|
# +--------+---------+
# | 1| [2, 3]|
# | 4| [5]|
# +--------+---------+
Or add one line to your SQL:
select cod_item,
collect_list(alter_cod) as alternative_item
from teste
group by cod_item
having forall(alternative_item, x -> x > cod_item)

Add single quotes to the dataFrame column values

DataFrame is holding a column QUALIFY with values like below.
QUALIFY
=================
ColA|ColB|ColC
ColA
ColZ|ColP
The values in this column are split by "|". I want values in this column to be like 'ColA','ColB','ColC' ...
With the below code I am able to replace | with ,',. How can I add a single quote at the start and end of value?
newDf = df_qualify.withColumn('QUALIFY2', regexp_replace('QUALIFY', "\\|", "\\','"))
Your solution is almost there - you just need to add a single quote to the start and end. You can achieve this using pyspark.sql.functions.concat:
from pyspark.sql.functions import col, concat, lit, regexp_replace
df.withColumn(
"QUALIFY2",
concat(lit("'"), regexp_replace(col('QUALIFY'), r"\|", r"','"), lit("'"))
).show()
#+--------------+--------------------+
#| QUALIFY| QUALIFY2|
#+--------------+--------------------+
#|ColA|ColB|ColC|'ColA','ColB','ColC'|
#| ColA| 'ColA'|
#| ColZ|ColP| 'ColZ','ColP'|
#+--------------+--------------------+
Alternatively, you can avoid regular expressions and achieve the same using split and concat_ws:
from pyspark.sql.functions import split, concat_ws
df.withColumn(
"QUALIFY2",
concat(lit("'"), concat_ws("','", split("QUALIFY", "\|")), lit("'"))
).show()
#+--------------+--------------------+
#| QUALIFY| QUALIFY2|
#+--------------+--------------------+
#|ColA|ColB|ColC|'ColA','ColB','ColC'|
#| ColA| 'ColA'|
#| ColZ|ColP| 'ColZ','ColP'|
#+--------------+--------------------+
Split the column on | and then join the resulting array back to a string :
import pyspark.sql.functions as F
import pyspark.sql.types as T
def str_list(x):
return str(x).replace("[", "").replace("]", "")
str_udf = F.udf(str_list, T.StringType())
df = df.withColumn("arr_split", F.split(F.col("QUALIFY"), "\|")) # escape character
df = df.withColumn("QUALIFY2", str_udf(F.col("arr_split")))
My sample output frame:
df.drop("arr_split").show() # Please ignore a and b columns
+---+---+--------------+--------------------+
| a| b| abc| QUALIFY2|
+---+---+--------------+--------------------+
| 1| 1|col1|col2|col3|'col1', 'col2', '...|
| 2| 2|col1|col2|col3|'col1', 'col2', '...|
| 3| 3|col1|col2|col3|'col1', 'col2', '...|
| 4| 4|col1|col2|col3|'col1', 'col2', '...|
| 5| 5|col1|col2|col3|'col1', 'col2', '...|
+---+---+--------------+--------------------+
Below code worked for me, added the square brackets back to make it an array
import pyspark.sql.functions as F
import pyspark.sql.types as T
def str_list(x):
return str(x).replace("[", "").replace("]", "")
str_udf = F.udf(str_list, T.StringType())
df = df.withColumn(column_name,str_udf(F.col(column_name)))
df = df.withColumn(column_name, F.expr("concat('[', " + column_name +", ']')"))

Spark DataFrame equivalent to Pandas Dataframe `.iloc()` method?

Is there a way to reference Spark DataFrame columns by position using an integer?
Analogous Pandas DataFrame operation:
df.iloc[:0] # Give me all the rows at column position 0
The equivalent of Python df.iloc is collect
PySpark examples:
X = df.collect()[0]['age']
or
X = df.collect()[0][1] #row 0 col 1
Not really, but you can try something like this:
Python:
df = sc.parallelize([(1, "foo", 2.0)]).toDF()
df.select(*df.columns[:1]) # I assume [:1] is what you really want
## DataFrame[_1: bigint]
or
df.select(df.columns[1:3])
## DataFrame[_2: string, _3: double]
Scala
val df = sc.parallelize(Seq((1, "foo", 2.0))).toDF()
df.select(df.columns.slice(0, 1).map(col(_)): _*)
Note:
Spark SQL doesn't support and it is unlikely to ever support row indexing so it is not possible to index across row dimension.
You can use like this in spark-shell.
scala>: df.columns
Array[String] = Array(age, name)
scala>: df.select(df.columns(0)).show()
+----+
| age|
+----+
|null|
| 30|
| 19|
+----+
As of Spark 3.1.1 on Databricks, it's a matter of selecting the column of interest, and applying limit:
%python
retDF = (inputDF
.select(col(inputDF
.columns[0]))
.limit(100)
)