Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I have a data Frame that contains 8M data. There is one column name EMAIL contains Email address I have to check:
Email value must be of the form _#_._
Email value can only contain alphanumeric characters along with -_#.
In fact, there is a Python library designed for it, validate_email.
You can use the below code snippet to validate the email id.
from validate_email import validate_email
from pyspark.sql.types import BooleanType
from pyspark.sql.functions import udf
valid_email = udf(lambda x: validate_email(x), BooleanType())
emailvalidation.withColumn('is_valid', valid_email('EmailAddress')).show()
+--------------------+--------+
| email|is_valid|
+--------------------+--------+
|aswin.raja#gm.com | true|
| abc | false|
+--------------------+--------+
Another way is to use regular expressions. You can use the below code snippet.
import re
regex = '^\w+([\.-]?\w+)*#\w+([\.-]?\w+)*(\.\w{2,3})+$'
def check(email):
if(re.search(regex,email)):
print("Valid Email")
else:
print("Invalid Email")
if __name__ == '__main__' :
email = "aswin.raja#gm.com"
check(email)
email = "aswinraja.com"
check(email)
+--------+
|Valid |
|Invalid |
+--------+
You can use the below code to validate the Email Id in your table.
from pyspark.sql.functions import udf
from pyspark.sql.types import BooleanType
import re
def regex_search(string):
regex = '^\w+([\.-]?\w+)*#\w+([\.-]?\w+)*(\.\w{2,3})+$'
if re.search(regex, string, re.IGNORECASE):
return True
return False
validateEmail_udf = udf(regex_search, BooleanType())
df = df.withColumn("is_valid",validateEmail_udf(col("email")))
No need for UDF here, just use rlike function :
# not really the regex to validate emails but this handles your requirement
r = """^[\w\d-_\.]+\.[\w\d-_\.]+#[\w\d]+\.[\w\d]+$"""
df.withColumn("flag", when(col("email").rlike(regex), lit("valid")).otherwise(lit("invalid")))\
.show()
Gives:
+---+-----+----+-----------------+-------+
|ids|first|last| email| flag|
+---+-----+----+-----------------+-------+
| 1| aa| zxc|aswin.raja#gm.com| valid|
| 2| bb| asd|aswin.raja#gm.com| valid|
| 3| cc| qwe| aswinraja#ad.com|invalid|
| 4| dd| qwe|aswin.raja#gm.com| valid|
| 5| ee| qws| aswinraja#ad.com|invalid|
+---+-----+----+-----------------+-------+
For complete regex to validate email address check this post
Related
We have a lot of files in our s3 bucket. The current pyspark code I have reads each file, takes one column from that file and looks for the keyword and returns a dataframe with count of keyword in the column and the file.
Here is the code in pyspark. (we are using databricks to write code if that helps)
import s3fs
fs = s3fs.S3FileSystem()
from pyspark.sql.functions import lower, col
keywords = ['%keyword1%','%keyword2%']
prefix = ''
deployment_id = ''
pull_id = ''
paths = fs.ls(prefix+'/'+deployment_id+'/'+pull_id)
result = []
errors = []
try:
for path in paths:
df = spark.read.parquet('s3://'+path)
print(path)
for keyword in keywords:
for col in df.columns:
filtered_df = df.filter(lower(df[col]).like(keyword))
filtered_count = filtered_df.count()
if filtered_count > 0 :
#print(col +' has '+ str(filtered_count) +' appearences')
result.append({'keyword': keyword, 'column': col, 'count': filtered_count,'table':path.split('/')[-1]})
except Exception as e:
errors.append({'error_msg':e})
try:
errors = spark.createDataFrame(errors)
except Exception as e:
print('no errors')
try:
result = spark.createDataFrame(result)
result.display()
except Exception as e:
print('problem with results. May be no results')
I am new to pyspark,databricks and spark. Code here works very slow. I know that cause we have a local code in python that is faster than this one. we wanted to use pyspark, databricks cause we thought it would be faster and on local code we need to put aws access keys every day and some times if the file is huge it gives a memory error.
NOTE - The above code reads data faster but the search functionality seems to be slower when compared to local python code
here is the python code in our local system
def search_df(self,keyword,df,regex=False):
start=time.time()
if regex:
mask = df.applymap(lambda x: re.search(keyword,x) is not None if isinstance(x,str) else False).to_numpy()
else:
mask = df.applymap(lambda x: keyword.lower() in x.lower() if isinstance(x,str) else False).to_numpy()
I was hoping if I could have any code changes to the pyspark so its faster.
Thanks.
tried changing
.like(keyword) to .contains(keyword) to see if thats faster. but doesnt seem to work
Check out the below code. Have defined a function that uses List Comprehensions to search each column in the df for a keyword. Next calling that function for each keyword. There will be a new df returned for each keyword, which then need to be unioned using reduce function.
import pyspark.sql.functions as F
from pyspark.sql import DataFrame
from functools import reduce
sampleData = [["Hello s1","Y","Hi s1"],["What is your name?","What is s1","what is s2?"] ]
df = spark.createDataFrame(sampleData,["col1","col2","col3"])
df.show()
# Sample input dataframe
+------------------+----------+-----------+
| col1| col2| col3|
+------------------+----------+-----------+
| Hello s1| Y| Hi s1|
|What is your name?|What is s1|what is s2?|
+------------------+----------+-----------+
keywords=["s1","s2"]
def calc(k) -> DataFrame:
return df.select([F.count(F.when(F.col(c).rlike(k),c)).alias(c) for c in df.columns] ).withColumn("keyword",F.lit(k))
lst=[calc(k) for k in keywords]
fDf=reduce(DataFrame.unionByName, [y for y in lst])
stExpr="stack(3,'col1',col1,'col2',col2,'col3',col3) as (ColName,Count)"
fDf.select("keyword",F.expr(stExpr)).show()
# Output
+-------+-------+-----+
|keyword|ColName|Count|
+-------+-------+-----+
| s1| col1| 1|
| s1| col2| 1|
| s1| col3| 1|
| s2| col1| 0|
| s2| col2| 0|
| s2| col3| 1|
+-------+-------+-----+
You can add a where clause at the end to filter rows greater than 0 ==>
where("Count >0")
I would like to check if the column values contains 'COIN' etc. in values.
Is there a possibility to change my regex so as not to include "CRYPTOCOIN|KUCOIN|COINBASE"? I'd like to have something like
"regex associated with COIN word|BTCBIT.NET"
Please find my attached code below:
val CRYPTO_CARD_INDICATOR: String = ("BTCBIT.NET|KUCOIN|COINBASE|CRYPTCOIN")
val CryptoCheckDataset = df.withColumn("is_crypto_indicator",when(upper(col("company_name")).rlike(CRYPTO_CARD_INDICATOR), 1).otherwise(0))
I think the following should work:
COIN|BTCBIT.NET
Full test in PySpark:
from pyspark.sql.functions import *
CRYPTO_CARD_INDICATOR = "COIN|BTCBIT.NET"
df = spark.createDataFrame([('kucoin',), ('coinbase',), ('crypto',)], ['company_name'])
CryptoCheckDataset = df.withColumn("is_crypto_indicator", when(upper(col("company_name")).rlike(CRYPTO_CARD_INDICATOR), 1).otherwise(0))
CryptoCheckDataset.show()
# +------------+-------------------+
# |company_name|is_crypto_indicator|
# +------------+-------------------+
# | kucoin| 1|
# | coinbase| 1|
# | crypto| 0|
# +------------+-------------------+
I have the below dataframe which I have read from a JSON file.
1
2
3
4
{"todo":["wakeup", "shower"]}
{"todo":["brush", "eat"]}
{"todo":["read", "write"]}
{"todo":["sleep", "snooze"]}
I need my output to be as below Key and Value. How do I do this? Do I need to create a schema?
ID
todo
1
wakeup, shower
2
brush, eat
3
read, write
4
sleep, snooze
The key-value which you refer to is a struct. "keys" are struct field names, while "values" are field values.
What you want to do is called unpivoting. One of the ways to do it in PySpark is using stack. The following is a dynamic approach, where you don't need to provide existent column names.
Input dataframe:
df = spark.createDataFrame(
[((['wakeup', 'shower'],),(['brush', 'eat'],),(['read', 'write'],),(['sleep', 'snooze'],))],
'`1` struct<todo:array<string>>, `2` struct<todo:array<string>>, `3` struct<todo:array<string>>, `4` struct<todo:array<string>>')
Script:
to_melt = [f"\'{c}\', `{c}`.todo" for c in df.columns]
df = df.selectExpr(f"stack({len(to_melt)}, {','.join(to_melt)}) (ID, todo)")
df.show()
# +---+----------------+
# | ID| todo|
# +---+----------------+
# | 1|[wakeup, shower]|
# | 2| [brush, eat]|
# | 3| [read, write]|
# | 4| [sleep, snooze]|
# +---+----------------+
Use from_json to convert string to array. Explode to cascade each unique element to row.
data
df = spark.createDataFrame(
[(('{"todo":"[wakeup, shower]"}'),('{"todo":"[brush, eat]"}'),('{"todo":"[read, write]"}'),('{"todo":"[sleep, snooze]"}'))],
('value1','values2','value3','value4'))
code
new = (df.withColumn('todo', explode(flatten(array(*[map_values(from_json(x, "MAP<STRING,STRING>")) for x in df.columns])))) #From string to array to indivicual row
.withColumn('todo', translate('todo',"[]",'')#Remove corner brackets
) ).show(truncate=False)
outcome
+---------------------------+-----------------------+------------------------+--------------------------+--------------+
|value1 |values2 |value3 |value4 |todo |
+---------------------------+-----------------------+------------------------+--------------------------+--------------+
|{"todo":"[wakeup, shower]"}|{"todo":"[brush, eat]"}|{"todo":"[read, write]"}|{"todo":"[sleep, snooze]"}|wakeup, shower|
|{"todo":"[wakeup, shower]"}|{"todo":"[brush, eat]"}|{"todo":"[read, write]"}|{"todo":"[sleep, snooze]"}|brush, eat |
|{"todo":"[wakeup, shower]"}|{"todo":"[brush, eat]"}|{"todo":"[read, write]"}|{"todo":"[sleep, snooze]"}|read, write |
|{"todo":"[wakeup, shower]"}|{"todo":"[brush, eat]"}|{"todo":"[read, write]"}|{"todo":"[sleep, snooze]"}|sleep, snooze |
+---------------------------+-----------------------+------------------------+--------------------------+--------------+
I know, that one can load files with PySpark for RDD's using the following commands:
sc = spark.sparkContext
someRDD = sc.textFile("some.csv")
or for dataframes:
spark.read.options(delimiter=',') \
.csv("some.csv")
My file is a .csv with 10 columns, seperated by ',' . However, the very last column contains some text, that also has a lot of ",". Splitting by "," will result in different column sizes for each row and moreover, I do not have the whole text in one column.
I am just looking for a good way to load a .csv file into a dataframe that has multiple "," at the very last index.
Maybe there is way to only split on the first n columns? Because it is guaranteed, that all columns before the text column are only seperated by one ",". Interestingly, using pd.read_csv does not cause this issue! So far my workaround has been to load the file with
csv = pd.read_csv("some.csv", delimiter=",")
csv_to_array = csv.values.tolist()
df = createDataFrame(csv_to_array)
which is not a pretty solution. Moreover, it did not allow me to use some schema on my dataframe.
If you can't correct the input file, then you can try to load it as text then split the values to get the desired columns. Here's an example:
input file
1,2,3,4,5,6,7,8,9,10,0,12,121
1,2,3,4,5,6,7,8,9,10,0,12,121
read and parse
from pyspark.sql import functions as F
nb_cols = 5
df = spark.read.text("file.csv")
df = df.withColumn(
"values",
F.split("value", ",")
).select(
*[F.col("values")[i].alias(f"col_{i}") for i in range(nb_cols)],
F.array_join(F.expr(f"slice(values, {nb_cols + 1}, size(values))"), ",").alias(f"col_{nb_cols}")
)
df.show()
#+-----+-----+-----+-----+-----+-------------------+
#|col_0|col_1|col_2|col_3|col_4| col_5|
#+-----+-----+-----+-----+-----+-------------------+
#| 1| 2| 3| 4| 5|6,7,8,9,10,0,12,121|
#| 1| 2| 3| 4| 5|6,7,8,9,10,0,12,121|
#+-----+-----+-----+-----+-----+-------------------+
This question already has answers here:
How to select the first row of each group?
(9 answers)
Closed 1 year ago.
I have a dataframe:
DF:
1,2016-10-12 18:24:25
1,2016-11-18 14:47:05
2,2016-10-12 21:24:25
2,2016-10-12 20:24:25
2,2016-10-12 22:24:25
3,2016-10-12 17:24:25
How to keep only latest record for each group? (there are 3 groups above (1,2,3)).
Result should be:
1,2016-11-18 14:47:05
2,2016-10-12 22:24:25
3,2016-10-12 17:24:25
Trying also to make it efficient (e.g. to finish within few short minutes on a moderate cluster with 100 million records), so sorting/ordering should be done (if they are required) in most efficient and correct manner..
You have to use the window function.
http://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=window#pyspark.sql.Window
you have to partition the window by the group and OrderBy time, below pyspark script do the work
from pyspark.sql.functions import *
from pyspark.sql.window import Window
schema = "Group int,time timestamp "
df = spark.read.format('csv').schema(schema).options(header=False).load('/FileStore/tables/Group_window.txt')
w = Window.partitionBy('Group').orderBy(desc('time'))
df = df.withColumn('Rank',dense_rank().over(w))
df.filter(df.Rank == 1).drop(df.Rank).show()
+-----+-------------------+
|Group| time|
+-----+-------------------+
| 1|2016-11-18 14:47:05|
| 3|2016-10-12 17:24:25|
| 2|2016-10-12 22:24:25|
+-----+-------------------+ ```
You can use window functions as described here for cases like this:
scala> val in = Seq((1,"2016-10-12 18:24:25"),
| (1,"2016-11-18 14:47:05"),
| (2,"2016-10-12 21:24:25"),
| (2,"2016-10-12 20:24:25"),
| (2,"2016-10-12 22:24:25"),
| (3,"2016-10-12 17:24:25")).toDF("id", "ts")
in: org.apache.spark.sql.DataFrame = [id: int, ts: string]
scala> import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.expressions.Window
scala> val win = Window.partitionBy("id").orderBy("ts desc")
win: org.apache.spark.sql.expressions.WindowSpec = org.apache.spark.sql.expressions.WindowSpec#59fa04f7
scala> in.withColumn("rank", row_number().over(win)).where("rank == 1").show(false)
+---+-------------------+----+
| id| ts|rank|
+---+-------------------+----+
| 1|2016-11-18 14:47:05| 1|
| 3|2016-10-12 17:24:25| 1|
| 2|2016-10-12 22:24:25| 1|
+---+-------------------+----+