optimize pyspark code to find a keyword and its count in a dataframe - optimization

We have a lot of files in our s3 bucket. The current pyspark code I have reads each file, takes one column from that file and looks for the keyword and returns a dataframe with count of keyword in the column and the file.
Here is the code in pyspark. (we are using databricks to write code if that helps)
import s3fs
fs = s3fs.S3FileSystem()
from pyspark.sql.functions import lower, col
keywords = ['%keyword1%','%keyword2%']
prefix = ''
deployment_id = ''
pull_id = ''
paths = fs.ls(prefix+'/'+deployment_id+'/'+pull_id)
result = []
errors = []
try:
for path in paths:
df = spark.read.parquet('s3://'+path)
print(path)
for keyword in keywords:
for col in df.columns:
filtered_df = df.filter(lower(df[col]).like(keyword))
filtered_count = filtered_df.count()
if filtered_count > 0 :
#print(col +' has '+ str(filtered_count) +' appearences')
result.append({'keyword': keyword, 'column': col, 'count': filtered_count,'table':path.split('/')[-1]})
except Exception as e:
errors.append({'error_msg':e})
try:
errors = spark.createDataFrame(errors)
except Exception as e:
print('no errors')
try:
result = spark.createDataFrame(result)
result.display()
except Exception as e:
print('problem with results. May be no results')
I am new to pyspark,databricks and spark. Code here works very slow. I know that cause we have a local code in python that is faster than this one. we wanted to use pyspark, databricks cause we thought it would be faster and on local code we need to put aws access keys every day and some times if the file is huge it gives a memory error.
NOTE - The above code reads data faster but the search functionality seems to be slower when compared to local python code
here is the python code in our local system
def search_df(self,keyword,df,regex=False):
start=time.time()
if regex:
mask = df.applymap(lambda x: re.search(keyword,x) is not None if isinstance(x,str) else False).to_numpy()
else:
mask = df.applymap(lambda x: keyword.lower() in x.lower() if isinstance(x,str) else False).to_numpy()
I was hoping if I could have any code changes to the pyspark so its faster.
Thanks.
tried changing
.like(keyword) to .contains(keyword) to see if thats faster. but doesnt seem to work

Check out the below code. Have defined a function that uses List Comprehensions to search each column in the df for a keyword. Next calling that function for each keyword. There will be a new df returned for each keyword, which then need to be unioned using reduce function.
import pyspark.sql.functions as F
from pyspark.sql import DataFrame
from functools import reduce
sampleData = [["Hello s1","Y","Hi s1"],["What is your name?","What is s1","what is s2?"] ]
df = spark.createDataFrame(sampleData,["col1","col2","col3"])
df.show()
# Sample input dataframe
+------------------+----------+-----------+
| col1| col2| col3|
+------------------+----------+-----------+
| Hello s1| Y| Hi s1|
|What is your name?|What is s1|what is s2?|
+------------------+----------+-----------+
keywords=["s1","s2"]
def calc(k) -> DataFrame:
return df.select([F.count(F.when(F.col(c).rlike(k),c)).alias(c) for c in df.columns] ).withColumn("keyword",F.lit(k))
lst=[calc(k) for k in keywords]
fDf=reduce(DataFrame.unionByName, [y for y in lst])
stExpr="stack(3,'col1',col1,'col2',col2,'col3',col3) as (ColName,Count)"
fDf.select("keyword",F.expr(stExpr)).show()
# Output
+-------+-------+-----+
|keyword|ColName|Count|
+-------+-------+-----+
| s1| col1| 1|
| s1| col2| 1|
| s1| col3| 1|
| s2| col1| 0|
| s2| col2| 0|
| s2| col3| 1|
+-------+-------+-----+
You can add a where clause at the end to filter rows greater than 0 ==>
where("Count >0")

Related

Spark regex 'COIN' in column values -> rlike approach

I would like to check if the column values contains 'COIN' etc. in values.
Is there a possibility to change my regex so as not to include "CRYPTOCOIN|KUCOIN|COINBASE"? I'd like to have something like
"regex associated with COIN word|BTCBIT.NET"
Please find my attached code below:
val CRYPTO_CARD_INDICATOR: String = ("BTCBIT.NET|KUCOIN|COINBASE|CRYPTCOIN")
val CryptoCheckDataset = df.withColumn("is_crypto_indicator",when(upper(col("company_name")).rlike(CRYPTO_CARD_INDICATOR), 1).otherwise(0))
I think the following should work:
COIN|BTCBIT.NET
Full test in PySpark:
from pyspark.sql.functions import *
CRYPTO_CARD_INDICATOR = "COIN|BTCBIT.NET"
df = spark.createDataFrame([('kucoin',), ('coinbase',), ('crypto',)], ['company_name'])
CryptoCheckDataset = df.withColumn("is_crypto_indicator", when(upper(col("company_name")).rlike(CRYPTO_CARD_INDICATOR), 1).otherwise(0))
CryptoCheckDataset.show()
# +------------+-------------------+
# |company_name|is_crypto_indicator|
# +------------+-------------------+
# | kucoin| 1|
# | coinbase| 1|
# | crypto| 0|
# +------------+-------------------+

How to properly import CSV files with PySpark

I know, that one can load files with PySpark for RDD's using the following commands:
sc = spark.sparkContext
someRDD = sc.textFile("some.csv")
or for dataframes:
spark.read.options(delimiter=',') \
.csv("some.csv")
My file is a .csv with 10 columns, seperated by ',' . However, the very last column contains some text, that also has a lot of ",". Splitting by "," will result in different column sizes for each row and moreover, I do not have the whole text in one column.
I am just looking for a good way to load a .csv file into a dataframe that has multiple "," at the very last index.
Maybe there is way to only split on the first n columns? Because it is guaranteed, that all columns before the text column are only seperated by one ",". Interestingly, using pd.read_csv does not cause this issue! So far my workaround has been to load the file with
csv = pd.read_csv("some.csv", delimiter=",")
csv_to_array = csv.values.tolist()
df = createDataFrame(csv_to_array)
which is not a pretty solution. Moreover, it did not allow me to use some schema on my dataframe.
If you can't correct the input file, then you can try to load it as text then split the values to get the desired columns. Here's an example:
input file
1,2,3,4,5,6,7,8,9,10,0,12,121
1,2,3,4,5,6,7,8,9,10,0,12,121
read and parse
from pyspark.sql import functions as F
nb_cols = 5
df = spark.read.text("file.csv")
df = df.withColumn(
"values",
F.split("value", ",")
).select(
*[F.col("values")[i].alias(f"col_{i}") for i in range(nb_cols)],
F.array_join(F.expr(f"slice(values, {nb_cols + 1}, size(values))"), ",").alias(f"col_{nb_cols}")
)
df.show()
#+-----+-----+-----+-----+-----+-------------------+
#|col_0|col_1|col_2|col_3|col_4| col_5|
#+-----+-----+-----+-----+-----+-------------------+
#| 1| 2| 3| 4| 5|6,7,8,9,10,0,12,121|
#| 1| 2| 3| 4| 5|6,7,8,9,10,0,12,121|
#+-----+-----+-----+-----+-----+-------------------+

How to load big double numbers in a PySpark DataFrame and persist it back without changing the numeric format to scientific notation or precision?

I have a CSV like that:
COL,VAL
TEST,100000000.12345679
TEST2,200000000.1234
TEST3,9999.1234679123
I want to load it having the column VAL as a numeric type (due to other requirements of the project) and then persist it back to another CSV as per structure below:
+-----+------------------+
| COL| VAL|
+-----+------------------+
| TEST|100000000.12345679|
|TEST2| 200000000.1234|
|TEST3| 9999.1234679123|
+-----+------------------+
The problem I'm facing is that whenever I load it, the numbers become scientific notation, and I cannot persist it back without having to inform the precision and scale of my data (I want to use the one that it is already in the file, whatever it is - I can't infer it).
Here's what I have tried:
Loading it with DoubleType() it gives me scientific notation:
schema = StructType([
StructField('COL', StringType()),
StructField('VAL', DoubleType())
])
csv_file = "Downloads/test.csv"
df2 = (spark.read.format("csv")
.option("sep",",")
.option("header", "true")
.schema(schema)
.load(csv_file))
df2.show()
+-----+--------------------+
| COL| VAL|
+-----+--------------------+
| TEST|1.0000000012345679E8|
|TEST2| 2.000000001234E8|
|TEST3| 9999.1234679123|
+-----+--------------------+
Loading it with DecimalType() I'm required to specify precision and scale, otherwise, I lose the decimals after the dot. However, specifying it, besides the risk of not getting the correct value (as my data might be rounded), I get zeros after the dot:
For example, using: StructField('VAL', DecimalType(38, 18)) I get:
[Row(COL='TEST', VAL=Decimal('100000000.123456790000000000')),
Row(COL='TEST2', VAL=Decimal('200000000.123400000000000000')),
Row(COL='TEST3', VAL=Decimal('9999.123467912300000000'))]
Realise that in this case, I have zeros on the right side that I don't want in my new file.
The only way I found to address it was using a UDF where I first use the float() to remove the scientific notation and then I convert it to string to make sure it will be persisted as I want:
to_decimal = udf(lambda n: str(float(n)))
df2 = df2.select("*", to_decimal("VAL").alias("VAL2"))
df2 = df2.select(["COL", "VAL2"]).withColumnRenamed("VAL2", "VAL")
df2.show()
display(df2.schema)
+-----+------------------+
| COL| VAL|
+-----+------------------+
| TEST|100000000.12345679|
|TEST2| 200000000.1234|
|TEST3| 9999.1234679123|
+-----+------------------+
StructType(List(StructField(COL,StringType,true),StructField(VAL,StringType,true)))
There's any way to reach the same without using the UDF trick?
Thank you!
The best way I found to address it was as bellow. It is still using UDF, but now, without the workarounds with Strings to avoid scientific notation. I won't make it as correct answer yet, because I still expect someone coming over with a solution without UDF (or a good explanation of why it's not possible without UDFs).
The CSV:
$ cat /Users/bambrozi/Downloads/testf.csv
COL,VAL
TEST,100000000.12345679
TEST2,200000000.1234
TEST3,9999.1234679123
TEST4,123456789.01234567
Load the CSV applying the default PySpark DecimalType precision and scale:
schema = StructType([
StructField('COL', StringType()),
StructField('VAL', DecimalType(38, 18))
])
csv_file = "Downloads/testf.csv"
df2 = (spark.read.format("csv")
.option("sep",",")
.option("header", "true")
.schema(schema)
.load(csv_file))
df2.show(truncate=False)
output:
+-----+----------------------------+
|COL |VAL |
+-----+----------------------------+
|TEST |100000000.123456790000000000|
|TEST2|200000000.123400000000000000|
|TEST3|9999.123467912300000000 |
|TEST4|123456789.012345670000000000|
+-----+----------------------------+
When you are ready to report it (print or save in a new file) you apply a format to trailing zeros:
import decimal
import pyspark.sql.functions as F
normalize_decimals = F.udf(lambda dec: dec.normalize())
(df2
.withColumn('VAL', normalize_decimals(F.col('VAL')))
.show(truncate=False))
output:
+-----+------------------+
|COL |VAL |
+-----+------------------+
|TEST |100000000.12345679|
|TEST2|200000000.1234 |
|TEST3|9999.1234679123 |
|TEST4|123456789.01234567|
+-----+------------------+
You can use spark to do that with sql query :
import org.apache.spark.SparkConf
import org.apache.spark.sql.{DataFrame, SparkSession}
val sparkConf: SparkConf = new SparkConf(true)
.setAppName(this.getClass.getName)
.setMaster("local[*]")
implicit val spark: SparkSession = SparkSession.builder().config(sparkConf).getOrCreate()
val df = spark.read.option("header", "true").format("csv").load(csv_file)
df.createOrReplaceTempView("table")
val query = "Select cast(VAL as BigDecimal) as VAL, COL from table"
val result = spark.sql(query)
result.show()
result.coalesce(1).write.option("header", "true").mode("overwrite").csv(outputPath + table)

Pyspark number of unique values in dataframe is different compared with Pandas result

I have large dataframe with 4 million rows. One of the columns is a variable called "name".
When I check the number of unique values in Pandas by: df['name].nunique() I get a different answer than from Pyspark df.select("name").distinct().show() (around 1800 in Pandas versus 350 in Pyspark). How can this be? Is this a data partitioning thing?
EDIT:
The record "name" in the dataframe looks like: name-{number}, for example: name-1, name-2, etc.
In Pandas:
df['name'] = df['name'].str.lstrip('name-').astype(int)
df['name'].nunique() # 1800
In Pyspark:
import pyspark.sql.functions as f
df = df.withColumn("name", f.split(df['name'], '\-')[1].cast("int"))
df.select(f.countDistinct("name")).show()
IIUC, it's most likely from non-numeric chars(i.e. SPACE) shown in the name column. Pandas will force the type conversion while with Spark, you get NULL, see below example:
df = spark.createDataFrame([(e,) for e in ['name-1', 'name-22 ', 'name- 3']],['name'])
for PySpark:
import pyspark.sql.functions as f
df.withColumn("name1", f.split(df['name'], '\-')[1].cast("int")).show()
#+--------+-----+
#| name|name1|
#+--------+-----+
#| name-1| 1|
#|name-22 | null|
#| name- 3| null|
#+--------+-----+
for Pandas:
df.toPandas()['name'].str.lstrip('name-').astype(int)
#Out[xxx]:
#0 1
#1 22
#2 3
#Name: name, dtype: int64

Pyspark dataframe conversion to pandas drops data?

I have a fairly involved process of creating a pyspark dataframe, converting it to a pandas dataframe, and outputting the result to a flat file. I am not sure at which point the error is introduced, so I'll describe the whole process.
Starting out I have a pyspark dataframe that contains pairwise similarity for sets of ids. It looks like this:
+------+-------+-------------------+
| ID_A| ID_B| EuclideanDistance|
+------+-------+-------------------+
| 1| 1| 0.0|
| 1| 2|0.13103884200454394|
| 1| 3| 0.2176246463836219|
| 1| 4| 0.280568636550471|
...
I'like to group it by ID_A, sort each group by EuclideanDistance, and only grab the top N pairs for each group. So first I do this:
from pyspark.sql.window import Window
from pyspark.sql.functions import rank, col, row_number
window = Window.partitionBy(df['ID_A']).orderBy(df_sim['EuclideanDistance'])
result = (df.withColumn('row_num', row_number().over(window)))
I make sure ID_A = 1 is still in the "result" dataframe. Then I do this to limit each group to just 20 rows:
result1 = result.where(result.row_num<20)
result1.toPandas().to_csv("mytest.csv")
and ID_A = 1 is NOT in the resultant .csv file (although it's still there in result1). Is there a problem somewhere in this chain of conversions that could lead to a loss of data?
You are referencing 2 dataframes in the window of your solution. Not sure this is causing your error, but it's worth cleaning up. In any case, you don't need to reference a particular dataframe in a window definition. In any case, try
window = Window.partitionBy('ID_A').orderBy('EuclideanDistance')
As David mentioned, you reference a second dataframe "df_sim" in your window function.
I tested the following and it works on my machine (famous last words):
from pyspark.sql.window import Window
from pyspark.sql.functions import rank, col, row_number
import pandas as pd
#simulate some data
df = pd.DataFrame({'ID_A': pd.np.arange(100)%5,
'ID_B': pd.np.repeat(pd.np.arange(20),5),
'EuclideanDistance': pd.np.random.rand(100)*5}
)
#artificially set distance between point and self to 0
df['EuclideanDistance'][df['ID_A'] == df['ID_B']] = 0
df = spark.createDataFrame(df)
#end simulation
window = Window.partitionBy(df['ID_A']).orderBy(df['EuclideanDistance'])
output = df.select('*', row_number().over(window).alias('rank')).filter(col('rank') <= 10)
output.show(50)
The simulation code is there just to make this a self-contained example. You can of course use your actual dataframe and ignore the simulation when you test it. Hope that works!