How to properly import CSV files with PySpark - dataframe

I know, that one can load files with PySpark for RDD's using the following commands:
sc = spark.sparkContext
someRDD = sc.textFile("some.csv")
or for dataframes:
spark.read.options(delimiter=',') \
.csv("some.csv")
My file is a .csv with 10 columns, seperated by ',' . However, the very last column contains some text, that also has a lot of ",". Splitting by "," will result in different column sizes for each row and moreover, I do not have the whole text in one column.
I am just looking for a good way to load a .csv file into a dataframe that has multiple "," at the very last index.
Maybe there is way to only split on the first n columns? Because it is guaranteed, that all columns before the text column are only seperated by one ",". Interestingly, using pd.read_csv does not cause this issue! So far my workaround has been to load the file with
csv = pd.read_csv("some.csv", delimiter=",")
csv_to_array = csv.values.tolist()
df = createDataFrame(csv_to_array)
which is not a pretty solution. Moreover, it did not allow me to use some schema on my dataframe.

If you can't correct the input file, then you can try to load it as text then split the values to get the desired columns. Here's an example:
input file
1,2,3,4,5,6,7,8,9,10,0,12,121
1,2,3,4,5,6,7,8,9,10,0,12,121
read and parse
from pyspark.sql import functions as F
nb_cols = 5
df = spark.read.text("file.csv")
df = df.withColumn(
"values",
F.split("value", ",")
).select(
*[F.col("values")[i].alias(f"col_{i}") for i in range(nb_cols)],
F.array_join(F.expr(f"slice(values, {nb_cols + 1}, size(values))"), ",").alias(f"col_{nb_cols}")
)
df.show()
#+-----+-----+-----+-----+-----+-------------------+
#|col_0|col_1|col_2|col_3|col_4| col_5|
#+-----+-----+-----+-----+-----+-------------------+
#| 1| 2| 3| 4| 5|6,7,8,9,10,0,12,121|
#| 1| 2| 3| 4| 5|6,7,8,9,10,0,12,121|
#+-----+-----+-----+-----+-----+-------------------+

Related

optimize pyspark code to find a keyword and its count in a dataframe

We have a lot of files in our s3 bucket. The current pyspark code I have reads each file, takes one column from that file and looks for the keyword and returns a dataframe with count of keyword in the column and the file.
Here is the code in pyspark. (we are using databricks to write code if that helps)
import s3fs
fs = s3fs.S3FileSystem()
from pyspark.sql.functions import lower, col
keywords = ['%keyword1%','%keyword2%']
prefix = ''
deployment_id = ''
pull_id = ''
paths = fs.ls(prefix+'/'+deployment_id+'/'+pull_id)
result = []
errors = []
try:
for path in paths:
df = spark.read.parquet('s3://'+path)
print(path)
for keyword in keywords:
for col in df.columns:
filtered_df = df.filter(lower(df[col]).like(keyword))
filtered_count = filtered_df.count()
if filtered_count > 0 :
#print(col +' has '+ str(filtered_count) +' appearences')
result.append({'keyword': keyword, 'column': col, 'count': filtered_count,'table':path.split('/')[-1]})
except Exception as e:
errors.append({'error_msg':e})
try:
errors = spark.createDataFrame(errors)
except Exception as e:
print('no errors')
try:
result = spark.createDataFrame(result)
result.display()
except Exception as e:
print('problem with results. May be no results')
I am new to pyspark,databricks and spark. Code here works very slow. I know that cause we have a local code in python that is faster than this one. we wanted to use pyspark, databricks cause we thought it would be faster and on local code we need to put aws access keys every day and some times if the file is huge it gives a memory error.
NOTE - The above code reads data faster but the search functionality seems to be slower when compared to local python code
here is the python code in our local system
def search_df(self,keyword,df,regex=False):
start=time.time()
if regex:
mask = df.applymap(lambda x: re.search(keyword,x) is not None if isinstance(x,str) else False).to_numpy()
else:
mask = df.applymap(lambda x: keyword.lower() in x.lower() if isinstance(x,str) else False).to_numpy()
I was hoping if I could have any code changes to the pyspark so its faster.
Thanks.
tried changing
.like(keyword) to .contains(keyword) to see if thats faster. but doesnt seem to work
Check out the below code. Have defined a function that uses List Comprehensions to search each column in the df for a keyword. Next calling that function for each keyword. There will be a new df returned for each keyword, which then need to be unioned using reduce function.
import pyspark.sql.functions as F
from pyspark.sql import DataFrame
from functools import reduce
sampleData = [["Hello s1","Y","Hi s1"],["What is your name?","What is s1","what is s2?"] ]
df = spark.createDataFrame(sampleData,["col1","col2","col3"])
df.show()
# Sample input dataframe
+------------------+----------+-----------+
| col1| col2| col3|
+------------------+----------+-----------+
| Hello s1| Y| Hi s1|
|What is your name?|What is s1|what is s2?|
+------------------+----------+-----------+
keywords=["s1","s2"]
def calc(k) -> DataFrame:
return df.select([F.count(F.when(F.col(c).rlike(k),c)).alias(c) for c in df.columns] ).withColumn("keyword",F.lit(k))
lst=[calc(k) for k in keywords]
fDf=reduce(DataFrame.unionByName, [y for y in lst])
stExpr="stack(3,'col1',col1,'col2',col2,'col3',col3) as (ColName,Count)"
fDf.select("keyword",F.expr(stExpr)).show()
# Output
+-------+-------+-----+
|keyword|ColName|Count|
+-------+-------+-----+
| s1| col1| 1|
| s1| col2| 1|
| s1| col3| 1|
| s2| col1| 0|
| s2| col2| 0|
| s2| col3| 1|
+-------+-------+-----+
You can add a where clause at the end to filter rows greater than 0 ==>
where("Count >0")

How to remove double quotes from column name while saving dataframe in csv in spark?

I am saving spark dataframe into csv file. All the records is saving in double quotes that is fine but column name also coming in double quotes. Could you please help me how to remove them?
Example:
"Source_System"|"Date"|"Market_Volume"|"Volume_Units"|"Market_Value"|"Value_Currency"|"Sales_Channel"|"Competitor_Name"
"IMS"|"20080628"|"183.0"|"16470.0"|"165653.256349"|"AUD"|"AUSTRALIA HOSPITAL"|"PFIZER"
Desirable Output:
Source_System|Date|Market_Volume|Volume_Units|Market_Value|Value_Currency|Sales_Channel|Competitor_Name
"IMS"|"20080628"|"183.0"|"16470.0"|"165653.256349"|"AUD"|"AUSTRALIA HOSPITAL"|"PFIZER"
I am using below code:
df4.repartition(1).write.csv(Output_Path_ASPAC, quote='"', header=True, quoteAll=True, sep='|', mode='overwrite')
I think only workaround would be concat quotes to the column values in dataframe before writing to csv.
Example:
df.show()
#+---+----+------+
#| id|name|salary|
#+---+----+------+
#| 1| a| 100|
#+---+----+------+
from pyspark.sql.functions import col, concat, lit
cols = [concat(lit('"'), col(i), lit('"')).alias(i) for i in df.columns]
df1=df.select(*cols)
df1.show()
#+---+----+------+
#| id|name|salary|
#+---+----+------+
#|"1"| "a"| "100"|
#+---+----+------+
df1.\
write.\
csv("<path>", header=True, sep='|',escape='', quote='',mode='overwrite')
#output
#cat tmp4/part*
#id|name|salary
#"1"|"a"|"100"

Pyspark number of unique values in dataframe is different compared with Pandas result

I have large dataframe with 4 million rows. One of the columns is a variable called "name".
When I check the number of unique values in Pandas by: df['name].nunique() I get a different answer than from Pyspark df.select("name").distinct().show() (around 1800 in Pandas versus 350 in Pyspark). How can this be? Is this a data partitioning thing?
EDIT:
The record "name" in the dataframe looks like: name-{number}, for example: name-1, name-2, etc.
In Pandas:
df['name'] = df['name'].str.lstrip('name-').astype(int)
df['name'].nunique() # 1800
In Pyspark:
import pyspark.sql.functions as f
df = df.withColumn("name", f.split(df['name'], '\-')[1].cast("int"))
df.select(f.countDistinct("name")).show()
IIUC, it's most likely from non-numeric chars(i.e. SPACE) shown in the name column. Pandas will force the type conversion while with Spark, you get NULL, see below example:
df = spark.createDataFrame([(e,) for e in ['name-1', 'name-22 ', 'name- 3']],['name'])
for PySpark:
import pyspark.sql.functions as f
df.withColumn("name1", f.split(df['name'], '\-')[1].cast("int")).show()
#+--------+-----+
#| name|name1|
#+--------+-----+
#| name-1| 1|
#|name-22 | null|
#| name- 3| null|
#+--------+-----+
for Pandas:
df.toPandas()['name'].str.lstrip('name-').astype(int)
#Out[xxx]:
#0 1
#1 22
#2 3
#Name: name, dtype: int64

pyspark withColumn, how to vary column name

is there any way to create/fill columns with pyspark 2.1.0 where the name of the column is the value of a different column?
I tried the following
def createNewColumnsFromValues(dataFrame, colName, targetColName):
"""
Set value of column colName to targetColName's value
"""
cols = dataFrame.columns
#df = dataFrame.withColumn(f.col(colName), f.col(targetColName))
df = dataFrame.withColumn('x', f.col(targetColName))
return df
The out commented line does not work, when calling the method I get the error
TypeError: 'Column' object is not callable
whereas the fixed name (as a string) is no problem. Any idea of how to also make the name of the column come from another one, not just the value? I also tried to use a UDF function definition as a workaround with the same no success result.
Thanks for help!
Edit:
from pyspark.sql import functions as f
I figured a solution which scales nicely for few (or not many) distinct values I need columns for. Which is necessarily the case or the number of columns would explode.
def createNewColumnsFromValues(dataFrame, colName, targetCol):
distinctValues = dataFrame.select(colName).distinct().collect()
for value in distinctValues:
dataFrame = dataFrame.withColumn(str(value[0]), f.when(f.col(colName) == value[0], f.col(targetCol)).otherwise(f.lit(None)))
return dataFrame
You might want to try the following code:
test_df = spark.createDataFrame([
(1,"2",5,1),(3,"4",7,8),
], ("col1","col2","col3","col4"))
def createNewColumnsFromValues(dataFrame, sourceCol, colName, targetCol):
"""
Set value column colName to targetCol
"""
for value in sourceCol:
dataFrame = dataFrame.withColumn(str(value[0]), when(col(colName)==value[0], targetCol).otherwise(None))
return dataFrame
createNewColumnsFromValues(test_df, test_df.select("col4").collect(), "col4", test_df["col3"]).show()
The trick here is to do select("COLUMNNAME").collect() to get a list of the values in the column. Then colName contains this list, which is a list of rows, where each row has a single element. So you can directly iterate through the list and access the element at position 0. In this case a cast to string was necessary to ensure the column name of the new column is a string. The target column is used for the values for each of the individual columns. So the result would look like:
+----+----+----+----+----+----+
|col1|col2|col3|col4| 1| 8|
+----+----+----+----+----+----+
| 1| 2| 5| 1| 5|null|
| 3| 4| 7| 8|null| 7|
+----+----+----+----+----+----+

Pyspark dataframe conversion to pandas drops data?

I have a fairly involved process of creating a pyspark dataframe, converting it to a pandas dataframe, and outputting the result to a flat file. I am not sure at which point the error is introduced, so I'll describe the whole process.
Starting out I have a pyspark dataframe that contains pairwise similarity for sets of ids. It looks like this:
+------+-------+-------------------+
| ID_A| ID_B| EuclideanDistance|
+------+-------+-------------------+
| 1| 1| 0.0|
| 1| 2|0.13103884200454394|
| 1| 3| 0.2176246463836219|
| 1| 4| 0.280568636550471|
...
I'like to group it by ID_A, sort each group by EuclideanDistance, and only grab the top N pairs for each group. So first I do this:
from pyspark.sql.window import Window
from pyspark.sql.functions import rank, col, row_number
window = Window.partitionBy(df['ID_A']).orderBy(df_sim['EuclideanDistance'])
result = (df.withColumn('row_num', row_number().over(window)))
I make sure ID_A = 1 is still in the "result" dataframe. Then I do this to limit each group to just 20 rows:
result1 = result.where(result.row_num<20)
result1.toPandas().to_csv("mytest.csv")
and ID_A = 1 is NOT in the resultant .csv file (although it's still there in result1). Is there a problem somewhere in this chain of conversions that could lead to a loss of data?
You are referencing 2 dataframes in the window of your solution. Not sure this is causing your error, but it's worth cleaning up. In any case, you don't need to reference a particular dataframe in a window definition. In any case, try
window = Window.partitionBy('ID_A').orderBy('EuclideanDistance')
As David mentioned, you reference a second dataframe "df_sim" in your window function.
I tested the following and it works on my machine (famous last words):
from pyspark.sql.window import Window
from pyspark.sql.functions import rank, col, row_number
import pandas as pd
#simulate some data
df = pd.DataFrame({'ID_A': pd.np.arange(100)%5,
'ID_B': pd.np.repeat(pd.np.arange(20),5),
'EuclideanDistance': pd.np.random.rand(100)*5}
)
#artificially set distance between point and self to 0
df['EuclideanDistance'][df['ID_A'] == df['ID_B']] = 0
df = spark.createDataFrame(df)
#end simulation
window = Window.partitionBy(df['ID_A']).orderBy(df['EuclideanDistance'])
output = df.select('*', row_number().over(window).alias('rank')).filter(col('rank') <= 10)
output.show(50)
The simulation code is there just to make this a self-contained example. You can of course use your actual dataframe and ignore the simulation when you test it. Hope that works!