I'm trying to read csv file using pyspark-sql, most of the column names will have special characters.I would like to get remove the special characters in all column names using pyspark dataframe.Is there any specific function available to remove special characters at once for all the column names ? I appreciate your response.
Try with using regular expression replace to replace all special characters and then use .toDF()
Example:
df=spark.createDataFrame([('a','b','v','d')],['._a','/b','c ','d('])
import re
cols=[re.sub("(_|\.|\(|\/)","",i) for i in df.columns]
df.toDF(*cols).show()
#+---+---+---+---+
#| a| b| c | d|
#+---+---+---+---+
#| a| b| v| d|
#+---+---+---+---+
Using .withColumnRenamed():
for i,j in zip(df.columns,cols):
df=df.withColumnRenamed(i,j)
df.show()
#+---+---+---+---+
#| a| b| c | d|
#+---+---+---+---+
#| a| b| v| d|
#+---+---+---+---+
Related
I am trying to load csv file which looks like following, using pyspark code.
A^B^C^D^E^F
"Yash"^"12"^""^"this is first record"^"nice"^"12"
"jay"^"13"^""^"
In second record, I am new line at the beingnning"^"nice"^"12"
"Nova"^"14"^""^"this is third record"^"nice"^"12"
When I read this file and select a few columns entire dataframe gets messed up.
import pyspark.sql.functions as F
df = (
spark.read
.option("delimiter", "^")
.option('header',True) \
.option("multiline", "true")
.option('multiLine', True) \
.option("escape", "\"")
.csv(
"test3.csv",
header=True,
)
)
df.show()
df = df.withColumn("isdeleted", F.lit(True))
select_cols = ['isdeleted','B','D','E','F']
df = df.select(*select_cols)
df.show()
(truncated some import statements for readability of code)
This is what I see when the above code runs
Before column selection (entire DF)
+----+---+----+--------------------+----+---+
| A| B| C| D| E| F|
+----+---+----+--------------------+----+---+
|Yash| 12|null|this is first record|nice| 12|
| jay| 13|null|\nIn second recor...|nice| 12|
|Nova| 14|null|this is third record|nice| 12|
+----+---+----+--------------------+----+---+
After df.select(*select_cols)
+---------+----+--------------------+----+----+
|isdeleted| B| D| E| F|
+---------+----+--------------------+----+----+
| true| 12|this is first record|nice| 12|
| true| 13| null|null|null|
| true|nice| null|null|null|
| true| 14|this is third record|nice| 12|
+---------+----+--------------------+----+----+
Here, second row with newline char is being broken down to 2 rows, output file is also messed up just like dataframe preview I showed above.
I am using apache Glue image amazon/aws-glue-libs:glue_libs_4.0.0_image_01 which uses spark 3.3.0 version. Also tried with spark 3.1.1. I see the same issue in both versions.
I am not sure whether this is a bug in spark package or I am missing something here. Any help will be appreciated
You are giving the wrong escape charactor. It is usually \ and you are specifing this to the quote. Once you change the option,
df = spark.read.csv('test.csv', sep='^', header=True, multiLine=True)
df.show()
df.select('B').show()
+----+---+----+--------------------+----+---+
| A| B| C| D| E| F|
+----+---+----+--------------------+----+---+
|Yash| 12|null|this is first record|nice| 12|
| jay| 13|null|\nIn second recor...|nice| 12|
|Nova| 14|null|this is third record|nice| 12|
+----+---+----+--------------------+----+---+
+---+
| B|
+---+
| 12|
| 13|
| 14|
+---+
You will get the desired result.
I have a dataframe like this:
id,p1
1,A
2,null
3,B
4,null
4,null
2,C
Using PySpark, I want to remove all the duplicates. However, if there is a duplicate in which the p1 column is not null I want to remove the null one. For example, I want to remove the first occurrence of id 2 and either of id 4. Right now I am splitting the dataframe into two dataframes as such:
id,p1
1,A
3,B
2,C
id,p1
2,null
4,null
4,null
Removing the duplicates from both, then adding the ones which are not in the first dataframe back. Like that I get this dataframe.
id,p1
1,A
3,B
4,null
2,C
This is what I have so far:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('test').getOrCreate()
d = spark.createDataFrame(
[(1,"A"),
(2,None),
(3,"B"),
(4,None),
(4,None),
(2,"C")],
["id", "p"]
)
d1 = d.filter(d.p.isNull())
d2 = d.filter(d.p.isNotNull())
d1 = d1.dropDuplicates()
d2 = d2.dropDuplicates()
d3 = d1.join(d2, "id", 'left_anti')
d4 = d2.unionByName(d3)
Is there a more beautiful way of doing this? It really feels redundant like this but I can't come up with a better way. I tried using groupby but couldn't achieve it. Any ideas? Thanks.
(df1.sort(col('p1').desc())#sort column descending and will put nulls low in list
.dropDuplicates(subset = ['id']).show()#Drop duplicates on column id
)
+---+----+
| id| p1|
+---+----+
| 1| A|
| 2| C|
| 3| B|
| 4|null|
+---+----+
Use window row_number() function and sort by "p" column descending.
Example:
d.show()
#+---+----+
#| id| p|
#+---+----+
#| 1| A|
#| 2|null|
#| 3| B|
#| 4|null|
#| 4|null|
#| 2| C|
#+---+----+
from pyspark.sql.functions import col, row_number
from pyspark.sql.window import Window
window_spec=row_number().over(Window.partitionBy("id").orderBy(col("p").desc()))
d.withColumn("rn",window_spec).filter(col("rn")==1).drop("rn").show()
#+---+----+
#| id| p|
#+---+----+
#| 1| A|
#| 3| B|
#| 2| C|
#| 4|null|
#+---+----+
I have a dataframe looks like:
group, rate
A,0.1
A,0.2
B,0.3
B,0.1
C,0.1
C,0.2
How can I transpose this to a wide data frame. This is what I expect to get:
group, rate_1, rate_2
A,0.1,0.2
B,0.3,0.1
C,0.1,0.2
The number of records in each group is the same and also how to create a consistent column name with prefix or suffix while transposing?
Do you know which function I can use?
Thanks,
Try with groupBy, collect_list then dynamically split the array column as new columns.
Example:
df.show()
#+-----+----+
#|group|rate|
#+-----+----+
#| A| 0.1|
#| A| 0.2|
#| B| 0.3|
#| B| 0.1|
#+-----+----+
arr_size = 2
exprs=['group']+[expr('lst[' + str(x) + ']').alias('rate_'+str(x+1)) for x in range(0, arr_size)]
df1=df.groupBy("group").agg(collect_list(col("rate")).alias("lst"))
df1.select(*exprs).show()
#+-----+------+------+
#|group|rate_1|rate_2|
#+-----+------+------+
#| B| 0.3| 0.1|
#| A| 0.1| 0.2|
#+-----+------+------+
For Preserver Order in collect_list():
df=spark.createDataFrame([('A',0.1),('A',0.2),('B',0.3),('B',0.1)],['group','rate']).withColumn("mid",monotonically_increasing_id()).repartition(100)
from pyspark.sql.functions import *
from pyspark.sql import *
w=Window.partitionBy("group").orderBy("mid")
w1=Window.partitionBy("group").orderBy(desc("mid"))
df1=df.withColumn("lst",collect_list(col("rate")).over(w)).\
withColumn("snr",row_number().over(w1)).\
filter(col("snr") == 1).\
drop(*['mid','snr','rate'])
df1.show()
#+-----+----------+
#|group| lst|
#+-----+----------+
#| B|[0.3, 0.1]|
#| A|[0.1, 0.2]|
#+-----+----------+
arr_size = 2
exprs=['group']+[expr('lst[' + str(x) + ']').alias('rate_'+str(x+1)) for x in range(0, arr_size)]
df1.select(*exprs).show()
+-----+------+------+
|group|rate_1|rate_2|
+-----+------+------+
| B| 0.3| 0.1|
| A| 0.1| 0.2|
+-----+------+------+
I would create a column to rank your "rate" column and then pivot:
First create a "rank" column and concatenate the string "rate_" to the row_number:
from pyspark.sql.functions import concat, first, lit, row_number
from pyspark.sql import Window
df = df.withColumn(
"rank",
concat(
lit("rate_"),
row_number().over(Window.partitionBy("group")\
.orderBy("rate")).cast("string")
)
)
df.show()
#+-----+----+------+
#|group|rate| rank|
#+-----+----+------+
#| B| 0.1|rate_1|
#| B| 0.3|rate_2|
#| C| 0.1|rate_1|
#| C| 0.2|rate_2|
#| A| 0.1|rate_1|
#| A| 0.2|rate_2|
#+-----+----+------+
Now group by the "group" column and pivot on the "rank" column. Since you need an aggregation, use first.
df.groupBy("group").pivot("rank").agg(first("rate")).show()
#+-----+------+------+
#|group|rate_1|rate_2|
#+-----+------+------+
#| B| 0.1| 0.3|
#| C| 0.1| 0.2|
#| A| 0.1| 0.2|
#+-----+------+------+
The above does not depend on knowing the number of records in each group ahead of time.
However if (like you said) you know the number of records in each group you can make the pivot more efficient by passing in the values
num_records = 2
values = ["rate_" + str(i+1) for i in range(num_records)]
df.groupBy("group").pivot("rank", values=values).agg(first("rate")).show()
#+-----+------+------+
#|group|rate_1|rate_2|
#+-----+------+------+
#| B| 0.1| 0.3|
#| C| 0.1| 0.2|
#| A| 0.1| 0.2|
#+-----+------+------+
I need help to convert below code in Pyspark code or Pyspark sql code.
df["full_name"] = df.apply(lambda x: "_".join(sorted((x["first"], x["last"]))), axis=1)
Its basically adding one new column name full_name which have to concatenate values of the columns first and last in a sorted way.
I have done below code but don't know how to apply to sort in a columns text value.
df= df.withColumn('full_name', f.concat(f.col('first'),f.lit('_'), f.col('last')))
From Spark-2.4+:
We can use array_join, array_sort functions for this case.
Example:
df.show()
#+-----+----+
#|first|last|
#+-----+----+
#| a| b|
#| e| c|
#| d| a|
#+-----+----+
from pyspark.sql.functions import *
#first we create array of first,last columns then apply sort and join on array
df.withColumn("full_name",array_join(array_sort(array(col("first"),col("last"))),"_")).show()
#+-----+----+---------+
#|first|last|full_name|
#+-----+----+---------+
#| a| b| a_b|
#| e| c| c_e|
#| d| a| a_d|
#+-----+----+---------+
Spark SQL FROM statement can be specified file path and format.
but, header ignored when load csv.
can use header for column name?
~ > cat test.csv
a,b,c
1,2,3
4,5,6
scala> spark.sql("SELECT * FROM csv.`test.csv`").show()
19/06/12 23:44:40 WARN ObjectStore: Failed to get database csv, returning NoSuchObjectException
+---+---+---+
|_c0|_c1|_c2|
+---+---+---+
| a| b| c|
| 1| 2| 3|
| 4| 5| 6|
+---+---+---+
I want to.
+---+---+---+
| a| b| c|
+---+---+---+
| 1| 2| 3|
| 4| 5| 6|
+---+---+---+
If you want to do it in plain SQL you should create a table or view first:
CREATE TEMPORARY VIEW foo
USING csv
OPTIONS (
path 'test.csv',
header true
);
and then SELECT from it:
SELECT * FROM foo;
To use this method with SparkSession.sql remove trailing ; and execute each statement separately.
I don't think a pure SQL solution is available in Spark 2.4.3 which is the latest version when writing this. This syntax is parsed using rule ResolveSQLOnFile which is always calling DataSource constructor with an empty options map.
I can verify that putting a break-point to DataSource constructor and modifying options to Map("header" -> "true") does the trick so obviously this is where it should be implemented.
You can try this:
scala> val df = spark.read.format("csv").option("header", "true").load("test.csv")
df: org.apache.spark.sql.DataFrame = [a: string, b: string ... 1 more field]
scala> df.show
+---+---+---+
| a| b| c|
+---+---+---+
| 1| 2| 3|
| 4| 5| 6|
+---+---+---+
A SQL way is below:
scala> val df = spark.read.format("csv").option("header", "true").load("test.csv")
df: org.apache.spark.sql.DataFrame = [a: string, b: string ... 1 more field]
scala> df.createOrReplaceTempView("table")
scala> spark.sql("SELECT * FROM table").show
+---+---+---+
| a| b| c|
+---+---+---+
| 1| 2| 3|
| 4| 5| 6|
+---+---+---+