How to SparkSQL load csv with header on FROM statement - sql

Spark SQL FROM statement can be specified file path and format.
but, header ignored when load csv.
can use header for column name?
~ > cat test.csv
a,b,c
1,2,3
4,5,6
scala> spark.sql("SELECT * FROM csv.`test.csv`").show()
19/06/12 23:44:40 WARN ObjectStore: Failed to get database csv, returning NoSuchObjectException
+---+---+---+
|_c0|_c1|_c2|
+---+---+---+
| a| b| c|
| 1| 2| 3|
| 4| 5| 6|
+---+---+---+
I want to.
+---+---+---+
| a| b| c|
+---+---+---+
| 1| 2| 3|
| 4| 5| 6|
+---+---+---+

If you want to do it in plain SQL you should create a table or view first:
CREATE TEMPORARY VIEW foo
USING csv
OPTIONS (
path 'test.csv',
header true
);
and then SELECT from it:
SELECT * FROM foo;
To use this method with SparkSession.sql remove trailing ; and execute each statement separately.

I don't think a pure SQL solution is available in Spark 2.4.3 which is the latest version when writing this. This syntax is parsed using rule ResolveSQLOnFile which is always calling DataSource constructor with an empty options map.
I can verify that putting a break-point to DataSource constructor and modifying options to Map("header" -> "true") does the trick so obviously this is where it should be implemented.

You can try this:
scala> val df = spark.read.format("csv").option("header", "true").load("test.csv")
df: org.apache.spark.sql.DataFrame = [a: string, b: string ... 1 more field]
scala> df.show
+---+---+---+
| a| b| c|
+---+---+---+
| 1| 2| 3|
| 4| 5| 6|
+---+---+---+
A SQL way is below:
scala> val df = spark.read.format("csv").option("header", "true").load("test.csv")
df: org.apache.spark.sql.DataFrame = [a: string, b: string ... 1 more field]
scala> df.createOrReplaceTempView("table")
scala> spark.sql("SELECT * FROM table").show
+---+---+---+
| a| b| c|
+---+---+---+
| 1| 2| 3|
| 4| 5| 6|
+---+---+---+

Related

pySpark not able to handle Multiline string in CSV file while selecting columns

I am trying to load csv file which looks like following, using pyspark code.
A^B^C^D^E^F
"Yash"^"12"^""^"this is first record"^"nice"^"12"
"jay"^"13"^""^"
In second record, I am new line at the beingnning"^"nice"^"12"
"Nova"^"14"^""^"this is third record"^"nice"^"12"
When I read this file and select a few columns entire dataframe gets messed up.
import pyspark.sql.functions as F
df = (
spark.read
.option("delimiter", "^")
.option('header',True) \
.option("multiline", "true")
.option('multiLine', True) \
.option("escape", "\"")
.csv(
"test3.csv",
header=True,
)
)
df.show()
df = df.withColumn("isdeleted", F.lit(True))
select_cols = ['isdeleted','B','D','E','F']
df = df.select(*select_cols)
df.show()
(truncated some import statements for readability of code)
This is what I see when the above code runs
Before column selection (entire DF)
+----+---+----+--------------------+----+---+
| A| B| C| D| E| F|
+----+---+----+--------------------+----+---+
|Yash| 12|null|this is first record|nice| 12|
| jay| 13|null|\nIn second recor...|nice| 12|
|Nova| 14|null|this is third record|nice| 12|
+----+---+----+--------------------+----+---+
After df.select(*select_cols)
+---------+----+--------------------+----+----+
|isdeleted| B| D| E| F|
+---------+----+--------------------+----+----+
| true| 12|this is first record|nice| 12|
| true| 13| null|null|null|
| true|nice| null|null|null|
| true| 14|this is third record|nice| 12|
+---------+----+--------------------+----+----+
Here, second row with newline char is being broken down to 2 rows, output file is also messed up just like dataframe preview I showed above.
I am using apache Glue image amazon/aws-glue-libs:glue_libs_4.0.0_image_01 which uses spark 3.3.0 version. Also tried with spark 3.1.1. I see the same issue in both versions.
I am not sure whether this is a bug in spark package or I am missing something here. Any help will be appreciated
You are giving the wrong escape charactor. It is usually \ and you are specifing this to the quote. Once you change the option,
df = spark.read.csv('test.csv', sep='^', header=True, multiLine=True)
df.show()
df.select('B').show()
+----+---+----+--------------------+----+---+
| A| B| C| D| E| F|
+----+---+----+--------------------+----+---+
|Yash| 12|null|this is first record|nice| 12|
| jay| 13|null|\nIn second recor...|nice| 12|
|Nova| 14|null|this is third record|nice| 12|
+----+---+----+--------------------+----+---+
+---+
| B|
+---+
| 12|
| 13|
| 14|
+---+
You will get the desired result.

pyspark get latest non-null element of every column in one row

Let me explain my question using an example:
I have a dataframe:
pd_1 = pd.DataFrame({'day':[1,2,3,2,1,3],
'code': [10, 10, 20,20,30,30],
'A': [44, 55, 66,77,88,99],
'B':['a',None,'c',None,'d', None],
'C':[None,None,'12',None,None, None]
})
df_1 = sc.createDataFrame(pd_1)
df_1.show()
Output:
+---+----+---+----+----+
|day|code| A| B| C|
+---+----+---+----+----+
| 1| 10| 44| a|null|
| 2| 10| 55|null|null|
| 3| 20| 66| c| 12|
| 2| 20| 77|null|null|
| 1| 30| 88| d|null|
| 3| 30| 99|null|null|
+---+----+---+----+----+
What I want to achieve is a new dataframe, each row corresponds to a code, and for each column I want to have the most recent non-null value (with highest day).
In pandas, I can simply do
pd_2 = pd_1.sort_values('day', ascending=True).groupby('code').last()
pd_2.reset_index()
to get
code day A B C
0 10 2 55 a None
1 20 3 66 c 12
2 30 3 99 d None
My question is, how can I do it in pyspark (preferably version < 3)?
What I have tried so far is:
from pyspark.sql import Window
import pyspark.sql.functions as F
w = Window.partitionBy('code').orderBy(F.desc('day')).rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
## Update: after applying #Steven's idea to remove for loop:
df_1 = df_1 .select([F.collect_list(x).over(w).getItem(0).alias(x) for x in df_.columns])
##for x in df_1.columns:
## df_1 = df_1.withColumn(x, F.collect_list(x).over(w).getItem(0))
df_1 = df_1.distinct()
df_1.show()
Output
+---+----+---+---+----+
|day|code| A| B| C|
+---+----+---+---+----+
| 2| 10| 55| a|null|
| 3| 30| 99| d|null|
| 3| 20| 66| c| 12|
+---+----+---+---+----+
Which I'm not very happy with, especially due to the for loop.
I think your current solution is quite nice. If you want another solution, you can try using first/last window functions :
from pyspark.sql import functions as F, Window
w = Window.partitionBy("code").orderBy(F.col("day").desc())
df2 = (
df.select(
"day",
"code",
F.row_number().over(w).alias("rwnb"),
*(
F.first(F.col(col), ignorenulls=True)
.over(w.rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing))
.alias(col)
for col in ("A", "B", "C")
),
)
.where("rwnb = 1")
.drop("rwnb")
)
and the result :
df2.show()
+---+----+---+---+----+
|day|code| A| B| C|
+---+----+---+---+----+
| 2| 10| 55| a|null|
| 3| 30| 99| d|null|
| 3| 20| 66| c| 12|
+---+----+---+---+----+
Here's another way of doing by using array functions and struct ordering instead of Window:
from pyspark.sql import functions as F
other_cols = ["day", "A", "B", "C"]
df_1 = df_1.groupBy("code").agg(
F.collect_list(F.struct(*other_cols)).alias("values")
).selectExpr(
"code",
*[f"array_max(filter(values, x-> x.{c} is not null))['{c}'] as {c}" for c in other_cols]
)
df_1.show()
#+----+---+---+---+----+
#|code|day| A| B| C|
#+----+---+---+---+----+
#| 10| 2| 55| a|null|
#| 30| 3| 99| d|null|
#| 20| 3| 66| c| 12|
#+----+---+---+---+----+

Filtering rows in pyspark dataframe and creating a new column that contains the result

so I am trying to identify the crime that happens within the SF downtown boundary on Sunday. My idea was to first write a UDF to label if each crime is in the area I identify as the downtown area, if it happened within the area, then it will have a label of "1" and "0" if not. After that, I am trying to create a new column to store those results. I tried my best to write everything I can but it just doesn't work for some reason. Here is the code I wrote:
from pyspark.sql.types import BooleanType
from pyspark.sql.functions import udf
def filter_dt(x,y):
if (((x < -122.4213) & (x > -122.4313)) & ((y > 37.7540) & (y < 37.7740))):
return '1'
else:
return '0'
schema = StructType([StructField("isDT", BooleanType(), False)])
filter_dt_boolean = udf(lambda row: filter_dt(row[0], row[1]), schema)
#First, pick out the crime cases that happens on Sunday BooleanType()
q3_sunday = spark.sql("SELECT * FROM sf_crime WHERE DayOfWeek='Sunday'")
#Then, we add a new column for us to filter out(identify) if the crime is in DT
q3_final = q3_result.withColumn("isDT", filter_dt(q3_sunday.select('X'),q3_sunday.select('Y')))
The error I am getting is:Picture for the error message
My guess is that the udf I am having right now doesn't support the whole column as input to be compared, but I don't know how to fix it to make it work. Please help! Thank you!
A sample data would have helped. For now I assume that your data looks like this:
+----+---+---+
|val1| x| y|
+----+---+---+
| 10| 7| 14|
| 5| 1| 4|
| 9| 8| 10|
| 2| 6| 90|
| 7| 2| 30|
| 3| 5| 11|
+----+---+---+
Then you dont need a udf, as you can do the evaluation using the when() function
import pyspark.sql.functions as F
tst= sqlContext.createDataFrame([(10,7,14),(5,1,4),(9,8,10),(2,6,90),(7,2,30),(3,5,11)],schema=['val1','x','y'])
tst_res = tst.withColumn("isdt",F.when(((tst.x.between(4,10))&(tst.y.between(11,20))),1).otherwise(0))This will give the result
tst_res.show()
+----+---+---+----+
|val1| x| y|isdt|
+----+---+---+----+
| 10| 7| 14| 1|
| 5| 1| 4| 0|
| 9| 8| 10| 0|
| 2| 6| 90| 0|
| 7| 2| 30| 0|
| 3| 5| 11| 1|
+----+---+---+----+
If i have got the data wrong and still you need to pass multiple values to udf, you have to pass it as an array or a struct. I prefer a struct
from pyspark.sql.functions import udf
from pyspark.sql.types import *
#udf(IntegerType())
def check_data(row):
if((row.x in range(4,5))&(row.y in range(1,20))):
return(1)
else:
return(0)
tst_res1 = tst.withColumn("isdt",check_data(F.struct('x','y')))
The result will be the same. But it is always better to avoid UDF and go for spark inbuilt functions since spark catalyst cannot understand the logic inside the udf and cannot optimize it.
Try changing last line as below-
from pyspark.sql.functions import col
q3_final = q3_result.withColumn("isDT", filter_dt(col('X'),col('Y')))

Spark SQL: Is there a way to distinguish columns with same name?

I have a csv with a header with columns with same name.
I want to process them with spark using only SQL and be able to refer these columns unambiguously.
Ex.:
id name age height name
1 Alex 23 1.70
2 Joseph 24 1.89
I want to get only first name column using only Spark SQL
As mentioned in the comments, I think that the less error prone method would be to have the schema of the input data changed.
Yet, in case you are looking for a quick workaround, you can simply index the duplicated names of the columns.
For instance, let's create a dataframe with three id columns.
val df = spark.range(3)
.select('id * 2 as "id", 'id * 3 as "x", 'id, 'id * 4 as "y", 'id)
df.show
+---+---+---+---+---+
| id| x| id| y| id|
+---+---+---+---+---+
| 0| 0| 0| 0| 0|
| 2| 3| 1| 4| 1|
| 4| 6| 2| 8| 2|
+---+---+---+---+---+
Then I can use toDF to set new column names. Let's consider that I know that only id is duplicated. If we don't, adding the extra logic to figure out which columns are duplicated would not be very difficult.
var i = -1
val names = df.columns.map( n =>
if(n == "id") {
i+=1
s"id_$i"
} else n )
val new_df = df.toDF(names : _*)
new_df.show
+----+---+----+---+----+
|id_0| x|id_1| y|id_2|
+----+---+----+---+----+
| 0| 0| 0| 0| 0|
| 2| 3| 1| 4| 1|
| 4| 6| 2| 8| 2|
+----+---+----+---+----+

Add aggregated columns to pivot without join

Considering the table:
df=sc.parallelize([(1,1,1),(5,0,2),(27,1,1),(1,0,3),(5,1,1),(1,0,2)]).toDF(['id', 'error', 'timestamp'])
df.show()
+---+-----+---------+
| id|error|timestamp|
+---+-----+---------+
| 1| 1| 1|
| 5| 0| 2|
| 27| 1| 1|
| 1| 0| 3|
| 5| 1| 1|
| 1| 0| 2|
+---+-----+---------+
I would like to make a pivot on timestamp column keeping some other aggregated information from the original table. The result I am interested in can be achieved by
df1=df.groupBy('id').agg(sf.sum('error').alias('Ne'),sf.count('*').alias('cnt'))
df2=df.groupBy('id').pivot('timestamp').agg(sf.count('*')).fillna(0)
df1.join(df2, on='id').filter(sf.col('cnt')>1).show()
with the resulting table:
+---+---+---+---+---+---+
| id| Ne|cnt| 1| 2| 3|
+---+---+---+---+---+---+
| 5| 1| 2| 1| 1| 0|
| 1| 1| 3| 1| 1| 1|
+---+---+---+---+---+---+
However, there are at least two issues with the mentioned solution:
I am filtering by cnt at the end of the script. If I would be able to do this at the beginning, I can avoid almost all processing, because a large portion of data is removed using this filtration. Is there any way how to do this excepting collect and isin methods?
I am doing groupBy on id two-times. First, to aggregate some columns I need in results and the second time to get the pivot columns. Finally, I need join to merge these columns. I feel that I am surely missing some solution because it should be possible to do this with just one groubBy and without join, but I cannot figure out, how to do this.
I think you can not get around the join, because the pivot will need the timestamp values and the first grouping should not consider them. So if you have to create the NE and cnt values you have to group the dataframe only by id which results in the loss of timestamp if you want to preserve the values in columns you have to do the pivot as you did separately and join it back.
The only improvement that can be done is to move the filter to the df1 creation. So as you said this could already improve the performance since df1 should be much smaller after the filtering for your real data.
from pyspark.sql.functions import *
df=sc.parallelize([(1,1,1),(5,0,2),(27,1,1),(1,0,3),(5,1,1),(1,0,2)]).toDF(['id', 'error', 'timestamp'])
df1=df.groupBy('id').agg(sum('error').alias('Ne'),count('*').alias('cnt')).filter(col('cnt')>1)
df2=df.groupBy('id').pivot('timestamp').agg(count('*')).fillna(0)
df1.join(df2, on='id').show()
Output:
+---+---+---+---+---+---+
| id| Ne|cnt| 1| 2| 3|
+---+---+---+---+---+---+
| 5| 1| 2| 1| 1| 0|
| 1| 1| 3| 1| 1| 1|
+---+---+---+---+---+---+
Actually it is indeed possible to avoid join using Window as
w1 = Window.partitionBy('id')
w2 = Window.partitionBy('id', 'timestamp')
df.select('id', 'timestamp',
sf.sum('error').over(w1).alias('Ne'),
sf.count('*').over(w1).alias('cnt'),
sf.count('*').over(w2).alias('cnt_2')
).filter(sf.col('cnt')>1) \
.groupBy('id', 'Ne', 'cnt').pivot('timestamp').agg(sf.first('cnt_2')).fillna(0).show()