Add new column DataFrame Spark SQL - apache-spark-sql

I try work with SparkSQL by adding a new column in dataframe
My Code is :
val df= sc.read.json("C:/Users/A661758/Desktop/TEST-XSLT.json")
df.withColumn("UID", new org.apache.spark.sql.Column("UID"))
Error :
cannot resolve 'UID' given input columns:
I use Spark 2.1.0 and Scala 2.11.8
Thank you.

If you need to add new column for existing dataframe you need to use withcolumn option.
New column with string type ,try this
df.withColumn("UID", lit("value").cast(StringType))
New column with integertype , try this
df.withColumn("UID", lit("1").cast(IntegerType))
New column with auto increment number integertype ,try this
df.withColumn("RowID",monotonicallyIncreasingId.cast(IntegerType))

Related

Update some rows of a dataframe or create new dataframe in PySpark

I am new to PySpark and my objective is to use PySpark script in AWS Glue for:
reading a dataframe from input file in Glue => done
changing columns of some rows which satisfy a condition => facing issue
write the updated dataframe on the same schema into S3 => done
The task seems very simple, but I could not find a way to complete it and still facing different different issues with my changing code.
Till now, my code looks like this:
Transform2.printSchema() # input schema after reading
Transform2 = Transform2.toDF()
def updateRow(row):
# my logic to update row based on a global condition
#if row["primaryKey"]=="knownKey": row["otherAttribute"]= None
return row
LocalTransform3 = [] # creating new dataframe from Transform2
for row in Transform2.rdd.collect():
row = row.asDict()
row = updateRow(row)
LocalTransform3.append(row)
print(len(LocalTransform3))
columns = Transform2.columns
Transform3 = spark.createDataFrame(LocalTransform3).toDF(*columns)
print('Transform3 count', Transform3.count())
Transform3.printSchema()
Transform3.show(1,truncate=False)
Transform4 = DynamicFrame.fromDF(Transform3, glueContext, "Transform3")
print('Transform4 count', Transform4.count())
I tried using multiple things like:
using map to update existing rows in a lambda
using collect()
using createDataFrame() to create new dataframe
But faced errors in below steps:
not able to create new updated rdd
not able to create new dataframe from rdd using existing columns
Some errors in Glue I got, at different stages:
ValueError: Some of types cannot be determined after inferring
ValueError: Some of types cannot be determined by the first 100 rows, please try again with sampling
An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob. Traceback (most recent call last):
Any working code snippet or help is appreciated.
from pyspark.sql.functions import col, lit, when
Transform2 = Transform2.toDF()
withKeyMapping = Transform2.withColumn('otherAttribute', when(col("primaryKey") == "knownKey", lit(None)).otherwise(col('otherAttribute')))
This should work for your use-case.

A binary column created in the PySpark dataframe cannot be used as a filter?

I am using PySpark to create an additional BinaryColumn in my dataframe and then using it to filter the dataframe. This process is showing an error.
Data:
Click here to see the data
Created the Binary Column:
Click here to see the Binary Column
Click here to see the Schema
Filter and the error:
Click here to see the Error
try using filter function
df_filter = df_bc.filter(col('binary_col') == 'false')
df_filter.show()
You are adding binary_col to df_bc dataframe not to df_.
Try to access binary_col from df_bc dataframe,
df_filter=df_bc.where(df_bc.binary_col)
df_filter.show()

How to set an existing field as a primary key in spark dataframe?

When I am writing the data of a spark dataframe into SQL DB by using JDBC connector. It is overwritting the properties of the table.
So, i want to set the keyfield in spark dataframe before writing the data.
url = "jdbc:sqlserver://{0}:{1};database={2};user={3};password={4};encrypt=true;trustServerCertificate=false; hostNameInCertificate=*.database.windows.net;loginTimeout=30;".format(jdbcHostname, jdbcPort, jdbcDatabase, JDBCusername, JDBCpassword)
newSchema_Product_names = [StructField('product__code',StringType(), False),
StructField('product__names__lang_code',StringType(),False),
StructField('product__names__name',StringType(),True),
StructField('product__country__code',StringType(),True),
StructField('product__country__name',StringType(),True)
]
Product_names1 = sqlContext.createDataFrame(Product_names_new,StructType(newSchema_Product_names))
Product_names1.write.mode("overwrite").jdbc(url, "product_names")
Before:
After:
#cronoik yes its correct. I thought JDBC connector is not supporting truncate option. But after changing the size of the fields, it worked for me.

Spark Dataframe sql in java - How to escape single quote

I'm using spark-core, spark-sql, Spark-hive 2.10(1.6.1), scala-reflect 2.11.2. I'm trying to filter a dataframe created through hive context...
df = hiveCtx.createDataFrame(someRDDRow,
someDF.schema());
One of the column that I'm trying to filter has multiple single quotes in it. My filter query will be something similar to
df = df.filter("not (someOtherColumn= 'someOtherValue' and comment= 'That's Dany's Reply'"));
In my java class where this filter occurs, I tried to replace the String variable for e.g commentValueToFilterOut, which contains the value "That's Dany's Reply" with
commentValueToFilterOut= commentValueToFilterOut.replaceAll("'","\\\\'");
But when apply the filter to the dataframe I'm getting the below error...
java.lang.RuntimeException: [1.103] failure: ``)'' expected but identifier
s found
not (someOtherColumn= 'someOtherValue' and comment= 'That\'s Dany\'s Reply'' )
^
scala.sys.package$.error(package.scala:27)
org.apache.spark.sql.catalyst.SqlParser$.parseExpression(SqlParser.scala:49)
org.apache.spark.sql.DataFrame.filter(DataFrame.scala:768)
Please advise...
We implemented a workaround to overcome this issue.
Workaround:
Create a new column in the dataframe and copy the values from the actual column (which contains special characters in it, that may cause issues (like singe quote)), to the new column without any special characters.
df = df.withColumn("comment_new", functions.regexp_replace(df.col("comment"),"'",""));
Trim out the special characters from the condition and apply the filter.
commentToFilter = "That's Dany's Reply'"
commentToFilter = commentToFilter.replaceAll("'","");
df = df.filter("(someOtherColumn= 'someOtherValue' and comment_new= '"+commentToFilter+"')");
Now, the filter has been applied, you can drop the new column that you created for the sole purpose of filtering and restore it to the original dataframe.
df = df.drop("comment_new");
If you dont wnat to create a new column in the dataframe, you can also replace the special character with some "never-happen" string literal in the same column, for e.g
df = df.withColumn("comment", functions.regexp_replace(df.col("comment"),"'","^^^^"));
and do the same with the string literal that you want to apply against
comment_new commentToFilter = "That's Dany's Reply'"
commentToFilter = commentToFilter.replaceAll("'","^^^^");
df = df.filter("(someOtherColumn= 'someOtherValue' and comment_new= '"+commentToFilter+"')");
Once filtering is done restore the actual value by reverse-applying the string litteral
df = df.withColumn("comment", functions.regexp_replace(df.col("comment"),"^^^^", "'"));
Though It's not answer the actual issue, but someone having the same issue, can try this out as a workaround.
The actual solution could be, use sqlContext (instead of hiveContext) and / or Dataset (instead of dataframe) and / or upgrade to spark hive 2.12.
experts to debate & answer
PS: Thanks to KP, my lead

Creating User Defined Function in Spark-SQL

I am new to spark and spark sql and i was trying to query some data using spark SQL.
I need to fetch the month from a date which is given as a string.
I think it is not possible to query month directly from sparkqsl so i was thinking of writing a user defined function in scala.
Is it possible to write udf in sparkSQL and if possible can anybody suggest the best method of writing an udf.
You can do this, at least for filtering, if you're willing to use a language-integrated query.
For a data file dates.txt containing:
one,2014-06-01
two,2014-07-01
three,2014-08-01
four,2014-08-15
five,2014-09-15
You can pack as much Scala date magic in your UDF as you want but I'll keep it simple:
def myDateFilter(date: String) = date contains "-08-"
Set it all up as follows -- a lot of this is from the Programming guide.
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext._
// case class for your records
case class Entry(name: String, when: String)
// read and parse the data
val entries = sc.textFile("dates.txt").map(_.split(",")).map(e => Entry(e(0),e(1)))
You can use the UDF as part of your WHERE clause:
val augustEntries = entries.where('when)(myDateFilter).select('name, 'when)
and see the results:
augustEntries.map(r => r(0)).collect().foreach(println)
Notice the version of the where method I've used, declared as follows in the doc:
def where[T1](arg1: Symbol)(udf: (T1) ⇒ Boolean): SchemaRDD
So, the UDF can only take one argument, but you can compose several .where() calls to filter on multiple columns.
Edit for Spark 1.2.0 (and really 1.1.0 too)
While it's not really documented, Spark now supports registering a UDF so it can be queried from SQL.
The above UDF could be registered using:
sqlContext.registerFunction("myDateFilter", myDateFilter)
and if the table was registered
sqlContext.registerRDDAsTable(entries, "entries")
it could be queried using
sqlContext.sql("SELECT * FROM entries WHERE myDateFilter(when)")
For more details see this example.
In Spark 2.0, you can do this:
// define the UDF
def convert2Years(date: String) = date.substring(7, 11)
// register to session
sparkSession.udf.register("convert2Years", convert2Years(_: String))
val moviesDf = getMoviesDf // create dataframe usual way
moviesDf.createOrReplaceTempView("movies") // 'movies' is used in sql below
val years = sparkSession.sql("select convert2Years(releaseDate) from movies")
In PySpark 1.5 and above, we can easily achieve this with builtin function.
Following is an example:
raw_data =
[
("2016-02-27 23:59:59", "Gold", 97450.56),
("2016-02-28 23:00:00", "Silver", 7894.23),
("2016-02-29 22:59:58", "Titanium", 234589.66)]
Time_Material_revenue_df =
sqlContext.createDataFrame(raw_data, ["Sold_time", "Material", "Revenue"])
from pyspark.sql.functions import *
Day_Material_reveneu_df = Time_Material_revenue_df.select(to_date("Sold_time").alias("Sold_day"), "Material", "Revenue")