Nullable field is changed upon writing a Spark Dataframe - dataframe

The following code reads a Spark DataFrame from parquet file and writes to another parquet file. Nullable filed in ArrayType DataType is changed after writing the DataFrame to a new Parquet file.
Code:
SparkConf sparkConf = new SparkConf();
String master = "local[2]";
sparkConf.setMaster(master);
sparkConf.setAppName("Local Spark Test");
JavaSparkContext sparkContext = new JavaSparkContext(new SparkContext(sparkConf));
SQLContext sqc = new SQLContext(sparkContext);
DataFrame dataFrame = sqc.read().parquet("src/test/resources/users.parquet");
StructField[] fields = dataFrame.schema().fields();
System.out.println(fields[2].dataType());
dataFrame.write().mode(SaveMode.Overwrite).parquet("src/test/resources/users1.parquet");
DataFrame dataFrame1 = sqc.read().parquet("src/test/resources/users1.parquet");
StructField [] fields1 = dataFrame1.schema().fields();
System.out.println(fields1[2].dataType());
Output:
ArrayType(IntegerType,false)
ArrayType(IntegerType,true)
Spark version is: 1.6.2

For Spark 2.4 or before, all the columns written from spark sql are nullable. Quoting the official guide
Parquet is a columnar format that is supported by many other data processing systems. Spark SQL provides support for both reading and writing Parquet files that automatically preserves the schema of the original data. When writing Parquet files, all columns are automatically converted to be nullable for compatibility reasons.

Related

How to convert S3 bucket content(.csv format) into a dataframe in AWS Lambda

I am trying to ingest S3 data(csv file) to RDS(MSSQL) through lambda. Sample code:
s3 = boto3.client('s3')
if event:
file_obj = event["Records"][0]
bucketname = str(file_obj["s3"]["bucket"]["name"])
csv_filename = unquote_plus(str(file_obj["s3"]["object"]["key"]))
print("Filename: ", csv_filename)
csv_fileObj = s3.get_object(Bucket=bucketname, Key=csv_filename)
file_content = csv_fileObj["Body"].read().decode("utf-8").split()
I have tried put my csv contents into a list but didnt work.
results = []
for row in csv.DictReader(file_content):
results.append(row.values())
print(results)
print(file_content)
return {
'statusCode': 200,
'body': json.dumps('S3 file processed')
}
Is there anyway I could convert "file_content" into a dataframe in Lambda? I have multiple columns to load.
Later I would follow this approach to load the data into RDS
import pyodbc
import pandas as pd
# insert data from csv file into dataframe(df).
server = 'yourservername'
database = 'AdventureWorks'
username = 'username'
password = 'yourpassword'
cnxn = pyodbc.connect('DRIVER={SQL Server};SERVER='+server+';DATABASE='+database+';UID='+username+';PWD='+ password)
cursor = cnxn.cursor()
# Insert Dataframe into SQL Server:
for index, row in df.iterrows():
cursor.execute("INSERT INTO HumanResources.DepartmentTest (DepartmentID,Name,GroupName) values(?,?,?)", row.DepartmentID, row.Name, row.GroupName)
cnxn.commit()
cursor.close()
Can anyone suggest how to go about it?
You can use io.BytesIO to get the bytes data into memory and after that use pandasread_csv to transform it into a dataframe. Note that there is some strange SSL download limit for dataframes that will lead to issue when downloading data > 2GB. That is why I have used this chunking in the code below.
import io
obj = s3.get_object(Bucket=bucketname, Key=csv_filename)
# This should prevent the 2GB download limit from a python ssl internal
chunks = (chunk for chunk in obj["Body"].iter_chunks(chunk_size=1024**3))
data = io.BytesIO(b"".join(chunks)) # This keeps everything fully in memory
df = pd.read_csv(data) # here you can provide also some necessary args and kwargs
It appears that your goal is to load the contents of a CSV file from Amazon S3 into SQL Server.
You could do this without using Dataframes:
Loop through the Event Records (multiple can be passed-in)
For each object:
Download the object to /tmp/
Use the Python CSVReader to loop through the contents of the file
Generate INSERT statements to insert the data into the SQL Server table
You might also consider using aws-data-wrangler: Pandas on AWS, which is available as a Lambda Layer.

How to create dataframes from Amazon Ion file in Spark using python/scala?

I am trying to create a dataframe from an Amazon ion format file. But in spark I didn't find any format for ion type. So, I used json format to load ion file but it throws error. Is there a way or not for creating dataframe from an ion file.
I tried with below code
val df = spark.read
.format("json")
.load("file:/home/mohan/Downloads/test_journal.ion")
Short Answer : No, at least not natively.
Ion is a proprietary format from Amazon, and it has parsers provided in C, java, JS, & Python. You need to tell Spark how to parse this format somehow.
You can build your own DataFrame by extending the API and following this post

reading from hive table and updating same table in pyspark - using checkpoint

I am using spark version 2.3 and trying to read hive table in spark as:
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
df = spark.table("emp.emptable")
here I am adding a new column with current date from system to the existing dataframe
import pyspark.sql.functions as F
newdf = df.withColumn('LOAD_DATE', F.current_date())
and now facing an issue,when I am trying to write this dataframe as hive table
newdf.write.mode("overwrite").saveAsTable("emp.emptable")
pyspark.sql.utils.AnalysisException: u'Cannot overwrite table emp.emptable that is also being read from;'
so I am checkpointing the dataframe to break the lineage since I am reading and writing from same dataframe
checkpointDir = "/hdfs location/temp/tables/"
spark.sparkContext.setCheckpointDir(checkpointDir)
df = spark.table("emp.emptable").coalesce(1).checkpoint()
newdf = df.withColumn('LOAD_DATE', F.current_date())
newdf.write.mode("overwrite").saveAsTable("emp.emptable")
This way it's working fine and new column has been added to the hive table. but I have to delete the checkpoint files every time it's get created. Is there any best way to break the lineage and write the same dataframe with updated column details and save it to hdfs location or as a hive table.
or is there any way to specify a temp location for checkpoint directory, which will get deleted post the spark session completes.
As we discussed in this post, setting below property is way to go.
spark.conf.set("spark.cleaner.referenceTracking.cleanCheckpoints", "true")
That question had different context. we wanted to retain the checkpointed dataset so did not care to add on cleanup solution.
Setting above property is working sometime(tested scala, java and python) but its hard to rely on it. Official document says that by setting this property it Controls whether to clean checkpoint files if the reference is out of scope. I don't know what exactly it means because my understanding is that once spark session/context is stopped it should clean it. Would be great if someone can shad light on it.
Regarding
Is there any best way to break the lineage
Check this question, #BiS found some way to cut the lineage using createDataFrame(RDD, Schema) method. I haven't tested it by myself though.
Just FYI, I don't rely on above property usually and delete the checkpointed directory in code itself to be on safe side.
We can get the checkpointed directory like below:
Scala :
//Set directory
scala> spark.sparkContext.setCheckpointDir("hdfs:///tmp/checkpoint/")
scala> spark.sparkContext.getCheckpointDir.get
res3: String = hdfs://<name-node:port>/tmp/checkpoint/625034b3-c6f1-4ab2-9524-e48dfde589c3
//It gives String so we can use org.apache.hadoop.fs to delete path
PySpark:
// Set directory
>>> spark.sparkContext.setCheckpointDir('hdfs:///tmp/checkpoint')
>>> t = sc._jsc.sc().getCheckpointDir().get()
>>> t
u'hdfs://<name-node:port>/tmp/checkpoint/dc99b595-f8fa-4a08-a109-23643e2325ca'
# notice 'u' at the start which means It returns unicode object use str(t)
# Below are the steps to get hadoop file system object and delete
>>> fs = sc._jvm.org.apache.hadoop.fs.FileSystem.get(sc._jsc.hadoopConfiguration())
fs.exists(sc._jvm.org.apache.hadoop.fs.Path(str(t)))
True
>>> fs.delete(sc._jvm.org.apache.hadoop.fs.Path(str(t)))
True
>>> fs = sc._jvm.org.apache.hadoop.fs.FileSystem.get(sc._jsc.hadoopConfiguration())
fs.exists(sc._jvm.org.apache.hadoop.fs.Path(str(t)))
False

How to read parquet file directly without registering TempTable in SparkSQL

When I run SQLs over parquet files, I always called sqlContext.read.parquet() => df.registerTempTable() => sqlContext.sql() like this:
val df = sqlContext.read.parquet("path/to/2016.05.30/")
df.registerTempTable("tab")
sqlContext.sql("SELECT * FROM tab")
the Spark Manual, says that:
Instead of using read API to load a file into DataFrame and query it, you can also query that file directly with SQL.
val df = sqlContext.sql("SELECT * FROM parquet.`examples/src/main/resources/users.parquet`")
So I've changed like this:
val df = sqlContext.sql("SELECT * FROM parquet.`path/to/2016.05.30/`")
But I'm getting an error
org.apache.spark.sql.AnalysisException: no such table parquet.path/to/2016.05.30/;
How Can I query directly?
Directly query on files will support from Spark 1.6 onwards. Please check the spark version which you are running.

Dataproc + BigQuery examples - any available?

According to the Dataproc docos, it has "native and automatic integrations with BigQuery".
I have a table in BigQuery. I want to read that table and perform some analysis on it using the Dataproc cluster that I've created (using a PySpark job). Then write the results of this analysis back to BigQuery. You may be asking "why not just do the analysis in BigQuery directly!?" - the reason is because we are creating complex statistical models, and SQL is too high level for developing them. We need something like Python or R, ergo Dataproc.
Are they any Dataproc + BigQuery examples available? I can't find any.
To begin, as noted in this question the BigQuery connector is preinstalled on Cloud Dataproc clusters.
Here is an example on how to read data from BigQuery into Spark. In this example, we will read data from BigQuery to perform a word count.
You read data from BigQuery in Spark using SparkContext.newAPIHadoopRDD. The Spark documentation has more information about using SparkContext.newAPIHadoopRDD. '
import com.google.cloud.hadoop.io.bigquery.BigQueryConfiguration
import com.google.cloud.hadoop.io.bigquery.GsonBigQueryInputFormat
import com.google.cloud.hadoop.io.bigquery.mapred.BigQueryMapredInputFormat
import com.google.gson.JsonObject
import org.apache.hadoop.io.LongWritable
val projectId = "<your-project-id>"
val fullyQualifiedInputTableId = "publicdata:samples.shakespeare"
val fullyQualifiedOutputTableId = "<your-fully-qualified-table-id>"
val outputTableSchema =
"[{'name': 'Word','type': 'STRING'},{'name': 'Count','type': 'INTEGER'}]"
val jobName = "wordcount"
val conf = sc.hadoopConfiguration
// Set the job-level projectId.
conf.set(BigQueryConfiguration.PROJECT_ID_KEY, projectId)
// Use the systemBucket for temporary BigQuery export data used by the InputFormat.
val systemBucket = conf.get("fs.gs.system.bucket")
conf.set(BigQueryConfiguration.GCS_BUCKET_KEY, systemBucket)
// Configure input and output for BigQuery access.
BigQueryConfiguration.configureBigQueryInput(conf, fullyQualifiedInputTableId)
BigQueryConfiguration.configureBigQueryOutput(conf,
fullyQualifiedOutputTableId, outputTableSchema)
val fieldName = "word"
val tableData = sc.newAPIHadoopRDD(conf,
classOf[GsonBigQueryInputFormat], classOf[LongWritable], classOf[JsonObject])
tableData.cache()
tableData.count()
tableData.map(entry => (entry._1.toString(),entry._2.toString())).take(10)
You will need to customize this example with your settings, including your Cloud Platform project ID in <your-project-id> and your output table ID in <your-fully-qualified-table-id>.
Finally, if you end up using the BigQuery connector with MapReduce, this page has examples for how to write MapReduce jobs with the BigQuery connector.
The above example doesn't show how to write data to an output table. You need to do this:
.saveAsNewAPIHadoopFile(
hadoopConf.get(BigQueryConfiguration.TEMP_GCS_PATH_KEY),
classOf[String],
classOf[JsonObject],
classOf[BigQueryOutputFormat[String, JsonObject]], hadoopConf)
where the key: String is actually ignored