Rename whitespace in column name in Parquet file using spark sql - apache-spark-sql

I want to show the content of the parquet file using Spark Sql but since the column names in parquet file contains space I am getting error -
Attribute name "First Name" contains invalid character(s) among " ,;{}()\n\t=". Please use alias to rename it.;
I have written below code -
val r1 = spark.read.parquet("filepath")
val r2 = r1.toDF()
r2.select(r2("First Name").alias("FirstName")).show()
but still getting same error

Try and rename the column first instead of aliasing it:
r2 = r2.withColumnRenamed("First Name", "FirstName")
r2.show()

For anyone still looking for an answer,
There is no optimised way to remove spaces from column names while dealing with parquet data.
What can be done is:
Change the column names at the source itself, i.e, while creating the parquet data itself.
OR
(NOT THE OPTIMISED WAY - won't WORK FOR HUGE DATASETS) read the parquet file using pandas and rename the column for the pandas dataframe. If required, write back the dataframe to a parquet using pandas itself and then progress using spark if required.
PS: With the new Pandas API for PySpark scheduled to be present from PySpark 3.2, implementing pandas with spark might be much faster and optimised when dealing with huge datasets.

For anybody struggling with this, the only thing that worked for me was:
for c in df.columns:
df = df.withColumnRenamed(c, c.replace(" ", ""))
df = spark.read.schema(base_df.schema).parquet(filename)
This is from this thread: Spark Dataframe validating column names for parquet writes (scala)
Alias, withColumnRenamed, and "as" sql select statements wouldn't work. Pyspark would still use the old name whenever trying to .show() the dataframe.

Related

How do I specify data type for Databricks spark dataframe and avoid future problems?

I have a databricks notebook that will run every 2-4 weeks. It will read in a small csv, perform etl on python, truncate and load to a delta table.
This is what I am currently doing to avoid failures related to data type:
python to replace all '-' with '0'
python to drop rows with NaN or nan
spark_df = spark.createDataFrame(dfnew)
spark_df.write.saveAsTable("default.test_table", index=False, header=True)
This automatically detects the datatypes and is working right now.
BUT, what if the datatype cannot be detected or detects wrong? Mostly concerned about doubles, ints, bigints.
I tested casting but it doesnt work on databricks:
spark_df = spark.createDataFrame(dfnew.select(dfnew("Year").cast(IntegerType).as("Year")))
Is there a way to feed a DDL to spark dataframe for databricks? Should I not use spark?

Using sparl.read.csv when one column is xml

I have CSV with 10 columns, one of which is an XML field. When I read this into a databricks notebook from azure data lake it splits up the xml into new rows, instead of keeping it in the one field.
Is there a way to stop this happening? The data looks like this when displayed
But like this when I open the CSV
I'm using the following code to read the csv
sourceDf = spark.read.csv(sourceFilePath, sep=',', header=True, inferSchema=True)
I'm attempting to build a data pipeline in ADF and want to use databricks to parse the XML field, but I need to be able to read it in to databricks first.
To read the data correctly I needed to define multiline=True as an option as below:
sourceDf = spark.read.csv(sourceFilePath, sep=',', header=True, inferSchema=True, multiLine=True)
Then I get a correctly formatted column.

csv file converted to parquet adds 'e0' to end of values

I am running a test to populate a table in Redshift. I added mock data to a csv file and then converted to parquet with pandas. I'm using the COPY command to get the data from the parquet file in the s3 bucket to my Redshift database.
I got the error:
'file has an incompatible Parquet schema for column'
Those columns are DECIMAL (12,3).
I checked in the s3 console and found that in looking at my converted parquet file, 'e0' had been added to the end of values, example:
{"id":2873130000000000000,"field1":9.335e0,"field2":9.335e0}
My code to convert to parquet is standard:
import pandas as pd
df = pd.read_csv('test.csv')
df.to_parquet('test.parquet')
At this point it seems these added values are why I'm getting that 'incompatiblity' error. Why would these values be added, how can I prevent this?
It looks like you are writing the parquet file with these fields in scientific notation. This is where e stands for 'times ten to the power of' eg. 1.1e2 equals 110. Check your formatting pandas.

How to create a table from a dataframe in SparkR

I am trying to find a way to convert a dataframe into a table to be used in another Databricks notebook. I cannot find any documentation regarding doing this in R.
First, convert R dataframe to SparkR dataframe using SparkR::createDataFrame(R_dataframe). Then use saveAsTable function to save as a permanent table - which can be accessed through other notebooks. SparkR::createOrReplaceTempView will not help if you try to access it from different notebook.
require(SparkR)
data1 <- createDataFrame(output)
saveAsTable(data1, tableName = "default.sample_table", source="parquet", mode="overwrite")
In the above code, default is some existing database name, under which a new table will get created having name as sample_table.

Retrieving data from s3 bucket in pyspark

I am reading data from s3 bucket in pyspark . I need to parallelize read operation and doing some transformation on the data. But its throwing error. Below is the code.
s3 = boto3.resource('s3',aws_access_key_id=access_key,aws_secret_access_key=secret_key)
bucket = s3.Bucket(bucket)
prefix = 'clickEvent-2017-10-09'
files = bucket.objects.filter(Prefix = prefix)
keys=[k.key for k in files]
pkeys = sc.parallelize(keys)
I have a global variable d which is an empty list. And I am appending deviceId data into this.
applying flatMap on the keys
pkeys.flatMap(map_func)
This the function
def map_func(key):
print "in map func"
for line in key.get_contents_as_string().splitlines():
# parse one line of json
content = json.loads(line)
d.append(content['deviceID'])
But the above code gives me error.
Can anyone help!
You have two issues that I can see. The first is you are trying to manually read data from S3 using boto instead of using the direct S3 support built into spark and hadoop. It looks like you are trying to read text files containing json records per line. If that is case, you can just do this in spark:
df = spark.read.json('s3://my-bucket/path/to/json/files/')
This will create a spark DataFrame for you by reading in the JSON data with each line as a row. DataFrames require a rigid pre-defined schema (like a relational database table) which spark try to determine will determine by sampling some of your JSON data. After you have the DataFrame all you need to do to get your column is select it like this:
df.select('deviceID')
The other issue worth pointing out is you are attempting to use a global variable to store data computed across your spark cluster. It is possible to send data from your driver to all of the executors running on spark workers using either broadcast variables or implicit closures. But there is no way in spark to write to a variable in your driver from an executor! To transfer data from executors back to the driver you need to use spark's Action methods intended for exactly this purpose.
Actions are methods that tell spark you want a result computed so it needs to go execute the transformations you have told it about. In your case you would probably either want to:
If the results are large: use DataFrame.write to save the results of your tranformations back to S3
If the results are small: DataFrame.collect() to download them back to your driver and do something with them