I am running a test to populate a table in Redshift. I added mock data to a csv file and then converted to parquet with pandas. I'm using the COPY command to get the data from the parquet file in the s3 bucket to my Redshift database.
I got the error:
'file has an incompatible Parquet schema for column'
Those columns are DECIMAL (12,3).
I checked in the s3 console and found that in looking at my converted parquet file, 'e0' had been added to the end of values, example:
{"id":2873130000000000000,"field1":9.335e0,"field2":9.335e0}
My code to convert to parquet is standard:
import pandas as pd
df = pd.read_csv('test.csv')
df.to_parquet('test.parquet')
At this point it seems these added values are why I'm getting that 'incompatiblity' error. Why would these values be added, how can I prevent this?
It looks like you are writing the parquet file with these fields in scientific notation. This is where e stands for 'times ten to the power of' eg. 1.1e2 equals 110. Check your formatting pandas.
Related
I am loading parquet file into BigQuery using bq load command, my parquet file contains column name start with number (e.g. 00_abc, 01_xyz). since BigQuery don't support column name start number I have created column in BigQuery such as _00_abc, _01_xyz.
But I am unable to load the parquet file to BigQuery using bq load command.
Is there any way to specify bq load command that source column 00_abc (from parquet file) will load to target column _00_abc (in BigQuery).
Thanks in advance.
Regards,
Gouranga Basak
It's general best practice to not start a Parquet column name with a number. You will experience compatibility issues with more than just bq load. For example, many Parquet readers use the parquet-avro library, and Avro's documentation says:
The name portion of a fullname, record field names, and enum symbols must:
start with [A-Za-z_]
subsequently contain only [A-Za-z0-9_]
The solution here is to rename the column in the Parquet file. Depending on how much control you have over the Parquet file's creation, you may need to write a Cloud Function to rename the columns (Pandas Dataframes won't complain about your column names).
While reading csv file with customized schema definition the column count changes whereas with inferschema column count is different. Can anyone help me why does this happens.
I am taking a pandas dataframe, converting it to a CSV file, uploading it to S3, and then copying that S3 file to a Redshift table.
Upon Redshift ingestion, I receive an error indicating that a integer column with a value of 0, is being read in as 0.0. I have converted the data type to an integer at the dataframe level before reading it to a CSV file, so I think the error must be when it becomes a CSV file. Opening the actual file, that field throwing the error is still 0 however. Any guesses to how I can preserve that value so the ingestion runs smoothly?
I got CSV files from Cloud SQL to Cloud Storage and trying to load them into BigQuery. Found that files are having unbalanced quotes instead of NULL values. For e.g. "N
Table Header: Name Updated_Time
Sample row in CSV: "ABC","N
bq load --source_format=CSV --null_marker="\"N" --quote='"'
project_name:dataset_name.table_name gs://bucket_name/folder_name/*
Getting this error:
CSV table references column position
1, but line starting at position:0 contains only 1 column.
I want to show the content of the parquet file using Spark Sql but since the column names in parquet file contains space I am getting error -
Attribute name "First Name" contains invalid character(s) among " ,;{}()\n\t=". Please use alias to rename it.;
I have written below code -
val r1 = spark.read.parquet("filepath")
val r2 = r1.toDF()
r2.select(r2("First Name").alias("FirstName")).show()
but still getting same error
Try and rename the column first instead of aliasing it:
r2 = r2.withColumnRenamed("First Name", "FirstName")
r2.show()
For anyone still looking for an answer,
There is no optimised way to remove spaces from column names while dealing with parquet data.
What can be done is:
Change the column names at the source itself, i.e, while creating the parquet data itself.
OR
(NOT THE OPTIMISED WAY - won't WORK FOR HUGE DATASETS) read the parquet file using pandas and rename the column for the pandas dataframe. If required, write back the dataframe to a parquet using pandas itself and then progress using spark if required.
PS: With the new Pandas API for PySpark scheduled to be present from PySpark 3.2, implementing pandas with spark might be much faster and optimised when dealing with huge datasets.
For anybody struggling with this, the only thing that worked for me was:
for c in df.columns:
df = df.withColumnRenamed(c, c.replace(" ", ""))
df = spark.read.schema(base_df.schema).parquet(filename)
This is from this thread: Spark Dataframe validating column names for parquet writes (scala)
Alias, withColumnRenamed, and "as" sql select statements wouldn't work. Pyspark would still use the old name whenever trying to .show() the dataframe.