Synapse Pipeline Error: The column name is invalid - azure-data-factory-2

I'm running a Synapse Pipeline which is moving data from SQL Server into Parquet files, but I'm getting an unusual error message. The error implies I have invalid characters in my column names, which I do not.
{
"errorCode": "2200",
"message": "ErrorCode=ParquetInvalidColumnName,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=The column name is invalid. Column name cannot contain these character:[,;{}()\\n\\t=],Source=Microsoft.DataTransfer.Common,'",
"failureType": "UserError",
"target": "Copy SQL data to Parquet",
"details": []
}
Here are the column names:
As you can see there are no invalid characters.
What is causing this error?

Parquet format does not allow any white spaces or special characters in the column names. Make sure you do not have any trailing or leading spaces in the column names.
I have reproed with your column names and was able to load data from SQL server to the parquet file successfully.
Source:
Sink:
Mapping:
Sink data preview after loading successfully:

Related

HIVE_BAD_DATA: Error parsing field value '' for field 12: Cannot convert value of type String to a REAL value

Hello all !
The query in Athena console
The response error
HIVE_BAD_DATA: Error parsing field value '' for field 12: Cannot convert value of type String to a REAL value
I try to create a Table queryable in Athena with a glue crawler where I specify each column's data type. My input is a CSV file where there are some empty fields.
Crawler description
Crawler description
The crawler finds each column and assigns the correct column name but when reading the values I have a parsing error.
I'm wondering if the problem could come from the crawler trying to read empty values as the wrong type.
Error come from col data looking like
Corresponding table Schema
I tried unsuccessfully to change the serialization like in:
AWS Athena: "HIVE_BAD_DATA: Error parsing column 'X' : empty String"
Specify a SerDe serialization lib with AWS Glue Crawler
Is there a parameter or a workaround to solve this issue?

What does this error mean: Required column value for column index: 8 is missing in row starting at position: 0

I'm attempting to upload a CSV file (which is an output from a BCP command) to BigQuery using the gcloud CLI BQ Load command. I have already uploaded a custom schema file. (was having major issues with Autodetect).
One resource suggested this could be a datatype mismatch. However, the table from the SQL DB lists the column as a decimal, so in my schema file I have listed it as FLOAT since decimal is not a supported data type.
I couldn't find any documentation for what the error means and what I can do to resolve it.
What does this error mean? It means, in this context, a value is REQUIRED for a given column index and one was not found. (By the way, columns are usually 0 indexed, meaning a fault at column index 8 is most likely referring to column number 9)
This can be caused by myriad of different issues, of which I experienced two.
Incorrectly categorizing NULL columns as NOT NULL. After exporting the schema, in JSON, from SSMS, I needed to clean it
up for BQ and in doing so I assigned IS_NULLABLE:NO to
MODE:NULLABLE and IS_NULLABLE:YES to MODE:REQUIRED. These
values should've been reversed. This caused the error because there
were NULL columns where BQ expected a REQUIRED value.
Using the wrong delimiter The file I was outputting was not only comma-delimited but also tab-delimited. I was only able to validate this by using the Get Data tool in Excel and importing the data that way, after which I saw the error for tabs inside the cells.
After outputting with a pipe ( | ) delimiter, I was finally able to successfully load the file into BigQuery without any errors.

Fixing error in a SHOW TABLES IN DATABASE name query

I am trying to list all the table in a database in Amazon AWS Athena via a Python script.
Here is my script:
data = {'name':['database1', 'database-name', 'database2']}
# Create DataFrame
df = pd.DataFrame(data)
for index, schema in df.iterrows():
tables_in_schema = pd.read_sql("SHOW TABLES IN "+schema[0],conn)
There is an error running this
When I run the same query in the Athena query editor, I get an error
SHOW TABLES IN database-name
Here is the error
DatabaseError: Execution failed on sql: SHOW TABLES IN database-name
An error occurred (InvalidRequestException) when calling the StartQueryExecution operation: line
1:19: mismatched input '-'. Expecting: '.', 'LIKE', <EOF>
unable to rollback
I think the issue is with the hypen "-" in the database name.
How do I escape this in the query?
You can use the Glue client instead. It provides a function get_tables(), which returns a list of all the tables in a specific data base.
The database, table or columns names cannot have anything other than an underscore "_" in its name. Any other special character will cause an issue when querying. It does not stop you from creating an object with the special characters but will cause an issue when using those objects.
The only way around this is to re-create the database names without the special character, hyphen "-" in this case.
https://docs.aws.amazon.com/athena/latest/ug/tables-databases-columns-names.html

Parquet troubles with decimal in Azure Data Factory V2

Since 3 or 4 days i'm experiencing troubles in writing decimal values in parquet file format with Azure Data Factory V2.
The repro steps are quite simple, from an SQL source containing a numeric value i map it to a parquet file using the copy activity.
At runtime the following exception is thrown:
{
"errorCode": "2200",
"message": "Failure happened on 'Source' side. ErrorCode=UserErrorParquetTypeNotSupported,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=Decimal Precision or Scale information is not found in schema for column: ADDRESSLONGITUDE,Source=Microsoft.DataTransfer.Richfile.ParquetTransferPlugin,''Type=System.InvalidCastException,Message=Object cannot be cast from DBNull to other types.,Source=mscorlib,'",
"failureType": "UserError",
"target": "Copy Data"
}
In the source the complaining column is defined as numeric(32,6) type.
I think the problem is circumscribed to the parquet sink because changing the destination format to csv result in a succeeded pipeline.
Any suggestions?
Based on Jay's answer, here is the whole dataset :
SELECT
[ADDRESSLATITUDE]
FROM
[dbo].[MyTable]
Based on the SQL Types to Parquet Logical Types and Data type mapping for Parquet files in data factory copy activity,it supports Decimal data type.Decimal data is converted into binary data type.
Back to your error message:
Failure happened on 'Source' side.
ErrorCode=UserErrorParquetTypeNotSupported,
'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,
Message=Decimal Precision or Scale information is not found in schema
for column:
ADDRESSLONGITUDE,Source=Microsoft.DataTransfer.Richfile.ParquetTransferPlugin,''
Type=System.InvalidCastException,Message=Object cannot be cast from
DBNull to other types.,Source=mscorlib,'
If your numeric data has null value, it will be converted into Int data type without any
Decimal precision or scale information.
Csv format does not have this transformation process so you could set default value for your numeric data.

Importing CSV to BigQuery parsing error

I have a CSV with 225 rows and see BigQuery expects the schema to be
column1_name:data_type,
I have removed all spaces however BigQuery doesn't like my schema it returns "Parsing Error" and returns the first field name.
my pasted schema looks like this (partial)
transaction_status:STRING(6),dollarsobligated:NUMERIC(10,2),baseandexercisedoptionsvalue:NUMERIC(10,2),baseandalloptionsvalue:NUMERIC(12,2),maj_agency_cat:STRING(35),mod_agency:STRING(37),maj_fund_agency_cat:STRING(35),contractingofficeagencyid:STRING(37),contractingofficeid:STRING(51),
Try removing the dimensioning, not needed. Declaring "String" is optional, as it's the default. Instead of numeric, do "float".
So
transaction_status:STRING(6),dollarsobligated:NUMERIC(10,2),baseandexercisedoptionsvalue:NUMERIC(10,2),baseandalloptionsvalue:NUMERIC(12,2),maj_agency_cat:STRING(35),mod_agency:STRING(37),maj_fund_agency_cat:STRING(35),contractingofficeagencyid:STRING(37),contractingofficeid:STRING(51),
should be
transaction_status,dollarsobligated:float,baseandexercisedoptionsvalue:float,baseandalloptionsvalue:float,maj_agency_cat,mod_agency,maj_fund_agency_cat,contractingofficeagencyid,contractingofficeid