Synapse polybase data ingestion is not working - azure-synapse

I have a task to convert the jobs from synapse bulk insert to synapse polybase pattern. As part of that I see that it doesn't work straight away. It is complaining about some datatypes etc as below.... where as there is no double datatypes sometimes in the source query. Please help to understand if there a basic pattern or casting we need to do before we use polybase.
Here the source SQL I used
SELECT TOP (1000) cast([SiteCode_SourceId] as varchar(1000))
[SiteCode_SourceId]
,cast([EquipmentCode_SourceId] as varchar(1000))
[EquipmentCode_SourceId]
,FORMAT([RecordedAt],'yyyy-MM-dd HH:mm:ss.fffffff') AS
[RecordedAt]
,cast([DataLineage_SK] as varchar(1000)) [DataLineage_SK]
,cast([DataQuality_SK] AS varchar(1000)) [DataQuality_SK]
,cast([FixedPlantAsset_SK] as varchar(1000))
[FixedPlantAsset_SK]
,cast([ProductionTimeOfDay_SK] as varchar(1000))
[ProductionTimeOfDay_SK]
,cast([ProductionType_SK] as varchar(1000)) [ProductionType_SK]
,cast([Shift_SK] as varchar(1000)) [Shift_SK]
,cast([Site_SK] as varchar(1000)) [Site_SK]
,cast([tBelt] as varchar(1000)) [tBelt]
,FORMAT([ModifiedAt],'yyyy-MM-dd HH:mm:ss.fffffff') [ModifiedAt]
,FORMAT([SourceUpdatedAt],'yyyy-MM-dd HH:mm:ss.fffffff')
[SourceUpdatedAt]
FROM [ORXX].[public_XX].[fact_FixedXXXX]
Operation on target cp_data_movement failed: ErrorCode=PolybaseOperationFailed,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=Error happened when loading data into SQL Data Warehouse. Operation: 'Polybase operation'.,Source=Microsoft.DataTransfer.ClientLibrary,''Type=System.Data.SqlClient.SqlException,Message=HdfsBridge::recordReaderFillBuffer - Unexpected error encountered filling record reader buffer: ClassCastException: class parquet.io.api.Binary$ByteArraySliceBackedBinary cannot be cast to class java.lang.Double (parquet.io.api.Binary$ByteArraySliceBackedBinary is in unnamed module of loader 'app'; java.lang.Double is in module java.base of loader 'bootstrap'),Source=.Net SqlClient Data Provider,SqlErrorNumber=106000,Class=16,ErrorCode=-2146232060,State=1,Errors=[{Class=16,Number=106000,State=1,Message=HdfsBridge::recordReaderFillBuffer - Unexpected error encountered filling record reader buffer: ClassCastException: class parquet.io.api.Binary$ByteArraySliceBackedBinary cannot be cast to class java.lang.Double (parquet.io.api.Binary$ByteArraySliceBackedBinary is in unnamed module of loader 'app'; java.lang.Double is in module java.base of loader 'bootstrap'),},],'

Reasons for this error can be,
Order of the columns in the target table is not matching with the source table. So, there will be data type mismatch
Data types in parquet file is Incompatible to target tables' data type.
Solution:
Make sure the order of the columns are same as in parquet staging file.
Keep the same data types in source columns and target columns.

Related

Loading CSV data containing string and numeric format to Ignite is failing

I am evaluating Ignite and trying to load CSV data to Apache Ignite. I have created a table in Ignite:
jdbc:ignite:thin://127.0.0.1/> create table if not exists SAMPLE_DATA_PK(SID varchar(30),id_status varchar(50), active varchar, count_opening int,count_updated int,ID_caller varchar(50),opened_time varchar(50),created_at varchar(50),type_contact varchar, location varchar,support_incharge varchar,pk varchar(10) primary key);
I tried to load data to this table with command:
copy from '/home/kkn/data/sample_data_pk.csv' into SAMPLE_DATA_PK(SID,ID_status,active,count_opening,count_updated,ID_caller,opened_time,created_at,type_contact,location,support_incharge,pk) format csv;
But the data load is failing with this error:
Error: Server error: class org.apache.ignite.internal.processors.query.IgniteSQLException: Value conversion failed [column=COUNT_OPENING, from=java.lang.String, to=java.lang.Integer] (state=50000,code=1)
java.sql.SQLException: Server error: class org.apache.ignite.internal.processors.query.IgniteSQLException: Value conversion failed [column=COUNT_OPENING, from=java.lang.String, to=java.lang.Integer]
at org.apache.ignite.internal.jdbc.thin.JdbcThinConnection.sendRequest(JdbcThinConnection.java:1009)
at org.apache.ignite.internal.jdbc.thin.JdbcThinStatement.sendFile(JdbcThinStatement.java:336)
at org.apache.ignite.internal.jdbc.thin.JdbcThinStatement.execute0(JdbcThinStatement.java:243)
at org.apache.ignite.internal.jdbc.thin.JdbcThinStatement.execute(JdbcThinStatement.java:560)
at sqlline.Commands.executeSingleQuery(Commands.java:1054)
at sqlline.Commands.execute(Commands.java:1003)
at sqlline.Commands.sql(Commands.java:967)
at sqlline.SqlLine.dispatch(SqlLine.java:734)
at sqlline.SqlLine.begin(SqlLine.java:541)
at sqlline.SqlLine.start(SqlLine.java:267)
at sqlline.SqlLine.main(SqlLine.java:206)
Below is the sample data I am trying to load:
SID|ID_status|active|count_opening|count_updated|ID_caller|opened_time|created_at|type_contact|location|support_incharge|pk
|---|---------|------|-------------|-------------|---------|-----------|----------|------------|--------|----------------|--|
INC0000045|New|true|1000|0|Caller2403|29-02-2016 01:16|29-02-2016 01:23|Phone|Location143||1
INC0000045|Resolved|true|0|3|Caller2403|29-02-2016 01:16|29-02-2016 01:23|Phone|Location143||2
INC0000045|Closed|false|0|1|Caller2403|29-02-2016 01:16|29-02-2016 01:23|Phone|Location143||3
INC0000047|Active|true|0|1|Caller2403|29-02-2016 04:40|29-02-2016 04:57|Phone|Location165||4
INC0000047|Active|true|0|2|Caller2403|29-02-2016 04:40|29-02-2016 04:57|Phone|Location165||5
INC0000047|Active|true|0|489|Caller2403|29-02-2016 04:40|29-02-2016 04:57|Phone|Location165||6
INC0000047|Active|true|0|5|Caller2403|29-02-2016 04:40|29-02-2016 04:57|Phone|Location165||7
INC0000047|AwaitingUserInfo|true|0|6|Caller2403|29-02-2016 04:40|29-02-2016 04:57|Phone|Location165||8
INC0000047|Closed|false|0|8|Caller2403|29-02-2016 04:40|29-02-2016 04:57|Phone|Location165||9
INC0000057|New|true|0|0|Caller4416|29-02-2016 06:10||Phone|Location204||10
Need help to understand how to figure out what is the issue and resolve it
You have to upload CSV without header line. Which contains the column names. An error is thrown when trying to convert the string value "count_opening" to a Integer.

what is the alternative for double datatype from spark sql(Databricks) to Sql Server Data warehouse

I have to load the data from azure datalake to data warehouse.I have created set up for creating external tables.there is one column which is double datatype, i have used decimal type in sql server data warehouse for creating the external table and file format is parquet.But using csv it is working.
i'm getting the following error.
HdfsBridge::recordReaderFillBuffer - Unexpected error encountered
filling record reader buffer: ClassCastException: class
java.lang.Double cannot be cast to class parquet.io.api.Binary
(java.lang.Double is in module java.base of loader 'bootstrap';
parquet.io.api.Binary is in unnamed module of loader 'app'.
Can some one help me on this issue?
Thanks in advance.
CREATE EXTERNAL TABLE [dbo].[EXT_TEST1]
( A VARCHAR(10),B decimal(36,19)))
(DATA_SOURCE = [Azure_Datalake],LOCATION = N'/A/B/PARQUET/*.parquet/',FILE_FORMAT =parquetfileformat,REJECT_TYPE = VALUE,REJECT_VALUE = 1)
Column datatype in databricks:
A string,B double
Data: A | B
'a' 100.0050
Use float(53) which is of 53 digits precision and 8 bytes length.

SSIS XML Source Error - Input string was not in a correct format

I have an attribute tlost with the definition below in the XSD file. I have tried both use="required" and use="optional".
<xs:attributeGroup name="defense">
<xs:attribute name="tlost" use="required" type="xs:decimal"/>
</xs:attributeGroup>
In the XML document I am trying to import I will get a value like the following:
<defense ast="0" category="special_team" tlost="0" int="0"/>
I am executing an SSIS package that takes the tlost value and inserts it into a sql database table. The column in the database table has a datatype of DECIMAL(28,10) and allows nulls.
When I execute the package, the previous values work perfectly and the data is inserted. However, when I get a value where tlost="" in the XML file, the package fails and the record is not inserted.
In the data flow path editor, the data type for tlost is DT_DECIMAL. When I check the Advanced Editor for the XML Source, the Input and Output properties have a data type for tlost as decimal [DT_DECIMAL].
I can't figure out why this is failing. I tried to create a derived column and cast it as a (DT_DECIMAL, 10) data type. That didn't work. I tried to check for a null value and replace with 0 if null, that didn't work. So I just ignored the column all together and in the Derived Column task, I replaced the tlost column value with (DT_DECIMAL, 10) 0 to just insert a 0 value and ignore whatever is in the xml file, and the job still failed with the following error message:
Error: 0xC020F444 at Load Play Summary Tables, XML Source [1031]: The error "Input string was not in a correct format." occurred while processing "XML Source.Outputs[defense].Columns[tlost]".
Error: 0xC02090FB at Load Play Summary Tables, XML Source [1031]: The "XML Source" failed because error code 0x80131537 occurred, and the error row disposition on "XML Source.Outputs[defense].Columns[tlost]" at "XML Source.Outputs[defense]" specifies failure on error. An error occurred on the specified object of the specified component.
Error: 0xC02092AF at Load Play Summary Tables, XML Source [1031]: The XML Source was unable to process the XML data. Pipeline component has returned HRESULT error code 0xC02090FB from a method call.
Error: 0xC0047038 at Load Play Summary Tables, SSIS.Pipeline: SSIS Error Code DTS_E_PRIMEOUTPUTFAILED. The PrimeOutput method on XML Source returned error code 0xC02092AF. The component returned a failure code when the pipeline engine called PrimeOutput(). The meaning of the failure code is defined by the component, but the error is fatal and the pipeline stopped executing. There may be error messages posted before this with more information about the failure.
Please help. I have exhausted everything I can think of to fix this issue. I am processing hundreds of files, and I can't keep fixing bad data files every time this issue occurs.
Can you please try these
1 - Change to data type to string in xsd and before loading into tables take care of data type conversion.
2 - If possible generate the xsd by passing your xml and then verify the data type and use it accordingly ...
rest of the xsd can be changed accordingly...
below is screen grab of what I tried. hope it helps]1

invalid datetime format

I have a question about powercenter message code: RR-4035. I have a mapping in which i am using a sql override query, this error is in sql override. This mapping is failing with an error,
'[IBM][CLI DRIVER]CLIO113E SQLSTATE 22007:An invalid datetime format
was detected, that is an invalid string representation or value was
specified'.
> Database driver error:
Function name:Fetch
SQL STMNT:
select s.employee_record_id,s.employee_id,s.record_origin,
cnt.employee_contract_id,cnt.employee_contract_efctv_dt,cnt.employee_contract_term_dt,club.employee_club
from
employee_main_info s
inner join
(select
employee_id,record_origin,employee_contract_term_dt,employee_contract_efctv_dt
from employee_perm
union
select
employee_id,record_origin,employee_contract_term_dt,employee_contract_efctv_dt
from employee_temp
) cnt on s.employee_id=cnt.employee_id,
employee_club_data club
where
club.employee_id=s.employee_id
and (cnt.employee_contract_efctv_dt <=sysdate or cnt.employee_contract_efctv_dt<'2016-01-01')
and s.employee_record_term_dt>sysdate;
native error code= -99999
I have tried everything, my previous mappings have run fine with the same datetime formats but this one is failing. One thing i have noticed is that if i remove all the transformations in between the source qualifier and target the mapping succeeds and data gets loaded to target, but as soon as i put any lookups or expressions between source qualifier and target except a pass through expression, the mapping fails again.
Any suggestion, any help regarding this is appreciated.
We've seen this error occurring when SELECTing from a table with a timestamp column via the IBM Data Server ODBC/CLI driver. It only happened on one Windows machine and we were able to make the error disappear by changing the regional setting main selection from Israel to USA.
While not tested yet, it may be that the IBM DB2 ODBC configuration option DateTimeStringFormat or the attributes SQL_ATTR_DATE_FMT and SQL_ATTR_TIME_FMT can be used to force a specific format (such as JIS). See https://www.ibm.com/support/knowledgecenter/en/SSEPGG_11.1.0/com.ibm.db2.luw.apdv.cli.doc/doc/r0011525.html

CSV file input not working together with set field value step in Pentaho Kettle

I have a very simple Pentaho Kettle transformation that causes a strange error. It consists of reading a field X from a CSV, add a field Y, set Y=X and finally write it back to another CSV.
Here you can see the steps and the configuration for them:
You can also download the ktr file from here. The input data is just this:
1
2
3
When I run this transformation, I get this error message:
ERROR (version 5.4.0.1-130, build 1 from 2015-06-14_12-34-55 by buildguy) : Unexpected error
ERROR (version 5.4.0.1-130, build 1 from 2015-06-14_12-34-55 by buildguy) : org.pentaho.di.core.exception.KettleStepException:
Error writing line
Error writing field content to file
Y Number : There was a data type error: the data type of [B object [[B©b4136a] does not correspond to value meta [Number]
at org.pentaho.di.trans.steps.textfiIeoutput.TextFiIeOutput.writeRowToFile(TextFiIeOutput.java:273)
at org.pentaho.di.trans.steps.textfiIeoutput.TextFileOutput.processRow(TextFiIeOutput.java:195)
at org.pentaho.di.trans.step.RunThread.run(RunThread.java:62)
atjava.Iang.Thread.run(Unknown Source)
Caused by: org.pentaho.di.core.exception.KettleStepException:
Error writing field content to file
Y Number : There was a data type error: the data type of [B object [[B©b4136a] does not correspond to value meta [Number]
at org.pentaho.di.trans.steps.textfiIeoutput.TextFiIeOutput.writeField(TextFileOutput.java:435)
at org.pentaho.di.trans.steps.textfiIeoutput.TextFiIeOutput.writeRowToFile(TextFiIeOutput.java:249)
3 more
Caused by: org.pentaho.di.core.exception.KettleVaIueException:
Y Number : There was a data type error: the data type of [B object [[B©b4136a] does not correspond to value meta [Number]
at org.pentaho.di.core.row.vaIue.VaIueMetaBase.getBinaryString(VaIueMetaBase.java:2185)
at org.pentaho.di.trans.steps.textfiIeoutput.TextFiIeOutput.formatField(TextFiIeOutput.java:290)
at org.pentaho.di.trans.steps.textfiIeoutput.TextFiIeOutput.writeField(TextFileOutput.java:392)
4 more
All of the above lines start with 2015/09/23 12:51:18 - Text file output.0 -, but I edited it out for brevity. I think the relevant, and confusing, part of the error message is this:
Y Number : There was a data type error: the data type of [B object [[B©b4136a] does not correspond to value meta [Number]
Some further notes:
If I bypass the set field value step by using the lower hop instead, the transformation finish without errors. This leads me to believe that it is the set field value step that causes the problem.
If I replace the CSV file input with a data frame with the same data (1,2,3) everything works just fine.
If I replace the file output step with a dummy the transformation finish without errors. However, if I preview the dummy, it causes a similar error and the field Y has the value <null> on all three rows.
Before I created this MCVE I got the error on all sorts of seemingly random steps, even when there was no file output present. So I do not think this is related to the file output.
If I change the format from Number to Integer, nothing changes. But if I change it to string the transformations finish without errors, and I get this output:
X;Y
1;[B#49e96951
2;[B#7b016abf
3;[B#1a0760b0
Is this a bug? Am I doing something wrong? How can I make this work?
It's because of lazy conversion. Turn it off. This is behaving exactly as designed - although admittedly the error and user experience could be improved.
Lazy conversion must not be used when you need to access the field value in your transformation. That's exactly what it does. The default should probably be off rather than on.
If your field is going directly to a database, then use it and it will be faster.
You can even have "partially lazy" streams, where you use lazy conversion for speed, but then use select values step, to "un-lazify" the fields you want to access, whilst the remainder remain lazy.
Cunning huh?