BizTalk - delimited flat file validation - schema

I have a schema for a semicolon-delimited flat file. The schema has 2 mandatory fields (minOccurs=1), and 2 optional fields (minOccurs=0). When I use a file with 2, 3 or 4 fields (ie "var1;var2" or "var1;var2;;var4") they work as expected. However, when I use a file that has a 5th field (ie "var1;var2;;var4;invalidvar5"), rather than BizTalk throwing an input validation error, it's just assuming that that extra field is part of the value for field 4, so it parses in the resulting XML to
<var4>var4;invalidvar5</var4>
Is there a way, without resorting to pipeline components, but just using settings in the schema, to have it realize that a wrong number of fields was supplied, and so the file is invalid?
Thanks!

Related

What happens if I send integers to a BigQuery field "string"?

One of the columns I send (in my code) to BigQuery is integers. I added the columns to BigQuery and I was too fast and added them as type string.
Will they be automatically converted? Or will the data be totally corrupted (= I cannot trust at all the resulting string)?
Data shouldn't be automatically converted as this would destroy the purpose of having a table schema.
What I've seen people doing is saving a whole json line as string and then processing this string inside of BigQuery. Other than that, if you try to save values not correspondent to the field schema definition, you should see an error being thrown, like so:
If you need to change a table schema's definition, you can check this tutorial on updating a table schema.
Actually BigQuery converted automatically the integers that I have sent it to string, so my table populates ok

How to validate data types in pig?

I have been trying to validate the data type of the data that I got from a flat file through pig.
A simple CAT can do the trick but the Flat files are huge and they sometimes contain special characters.
I need to filter out the records containing special characters from the file and also when the data type is not int.
Is there any way to do this in pig?
I am trying to find a substitute for getType().getName() kind of usage of java here.
Enforcing schema and using Describe is what we do while loading data and then remove the miss match but is there anyway to it without enforcing the schema.
Any suggestions will be helpful.
Load the data into a line:charraray and use regular expression to filter out the records that contains characters other than numbers
A = LOAD 'data.txt' AS (line:chararray);
B = FILTER A BY (line matches '\\d+$'); -- Change according to your needs.
DUMP B;

Exporting numeric to flat files

I'm creating a file export and a number of these fields are set up as numeric(18,4) in both the source table and flat file column. When the file is generated, any numbers which are < 1 are shown as .#, i.e. .52, instead of 0.52.
What needs to be done to fix this? The only way I can think of which neither are ideal methods are :
1. Output them as strings
2. Use a derived columns on all the numeric fields.

Importing File WIth Field Terminators In Data

I've been given some csv files that I want to turn into tables in a SQL database. However, the genius who created the files used comma delimiters, even though several data fields contain commas. So when I try to BCP the data into the database, I get a whole bunch of errors.
Is there a way that I can escape the commas that aren't field separators? At the moment I'm tempted to write a script to manually replace every comma in each file with a pipe, and then go through and manually change the affected rows back.
The only way to fix this is to write a script or program that fixes the data.
If the bad data is limited to a single field the process should be trivial:
You consume the row from either side by the count of good delimiters and replace with a new unique delimiter and what remains is the column with the extra old delimiters that you would just leave as is.
If you have two bad fields straddling good fields, you would need some kind of advanced logic, for instance I had XML data with delimiters, I had to parse the XML until I found a terminating tag and then process the other delimiters as needed.

Querying text file with SQL converts large numbers to NULL

I am importing data from a text file and have hit a snag. I have a numeric field which occasionally has very large values (10 billion+) and some of these values which are being converted to NULLs.
Upon further testing I have isolated the problem as follows - the first 25 rows of data are used to determine the field size, and if none of the first 25 values are large then it throws out any value >= 2,147,483,648 (2^31) which comes after.
I'm using ADO and the following connection string:
Provider=Microsoft.Jet.OLEDB.4.0;Data Source=FILE_ADDRESS;Extended Properties=""text;HDR=YES;FMT=Delimited""
Therefore, can anyone suggest how I can get round this problem without having to get the source data sorted descending on the large value column? Is there some way I could define the data types of the recordset prior to importing rather than let it decide for itself?
Many thanks!
You can use an INI file placed in the directory you are connecting to which describes the column types.
See here for details:
http://msdn.microsoft.com/en-us/library/windows/desktop/ms709353(v=vs.85).aspx