Context
Currently I am using Snowflake as a Data Warehouse and AWS' S3 as a data lake. The majority of the files that land on S3 are in the Parquet format. For these, I am using a new limited feature by Snowflake (documented here) that automatically detects the schema from the parquet files on S3, which I can use to generate a CREATE TABLE statement with the correct column names and inferred data types. This feature currently only works for Apache Parquet, Avro, and ORC files. I would like to find a way that achieves the same desired objective but for CSV files.
What I have tried to do
This is how I currently infer the schema for Parquet files:
select generate_column_description(array_agg(object_construct(*)), 'table') as columns
from table (infer_schema(location=>'${LOCATION}', file_format=>'${FILE_FORMAT}'))
However if I try specifying the FILE_FORMAT as csv that approach will fail.
Other approaches I have considered:
Transferring all files that land on S3 to parquet (this involves more code, and infra setup so wouldn't be my top choice, especially that I'd like to keep some files in their natural type on s3)
Having a script (using libraries like Pandas in Python for example) that infer the schema for files in S3 (this also involves more code, and will be strange in the sense that parquet files are handled in Snowflake, but non parquet files are handled by some script on aws).
Using a Snowflake UDF to infer the schema. Haven't fully considered my options there yet.
Desired Behaviour
As a new csv file lands on S3 (on a pre-existing STAGE), I would like to infer the schema, and be able to generate a CREATE TABLE statement with the inferred data types. Preferably, I would like to do that within Snowflake as the existing aforementioned schema-inference solution exists there. Happy to add further information if needed.
UPDATE: I modified the SP that infers data types in untyped (all string type columns) tables and it now works directly against Snowflake stages. The project code is available here: https://github.com/GregPavlik/InferSchema
I wrote a stored procedure to assist with this; however, its only goal is to infer the data types of untyped columns. It works as follows:
Load the CSV into a table with all columns defined as varchars.
Call the SP with a query against the new table (main point is to get only the columns you want and limit the row count to keep type inference times reasonable).
Also in the SP call is the DB, schema, and table for the old and new locations -- old with all varchar and new with the inferred types.
The SP will then infer the data types and create two SQL statements. One statement will create the new table with the inferred data types. One statement will copy from the untyped (all varchar) table to the new table with appropriate wrappers such as try_multi_timestamp(), a UDF that extends try_to_timestamp() to try various common formats.
I meant to extend this so that it didn't require the untyped (all varchar) table at all, but haven't gotten around to it. Since it's come up here, I may circle back and update the SP with that capability. You can specify a query that reads directly from the stage, but you'd have to use $1, $2... with aliases for the column names (or else the DDL will try to create column names like $1). If the query runs directly against a stage, for the old DB, schema, and table, you could put in whatever because that's only used to generate an insert from select statement.
-- This shows how to use on the Snowflake TPCH sample, but could be any query.
-- Keep the row count down to reduce the time it take to infer the types.
call infer_data_types('select * from SNOWFLAKE_SAMPLE_DATA.TPCH_SF1.LINEITEM limit 10000',
'SNOWFLAKE_SAMPLE_DATA', 'TPCH_SF1', 'LINEITEM',
'TEST', 'PUBLIC', 'LINEITEM');
create or replace procedure INFER_DATA_TYPES(SOURCE_QUERY string,
DATABASE_OLD string,
SCHEMA_OLD string,
TABLE_OLD string,
DATABASE_NEW string,
SCHEMA_NEW string,
TABLE_NEW string)
returns string
language javascript
as
$$
/****************************************************************************************************
* *
* DataType Classes
* *
****************************************************************************************************/
class Query{
constructor(statement){
this.statement = statement;
}
}
class DataType {
constructor(db, schema, table, column, sourceQuery) {
this.db = db;
this.schema = schema;
this.table = table;
this.sourceQuery = sourceQuery
this.column = column;
this.insert = '"#~COLUMN~#"';
this.totalCount = 0;
this.notNullCount = 0;
this.typeCount = 0;
this.blankCount = 0;
this.minTypeOf = 0.95;
this.minNotNull = 1.00;
}
setSQL(sqlTemplate){
this.sql = sqlTemplate;
this.sql = this.sql.replace(/#~DB~#/g, this.db);
this.sql = this.sql.replace(/#~SCHEMA~#/g, this.schema);
this.sql = this.sql.replace(/#~TABLE~#/g, this.table);
this.sql = this.sql.replace(/#~COLUMN~#/g, this.column);
}
getCounts(){
var rs;
rs = GetResultSet(this.sql);
rs.next();
this.totalCount = rs.getColumnValue("TOTAL_COUNT");
this.notNullCount = rs.getColumnValue("NON_NULL_COUNT");
this.typeCount = rs.getColumnValue("TO_TYPE_COUNT");
this.blankCount = rs.getColumnValue("BLANK");
}
isCorrectType(){
return (this.typeCount / (this.notNullCount - this.blankCount) >= this.minTypeOf);
}
isNotNull(){
return (this.notNullCount / this.totalCount >= this.minNotNull);
}
}
class TimestampType extends DataType{
constructor(db, schema, table, column, sourceQuery){
super(db, schema, table, column, sourceQuery)
this.syntax = "timestamp";
this.insert = 'try_multi_timestamp(trim("#~COLUMN~#"))';
this.sourceQuery = SOURCE_QUERY;
this.setSQL(GetCheckTypeSQL(this.insert, this.sourceQuery));
this.getCounts();
}
}
class IntegerType extends DataType{
constructor(db, schema, table, column, sourceQuery){
super(db, schema, table, column, sourceQuery)
this.syntax = "number(38,0)";
this.insert = 'try_to_number(trim("#~COLUMN~#"), 38, 0)';
this.setSQL(GetCheckTypeSQL(this.insert, this.sourceQuery));
this.getCounts();
}
}
class DoubleType extends DataType{
constructor(db, schema, table, column, sourceQuery){
super(db, schema, table, column, sourceQuery)
this.syntax = "double";
this.insert = 'try_to_double(trim("#~COLUMN~#"))';
this.setSQL(GetCheckTypeSQL(this.insert, this.sourceQuery));
this.getCounts();
}
}
class BooleanType extends DataType{
constructor(db, schema, table, column, sourceQuery){
super(db, schema, table, column, sourceQuery)
this.syntax = "boolean";
this.insert = 'try_to_boolean(trim("#~COLUMN~#"))';
this.setSQL(GetCheckTypeSQL(this.insert, this.sourceQuery));
this.getCounts();
}
}
// Catch all is STRING data type
class StringType extends DataType{
constructor(db, schema, table, column, sourceQuery){
super(db, schema, table, column, sourceQuery)
this.syntax = "string";
this.totalCount = 1;
this.notNullCount = 0;
this.typeCount = 1;
this.minTypeOf = 0;
this.minNotNull = 1;
}
}
/****************************************************************************************************
* *
* Main function *
* *
****************************************************************************************************/
var pass = 0;
var column;
var typeOf;
var ins = '';
var newTableDDL = '';
var insertDML = '';
var columnRS = GetResultSet(GetTableColumnsSQL(DATABASE_OLD, SCHEMA_OLD, TABLE_OLD));
while (columnRS.next()){
pass++;
if(pass > 1){
newTableDDL += ",\n";
insertDML += ",\n";
}
column = columnRS.getColumnValue("COLUMN_NAME");
typeOf = InferDataType(DATABASE_OLD, SCHEMA_OLD, TABLE_OLD, column, SOURCE_QUERY);
newTableDDL += '"' + typeOf.column + '" ' + typeOf.syntax;
ins = typeOf.insert;
insertDML += ins.replace(/#~COLUMN~#/g, typeOf.column);
}
return GetOpeningComments() +
GetDDLPrefixSQL(DATABASE_NEW, SCHEMA_NEW, TABLE_NEW) +
newTableDDL +
GetDDLSuffixSQL() +
GetDividerSQL() +
GetInsertPrefixSQL(DATABASE_NEW, SCHEMA_NEW, TABLE_NEW) +
insertDML +
GetInsertSuffixSQL(DATABASE_OLD, SCHEMA_OLD, TABLE_OLD) ;
/****************************************************************************************************
* *
* Helper functions *
* *
****************************************************************************************************/
function InferDataType(db, schema, table, column, sourceQuery){
var typeOf;
typeOf = new IntegerType(db, schema, table, column, sourceQuery);
if (typeOf.isCorrectType()) return typeOf;
typeOf = new DoubleType(db, schema, table, column, sourceQuery);
if (typeOf.isCorrectType()) return typeOf;
typeOf = new BooleanType(db, schema, table, column, sourceQuery); // May want to do a distinct and look for two values
if (typeOf.isCorrectType()) return typeOf;
typeOf = new TimestampType(db, schema, table, column, sourceQuery);
if (typeOf.isCorrectType()) return typeOf;
typeOf = new StringType(db, schema, table, column, sourceQuery);
if (typeOf.isCorrectType()) return typeOf;
return null;
}
/****************************************************************************************************
* *
* SQL Template Functions *
* *
****************************************************************************************************/
function GetCheckTypeSQL(insert, sourceQuery){
var sql =
`
select count(1) as TOTAL_COUNT,
count("#~COLUMN~#") as NON_NULL_COUNT,
count(${insert}) as TO_TYPE_COUNT,
sum(iff(trim("#~COLUMN~#")='', 1, 0)) as BLANK
--from "#~DB~#"."#~SCHEMA~#"."#~TABLE~#";
from (${sourceQuery})
`;
return sql;
}
function GetTableColumnsSQL(dbName, schemaName, tableName){
var sql =
`
select COLUMN_NAME
from ${dbName}.INFORMATION_SCHEMA.COLUMNS
where TABLE_CATALOG = '${dbName}' and
TABLE_SCHEMA = '${schemaName}' and
TABLE_NAME = '${tableName}'
order by ORDINAL_POSITION;
`;
return sql;
}
function GetOpeningComments(){
return `
/**************************************************************************************************************
* *
* Copy and paste into a worksheet to create the typed table and insert into the new table from the old one. *
* *
**************************************************************************************************************/
`;
}
function GetDDLPrefixSQL(db, schema, table){
var sql =
`
create or replace table "${db}"."${schema}"."${table}"
(
`;
return sql;
}
function GetDDLSuffixSQL(){
return "\n);";
}
function GetDividerSQL(){
return `
/**************************************************************************************************************
* *
* The SQL statement below this attempts to copy all rows from the string tabe to the typed table. *
* *
**************************************************************************************************************/
`;
}
function GetInsertPrefixSQL(db, schema, table){
var sql =
`\ninsert into "${db}"."${schema}"."${table}" select\n`;
return sql;
}
function GetInsertSuffixSQL(db, schema, table){
var sql =
`\nfrom "${db}"."${schema}"."${table}" ;`;
return sql;
}
//function GetInsertSuffixSQL(db, schema, table){
//var sql = '\nfrom "${db}"."${schema}"."${table}";';
//return sql;
//}
/****************************************************************************************************
* *
* SQL functions *
* *
****************************************************************************************************/
function GetResultSet(sql){
cmd1 = {sqlText: sql};
stmt = snowflake.createStatement(cmd1);
var rs;
rs = stmt.execute();
return rs;
}
function ExecuteNonQuery(queryString) {
var out = '';
cmd1 = {sqlText: queryString};
stmt = snowflake.createStatement(cmd1);
var rs;
rs = stmt.execute();
}
function ExecuteSingleValueQuery(columnName, queryString) {
var out;
cmd1 = {sqlText: queryString};
stmt = snowflake.createStatement(cmd1);
var rs;
try{
rs = stmt.execute();
rs.next();
return rs.getColumnValue(columnName);
}
catch(err) {
if (err.message.substring(0, 18) == "ResultSet is empty"){
throw "ERROR: No rows returned in query.";
} else {
throw "ERROR: " + err.message.replace(/\n/g, " ");
}
}
return out;
}
function ExecuteFirstValueQuery(queryString) {
var out;
cmd1 = {sqlText: queryString};
stmt = snowflake.createStatement(cmd1);
var rs;
try{
rs = stmt.execute();
rs.next();
return rs.getColumnValue(1);
}
catch(err) {
if (err.message.substring(0, 18) == "ResultSet is empty"){
throw "ERROR: No rows returned in query.";
} else {
throw "ERROR: " + err.message.replace(/\n/g, " ");
}
}
return out;
}
function getQuery(sql){
var cmd = {sqlText: sql};
var query = new Query(snowflake.createStatement(cmd));
try {
query.resultSet = query.statement.execute();
} catch (err) {
throw "ERROR: " + err.message.replace(/\n/g, " ");
}
return query;
}
$$;
Have you tried STAGES?
Create 2 stages ... one with no header and the other with header .. .
see examples below.
Then a bit of SQL and voila your DDL.
Only issue - you need to know the # of cols to put correct number of t.$'s.
If someone could automate that ... we'd have an almost automatic DDL generator for CSV's.
Obviously once you have the SQL stmt then just add the create or replace table to the front and your table is nicely created with all the names from the CSV.
:-)
-- create or replace stage CSV_NO_HEADER
URL = 's3://xxx-x-dev-landing/xxx/'
STORAGE_INTEGRATION = "xxxLAKE_DEV_S3_INTEGRATION"
FILE_FORMAT = ( TYPE = CSV SKIP_HEADER = 1 FIELD_OPTIONALLY_ENCLOSED_BY = '"' )
-- create or replace stage CSV
URL = 's3://xxx-xxxlake-dev-landing/xxx/'
STORAGE_INTEGRATION = "xxxLAKE_DEV_S3_INTEGRATION"
FILE_FORMAT = ( TYPE = CSV FIELD_OPTIONALLY_ENCLOSED_BY = '"' )
select concat('select t.$1 ', t.$1, ',t.$2 ', t.$2,',t.$3 ', t.$3, ',t.$4 ', t.$4,',t.$5 ', t.$5,',t.$6 ', t.$6,',t.$7 ', t.$7,',t.$8 ', t.$8,',t.$9 ', t.$9,
',t.$10 ', t.$10, ',t.$11 ', t.$11,',t.$12 ', t.$12 ,',t.$13 ', t.$13, ',t.$14 ', t.$14 ,',t.$15 ', t.$15 ,',t.$16 ', t.$16 ,',t.$17 ', t.$17 ,' from #xxxx_NO_HEADER/SUB_TRANSACTION_20201204.csv t') from
--- CHANGE TABLE ---
#xxx/SUB_TRANSACTION_20201204.csv t limit 1;
I am using databricks spark-avro to convert a dataframe schema into avro schema.The returned avro schema fails to have a default value. This is causing issues when i am trying to create a Generic record out of the schema. Can, any one help with the right way of using this function ?
Dataset<Row> sellableDs = sparkSession.sql("sql query");
SchemaBuilder.RecordBuilder<Schema> rb = SchemaBuilder.record("testrecord").namespace("test_namespace");
Schema sc = SchemaConverters.convertStructToAvro(sellableDs.schema(), rb, "test_namespace");
System.out.println(sc.toString());
System.out.println(sc.getFields().get(0).toString());
String schemaString = sc.toString();
sellableDs.foreach(
(ForeachFunction<Row>) row -> {
Schema scEx = new Schema.Parser().parse(schemaString);
GenericRecord gr;
gr = new GenericData.Record(scEx);
System.out.println("Generic record Created");
int fieldSize = scEx.getFields().size();
for (int i = 0; i < fieldSize; i++ ) {
// System.out.println( row.get(i).toString());
System.out.println("field: " + scEx.getFields().get(i).toString() + "::" + "value:" + row.get(i));
gr.put(scEx.getFields().get(i).toString(), row.get(i));
//i++;
}
}
);
This is the df schema:
StructType(StructField(key,IntegerType,true), StructField(value,DoubleType,true))
This is the avro converted schema:
{"type":"record","name":"testrecord","namespace":"test_namespace","fields":[{"name":"key","type":["int","null"]},{"name":"value","type":["double","null"]}]}
The problems is that the class SchemaConverters does not include default values as part of the schema creation. You have 2 options, modify the schema adding default values before Record creation or filling the record before building with some value( it could be actually values from your row). For example null. This is an example how create a Record using your schema
import org.apache.avro.generic.GenericRecordBuilder
import org.apache.avro.Schema
var schema = new Schema.Parser().parse("{\"type\":\"record\",\"name\":\"testrecord\",\"namespace\":\"test_namespace\",\"fields\":[{\"name\":\"key\",\"type\":[\"int\",\"null\"]},{\"name\":\"value\",\"type\":[\"double\",\"null\"]}]}")
var builder = new GenericRecordBuilder(schema);
for (i <- 0 to schema.getFields().size() - 1 ) {
builder.set(schema.getFields().get(i).name(), null)
}
var record = builder.build();
print(record.toString())
I've developed an apex API on salesforce which performs a SOQL on a list of CSV data. It has been working smoothly until yesterday, after making a few changes to code that follow the SOQL query, I started getting a strange 500 error:
[{"errorCode":"APEX_ERROR","message":"System.UnexpectedException:
common.exception.SfdcSqlException: ORA-01460: unimplemented or
unreasonable conversion requested\n\n\nselect /SampledPrequery/
sum(term0) \"cnt0\",\nsum(term1) \"cnt1\",\ncount(*)
\"totalcount\",\nsum(term0 * term1) \"combined\"\nfrom (select /*+
ordered use_nl(t_c1) /\n(case when (t_c1.deleted = '0') then 1 else 0
end) term0,\n(case when (upper(t_c1.val18) = ?) then 1 else 0 end)
term1\nfrom (select /+ index(sampleTab AKENTITY_SAMPLE)
*/\nentity_id\nfrom core.entity_sample sampleTab\nwhere organization_id = '00Dq0000000AMfz'\nand key_prefix = ?\nand rownum <=
?) sampleTab,\ncore.custom_entity_data t_c1\nwhere
t_c1.organization_id = '00Dq0000000AMfz'\nand t_c1.key_prefix = ?\nand
sampleTab.entity_id =
t_c1.custom_entity_data_id)\n\nClass.labFlows.queryContacts: line 13,
column 1\nClass.labFlows.fhaQuery: line 6, column
1\nClass.zAPI.doPost: line 10, column 1"}]
the zAPI.doPost() is simply our router class which takes in the post payload as well as the requested operation. It then calls whatever function the operation requests. In this case, the call is to labFlows.queryContacts():
Public static Map<string,List<string>> queryContacts(string[] stringArray){
//First get the id to get to the associative entity, Contact_Deals__c id
List<Contact_Deals__c> dealQuery = [SELECT id, Deal__r.id, Deal__r.FHA_Number__c, Deal__r.Name, Deal__r.Owner.Name
FROM Contact_Deals__c
Where Deal__r.FHA_Number__c in :stringArray];
//Using the id in the associative entity, grab the contact information
List<Contact_Deals__c> contactQuery = [Select Contact__r.Name, Contact__r.Id, Contact__r.Owner.Name, Contact__r.Owner.Id, Contact__r.Rule_Class__c, Contact__r.Primary_Borrower_Y_N__c
FROM contact_deals__c
WHERE Id in :dealQuery];
//Grab all deal id's
Map<string,List<string>> result = new Map<string,List<string>>();
for(Contact_Deals__c i:dealQuery){
List<string> temp = new list<string>();
temp.add(i.Deal__r.Id);
temp.add(i.Deal__r.Owner.Name);
temp.add(i.Deal__r.FHA_Number__c);
temp.add(i.Deal__r.Name);
for(Contact_Deals__c j:contactQuery){
if(j.id == i.id){
//This doesn't really help if there are multiple primary borrowers on a deal - but that should be a SF worflow rule IMO
if(j.Contact__r.Primary_Borrower_Y_N__c == 'Yes'){
temp.add(j.Contact__r.Owner.Id);
temp.add(j.Contact__r.Id);
temp.add(j.Contact__r.Name);
temp.add(j.Contact__r.Owner.Name);
temp.add(j.Contact__r.Rule_Class__c);
break;
}
}
}
result.put(i.Deal__r.id, temp);
}
return result;
}
The only thing I've changed is moving the temp list to add elements before the inner-loop (previously temp would only capture things from the inner-loop). The error above is referencing line 13, which is specifically the first SOQL call:
List<Contact_Deals__c> dealQuery = [SELECT id, Deal__r.id, Deal__r.FHA_Number__c, Deal__r.Name, Deal__r.Owner.Name
FROM Contact_Deals__c
Where Deal__r.FHA_Number__c in :stringArray];
I've tested this function in the apex anonymous window and it worked perfectly:
string a = '00035398,00035401';
string result = zAPI.doPost(a, 'fhaQuery');
system.debug(result);
Results:
13:36:54:947 USER_DEBUG
[5]|DEBUG|{"a09d000000HRvBAD":["a09d000000HRvBAD","Contacta","11111111","Plaza
Center
Apts"],"a09d000000HsVAD":["a09d000000HsVAD","Contactb","22222222","The
Garden"]}
So this is working. The next part is maybe looking at my python script that is calling the API,
def origQuery(file_name, operation):
csv_text = ""
with open(file_name) as csvfile:
reader = csv.reader(csvfile, dialect='excel')
for row in reader:
csv_text += row[0]+','
csv_text = csv_text[:-1]
data = json.dumps({
'data' : csv_text,
'operation' : operation
})
results = requests.post(url, headers=headers, data=data)
print results.text
origQuery('myfile.csv', 'fhaQuery')
I've tried looking up this ORA-01460 apex error, but I can't find anything that will help me fix this issue.
Can any one shed ore light on what this error is telling me?
Thank you all so much!
It turns out the error was in the PY script. For some reason the following code isn't functioning as it is supposed to:
with open(file_name) as csvfile:
reader = csv.reader(csvfile, dialect='excel')
for row in reader:
csv_text += row[0]+','
csv_text = csv_text[:-1]
This was returning one very long string that had zero delimiters. The final line in the code was cutting off the delimiter. What I needed instead was:
with open(file_name) as csvfile:
reader = csv.reader(csvfile, dialect='excel')
for row in reader:
csv_text += row[0]+','
csv_text = csv_text[:-1]
Which would cut off the final ','
The error was occurring because the single long string was above 4,000 characters.
$dbh_source2 = DBI->connect("dbi:Oracle:host=.......;port=......;sid=......",'..........','..........');
foreach $data_line (#raw_data) {
$SEL = "SELECT arg1,arg2 FROM TABLE_NAME WHERE DATA_NAME = '$data_line'";
$sth = $dbh_source2->prepare($SEL);
$sth->execute();
while (my #row = $sth->fetchrow_array() ) {
print #row;
print "\n";
}
}
END {
$dbh_source2->disconnect if defined($dbh_source2);
}
I am trying to grab several lines of data from a user. I want to take that data and use it to query a database and grab ARG1 and ARG2 WHERE USER_DATA = $data_line.
It will not display anything.
Here's a quick revision which uses SQL placeholders in order to keep Bobby Tables from destroying your database. It may also fix the problem you're currently having, but I haven't seen enough details of your problem yet to be sure.
my $dbh_source2 = DBI->connect("dbi:Oracle:host=.......;port=......;sid=......",'..........','..........');
my $SEL = "SELECT arg1,arg2 FROM TABLE_NAME WHERE DATA_NAME = ?";
my $sth = $dbh_source2->prepare($SEL);
foreach my $data_line (#raw_data) {
$sth->execute($data_line);
while (my #row = $sth->fetchrow_array() ) {
print #row;
print "\n";
}
}
END {
$dbh_source2->disconnect if defined($dbh_source2);
}
my $dbh_source2 = DBI->connect
("dbi:Oracle:host=.......;port=......;sid=......",'..........','..........');
my $SEL = "SELECT arg1,arg2 FROM TABLE_NAME WHERE DATA_NAME = ?";
my $sth = $dbh_source2->prepare($SEL);
foreach my $data_line (#raw_data) {
chomp $data_line;
$sth->execute($data_line);
while (my #row = $sth->fetchrow_array() ) {
print "$data_line\t #row\n";
}
}
END {
$dbh_source2->disconnect if defined($dbh_source2);
}
The issue I was having was the fact that the development database was not updated with the correct information so some of the items came up blank. When using the production database with the correct information it worked great!
Thank you all for your help!