How can I automatically infer schemas of CSV files on S3 as I load them?

How can I automatically infer schemas of CSV files on S3 as I load them? - amazon-s3

Context
Currently I am using Snowflake as a Data Warehouse and AWS' S3 as a data lake. The majority of the files that land on S3 are in the Parquet format. For these, I am using a new limited feature by Snowflake (documented here) that automatically detects the schema from the parquet files on S3, which I can use to generate a CREATE TABLE statement with the correct column names and inferred data types. This feature currently only works for Apache Parquet, Avro, and ORC files. I would like to find a way that achieves the same desired objective but for CSV files.
What I have tried to do
This is how I currently infer the schema for Parquet files:
select generate_column_description(array_agg(object_construct(*)), 'table') as columns
from table (infer_schema(location=>'${LOCATION}', file_format=>'${FILE_FORMAT}'))
However if I try specifying the FILE_FORMAT as csv that approach will fail.
Other approaches I have considered:
Transferring all files that land on S3 to parquet (this involves more code, and infra setup so wouldn't be my top choice, especially that I'd like to keep some files in their natural type on s3)
Having a script (using libraries like Pandas in Python for example) that infer the schema for files in S3 (this also involves more code, and will be strange in the sense that parquet files are handled in Snowflake, but non parquet files are handled by some script on aws).
Using a Snowflake UDF to infer the schema. Haven't fully considered my options there yet.
Desired Behaviour
As a new csv file lands on S3 (on a pre-existing STAGE), I would like to infer the schema, and be able to generate a CREATE TABLE statement with the inferred data types. Preferably, I would like to do that within Snowflake as the existing aforementioned schema-inference solution exists there. Happy to add further information if needed.

UPDATE: I modified the SP that infers data types in untyped (all string type columns) tables and it now works directly against Snowflake stages. The project code is available here: https://github.com/GregPavlik/InferSchema
I wrote a stored procedure to assist with this; however, its only goal is to infer the data types of untyped columns. It works as follows:
Load the CSV into a table with all columns defined as varchars.
Call the SP with a query against the new table (main point is to get only the columns you want and limit the row count to keep type inference times reasonable).
Also in the SP call is the DB, schema, and table for the old and new locations -- old with all varchar and new with the inferred types.
The SP will then infer the data types and create two SQL statements. One statement will create the new table with the inferred data types. One statement will copy from the untyped (all varchar) table to the new table with appropriate wrappers such as try_multi_timestamp(), a UDF that extends try_to_timestamp() to try various common formats.
I meant to extend this so that it didn't require the untyped (all varchar) table at all, but haven't gotten around to it. Since it's come up here, I may circle back and update the SP with that capability. You can specify a query that reads directly from the stage, but you'd have to use $1, $2... with aliases for the column names (or else the DDL will try to create column names like $1). If the query runs directly against a stage, for the old DB, schema, and table, you could put in whatever because that's only used to generate an insert from select statement.
-- This shows how to use on the Snowflake TPCH sample, but could be any query.
-- Keep the row count down to reduce the time it take to infer the types.
call infer_data_types('select * from SNOWFLAKE_SAMPLE_DATA.TPCH_SF1.LINEITEM limit 10000',
'SNOWFLAKE_SAMPLE_DATA', 'TPCH_SF1', 'LINEITEM',
'TEST', 'PUBLIC', 'LINEITEM');
create or replace procedure INFER_DATA_TYPES(SOURCE_QUERY string,
DATABASE_OLD string,
SCHEMA_OLD string,
TABLE_OLD string,
DATABASE_NEW string,
SCHEMA_NEW string,
TABLE_NEW string)
returns string
language javascript
as
$$
/****************************************************************************************************
* *
* DataType Classes
* *
****************************************************************************************************/
class Query{
constructor(statement){
this.statement = statement;
}
}
class DataType {
constructor(db, schema, table, column, sourceQuery) {
this.db = db;
this.schema = schema;
this.table = table;
this.sourceQuery = sourceQuery
this.column = column;
this.insert = '"#~COLUMN~#"';
this.totalCount = 0;
this.notNullCount = 0;
this.typeCount = 0;
this.blankCount = 0;
this.minTypeOf = 0.95;
this.minNotNull = 1.00;
}
setSQL(sqlTemplate){
this.sql = sqlTemplate;
this.sql = this.sql.replace(/#~DB~#/g, this.db);
this.sql = this.sql.replace(/#~SCHEMA~#/g, this.schema);
this.sql = this.sql.replace(/#~TABLE~#/g, this.table);
this.sql = this.sql.replace(/#~COLUMN~#/g, this.column);
}
getCounts(){
var rs;
rs = GetResultSet(this.sql);
rs.next();
this.totalCount = rs.getColumnValue("TOTAL_COUNT");
this.notNullCount = rs.getColumnValue("NON_NULL_COUNT");
this.typeCount = rs.getColumnValue("TO_TYPE_COUNT");
this.blankCount = rs.getColumnValue("BLANK");
}
isCorrectType(){
return (this.typeCount / (this.notNullCount - this.blankCount) >= this.minTypeOf);
}
isNotNull(){
return (this.notNullCount / this.totalCount >= this.minNotNull);
}
}
class TimestampType extends DataType{
constructor(db, schema, table, column, sourceQuery){
super(db, schema, table, column, sourceQuery)
this.syntax = "timestamp";
this.insert = 'try_multi_timestamp(trim("#~COLUMN~#"))';
this.sourceQuery = SOURCE_QUERY;
this.setSQL(GetCheckTypeSQL(this.insert, this.sourceQuery));
this.getCounts();
}
}
class IntegerType extends DataType{
constructor(db, schema, table, column, sourceQuery){
super(db, schema, table, column, sourceQuery)
this.syntax = "number(38,0)";
this.insert = 'try_to_number(trim("#~COLUMN~#"), 38, 0)';
this.setSQL(GetCheckTypeSQL(this.insert, this.sourceQuery));
this.getCounts();
}
}
class DoubleType extends DataType{
constructor(db, schema, table, column, sourceQuery){
super(db, schema, table, column, sourceQuery)
this.syntax = "double";
this.insert = 'try_to_double(trim("#~COLUMN~#"))';
this.setSQL(GetCheckTypeSQL(this.insert, this.sourceQuery));
this.getCounts();
}
}
class BooleanType extends DataType{
constructor(db, schema, table, column, sourceQuery){
super(db, schema, table, column, sourceQuery)
this.syntax = "boolean";
this.insert = 'try_to_boolean(trim("#~COLUMN~#"))';
this.setSQL(GetCheckTypeSQL(this.insert, this.sourceQuery));
this.getCounts();
}
}
// Catch all is STRING data type
class StringType extends DataType{
constructor(db, schema, table, column, sourceQuery){
super(db, schema, table, column, sourceQuery)
this.syntax = "string";
this.totalCount = 1;
this.notNullCount = 0;
this.typeCount = 1;
this.minTypeOf = 0;
this.minNotNull = 1;
}
}
/****************************************************************************************************
* *
* Main function *
* *
****************************************************************************************************/
var pass = 0;
var column;
var typeOf;
var ins = '';
var newTableDDL = '';
var insertDML = '';
var columnRS = GetResultSet(GetTableColumnsSQL(DATABASE_OLD, SCHEMA_OLD, TABLE_OLD));
while (columnRS.next()){
pass++;
if(pass > 1){
newTableDDL += ",\n";
insertDML += ",\n";
}
column = columnRS.getColumnValue("COLUMN_NAME");
typeOf = InferDataType(DATABASE_OLD, SCHEMA_OLD, TABLE_OLD, column, SOURCE_QUERY);
newTableDDL += '"' + typeOf.column + '" ' + typeOf.syntax;
ins = typeOf.insert;
insertDML += ins.replace(/#~COLUMN~#/g, typeOf.column);
}
return GetOpeningComments() +
GetDDLPrefixSQL(DATABASE_NEW, SCHEMA_NEW, TABLE_NEW) +
newTableDDL +
GetDDLSuffixSQL() +
GetDividerSQL() +
GetInsertPrefixSQL(DATABASE_NEW, SCHEMA_NEW, TABLE_NEW) +
insertDML +
GetInsertSuffixSQL(DATABASE_OLD, SCHEMA_OLD, TABLE_OLD) ;
/****************************************************************************************************
* *
* Helper functions *
* *
****************************************************************************************************/
function InferDataType(db, schema, table, column, sourceQuery){
var typeOf;
typeOf = new IntegerType(db, schema, table, column, sourceQuery);
if (typeOf.isCorrectType()) return typeOf;
typeOf = new DoubleType(db, schema, table, column, sourceQuery);
if (typeOf.isCorrectType()) return typeOf;
typeOf = new BooleanType(db, schema, table, column, sourceQuery); // May want to do a distinct and look for two values
if (typeOf.isCorrectType()) return typeOf;
typeOf = new TimestampType(db, schema, table, column, sourceQuery);
if (typeOf.isCorrectType()) return typeOf;
typeOf = new StringType(db, schema, table, column, sourceQuery);
if (typeOf.isCorrectType()) return typeOf;
return null;
}
/****************************************************************************************************
* *
* SQL Template Functions *
* *
****************************************************************************************************/
function GetCheckTypeSQL(insert, sourceQuery){
var sql =
`
select count(1) as TOTAL_COUNT,
count("#~COLUMN~#") as NON_NULL_COUNT,
count(${insert}) as TO_TYPE_COUNT,
sum(iff(trim("#~COLUMN~#")='', 1, 0)) as BLANK
--from "#~DB~#"."#~SCHEMA~#"."#~TABLE~#";
from (${sourceQuery})
`;
return sql;
}
function GetTableColumnsSQL(dbName, schemaName, tableName){
var sql =
`
select COLUMN_NAME
from ${dbName}.INFORMATION_SCHEMA.COLUMNS
where TABLE_CATALOG = '${dbName}' and
TABLE_SCHEMA = '${schemaName}' and
TABLE_NAME = '${tableName}'
order by ORDINAL_POSITION;
`;
return sql;
}
function GetOpeningComments(){
return `
/**************************************************************************************************************
* *
* Copy and paste into a worksheet to create the typed table and insert into the new table from the old one. *
* *
**************************************************************************************************************/
`;
}
function GetDDLPrefixSQL(db, schema, table){
var sql =
`
create or replace table "${db}"."${schema}"."${table}"
(
`;
return sql;
}
function GetDDLSuffixSQL(){
return "\n);";
}
function GetDividerSQL(){
return `
/**************************************************************************************************************
* *
* The SQL statement below this attempts to copy all rows from the string tabe to the typed table. *
* *
**************************************************************************************************************/
`;
}
function GetInsertPrefixSQL(db, schema, table){
var sql =
`\ninsert into "${db}"."${schema}"."${table}" select\n`;
return sql;
}
function GetInsertSuffixSQL(db, schema, table){
var sql =
`\nfrom "${db}"."${schema}"."${table}" ;`;
return sql;
}
//function GetInsertSuffixSQL(db, schema, table){
//var sql = '\nfrom "${db}"."${schema}"."${table}";';
//return sql;
//}
/****************************************************************************************************
* *
* SQL functions *
* *
****************************************************************************************************/
function GetResultSet(sql){
cmd1 = {sqlText: sql};
stmt = snowflake.createStatement(cmd1);
var rs;
rs = stmt.execute();
return rs;
}
function ExecuteNonQuery(queryString) {
var out = '';
cmd1 = {sqlText: queryString};
stmt = snowflake.createStatement(cmd1);
var rs;
rs = stmt.execute();
}
function ExecuteSingleValueQuery(columnName, queryString) {
var out;
cmd1 = {sqlText: queryString};
stmt = snowflake.createStatement(cmd1);
var rs;
try{
rs = stmt.execute();
rs.next();
return rs.getColumnValue(columnName);
}
catch(err) {
if (err.message.substring(0, 18) == "ResultSet is empty"){
throw "ERROR: No rows returned in query.";
} else {
throw "ERROR: " + err.message.replace(/\n/g, " ");
}
}
return out;
}
function ExecuteFirstValueQuery(queryString) {
var out;
cmd1 = {sqlText: queryString};
stmt = snowflake.createStatement(cmd1);
var rs;
try{
rs = stmt.execute();
rs.next();
return rs.getColumnValue(1);
}
catch(err) {
if (err.message.substring(0, 18) == "ResultSet is empty"){
throw "ERROR: No rows returned in query.";
} else {
throw "ERROR: " + err.message.replace(/\n/g, " ");
}
}
return out;
}
function getQuery(sql){
var cmd = {sqlText: sql};
var query = new Query(snowflake.createStatement(cmd));
try {
query.resultSet = query.statement.execute();
} catch (err) {
throw "ERROR: " + err.message.replace(/\n/g, " ");
}
return query;
}
$$;

Have you tried STAGES?
Create 2 stages ... one with no header and the other with header .. .
see examples below.
Then a bit of SQL and voila your DDL.
Only issue - you need to know the # of cols to put correct number of t.$'s.
If someone could automate that ... we'd have an almost automatic DDL generator for CSV's.
Obviously once you have the SQL stmt then just add the create or replace table to the front and your table is nicely created with all the names from the CSV.
:-)
-- create or replace stage CSV_NO_HEADER
URL = 's3://xxx-x-dev-landing/xxx/'
STORAGE_INTEGRATION = "xxxLAKE_DEV_S3_INTEGRATION"
FILE_FORMAT = ( TYPE = CSV SKIP_HEADER = 1 FIELD_OPTIONALLY_ENCLOSED_BY = '"' )
-- create or replace stage CSV
URL = 's3://xxx-xxxlake-dev-landing/xxx/'
STORAGE_INTEGRATION = "xxxLAKE_DEV_S3_INTEGRATION"
FILE_FORMAT = ( TYPE = CSV FIELD_OPTIONALLY_ENCLOSED_BY = '"' )
select concat('select t.$1 ', t.$1, ',t.$2 ', t.$2,',t.$3 ', t.$3, ',t.$4 ', t.$4,',t.$5 ', t.$5,',t.$6 ', t.$6,',t.$7 ', t.$7,',t.$8 ', t.$8,',t.$9 ', t.$9,
',t.$10 ', t.$10, ',t.$11 ', t.$11,',t.$12 ', t.$12 ,',t.$13 ', t.$13, ',t.$14 ', t.$14 ,',t.$15 ', t.$15 ,',t.$16 ', t.$16 ,',t.$17 ', t.$17 ,' from #xxxx_NO_HEADER/SUB_TRANSACTION_20201204.csv t') from
--- CHANGE TABLE ---
#xxx/SUB_TRANSACTION_20201204.csv t limit 1;

Related

While loop through each row of a table stored procedure snowflake

create or replace procedure Proc_Name("Trgt_DB" VARCHAR, "Trgt_Schema" VARCHAR, "Trgt_Stage" VARCHAR, "Trgt_Timeframe" VARCHAR, "Trgt_Number_Timeframe" FLOAT)
returns string
language JavaScript
EXECUTE AS CALLER
as
$$
--select all data from table
var select_stmt = `SELECT * FROM ` + Trgt_DB + `.schema.data_table ORDER BY ID`;
var select_stmt_sql = { sqlText: select_stmt };
var select_stmt_create = snowflake.createStatement(select_stmt_sql);
var select_stmt_exec = select_stmt_create.execute();
--loop throgh all the external tables
while (select_stmt_exec.next()) {
var TGT_SCHEMA_NAME = select_stmt_exec.getColumnValue(5);
var TGT_TABLE_NAME = select_stmt_exec.getColumnValue(6);
--delete data from target tables
var remove_data_tbl = `delete from ` +TGT_SCHEMA_NAME+ `.` +TGT_TABLE_NAME+ ` where SYS_DATE <= DATEADD(` +Trgt_Timeframe+ `,-` +Trgt_Number_Timeframe+ `, current_date())`;
var remove_data_tbl_sql = {sqlText: remove_data_tbl };
var remove_data_tbl_create = snowflake.createStatement(remove_data_tbl_sql);
var remove_details_exec = remove_data_tbl_create.execute();
CALL Proc_Name('Dev', 'ACQ', 'stage', 'days', 60)
--I have a table with all of the target schemas, (column 5) and target tables (column 6) in it. I want to create a loop which iterates through each target table and deletes the data from the target table which is past a certain timeframe.
--The code I have at the moment will only delete data from the first target table in the table

Snowflake error in JavaScript procedure column index/name does not exists in resultSet getColumnValue

I am having a procedure on snowflake which executing the following query:
select
array_size(split($1, ',')) as NO_OF_COL,
split($1, ',') as COLUMNS_ARRAY
from
#mystage/myfile.csv(file_format => 'ONE_COLUMN_FILE_FORMAT')
limit 1;
And the result would be like:
Why I run this query in a procedure:
CREATE OR REPLACE PROCEDURE ADD_TEMPORARY_TABLE(TEMP_TABLE_NAME STRING, FILE_FULL_PATH STRING, ONE_COLUMN_FORMAT_FILE STRING, FILE_FORMAT_NAME STRING)
RETURNS variant
LANGUAGE JAVASCRIPT
EXECUTE AS CALLER
AS
$$
try{
var final_result = [];
var nested_obj = {};
var nbr_rows = 0;
var NO_OF_COL = 0;
var COLUMNS_ARRAY = [];
var get_length_and_columns_array = "select array_size(split($1,',')) as NO_OF_COL, "+
"split($1,',') as COLUMNS_ARRAY from "+FILE_FULL_PATH+" "+
"(file_format=>"+ONE_COLUMN_FORMAT_FILE+") limit 1";
var stmt = snowflake.createStatement({sqlText: get_length_and_columns_array});
var array_result = stmt.execute();
array_result.next();
//return array_result.getColumnValue('COLUMNS_ARRAY');
NO_OF_COL = array_result.getColumnValue('NO_OF_COL');
COLUMNS_ARRAY = array_result.getColumnValue('COLUMNS_ARRAY');
return COLUMNS_ARRAY;
}
...
$$;
It will return an error as the following:
{
"code": 100183,
"message": "Given column name/index does not exist: NO_OF_COL",
"stackTraceTxt": "At ResultSet.getColumnValue, line 16 position 29",
"state": "P0000",
"toString": {}
}
The other issue is if I keep trying, it will return the desired array, but most of the times is returning this error.

The other issue is if I keep trying, it will return the desired array
If it works one time and not another time, my educated guess is that the stored procedure is called from different schemas.
Querying stage
( FILE_FORMAT => '<namespace>.<named_file_format>' )
If referencing a file format in the current namespace for your user session, you can omit the single quotes around the format identifier.
In the standalone query we can see:
select
array_size(split($1, ',')) as NO_OF_COL,
split($1, ',') as COLUMNS_ARRAY
from
#mystage/myfile.csv(file_format => 'ONE_COLUMN_FILE_FORMAT')
>-< >-<
But in the stored procedure body:
"(file_format=>"+ONE_COLUMN_FORMAT_FILE+") limit 1";
--here the text is appended, but without wrapping with ''
=>
"(file_format=>'"+ONE_COLUMN_FORMAT_FILE+"') limit 1";
Suggestion: always provide file format as as string wrapped with ', preferably prefixed with namespace '<schema_name>.<format_name>'.

ActiveJDBC , How can i query some columns i interest with in a single table

when i query a single table , i do not want all columns , i just want some column that i interest in.
For example, when i use where method to query a table, it will query all columns in a table like
public class SubjectSpecimenType extends Model {
}
SubjectSpecimenType.where("SUBJECT_ID = ? AND SITE_ID = ?", subjectId, siteId);
i don't know if there has a method named select that i can use to query some column like
SubjectSpecimenType.select("SUBJECT_NAME", "SITE_NAME").where("SUBJECT_ID = ? AND SITE_ID = ?", subjectId, siteId);
there are the source code in LazyList.java
/**
* Use to see what SQL will be sent to the database.
*
* #param showParameters true to see parameter values, false not to.
* #return SQL in a dialect for current connection which will be used if you start querying this
* list.
*/
public String toSql(boolean showParameters) {
String sql;
if(forPaginator){
sql = metaModel.getDialect().formSelect(null, null, fullQuery, orderBys, limit, offset);
}else{
sql = fullQuery != null ? fullQuery
: metaModel.getDialect().formSelect(metaModel.getTableName(), null, subQuery, orderBys, limit, offset);
}
if (showParameters) {
StringBuilder sb = new StringBuilder(sql).append(", with parameters: ");
join(sb, params, ", ");
sql = sb.toString();
}
return sql;
}
when call formSelect method, the second param columns always be null
is there a unfinish TODO ?

When operating on Models, ActiveJDBC always selects all columns, because if you load a model and it has partial attributes loaded, then you have a deficient model. The columns are specified in some edge cases, as in the RawPaginator: https://github.com/javalite/javalite/blob/e91ebdd1e4958bc0965d7ee99e6b7debc59a7b85/activejdbc/src/main/java/org/javalite/activejdbc/RawPaginator.java#L141
There is nothing to finish here, the behavior is intentional.

Converting StructType to Avro Schema, returns type as Union when using databricks spark-avro

I am using databricks spark-avro to convert a dataframe schema into avro schema.The returned avro schema fails to have a default value. This is causing issues when i am trying to create a Generic record out of the schema. Can, any one help with the right way of using this function ?
Dataset<Row> sellableDs = sparkSession.sql("sql query");
SchemaBuilder.RecordBuilder<Schema> rb = SchemaBuilder.record("testrecord").namespace("test_namespace");
Schema sc = SchemaConverters.convertStructToAvro(sellableDs.schema(), rb, "test_namespace");
System.out.println(sc.toString());
System.out.println(sc.getFields().get(0).toString());
String schemaString = sc.toString();
sellableDs.foreach(
(ForeachFunction<Row>) row -> {
Schema scEx = new Schema.Parser().parse(schemaString);
GenericRecord gr;
gr = new GenericData.Record(scEx);
System.out.println("Generic record Created");
int fieldSize = scEx.getFields().size();
for (int i = 0; i < fieldSize; i++ ) {
// System.out.println( row.get(i).toString());
System.out.println("field: " + scEx.getFields().get(i).toString() + "::" + "value:" + row.get(i));
gr.put(scEx.getFields().get(i).toString(), row.get(i));
//i++;
}
}
);
This is the df schema:
StructType(StructField(key,IntegerType,true), StructField(value,DoubleType,true))
This is the avro converted schema:
{"type":"record","name":"testrecord","namespace":"test_namespace","fields":[{"name":"key","type":["int","null"]},{"name":"value","type":["double","null"]}]}

The problems is that the class SchemaConverters does not include default values as part of the schema creation. You have 2 options, modify the schema adding default values before Record creation or filling the record before building with some value( it could be actually values from your row). For example null. This is an example how create a Record using your schema
import org.apache.avro.generic.GenericRecordBuilder
import org.apache.avro.Schema
var schema = new Schema.Parser().parse("{\"type\":\"record\",\"name\":\"testrecord\",\"namespace\":\"test_namespace\",\"fields\":[{\"name\":\"key\",\"type\":[\"int\",\"null\"]},{\"name\":\"value\",\"type\":[\"double\",\"null\"]}]}")
var builder = new GenericRecordBuilder(schema);
for (i <- 0 to schema.getFields().size() - 1 ) {
builder.set(schema.getFields().get(i).name(), null)
}
var record = builder.build();
print(record.toString())

Removing column gives syntax error [duplicate]

I have a problem: I need to delete a column from my SQLite database. I wrote this query
alter table table_name drop column column_name
but it does not work. Please help me.

Update: SQLite 2021-03-12 (3.35.0) now supports DROP COLUMN. The FAQ on the website is still outdated.
From: http://www.sqlite.org/faq.html:
(11) How do I add or delete columns from an existing table in SQLite.
SQLite has limited ALTER TABLE support that you can use to add a
column to the end of a table or to change the name of a table. If you
want to make more complex changes in the structure of a table, you
will have to recreate the table. You can save existing data to a
temporary table, drop the old table, create the new table, then copy
the data back in from the temporary table.
For example, suppose you have a table named "t1" with columns names
"a", "b", and "c" and that you want to delete column "c" from this
table. The following steps illustrate how this could be done:
BEGIN TRANSACTION;
CREATE TEMPORARY TABLE t1_backup(a,b);
INSERT INTO t1_backup SELECT a,b FROM t1;
DROP TABLE t1;
CREATE TABLE t1(a,b);
INSERT INTO t1 SELECT a,b FROM t1_backup;
DROP TABLE t1_backup;
COMMIT;

Instead of dropping the backup table, just rename it...
BEGIN TRANSACTION;
CREATE TABLE t1_backup(a,b);
INSERT INTO t1_backup SELECT a,b FROM t1;
DROP TABLE t1;
ALTER TABLE t1_backup RENAME TO t1;
COMMIT;

For simplicity, why not create the backup table from the select statement?
CREATE TABLE t1_backup AS SELECT a, b FROM t1;
DROP TABLE t1;
ALTER TABLE t1_backup RENAME TO t1;

This option works only if you can open the DB in a DB Browser like DB Browser for SQLite.
In DB Browser for SQLite:
Go to the tab, "Database Structure"
Select you table Select Modify table (just under the tabs)
Select the column you want to delete
Click on Remove field and click OK

=>Create a new table directly with the following query:
CREATE TABLE table_name (Column_1 TEXT,Column_2 TEXT);
=>Now insert the data into table_name from existing_table with the following query:
INSERT INTO table_name (Column_1,Column_2) FROM existing_table;
=>Now drop the existing_table by following query:
DROP TABLE existing_table;

PRAGMA foreign_keys=off;
BEGIN TRANSACTION;
ALTER TABLE table1 RENAME TO _table1_old;
CREATE TABLE table1 (
( column1 datatype [ NULL | NOT NULL ],
column2 datatype [ NULL | NOT NULL ],
...
);
INSERT INTO table1 (column1, column2, ... column_n)
SELECT column1, column2, ... column_n
FROM _table1_old;
COMMIT;
PRAGMA foreign_keys=on;
For more info:
https://www.techonthenet.com/sqlite/tables/alter_table.php

I've made a Python function where you enter the table and column to remove as arguments:
def removeColumn(table, column):
columns = []
for row in c.execute('PRAGMA table_info(' + table + ')'):
columns.append(row[1])
columns.remove(column)
columns = str(columns)
columns = columns.replace("[", "(")
columns = columns.replace("]", ")")
for i in ["\'", "(", ")"]:
columns = columns.replace(i, "")
c.execute('CREATE TABLE temptable AS SELECT ' + columns + ' FROM ' + table)
c.execute('DROP TABLE ' + table)
c.execute('ALTER TABLE temptable RENAME TO ' + table)
conn.commit()
As per the info on Duda's and MeBigFatGuy's answers this won't work if there is a foreign key on the table, but this can be fixed with 2 lines of code (creating a new table and not just renaming the temporary table)

For SQLite3 c++ :
void GetTableColNames( tstring sTableName , std::vector<tstring> *pvsCols )
{
UASSERT(pvsCols);
CppSQLite3Table table1;
tstring sDML = StringOps::std_sprintf(_T("SELECT * FROM %s") , sTableName.c_str() );
table1 = getTable( StringOps::tstringToUTF8string(sDML).c_str() );
for ( int nCol = 0 ; nCol < table1.numFields() ; nCol++ )
{
const char* pch1 = table1.fieldName(nCol);
pvsCols->push_back( StringOps::UTF8charTo_tstring(pch1));
}
}
bool ColExists( tstring sColName )
{
bool bColExists = true;
try
{
tstring sQuery = StringOps::std_sprintf(_T("SELECT %s FROM MyOriginalTable LIMIT 1;") , sColName.c_str() );
ShowVerbalMessages(false);
CppSQLite3Query q = execQuery( StringOps::tstringTo_stdString(sQuery).c_str() );
ShowVerbalMessages(true);
}
catch (CppSQLite3Exception& e)
{
bColExists = false;
}
return bColExists;
}
void DeleteColumns( std::vector<tstring> *pvsColsToDelete )
{
UASSERT(pvsColsToDelete);
execDML( StringOps::tstringTo_stdString(_T("begin transaction;")).c_str() );
std::vector<tstring> vsCols;
GetTableColNames( _T("MyOriginalTable") , &vsCols );
CreateFields( _T("TempTable1") , false );
tstring sFieldNamesSeperatedByCommas;
for ( int nCol = 0 ; nCol < vsCols.size() ; nCol++ )
{
tstring sColNameCurr = vsCols.at(nCol);
bool bUseCol = true;
for ( int nColsToDelete = 0; nColsToDelete < pvsColsToDelete->size() ; nColsToDelete++ )
{
if ( pvsColsToDelete->at(nColsToDelete) == sColNameCurr )
{
bUseCol = false;
break;
}
}
if ( bUseCol )
sFieldNamesSeperatedByCommas+= (sColNameCurr + _T(","));
}
if ( sFieldNamesSeperatedByCommas.at( int(sFieldNamesSeperatedByCommas.size()) - 1) == _T(','))
sFieldNamesSeperatedByCommas.erase( int(sFieldNamesSeperatedByCommas.size()) - 1 );
tstring sDML;
sDML = StringOps::std_sprintf(_T("insert into TempTable1 SELECT %s FROM MyOriginalTable;\n") , sFieldNamesSeperatedByCommas.c_str() );
execDML( StringOps::tstringTo_stdString(sDML).c_str() );
sDML = StringOps::std_sprintf(_T("ALTER TABLE MyOriginalTable RENAME TO MyOriginalTable_old\n") );
execDML( StringOps::tstringTo_stdString(sDML).c_str() );
sDML = StringOps::std_sprintf(_T("ALTER TABLE TempTable1 RENAME TO MyOriginalTable\n") );
execDML( StringOps::tstringTo_stdString(sDML).c_str() );
sDML = ( _T("DROP TABLE MyOriginalTable_old;") );
execDML( StringOps::tstringTo_stdString(sDML).c_str() );
execDML( StringOps::tstringTo_stdString(_T("commit transaction;")).c_str() );
}

In case anyone needs a (nearly) ready-to-use PHP function, the following is based on this answer:
/**
* Remove a column from a table.
*
* #param string $tableName The table to remove the column from.
* #param string $columnName The column to remove from the table.
*/
public function DropTableColumn($tableName, $columnName)
{
// --
// Determine all columns except the one to remove.
$columnNames = array();
$statement = $pdo->prepare("PRAGMA table_info($tableName);");
$statement->execute(array());
$rows = $statement->fetchAll(PDO::FETCH_OBJ);
$hasColumn = false;
foreach ($rows as $row)
{
if(strtolower($row->name) !== strtolower($columnName))
{
array_push($columnNames, $row->name);
}
else
{
$hasColumn = true;
}
}
// Column does not exist in table, no need to do anything.
if ( !$hasColumn ) return;
// --
// Actually execute the SQL.
$columns = implode('`,`', $columnNames);
$statement = $pdo->exec(
"CREATE TABLE `t1_backup` AS SELECT `$columns` FROM `$tableName`;
DROP TABLE `$tableName`;
ALTER TABLE `t1_backup` RENAME TO `$tableName`;");
}
In contrast to other answers, the SQL used in this approach seems to preserve the data types of the columns, whereas something like the accepted answer seems to result in all columns to be of type TEXT.
Update 1:
The SQL used has the drawback that autoincrement columns are not preserved.

Just in case if it could help someone like me.
Based on the Official website and the Accepted answer, I made a code using C# that uses System.Data.SQLite NuGet package.
This code also preserves the Primary key and Foreign key.
CODE in C#:
void RemoveColumnFromSqlite (string tableName, string columnToRemove) {
try {
var mSqliteDbConnection = new SQLiteConnection ("Data Source=db_folder\\MySqliteBasedApp.db;Version=3;Page Size=1024;");
mSqliteDbConnection.Open ();
// Reads all columns definitions from table
List<string> columnDefinition = new List<string> ();
var mSql = $"SELECT type, sql FROM sqlite_master WHERE tbl_name='{tableName}'";
var mSqliteCommand = new SQLiteCommand (mSql, mSqliteDbConnection);
string sqlScript = "";
using (mSqliteReader = mSqliteCommand.ExecuteReader ()) {
while (mSqliteReader.Read ()) {
sqlScript = mSqliteReader["sql"].ToString ();
break;
}
}
if (!string.IsNullOrEmpty (sqlScript)) {
// Gets string within first '(' and last ')' characters
int firstIndex = sqlScript.IndexOf ("(");
int lastIndex = sqlScript.LastIndexOf (")");
if (firstIndex >= 0 && lastIndex <= sqlScript.Length - 1) {
sqlScript = sqlScript.Substring (firstIndex, lastIndex - firstIndex + 1);
}
string[] scriptParts = sqlScript.Split (new string[] { "," }, StringSplitOptions.RemoveEmptyEntries);
foreach (string s in scriptParts) {
if (!s.Contains (columnToRemove)) {
columnDefinition.Add (s);
}
}
}
string columnDefinitionString = string.Join (",", columnDefinition);
// Reads all columns from table
List<string> columns = new List<string> ();
mSql = $"PRAGMA table_info({tableName})";
mSqliteCommand = new SQLiteCommand (mSql, mSqliteDbConnection);
using (mSqliteReader = mSqliteCommand.ExecuteReader ()) {
while (mSqliteReader.Read ()) columns.Add (mSqliteReader["name"].ToString ());
}
columns.Remove (columnToRemove);
string columnString = string.Join (",", columns);
mSql = "PRAGMA foreign_keys=OFF";
mSqliteCommand = new SQLiteCommand (mSql, mSqliteDbConnection);
int n = mSqliteCommand.ExecuteNonQuery ();
// Removes a column from the table
using (SQLiteTransaction tr = mSqliteDbConnection.BeginTransaction ()) {
using (SQLiteCommand cmd = mSqliteDbConnection.CreateCommand ()) {
cmd.Transaction = tr;
string query = $"CREATE TEMPORARY TABLE {tableName}_backup {columnDefinitionString}";
cmd.CommandText = query;
cmd.ExecuteNonQuery ();
cmd.CommandText = $"INSERT INTO {tableName}_backup SELECT {columnString} FROM {tableName}";
cmd.ExecuteNonQuery ();
cmd.CommandText = $"DROP TABLE {tableName}";
cmd.ExecuteNonQuery ();
cmd.CommandText = $"CREATE TABLE {tableName} {columnDefinitionString}";
cmd.ExecuteNonQuery ();
cmd.CommandText = $"INSERT INTO {tableName} SELECT {columnString} FROM {tableName}_backup;";
cmd.ExecuteNonQuery ();
cmd.CommandText = $"DROP TABLE {tableName}_backup";
cmd.ExecuteNonQuery ();
}
tr.Commit ();
}
mSql = "PRAGMA foreign_keys=ON";
mSqliteCommand = new SQLiteCommand (mSql, mSqliteDbConnection);
n = mSqliteCommand.ExecuteNonQuery ();
} catch (Exception ex) {
HandleExceptions (ex);
}
}

In Python 3.8...
Preserves primary key and column types.
Takes 3 inputs:
a sqlite cursor: db_cur,
table name: t and,
list of columns to junk: columns_to_junk
def removeColumns(db_cur, t, columns_to_junk):
# Obtain column information
sql = "PRAGMA table_info(" + t + ")"
record = query(db_cur, sql)
# Initialize two strings: one for column names + column types and one just
# for column names
cols_w_types = "("
cols = ""
# Build the strings, filtering for the column to throw out
for r in record:
if r[1] not in columns_to_junk:
if r[5] == 0:
cols_w_types += r[1] + " " + r[2] + ","
if r[5] == 1:
cols_w_types += r[1] + " " + r[2] + " PRIMARY KEY,"
cols += r[1] + ","
# Cut potentially trailing commas
if cols_w_types[-1] == ",":
cols_w_types = cols_w_types[:-1]
else:
pass
if cols[-1] == ",":
cols = cols[:-1]
else:
pass
# Execute SQL
sql = "CREATE TEMPORARY TABLE xfer " + cols_w_types + ")"
db_cur.execute(sql)
sql = "INSERT INTO xfer SELECT " + cols + " FROM " + t
db_cur.execute(sql)
sql = "DROP TABLE " + t
db_cur.execute(sql)
sql = "CREATE TABLE " + t + cols_w_types + ")"
db_cur.execute(sql)
sql = "INSERT INTO " + t + " SELECT " + cols + " FROM xfer"
db_cur.execute(sql)
You'll find a reference to a query() function. Just a helper...
Takes two inputs:
sqlite cursor db_cur and,
the query string: query
def query(db_cur, query):
r = db_cur.execute(query).fetchall()
return r
Don't forget to include a "commit()"!

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

How can I automatically infer schemas of CSV files on S3 as I load them? - amazon-s3

Related

While loop through each row of a table stored procedure snowflake

Snowflake error in JavaScript procedure column index/name does not exists in resultSet getColumnValue

ActiveJDBC , How can i query some columns i interest with in a single table

Converting StructType to Avro Schema, returns type as Union when using databricks spark-avro

Removing column gives syntax error [duplicate]

Categories

Resources