COPY into Snowflake table without defining the table schema - amazon-s3

Is there a way to copy data from S3 into Snowflake without manually defining the columns beforehand?
I don't want to have to define the schema for the table in Snowflake OR the schema for which columns should be imported from S3. I want it to be schema-on-read, not schema-on-write.
I'm using a storage integration for access to the S3 external stage.
My question is a bit similar to this question, but I don't want to have to define any columns individually. If there's a way to just add on additional columns on the fly, that would solve my issue too.

We do not currently have schema inference for COPY. I am assuming that you already know about the variant column option for JSON but it will not give you full schematization.
https://docs.snowflake.net/manuals/user-guide/semistructured-concepts.html
Dinesh Kulkarni
(PM, Snowflake)

You need to use a third party tool that analyses your whole S3 data file in order to build an SQL schema from the data set in the file. Or maybe the tool is given access to the data source definition (which Snowflake hasn't) to make the job for the tool easier.
You might find snippets of Snowflake Stored Procedure code by searching around here at stackoverflow, that outputs schema definitions by eg. recursively flattening JSON data files.
If you want the import to be flexible, you need to use a flexible data format like JSON and a flexible SQL data type like VARIANT. This will work even if your data structures change.
If you want to use rigid formats like CSV or rigid SQL data types (most are rigid) then things get complicated. Rigid data are not flexible, and eg CSV files do not have any embedded type information, making for massive non-future-proof guesswork.
And maybe you are satisfied having all your columns end up as VARCHAR...

Related

Looking for a non-cloud RDBMS to import partitioned tables (in CSV format) with their directory structure

Context: I have been working on Cloudera/Impala in order to use a big database and create more manageable "aggregate" tables which contain substantially less information. These more manageable tables are of the order of tens to hundreds of gigabytes, and there are about two dozen tables. I am looking at about 500 gigabytes of data which will fit on a computer in my lab.
Question: I wish to use a non-cloud RDBMS in order to further work on these tables locally from my lab. The original Impala tables, most of them partitioned by date, have been exported to CSV, in such a way that the "table" folder contains a subfolder for each date, each subfolder containing a unique csv file (in which the partitioned "date" column is absent, since it is in its dated subfolder). Which would be an adequate RDBMS and how would I import these tables?
What I've found so far:
there seem to be several GUIs or commands for MySQL which simplify importing, e.g.:
How do I import CSV file into a MySQL table?
Export Impala Table from HDFS to MySQL
How to load Excel or CSV file into Firebird?
However these do not address my specific situation since 1. I only have access to Impala on the cluster, i.e. I cannot add any tools, so the heavy-lifting must be done on the lab computer, and 2. they do not say anything about importing an already partitioned table with the existing directory/partition structure.
Constraints:
Lab computer is on Ubuntu 20.04
Ideally, I would like to avoid having to load each csv / partition manually, as I have tens of thousands of dates. I am hoping for a RDBMS which already recognizes the partitioned directory structure...
the RDBMS itself should have a fairly recent set of functions available, including lead/lag/first/last window functions. aside from that, it needn't be too fancy.
I'm open to using Spark as an "overkill SQL engine", if that's the best way, I'm just not too sure if this is the best approach for a unique computer (not a cluster). Also, if need be (though I would ideally like to avoid this), I can export my Impala tables in another format in order to ease the import phase. E.g. a different format for text-based tables, parquet, etc.
Edit 1
As suggested in the comments, I am currently looking at Apache Drill. It is correctly installed, and I have successfully run the basic queries from the documentation / tutorials. However, I am now stuck at how to actually "import" (actually, I only need to "use" them since drill seems able to run queries directly on the filesystem) my tables. To clarify:
I currently have two "tables" in the directories /data/table1 and /data/table2 .
those directories contain subdirectories corresponding to the different partitions, e.g.: /data/table1/thedate=1995 , /data/table1/thedate=1996 , etc., and the same goes for table2.
within each subdirectory, I have a file (without an extension) that contains the CSV data, without headers.
My understanding (I'm still new to Apache-Drill) is that I need to create a File System Storage Plugin somehow for drill to understand where to look and what it's looking at, so I created a pretty basic plugin (a quasi copy/paste from this one) using the web interface on the Plugin Management page. The net result of that is that now I can type use data; and drill understands that. I can then say show files in data and it correctly lists table1 and table2 as my two directories. Unfortunately, I am still missing two key things to successfully be able to query these tables:
running select * from data.table1 fails with an error, and I've tried table1 or dfs.data.table1 and I get different errors for each command (object 'data' not found, object 'table1' not found, schema [[dfs,data]] isnot valid with respect to either root schema or current default schema, respectively). I suspect this is because there are sub-directories within table1?
I still have not said anything about the structure of the CSV files, and that structure would need to incorporate the fact that there is "thedate" field and value in the sub-directory name...
Edit 2
After trying a bunch of things, still no luck using text-based files, however using parquet files worked:
I can query a parquet file
I can query a directory containing a partitioned table, each directory being in the format: thedate=1995 , thedate=1996 as stated earlier.
I used the advice here in order to be able to query a table the usual way, i.e. without using dir0 but using thedate. Essentially, I created a view :
create view drill.test as select dir0 as thedate, * from dfs.data/table1_parquet_partitioned
Unfortunately, thedate now is a text that says: thedate=1994 , rather than just 1994 (int). So I renamed the directories in order to only contain the date, however this was not a good solution as the type for thedate was not an int and therefore I could not use dates to join with table2 (which has thedate in a column). So finally, what I did was cast thedate to an int in the view
=> This is all fine as, although not csv files, this alternative is doable for me. However I am wondering if by using such a view, with a cast inside, will I benefit from partition pruning ? The answer in the referenced stackoverflow link suggests partition pruning is conserved by the view, however I am unsure about this when the column is used in a formula... Finally, given that the only way I can make this work is via parquet, it begs the question: is drill the best solution for this in terms of performance? So far, I like it, but migrating the database to this will be time-consuming and I would like to try to choose the best destination for this without too much trial and error...
I ended up using Spark. The only alternative I currently know about, which was brought to my attention by Simon Darr (whom I wish to thank again!), is Apache Drill. Pros and cons for each solution, as far as I could test:
Neither solution was great for offering a simple way to import the existing schema when the database is exported in text (in my case, CSV files).
Both solutions import the schema correctly using parquet files, so I have decided I must recreate my tables in the parquet format from my source cluster (which uses Impala).
The problem remaining is with respect to the partitioning: I was at long last able to figure out how to import partitioned files on Spark and the process of adding that partition dimension is seemless (I got help from here and here for that part), whereas I was not able to find a way to do this convincingly using Drill (although the creation of a view, as suggested here, does help somewhat):
On Spark. I used : spark.sql("select * from parquet.file:///mnt/data/SGDATA/sliced_liquidity_parq_part/"). Note that it is important to not use the * wildcard, as I first did, because if you use the wildcard each parquet file is read without looking at the directory it belongs to, so it doesn't take into account the directory structure for the partitioning or adding those fields into the schema. Without the wildcard, the directory name with syntax field_name=value is correctly added to the schema, and the value types themselves are correctly inferred (in my case, int because I use thedate=intvalue syntax).
On Drill, the trick of creating a view is a bit messy since it involves, first, using the substring of dir0 in order to extract the field_name and value, and second it requires a cast in order to send that field to the correct type in the schema. I am really not certain this sort of view would enable partition pruning when doing queries thereafter, so I was not fond of this hack. NB: there is likely another way to do this properly, I simply haven't found it.
I learned along the way about Drill (which seems great for logs and stuff that don't have a known structure), and learned that Spark could do a lot of what drill does if the data is structured (I had no idea it could read CSVs or parquet files directly without an underlying DB system). I also did not know that Spark was so easy to install on a standalone machine: after following steps here, I simply created a script in my bashrc which launches the master, a worker, and the shell all in one go (although I cannot comment on the performance of using a standalone computer for this, perhaps Spark is bad at this). Having used spark a bit in the past, this solution still seems best for me given my options. If there are any other solutions out there keep them coming as I won't accept my own answer just yet (I have a few days required to change all my tables to parquet anyway).

Mapping multiple layouts from a working SQL table - SSIS

I have a flat file as an input that has multiple layouts:
Client# FileType Data
------- -------- --------------------------------------
Client#1FileType0Dataxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Client#1FileType1Datayyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy
Client#1FileType2Datazzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz
Client#2FileType0Dataxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
My PLANNED workflow goes as follows: Drop Temp table -Load SQL temp table with columns Client#, FileType, Data and then from there, map my 32 file types to actual permanent SQL table.
My question is, is that even doable and how would you proceed?
Can you, from such a working table, split to 32 sources? With SQL substrings? I am not sure how I will map my columns from the differing file type from my temp table, what 'box' to use in my workflow.
What you are describing is a very reasonable approach to loading data in a database. The idea is:
Create a staging table where all the columns are strings.
Load data into the final table, using SQL manipulations.
The advantage of this approach is that you can debug any data anomalies in the database and that generally makes things much simpler.
The answer to your question is that the following functions are generally very useful in doing this:
substring()
try_convert()
This can get more complicated if the "data" is not fixed width. In that case, you would have to use more complex string processing. In that case, recursive CTEs or JSON functionality might help.

How to read a text file using pl/sql and do string search for each line

I want to read text file and store(using clob datatype) its data in table . I want to do string comparison on loaded data.
Loaded text file contains DDL scripts and i want to get segregation of new/modified tables, new/modified indexes and constraints.
This can be done as Tom suggested Ask tom article
Challenge i'm facing here is that, i have to get above details before running those scripts, otherwise i would have used some DDL trigger to audit schema changes.
My qustion is , is it feasible to do string comparison on large text ? or is there any better alternative. please share your views/ideas on this.
Example file
Create table table_one
Alter table table_two
create index index_table_one_idx table_one (column_one)
etc etc... 100s of statements
from above code i want to get table_one , table_two as modified tables, and index_table_one_idx as newly created index.
i want to achieve this by looking for 'create table','alter table' strings in large text file and get the table name using substring.
It is perfectly feasible to do string comparison on large text.
There are a couple of different approaches. One is to read the file line by line using UTL_FILE. The other word be to load it into a temporary CLOB and process it in chunks. Probably the second way is is the better option. Make sure to use the DBMS_LOB functions for string manipulation, because they will perform better. Find out more.
Your main problem is a specification one. You need to isolate all the different starting points for SQL statements. If your script just as CREATE, ALTER and DROP then it's not too difficult, depending on how much subsequent detail you need (extract object type? extract object name? etc) and what additional processing you need to do.
If your scripts contain DML statements the task becomes harder. If the DDL encompasses programmatic objects (TYPE, PACKAGE, etc) then it's an order of magnitude harder.

Copy tables from query in Bigquery

I am attempting to fix the schema of a Bigquery table in which the type of a field is wrong (but contains no data). I would like to copy the data from the old schema to the new using the UI ( select * except(bad_column) from ... ).
The problem is that:
if I select into a table, then Bigquery is removing the required of the columns and therefore rejecting the insert.
Exporting via json loses information on dates.
Is there a better solution than creating a new table with all columns being nullable/repeated or manually transforming all of the data?
Update (2018-06-20): BigQuery now supports required fields on query output in standard SQL, and has done so since mid-2017.
Specifically, if you append your query results to a table with a schema that has required fields, that schema will be preserved, and BigQuery will check as results are written that it contains no null values. If you want to write your results to a brand-new table, you can create an empty table with the desired schema and append to that table.
Outdated:
You have several options:
Change your field types to nullable. Standard SQL returns only nullable fields, and this is intended behavior, so going forward it may be less useful to mark fields as required.
You can use legacy SQL, which will preserve required fields. You can't use except, but you can explicitly select all other fields.
You can export and re-import with the desired schema.
You mention that export via JSON loses date information. Can you clarify? If you're referring to the partition date, then unfortunately I think any of the above solutions will collapse all data into today's partition, unless you explicitly insert into a named partition using the table$yyyymmdd syntax. (Which will work, but may require lots of operations if you have data spread across many dates.)
BigQuery now supports table clone features. A table clone is a lightweight, writeable copy of another table
Copy tables from query in Bigquery

Can data and schema be changed with DB2/z load/unload?

I'm trying to find an efficient way to migrate tables with DB2 on the mainframe using JCL. When we update our application such that the schema changes, we need to migrate the database to match.
What we've been doing in the past is basically creating a new table, selecting from the old table into that, deleting the original and renaming the new table to the original name.
Needless to say, that's not a very high-performance solution when the tables are big (and some of them are very big).
With latter versions of DB2, I know you can do simple things like alter column types but we have migration jobs which need to do more complicated things to the data.
Consider for example the case where we want to combine two columns into one (firstname + lastname -> fullname). Never mind that it's not necessarily a good idea to do that, just take it for granted that this is the sort of thing we need to do. There may be arbitrarily complicated transformations to the data, basically anything you can do with a select statement.
My question is this. The DB2 unload utility can be used to pull all of the data out of a table into a couple of data sets (the load JCL used for reloading the data, and the data itself). Is there an easy way (or any way) to massage this output of unload so that these arbitrary changes are made when reloading the data?
I assume that I could modify the load JCL member and the data member somehow to achieve this but I'm not sure how easy that would be.
Or, better yet, can the unload/load process itself do this without having to massage the members directly?
Does anyone have any experience of this, or have pointers to redbooks or redpapers (or any other sources) that describe how to do this?
Is there a different (better, obviously) way of doing this other than unload/load?
As you have noted, SELECTing from the old table into the new table will have very poor performance. Poor performance here is generally due to the relatively high costs of insertion INTO the target table (index building and RI enforcement). The SELECT itself is generally not a performance issue. This is why the LOAD utility is generally perferred when large tables need to be populated from scratch, indices may be built more efficiently and RI may be deferred.
the UNLOAD utility allows unrestricted usage of SELECT. If you can SELECT data using scalar and/or column functions to build a result set that is compatible with your new table column definitions then UNLOAD can be used to do the data conversion. Specify a SELECT statement in SYSIN for the UNLOAD utility. Something like:
//SYSIN DD *
SELECT CONCAT(FIRST_NAME, LAST_NAME) AS "FULLNAME"
FROM OLD_TABLE
/*
The resulting SYSRECxx file will contain a single column that is a concatenation of the two identified columns (result of the CONCAT function) and SYSPUNCH will contain a
compatible column definition for FULLNAME - the converted column name for the new table. All you need to do is edit the new table name in SYSPUNCH (this will have defaulted to TBLNAME) and LOAD it. Try not to fiddle with the SYSRECxx data or the SYSPUNCH column definitions - a goof here could get ugly.
Use the REPLACE option when running the LOAD utility
to create the new table (I think the default is LOAD RESUME which won't work here). Often it is a good idea to leave RI off when running the LOAD, this will improve performance and
save the headache of figuring out the order in which LOAD jobs need to be run. Once finished you need to verify the
RI and build the indices.
The LOAD utility is documented here
I assume that I could modify the load JCL member and the data member somehow to achieve this but I'm not sure how easy that would be.
I believe you have provided the answer within your question. As to the question of "how easy that would be," it would depend on the nature of your modifications.
SORT utilities (DFSORT, SyncSort, etc.) now have very sophisticated data manipulation functions. We use these to move data around, substitute one value for another, combine fields, split fields, etc. albeit in a different context from what you are describing.
You could do something similar with your load control statements, but that might not be worth the trouble. It will depend on the extent of your changes. It may be worth your time to attempt to automate modification of the load control statements if you have a repetitive modification that is necessary. If the modifications are all "one off" then a manual solution may be more expedient.