I have a flat file as an input that has multiple layouts:
Client# FileType Data
------- -------- --------------------------------------
Client#1FileType0Dataxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Client#1FileType1Datayyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy
Client#1FileType2Datazzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz
Client#2FileType0Dataxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
My PLANNED workflow goes as follows: Drop Temp table -Load SQL temp table with columns Client#, FileType, Data and then from there, map my 32 file types to actual permanent SQL table.
My question is, is that even doable and how would you proceed?
Can you, from such a working table, split to 32 sources? With SQL substrings? I am not sure how I will map my columns from the differing file type from my temp table, what 'box' to use in my workflow.
What you are describing is a very reasonable approach to loading data in a database. The idea is:
Create a staging table where all the columns are strings.
Load data into the final table, using SQL manipulations.
The advantage of this approach is that you can debug any data anomalies in the database and that generally makes things much simpler.
The answer to your question is that the following functions are generally very useful in doing this:
substring()
try_convert()
This can get more complicated if the "data" is not fixed width. In that case, you would have to use more complex string processing. In that case, recursive CTEs or JSON functionality might help.
Related
Is there a way to copy data from S3 into Snowflake without manually defining the columns beforehand?
I don't want to have to define the schema for the table in Snowflake OR the schema for which columns should be imported from S3. I want it to be schema-on-read, not schema-on-write.
I'm using a storage integration for access to the S3 external stage.
My question is a bit similar to this question, but I don't want to have to define any columns individually. If there's a way to just add on additional columns on the fly, that would solve my issue too.
We do not currently have schema inference for COPY. I am assuming that you already know about the variant column option for JSON but it will not give you full schematization.
https://docs.snowflake.net/manuals/user-guide/semistructured-concepts.html
Dinesh Kulkarni
(PM, Snowflake)
You need to use a third party tool that analyses your whole S3 data file in order to build an SQL schema from the data set in the file. Or maybe the tool is given access to the data source definition (which Snowflake hasn't) to make the job for the tool easier.
You might find snippets of Snowflake Stored Procedure code by searching around here at stackoverflow, that outputs schema definitions by eg. recursively flattening JSON data files.
If you want the import to be flexible, you need to use a flexible data format like JSON and a flexible SQL data type like VARIANT. This will work even if your data structures change.
If you want to use rigid formats like CSV or rigid SQL data types (most are rigid) then things get complicated. Rigid data are not flexible, and eg CSV files do not have any embedded type information, making for massive non-future-proof guesswork.
And maybe you are satisfied having all your columns end up as VARCHAR...
Is it possible with Netezza queries to include sql files (which contain specific sql code) or it is not the right way of usage ?
Here is an example.
I have some common sql code (lets say common.sql) which creates some temp table and needs to be used across multiple other queries (lets say analysis1.sql, analysis2.sql etc.) - . From a code management perspective it is quite overwhelming to maintain if the code in common.sql is repeated across the many other queries. Is there a DRY way to do this - something like #include <common.sql> from the other queries to call the reused code common.sql ?
Including sql files is not the right way to do it. If you wish to persist with this you could use a preprocessor like cpp or even php to assemble the files for you and have a build process to generate finished ones.
However from a maintainability perspective you are better off creating views and functions for reusable content. Note that this can pose optimization barriers so large queries are often the way to go.
I agree, views, functions (table values if needed) or more likely: stored procedures is the way to go.
We have had a lot of luck letting stored procedures generate complex but repeatable code patterns on the fly based on input parameters and metadata on the tables being processed.
An example: all tables has a 'unique constraint' (which is not really unique, but that doesn't matter since it isn't enforced in Netezza) that has a fixed name of UBK_[tablename]
UBK is used as a 'signal' to the stored procedure identifying the columns of the BusinessKey for a classic 'kimball style' type 2 dimension table.
The SP can then apply the 'incoming' rows to the target table just by being supplied With the name of the target table and a 'stage' table containing all the same column names and data types.
Other examples could be a SP that takes a tablename and three arguments each with a 'string,of,columns' and does a 'excel style pivot' with group-by on the columns in the first argument and uses the second argument as to do a 'select distinct' to generate new column names for the pivoted columns, and does a 'sum' on the column in the third argument into some target table you specify the name for...
Can you follow me?
I think that the nzsql command line tool may be able to do an 'include', but a combination of strong 'building block stored procedures' and perl/python and/or an ETL tool will most likely proove a better choice
I am attempting to fix the schema of a Bigquery table in which the type of a field is wrong (but contains no data). I would like to copy the data from the old schema to the new using the UI ( select * except(bad_column) from ... ).
The problem is that:
if I select into a table, then Bigquery is removing the required of the columns and therefore rejecting the insert.
Exporting via json loses information on dates.
Is there a better solution than creating a new table with all columns being nullable/repeated or manually transforming all of the data?
Update (2018-06-20): BigQuery now supports required fields on query output in standard SQL, and has done so since mid-2017.
Specifically, if you append your query results to a table with a schema that has required fields, that schema will be preserved, and BigQuery will check as results are written that it contains no null values. If you want to write your results to a brand-new table, you can create an empty table with the desired schema and append to that table.
Outdated:
You have several options:
Change your field types to nullable. Standard SQL returns only nullable fields, and this is intended behavior, so going forward it may be less useful to mark fields as required.
You can use legacy SQL, which will preserve required fields. You can't use except, but you can explicitly select all other fields.
You can export and re-import with the desired schema.
You mention that export via JSON loses date information. Can you clarify? If you're referring to the partition date, then unfortunately I think any of the above solutions will collapse all data into today's partition, unless you explicitly insert into a named partition using the table$yyyymmdd syntax. (Which will work, but may require lots of operations if you have data spread across many dates.)
BigQuery now supports table clone features. A table clone is a lightweight, writeable copy of another table
Copy tables from query in Bigquery
I've got a table that can contain a variety of different types and fields, and I've got a table definitions table that tells me which field contains which data. I need to select things from that table, so currently I build up a dynamic select statement based on what's in that table definitions table and select it all into a temp table, then work from that.
The actual amount of data I'm selecting is quite big, over 5 million records. I'm wondering if a temp table is really the best way to go around doing this.
Are there other more efficient options of doing what I need to do?
If your data is static, reports like - cache most popular queries results, preferably on Application Server. Or do multidimensional modeling (cubes). That is the really "more efficient option to do that".
Temp tables, table variables, table data types... In any case you will use your tempdb, and if you want to optimize your queries, try to optimize tempdb storage (after checking IO statistics ). You can aslo create indexes for your temp tables.
You can use Table Variables to achieve the functionality.
If you are using the same structure in multiple queries, you can go for custom defined Table data types as well.
http://technet.microsoft.com/en-us/library/ms188927.aspx
http://technet.microsoft.com/en-us/library/bb522526(v=sql.105).aspx
I have lately learned what is dynamic sql and one of the most interesting features of it to me is that we can use dynamic columns names and tables. But I cannot think about useful real life examples. The only one that came into my mind is statistical table.
Let`s say that we have table with name, type and created_data. Then we want to have a table that in columns are years from created_data column and in row type and number of names created in years. (sorry for my English)
What can be other useful real life examples of using dynamic sql with column and table as parameters? How do you use it?
Thanks for any suggestions and help :)
regards
Gabe
/edit
Thx for replies, I am particulary interested in examples that do not contain administrative things or database convertion or something like that, I am looking for examples where the code in example java is more complicated than using a dynamic sql in for example stored procedure.
An example of dynamic SQL is to fix a broken schema and make it more usable.
For example if you have hundreds of users and someone originally decided to create a new table for each user, you might want to redesign the database to have only one table. Then you'd need to migrate all the existing data to this new system.
You can query the information schema for table names with a certain naming pattern or containing certain columns then use dynamic SQL to select all the data from each of those tables then put it into a single table.
INSERT INTO users (name, col1, col2)
SELECT 'foo', col1, col2 FROM user_foo
UNION ALL
SELECT 'bar', col1, col2 FROM user_bar
UNION ALL
...
Then hopefully after doing this once you will never need to touch dynamic SQL again.
Long-long ago I have worked with appliaction where users uses their own tables in common database.
Imagine, each user can create their own table in database from UI. To get the access to data from these tables, developer needs to use the dynamic SQL.
I once had to write an Excel import where the excel sheet was not like a csv file but layed out like a matrix. So I had to deal with a unknown number of columns for 3 temporary tables (columns, rows, "infield"). The rows were also a short form of tree. Sounds weird, but was a fun to do.
In SQL Server there was no chance to handle this without dynamic SQL.
Another example from a situation I recently came up against. A MySQL database of about 250 tables, all in MyISAM engine and no database design schema, chart or other explanation at all - well, except the not so helpful table and column names.
To plan for conversion to InnoDB and find possible foreign keys, we either had to manually check all queries (and the conditions used in JOIN and WHERE clauses) created from the web frontend code or make a script that uses dynamic SQL and checks all combinations of columns with compatible datatype and compares the data stored in those columns combinations (and then manually accept or reject these possibilities).