BQ STRUCT versus RECORD types? - google-bigquery

May be a very trivial question.
What is the actual difference between STRUCT and RECORD types in GCP BigQuery? Can I use them interchangably? If I have a table created with a column defined as STRUCT, will it show a "schema" mismatch if I try to re-run a Terraform script with the field type changed to RECORD?

I believe they are mostly the same thing, or you may view them as same concept in different components of BigQuery.
For historical reasons the Legacy SQL and storage documentation talks mostly about RECORD, while Standard SQL dialect uses STRUCT.
A column created with Standard SQL DDL as STRUCT will appear as RECORD in storage UI, and Terraform script using RECORD should be compatible.

Related

How to pass dynamic table names for sink database in Azure Data Factory

I am trying to copy tables from one schema to another with the same Azure SQL db. So far, I have created a lookup pipeline and passed the parameters for the for each loop and copy activity. But my sink dataset is not taking the parameter value I have given under "table option" field rather it is taking the dummy table I chose when creating the sink dataset. Can someone tell how can I pass dynamic table name to a sink dataset?
I have given concat('dest_schema.STG_',#{item().table_name})} in the table option field.
To make the schema and table names dynamic, add Parameters to the Dataset:
Most important - do NOT import a schema. If you already have one defined in the Dataset, clear it. For this Dataset to be dynamic, you don't want improper schemas interfering with the process.
In the Copy activity, provide the values at runtime. These can be hardcoded, variables, parameters, or expressions, so very flexible.
If it's the same database, you can even use the same Dataset for both, just provide different values for the Source and Sink.
WARNING: If you use the "Auto-create table" option, the schema for the new table will define any character field as varchar(8000), which can cause serious performance problems.
MY OPINION:
While you can do this, one of my personal rules is to not cross the database boundary. If the Source and Sink are on the same SQL database, I would try to solve this problem with a Stored Procedure rather than a data factory.

How can I extract DDL for particurar object of a database in PostgreSQL

I need to get DDL query of a particular object of database like schema, table, column, etc. Is there a way to extract it from system catalog tables using sql?
I tried to find any table in information_schema or pg_catalog with required information, but I didn't find such one.
There is a way to extract it from the system catalogs, but the method depends on what type of object it is, and is not easy.
The "pg_dump" knows how to do it. I would just use that, rather than reinventing things. You can get just the DDL (exclude the data itself) using "-s" option. Then you can fish out the DDL for your specific desired object using your favorite text editor. If the object is a table, you can tell pg_dump to dump just that table, but for other objects you can't.

Writing Avro to BigQuery using Beam

Q1: Say I load Avro encoded data using BigQuery load tool. Now I need to write this data to different table still in Avro format. I am trying to test out different partition in order to test the table performance. How do I write back SchemaAndRecord to BigQuery using Beam? Also would schema detection work in this case?
Q2: Looks like schema information is lost when converted to BigQuery schema type from Avro schema type. For example both double and float Avro type is converted to FLOAT type in BigQuery. Is this expected?
Q1: If the table already exists and the schema matches the one you're copying from you should be able to use CREATE_NEVER CreateDisposition (https://cloud.google.com/dataflow/model/bigquery-io#writing-to-bigquery) and just write the TableRows directly from the output of readTableRows() of the original table. Although I suggest using BigQuery's TableCopy command instead.
Q2: That's expected, BigQuery does not have a Double type. You can find more information on the type mapping here: https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-avro#avro_conversions. Also Logical Types will soon be supported as well: https://issuetracker.google.com/issues/35905894.

Keep Column types from Java ResultSet in CSV export

I'm currently building a tool that pulls data directly from a database because SPSS Modeler is too slow and store it in a Java ResultSet first of all.
But I try to export the data into a CSV (or similar) file and try to keep as much column types as possible.
Currently I'm using opencsv but it casts Decimals and many others to a String. When I load the file back into SPSS Modeler I get only Integers and Strings.
Are there any CSV libraries (maybe with a special encoding) or other file types I can use to export the data with its column types (like IBM InfoSphere Data Architect can do) so I can load it directly back into SPSS Modeler without changing it back manually there ?
Thank you!
Retrieving the Metadata from the DB Information Schema
If the data is currently stored in a database, you can retrieve the column type from the information schema. All you need to do is retrieving this information after your queried the table and store it so that you can reuse it later.
// connect to DB as usual
Statement stmt = conn.createStatement();
// create your query
// Note that you can use a dummy query here.
//You only need to access the metadata schema of the table, regardless of the actual query.
ResultSet rse = stmt.executeQuery("Select A,B FROM table WHERE ..");
// get the ResultSetMetadata
ResultSetMetaData rsmd = rse.getMetaData();
// Get database specific type
rsmd.getColumnTypeName(1); // database specific type name for column 1 (e.g. VARCHAR)
rsmd.getColumnTypeName(2); // database specific type name for column 2 (e.g. DateTime)
....
// Get generic JDBC type http://docs.oracle.com/javase/7/docs/api/java/sql/Types.html
rsmd.getColumnType(1) // generic type for col 1 (e.g. 12)
rsmd.getColumnType(2) // generic type for col 2
Processing
You could store this information in a CSV schema and process this during the transformation process.
I recommend that you use SuperCSV, which is available here.
This library provides so called cell processors, which allow you to define the type of the columns.
Description:
Cell processors are an integral part of reading and writing with Super CSV - they automate the data type conversions, and enforce constraints. They implement the chain of responsibility design pattern - each processor has a single, well-defined purpose and can be chained together with other processors to fully automate all of the required conversions and constraint validation for a single CSV column.

Few questions from a Java programmer regarding porting preexisting database which is stored in .txt file to mySQL?

I've been writing a Library management Java app lately, and, up until now, the main Library database is stored in a .txt file which was later converted to ArrayList in Java for creating and editing the database and saving the alterations back to the .txt file again. A very primitive method indeed. Hence, having heard on SQL later on, I'm considering to port my preexisting .txt database to mySQL. Since I've absolutely no idea how SQL and specifically mySQL works, except for the fact that it can interact with Java code. Can you suggest me any books/websites to visit/buy? Will the book Head First with SQL ever help? especially when using Java code to interact with the SQL database? It should be mentioned that I'm already comfortable with using 3rd Party APIs.
View from 30,000 feet:
First, you'll need to figure out how to represent the text file data using the appropriate SQL tables and fields. Here is a good overview of the different SQL data types. If your data represents a single Library record, then you'll only need to create 1 table. This is definitely the simplest way to do it, as conversion will be able to work line-by-line. If the records contain a LOT of data duplication, the most appropriate approach is to create multiple tables so that your database doesn't duplicate data. You would then link these tables together using IDs.
When you've decided how to split up the data, you create a MySQL database, and within that database, you create the tables (a database is just something that holds multiple tables). Connecting to your MySQL server with the console and creating a database and tables is described in this MySQL tutorial.
Once you've got the database created, you'll need to write the code to access the database. The link from OMG Ponies shows how to use JDBC in the simplest way to connect to your database. You then use that connection to create Statement object, execute a query to insert, update, select or delete data. If you're selecting data, you get a ResultSet back and can view the data. Here's a tutorial for using JDBC to select and use data from a ResultSet.
Your first code should probably be a Java utility that reads the text file and inserts all the data into the database. Once you have the data in place, you'll be able to update the main program to read from the database instead of the file.
Know that the connection between a program and a SQL database is through a 'connection program'. You write an instruction in an SQL statement, say
Select * from Customer order by name;
and then set up to retrieve data one record at a time. Or in the other direction, you write
Insert into Customer (name, addr, ...) values (x, y, ...);
and either replace x, y, ... with actual values or bind them to the connection according to the interface.
With this understanding you should be able to read pretty much any book or JDBC API description and get started.