I am porting a java application from Hadoop/Hive to Google Cloud/BigQuery. The application writes avro files to hdfs and then creates Hive external tables with one/multiple partitions on top of the files.
I understand Big Query only supports date/timestamp partitions for now, and no nested partitions.
The way we now handle hive is that we generate the ddl and then execute it with a rest call.
I could not find support for CREATE EXTERNAL TABLE in the BigQuery DDL docs, so I've switched to using the java library.
I managed to create an external table, but I cannot find any reference to partitions in the parameters passed to the call.
Here's a snippet of the code I use:
....
ExternalTableDefinition extTableDef =
ExternalTableDefinition.newBuilder(schemaName, null, FormatOptions.avro()).build();
TableId tableID = TableId.of(dbName, tableName);
TableInfo tableInfo = TableInfo.newBuilder(tableID, extTableDef).build();
Table table = bigQuery.create(tableInfo);
....
There is however support for partitions for non external tables.
I have a few questions questions:
is there support for creating external tables with partition(s)? Can you please point me in the right direction
is loading the data into BigQuery preferred to having it stored in GS avro files?
if yes, how would we deal with schema evolution?
thank you very much in advance
You cannot create partitioned tables over files on GCS, although you can use the special _FILE_NAME pseudo-column to filter out the files that you don't want to read.
If you can, prefer just to load data into BigQuery rather than leaving it on GCS. Loading data is free, and queries will be way faster than if you run them over Avro files on GCS. BigQuery uses a columnar format called Capacitor internally, which is heavily optimized for BigQuery, whereas Avro is a row-based format and doesn't perform as well.
In terms of schema evolution, if you need to change a column type, drop a column, etc., you should recreate your table (CREATE OR REPLACE TABLE ...). If you are only ever adding columns, you can add the new columns using the API or UI.
See also a relevant blog post about lazy data loading.
Related
I have a requirement where I might have to update the Bigquery External tables on a periodic basis.
The GCS location has timestamp for every incremental run, I would like to update to the latest timestamp folder as the path of External table.
One way i see is only dropping the table and creating again by pointing it to latest folder. But, is there any other way to update it without dropping the table
As suggested by #Samuel , you can use the SQL statement CREATE or REPLACE EXTERNAL TABLES for your requirement. Scheduled queries support DML and DDL statements which can be used to create the new tables. You can use the below mentioned query parameter to create the table according to your schedule :
My_database_name.my_table_name.my_results_{run_date}
For more information you can refer to this documentation.
Just starting out with bigquery and trying to find the best way to upload db tables to bq. We've been converting the table content to avro using the avsc library because from all the docs it seems that avro is the fastest way to load it but it adds a root element to the schema so that all the columns are root.name, root.time etc. I saw there was another post about this from 2016 here and the solution is to use a temporary table and strip out the root like this,
bq query --external_table_definition=foo::AVRO=gs://your_bucket/path/file.avro* --destination_table your_dataset.your_table "SELECT root.* FROM foo"
but the nodejs library only has instructions to accomplish permanent tables, not temporary ones. Even if I wanted to create a permanent table, I can't because due to the "root" it places all columns in one row and the amount of data exceeds the amount allowed in a single row. How can I load the data to bigquery?
You can create an external table using the bq js library [1] you need to set the options object appropriately [2]
[1] https://github.com/googleapis/nodejs-bigquery/blob/master/samples/createTable.js
[2] https://stackoverflow.com/a/42916251/5873699
Are two hive tables (native, external) always required for querying a DynamoDB table from an AWS EMR?
I have created a native hive table (CTAS, create table as select) using an hive external table that was mapped to a DynamoDB table. My (read) query times against external tables are slow and it uses up the read throughput versus native table are fast and read throughput is not consumed.
My questions:
Is this a standard practice/best practice i.e., create an external table mapped to a dynamodb table and then create a CTAS and query against CTAS for all read query use cases?
Where or how GSI's on dynamodb come into picture on hive side of things? Toward this curiosity I have tried to map my external hive table column to dynamodb GSI and some what expectedly saw NULLs.
So, back to #2 question was wondering how are GSI's used with a native or external hive table?
Thanks,
Answer is no.
However, from my observation if a hive native table data is backed (CTAS) by hive external table that is referencing a DynamoDb table: Read data is not accounted if you are querying hive native table from EMR. If you to take into account the periodic update (refresh data) of hive native table.
I understand the difference between Internal tables and external tables in hive as below
1) if we drop the internal Table File and metadata will be deleted, however , in case of External only metadata will be
deleted
2) if the file data need to be shared by other tools/applications then we go for external table if not
internal table, so that if we drop the table(external) data will still be available for other tools/applications
I have gone through the answers for question "Difference between Hive internal tables and external tables? "
but still I am not clear about the proper uses cases for Internal Table
so my question is why is that I need to make an Internal table ? why cant I make everything as External table?
Use EXTERNAL tables when:
The data is also used outside of Hive.
For example, the data files are read and processed by an existing program that doesn't lock the files.
The data is permanent i.e used when needed.
Use INTERNAL tables when:
The data is temporary.
You want Hive to completely manage the lifecycle of the table and data.
Let's understand it with two simple scenarios:
Suppose you have a data set, and you have to perform some analytics/problem statements on it. Because of the nature of problem statements, few of them can be done by HiveQL, few of them need Pig Latin and few of them need Map Reduce etc., to get the job done. In this situation External Table comes into picture- the same data set can be used to solve entire analytics instead of having different different copies of same data set for the different different tools. Here Hive don't need authority on the data set because several tools are going to use it.
There can be a scenario, where entire analytics/problem statements can be solved by only HiveQL. In such situation Internal Table comes into picture- Means you can put the entire data set into Hive's Warehouse and Hive is going to have complete authority on the data set.
I am new to Hadoop and I just started working on Hive, I my understanding it provides a query language to process data in HDFS. With HiveQl we can create tables and load data into it from HDFS.
So my question is: where are those tables stored? Specifically if we have 100 GB file in our HDFS and we want to make a hive table out of that data what will be the size of that table and where is it stored?
If my understanding about this concept is wrong please correct me ..
If the table is 100GB you should consider an Hive External Table (as opposed to a "managed table", for the difference, see this).
With an external table the data itself will be still stored on the HDFS in the file path that you specify (note that you may specify a directory of files as long as they all have the same structure), but Hive will create a map of it in the meta-store whereas the managed table will store the data "in Hive".
When you drop a managed table, it drops the underlying data as opposed to dropping a hive external table which only drops the meta-data from the meta-store referencing that data.
Either way you are using only 100GB as viewed by the user and are taking advantage of the HDFS' robustness though duplication of the data.
Hive will create a directory on HDFS. If you didn't specify any location it will create a directory at /user/hive/warehouse on HDFS. After load command the files are moved to the /warehouse/tablename. You can also point to the HDFS directory if it contains partitions (if the files are partitioned), or use external table concept.