Bigquery external table metadata - google-bigquery

I want to extract bigquery external table metadata.
I've gone through documetation
But I'm not able to find the field which gives me the information related to the external table location on gcs.
So is there any other metadata table that gives me location information of the external table?

Using SPLIT since there can be multiple comma separated uris:
SELECT ddl, SPLIT(REGEXP_EXTRACT(ddl, r"(?i)uris\s*=\s*\[(.*)\]")) as uris
FROM `catalog.schema.INFORMATION_SCHEMA.TABLES`
WHERE table_type = 'EXTERNAL';

Related

How to create an external table for a google cloud storage and query the externally partitioned data?

I am trying to query the externally partitioned data with the reference of this BigQuery doc
Google Cloud Storage: (csv data contains string value only)
gs://project/myfolder/count=1000/file_1k.csv
gs://project/myfolder/count=10000/file_10k.csv
gs://project/myfolder/count=100000/file_100k.csv
Source URI prefix: gs://project/myfolder
But I am getting the following error while querying the table,
Error while reading table: project.dataset.partition_table,
error message: Cannot query hive partitioned data for table project.dataset.partition_table without any associated files.
Query:
SELECT * FROM `project.dataset.partition_table` where count=1000 order by rand() LIMIT 100;
Refer the screenshot,
Any inputs here really appreciated.
The problem is that the engine can't find the files related to the partitions.
In your case, its because when you created the table you referenced a folder in GCS but not it's files.
To solve your problem, you should use a wildcard and your path would be gs://project/myfolder/* instead of gs://project/myfolder
I hope it helps

create BigQuery external tables partitioned by one/multiple columns

I am porting a java application from Hadoop/Hive to Google Cloud/BigQuery. The application writes avro files to hdfs and then creates Hive external tables with one/multiple partitions on top of the files.
I understand Big Query only supports date/timestamp partitions for now, and no nested partitions.
The way we now handle hive is that we generate the ddl and then execute it with a rest call.
I could not find support for CREATE EXTERNAL TABLE in the BigQuery DDL docs, so I've switched to using the java library.
I managed to create an external table, but I cannot find any reference to partitions in the parameters passed to the call.
Here's a snippet of the code I use:
....
ExternalTableDefinition extTableDef =
ExternalTableDefinition.newBuilder(schemaName, null, FormatOptions.avro()).build();
TableId tableID = TableId.of(dbName, tableName);
TableInfo tableInfo = TableInfo.newBuilder(tableID, extTableDef).build();
Table table = bigQuery.create(tableInfo);
....
There is however support for partitions for non external tables.
I have a few questions questions:
is there support for creating external tables with partition(s)? Can you please point me in the right direction
is loading the data into BigQuery preferred to having it stored in GS avro files?
if yes, how would we deal with schema evolution?
thank you very much in advance
You cannot create partitioned tables over files on GCS, although you can use the special _FILE_NAME pseudo-column to filter out the files that you don't want to read.
If you can, prefer just to load data into BigQuery rather than leaving it on GCS. Loading data is free, and queries will be way faster than if you run them over Avro files on GCS. BigQuery uses a columnar format called Capacitor internally, which is heavily optimized for BigQuery, whereas Avro is a row-based format and doesn't perform as well.
In terms of schema evolution, if you need to change a column type, drop a column, etc., you should recreate your table (CREATE OR REPLACE TABLE ...). If you are only ever adding columns, you can add the new columns using the API or UI.
See also a relevant blog post about lazy data loading.

Why the hive archive not support EXTERNAL_TABLE?

At present, out tables are EXTERNAL_TABLE, and there is a large amount of metadata information in NameNode, so I need to do archive to reduce the information, but hive archive is only support MANAGED_TABLE。
Can someone explain why archives don't support EXTERNAL_TABLE?
Is there any downside if I changed the code this way?
if (!(tbl.getTableType() == TableType.MANAGED_TABLE || tbl.getTableType() == TableType.EXTERNAL_TABLE))
{
throw new HiveException("ARCHIVE can only be performed on managed tables");
}
Why you want to archive the meta data of external tables? Just to reduce the information? If you have the external tables in hive and you are doing this then this is not right. If you have the tables in hive then the meta data should not be deleted.
In hive if you drop an external table then the meta data will also automatically delete. If you drop a managed table both data and meta data get delete
if you drop an external table only meta data get delete

Copying data from External table to database

I'm having a data in a external table. Now I'm copying the data from external table to a newly created table in a database. What kind of table will be the table in the database? Is it a managed table or external table? I need your help to understand the concept behind this question
Thanks,
Madan Mohan S
The hive table get their type "Managed" or "External" at time of their creation, not when data is inserted.
So table employees is external (because it was created using "create External" in DDL and provided location of data file.
The emp is managed table because "external" was NOT used in DDL and also location of data was not needed.
The difference now is, if table employees dropped the data it was reading that was provided in "location" is not deleted. So external table is useful when data is being read by multiple tools i.e pig. If pig script is reading same location, it will still function even though employees table is dropped.
But emp is managed (in other word metadata and data both are managed by hive) so when emp is dropped the data also are deleted. So after dropping it if you check the hive warehouse directory you will no find "emp" hdfs directory anymore.

Hive Extended table

When we create using
Create external table employee (name string,salary float) row format delimited fields terminated by ',' location /emp
In /emp directory there are 2 emp files.
so when we run select * from employee, it get the data from both the file ad display.
What will be happen when there will be others file also having different kind of record which column is not matching with the employee table , so it will try to load all the files when we run "select * from employee"?
1.Can we specify the specific file name which we want to load?
2.Can we create other table also with the same location?
Thanks
Prashant
It will load all the files in emp directory even it doesn’t match with table.
for your first question. you can use Regex serde.if your data matches to regex.then it loads to the table.
regex for access log in hive serde
https://github.com/apache/hive/blob/trunk/contrib/src/java/org/apache/hadoop/hive/contrib/serde2/RegexSerDe.java
other options:I am pointing some links.these links has some ways.
when creating an external table in hive can I point the location to specific files in a direcotry?
https://issues.apache.org/jira/browse/HIVE-951
for your second question: yes we can create other tables also with the same location.
Here are your answers
1. If the data in the file dosent match with table format, hive doesnt throw an error. It tries to read the data as best as it could. If data for some columns are missing it will put NULL for them.
No we cannot specify the file name for any table to read data. Hive will consider all the files under the table directory.
Yes, we can create other tables with the same location.