I'm streaming data using bigquery, but somehow the table I created will disappear from WebUI while the dataset will remain.
I set up the dataset as never expire, is there any configuration for the table itself?
I'd look into Mikhail's suggestion of the table's explicit expiration time. The tables could also be getting deleted via the tables.delete API, possibly by another user or process. You could check operations on your table in your project's audit logs and see if something is deleting them.
is there any configuration for the table itself?
Expiration that is set for dataset is just default expiration for newly created tables
Table itself can be set with expiration using expirationTime property
Related
In BigQuery lot of temp_table_* tables are created, this is clustering the datasets. Is there any way to delete these temp tables automatically?
On any a dataset you can set a default expiration time for any new table you create in it. Then everything will get deleted on schedule.
Same for tables, even at creation time you can set an automatic deletion date.
Now, who is creating these temp tables you see? I have no idea, I've never seen them.
I'm 99.9% sure these tables are not created by any process ran by Google. No one at Google would format dates in that way.
The temp tables are created from one of the third party consumers, they are using Simba JDBC connector, which internally using the temp_table_ prefix.
Currently no option to set expire date for these tables, once they introduce the option we will leverage.
Reference:
https://www.simba.com/products/BigQuery/doc/JDBC_InstallGuide/content/jdbc/bq/options/largeresulttable.htm
In first time,I created a empty table with partition and cluster. After that, I would like to configure data transfere service to fill my table from Google Cloud Storage.But when I configure the transfer, I didn't see a parameter field which allows to choose the cluster field.
I tried to do the same thing without the cluster and I can fill my table easily.
Big query error when I ran the transfer:
Failed to start job for table matable$20190701 with error INVALID_ARGUMENT: Incompatible table partitioning specification. Destination table exists with partitioning specification interval(type:DAY,field:) clustering(string_field_15), but transfer target partitioning specification is interval(type:DAY,field:). Please retry after updating either the destination table or the transfer partitioning specification.
When you define the table you specify the partitioning and clustering columns. That's everything you need to do.
When you load the data (from CLI or UI) from GCS BigQuery automatically partition and cluster the data.
If you can give more detail of how you create the table and set up the transfer would be helpful to provide a more detailed explanation.
Thanks for your time.
Of course :
empty table configuration
transfer configuration
I success to transfer datat without cluster but, when I add a cluster in my empty table,the trasnfer fails.
The issue I am facing in my nodejs application is identical to this user's question: Cannot insert new value to BigQuery table after updating with new column using streaming API.
To my understanding changes such as widening a table's schema may require some period of time before streamed inserts can reference the new columns otherwise a 'no such field' error is returned. For me this error is not always consistent as sometimes I am able to successfully insert.
However, I specifically wanted to know if you could alternatively use a load job instead of streaming? If so what drawbacks does it have as I am not sure of the difference even having read the documentation.
Alternatively, if I do use streaming but with the ignoreUnknownValues option, does that mean that all of the data is eventually inserted including data referencing new columns? Just that new columns are not queryable until the table schema is finished updating?
I saw at this link which affects Impala version 1.1:
Since Impala 1.1, REFRESH statement only works for existing tables. For new tables you need to issue "INVALIDATE METADATA" statement.
Does this still hold true for later versions of Impala?
According to Cloudera's Impala guide (Cloudera Enterprise 5.8) but stayed the same for 5.9:
INVALIDATE METADATA and REFRESH are counterparts: INVALIDATE METADATA
waits to reload the metadata when needed for a subsequent query, but
reloads all the metadata for the table, which can be an expensive
operation, especially for large tables with many partitions. REFRESH
reloads the metadata immediately, but only loads the block location
data for newly added data files, making it a less expensive operation
overall. If data was altered in some more extensive way, such as being
reorganized by the HDFS balancer, use INVALIDATE METADATA to avoid a
performance penalty from reduced local reads. If you used Impala
version 1.0, the INVALIDATE METADATA statement works just like the
Impala 1.0 REFRESH statement did, while the Impala 1.1 REFRESH is
optimized for the common use case of adding new data files to an
existing table, thus the table name argument is now required.
and related to working on existing tables:
The table name is a required parameter [for REFRESH]. To flush the metadata for all
tables, use the INVALIDATE METADATA command.
Because REFRESH table_name only works for tables that the current
Impala node is already aware of, when you create a new table in the
Hive shell, enter INVALIDATE METADATA new_table before you can see the
new table in impala-shell. Once the table is known by Impala, you can
issue REFRESH table_name after you add data files for that table.
So it seems like it indeed stayed the same. I believe CDH 5.9 comes with Impala 2.7.
As per Impala document Invalidate Metada and Refresh
INVALIDATE METADATA Statement
The INVALIDATE METADATA statement marks the metadata for one or all tables as stale. The next time the Impala service performs a query against a table whose metadata is invalidated, Impala reloads the associated metadata before the query proceeds. As this is a very expensive operation compared to the incremental metadata update done by the REFRESH statement, when possible, prefer REFRESH rather than INVALIDATE METADATA.
INVALIDATE METADATA is required when the following changes are made outside of Impala, in Hive and other Hive client, such as SparkSQL:
Metadata of existing tables changes.
New tables are added, and Impala will use the tables.
The SERVER or DATABASE level Sentry privileges are changed.
Block metadata changes, but the files remain the same (HDFS rebalance).
UDF jars change.
Some tables are no longer queried, and you want to remove their metadata from the catalog and coordinator caches to reduce memory requirements.
No INVALIDATE METADATA is needed when the changes are made by impalad.
REFRESH Statement
The REFRESH statement reloads the metadata for the table from the metastore database and does an incremental reload of the file and block metadata from the HDFS NameNode. REFRESH is used to avoid inconsistencies between Impala and external metadata sources, namely Hive Metastore (HMS) and NameNodes.
Usage notes:
The table name is a required parameter, and the table must already exist and be known to Impala.
Only the metadata for the specified table is reloaded.
Use the REFRESH statement to load the latest metastore metadata for a particular table after one of the following scenarios happens outside of Impala:
Deleting, adding, or modifying files.
For example, after loading new data files into the HDFS data directory for the table, appending to an existing HDFS file, inserting data from Hive via INSERT or LOAD DATA.
Deleting, adding, or modifying partitions.
For example, after issuing ALTER TABLE or other table-modifying SQL statement in Hive
Invalidate metadata is used to refresh the metastore and the data (structure & data)==complete flush
Refresh is used to update only the data = lightweight flush
I would need to come up with a robust solution to detect when a new table is created in MS Access and then copy its content to a master table.
One application is writing new data as new data tables into MS Access. This part can't be changed. Now these tables have to be copied into a master table to be picked up by an interface.
Is there a trigger in MS Access when a new table is created?
I was also thinking about a timer and then to look up all tables.
Any ideas or suggestions?
Access does not expose an event for table creation. So you will have to check whether a new table has been created.
If you're not deleting tables, you could examine whether CurrentDb.TableDefs.Count has increased since the last time you checked.
You need to trap COMPLETION of putting all the data into the table,
not the initial creation and not sometime during adding data when the count may be greater then before.
The copy operation cannot start until all the data is in the table.
Thus the creating program needs to send a signal when its done.