I am in learning phase of Hive. I read if we use auto purge property on table , it means data won't go into trash.Can we access the data of that table after overwriting it with new data, if auto purge property is set to true?
No, You will not be able to access data if you set "auto.purge"="true" property in Hive table. Usually, it's rarely used because you want data to go to Trash directory in case you deleted them accidently and then Trash will automatically remove older files depending on the policy set.
Related
I have a process that uses the PutBigQueryBatch processor, in which I would like it to truncate the table before inserting the data. I defined an AVRO schema, and previously created the table in BigQuery specifying how I wanted the fields. I am aware that if I change the "Write Disposition" property to the value "WRITE_TRUNCATE", it will truncate the table. However, when I use this option, the schema of the table in BigQuery ends up being deleted, which I would not like to happen, and a new schema is created to record the data. I understand that the "Create Disposition" property exists, and that if the "CREATE_NEVER" option is selected, the schema should be respected and not deleted.
When I run this processor with the "Write Disposition" property set to "WRITE_APPEND", the schema I created in BigQuery is respected, but with the "WRITE_TRUNCATE" not.
Is there any way to use the "WRITE_TRUNCATE" option and the table schema not be deleted?
Am I doing something wrong?
Below I forward the configuration that I am using in the PutBigQueryBatch processor:
PutBigQueryBatch processor configuration
It sounds like what you want is to run a TRUNCATE TABLE query before starting your process: https://cloud.google.com/bigquery/docs/reference/standard-sql/dml-syntax#truncate_table_statement
I saw at this link which affects Impala version 1.1:
Since Impala 1.1, REFRESH statement only works for existing tables. For new tables you need to issue "INVALIDATE METADATA" statement.
Does this still hold true for later versions of Impala?
According to Cloudera's Impala guide (Cloudera Enterprise 5.8) but stayed the same for 5.9:
INVALIDATE METADATA and REFRESH are counterparts: INVALIDATE METADATA
waits to reload the metadata when needed for a subsequent query, but
reloads all the metadata for the table, which can be an expensive
operation, especially for large tables with many partitions. REFRESH
reloads the metadata immediately, but only loads the block location
data for newly added data files, making it a less expensive operation
overall. If data was altered in some more extensive way, such as being
reorganized by the HDFS balancer, use INVALIDATE METADATA to avoid a
performance penalty from reduced local reads. If you used Impala
version 1.0, the INVALIDATE METADATA statement works just like the
Impala 1.0 REFRESH statement did, while the Impala 1.1 REFRESH is
optimized for the common use case of adding new data files to an
existing table, thus the table name argument is now required.
and related to working on existing tables:
The table name is a required parameter [for REFRESH]. To flush the metadata for all
tables, use the INVALIDATE METADATA command.
Because REFRESH table_name only works for tables that the current
Impala node is already aware of, when you create a new table in the
Hive shell, enter INVALIDATE METADATA new_table before you can see the
new table in impala-shell. Once the table is known by Impala, you can
issue REFRESH table_name after you add data files for that table.
So it seems like it indeed stayed the same. I believe CDH 5.9 comes with Impala 2.7.
As per Impala document Invalidate Metada and Refresh
INVALIDATE METADATA Statement
The INVALIDATE METADATA statement marks the metadata for one or all tables as stale. The next time the Impala service performs a query against a table whose metadata is invalidated, Impala reloads the associated metadata before the query proceeds. As this is a very expensive operation compared to the incremental metadata update done by the REFRESH statement, when possible, prefer REFRESH rather than INVALIDATE METADATA.
INVALIDATE METADATA is required when the following changes are made outside of Impala, in Hive and other Hive client, such as SparkSQL:
Metadata of existing tables changes.
New tables are added, and Impala will use the tables.
The SERVER or DATABASE level Sentry privileges are changed.
Block metadata changes, but the files remain the same (HDFS rebalance).
UDF jars change.
Some tables are no longer queried, and you want to remove their metadata from the catalog and coordinator caches to reduce memory requirements.
No INVALIDATE METADATA is needed when the changes are made by impalad.
REFRESH Statement
The REFRESH statement reloads the metadata for the table from the metastore database and does an incremental reload of the file and block metadata from the HDFS NameNode. REFRESH is used to avoid inconsistencies between Impala and external metadata sources, namely Hive Metastore (HMS) and NameNodes.
Usage notes:
The table name is a required parameter, and the table must already exist and be known to Impala.
Only the metadata for the specified table is reloaded.
Use the REFRESH statement to load the latest metastore metadata for a particular table after one of the following scenarios happens outside of Impala:
Deleting, adding, or modifying files.
For example, after loading new data files into the HDFS data directory for the table, appending to an existing HDFS file, inserting data from Hive via INSERT or LOAD DATA.
Deleting, adding, or modifying partitions.
For example, after issuing ALTER TABLE or other table-modifying SQL statement in Hive
Invalidate metadata is used to refresh the metastore and the data (structure & data)==complete flush
Refresh is used to update only the data = lightweight flush
I'm streaming data using bigquery, but somehow the table I created will disappear from WebUI while the dataset will remain.
I set up the dataset as never expire, is there any configuration for the table itself?
I'd look into Mikhail's suggestion of the table's explicit expiration time. The tables could also be getting deleted via the tables.delete API, possibly by another user or process. You could check operations on your table in your project's audit logs and see if something is deleting them.
is there any configuration for the table itself?
Expiration that is set for dataset is just default expiration for newly created tables
Table itself can be set with expiration using expirationTime property
When I load a (csv)-file to a hive table I can load without overwriting, thus adding the new file to the table.
Internally the file is just copied to the correct folder in HDFS
(e.g. user/warehouse/dbname/tablName/datafile1.csv). And probably some metadata is updated.
After a few loads I want to remove the contents of a specific file from the table.
I am sure I cannot simply delete the file because of the metadata that needs to be adjusted as well. There must be some kind of build-in function for this.
How do I do that?
Why do you need that?I mean Hive was developed to serve as a warehouse where you put lots n lots n lots of data and not to delete data every now and then. Such a need seems to be a poorly thought out schema or a poor use of Hive, at least to me.
And if you really have these kind of needs why don't you create partitioned tables? If you need to delete some specific data just delete that particular partition using either TRUNCATE or ALTER.
TRUNCATE TABLE table_name [PARTITION partition_spec];
ALTER TABLE table_name DROP [IF EXISTS] PARTITION partition_spec, PARTITION partition_spec,...
if this feature is needed more than just once in a while you can use MapR's distribution while allows this kind of operations with no problem (even via NFS). otherwise, if you don't have partition I think you'll have to create and new table using CTAS filterring the data in the bad file or just copy the good files back to os with "hadoop fs -copyToLocal" and move them back to hdfs into new table
I want to save some temporary data in memory, which should be removed after server shuts down.
There is a temporary table in HSQLDB, but the data is removed immediately after transaction committed, which is too short for me. On the other side, the memory table keeps a script log file and resume the data when server new starts. It takes time and place to maintain such script logs, which are useless for my situation.
what I need is just a type of table, only the table structure is persistent in hard disk, the data and the data operations should only be performed in memory. Otherwise why do I need a in-memory DB instead of mysql?
Is there such type of table in HSQLDB?
thanks
Create the file: database, then create the tables. Perform SHUTDOWN. Edit the .properties file for the database, add the setting below and save.
files_readonly=true
When you perform your tests with this database, no data is written to disk.
Alternatively, with the latest versions of HSQLDB 2.2.x, you can can specify this property on the connection URL during the tests. For example
jdbc:hsqldb:file:myfilepath;files_readonly=true