Why Refresh in Impala in required if invalidate metadata can do same thing - impala

Using Hive if I create one table,insert records in it and to reflect the same table in Impala first I have to run Invalidate Metadata which reflects the metadata on the executor also shows newly added records as well.
Then my question is still why we need to do the refresh table?

INVALIDATE METADATA and REFRESH are counterparts:
INVALIDATE METADATA is an asynchronous operations that simply
discards the loaded metadata from the catalog and coordinator caches.
After that operation, the catalog and all the Impala coordinators
only know about the existence of databases and tables and nothing
more. Metadata loading for tables is triggered by any subsequent
queries.
REFRESH reloads the metadata synchronously. REFRESH is more
lightweight than doing a full metadata load after a table has been
invalidated. REFRESH cannot detect changes in block locations (as INVALIDATE METADATA does) triggered by operations like HDFS balancer, hence causing remote
reads during query execution with negative performance implications.
Use REFRESH after invalidating a specific table to separate the
metadata load from the first query that's run against that table.
====
INVALIDATE METADATA is required when the following changes are made outside of Impala, in Hive and other Hive client, such as SparkSQL:
Metadata of existing tables changes.
New tables are added, and Impala will use the tables.
The SERVER or DATABASE level Sentry privileges are changed from
outside of Impala.
Block metadata changes, but the files remain the same (HDFS
rebalance). UDF jars change.
Some tables are no longer queried, and you want to remove their
metadata from the catalog and coordinator caches to reduce memory
requirements.
No INVALIDATE METADATA is needed when the changes are made by impalad.
For even more details, read HERE

Related

Is it any way to auto refresh Redshift Matearilized view based on external table?

I have materialized view in Redshift, that based on data from external Redshift Spectrum table, so it's impossible to use Redshift auto refresh feature.
I just don't wont to refresh it by hand.
I don't care much about data consistency, so delay for some time (up to 1+ hour) is fine for me.
So, is it any way to update materialized view automatically? Maybe, it's possible to configure some TTL for it? Any other ideas?
Usually, the location of the Spectrum files is Amazon S3.
You can automate the update by scheduled Lambda that triggers periodically.
You can also establish a trigger that fires every time a file is created in a part of a bucket. This trigger creates an event that initiates a Lambda function that issues the desired update.
I'd start simple if you can and work up to Redshift Data API and Step functions.
This first option might be a better solution than invoking a Lambda on every object upload, especially if you are receiving lots of files continuously.

Auto Syncing for Keys in Apache Geode

I have an Apache Geode setup, connected with external Postgres datasource. I have a scenario where I define an expiration time for a key. Let's say after T time the key is going to expire. Is there a way so that the keys which are going to expire can make a call to an external datasource and update the value incase the value has been changed? I want a kind of automatic syncing for my keys which are there in Apache Geode. Is there any interface which i can implement and get the desired behavior?
I am not sure I fully understand your question. Are you saying that the values in the cache may possibly be more recent than what is currently stored in the database?
Whether you are using Look-Aside Caching, Inline Caching, or even Near Caching, Apache Geode combined with Spring would take care of ensuring the cache and database are kept in sync, to some extent depending on the caching pattern.
With Look-Aside Caching, if used properly, the database (i.e. primary System of Record (SOR), e.g. Postgres in your case) should always be the most current. (Look-Aside) Caching is secondary.
With Synchronous Inline Caching (using a CacheLoader/CacheWriter combination for Read/Write-Through) and in particular, with emphasis on CacheWriter, during updates (e.g. Region.put(key, value) cache operations), the DB is written to first, before the entry is stored (or overwritten) in the cache. If the DB write fails, then the cache entry is not written or updated. This is true each time a value for a key is updated. If the key has not be updated recently, then the database should reflect the most recent value. Once again, the database should always be the most current.
With Asynchronous Inline Caching (using AEQ + Listener, for Write-Behind), the updates for a cache entry are queued and asynchronously written to the DB. If an entry is updated, then Geode can guarantee that the value is eventually written to the underlying DB regardless of whether the key expires at some time later or not. You can persist and replay the queue in case of system failures, conflate events, and so on. In this case, the cache and DB are eventually consistent and it is assumed that you are aware of this, and this is acceptable for your application use case.
Of course, all of these caching patterns and scenarios I described above assume nothing else is modifying the SOR/database. If another external application or process is also modifying the database, separate from your Geode-based application, then it would be possible for Geode to become out-of-sync with the database and you would need to take steps to identify this situation. This is rather an issue for reads, not writes. Of course, you further need to make sure that stale cache entries does not subsequently overwrite the database on an update. This is easy enough to handle with optimistic locking. You could even trigger a cache entry remove on an DB update failure to have the cache refreshed on read.
Anyway, all of this is to say, if you applied 1 of the caching patterns above correctly, the value in the cache should already be reflected in the DB (or will be in the Async, Write-Behind Caching UC), even if the entry eventually expires.
Make sense?

Snowflake - Loading data loading data from cloud storage

I have some data stored in an S3 bucket and I want to load it into one of my Snowflake DBs. Could you help me to better understand the 2 following points please :
From the documentation (https://docs.snowflake.com/en/user-guide/data-load-s3.html), I see it is better to first create an external stage before loading the data with the COPY INTO operation, but it is not mandatory.
==> What is the advantage/usage of creating this external step and what happen under the hood if you do not create it
==> In the COPY INTO doc, it is said that the data must be staged beforehand. If the data is not staged, Snowflake creates a temporary stage ?
If my S3 bucket is not in the same region as my Snowflake DB, is it still possible to load the data directly, or one must first transfert the data to another S3 bucket in the same region as the Snowflake DB ?
I expect it is still possible but slower because of network transfert time ?
Thanks in advance
The primary advantages of creating an external stage is the ability to tie a file format directly to the stage and not have to worry about defining it on every COPY INTO statement. You can also tie a connection object that contains all of your security information to make that transparent to your users. Lastly, if you have a ton of code that references the stage, but you wind up moving your bucket, you won't need to update any of your code. This is nice for Dev to Prod migrations, as well.
Snowflake can load from any S3 bucket regardless of region. It might be a little bit slower, but not any slower than it'd be for you to copy it to a different bucket and then load to Snowflake. Just be aware that you might incur some egress charges from AWS for moving data across regions.

jboss data grid for clustered enterprise application - what is the efficient way

we are having a clustered enterprise application using JTA transaction and hibernate for database operations deployed on JBoss EAP.
To increase system performance we are planning to use Jboss data grid. This is how I plan to use jboss data grid:
I am adding/replacing the object is cache whenever its inserted/updated in database using cache.put
when object is deleted from database its deleted from cache using cache.remove
while retrieving, first try to get the data from cache using key or query. If data is not present, load the data from database.
However, I have below questions on data grid:
To query objects we are using hibernate criteria however data grid uses its own query builder. Can we avoid writing separate query for hibernate and datagrid?
I want a list of objects to be returned matching a criteria. If one of the objects matching the criteria is evicted from cache, is it reloaded automatically from database?
If the transactions is rolled back is it rolled back from data grid cache as well
Are there any examples which I can refer for my implementation of data grid?
which is better choice for my requirement infinispan as 2nd level cache or data grid in library or remote mode?
Galder's comment is right, the best practice is using Infinispan as the second-level cache provider. Trying to implement it on your own is very prone to timing issues (you'd have stale/non-updated entries in the cache).
Regarding queries: With 2LC query caching on the cache keeps a map of 'sql query' -> 'list of results'. However once you update any type that's used in a query, all such queries are invalidated (e.g. if the query lists people with age > 60, updating a newborn still invalidates that query). Therefore this should be on only when the queries prevail over updates.
Infinispan has its own query support but this is not exposed when using it as 2LC provider. It is assumed that the cache will hold only a (most frequently accessed) subset of the entities in the database and therefore the results of such queries would not be correct.
If you want to go for Infinispan but keep the DB persistence, an option might be using JPA cache store (and indexing). Note though that updates to DB that don't go through Infinispan would not be reflected in the cache, and the indexing may lag a bit (since it's asynchronous). You can split your dataset and use JPA for one part and Infinispan + JPA cache store for the other, too.
A third option is using Hibernate Search, which keeps the data in database but index is in Lucene (possibly stored in Infinispan caches, too) and you don't use the Criteria API but Hibernate Search API.

standard protocol on inserting data and querying data at the same time

every day we have an SSIS package running to import data into a database.
sometimes people are querying the database at the same time.
the loading (data import) times out because there's a table lock on the specific table.
what is the standard protocol on inserting data and querying data at the same time?
First you need to figure out where those locks are coming from. Use the link to see if there are any locks.
How to: Determine Which Queries Are Holding Locks
If you have another process that holds a table lock then not much you can do.
Are you sure the error is "not able to OBTAIN a table lock". If so look at changing your SSIS package to not use table locks.
There are several strategies.
One approach is to design your ETL pipeline as to minimize lock time. All the data is prepared in staging tables and then, when complete, is switched in using fast partition switch operations, see Transferring Data Efficiently by Using Partition Switching. This way the ETL blocks reads onyl for a very short duration. It also has the advantage that the reads see all the ETL data at once, not intermediate stages. The drawback is difficult implementation.
Another approach is to enable snapshot isolation and/or read committed snapshot in the database, see Row Versioning-based Isolation Levels in the Database Engine. This way reads no longer block behind the locks held by the ETL. the drawback is resource consumption, the hardware must be able to drive the additional load of row versioning.
Yet another approach is to move the data querying to a read-only standby server, eg. using log shipping.