Foundry Data Storage Optimization - optimization

Hi I have a general question about pipelines optimization in order to lower storage space.
Does deleting trashed datasets help alleviate disk storage? Ex. Remove obsolete datasets: a.) based on business knowledge and utilization and b.) datasets in the trash.
Also, We'd like to manage the copies of datasets that are stored when a schedule runs. We believe that if we ever had to fall back to a previous version, we only need to reference the latest one, as opposed to keeping multiple copies.
Does this affect storage? And is there a way to manage configuration on this?

Deleting trashed datasets (in typical setups) will result in their underlying files being deleted, but typically a larger driver of storage consumption is the set of previous dataset views kept by default.
You can control the length of time these files and views are kept using the Foundry Retention service. I'd recommend you consult with platform documentation and your support team for configuration of this service.
Retention will compute and mark files matching your configuration for deletion and periodically delete them, thus reducing your storage consumption.

Related

Cross-region movement - Transferring Big Query and GCS data from US to EU locations

For compliance requirements, we would like to move all our Bigquery data and GCS data from US region to EU region.
From my understanding, multi-region is either within US or within EU. There is no cross-region option as such.
Question 1: In order to move the data from US to EU or vice versa, our understanding is that we need explicitly move the data using a storage transfer service. And assuming a cost associated with this movement even though it is within Google cloud?
Question 2: We also think if we can maintain copies at both locations. In this case, Is there a provision for cross-region replication? If so, what would be the associated cost for the same?
Question1:
You are moving data from one part of the world to another one. So, yes you will pay the egress cost of the source location.
Sadly, today (November 28th 2023), I can't 100% commit on that cost. Indeed, I reached Google Cloud about a very similar question and my Google Cloud contact told me that the cost page was out of date. The Cloud Storage egress cost should apply (instead of the Compute Engine Networking egress cost as today in the documentation).
Question2:
You copy the data, so, you have, at the end, the volume of data duplicated in 2 dataset and you have your storage cost duplicated.
Every time that you want to sync the data, you perform a copy. It's only a copy, and not a smart delta update. So, be careful if you update directly the data in the target dataset: a new copy will override the data!
Instead, use the target dataset as a base to query, and duplicate (again) the data in an independent dataset, where you can add your region specific data
According to the docs, once the dataset is created, the location cannot be changed, but you can copy the dataset to a different location, or manually move (recreate) the dataset in a different location.
The easier approach is copy, you can learn more about the requirements, quotas and limitations here: https://cloud.google.com/bigquery/docs/copying-datasets
So:
There is no need for the transfer service, you can copy datasets to a different location.
There is no mechanism for automatic replication across regions. Even a disaster recovery policy will require cross-region datasets copies.
BigQuery does not automatically provide a backup or replica of your data in another geographic region. You can create cross-region dataset copies to enhance your disaster recovery strategy.
https://cloud.google.com/bigquery/docs/availability#:%7E:text=cross%2Dregion%20dataset%20copies
So in both cases you need to work with dataset copies and deal with data freshness in the second scenario.

On a google compute engine (GCE), where are snapshots stored?

I've made two snapshots using the GCE console. I can see them there on the console but cannot find them on my disks. Where are they stored? If something should corrupt one of my persistent disk, will the snapshots still be available? If they're not stored on the persistent disk, will I be charged extra for snapshot storage?
GCE has added a new level of abstraction. The disks were separated from the VM instance. This allows you to attach a disk to several instances or restore snapshots to another VMs.
In case your VM or disk become corrupt, the snapshots are safely stored elsewhere. As for additional costs - keep in mind that snapshots store only files that changed since the last snapshot. Therefore the space needed for 7 snapshots is often not more than 30% more space than one snapshot. You will be charged for the space they use, but the costs are quite low from what i observed (i was charged 0.09$ for 3.5 GB snapshot during one month).
The snapshots are stored separately on Google's servers, but are not attached to or part of your VM. You can create a new disk from an existing snapshot, but Google manages the internal storage and format of the snapshots.

Is Infinispan an improvement of JBoss Cache?

According to this link which belongs to JBoss documentation, I understood that Infinispan is a better product than JBoss Cache and kind of improvement the reason for which they recommend to migrate from JBoss Cache to Infinispan, that is supported by JBoss as well. Am I right in what I understood? Otherwise, are there differences?
One more question : Talking about replication and distribution, can any one of them be better than the other according to the need?
Thank you
Question:
Talking about replication and distribution, can any one of them be better than the other according to the need?
Answer:
I am taking a reference directly from Clustering modes - Infinispan
Distributed:
Number of copies represents the tradeoff between performance and durability of data
The more copies you maintain, the lower performance will be, but also the lower the risk of losing data due to server outages
use of a consistent hash algorithm to determine where in a cluster entries should be stored
No need to replicate data on each node that takes more time than just communicating hash code
Best suitable if no of nodes are high
Best suitable if size of data stored in cache is high.
Replicated:
Entries added to any of these cache instances will be replicated to all other cache instances in the cluster
This clustered mode provides a quick and easy way to share state across a cluster
replication practically only performs well in small clusters (under 10 servers), due to the number of replication messages that need to happen - as the cluster size increases
Practical Experience:
I are using Infinispan cache in my running live application on Jboss server having 8 nodes. Initially I used replicated cache but it took much longer time to respond due to large size of data. Finally we come back to Distributed and now its working fine.
Use replicated or distributed cache only for data specific to any user session. If data is common regardless of any user than prefer Local cache that's created separately for each node.

RavenDB disk storage

I have a requirement to keep the RavenDB database running when the disk for main database and index storage is full. I know I can configure provide a drive for storage with config option - Raven/IndexStoragePath
But I need to design for the corner case of when the this disk is full. What is the usual pattern used in this situation. One way is to halt all access while shutting the service down and updating the config file programatically and then start the service - but it is a bit risky.
I am aware of sharding and this question is not related to that , assume that sharding is enables and I have multiple shards and I want to increase storage for each shard by adding a new drive to each. Is there an elegant solution to this?
user544550,
In a disk full scenario, RavenDB will continue to operate, but will refuse to accept further writes.
Indexing will fail as well and eventually mark the indexes as permanently failing.
What is your actual scenario?
Note that in RavenDB, indexes tend to be significantly smaller than the actual data size, so the major cause for disk space utilization is actually the main database, not the indexes.

Distributed datastore

We're trying to add some kind of persistence in our app.
The app generates about 250 entries per second. Each of these entries belong to one of 2M files. For each file, we want to keep the last 10 entries, so we can look them up later.
The way our client application works :
it gets a stream of all the data
it fetches the right file (GET)
it adds the new content
it saves the file back (PUT)
We're looking for an efficient way to store this data that can scale horizontally as the amount of data we're getting is doubling every few weeks.
We initially looked at S3. It works fine, but becomes very expensive very fast (>$1000 monthly just in PUT operations!)
We then gave a shot at Riak. But it seems we can't get more than 60 write/sec on each node, which is very very slow.
Any other solution out there?
There are lots of knobs you can turn in Riak - ask the mailing list if you haven't already and we'll figure out a sane configuration for you. 60 writes/sec is not within the norm.
See: http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
What about Hadoop's HDFS spread over Amazon EC2 instances? I know each instance has a good amount of storage space, and you don't have to pay for put/get, only the inbound transfer.
I would suggest looking at CloudIQ Storage from Appistry. Its a fully distributed file store. Its accessible via a REST-based API, and can run on commodity hardware. You can define the number of copies retained on a file by file basis. It supports an Eventually Consistent model so you can balance file consistency with performance.