How to make aliases of apptainer container tags? - singularity-container

Apptainer (and singularity) can interact with container registries, much like docker can. Common practice for these registries is to have a (semantic version) tag for each image pushed and maintain the tag "latest" to point to the most recent version.
In apptainer, the tag is part of the target URI in the push command:
apptainer push /path/to/mycontainer_v1.2.1.sif oras://registry.tld/foo/mycontainer:1.2.1
As near as I can tell, if I want my "latest" tag to be the same as 1.2.1, the only way to accomplish this is to upload the image twice, wasting both bandwidth for the upload and storage space in the registry, which may not be insignificant when images often run to several GB.
Is there a way to define tag aliases without uploading and storing multiple copies of the container image? If it's registry-dependent, I am specifically interested in solutions relevant to the gitlab container registry.


When AEM is configured to use a S3 data store will it make blue-green deployments faster?

We know it's possible to setup a devops pipeline that deploys updates to AEM via a blue/green approach by using crx2oak to migrate the content from old to new environment. Why is out of scope of this question.
The problem with this approach is the content copy operation can take a significant time, as the amount of content in the JCR grows. Other ideas to mittigate this are appreciated.
We also know that AEM can have a S3 datastore that off-loads the binary content into a S3 bucket which would not be re-built during blue/green deployment as per:
What is unclear from Adobe's documentation is whether the same S3 bucket can be shared across AEM instances (i.e. blue/green instances). Maybe it's just my google fu that has failed...
When a new AEM instance is configured to use a S3 datastore that already has content in it from the old instance, when crx2oak is used to migrate content, will the new instance be able to access the existing content?
Are there any articles/blogs that describe what the potential time savings of this approach would be?
Yes I could do an experiment, and may do so in the future to answer my own question. I'm looking for information from anyone who has already done this? I'm an engineer so will not re-invent the wheel if someone else has done so.
You can certainly share the same S3 bucket between instances - in fact, this is commonly used along with binary-less replication from author->publisher(s) and is a tried and true configuration.
It's even possible to share the same bucket between completely different environments (e.g. DEV/STAGE, or BLUE/GREEN in your case). The main "gotcha" to be aware of is with regard to DataStore Garbage Collection (DSGC) because it's very possible that there will be blobs which are referenced by only some of the instances sharing the bucket and so when purging unused blobs this needs to be taken into account.
This is all part of the design though, and there is a flag designed specifically for this purpose which tells DSGC to only execute the first phase (the "mark" phase) of GC, and skip the 2nd "sweep" phase, until all instances have marked which blobs they wish to keep/discard. Once all instances have done so the sweep phase can be run to purge blobs not needed by any instances using the bucket.
For a more detailed explanation see the Oak docs:
I find it helps to understand that pretty much all of the datastore implementations are done such that blobs are stored according to their checksum, so the same file added uploaded twice will only have one copy stored in the datastore, and there will be two segment store records referencing that same blob. In the same way, multiple AEM instances sharing the same bucket will be able to find a given blob regardless of which instance put it there in the first place.
You can observe see this in action easily with FileDataStore by finding a blob and sha256'ing it - e.g. (this example is on OS X, the checksum command on Linux/Windows will be slightly different):
$ shasum -a256 crx-quickstart/repository/datastore/0c/9e/40/0c9e405fc8d0f0405930cd0044611cfbf014938a1837ae0cfaa266d7732d1002
0c9e405fc8d0f0405930cd0044611cfbf014938a1837ae0cfaa266d7732d1002 crx-quickstart/repository/datastore/0c/9e/40/0c9e405fc8d0f0405930cd0044611cfbf014938a1837ae0cfaa266d7732d1002
There you can see that a) the filename is the checksum, and b) it's nested using the first 3 pairs of characters from that checksum, so you can locate the file by just knowing the hash and if you store the same binary, even if the name or JCR metadata is different, the blob referenced will be the same literal file on disk.
From memory S3 datastore uses prefixes rather than directory nesting because this performance better, but the principle is the same.
Finally, a couple of things to consider are:
1) S3 storage is relatively cheap (and practically unlimited) so there is an argument to be made that it's not as necessary to perform regular DSGC unless you're really trying to pinch pennies.
2) If you do run DSGC you need to think about how this will work with whatever backup strategy you're using for the AEM instances. For instance, if you roll back a segment store shortly after running DSGC you'll likely have to recover some of those purged blobs. You can use versioning and/or lifecycle rules to help with this, but it can add significant additional complexity and time to your restore process.
If you opt to simply skip DSGC and leave the blobs there indefinitely it's a good idea to make sure the access key or IAM roles AEM is using doesn't have the DeleteObject permission for the bucket, just to be sure a rogue GC process can't delete anything.
Hope this helps.
In all that I forgot to actually answer your question - yes it will save some time in cloning in most cases. You'll still need to sync the segment store (obviously) and there are various approaches for this. crx2oak is certainly one - you'll see in the documentation there are specific options for using it w/ S3 where you supply a configuration file (basically a serialised .config file like you'd use with Felix/OSGi).
You can also use something like rsync to simply copy the TAR files over (while at least the target AEM is stopped. Oak is generally atomic so a hot copy from the source can work in theory, but YMMV).
Finally you could obviously use Mongo and cluster the segment store that way, but all the usual cost/complexity/performance issues with doing so apply).
Another interesting development on the horizon for blue/green type is the CompositeNodeStore - there is a good talk from the 2017 adaptTo() conference that talks about this:
An external datastore will help a lot, as usually the most space is used by binary assets. The pure content typed in by real people is much less.
On my current project (quite small, but relations should be normal):
Repository 4,8 GB total (4.1 GB Segment Store, 780 MB Index)
File DataStore 222 GB total
If you wanna do it, I have the following remarks:
There are different datastores available. For testing I would start with the File DataStore.
The S3 DataStore makes only sense in my point of view, if you are hosting at Amazons AWS anyway. Adobe Managed Services is doing this, and so S3 makes sense for them. But also there only if you have more than 500 GB assets.
If you use the green/blue approach, then be careful the DataStore garbage collection (just do it manually). The shared Datastore is meant for several publishers, that have the same content. As example you could have the following situation: Your editors delete some assets, you run the DataStore GC and finally your rollback your environment. That means the assets are still in the content repository, but the binaries are cleaned out of the DataStore.
In order to to use a shared file datastore, you need to do the following:
Unpack Quickstart java -jar AEM_6.3_Quickstart.jar -unpack
Create an directory for the file datastore (anywhere outside of the crx-quickstart folder)
Create a directory install inside the extracted crx-quickstart folder
Create a file called org.apache.jackrabbit.oak.plugins.blob.datastore.FileDataStore.cfg inside this install folder
This file contains just 1 line path=<path to file datastore> (see
Place a reference.key file inside the datastore directory. First time it will be created automatically. But if you use always the same key, the same hash-values are used all datastores across all your environments. This is also a prerequisite for a feature called "binary-less replication" (so binary would only be replicated the first time between author and publisher)
kind regards,

Change resolutions of image files stored on S3 server

Is there a way to run imagemagick or some other tool on s3 servers to resize the images.
The way I know is first downloading all the image files on my machine and then convert these files and reupload them on s3 server. The problem is the number of file is more than 10000. I don't want to download all the files on my local machine.
Is there a way to convert it on s3 server itself.
look at it:
It is a library providing some features for s3 uploading including resizing as you want
Another option is NOT to change the resolution, but to use a service that can convert the images on-the-fly when they are accessed, such as:
Also check out the following article on amazon's compute blog.. I found myself here because i had the same question. I think i'm going to implement this in Lambda so i can just specify the size and see if that helps. My problem is i have image files on s3 that are 2MB.. i dont want them at full resolution because I have an app that is retrieving them and it takes a while sometimes for a phone to pull down a 2MB image. But i dont mind storing them at full resolution if i can get a different size just by specifying it in the URL. easy!
S3 does not, alone, enable arbitrary compute (such as resizing) on the data.
I would suggest looking into AWS-Lambda (available in the AWS console), which will allow you to setup a little program (which they call a Lambda) to run when certain events occur in a S3 bucket. You don't need to setup a VM, you only need to specify a few files, with a particular entry point. The program can be written in a few languages, namely node.js python and java. You'd be able to do it all from the console's web GUI.
Usually those are setup for computing things on new files being uploaded. To trigger the program for files that are already in place on S3, you have to "force" S3 to emit one of the events you can hook into for the files you already have. The list is here. Forcing a S3 copy might be sufficient (copy A to B, delete B), an S3 rename operation (rename A to A.tmp, rename A.tmp to A), and creation of new S3 objects would all work. You essentially just poke your existing files in a way that causes your Lambda to fire. You may also invoke your Lambda manually.
This example shows how to automatically generate a thumbnail out of an image on S3, which you could adapt to your resizing needs and reuse to create your Lambda:
Also, here is the walkthrough on how to configure your lambda with certain S3 events:

Is there an automated way to push all my javascript/css/images to s3 everytime I do a website push?

So I am in the process of moving all the thumbnails of my major sites to S3 and now I am thinking about how I can consistently put all my CSS/JS/images that power the actual sites to it. It's easy enough to upload everything the first time but I am trying to think of a way to somehow automate the process everytime I push out to production.
Does anyone have any clever ways of doing this?
I used to use s3sync to compare and update the assets just before upload the site files using a bash file to iterate through my files
This works well but when the amount of likes to compare (lets say thousands) gets big this process start being really slow. If you have an small architecture (in term of assets) this would do the trick
to make this better I would recommend capistrano or some other assistant that helps you to deploy...this way you can run at all once..
upload assets
deploy your files
In the other hand you could take a look to cloudfront (amazon's CDN) and set it up using ORIGIN..this way you dont need to worry about upload the files to s3 since they will be automatically pulled on demand. The down side of this approach is the caching if you need to update a file and keep the same name (AKA expire the object) can do this in cloudfront but will need an script to do the task.
Depending in the traffic (and other factors, ofcourse) one or other path will fit the best.

Storing Drupal SQL in Git

I have a drupal site, and I am storing the codebase in a git repository. This seems to be working out well, but I'm also making changes to the database. I'm considering doing periodic dumps of the database and committing to git. I had a few questions about this.
If I overwrite the file, will git think it is a brand new file or will it recognize that it is an altered version of the same file.
Will this potentialy make my repo huge (the database is 16mb)
Can I zip this file? or will this mess Git up ... the zipped version is only 3mb
Any other suggestions?
If you have enough space, a non-compressed dump in source control is pretty handy because you can compare using a diff program what rows were added/modified/deleted.
Another solution is to use the features module which is supposed to capture drupal config in code. It stores this captured data as a feature module which you can put into version control.
For my database applications, I store scripts of DDL statements (like CREATE TABLE) in some sort of version control system. These scripts sometimes include static "seed" data as well. All the version control systems I use are good at recognizing differences in these files, and they are much smaller than the full database with data.
For the dynamically-generated data, I store backups (e.g. from mysqldump) in an appropriate location (depending on the importance of the data, that may include offsite backups).
1) It's all text, so GIT will just see it as it would any other file.
2) No, due to the above it should add 16mb to the repo (or less, due to GITs own compression), it won't add a new file every time, just the changes, so the repo will change by the size of the additions to the repository
3) No, or GIT won't be able to see the differences - GIT does it's own compression anyway

iPad - how should I distribute offline web content for use by a UIWebView in application?

I'm building an application that needs to download web content for offline viewing on an iPad. At present I'm loading some web content from the web for test purposes and displaying this with a UIWebView. Implementing that was simple enough. Now I need to make some modifications to support offline content. Eventually that offline content would be downloaded in user selectable bundles.
As I see it I have a number of options but I may have missed some:
Pack content in a ZIP (or other archive) file and unpack the content when it is downloaded to the iPad.
Put the content in a SQLite database. This seems to require some 3rd party libs like FMDB.
Use Core Data. From what I understand this supports a number of storage formats including SQLite.
Use the filesystem and download each required file individually. OK, not really a bundle but maybe this is the best option?
What are the storage limitations and performance limitations for each of these methods? And is there an overall storage limit per iPad app?
If I'm going to have the user navigate through the downloaded content, what option is easier to code up?
It would seem like spinning up a local web server would be one of the most efficient ways to handle the runtime aspects of displaying the content. Are there any open source examples of this which load from a bundle like options 1-3?
The other side of this is the content creation and it seems like zipping up the content (option 1) is the simplest from this angle. The other options would appear to require creation of tools to support the content creator.
If you have the control over the content, I'd recommend a mix of both the first and the third option. If the content is created by you (like levels, etc) then simply store it on the server, download a zip and store it locally. Use CoreData to store an Index about the things you've downloaded, like the path of the folder it's stored in and it's name/origin/etc, but not the raw data. Databases are not thought to hold massive amounts of raw content, rather to hold structured data. And even if they can -- I'd not do so.
For your considerations:
Disk space is the only limit I know on the iPad. However, databases tend to get slower if they grow too large. If you barely scan though the data, use the file system directly -- may prove faster and cheaper.
The index in CoreData could store all relevant data. You will have very easy and very quick access. Opening a content will load it from the file system, which is quick, cheap and doesn't strain the index.
Why would you do so? Redirect your WebView to a file:// URL will have the same effect, won't it?
Should be answered by now.
If you don't have control then use the same as above but download each file separately, as suggested in option four. after unzipping both cases are basically the same.
Please get back if you have questions.
You could create a xml file for each bundle, containing the path to each file in the bundle, place it in a folder common to each bundle. When downloading, download and parse the xml first and download each ressource one by one. This will spare you the overhead of zipping and unzipping the content. Create a folder for each bundle locally and recreate the folder structure of the bundle there. This way the content will work online and offline without changes.
With a little effort, you could even keep track of file versions by including version numbers in the xml file for each ressource, so if your content has been partially updated only the files with changed version numbers have to be downloaded again.