Can Sonatype Nexus publish indexes after a commit? - maven-2

We have a Nexus repository, that holds our software (surprise ...). In our IDEs, we added the repositories, so we can navigate them in a visual manner.
Problem: the IDE uses the .index folder (with it's gz file), which is currently updated every night.
What I want is: if anyone adds an artifact, the .index should be updated automatically. Is that possible?
What I don't want is: make the index update itself every so many minutes.

Exporting the index into the consumable binary and chunking out the indexes isn't a trivial process so it's not something we've setup to allow publishing every time a release or snapshot is published. It's also a waste because it's unlikely that consumers of the index are updating every n seconds.
The internal indexes are however updated in realtime and searches against the Nexus UI will have the latest and greatest. We are working on more tightly integrating the IDE and Nexus internal indexes to eliminate the publish/consume loop.

Related

How do I manually remove old release builds from an expired/deleted plan branch in Bamboo?

I use Bamboo regularly as a QA tester to deploy pull requests and feature branches/release branches, but I'm not a developer and have a layman's understanding of how it works.
Our Bamboo configuration is set up to remove inactive branches after a certain amount of time (2 weeks) which happens pretty regularly with longer-term projects, unfortunately. (When that happens, I do know how to configure a new plan and run a new build.) Often, with these larger projects, they've been deployed manually many times over the course of the project, resulting in a large list of possible "release" versions when I go to "Promote existing release to this environment."
Now, I have a brand-new build of a brand-new plan for a project I've been working on off and on for a year and I would like to delete all these old builds (releases?) that show up in the dropdown when I want to just deploy the current version of the current new build, but I can't figure out where to do it (neither can the devs I've asked, but it's NBD to them, whereas this is a constant annoyance for me).
All the advice I can find online says things like "all builds are automatically deleted when the branch expires ...." and that doesn't seem to be true, because these are definitely from old expired plan branches. They also explain how to delete things manually .... from an existing plan branch, which I don't have, because the older plan branches expired and were removed.
Am I using the wrong terminology here and these aren't "builds" and there's a separate way to delete them? Do we have a setup that's failing to delete them when it should? Do devs need to do something different with their branches? I obviously don't have access to global settings but I could put in a request if that's what needs to change.
To be clear, I'm talking about going to deployment preview, selecting "promote existing release to this environment," entering in the jira number/beginning of the branch name, and seeing a million of these (which all look identical because our branch names are hella long):
deployment preview screenshot
I have read through all the Bamboo documentation relating to plans, builds, branches, and deployment, and Googled various combinations of relevant keywords and haven't found a solution. I've also asked devs I work with and they don't know either.

When AEM is configured to use a S3 data store will it make blue-green deployments faster?

Background
We know it's possible to setup a devops pipeline that deploys updates to AEM via a blue/green approach by using crx2oak to migrate the content from old to new environment. Why is out of scope of this question.
The problem with this approach is the content copy operation can take a significant time, as the amount of content in the JCR grows. Other ideas to mittigate this are appreciated.
We also know that AEM can have a S3 datastore that off-loads the binary content into a S3 bucket which would not be re-built during blue/green deployment as per:
https://helpx.adobe.com/experience-manager/6-3/sites/deploying/using/storage-elements-in-aem-6.html#OverviewofStorageinAEM6
What is unclear from Adobe's documentation is whether the same S3 bucket can be shared across AEM instances (i.e. blue/green instances). Maybe it's just my google fu that has failed...
Question(s)
When a new AEM instance is configured to use a S3 datastore that already has content in it from the old instance, when crx2oak is used to migrate content, will the new instance be able to access the existing content?
Are there any articles/blogs that describe what the potential time savings of this approach would be?
Yes I could do an experiment, and may do so in the future to answer my own question. I'm looking for information from anyone who has already done this? I'm an engineer so will not re-invent the wheel if someone else has done so.
You can certainly share the same S3 bucket between instances - in fact, this is commonly used along with binary-less replication from author->publisher(s) and is a tried and true configuration.
It's even possible to share the same bucket between completely different environments (e.g. DEV/STAGE, or BLUE/GREEN in your case). The main "gotcha" to be aware of is with regard to DataStore Garbage Collection (DSGC) because it's very possible that there will be blobs which are referenced by only some of the instances sharing the bucket and so when purging unused blobs this needs to be taken into account.
This is all part of the design though, and there is a flag designed specifically for this purpose which tells DSGC to only execute the first phase (the "mark" phase) of GC, and skip the 2nd "sweep" phase, until all instances have marked which blobs they wish to keep/discard. Once all instances have done so the sweep phase can be run to purge blobs not needed by any instances using the bucket.
For a more detailed explanation see the Oak docs:
https://jackrabbit.apache.org/oak/docs/plugins/blobstore.html#Shared_DataStore_Blob_Garbage_Collection_Since_1.2.0
I find it helps to understand that pretty much all of the datastore implementations are done such that blobs are stored according to their checksum, so the same file added uploaded twice will only have one copy stored in the datastore, and there will be two segment store records referencing that same blob. In the same way, multiple AEM instances sharing the same bucket will be able to find a given blob regardless of which instance put it there in the first place.
You can observe see this in action easily with FileDataStore by finding a blob and sha256'ing it - e.g. (this example is on OS X, the checksum command on Linux/Windows will be slightly different):
$ shasum -a256 crx-quickstart/repository/datastore/0c/9e/40/0c9e405fc8d0f0405930cd0044611cfbf014938a1837ae0cfaa266d7732d1002
0c9e405fc8d0f0405930cd0044611cfbf014938a1837ae0cfaa266d7732d1002 crx-quickstart/repository/datastore/0c/9e/40/0c9e405fc8d0f0405930cd0044611cfbf014938a1837ae0cfaa266d7732d1002
There you can see that a) the filename is the checksum, and b) it's nested using the first 3 pairs of characters from that checksum, so you can locate the file by just knowing the hash and if you store the same binary, even if the name or JCR metadata is different, the blob referenced will be the same literal file on disk.
From memory S3 datastore uses prefixes rather than directory nesting because this performance better, but the principle is the same.
Finally, a couple of things to consider are:
1) S3 storage is relatively cheap (and practically unlimited) so there is an argument to be made that it's not as necessary to perform regular DSGC unless you're really trying to pinch pennies.
2) If you do run DSGC you need to think about how this will work with whatever backup strategy you're using for the AEM instances. For instance, if you roll back a segment store shortly after running DSGC you'll likely have to recover some of those purged blobs. You can use versioning and/or lifecycle rules to help with this, but it can add significant additional complexity and time to your restore process.
If you opt to simply skip DSGC and leave the blobs there indefinitely it's a good idea to make sure the access key or IAM roles AEM is using doesn't have the DeleteObject permission for the bucket, just to be sure a rogue GC process can't delete anything.
Hope this helps.
Edit
In all that I forgot to actually answer your question - yes it will save some time in cloning in most cases. You'll still need to sync the segment store (obviously) and there are various approaches for this. crx2oak is certainly one - you'll see in the documentation there are specific options for using it w/ S3 where you supply a configuration file (basically a serialised .config file like you'd use with Felix/OSGi).
You can also use something like rsync to simply copy the TAR files over (while at least the target AEM is stopped. Oak is generally atomic so a hot copy from the source can work in theory, but YMMV).
Finally you could obviously use Mongo and cluster the segment store that way, but all the usual cost/complexity/performance issues with doing so apply).
Another interesting development on the horizon for blue/green type is the CompositeNodeStore - there is a good talk from the 2017 adaptTo() conference that talks about this:
https://adapt.to/2017/en/schedule/zero-downtime-deployments-for-the-sling-based-apps-using-docker.html
An external datastore will help a lot, as usually the most space is used by binary assets. The pure content typed in by real people is much less.
On my current project (quite small, but relations should be normal):
Repository 4,8 GB total (4.1 GB Segment Store, 780 MB Index)
File DataStore 222 GB total
If you wanna do it, I have the following remarks:
There are different datastores available. For testing I would start with the File DataStore.
The S3 DataStore makes only sense in my point of view, if you are hosting at Amazons AWS anyway. Adobe Managed Services is doing this, and so S3 makes sense for them. But also there only if you have more than 500 GB assets.
If you use the green/blue approach, then be careful the DataStore garbage collection (just do it manually). The shared Datastore is meant for several publishers, that have the same content. As example you could have the following situation: Your editors delete some assets, you run the DataStore GC and finally your rollback your environment. That means the assets are still in the content repository, but the binaries are cleaned out of the DataStore.
In order to to use a shared file datastore, you need to do the following:
Unpack Quickstart java -jar AEM_6.3_Quickstart.jar -unpack
Create an directory for the file datastore (anywhere outside of the crx-quickstart folder)
Create a directory install inside the extracted crx-quickstart folder
Create a file called org.apache.jackrabbit.oak.plugins.blob.datastore.FileDataStore.cfg inside this install folder
This file contains just 1 line path=<path to file datastore> (see https://jackrabbit.apache.org/oak/docs/osgi_config.html)
Place a reference.key file inside the datastore directory. First time it will be created automatically. But if you use always the same key, the same hash-values are used all datastores across all your environments. This is also a prerequisite for a feature called "binary-less replication" (so binary would only be replicated the first time between author and publisher)
kind regards,
Alex

Composite C1 - develop locally, sync to live site

I have a couple of Composite C1 CMS websites.
To edit them currently I use the web based CMS on the live site.
However - I would like to update the (code & content) in Visual Studio locally - then sync to the web. However, if my local copy is older than that online (e.g. a non techy client has edited something on the live site) and I Web Deploy - it will go over the top of the new file on the server.
I need a solution that works out the newest change? I can't find anything in Google or the C1 docs.
How can I sync - preferably using Web Deploy. Do I need some kind of version control?
Is there a best practice for this - editing the live site through the web interface seems a bit dicey & is slow.
The general answer to this type of scenario seems to be to use the Package Creator. With that you can develop locally, add the files you've changed to a package, and install that package on a live site. This solution does not at all cover all the parts of you question though, and has certain limitations:
You cannot selectively add content to a package. It's all pages or no pages.
Adding datatypes is easy, but updating them later requires you to delete the datatype (and data), and recreate the datatype.
In my experience packages works well for incremental site updates, if you limit the packages content to be front end stuff, like css, images and such.
You say you need a solution that works out the newest changes - I believe the only solution to this is yourself, with the aid of some tooling. I don't think there's a silver bullet solution here.
Should you use a version control system? Yes! By all means. Even if you are not sharing your code with anyone, a VCS is a great way to get to know Composite C1 from a file system perspective, as you can carefully track what files are changed on disk, as you develop. This knowledge is crucial when you want to continuously add features the a website that is already alive and kicking - you need to know what to deploy, and what not to touch.
Make sure you read the docs on how Composite fits in VCS: http://docs.composite.net/Configuration/C1-and-Version-Control
I assume that your sites are using the XML data storage (if you where using SQL Data Store, your content would not be overridden upon sync).
This means that your entire web application lives in one folder on disk on the web server, which can be an advantage here.
I'll try to outline a solution that could work for you, although I must stress that I've never tried this - I'm making it up as I type.
Let's say you're using git, download the site in it's entirety from the production web server, and commit the whole damned thing* to your master branch.
Then you create a new feature branch from that commit, and start making the changes you want to deploy later, and carefully commit your work as you go along, making sure you only commit the changes that are needed for your feature to work, to the feature branch.
Now, you are ready to deploy, and you switch back the master branch, and again download the entire site and commit it to master.
You then merge your feature branch into the master branch, and have git do all the hard work of stitching you changes in with the changes from the live site. There are bound to be merge conflicts, and that is where you will have to jump in, and decide for yourself what content needs to go live.
After this is done and tested, you can web deploy the site up to the production environment.
Changes to the live site might have occurred while you where merging, so consider closing the site, or parts of it, during this process.
If you are using SQL Data Store i suggest paying for a tool like Red Gate's SQL Compare and SQL Data Compare or SQL Delta, to compare your dev database to the production database, and hand pick SQL scripts that can be applied to the production database along with your feature deployment.
'* Do consider using a .gitignore file to avoid committing certain files - refer to the docs for mere info.
I suppose you should use the Package Creator
Also have a look here: http://docs.composite.net/Configuration/C1-and-Version-Control

Storing Drupal SQL in Git

I have a drupal site, and I am storing the codebase in a git repository. This seems to be working out well, but I'm also making changes to the database. I'm considering doing periodic dumps of the database and committing to git. I had a few questions about this.
If I overwrite the file, will git think it is a brand new file or will it recognize that it is an altered version of the same file.
Will this potentialy make my repo huge (the database is 16mb)
Can I zip this file? or will this mess Git up ... the zipped version is only 3mb
Any other suggestions?
If you have enough space, a non-compressed dump in source control is pretty handy because you can compare using a diff program what rows were added/modified/deleted.
Another solution is to use the features module which is supposed to capture drupal config in code. It stores this captured data as a feature module which you can put into version control.
For my database applications, I store scripts of DDL statements (like CREATE TABLE) in some sort of version control system. These scripts sometimes include static "seed" data as well. All the version control systems I use are good at recognizing differences in these files, and they are much smaller than the full database with data.
For the dynamically-generated data, I store backups (e.g. from mysqldump) in an appropriate location (depending on the importance of the data, that may include offsite backups).
1) It's all text, so GIT will just see it as it would any other file.
2) No, due to the above it should add 16mb to the repo (or less, due to GITs own compression), it won't add a new file every time, just the changes, so the repo will change by the size of the additions to the repository
3) No, or GIT won't be able to see the differences - GIT does it's own compression anyway

How to maintain lucene indexes in azure cloud-app

I just started playing with the Azure Library for Lucene.NET (http://code.msdn.microsoft.com/AzureDirectory). Until now, I was using my own custom code for writing lucene indexes on the azure blob. So, I was copying the blob to localstorage of the azure web/worker role and reading/writing docs to the index. I was using my custom locking mechanism to make sure we dont have clashes between reads and writes to the blob. I am hoping Azure Library would take care of these issues for me.
However, while trying out the test app, I tweaked the code to use compound-file option, and that created a new file everytime I wrote to the index. Now, my question is, if I have to maintain the index - i.e keep a snapshot of the index file and use it if the main index gets corrupt, then how do I go about doing this. Should I keep a backup of all the .cfs files that are created or handling only the latest one is fine. Are there api calls to clean up the blob to keep the latest file after each write to the index?
Thanks
Kapil
After i answered this, we ended up changing our search infrastructure and used Windows Azure Drive. We had a Worker Role, which would mount a VHD using the Block Storage, and host the Lucene.NET Index on it. The code checked to make sure the VHD was mounted first and that the index directory existed. If the worker role fell over, the VHD would automatically dismount after 60 seconds, and a second worker role could pick it up.
We have since changed our infrastructure again and moved to Amazon with a Solr instance for search, but the VHD option worked well during development. it could have worked well in Test and Production, but Requirements meant we needed to move to EC2.
i am using AzureDirectory for Full Text indexing on Azure, and i am getting some odd results also... but hopefully this answer will be of some use to you...
firstly, the compound-file option: from what i am reading and figuring out, the compound file is a single large file with all the index data inside. the alliterative to this is having lots of smaller files (configured using the SetMaxMergeDocs(int) function of IndexWriter) written to storage. the problem with this is once you get to lots of files (i foolishly set this to about 5000) it takes an age to download the indexes (On the Azure server it takes about a minute,, of my dev box... well its been running for 20 min now and still not finished...).
as for backing up indexes, i have not come up against this yet, but given we have about 5 million records currently, and that will grow, i am wondering about this also. if you are using a single compounded file, maybe downloading the files to a worker role, zipping them and uploading them with todays date would work... if you have a smaller set of documents, you might get away with re-indexing the data if something goes wrong... but again, depends on the number....