Stormcrawler: Injecting new URL to crawl without restarting the topology

Stormcrawler: Injecting new URL to crawl without restarting the topology - apache

Is there any way to inject new URL to crawl without stoping the topology from command line and editing
proper files ? I want to do that with Elasticsearch as indexer

It depends on what you use as a backend for storing the status of the URLs. If the URLs are stored in Elasticsearch in the status index, you won't need to restart the crawl topology. You can use the injector topology separately in local mode to inject the new URLs into the status index.
This is also the case with the SOLR or SQL modules but not with MemorySpout + MemoryStatusUpdater as it lives within the JVM and nowhere else.
Which spout do you use?

Related

Overriding httpd/apache upstream proxy httpcode with another

I have some react code (written by someone else) that needs to be served. The preferred method is via a Google Storage Bucket, fronted by their Cloud CDN, and this works. However, due to some quirks in the code, there is a requirement to override 404s with 200s, and serve content from the homepage instead (i.e. if there is a 404, don't serve a 404, serve the content of the homepage and return as a 200 instead)
(If anyone is interested, this override currently is implemented in CloudFront on AWS. Google CDN does not provide this functionality yet)
So, if the code is served at "www.mysite.com/app/" and someone hits "www.mysite.com/app/not-here" (which would return a 404), what should happen is that the response should NOT be 404, but a 200 with the content being served from index.html instead.
I was able to get this working by bundling all the code inside a docker container and then using the solution here. However, this setup means if we have a code change, all the running containers need to be restarted, and the customer expects zero downtime, hence the bucket solution.
So I now need to do the same thing but with the files being proxied in (with the upstream being the CDN).
I cannot use the original solution since the files are no longer local, and httpd can't check for existence of something that is not local.
I've tried things like ProxyErrorOverride and ErrorDocument, and managed to get it to redirect, but that is not what is needed.
Does anyone know how/if this can be done?

If the question is: how to catch the 404 error provided by Cloud Storage when a file is missing with httpd/apache? I don't know.
However, I think that isn't the best solution. Serving files directly from Cloud Storage is convenient but not industrial.
Imagine, you deploy several broken files successively, how to rollback in a stable format?
The best is to package your different code release in an atomic bag, a container for instance. Each version are in a different container and performing a rollback is easier and consistent.
Now your "container restart" issue. I don't know on which platform you are running your container. If your run it on a Compute Engine (a VM) it's maybe the worse solution. Today, there is container orchestration system that allows you to deploy, scale up and down the containers, and to perform progressive rollout, to replace, without downtime, the existing running containers by a newer version.
Cloud Run is a wonderful serverless solution for that, you also have Kubernetes (GKE on Google Cloud) that you can use with Knative for a better developer experience.

What are the best way of handling AEM caching in dispatcher?

I recently came across situation where I need to clear my dispatcher cache manually. For instance, If I am modifying any Js/css files, I would need to clear my dispatcher cache manually in order to get those modified new Js/css else it would be serving the old version of code. I just heard that ACS developed version clientlib which will help us do versioning. I have so many question on this.
Before version clientlib how did AEM manage?
Doesn't AEM has intelligent to manage versioned clientlibs?
Is it correct of way of handling it?
Can we create a Script whil will take back up of the existing before clearing those JS/css files?
What are the other options we have?

Versioned clientlib is correct solution. But you ll need bit more:
Versioned Clientlibs is a clientside cache busting technique. Used to bust the browser cache.
It will NOT bust dispatcher cache. Pages cached at dispatcher continues to serve unless manually cleared.
Refer here for similar question.
To answer your queries:
Before version clientlib how did AEM manage? As #Subhash points out, it is part of prod deployment scripts(Bamboo or Jenkins) to clear dispatcher cache when clientlibs change.
Doesn't AEM has intelligent to manage versioned clientlibs? - This is same as any cms tools. Caching strategy has to be responsibility of http servers and NOT AEM. Morever when you deploy js code changes, you need to clear dispatcher cache to get pages reflect new js changes.
Is it correct of way of handling it? For clientside busting - versioned clientlib is 100% foolproof technique. For dispatcher cache busting, you ll need different method.
Can we create a Script while will take back up of the existing before clearing those JS/css files? Should be part of your CI process defined in Jenkins/Bamboo jobs. Not a responsibility of AEM.
What are the other options we have? - For dispatcher cache clearance, try dispatcher-flush-rules. You can configure that when /etc design paths are published, they should automatically clear entire tree so that subsequent requests will hit publisher and get updated clientlibs.
Recommended:
Use Versioned Clientlibs + CI driven dispatcher cache clearance.
Since clientlibs are modified by IT team and requires deployment, make it part of CI process to clear cache. Dispatcher flush rules might help. But it is NOT usecase for someone to modify js/css and hit publish button in production. Production deployment cycle should perform this task. Reference links for dispatcher cache clear scripts: 1. Adobe documentation, 2. Jenkins way, 3. Bamboo way

Before versioned clientlibs -
You usually wire dispatcher invalidation as part of build and deployment pipeline. The last phase after the packages have been deployed to the author and publish instances, dispatcher is invalidated. This still leads to issue of browser cache not getting cleared (in cases where clientlib name has not changed.)
To overcome this, you can build custom cache busting techniques where you maintain a scheme for naming clientlibs for each release - eg: /etc/designs/homepageclientlib.v1.js or /etc/designs/homepageclientlib.<<timestamp>>.js. This is just for the browser to trigger a fresh download of the file from the server. But with CI/CD and frequent releases these days, this is just an overhead.
There are also non elegant ways of enforcing bypass of dispatcher using query params. In fact, even now if you open any of the AEM pages, you might notice cq_ck query param which is for disabling caching.
Versioned clientlibs from acs commons is now the way to go. Hassle free, the config generates unique md5hash for clientlibs, thereby forcing not just bypassing dispatcher cache, but also the browser level cache.

There is an add-on for Adobe AEM that does resource fingerprinting (not limited to clientlibs, basically for all static website content), Cache-Control header management and true resource-only flushing of the AEM dispatcher cache. It also deletes updated resources from the dispatcher cache that are not covered by AEM's authoring process (e.g. when you deploy your latest code). A free trial version is available from https://www.browsercachebooster.com/

Remote streaming with Solr

I'm having trouble using remote streaming with Apache Solr.
We previously had Solr running on the same server where the files to be indexed are located so all we had to to was pass it the path of the file we wanted to index.
We used something like this:
stream.file=/path/to/file.pdf
This worked fine. We have now moved Solr so that it runs on a different server to the website that uses it. This was because it was using up too many resources.
I'm now using the following to point Solr in the direction of the file:
stream.file=http://www.remotesite.com/path/to/file.pdf
When I do this Solr reports the following error:
http:/www.remotesite.com/path/to/file.pdf (No such file or directory)
Note that it is stripping one of the slashes from http://.
How can I get Solr to index a file at a certain URL like i'm trying to do above? The enableRemoteStreaming parameter is already set to true.
Thank you

For remote streaming
you would need to enable remote streaming
<requestParsers enableRemoteStreaming="true" multipartUploadLimitInKB="2048" />
and probably use stream.url for urls
If remote streaming is enabled and URL content is called for during
request handling, the contents of each stream.url and stream.file
parameters are fetched and passed as a stream.

Serving dynamic zip files through Apache

One of the responsibilities of my Rails application is to create and serve signed xmls. Any signed xml, once created, never changes. So I store every xml in the public folder and redirect the client appropriately to avoid unnecessary processing from the controller.
Now I want a new feature: every xml is associated with a date, and I'd like to implement the ability to serve a compressed file containing every xml whose date lies in a period specified by the client. Nevertheless, the period cannot be limited to less than one month for the feature to be useful, and this implies some zip files being served will be as big as 50M.
My application is deployed as a Passenger module of Apache. Thus, it's totally unacceptable to serve the file with send_data, since the client will have to wait for the entire compressed file to be generated before the actual download begins. Although I have an idea on how to implement the feature in Rails so the compressed file is produced while being served, I feel my server will get scarce on resources once some lengthy Ruby/Passenger processes are allocated to serve big zip files.
I've read about a better solution to serve static files through Apache, but not dynamic ones.
So, what's the solution to the problem? Do I need something like a custom Apache handler? How do I inform Apache, from my application, how to handle the request, compressing the files and streaming the result simultaneously?

Check out my mod_zip module for Nginx:
http://wiki.nginx.org/NgxZip
You can have a backend script tell Nginx which URL locations to include in the archive, and Nginx will dynamically stream a ZIP file to the client containing those files. The module leverages Nginx's single-threaded proxy code and is extremely lightweight.
The module was first released in 2008 and is fairly mature at this point. From your description I think it will suit your needs.

You simply need to use whatever API you have available for you to create a zip file and write it to the response, flushing the output periodically. If this is serving large zip files, or will be requested frequently, consider running it in a separate process with a high nice/ionice value / low priority.
Worst case, you could run a command-line zip in a low priority process and pass the output along periodically.

it's tricky to do, but I've made a gem called zipline ( http://github.com/fringd/zipline ) that gets things working for me. I want to update it so that it can support plain file handles or paths, right now it assumes you're using carrierwave...
also, you probably can't stream the response with passenger... I had to use unicorn to make streaming work properly... and certain rack middleware can even screw that up (calling response.to_s breaks it)
if anybody still needs this bother me on the github page

Is mod_rewrite a valid option for caching dynamic pages with Apache?

I have read about a technique involving writing to disk a rendered dynamic page and using that when it exists using mod_rewrite. I was thinking about cleaning out the cached version every X minutes using a cron job.
I was wondering if this was a viable option or if there were better alternatives that I am not aware of.
(Note that I'm on a shared machine and mod_cache is not an option.)

You could use your cron job to run the scripts and redirect the output to a file.
If you had a php file index.php, all you would have to do is run
php index.php > (location of static file)
You just have to make sure that your script runs the same on command line as it does served by apache.

I would use a cache on application level. Because the application knows best when the cached version is out of date and is more flexible and powerful in the matter of cache negotiation.

Does the page need to be junked every so often because it just has to? Or should it be paralleled with a static version after an update to the page?
If the latter, you could try and write a script that would make a copy of the just edited page and save it to its static filename version. That should lighten the write load since in that scenario you wouldn't need to have a fresh static copy unless there was a change made that needed some show time.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas