Apache regex -301 redirects to eradicate duplicates in url path - apache

We are using a CMS that produces URLs of the format www.domain.com/home/help/contact/contact. Here the first occurrence of contact is the directory and the second occurrence is the HTML page itself. These urls are causing issues in the SEO space.
We have implemented canonical tags but the business wants to make sure they don't come across these duplicates in both the search engines and Google analytics, and have asked us to implement a 301 solution on our web server.
My question is we have got a regex to find these matches but I also need the part of the URL before the match. The regex we have is .*?([\w]+)\/\1+ and this returns contact in /home/help/contact/contact. How can I get the /home/help/ path as well so I can redirect to the right page? Can someone help with this please as I am a beginner when it comes to regex?

Since you're able to get contact using a matching group, enclose everything before that inside a matching group as well:
(.*?)(/[\w]+)\2+
I have put the / inside a matching group too, so that you won't get false positives for
/home/some/app/page
this would be \1 ^ ^ found repetition (character p would be matched)

Related

How do I enable query string in Cloudflare page rules?

I want to forward this URL
https://demo.example.com/page1?v=105
to
https://www.example.com/page1
It's not working if I set so:
But if I remove the ?v= part, it works:
Is there any way to include the ?v= part in this page rule?
The rule you mentioned works only for that exact URL: https://demo.example.com/page1?v=105. If it has any additional query parameter or even an additional character, eg https://demo.example.com/page1?v=1050, it won't match the rule.
If you need to use complicated URL matches that require specific query strings, you might need to use bulk redirects or a Cloudflare worker.

Mod_Rewrite how to ignore slashes that don't belong in url string

I'm trying to setup canonical links in our forum and I need to come up with a rewrite rule that will ignores slashes that don't belong in the URL.
The proper URL would look like:
http://www.truckingtruth.com/truckers-forum/Topic-20315/Page-1/speak-to-recruiter
For this I'm using the following mod_rewrite rule to pass the 'topic', 'page', and 'subjectString' variables:
^Topic-(.*)/Page-(.*)/(.*)$ index.html?topic=$1&page=$2&subjectString=$3
But sometimes improper links to our site or improper links in a comment will add slashes to the URL that don't belong there and it throws off the rule. Example:
http://www.truckingtruth.com/truckers-forum/Topic-1652/Page-1/www.truckingtruth.com/free_truck_driving_schools/swift/how-to-use-the-qualcomm
When that happens the variables being passed are:
topic = "1652"
page = "1/www.truckingtruth.com/free_truck_driving_schools/swift"
subjectString = "how-to-use-the-qualcomm"
What I want it to do is pass:
topic = "1652"
page = "1"
subjectString = "www.truckingtruth.com/free_truck_driving_schools/swift/how-to-use-the-qualcomm"
How can I create a rewrite rule that will pass everything after "Page-1" as the subjectString even if there are slashes in it?
Since the topic is always integer, for your first capturing group you can use \d which matches any decimal digit (equivalent to [0-9]).
For page, just make sure not to include any slashes, [^/] will take care of that.
The rest should then all go to third capturing group, so the resulting regex will be:
^Topic-(\d*)/Page-([^/]*)/(.*)$

htaccess ignore part of query string

It is possible to write a .htaccess rule to ignore part of the query string but still keep it in the url (without redirection)?
i need a way to ignore query string(s) starts with the utm_ like they are absent (invisible to the backend scripts) but still present in the browser URL because the query string will be captured by the JavaScript analytics scripts (thats why solutions with the redirection unacceptable).
lets say i have an url without query string: https://example.com/bla/hello
or with query string: https://example.com/anything?hello=world
then i need to add a query string part utm_source=123 which could be positioned anywhere among the other query string elements
some pages of the site stops working or starts behaving different when im adding a query string to the url for example https://example.com/?utm_source=123 throws 404
if it is possible, could you help me with the rules ?

Alfresco FTS - why first digit of folder's name should be escaped?

I have a question regarding the alfresco FTS/lucene search. It is known that in the search query some special characters have to be escaped, like space (by _x0020_).
But it turned out that if folder's name first chatacter is a digit, it should also be escaped. It can be easily tested in Node Browser by creating a folder, like 123456 and navigate to the parent folder in node browser (in my case I have following folder structure: */2017/123456/):
Primary Path: /app:company_home/st:sites/<some-folders>/cm:_x0032_017/cm:_x0031_23456
^this is 2 ^ and this is 1
If I don't ecape first character of the folder I have 500 error returned.
Why is that, I tried to find something relevant in Alfresco documentation, but didn't manage to.
Alfresco v.4.2.0
Lucene search uses ISO 9075 codification (SQL) like similar frameworks, so we need to encode the path elements. It would be nice if the API hides this requirement like the browser url but you could use ISO9075Encode to do the job.

Drupal 7 Apache solr faceted search with OR condition on two fields instead of drill down/AND

I have a Drupal 7 website that is running apachesolr search and is using faceting through the facetapi module.
When I use the facets to narrow my searches, everything works perfectly and I can see the filters being added to the search URL, so I can copy them as links (ready-made narrowed searches) elsewhere on the site.
Here is an example of how the apachesolr URL looks after I select several facets/filters:
search_url/search_keyword?f[0]=im_field_tag_term1%3A1&f[1]=im_field_tag_term2%3A100
Where the 'search_keyword' portion is the text I'm searching for and the '%3A' is just the url encoded ':' (colon).
Knowing this format, I can create any number of ready-made searches by creating the correct format for the URL. Perfect!
However, these filters are always ANDed, the same way they are when using the facet interface. Does anyone know if there is a syntax I can use, specifically in the search URL, to OR my filters/facets? Meaning, to make it such that the result is all entries that contains EITHER of the two filters?
New edit:
I do know how to OR terms for one facet through the URL im_field_tag_term1:(x or y) but I need to know how to apply OR condition between two facets .
Thanks in advance .