Finding 301 Redirects which are working on my site - apache

I have reached a fair number of redirects in the .htaccess of web site (around 700) due to software upgrades. I think about half of these have been now indexed by Google. How can I find the list of redirects which are currently being used ?
My idea is to find all "301" in the Apache Logs, such as this:
1.235.117.180 - - [01/Aug/2014:06:41:59 +0200] "GET /components/com_acesearch/assets/css/acesearch.css HTTP/1.1" 301 626
"http://example.com/link1/link2/page-2" "Mozilla/5.0 (Windows NT 6.1;
WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125
Safari/537.36"
Is it safe to assume that all Redirects which are not listed like the above one are not being used (so I can remove them?)
Thanks

No, that is not safe, do not rely only on the apache logs. Some old links might still be in the index and can be crawled later on.
Can't you optimize your redirects? Can you give an example of some of the redirects? Isn't there a pattern? With regular expressions, you can rewrite your files quite effectively if you can find a sort of pattern (or a couple of them).
There are more search engines than Google alone. If it is important that everything keeps indexed, I would keep the redirects, but find the pattern and schrink the number of redirects to max. 10 or something.

Related

Why does one specific customer's IP get refused (403 error) from our apache2.4?

We never had any problem and we didn't deploy anything, but one particular customer on his ipv6 addr is now getting 403 error from our Apache and I just can't figure out why.
I'm not sure what to provide but I double check every a2 config file.
I can see the customer access in the access.log (with the 403 code status), but nothing in the error.log.
access.log :
2a02:2788(...):102f - - [17/May/2021:12:54:12 +0200] "GET /page_url HTTP/1.0" 403 368 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.114 Safari/537.36 Edg/89.0.774.75"
2a02:2788(...):102f - - [17/May/2021:12:54:15 +0200] "GET /page_url HTTP/1.0" 403 368 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.114 Safari/537.36 Edg/89.0.774.75"
It's not on the application level too, we don"t have anything that return a 403 error.
Any idea on what Apache can do to trigger 403 error specificly on IP ?
Why/how is the customer seemingly making an HTTP/1.0 request? This alone could be sufficient reason for the server to reject the request since normal users using normal browsers don't send HTTP 1.0 requests. (HTTP/1.1 is expected.)
Generally, only certain bots make HTTP 1.0 requests.
An Apache module like mod_security could potentially have a rule that would block such requests. (Or any other rule using mod_rewrite, for instance, could also block such requests - but this is certainly not a default.)
Edg/89.0.774.75
It would seem this may have been a bug with Microsoft Edge, as the following Microsoft community post (from around the same time as this question) would seem to suggest:
https://answers.microsoft.com/en-us/microsoftedge/forum/all/internet-explorer-and-ms-edge-sends-ssl-requests/22708bcd-f196-45fb-84c9-6d8c34e7e08f
And as also noted in the above article, this would seem to have been "fixed" in later versions. So, your customer may also now be "fixed". (?)

Random chars appearing in Apache access logs

We are seeing random letters appear in access logs. The requests 404 since the content does not exist. The requests are made by a variety of users and other requests from the same ip usually look genuine. There is no way to request these from the site. Some of these requests even appear from internal traffic on our network.
Example:
157.203.177.191 - - [04/Feb/2018:23:51:20 +0000] "GET /VLTRP/content/dam/example/dotcom/images/ABtest/existing-customer-thumb.jpg HTTP/1.1" 404 60294 39082 "http://www.example.com/shop.html" "Mozilla/5.0 (Windows NT 10.0; WOW64; rv:52.0) Gecko/20100101 Firefox/52.0" 2
Without the /VLTRP this is a genuine request. Has anyone seen something similar before?
For info we are running Apache/2.2.15 (Unix) with ModSec enabled. We do see similar behaviour on another site where we do not have ModSec configured. We see similar requests for internal, external and bot traffic.

Strange GET request in Apache Log

I'm monitoring my website with apache log and i saw some stranges requests, see:
51.255.65.74 - - [28/May/2016:11:48:02 -0300] "GET /insert/xahanave.html HTTP/1.1" 404 1035 "-" "Mozilla/5.0 (compatible; AhrefsBot/5.1; +http://ahrefs.com/robot/)"
207.46.13.128 - - [28/May/2016:11:49:13 -0300] "GET / HTTP/1.1" 200 14188 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)"
66.249.64.87 - - [28/May/2016:11:49:32 -0300] "GET /css/kin8tengoku-1144-may.html HTTP/1.1" 404 1039 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
Well, my FTP don't have the folder "/insert/xanahave", neither file 'kin8tengoku' in folder css. Is it possibile make a request to a non existen file/folder ?
Important: Some days ago my site was hacked and a "insert" folder was created without permission in FTP, but now everything was clean and folder "insert" don't exist anymore. My big question is, why requests to this folder continue ?
Because the files were picked up by Ahrefs, Bing search engine and Google search engine when they were up and they periodically recheck files to see if there are any changes. This is how Google and the like return up to date information on your site.
You can see it's these companies from the user agent sent (at the end of each line). Now some, more nefarious bots, sometimes pretend to be GoogleBot but a quick Google of these IP addresses show these to be legitimate ones.
As you can see your server correctly responds with a 404 (page not found status) and, providing there are no links to them, then these companies will eventually take the hint and drop them from their index and stop requesting them. Can take a month or two. They don't do this immediately in case the 404 is an error because you accidentally removed the page or similar.

Safari shows an error when IPB redirects it to an https:// URL with language accents

My https switch is almost complete. Everything works flawlessly on every browser except Safari (both iOS and OSX).
Some URLs contain Polish characters, such as żźćąłóń. Apparently in some cases Invision Power Board encodes them, so for example ó becomes %F3.
Example 1
HTTP version works: http://net4game.com/topic/256549-rekrutujemy-gamemaster%F3w/
HTTPS with a special UTF8 character works as well: https://net4game.com/topic/256549-rekrutujemy-gamemasterów/
Does not work when ó is encoded: https://net4game.com/topic/256549-rekrutujemy-gamemaster%F3w/
[29/Aug/2015:17:19:56 +0200] "GET /topic/256549-rekrutujemy-gamemaster%F3w/ HTTP/1.1" 301 31 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/600.8.9 (KHTML, like Gecko) Version/8.0.8 Safari/600.8.9"
Example 2
HTTP works: http://net4game.com/topic/256786-bug-ślub/
HTTPS works: https://net4game.com/topic/256786-bug-ślub/
HTTPS, UTF8, with a redirect, does not work: https://net4game.com/topic/256786-bug-ślub/?view=getlastpost
I'm using nginx behind CloudFlare. Every request goes to index.php.
What could be wrong? Why does this problem occur only on HTTPS? It seems Safari doesn't rewrite %F3 to ó over encrypted connections, but I'm not sure it's relevant, as the second example seems to stick to the unencoded ś.
Cheers.

Fixing mistakes reading logs

I have huge 1 GB log file. As I know, it shows errors in my site. But I absolutely don't get it.
I have lots of rows like this:
8x.xxx.45.10x (my ip) - - [04/Feb/2011:09:59:48 -0500] "GET /post?slaps=bbrfd HTTP/1.1" 404 278 "http://mywebsite.com/" "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US) AppleWebKit/534.13 (KHTML, like Gecko) Chrome/9.0.597.86 Safari/534.13"
What does it mean?
Thank you very much.
That entry indicates that a request for /post?slaps=bbrfd on your site was not found (404). The request came from your IP, transferred 278 bytes of data (the 404 error page's contents). The link that couldn't be found was clicked on mywebsite.com, and the rest is how the browser identified itself. The two dashes are for "remote username", and "username as logged into the site". The remote username is VERY rarely present, as it requires the remote site running identd and would slow down your site massively.
Looks like an access log file from Apache. Nothing to do with PHP or MySQL. Looks the user got a 404 page when trying to access /post?slaps=bbrfd
This would suggest the URL does not exist.