Let's say i have a simple blog engine. I've posted a simple post with URL
http://example.org/blog/awesomr-post
Few days later i've noticed the typo and fix my URL
http://example.org/blog/awesome-post
But search engines have already indexed "awesomr-post" and if somebody follow this link he'll get 404 error. There is the same issue with bookmarked pages.
So i think the post should be accepted by two links
http://example.org/blog/awesome-post
http://example.org/permalinks/1
Now i have to specify relationships somehow. What i can do
http://example.org/permalinks/1
<!DOCTYPE html>
<html>
<head>
<link rel="canonical" href="http://example.org/blog/awesome-post">
</head>
<body>
page content
</body>
</html>
http://example.org/blog/awesome-post
<!DOCTYPE html>
<html>
<head>
<link rel="bookmark" href="http://example.org/permalinks/1">
</head>
<body>
page content
</body>
</html>
Is it right solution? And should i use the canonical or permalink URL when linking from another site pages?
One of the way is to have 301 (permanent) redirect from http://example.org/blog/awesomr-post to http://example.org/blog/awesome-post
Related
I'm using scrapy to crawl an e-commerce website I'm experienced with simpler websites where scrapy alone or with splash/selenium handle most cases.
I have a new situation where I have no experience to deal with. From my investigations it could be like a captcha but without any request to the user.
I've made tests to solve it with scrapy alone, scrapy and selenium with no success.
With my scrapy request I receive the following response
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
<title>Challenge Validation</title>
<link rel="stylesheet" type="text/css" href="/_sec/cp_challenge/sec-2-9.css">
<script type="text/javascript">function cp_clge_done(){location.reload(true);}</script>
<script src="/_sec/cp_challenge/sec-cpt-int-2-9.js" async defer></script>
<script type="text/javascript">sessionStorage.setItem('data-duration', 5);</script>
</head>
<body>
<div class="sec-container">
<div id="sec-text-container"><iframe id="sec-text-if" class="custmsg" src="https://beta.elcorteingles.es/sgfm/statics/eci_non_food/contents/cc/cca.html"></iframe></div>
<div id="sec-if-container">
<iframe id="sec-cpt-if" class="crypto" data-key="" data-duration=5 src="/_sec/cp_challenge/ak-challenge-2-9.htm"></iframe>
</div>
</div>
</body>
</html>
With the chrome inspector i see also noticed two GET requests (non-java) that might be related:
check -> returns HTML ( ... <title>RP iframe</title> ...)
check-session?origin=https%3A%2F%2Fwww.elcorteingles.es -> returns HTML (...<title>OP iframe</title>...)
Using scrapy shell with view(response) it looks like a captcha situation, waiting for something. Page example could be:
scrapy shell "https://www.elcorteingles.es/supermercado/0110120903000022-coosur-aceite-de-oliva-intenso-1-botella-1-l/"
The title 'challenge validation' suggests it. I have no idea how to handle with this case. From research, I've seen solutions involving scrapy middleware but for cases where input was asked from the user. I found no example similar to this case. Any guidance on how to proceed is appreciated.
I have a page setup for Open Graph Protocol because our app is built upon Angular 1.x now when we share a URL using LinkedIn. Share Popup opens but it does not crawl open graph tags sometimes and sometimes it shows the proper crawl tags it was working fine till last week. here is the image which shows the preview area:
Scenario for sharing a link:
User comes on our site: www.example.com/event/[EVENT_ID] and clicks share to LinkedIn.
Popups opens using: https://www.linkedin.com/shareArticle?mini=true&url=https://example.com/event/0u83s43rf6r/4295028179 where 4295028179 is event id and 0u83s43rf6r is a random key for sharing because of cache busting.
Now we are using apache mod_rewrite to redirect LinkedIn, Facebook, Twitter bot to our crawler page where Open graph tags are rendered.
Apache Mod Rewrite Settings in .htaccess file
RewriteCond %{HTTP_USER_AGENT} ^(facebookexternalhit/(.*)|Facebot|Twitter(.*)|Pinterest|LinkedIn(.*)|LinkedInBot)$ [NC]
RewriteRule ^(event)/([_0-9a-zA-Z]+)/([0-9]+)$ https://share.example.com/web/crawler/details/$3 [R=301,L]
So the end url becomes when crawler redirect based on USER AGENT where open graph tags are rendered: http://share.example.com/web/crwaler/details/4295028179
Here is the rendered html tags:
<html>
<head>
<script type="text/javascript">window.location = 'https://example.com/event/236129271' // if it's a browser then redirect it to website</script>
<meta property="og:title" content="Event Title" />
<meta property="og:description" content="Event Description" />
<meta property="og:image" content="Event Thumbnail" />
<meta name="title" content="LinkedIn Share Test" />
<meta name="description" content="Event Description" />
<meta property="og:image:width" content="188" />
<meta property="og:image:height" content="71" />
<!-- Twitter Card Working Fine-->
<meta name="twitter:card" content="summary_large_image">
<meta name="twitter:title" content="Event Title">
<meta name="twitter:description" content="Event Description">
<meta name="twitter:image" content="Event Image">
</head>
<body>
</body>
</html>
Last week this logic is working fine on Linkedin but now somehow it's not working.
Your code seems fine, you have the right og: tags, etc..
Whenever you're not sure that you're using the LinkedIn share API, check out your website with the LinkedIn Post Inspector, and this will tell you how the LinkedIn API is looking at your webpage. It covers many things, from <title> tags, to og: tags, to oEmbed tags, etc., etc..
Worried about caching? Why not test a URL like example.com?someFakeParameter=123? This will similarly bypass the caching at the LinkedIn Post Inspector.
If you could post your actual URL that you're sharing, I could give you a better answer, but hopefully something here helps!
this might sound like a very basic question, but i couldnt find much help from google..
so, i have a html file -
<!doctype html>
<html>
<title>New Form Title</title>
<head>
<script type='text/javascript' src='http://localhost/whatever.js'></script>
</head>
<body>
</body>
</html>
when i hit f5(after loading the page for first time), i can see the server returned a 304 status, but i was under assumption that a server request will not even be sent in the first place (i.e the browser would not send a request because the url is the same, and the browser would use the cached item)
what am i missing? is this the actual behaviour?
thank you..
Is there any way to capture all the redirects on the page performed in JS? For instance, let's take a look at this web page making redirect using window.location
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>Redirect JS</title>
</head>
<body>
<script>
window.location = "http://www.example.com";
</script>
</body>
</html>
or meta tag
<meta http-equiv="refresh" content="0; url=http://example.com/">
I would like to render web page and get all urls where user has been redirected. Is it possible? How to do that in selenium?
In Python: http://selenium-python.readthedocs.org/en/latest/api.html : webdriver has property current_url. After you driver.get() the page, I would assume current_url is the redirected URL. Is it not?
Your requirement "in Selenium" will make this impossible. Selenium interacts with a browser as a human would - a human should generally not know or care about all the redirects. If you are willing to abandon Selenium for this purpose, then there are libraries such as HttpBuilder (in the Java world) and many others (for other languages) that allow you to manipulate and watch HTTP traffic, which is what you are after here.
I have one html page open inside UiWebViewController with cordova. While index.html loading inside the Uiwebviewcontroller can we sniff the requests that is originating from index.html?
for example I have following html that is getting opened in UiWebviewcontroller:
<html>
<head>
<link rel="stylesheet" type="text/css" href="theme.css">
<script src="app.js"></script>
</head>
<body>
<img src="img.jpg"/>
</body>
</html>
Can I sniff and modify the url that is getting requested inside Uiwebviewcontroller ie. img.jpg,theme.css,app.js to something like content/img.jpg, css/theme.css, js/app.js using Objective-C.
Yes, that’s possible using NSURLProtocol, see this blog post by NSHipster and this related Stack Overflow thread.