How does this site work?
They have FORM and input submit, but I don't see XHR's.... there's just "ping" that is being sent frequently with no content (just number 1)
How do they get the data?
Why is it that I don't see it in chrome console?
Weird...
Oh, okay... looks like Socket.IO... very interesting
Related
Background: I'm trying to log into an HTTPS site with my valid credentials, navigate to a page that has a frequently updated list, and then scrape the list.
I was using code someone else wrote, which worked until a few weeks ago. I am new to this, but even i can see that the code was not very good, so i am trying to rewrite.
First I log into the site and create an tunnel. Then I move to the page where my list is and grab the list, etc.
Here's what's weird. The login fails every time, until I turn on Fiddler. With Fiddler running it succeeds every time.
Any idea about why this would happen and how to fix?
Many thanks.
I got it working!
For anyone who finds themselves in the same situation (I've seen a number of posting of similar questions - but the answers hadn't worked for me, so I expect I am not alone), I eventually saw that I needed to set the security protocol to TLS. The specific syntax I used was:
ServicePointManager.SecurityProtocol = SecurityProtocolType.Tls;
The setting needs to be specified before the Httpwebrequest get or post event occurs.
If you have a similar problem, I hope this helps.
I had an invalid "User-Agent" header. It contained invalid characters (ä, ö, ü).
I have a page tab app that I am hosting. I have both http and https supported. While I receive a signed_request package as expected, after I decode it does not contain page information. That data is simply missing.
I verified that like schemes are being used (https) among facebook, my hosted site and even the 'go between'-- facebook's static page handler.
Also created a new application with page tab support but got the same results-- simply no page information in the signed_request.
Any other causes people can think of?
I add the app to the page tab using this link:
https://www.facebook.com/dialog/pagetab?app_id=176236832519816&next=https://www.intelligantt.com/Facebook/application.html
Here is the page tab I am using (Note: requires permissions):
https://www.facebook.com/pages/School-Auction-Test-2/154869721351873?id=154869721351873&sk=app_176236832519816
Here is the decoded signed_request I am receiving:
{"algorithm":"HMAC-SHA256","code":!REMOVED!,"issued_at":1369384264,"user_id":"1218470256"}
5/25 Update - I thought maybe the canvas app urls didn't match the page tab urls so I spent several hours going through scenarios where they both had a trailing slash or not. Where they both had a trailing ? or not, with query parameters or not.
I also tried changing the 'next' value when creating the page tab to the canvas app url and the page tab url.
No success on either count.
I did read where because I'm seeing the 'code' value in the signed_request it means Facebook either couldn't match my urls or that I'm capturing the second request. However, I given all the URL permutations I went through I believe the urls match. I also subscribed to the 'auth.authResponseChange' which should give me the very first authResponse that should contain the signed_request with page.id in it (but doesn't).
If I had any reputation, I'd add a bounty to this.
Thanks.
I've just spent ~5 hours on this exact same problem and posted a prior answer that was incorrect. Here's the deal:
As you pointed out, signed_request appears to be missing the page data if your tab is implemented in pure javascript as a static html page (with *.htm extension).
I repeated the exact same test, on the exact same page, but wrapped my html page (including js) within a Perl script (with *.cgi extension)... and voila, signed_request has the page info.
Although confusing (and should be better documented as a design choice by Facebook), this may make some sense because it would be impossible to validate the signed_request wholly within Javascript without placing your secretkey within the scope (and therefore revealing it to a potential hacker).
It would be much easier with the PHP SDK, but if you just want to use JavaScript, maybe this will help:
Facebook Registration - Reading the data/signed request with Javascript
Also, you may want to check out this: https://github.com/diulama/js-facebook-signed-request
simply you can't get the full params with the javascript signed_request, use the php sdk to get the full signed_request . and record the values you need into javascript variabls ...
with the php sdk after instanciation ... use the facebook object as following.
$signed_request = $facebook->getSignedRequest();
var_dump($signed_request) ;
this is just to debug but u'll see that the printed array will contain many values that u won't get with js sdk for security reasons.
hope that helped better anyone who would need it, cz it seems this issue takes at the min 3 hours for everyone who runs into.
I am trying to scrape a this page with scrapy:
http://www.barnesandnoble.com/s?dref=4815&sort=SA&startat=7391
and the response which I get is different than what I see in the browser. Browser response has the correct page, while scrapy response is:
http://www.barnesandnoble.com/s?dref=4815&sort=SA&startat=1
page. I have tried with urllib2 but still have the same issue. Any help is much appreciated.
I don't really understand the issue, but usually a different response for a browser and scrapy is caused by one these:
the server analyzes your User-Agent header, and returns a specially crafted page for mobile clients or bots;
the server analyzes the cookies, and does something special when it looks like you are visiting for the first time;
you are trying to make a POST request via scrapy like the browser does, but you forgot some form fields, or put wrong values
etc.
There is no universal way to determine what's wrong, because it depends on the server logic, which you don't know. If you are lucky, you will analyze and fix all the mentioned issues and will make it work.
What I'm trying to achieve is something similar to an Add-on called Live Http Headers used with Firefox. I'm not trying to get the Headers or cookies, but the links that load on the page itself. Let us assume I visited Mail.Yahoo.com, this is pretty much what you would see when I use the add-on.
CLICK HERE
How can I achieve something similar ? Only the links that load on the page itself !
I'm looking forward into reading your suggestions, please enlighten me if you know!
You can download the webpage using a webclient instance
Then with the result string, you can get the urls using a regular expression
http://www.geekzilla.co.uk/view2D3B0109-C1B2-4B4E-BFFD-E8088CBC85FD.htm
We're in the middle of writing a lot of URL rewrite code that would basically take ourdomain.com/SomeTag and some something dynamic to figure out what to display.
Now if the Tag doesn't exist in our system, we're gonna display some information helping them finding what they were looking for.
And now the question came up, do we need to send a 404 header? Should we? Are there any reasons to do it or not to do it?
Thanks
Nathan
You aren't required to, but it can be useful for automated checkers to detect the response code instead of having to parse the page.
I certainly send proper response codes in my applications, especially when I have database errors or other fatal errors. Then the search engine knows to give up and retry in 5 mins instead of indexing the page. e.g. code 503 for "Service Unavailable" and I also send a Retry-After: 600 to tell it to try again...search engines won't take this badly.
404 codes are sent when the page should not be indexed or doesn't exist (e.g. non-existent tag)
So yes, do send status codes.
I say do it - if the user is actually an application acting on behalf of the user (i.e. cURL, wget, something custom, etc...) then a 404 would actually help quite a bit.
You have to keep in mind that the result code you return is not for the user; for the standard user, error codes are meaningless so don't display this info to the user.
However think about what could happen if the crawlers access your pages and consider them valid (with a 200 response); they will start indexing the content and your page will be added to the index. If you tell the search engine to index the same content for all your not found pages, it will certainly affect your ranking and if one page appears in the top search results, you will look like a fool.