Is there any way to set JSESSIONID while doing scraping using scrapy [closed] - scrapy

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I am writing a code for spider in Scrapy for this website
[ https://www.garageclothing.com/ca/ ]
this website uses jsessionid.
I want to get that in my code(spider)
Can anybody guide me that how can i get
jsessionid in my code.
Currently i just copy paste the jsessionid from inspection tools of browser after visiting that website on browser.

This site uses JavaScript to set JSESSIONID. But if you will disable JavaScript, and try to load the page, you'll see that it requests the following URL:
https://www.dynamiteclothing.com/?postSessionRedirect=https%3A//www.garageclothing.com/ca&noRedirectJavaScript=true (1)
which redirects you to this URL:
https://www.garageclothing.com/ca;jsessionid=YOUR_SESSION_ID (2)
So you can do the following:
start requests with the URL (1)
in callback, extract session ID from URL (2) (which will be stored in response.url)
make the requests you want with the extracted session ID in cookies

Related

I have disallowed everything for 10 days [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 3 years ago.
Improve this question
Due to an update error, I put in prod a robots.txt file that was intended for a test server. Result, the prod ended up with this robots.txt :
User-Agent: *
Disallow: /
That was 10 days ago and I now have more than 7000 URLS blocked Error (Submitted URL blocked by robots.txt) or Warning (Indexed through blocked byt robots.txt).
Yesterday, of course, I corrected the robots.txt file.
What can I do to speed up the correction by Google or any other search engine?
You could use the robots.txt test feature. https://www.google.com/webmasters/tools/robots-testing-tool
Once the robots.txt test has passed, click the "Submit" button and a popup window should appear. and then click option #3 "Submit" button again --
Ask Google to update
Submit a request to let Google know your robots.txt file has been updated.
Other then that, I think you'll have to wait for Googlebot to crawl the site again.
Best of luck :).

Canonical Link Element for Dynamic Pages ( rel="canonical") [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 8 years ago.
Improve this question
I have a stack system that passes page tokens in the URL. As well my pages are dynamically created content so I have one php page to access the content with parameters.
index.php?grade=7&page=astronomy&pageno=2&token=foo1
I understand the search indexing goal to be The goal is to have only one link per unique set of data on your website.
Bing has a way to specify specific parameters to ignore.
Google it seems uses rel="canonical" but is it possible to use this to tell Google to ignore the token parameter? My URL (without tokens) can be anything like:
index.php?grade=5&page=astronomy&pageno=2
index.php?grade=6&page=math&pageno=1
index.php?grade=7&page=chemistry&page2=combustion&pageno=4
If there is not a solution for Google... Other possible solutions:
If I provide a site map for each base page, I can supply base URLs but any crawing of that page's links will crate tokens on resulting pages. Plus I would have to constantly recreate the site map to cover new pages (e.g. 25 posts per page, post 26 is on page 2).
One idea I've had is to identify bots on page load (I do this already) and disable all tokens for bots. Since (I'm presuming) bots don't use session data between pages anyway, the back buttons and editing features are useless. Is it feasible (or is it crazy) to write custom code for bots?
Thanks for your thoughts.
You can use the Google Webmaster Tools to tell Google to ignore certain URL parameters.
This is covered on the Google Webmaster Help page.

How to use Search domain Http API provided by reseller club [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 8 years ago.
Improve this question
I have been provided the reseller club http api for the domain search.The url is as follows:
https://test.httpapi.com/api/domains/available.json?auth-userid=0&auth-password=password&domain-name=domain1&domain-name=domain2&tlds=com&tlds=net
Now I am not undersatnding how to use it and what it should be inplace of test.httpapi.com
and also when I use my domain name say
for
www.x.in
I use
x.httpapi.com with the valide parameters which makes the url
https://x.httpapi.com/api/domains/available.json?auth-userid=xxxx&auth-password=xxxxx&domain-name=test.com&domain-name=test2.com&tlds=com&tlds=net
It shows ssl error and when
www.x.httpapi.com/api/domains/available.json?auth-userid=xxxx&auth-password=xxxxx&domain-name=test.com&domain-name=test2.com&tlds=com&tlds=net
It shows ngix error
Please suggest
You can find most of this information buried inside the ResellerClub HTTP API Documentation, however, here is what you need to do in order to get going.
There is nothing wrong with the URL. That is the testing URL for them.
https://test.httpapi.com/api/domains/available.json?auth-userid=0&auth-password=password&domain-name=domain1&domain-name=domain2&tlds=com&tlds=net
...is the URL to check availability of domain names. You are supposed to replace the auth-userid parameter with your ResellerClub User ID (you can get this from the reseller control panel) and the auth-password parameter with you password.
If your user ID is 123456 and password is albatross then the URL will look like:
https://test.httpapi.com/api/domains/available.json?auth-userid=123456&auth-password=albatross&domain-name=google&tlds=com&tlds=net
This URL will output JSON on your browser screen.
This URL is not working because you need to add your IP address to the list of white-listed IP addresses to make API calls. You can find this setting inside your ResellerClub control panel. Go to Settings -> API and add your IP address. Within 30 minutes, this URL will start spitting JSON.
To use it within your code, you need to white-list the IP for your server within your control panel and make a CURL call or open a socket connection or something.
Let me know if this post helped.

creating a online website builder with rails [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I want to create a web site builder. What I'm thinking is to have a one server as the main web server
my concept is as follows
1 - user enters url (http://www.userdomain.com)
2- it masks and redirect to one of my custom domain (http://www.myapp.userdomain.com)
3 - from the custom domain (myapp.userdomain) my application will identify the web site
3 - according to the website, it will render pages
my concerns are,
1 - is this the proper way of doing something like this (online web site builder)
2- since I'm masking the url i will not be able to do something like
'http://www.myapp.userdomain.com/products'
and if the user refresh the page it goes to home page (http://www.myapp.userdomain.com). how to avoid that
3- I'm thinking of using Rails, liquid for this. Will that be a good option
thanks in advance
cheers
sameera
Masking domains with redirects is going to get messy plus all those redirects may not play nice for SEO. Rails doesn't care if you host everything under a common domain name. It's just as easy to detect the requested domain name as it is the requested subdomain.
I suggest pointing all of your end-user domains directly to the IP of your main server so that redirects are not required. Use the the :domain and :subdomain conditions in the Rails router or parse them in your application controller to determine which site to actually render based on the hostname the user requested. This gives you added flexibility later as you could tell Apache or Nginx which domains to listen for and setup different instances of your application as to support rolling upgrades and things like that.
Sounds like this was #wukerplank's approach and I agree. Custom router to look at the domain name of the current request keeps the rest of your application simple.
There will you get some more help by getting site details of existing online site builder you can look upon [wix][1] , [weebly][2] , ecositebuilder and word press and many

Would 401 Error be a good choice? [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 9 years ago.
Improve this question
On one of my sites have a lot of restricted pages which is only available to logged-in users, and for everyone else it outputs a default "you have to be logged in ... " view.
The problem is; a lot of these pages are listed on Google with the not-logged-in-view, and it looks pretty bad when 80% of the pages in the list have the same title and description/preview.
Would it be a good choice to, along with my default not-logged-in-view, send a 401 unauthorized header? And would this stop Google (and other engines) to index these pages?
Thanks!
(and if you have another (better?) solution I would love to hear about it!)
Use a robots.txt to tell search engines not to index the not logged in pages.
http://www.robotstxt.org/
Ex.
User-agent: *
Disallow: /error/notloggedin.html
401 Unauthorized is the response code for requests that requires user authentication. So this is exactly the response code you want and have to send. Status Code Definitions
EDIT: Your previous suggestion, response code 403, is for requests, where authentication makes no difference, eg. disabled directory browsing.
here are the status codes googlebot understands and recommends.
http://www.google.com/support/webmasters/bin/answer.py?hl=en&answer=40132
in your case an HTTP 403 would be the right one.