I have been trying to scrape TireRack.com to find all their product names, images, sizes, and prices, but nothing seems to work. I believe this is because the "main" tire page is simply a search box where one has to input tire dimensions, vehicle model, etc. I was wondering if there was a way around this.
You can use a Connector to simulate a search on the website or create a Crawler and train it on the pages that you'd like to scrape, like BLIZZAK DM-V2, then you can use urls like Our Catalog so that it will follow all the available links that matches the pages that you trained.
You can also check these links:
Create a Connector
Create a Crawler
Related
Trying to figure out how can i search in mutliple sites using Google Custom Search JSON API.
Meaning that search will be only from a specific sites list.
i was playing with the api explorer - https://developers.google.com/custom-search/v1/reference/rest/v1/cse/list?apix_params=%7B%22cx%22%3A%22011602274690322925368%3Atkz2zvvpmk0%22%2C%22siteSearch%22%3A%22www.walla.co.il%22%7D
and noticed the site search query key, but it can only accept a single string not a list of sites:
enter image description here
What is the way to search in only in specific sites?
Thanks
There's a couple things you can do.
If you know the specific sites you want to search, you can add them as refinements to your engine. Then query for that refinement by adding 'more:<REFINEMENT_LABEL>' to the query.
Or, add 'site:' operators to the query itself. For example cats site:cnn.com OR site:bbc.com
TOPIC - Google Search Engine / Custom Search - with Database
References
Search for "Google Search Engine" and "Google Custom Search"
(New to StackOverflow; just joined the other day.I'm limited to 2 links I can post right now).
NOTE:
I have not YET decided/committed to any specific coding language, framework, etc. Not until I figure out how to accomplish my question (below).
BACKGROUND INFO
What I'm trying to do (for now) is add a "search-box/ search engine" to a simple website I'm building out. Before I get too far into it (planning ahead) I would like to use Google CSE if all possible (which can do A LOT of things and works well). However, I will have a database (not sure on type YET. Will depend on what my options and I can do with CSE) of "items" that I want to be able to quickly search (in the search-box) i.e. like Amazon.com.
QUESTION:
Is there any way at all, to use Google Custom Search and or Custom Search API to search/attach a database (SQL, NoSQL, or others)? I would HIGHLY prefer being able to do all of this in Google Cloud Platform, and use one of their storage/database products.
If I get what you try to do, Google CSE is enough.
From the google doc you linked :
#Defining a Custom Search Engine in Control Panel
In the Sites to search section, add the pages you want to include in
your search engine. You can include any sites you want, not just the
sites you own. You can include whole site URLs or individual pages
URLs. You can also use URL patterns.
#Enabling Autocomplete
[...]you can enable or disable autocomplete feature using
enableAutoComplete attribute.
For the Is there any way at all [..] to search a database, I'll said not directly, but it's not a big problem.
Google CSE work on "indexable web pages", so it'll not work again a raw DB, restricted internet, or custom network not under http(s)://.
But in your case, if you make a DB, I suppose you'll have to make web page to display the data you store inside to your users ? (like products pages on Amazon)
If yes, then you'll run Google CSE again these pages by adding your http://[server ip] or http://[domain name] in the white list.
As far as I know, custom search won't guarantee all your content will be indexed.
You probably want to try exporting a full sitemap.xml, a RSS feed and if the custom search results from either of these won't satisfy you, you will probably want to look at the google search appliance product.
There's also http://sphinxsearch.com/ by the way.
I'd like to hear the diffrences between 3 different approaches for using Scrapy in order to crawl 1000 sites.
For example, I want to scrape 1000 photo sites, they all most has the same structure.Like have one kind of photo list page,and other kind of big photo page; but these list or photo desc page's HTML code will not all the same.
Another example,I want to scrape 1000 wordpress blog,Only bolg's article.
The first, is exploring the entire 1000 sites using one scrapy project.
The second, is having all these 1000 sites under the same scrapy project, all items in items.py, each site having it's own spider.
The third is similar to the second, but having one spider for all the sites instead of seperating them.
What are the diffrences, and which do you think is the right approach? Is there any other, better approach I've missed?
I had 90 sites to pull from so it wasn't great option to create one crawler per site. The idea was to be able to run in parallel. Also i had split this to pack similar page formats in one place.
So I ended up with 2 crawlers:
Crawler 1 - URL Extractor. This would extract all detail page URLs from top level listing page in a file(s).
Crawler 2 - Fetch Details.
This would read from the URL file and extract item details.
This allowed me to fetch URLs first and estimate number of threads that i might need for second crawler.
Since each crawler was working on specific page format, there were quite a few functions I could reuse.
I've got a site where users can create groups (we call them games)
www.ongoingworlds.com/games/270/
www.ongoingworlds.com/games/287/ etc
Each of these games has it's own user-generated content. I want to use a Google custom search for each game. But I can't see an easy way to amend the embed code to add a dynamic path, and I don't want to have to register multiple (hundreds) of GCSEs separately to get an embed code for each.
What would be the best way of allowing each of these URLs (above) to have their own GCSE?
You can search subparts of your site by using a combination of site: operator and webSearchQueryAddition parameter on gcse element.
webSearchQueryAddition appends additional search term to your user's query. If for each of the "games" you change the webSearchQueryAddition to point to the "game" base url, the search results will be matching that url. You can inject that parameter programmatically with e.g. javascript, for each of the "games".
Documentation is here: https://developers.google.com/custom-search/docs/element#supported_attributes
And here is working example:
http://jsfiddle.net/t2s5M/
I'm working on improving the site for the SEO purposes and hit an interesting issue. The site, among other things, includes a large directory of individual items (it doesn't really matter what these are). Each item has its own details page, which is accessed via
http://www.mysite.com/item.php?id=item_id
or
http://www.mysite.com/item.php/id/title
The directory is large - having about 100,000 items in it. Naturally, on any of the pages only a few items are listed. For example, on the main site homepage, there are links to about 5 or 6 items, from some other page there links to about a dozen different items, etc.
When real users visits the site, they can use search form to find item by keyword or location - so there would be a list produced matching their search criteria. However when, for example, a google crawler visits the site, it won't even attempt to put a text into the keyword search field and submit the form. Thus as far as the bot is concern, after indexing the entire site, it has covered only a few dozen items at best. Naturally, I want it to index each individual item separately. What are my options here?
One thing I considered is to check the user agent and IP ranges and if the requestor is a bot (as best I can say), then add a div to the end of the most relevant page with links to each individual item. Yes, this would be a huge page to load - and I'm not sure how google bot would react to this.
Any other things I can do? What are best practices here?
Thanks in advance.
One thing I considered is to check the user agent and IP ranges and if
the requestor is a bot (as best I can say), then add a div to the end
of the most relevant page with links to each individual item. Yes,
this would be a huge page to load - and I'm not sure how google bot
would react to this.
That would be a very bad thing to do. Serving up different content to the search engines specifically for their benefit is called cloaking and is a great way to get your site banned. Don't even consider it.
Whenever a webmaster is concerned about getting their pages indexed having an XML sitemap is an easy way to ensure the search engines are aware of your site's content. They're very easy to create and update, too, if your site is database driven. The XML file does not have to be static so you can dynamically produce it whenever the search engines request it (Google, Yahoo, and Bing all support XML sitemaps). You can find out mroe about XML sitemaps at sitemaps.org.
If you want to make your content available to search engines and want to benefit from semantic markup (i.e. HTML) you should also make sure your all of content can be reached through hyperlinks (in other words not through form submissions or JavaScript). The reason for this is twofold:
The anchor text in the links to your items will contain the keywords you want to rank well for. This is one of the more heavily weighted ranking factors.
Links count as "votes", especially to Google. Links from external websites, especially related websites, are what you'll hear people recommend the most and for good reason. They're valuable to have. But internal links carry weight, too, and can be a great way to prop up your internal item pages.
(Bonus) Google has PageRank which used to be a huge part of their ranking algorithm but plays only a small part now. But it still has value and links "pass" PageRank to each page they link to increasing the PageRank of that page. When you have as many pages as you do that's a lot of potential PageRank to pass around. If you built your site well you could probably get your home page to a PageRank of 6 just from internal linking alone.
Having an HTML sitemap that somehow links to all of your products is a great way to ensure that search engines, and users, can easily find all of your products. It is also recommended that you structure your site so more important pages are closer to the root of your website (home page) and then as you branch out gets to sub pages (categories) and then to specific items. This gives search engines an idea of what pages are important and helps them organize them (which helps them rank them). It also helps them follow those links from top to bottom and find all of your content.
Each item has its own details page, which is accessed via
http://www.mysite.com/item.php?id=item_id
or
http://www.mysite.com/item.php/id/title
This is also bad for SEO. When you can pull up the same page using two different URLs you have duplicate content on your website. Google is on a crusade to increase the quality of their index and they consider duplicate content to be low quality. Their infamous Panda Algorithm is partially out to find and penalize sites with low quality content. Considering how many products you have it is only a matter of time before you are penalized for this. Fortunately the solution is easy. You just need to specify a canonical URL for your product pages. I recommend the second format as it is more search engine friendly.
Read my answer to an SEO question at the Pro Webmaster's site for even more information on SEO.
I would suggest for starters having an xml sitemap. Generate a list of all your pages, and submit this to Google via webmaster tools. It wouldn't hurt having a "friendly" sitemap either - linked to from the front page, which lists all these pages, preferably by category, too.
If you're concerned with SEO, then having links to your pages is hugely important. Google could see your page and think "wow, awesome!" and give you lots of authority -- this authority (some like to call it link juice" is then passed down to pages that are linked from it. You ought to make a hierarchy of files, more important ones closer to the top and/or making it wide instead of deep.
Also, showing different stuff to the Google crawler than the "normal" visitor can be harmful in some cases, if Google thinks you're trying to con it.
Sorry -- A little bias on Google here - but the other engines are similar.