I am new to using Scrapy and I know very little of the Python Language. So far, I have installed Scrapy and gone through a few tutorials. After that, I have been trying to find a way to search many sites for the same data. My goal is to use Scrapy to find links to "posts" and links for a few search criteria. As an example, I would like to search site A, B, and C. Each site, I would like to see if they have a "post" about app name X, Y, and Z. If they have any "posts" on X, Y, Z. I would like it to grab the link to that post. If it would be easier... It can scan each post for our Company Name. Instead of X, Y, Z it would search the contents of each "post" for [Example Company name]. The reason that I am doing it this way is so that the JSON that is created just has links to the "posts" so that we can review them and contact the website if need be.
I am on Ubuntu 10.12 and I have been able to scrape the sites that we are wanting but I have not been able to narrow down the JSON to the needed info. So currently we are still having to go through hundreds of links, which is what we wanted to avoid by doing this. The reason that we are getting so many links is because all the tutorials that I have found are for scraping a specific HTML tag. I want it to search the tag to see if it contains any part of our App Titles or Package name.
Like this, it displays the post info but now we have to pick out the links from the json. Saves time but still not really what we are wanting. Part of that, I think is that I am not referencing or calling it correctly. Please give me any help that you can. I have spent hours trying to figure this out.
posts = hxs.select("//div[#class='post']")
items = []
for post in posts:
item = ScrapySampleItem()
item["title"] = post.select("div[#class='bodytext']/h2/a/text()").extract()
item["link"] = post.select("div[#class='bodytext']/h2/a/#href").extract()
item["content"] = post.select("div[#class='bodytext']/p/text()").extract()
items.append(item)
for item in items:
yield item
I am wanting to use this to cut down on Piracy of our Android Apps. If I can have this go out and search the Piracy sites that we want, I can then email the Site or Hosting Company with all of the links that we want removed. Under Copy Right law, they have to comply but they require that we link them to every "post" that they infringe upon which is why App Developers normally do not mess with this kind of thing. They have hundreds of apps so finding the links to your apps takes many hours of work.
Thank you for any help you can offer in advance. You will be helping out many App Developers in the long run!
Grady
Your XPath selectors are absolute. They have to be relative to the previous selector (the .)
posts = hxs.select('//div[#class='post']')
for post in posts:
item = ScrapySampleItem()
item['title'] = post.select('.//div[#class="bodytext"]/h2/a/text()').extract()
item['link'] = post.select('.//div[#class="bodytext"]/h2/a/#href').extract()
item['content'] = post.select('.//div[#class="bodytext"]/p/text()').extract()
yield item
Related
This is my link for getting one random article using Wiki API:
https://en.wikipedia.org/w/api.php?%20format=json&action=query&prop=extracts&exsentences=2&exintro=&explaintext=&generator=random&grnnamespace=0
I need to get from it the first two sentences of the first section, and it works pretty well.
I want to use this kind of link and search this random article in a specific category. This is what I have tried after searching online:
https://en.wikipedia.org/w/api.php?%20format=json&action=query&prop=extracts&exsentences=2&exintro=&explaintext=&generator=random&grnnamespace=0&cmtitle=Category:Music
(I have added this part to the original link: cmtitle=Category:Music )
It doesn't work for me.
It gets the random article like the first link (not under a wanted category, which is Music in this link).
There is no API to get a random category member (and using a parameter from some unrelated API module is certainly not going to help). You could screen scrape Special:RandomInCategory (or turn it into an API module - patches welcome :)
try to use cmlimit to get all of the catgeorymembers, then use a programming language, like Python to request the page, then store every catgeory in an array, and use the random module to get a random catgeorymember from the array you stored them in. then you can use it in a link to get the specific page for the categorymember or anything else that you need.
Google displays my website’s page title differently to how it is meant to be.
The page title should be:
Graphic Designer Brighton and Lewes | Lewis Wallis Graphic Design
It displays fine in Bing, Yahoo and on my actual website.
However, Google displays it differently:
Lewis Wallis Graphic Design: Graphic Designer Brighton and Lewes
This is annoying as I want my keywords "graphic designer brighton" to go before my name.
I am using the Yoast SEO plugin and my only suspicion is that there might be a conflict between that and my theme, Workality.
Has anyone got any suggestions as to why this might be happening?
Google Search may change webpage titles they show in the result page (since 2012-01):
We use many signals to decide which title to show to users, primarily the <title> tag if the webmaster specified one. But for some pages, a single title might not be the best one to show for all queries, and so we have algorithms that generate alternative titles to make it easier for our users to recognize relevant pages.
See also the documentation at http://support.google.com/webmasters/bin/answer.py?hl=en&answer=35624:
Google's generation of page titles and descriptions (or "snippets") is completely automated and takes into account both the content of a page as well as references to it that appear on the web. The goal of the snippet and title is to best represent and describe each result and explain how it relates to the user's query.
[…]
While we can't manually change titles or snippets for individual sites, we're always working to make them as relevant as possible.
In my answer on Webmasters SE I linked to questions from people having the same issue.
Is is possible that you changed the title, or installed the plugin, and Google hasn't picked up the changes yet?
It can take a few weeks for Google to pick up changes to your site, depending on how often it spiders it. The HTML looks fine so I can only think that Google hasn't got round to picking up the changes yet.
I have tagged over 700 blog posts with tags containing hyphens, and these tags suddenly stopped working in 2011, because Tumblr decided (without any notice) to forbid hyphens in tags (I guess hyphens are blocked now, because spaces in tags (which are allowed) get changed to hyphens.). Unfortunately, Tumblr is not willing to globally rename all tags containg hyphens (although these tags are of no use anymore → 404).
Now I want to rename my tags myself.
I tried to do it with the "Mass Post Editor" (tumblr.com/mega-editor), but it's not possible to select posts by tag. I'd have to manually select post after post and look if a certain tag was used, and if so, delete it and add a new one instead. This would be a huge job (700 tagged posts, but more than 1000 in total).
So I thought that the Tumblr API might help me. I'm no programmer, but I'd be willing to dig into it, if I could get some help here as a starting point.
I think I need the following process:
select all posts that are tagged with x (= a tag containing hyphens)
tag all these posts with y (= a tag without hyphens)
delete the tag x on all these posts
I'd start this process for every affected tag manually.
I see that the method (or whatever you call it) /post knows the request parameter tag:
Limits the response to posts with the specified tag
(I guess I can only hope that this works for tags containing hyphens, too.)
After that I'd need a way to add and remove tags from that result set. /post/edit doesn't say anything about tags. Did I miss something? Isn't it possible to add/remove tags with the API?
Have you an idea how I could "easily" rename my tags?
Is it possible with the API? Could you give me a starting point, tip etc. how I could manage to do it?
I don't know if this might be helpful, but I noticed that the search function is still able to find posts "tagged" with tags that contain hyphens.
Example: let's say I have the tag foo-bar. It is linked with /tagged/foo-bar (→ 404). I can find the posts with /search/foo-bar (but this is of course not ideal because it might also find posts that contain (in the body text) words similar/equal to the tag name).
I tried to encode the hyphen (/tagged/foo%2Dbar), but no luck.
just for the record, because this is a popular google search: i've done it! you can use it at http://dev.goose.im/tags/.
i used a combo of PHP and jquery, basing my jquery off of a previous tumblr api script i wrote a year or two ago, and used this tumblr php oauth script for the authentication. if anyone wants me to put up the source code, i'd be happy to.
If you aren't a programmer, how much is your time is worth to you? As they say, time is money. Not only do you have to figure out how to use the API, but choose a language and learn to write in it. That's no small task. You could higher a freelancer for $50 for an hour worth of work.
To answer your question, yes it is possible to do this with the API. It mentions "These parameters are used for /post, /post/edit and /post/reblog methods." and tags is mentioned as a string of comma separated words.
What you want to do is get a listing of every single blog post using the /posts method. You'll want to look at the "Request" section to figure out the criteria to pass to this URL. You want it to be as general as possible to get a complete listing of all your posts.
After you get a listing of posts you'll want to iterate over it and modify the tags parameter provided in the response for each post. You'll want to use the id paramater along with /post/edit, which again takes tags as a string.
The simplest language you can use for this task is PHP. You'll want to look at the curl extension to make your requests. You'll want to read up on arrays as you'll be using them a lot. You'll also need to look at explode, implode, str_replace (for the dashes), and foreach for iterating over the result.
When you do this I would highly recommend you use break at the end of your foreach loop so it only affects one post at first. Testing it first will be important, as you don't want to accidentally erase your tags/posts. print and var_dump are good ways to help you debug the code. xdebug is a nice extension that allows you to step through the code line by line as it runs. Netbeans is an IDE that has good xdebug support.
There's also a nice page here to get you started with PHP. You'll need to install PHP on your machine. You don't need to install a web server - for this PHP-CLI (command line) sapi is good enough.
My neighbour popped over last night to ask me for help with regards to his company's website. He said that it used to be ranked pretty high on Google but has since fallen off completely.
Now, I'm a Windows App programmer hence my request for help. I took a look and there the meta tags seem ok. I recommended that he add a <h1>heading</h1> to the pages with a page title to help reinforce the content.
I also suggested that finding related websites and getting them to link to his site was good for search ranking.
Are there any other general strategies / tools that could help?
He site is: http://www.colofinder.co.uk/
ps. BTW: this isn't just an attempt to have StackOverflow link to my neighbour's site - I'm aware that links from SO don't add to its ranking.
Go to http://ooyes.net/blog/a-step-by-step-15-minute-seo-audit-%28a-sample-from-seo-secrets%29 and read it. Then go to http://www.searchenginejournal.com/55-quick-seo-tips-even-your-mother-would-love/6760/ and read it. Then go to your friends site and look at it with that information in mind. Off the top of my head, I would add flip the company name and page title in the "title" tags. Look at the google analytics account and see how people are coming to the site. That will give you an idea of where you should start your efforts to build a workable base.
First of all he needs to be make sure that his website contents are well managed and to the point. Then Page title has to be pin point, meta tags are obsolete so try meta description. Then Main Heading should be under h1 tag, sub heading under h2 and further sub heading h3. Try to update your website one in a month.
Use community websites like Facebook, Twitter and linkidin and other related forums for posting updates about completed projects and must give inbound links. You can use your company name as an inlink to your primary website and project name as an inlink of subpage of your company website.
Keep on posting at least once in a week. Post website URL to online directories will be a great help. Do not use Blackhat SEO techniques like cloaking. Do not use any invisible text/div in your website. Make sure that whenever you give your website link any where, give the most to the point and appropriate link.
Your link should have to have that stuff against you are posting your link/sublink. Make a section on your website for tag clouds/google tags, this will be a great attraction for search engines and they will link your website to other popular websites.
Make sure these tags should be directed to top ranking website which should have relevant material. I hope this will help. Feel free if you have trouble to understand anything i have mentioned above. Best of Luck
I'm maintaining an existing website that wants a site search. I implemented the search using the YAHOO API. The problem is that the API is returning irrelevant results. For example, there is a sidebar with a list of places and if a user searches for "New York" the top results will be for pages that do not have "New York" in the main content section. I have tried adding Yahoo's class="robots-nocontent" to the sidebar however that was two weeks ago and there has been no update.
I also tried out Google's Search API but am having the same problem.
This site has mostly static content and about 50 pages total so it is very small.
How can I implement a simple search that only searches the main content portions of the page?
At the risk of sounding completely self-promoting as well as pushing yet another API on you, I wrote a blog post about implementing Bing for your site using jQuery.
The advantage in using the jQuery approach is that you can tune the results quite specifically based on filters passed to the API and playing around with the JSON (or XML / SOAP if you prefer) result Bing returns, as well as having the ability to be more selective about what data you actually have jQuery display.
The other thing you should probably be aware of is how to effectively use #rel attributes on your content (esp. links) so that search engines are aware of what the relationship is between the actual content they're crawling and the destination content it links to.
First, post a link to your website... we can probably help you more if we can see the problem.
It sound like you're doing it wrong. Google Search should work on your website, unless your content is hidden behind javascript or forms or something, or your site isn't properly interlinked. Google solved crawling static pages, so if that's what you have, it will work.
So, tell me... does your site say New York anywhere? If it does, have a look at the page and see how the word is used... maybe your site isn't as static as you think. Also, are people really going to search your site for New York? Why don't you input some search terms that are likely on your site.
Another thing to consider is if your site is really just 50 pages, is it really realistic that people will want to search it? Maybe you don't need search... maybe you just need like a commonly used link section.
The BOSS Site Search Widget is pretty slick.
I use the bookmarklet thing but set as my "home" page in my browser. So whatever site I'm on I can hit my "home" button (which I never used anyway) and it pops up that handy site search thing.