Scrapy - Cleaning up text[/p] from nested links[/a] etc - scrapy

I am new to python and scrape as well. Nevertheless, I spend a few days trying to scrape news articles from its archive - SUCCESSFULLY.
PROBLEM is that when I scrape CONTENT of the article <p> that content is filled with additional tags like - strong, a etc. And as such scrapy won't pull it out and I am left with news article containing 2/3 of the text. Will try HTML below:
<p> According to <a> Japan's newspapers </a> it happened ... </p>
Now I tried googling around and looking into the forum here. There were some suggestion but from what I tried, it did not work or broke my spider:
I have read about normalized-space and remove tags but it didn't work. Thank you for any insights in advance.

Please provide your selector for more detailed help.
Given what you're describing, I'd guess you're selecting p/text() (xml) or p::text (css), which is not going to get the text in the children of <p> elements.
You should try selecting response.xpath('//p/descendant-or-self::*/text()') to get the text in the <p> and all it's children.
You could also just select the <p>, not its text, and you'll get its children as well. From there you can start cleaning up the tags. There are answered questions regarding how to do that.

You could use string.replace(,)
new_string = old_string.replace("<a>", "")
You could integrate this into a loop which iterates over a list that contains all of the substrings that you want to discard.

Related

Wordpress Database search and replace html and keep content

I have a problem. An old plugin created a lot of unnecessary tags in all my 700 WordPress blog posts.
Currently the html of every h2-tag looks like this:
<h2><a class="chartbeat-section" target="_blank" rel="nofollow noopener" name="name"></a>title</h2>
The outcome should be just:
<h2>title</h2>
Is it possible to get rid of the a-tag inside of every h2-tag with some sql query?
Thanks in advance.

How can I search a Beautiful Soup tree to get the tag path to a text match?

I would like to search a Beautiful Soup element for a text match and return the sequence of tags that lead to the element containing that text.
For example, if at soup.html.head.meta there is text “Hello everybody”, I would like to search on “soup.head” for “Hello everybody” and return the result “soup.html.head.meta”.
Is there a good way to do this and if there is not a simple way, is there a good workaround for quickly finding out where certain known text is located?
Example:
I retrieved the HTML source code from this URL with wget: https://www.gitpod.io/docs/context-urls
I created a Beautiful Soup object from this document like so:
soup = bs4.BeautifulSoup(doc, 'html.parser')
The method soup.html.head.get_text() returns
'\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nGitpod
Contexts\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n'
I know that somewhere in the head element is some text, "Gitpod Contexts". I would like to know the nearest element tag so I can delete everything except that element, because I am trying to prune the Beautiful Soup object to just contain elements with text in them, myself, without using "get_text()" over the entire object and just automatically pulling it out.
Example 2
A simpler demonstration would be this:
<html>
<body>
<p>
Hello!
</p>
<p>
Goodbye!
</p>
</body>
</html>
The function:
html.returnLocationOf("Hello!")
returns:
html.body.p
I don't know enough about Beautiful Soup to know how it would specify "the second p" for "Goodbye!" but I imagine it could be incorporated as a method somehow.

How do I add HTML tags and links in QnA-maker?

I have tried the suggestions provided for other similar questions but they didn't work.
here 2 example:
1) Link: name of your link is changed by QnAMaker during the save and train step in name of your link](https://url.com)) and it is diplayed in the test feature
name of your link](https://url.com))
2) <b> bold text </b> is not changed in bold and the same thing happens with the escaped html
Did you consider markdown? You examples are supported.
Cheatsheet to help you on you way.

Selenium find all the elements which have two divs

I am trying to collect texts and images from a website to help collect missing people related tweets. Here is the problem:
Some tweets don't have images so the corresponding <div class='c' ....> has only one <div>...</div>.
Some tweets have images, so the corresponding <div class='c' ....> has two <div>...</div>, as shown in the following codes:
<div class='c' id="M_D*****">
<div>...</div>
and
<div class='c' id="M_D*****">
<div>...</div>
<div>...</div>
I intend to check whether a tweet has an image, i.e. find out whether the corresponding <div class='c' ....> has two <div>...</div>.
PS: The following codes are used to collect all the texts and image URLs but not all tweets have images so I want to match them by solving the above problem.
tweets = browser.find_elements_by_xpath("//span[#class='ctt']")
graph_links = browser.find_elements_by_xpath("//img[#alt='img' and #class='ib']")
This is a public welfare program, which aims to help the missing people go back home.
By collecting the text and the images separately, I think that it's going to be impossible to match the text with the related image after the fact. I would suggest a different approach. I would search for the <div class='c'...> that contains both the text and the optional image. Once you have the "container" DIV, you can then get the text and see if an image exists and put them all together. Without all the relevant HTML, you may have to tweak the code below but it should give you an idea on how to approach this.
containers = browser.find_elements_by_css_selector("div.c")
for container in containers:
print container.find_element_by_css_selector("span.ctt").text // the tweet text
images = container.find_elements_by_css_selector("img.ib")
if len(images) > 0 // see if the image exists
print images[0].get_attribute("src") // the URL of the image
print "-------------" // separator between tweets
The html you provided is probably not enough, but basing on it I suggest xpath: //div[#id='M_D*****' and ./div//img] which find div with specified id and containing div with image.
But answering directly to your question:
//div[./div[2] and not(./div[3])] will find all divs with exactly 2 div children

CSS locator for corresponding xpath for selenium

The some part of the html of the webpage which I'm testing looks like this
<div id="twoWideCallouts">
<div class="callout">
<a target="_blank" href="http://facebook.com">Facebook</a>
</div>
<div class="callout last">
<a target="_blank" href="http://youtube.com">Youtube</a>
</div>
I've to check using selenium that when I click on text, the URL opened is the same that is given in href and not error page.
Using Xpath I've written the following command
//i is iterator
selenium.getAttribute("//div[contains(#class, 'callout')]["+i+"]/a/#href")
However, this is very slow and for some of the links doesn't work. By reading many answers and comments on this site I've come to know that CSS loactors are faster and cleaner to maintain so I wrote it again as
css = div:contains(callout)
Firstly, I'm not able to reach to the anchor tag.
Secondly, This page can have any number of div where id = callout. Using xpathcount i can get the count of this, and I'll be iterating on that count and performing the href check. How can something similar be done using CSS locator?
Any help would be appreciated.
EDIT
I can click on the link using the locator css=div.callout a, but when I try to read the href value using String str = "css=div.callout a[href]";
selenium.getAttribute(str);. I get the Error - element not found. Console description is given below.
19:12:33.968 INFO - Command request: getAttribute[css=div.callout a[href], ] on session
19:12:33.993 INFO - Got result: ERROR: Element css=div.callout a[href not found on session
I tried to get the href attribute using xpath like this
"xpath=(//div[contains(#class, 'callout')])["+1+"]/a/#href" and it worked fine.
Please tell me what should be the corresponding CSS locator for this.
It should be -
css = div:contains(callout)
Did you notice ":" instead of "." you used?
For CSSCount this might help -
http://www.eviltester.com/index.php/2010/03/13/a-simple-getcsscount-helper-method-for-use-with-selenium-rc/
#
On a different note, did you see proposal of new selenium site on area 51 - http://area51.stackexchange.com/proposals/4693/selenium.
#
To read the sttribute I used css=div.callout a#href and it worked. The problem was with use of square brackets around attribute name.
For the first part of your question, anchor your identifier on the hyperlink:
css=a[href=http://youtube.com]
For achieving a count of elements in the DOM, based on CSS selectors, here's an excellent article.