rvest: filter nodes based on subsequent node - beautifulsoup

I'm trying to scrape links from a webpage. The webpage has hyperlinked images and hyperlinked h3 headers. I want to discard the links for the images. Unfortunately, there are no classes, ids, or attributes of the divs to identify the image hyperlinks. Is there some logic in rvest or bs4 to filter out the links based on the subsequently nested HTML elements? For example, if the next element is a img then ignore, if the next element is a h3 then keep?
html <- '<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>Title</title>
</head>
<body>
<div>
<div>
<div>
<a href="https://upload.wikimedia.org/wikipedia/commons/thumb/e/e0/SNice.svg/1200px-SNice.svg.png">
<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/e/e0/SNice.svg/1200px-SNice.svg.png" style="max-width:72px;max-height:72px">
</a>
</div>
<span>
<h3>
<div>
Smiley Face
</div>
</h3>
</span>
<span>
<div>
https://upload.wikimedia.org/wikipedia/commons/thumb/e/e0/SNice.svg/1200px-SNice.svg.png
</div>
</span>
</div>
</div>
<div>
<div>
<a href="https://www.hbs.edu">
<h3>
<div>Harvard Business School</div>
</h3>
<div>https://www.hbs.edu</div>
</a>
</div>
</div>
</body>
</html>'
my_page <- read_html(html)
my_page %>%
html_nodes("a") %>%
html_attr("href")
# [1] "https://upload.wikimedia.org/wikipedia/commons/thumb/e/e0/SNice.svg/1200px-SNice.svg.png" # Want to ignore this
# [2] "https://www.hbs.edu" # Want to keep this

With rvest you might wish to use xpath (parent axis) so as to specify the parent child (anchor tag h3 tag) relationship as follows:
library(rvest)
library(magrittr)
html <- '<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>Title</title>
</head>
<body>
<div>
<div>
<div>
<a href="https://upload.wikimedia.org/wikipedia/commons/thumb/e/e0/SNice.svg/1200px-SNice.svg.png">
<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/e/e0/SNice.svg/1200px-SNice.svg.png" style="max-width:72px;max-height:72px">
</a>
</div>
<span>
<h3>
<div>
Smiley Face
</div>
</h3>
</span>
<span>
<div>
https://upload.wikimedia.org/wikipedia/commons/thumb/e/e0/SNice.svg/1200px-SNice.svg.png
</div>
</span>
</div>
</div>
<div>
<div>
<a href="https://www.hbs.edu">
<h3>
<div>Harvard Business School</div>
</h3>
<div>https://www.hbs.edu</div>
</a>
</div>
</div>
</body>
</html>'
my_page <- read_html(html)
my_page %>%
html_elements(xpath = "//h3/parent::a[#href]") %>%
html_attr("href")
With bs4 you can use :has pseudo class selector with > child combinator to specify relationship of anchor tag with direct child h3 element. You can swop the child combinator for a descendant combinator if can be any child and not a direct child (potential difference in DOM depth)
from bs4 import BeautifulSoup as bs
html = '''<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>Title</title>
</head>
<body>
<div>
<div>
<div>
<a href="https://upload.wikimedia.org/wikipedia/commons/thumb/e/e0/SNice.svg/1200px-SNice.svg.png">
<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/e/e0/SNice.svg/1200px-SNice.svg.png" style="max-width:72px;max-height:72px">
</a>
</div>
<span>
<h3>
<div>
Smiley Face
</div>
</h3>
</span>
<span>
<div>
https://upload.wikimedia.org/wikipedia/commons/thumb/e/e0/SNice.svg/1200px-SNice.svg.png
</div>
</span>
</div>
</div>
<div>
<div>
<a href="https://www.hbs.edu">
<h3>
<div>Harvard Business School</div>
</h3>
<div>https://www.hbs.edu</div>
</a>
</div>
</div>
</body>
</html>'''
soup = bs(html, 'lxml') # pip install lxml if missing
print([i['href'] for i in soup.select('a[href]:has(> h3)')])
In either case, I have specified that the parent anchor tag must have an href attribute.

Related

Slide divs horizontal with materialize css

I need someone who knows materializecss to tell me is it possible to make this with materializecss? I googled but didnt find anything for materializecss.
So I am looking for somthing like in this example https://www.codeply.com/go/zAc3W2rzOd
Yes there is a way to slide divs in materializecss. You can actually find some examples at their website. Click here for the examples.
I simply copied one of their slider examples and pasted it into this jsfiddle and the section below.
<html>
<head>
<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/materialize/0.99.0/css/materialize.css" />
</head>
<body>
<div class="container" style="padding-top:100px;">
<div class="carousel carousel-slider center" data-indicators="true">
<div class="carousel-fixed-item center">
<a class="btn waves-effect white grey-text darken-text-2">button</a>
</div>
<div class="carousel-item red white-text" href="#one!">
<h2>First Panel</h2>
<p class="white-text">This is your first panel</p>
</div>
<div class="carousel-item amber white-text" href="#two!">
<h2>Second Panel</h2>
<p class="white-text">This is your second panel</p>
</div>
<div class="carousel-item green white-text" href="#three!">
<h2>Third Panel</h2>
<p class="white-text">This is your third panel</p>
</div>
<div class="carousel-item blue white-text" href="#four!">
<h2>Fourth Panel</h2>
<p class="white-text">This is your fourth panel</p>
</div>
</div>
</div>
<script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/3.2.1/jquery.js"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/materialize/0.99.0/js/materialize.js"></script>
<script>
$('.carousel.carousel-slider').carousel({fullWidth: true});
</script>
</body>
</html>
In order to make these sliders work materializecss uses the jQuery plugin Carousel which also has some options.

Dropdown on Bootstrap not expanding

Wanted a horizontal menu for my application. Used Bootstrap to create it using the
this http://www.bootply.com/113314 example. it is not working though. Also need help to understand how the button identifies the appropriate list without specifying id or name anywhere given the fact that there are more than one in the document
<!DOCTYPE html>
<html>
<head>
<title>Home</title>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1">
<link rel="stylesheet" href="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.7/css/bootstrap.min.css">
<script src="https://ajax.googleapis.com/ajax/libs/jquery/3.1.1/jquery.min.js"></script>
<script src="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.7/js/bootstrap.min.js" integrity="sha384-Tc5IQib027qvyjSMfHjOMaLkfuWVxZxUPnCJA7l2mCWNIpG9mGCD8wGNIcPD7Txa"
crossorigin="anonymous"></script>
</head>
<body>
<div align="center">
<a href="home.erp">
<img title="Home" class="img-rounded" src="images/empireSmall.png">
</a>
</div>
<div align="right">
<button class="btn btn-danger" onclick="logout()">Logout</button>
</div>
<div class="btn-group">
<div class="btn-group">
<button type="button" class="btn btn-default dropdown-toggle" data-toggle="dropdown">
Administration
<span class="caret"></span>
</button>
<ul class="dropdown-menu" role="menu">
<li> Invoice Application</li>
<li> Karate Student</li>
</ul>
</div>
<div class="btn-group">
<button type="button" class="btn btn-default dropdown-toggle" data-toggle="dropdown">
System
<span class="caret"></span>
</button>
<ul class="dropdown-menu" role="menu">
<li> User Application</li>
</ul>
</div>
</div>
</div>
</body>
</html>
Just to confirm for anyone else finding the same issue - ensure jQuery is referenced prior to Bootstrap JavaScript (e.g. bootstrap.min.js), as this is a dependency.

Hidden column breaks layout of following row

I have a layout with three rows. I would like the first row to have one column that spans all 12 columns at the xs size. At larger sizes I would like two columns, one 9 columns wide and one 3 columns wide. I have attempted this by adding the second column and making it hidden at xs and visible and 3 wide at sm.
However, doing so breaks the layout of my second row at larger screen sizes for some reason. I'm not sure why.
Code working before adding hidden column:
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1">
<!-- Latest compiled and minified CSS -->
<link rel="stylesheet" href="http://maxcdn.bootstrapcdn.com/bootstrap/3.3.7/css/bootstrap.min.css">
<!-- jQuery library -->
<script src="https://ajax.googleapis.com/ajax/libs/jquery/1.12.4/jquery.min.js"></script>
<!-- Latest compiled JavaScript -->
<script src="http://maxcdn.bootstrapcdn.com/bootstrap/3.3.7/js/bootstrap.min.js"></script>
<link href="../styles/StyleSheet1.css" rel="stylesheet" />
<title></title>
</head>
<body>
<div class="container-fluid">
<div class="row-fluid">
<div class="col-xs-12">
<div id="header">
<h1>This will be the page header</h1>
<ul class="nav nav-tabs">
<li class="active">Home</li>
<li>Page w/o Sidebar</li>
<li>Menu Item 3</li>
</ul>
</div>
</div>
</div>
<div class="row-fluid">
<div class="col-xs-6 col-sm-3 col-sm-push-9">
<div id="side-bar">
This is the side bar
<ul class="list-group">
<li class="list-group-item active">item a</li>
<li class="list-group-item">item b</li>
<li class="list-group-item">item c</li>
</ul>
</div>
</div>
<div class="col-xs-12 col-sm-9 col-sm-pull-3">
<div id="content">
<h2>This is the main content</h2>
</div>
</div>
</div>
<div class="row-fluid">
<div class="col-xs-12">
<div id="footer">
This is the footer
</div>
</div>
</div>
</div>
</body>
</html>
Code broken after adding hidden column:
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1">
<!-- Latest compiled and minified CSS -->
<link rel="stylesheet" href="http://maxcdn.bootstrapcdn.com/bootstrap/3.3.7/css/bootstrap.min.css">
<!-- jQuery library -->
<script src="https://ajax.googleapis.com/ajax/libs/jquery/1.12.4/jquery.min.js"></script>
<!-- Latest compiled JavaScript -->
<script src="http://maxcdn.bootstrapcdn.com/bootstrap/3.3.7/js/bootstrap.min.js"></script>
<link href="../styles/StyleSheet1.css" rel="stylesheet" />
<title></title>
</head>
<body>
<div class="container-fluid">
<div class="row-fluid">
<div class="col-xs-12 col-sm-9">
<div id="header">
<h1>This will be the page header</h1>
<ul class="nav nav-tabs">
<li class="active">Home
</li>
<li>Page w/o Sidebar
</li>
<li>Menu Item 3
</li>
</ul>
</div>
</div>
<div class="hidden-xs col-sm-3 visible-sm ">
</div>
</div>
<div class="row-fluid">
<div class="col-xs-6 col-sm-3 col-sm-push-9">
<div id="side-bar">
This is the side bar
<ul class="list-group">
<li class="list-group-item active">item a</li>
<li class="list-group-item">item b</li>
<li class="list-group-item">item c</li>
</ul>
</div>
</div>
<div class="col-xs-12 col-sm-9 col-sm-pull-3">
<div id="content">
<h2>This is the main content</h2>
</div>
</div>
</div>
<div class="row-fluid">
<div class="col-xs-12">
<div id="footer">
This is the footer
</div>
</div>
</div>
</div>
</body>
</html>
I seem to have figured out the solution. I'm setting the column that I want hidden as visible on small, medium, and large, and setting it to 3 columns on small screens or larger.
I also added a div between the rows with the clearfix class.
<body>
<div class="container-fluid">
<div class="row-fluid">
<div class="col-xs-12 col-sm-9">
<div id="header">
<h1>This will be the page header</h1>
<ul class="nav nav-tabs">
<li class="active">Home
</li>
<li>Page w/o Sidebar
</li>
<li>Menu Item 3
</li>
</ul>
</div>
</div>
<div class="visible-sm visible-md visible-lg col-sm-3">
<div id="logo">
<p>This is the hidden area</p>
</div>
</div>
</div>
<div class="clearfix"></div>
<div class="row-fluid">
<!-- rest of the code from before -->

Divs in Twitter Bootstrap Carousel appear on top of each other

I have an issue with my Twitter Bootstrat Carousel. The 4 DIV elements which are supposed to appear one after the other all appear together, on top of each other.
I followed CreativityTuts.org tutorial and cannot see where I am wrong...
Here is the HTML code:
<html>
<head>
<link rel="stylesheet" type="text/css" href="css/bootstrap.css"/>
<script src="//ajax.googleapis.com/ajax/libs/jquery/1.11.1/jquery.min.js"></script>
<script type="text/javascript">
$(document).ready(function(){
$('.carousel').carousel({
interval:2000
});
});
</script>
</head>
<body>
<div class="container">
<div class="row">
<div class="span6 well">
<div id="myCarousel" class="carousel">
<div class="carousel-inner">
<center>
<div class="item active"><img src="bkk.png"></div>
<div class="item"><img src="brr.png"></div>
<div class="item"><img src="buu.png"></div>
<div class="item"><img src="btt.png"></div>
</center>
</div>
<a class="carousel-control left" href="#myCarousel" data-slide="prev">‹</a>
<a class="carousel-control right" href="#myCarousel" data-slide="next">›</a>
</div>
</div>
</div>
</div>
<script src="js/carousel.js"></script>
</body>
</html>
My file structure is the following:
- bkk.png (and the 3 other images)
- css (file)
- fonts (file)
- js (file)
- carousel.js
- bootstrap.js
- bootstrap.min.js
- testcarousel.html
Can anyone help?
Thanks a lot in advance,
Paul
OK guys,
Thanks for your answers, it made me realize that I used code for Bootstrap 2, but actually using Bootstrap 3.
I found needed syntax here:
http://www.tutorialrepublic.com/twitter-bootstrap-tutorial/bootstrap-carousel.php

Capybara and Rails, Why my link expected to return some thing?

I am trying to test that a link dose exist on the page,
I tried to check all the nesting tags that contain the link tag like that:
response.body.should have_selector("div.page_margins div.page div#nav div.hlist ul li#2")
and it passes correctly, but, if I added the link tag to the test like this:
response.body.should have_selector("div.page_margins div.page div#nav div.hlist ul li#2 a",:text => "Next")
I get the error:
expected css "div.page_margins div.page div#nav div.hlist ul li#2 a#next_page"
with text "Next" to return something
If I test it with have_link like this:
response.body.should have_link("div.page_margins div.page div#nav div.hlist ul li#2 a#next_page")
I get the error:
expected link "div.page_margins div.page div#nav div.hlist ul li#2
a#next_page" to return something
Can any body help please ? I love rails, but, I still need a hand to get along with testing ..
EDIT
Here is the page.html, I've noticed that the html in content_for in which the link is rendered is not rendered in yield
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html lang="en" xml:lang="en" xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta content="text/html; charset=utf-8" http-equiv="Content-type">
<title></title>
<script src="/assets/application.js" type="text/javascript"></script><link href="/assets/application.css" media="screen" rel="stylesheet" type="text/css">
<script type="text/javascript">
//<![CDATA[
var auto_log_off = false;
//]]>
</script><script type="text/javascript">
//<![CDATA[
var student_logged = false;
//]]>
</script><script src="/assets/sessions.js" type="text/javascript"></script><!-- add your meta tags here --><link href="/assets/application_yaml/css/my_layout.css" rel="stylesheet" type="text/css">
<!--[if lte IE 7]> <![endif]--><link href="/assets/application_yaml/css/patch_my_layout.css" rel="stylesheet" type="text/css">
</head>
<body>
<div class="page_margins">
<div id="topnav">
<!-- start: skip link navigation -->
<a class="skip" href="#navigation" title="skip link">Skip to the navigation</a>
<span class="hideme">.</span>
<a class="skip" href="#content" title="skip link">Skip to the content</a>
<span class="hideme">.</span>
<!-- end: skip link navigation -->
</div>
<!-- start: skip link navigation -->
<!-- end: skip link navigation -->
<div class="page">
<div id="header">
<h1>Welcome to course builder!!</h1>
<p>Home</p>
</div>
<div id="nav">
<!-- skiplink anchor: navigation -->
<a id="navigation" name="navigation"></a>
<div class="hlist">
<!-- main navigation: horizontal list -->
<div class="quiz_review_buttons">
<!--
<ul>
<li class="active"><strong>Button 1</strong></li>
<li>Button 2</li>
<li>Button 3</li>
</ul>
-->
</div>
<!-- <ul> -->
<!-- <li class="active"><strong>Button 1</strong></li> -->
<!-- <li>Button 2</li> -->
<!-- <li>Button 3</li> -->
<!-- <li>Button 4</li> -->
<!-- <li>Button 5</li> -->
<!-- </ul> -->
</div>
</div>
<div id="main">
<div id="col1">
<div class="clearfix" id="col1_content">
<!-- add your content here -->
<div class="debug_div">
<p>
<b>
devise/sessions#new
</b>
</p>
</div>
</div>
</div>
<div id="col3">
<div class="clearfix" id="col3_content">
<!-- add your content here -->
<div class="alert" id="notice_alert">You need to sign in or sign up before continuing.</div>
<script type="text/javascript"></script><!-- <div style="clear:both"></div> --><h2>Sign in for student</h2>
<form accept-charset="UTF-8" action="/students/sign_in" class="student_new" id="student_new" method="post">
<div style="margin:0;padding:0;display:inline"><input name="utf8" type="hidden" value="✓"></div>
<div>
<label for="student_email">Email</label><br><input id="student_email" name="student[email]" size="30" type="email" value="">
</div>
<div>
<label for="student_password">Password</label><br><input id="student_password" name="student[password]" size="30" type="password">
</div>
<div>
<input name="student[remember_me]" type="hidden" value="0"><input id="student_remember_me" name="student[remember_me]" type="checkbox" value="1"><label for="student_remember_me">Remember me</label>
</div>
<div><input name="commit" type="submit" value="Sign in"></div>
</form>
Sign up<br>Forgot your password?<br>
</div>
<!-- IE Column Clearing -->
<div id="ie_clearing"> </div>
</div>
</div>
<div id="footer">
Layout based on
YAML
</div>
</div>
</div>
</body>
</html>
"expected to return something" just means that capybara couldn't find the element it was looking for. Hard to say why without seeing the html that capybara is searching
Note that the argument to have_link is not a css selector, it should be the text, id, title, or image alt attribute of the link.
Also, in controller specs, make sure you call render_views when checking markup.
.