Extract 3-level content from paginated pages with Scrapy

Extract 3-level content from paginated pages with Scrapy - scrapy

I have a seed url (say DOMAIN/manufacturers.php) with no pagination that looks like this:
<!DOCTYPE html>
<html>
<head>
<title></title>
</head>
<body>
<div class="st-text">
<table cellspacing="6" width="600">
<tr>
<td>
</td>
<td>
Name 1
</td>
<td>
</td>
<td>
Name 2
</td>
</tr>
<tr>
<td>
</td>
<td>
Name 3
</td>
<td>
</td>
<td>
Name 4
</td>
</tr>
<tr>
<td>
</td>
<td>
Name 5
</td>
<td>
</td>
<td>
Name 6
</td>
</tr>
</table>
</div>
</body>
</html>
From there I would like to get all a['href'] 's, for example: manufacturer1-type-59.php. Note that these links do NOT contain the DOMAIN prefix already so my guess is that I have to add it somehow, or maybe not?
Optionally, I would like to keep the links both in memory (for the very next phase) and also save them to disk for future reference.
The content of each of these links, such as manufacturer1-type-59.php, looks like this:
<!DOCTYPE html>
<html>
<head>
<title></title>
</head>
<body>
<div class="makers">
<ul>
<li>
</li>
<li>
</li>
<li>
</li>
</ul>
</div>
<div class="nav-band">
<div class="nav-items">
<div class="nav-pages">
<span>Pages:</span><strong>1</strong>
2
3
»
</div>
</div>
</div>
</body>
</html>
Next, I would like to get all a['href'] 's, for example manufacturer_model1_type1.php. Again, note that these links do NOT contain the domain prefix. One additional difficulty here is that these pages support pagination. So, I would like to go into all these pages too. As expected, manufacturer-type-59.php redirects to manufacturer-type-STRING-59-INT-p2.php.
Optionally, I would also like to keep the links both in memory (for the very next phase) and also save them to disk for future reference.
The third and final step should be to retrieve the content of all pages of type manufacturer_model1_type1.php, extract the title, and save result in a file in the following form: (url, title, ).
EDIT
This is what I have done so far but doesn't seem to work...
import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
class ArchiveItem(scrapy.Item):
url = scrapy.Field()
class ArchiveSpider(CrawlSpider):
name = 'gsmarena'
allowed_domains = ['gsmarena.com']
start_urls = ['http://www.gsmarena.com/makers.php3']
rules = [
Rule(LinkExtractor(allow=['\S+-phones-\d+\.php'])),
Rule(LinkExtractor(allow=['\S+-phones-f-\d+-0-\S+\.php'])),
Rule(LinkExtractor(allow=['\S+_\S+_\S+-\d+\.php']), 'parse_archive'),
]
def parse_archive(self, response):
torrent = ArchiveItem()
torrent['url'] = response.url
return torrent

I think you better use BaseSpider instead of CrawlSpider
this code might help
class GsmArenaSpider(Spider):
name = 'gsmarena'
start_urls = ['http://www.gsmarena.com/makers.php3', ]
allowed_domains = ['gsmarena.com']
BASE_URL = 'http://www.gsmarena.com/'
def parse(self, response):
markers = response.xpath('//div[#id="mid-col"]/div/table/tr/td/a/#href').extract()
if markers:
for marker in markers:
yield Request(url=self.BASE_URL + marker, callback=self.parse_marker)
def parse_marker(self, response):
url = response.url
# extracting phone urls
phones = response.xpath('//div[#class="makers"]/ul/li/a/#href').extract()
if phones:
for phone in phones:
# change callback function name as parse_events for first crawl
yield Request(url=self.BASE_URL + phone, callback=self.parse_phone)
else:
return
# pagination
next_page = response.xpath('//a[contains(#title, "Next page")]/#href').extract()
if next_page:
yield Request(url=self.BASE_URL + next_page[0], callback=self.parse_marker)
def parse_phone(self, response):
# extract whatever stuffs you want and yield items here
pass
EDIT
if you want to keep the track of from where these phone url's are coming, you could pass the url as meta from parse to parse_phone through parse_marker
then the request will look like
yield Request(url=self.BASE_URL + marker, callback=self.parse_marker, meta={'url_level1': response.url})
yield Request(url=self.BASE_URL + phone, callback=self.parse_phone, meta={'url_level2': response.url, url_level1: response.meta['url_level1']})

Related

Is it possible to change view content without performing reload

I need to create view that contains table with pagination, sorting and search. I'm getting my data from backend api and then I'm passing it to the view.
My action method looks like this:
public async Task<IActionResult> Home(int page = 1)
{
var list = await _service.Get<List<Model.Rezervacija>>(page);
ViewBag.Page = page;
return View(list);
}
And I have this view:
#using eBiblioteka.Model
#model List<Rezervacija>
#{
ViewBag.Title = "Home";
Layout = "~/Views/Shared/_Layout.cshtml";
}
<div id="view-all">
#await Html.PartialAsync("_ViewAll", Model, null)
</div>
<a asp-action="Home" asp-route-page="#(ViewBag.Page - 1)">Previous</a>
<a asp-action="Home" asp-route-page="#(ViewBag.Page + 1)">Next</a>
And what view looks like:
And that all works. Don't pay attention for displaying pages number like this and changing with previous and next, I just wanted to simplify my question. So when i click next page, data changes and table also update very fast and that isn't problem.
But i see in tab that page is reloaded:
.
So is there any solution in asp .net core to do this without seeing that page reload. I need to call my action again to change this param "page" and then fetch new data from api without see that ugly page reload.
Thank you for all solutions.

So you want to create view that contains table with pagination, sorting and search without reloading.so you should use jquery for this.update your code like below:-
#using eBiblioteka.Model
#model IEnumerable<Rezervacija>
<div>
<table class="table table-striped border" id="myTable">
<thead>
<tr class="table-info">
<th>
#Html.DisplayNameFor(c => c.Name)
</th>
<th>
#Html.DisplayNameFor(c => c.SpecialTagId)
</th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
#foreach (var item in Model)
{
<tr>
<td> #item.Name </td>
<td> #item.SpecialTag.Name </td>
<td>
<partial name="_ButtonPartia3" model="#item.Id" /> //you can use partial view for delete or add data.
</td>
</tr>
}
</tbody>
</table>
</div>
<br /> <br />
#section scripts{
<script src="https://cdn.datatables.net/1.11.2/js/jquery.dataTables.min.js"></script>
<script type="text/javascript">
$(document).ready( function () {
$('#myTable').DataTable();
});
</script>
}
And Add this below link in your _Layout.cshtml:-
<link rel="stylesheet" type="text/css" href="https://cdn.datatables.net/1.11.2/css/jquery.dataTables.min.css">
And your output will be like this:-
So when you will go from one page to another page, it will not reload because you used jquery. and it's already sorted and you can also search as needed.

You can use some excellent front-end frameworks. I now recommend you use layui, this is a Chinese community, of course you can also choose other ones. When this layui is paging, the URL is like the following format.
https://www.layui.com/demo/table/user/?page=2&limit=10
In this way, the .net core api can be called to perform partial refresh without reloading the entire page.

Scrapy - Extract data from table

I am trying to fetch data from a table into separate fields of a CSV file.
The table in the website looks like this:
And (part of) the source of the webpage looks like this:
<div id="right">
<div id="rightwrap">
<h1>Krimpen aan den IJssel</h1>
<div class="tools">
print
terug
</div>
<h2 class="lijst">Krimpen aan den IJssel</h2>
<div class="dotted"> </div>
<div class="zoekres">
<h4>Aantal kindplaatsen</h4>
<div class="paratable">
<table cellpadding="0" cellspacing="0">
<tr>
<th> </th>
<th>2006</th>
<th>2008</th>
<th>2009</th>
<th>2010</th>
<th>2011</th>
</tr>
<tr>
<th>KDV</th>
<td>144</td>
<td>144</td>
<td>174</td>
<td>243</td>
<td>-</td>
</tr>
<tr>
<th>BSO</th>
<td>135</td>
<td>265</td>
<td>315</td>
<td>365</td>
<td>-</td>
</tr>
<tr>
<th>Totaal</th>
<td>279</td>
<td>409</td>
<td>489</td>
<td>608</td>
<td>-</td>
</tr>
</table>
</div>
</div>
</div>
</div>
<div class="brtotal"> </div>
</div>
I managed to retrieve the name of the place "Krimpen aan den IJssel" using this code:
def parse(self, response):
item = OrderedDict()
for col in self.cols:
item[col] = 'None'
item['Gemeente'] = response.css('h2.lijst::text').get('')
yield item
But I am unable to retrieve the values displayed in the table of this website. The standard approach for table using:
response.xpath('//*[#class="table paratable"]
doesn't seem to work or I am not experienced enough to set the parameters right.
Can anyone provide me with some lines of code that will bring the
values from this table into the following columns of my CSV-file
KDV_2006 KDV_2008 KDV_2009 KDV_2010 KDV_2011 BSO_2006 BSO_2008
BSO_2009 BSO_2010 BSO_2011

One possible way:
result = {}
years = response.xpath('//div[#class="paratable"]/table/tr[1]/th[position() > 1]/text()').getall()
for row in response.xpath('//div[#class="paratable"]/table/tr[position() > 1][position() < last()]'):
field_name = row.xpath('./th/text()').get()
values = row.xpath('./td/text()').getall()
for year, value in zip(years, values):
result[f'{field_name}_{year}'] = value

Prestashop 1.5.6 - Add a custom field in Admin Product Page

I have been trying to introduce a new custom field in Admin Product page but got a weird error. Actually this field has the same text area type as the existing ones. If I enter a plain text, it works fine but if I enter anything starts with an opening angle bracket (<) it gives me a 403 forbidden error. I already checked permission of folders and files.
This is what I have done.
New db field
ProductCore class. Made changes wherever 'description' is referenced because they are pretty much the same thing.
Information.tpl modified to display this new field.
This is how Admin Product page looks like after the change. As you can see, a plain text can be saved without any issue.
If I type in any html tag, which starts with an angle bracket, then save, I get a 403 error, whereas an existing textarea such as Description in the screen is totally fine.
I have checked permission of files and folders and error logs but couldn't find any clue. Anyone got an idea? Please help!
information.tpl
<tr>
<td class="col-left">
{include file="controllers/products/multishop/checkbox.tpl" field="description" type="tinymce" multilang="true"}
<label>{l s='Description:'}<br /></label>
<p class="product_description">({l s='Appears in the body of the product page'})</p>
</td>
<td style="padding-bottom:5px;">
{include file="controllers/products/textarea_lang.tpl" languages=$languages
input_name='description'
input_value=$product->description
}
<p class="clear"></p>
</td>
</tr>
<tr>
<td class="col-left">
{include file="controllers/products/multishop/checkbox.tpl" field="alternate_item" type="tinymce" multilang="true"}
<label>{l s='Alternate Item:'}<br /></label>
</td>
<td style="padding-bottom:5px;">
{include file="controllers/products/textarea_lang.tpl"
languages=$languages
input_name='alternate_item'
input_value=$product->alternate_item}
<p class="clear"></p>
</td>
</tr>
textarea_lang.tpl
<div class="translatable">
{foreach from=$languages item=language}
<div class="lang_{$language.id_lang}" style="{if !$language.is_default}display:none;{/if}float: left;">
<textarea cols="100" rows="10" id="{$input_name}_{$language.id_lang}"
name="{$input_name}_{$language.id_lang}"
class="autoload_rte" >{if isset($input_value[$language.id_lang])}{$input_value[$language.id_lang]|htmlentitiesUTF8}{/if}</textarea>
<span class="counter" max="{if isset($max)}{$max}{else}none{/if}"></span>
<span class="hint">{$hint|default:''}<span class="hint-pointer"> </span></span>
</div>
{/foreach}
</div>
<script type="text/javascript">
var iso = '{$iso_tiny_mce}';
var pathCSS = '{$smarty.const._THEME_CSS_DIR_}';
var ad = '{$ad}';
</script>

You must override the ProductCore class NOT modify ProductCore class
class Product extends ProductCore
{
public $alternate_item;
public function __construct($id_product = null, $full = false, $id_lang = null, $id_shop = null, Context $context = null)
{
self::$definition['fields']['alternate_item'] = array('type' => self::TYPE_HTML, 'lang' => true, 'validate' => 'isCleanHtml');
parent::__construct($id_product, $full, $id_lang, $id_shop, $context);
}
}
The override class must be placed in override/classes folder or yourmodule/override/classes
When you override a class you must delete cache/class_index.php

parsejson rendering through template in javascript

I have parseJson object as follows...
parseJSON([
{"BOOK_Name":"AAA”,"quickRead":[{"Page_Heading":"AAA-heading","Page_Url":"http://rtrt.com"},{"Page_Heading":"AAA-heading2","Page_Url":"http://bghfhghf.com"}]},
{"BOOK_Name":"BBB","quickRead":[{"Page_Heading":"BBB-heading","Page_Url":"http://dsdfdf.com"},{"Page_Heading":"BBB-heading2","Page_Url":"http://rtrtdfdf.com"}]}
]}
I am able to render this partially in tbody element using javascript jsrender as follows.... i.e able to render book_name but not quickread ... so how can I render data within quickread inside say element?
$('tbody', '#bookTemplateTable').html($('#bookTemplate').render(data));
template for the same is as follows:
<script id="bookTemplate" type="text/html">
<tr>
<td>{{=BOOK_Name}}</td>
<td>
<ul>
<li> .. render 1st quickread value .. </li>
<li> .. render 1st quickread value .. </li>
</ul>
</td>
</tr>
</script>
can anyone help on this?

It would be best for you to take a look at jsrender demo page. There you can learn how to use template tags.
First, you should use
{{>propertyName}}
{{:propertyName}}
to display data. See this demo for explanation on when to use what.
For your particular case, here is the valid template:
<script id="bookTemplate" type="text/html">
<tr>
<td>{{>BOOK_Name}}</td>
<td>
<ul>
{{for quickRead}}
<li>{{>#data.Page_Heading}}</li>
{{/for}}
</ul>
</td>
</tr>
</script>
Working example can be found here:
http://jsfiddle.net/EGFbq/

Accessing a file dialog inside an iframe

I have an inline jsp page which has the code for the html type: file. The below displayed IFRAME tag is in the main jsp. I want to pass on the file name and submit this file dialog using the web driver. Basically want to do something like:
WebElement elem = driver.findElement(By.id("attachmentfile"));
elem.sendKeys("C:\\Users\\Public\\Pictures\\Sample Pictures\\Koala.jpg");
However, am not able to get a hold of the attachmentfile id. Any help would be appreciated. Thanks.
Main Jsp:
<IFRAME id=fileupload src="fileupload.jsp?type=uploadbutton"
frameBorder=0></IFRAME></TD></TR></FORM>
The fileupload.jsp is as below:
<html> <body>
<form name="frm_fileUpload" ENCTYPE="multipart/form-data"><%
<tr>
<td>
<input type="file" name="attachmentfile"
id="attachmentfile" onChange="uploadFile ();" />
<input type="button" name="uploadbutton" id="uploadbutton"
value="Upload" class="button" />
</td>
</tr>
</table>
</form>
</body>
</html>

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Extract 3-level content from paginated pages with Scrapy - scrapy

Related

Is it possible to change view content without performing reload

Scrapy - Extract data from table

Prestashop 1.5.6 - Add a custom field in Admin Product Page

parsejson rendering through template in javascript

Accessing a file dialog inside an iframe

Categories

Resources