How do filter out escape sequences while scraping tables using css selectors? - html-table

I am trying to scrape a table using CSS Selectors in Scrapy. The method I used is scraping row by row into a single scrapy.Field() in an item object.
However, the data scraped contains a "\n\t\t" element between every other element in the table. How do I remove this in the scraping process. I can do post-processing on the data but I would like to understand what is going on.
My parse method:
def parse_product(self, response):
l = ItemLoader(item = KdramaItem(),
response = response,
)
l.add_value('url', response.meta['source_url'])
table_loader = l.nested_css('table')
table_loader.add_css('table', 'tr ::text')
yield l.load_item()
Part of output:
"url": ["http://www.koreandrama.org/angels-last-mission-love/"], "table": ["\n\t\t", "Date", "\n\t\t", "Ep", "\n\t\t", "TNmS", "\n\t\t", "TNmS", "\n\t\t", "AGB", "\n\t\t", "AGB", "\n\t", "\n\t\t", "\u00a0", "\n\t\t", "\u00a0", "\n\t\t", "Nationwide", "\n\t\t", "Seoul", "\n\t\t", "Nationwide", "\n\t\t", "Seoul", "\n\t", "\n\t\t",

Related

How to iterate over a dynamic array of objects and use each object as a parameter in test?

I started my adventure with Karate a month ago. I have a simple GET test called getAllCars.feature showing a list of cars currently available:
[
{
"brandName": "BMW",
"id": 1,
"winterTires": false,
"modelName": "X5"
},
{
"brandName": "Opel",
"id": 34,
"winterTires": true,
"modelName": "Insignia"
},
{
"brandName": "Mercedes-Benz",
"id": 36,
"winterTires": true,
"modelName": "GLE Coupe"
},
{
"brandName": "Huydai",
"id": 251,
"winterTires": false,
"modelName": "i30"
}
]
I have to use each id as a parameter for the next feature file, the problem is, the list of cars is dynamic, ids don't repeat and I will have to use this list of ids for several other feature files. I managed to create a helper getCarIds.feature, which creates an array of objects "carId": "#number":
Feature: Get car IDs
Scenario: Get car IDs
* call read('classpath:x/automation/cars/getAllCars.feature')
* def carIds = $response[*].id
* def carFeeder = karate.mapWithKey(carIds, 'carId')
The following getCarParameters.feature has to iterate over the array from getCarIds.feature and pass each id as a parameter to get a response with performance parameters of each car and I don't know how to use each id separately as a parameter (keeping in mind that the list of ids is changing):
Feature: Get parameters of each car
Scenario: Get parameters for each car
* call read('classpath:x/automation/cars/getCarIds.feature')
Given url carUrl + '/carparameters'
And param carId =
When method GET
Then status 200
I managed to do it when passing the values from getCarIds.feature to getCarParameters.feature like described here by adding following line to getCarIds.feature:
* call read('classpath:x/automation/cars/getCarParameters.feature') carFeeder
but several other tests require car ids. I need getCarIds.feature to be reusable, so I would have to retrieve data from feature file, which creates the array with ids, instead of passing it to the GET feature and apparently it isn't so easy. Maybe my approach is completely wrong.
I think this is a valid case for karate.callSingle(): https://github.com/karatelabs/karate#karatecallsingle
So you can actually stick this in any feature and it is guaranteed to execute only once across your test suite. If the data is indeed something used by a majority of your test suite, you could even do this initialization in karate-config.js.
So this should work. First the reusable feature common.feature. Instead of the hard-coded response, you know how to make an actual HTTP request.
#ignore
Feature:
Scenario:
* def response =
"""
[
{
"brandName": "BMW",
"id": 1,
"winterTires": false,
"modelName": "X5"
},
{
"brandName": "Opel",
"id": 34,
"winterTires": true,
"modelName": "Insignia"
}
]
"""
* print 'getting car ids'
* def carIds = response.map(x => ({ id: x.id }))
Note the use of the JS map() function above, which I have started to recommend instead of JsonPath.
And here is a feature that uses the above. This uses the new #setup annotation that makes it easy to "loop": https://github.com/karatelabs/karate#setup
You can try this example quickly, and watch it make 2 requests using a param id from the loop.
Feature:
#setup
Scenario:
* def data = karate.callSingle('call-single-common.feature').carIds
Scenario Outline:
* url 'https://httpbin.org/get'
* param id = id
* method get
Examples:
| karate.setup().data |
There are other ways to loop, refer: https://github.com/karatelabs/karate#data-driven-features

Scrappy: Extracting data from sub pages

I'm trying to extract price details from such web links using scrapy. When I select each color browser sends a new ajax request to the server. eg for color Vert cèdre.
import scrapy
class TestSpider(scrapy.Spider):
name = "Test"
def start_requests(self):
urls = [
'https://www.alinea.com/fr-fr/p/vence-canape-1.5-places-fixe-en-lin-vert-cedre-26943589.html',
'... other URL's',
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
# extract each color and color url
# hit each color page url to get pricing details
Scrappy gives me contents of the main page in the parse function. My question is how I can hit the sub page links (for colors) and extract the contents from it in the parse method so I can get the pricing detail for each color in a single object.
eg
{
'url': 'https://www.alinea.com/fr-fr/p/vence-canape-1.5-places-fixe-en-lin-vert-cedre-26943589.html'
'pricing': [{
'color': 'Beige roucas',
'price': '699,00 €'
},{
'color': 'Blanc capelan',
'price': '699,00 €'
}
....... other colors
]
}
If I yield color page url's from parse method as new requests, how can I merge pricing to get above structure.
After going over some links, doesn't seem like the prices changes with color (if I'm wrong then, you'd probably use CrawlSpider with rules or Splash together with Scrapy for a more robust spider).
But for now, for the color, their respective links etc., you could try the parse function below. Edit accordingly.
def parse(self, response):
url = response.url
price = response.xpath('//*[#class="product-price product-pricing"]/div/span/text()').get().strip()
prices = []
# Selector for the color options
color_list = response.xpath('//*/li[#class="attribute attr-color "]/div[#class="value full-line"]/select/option')
# Check if that selector exists &
# Cycle through colors adding the data
if color_list:
for color_data in color_list:
prices.append({
'color': color_data.xpath('#data-title').get().rsplit(':: ')[1],
'price': price,
'link': color_data.xpath('#data-link').get()
})
yield {
'url': url,
'pricing': prices,
}

DataTables Pager Showing Many Pages when there is Only One

This is a weird one.
I'm using datatables v1.10.19 with jQuery 3.3.1 and Bootstrap 3.3.7
My datatables grid is configured to display 1000 records (but you can change it to 2500, 5000 and "all").
I've only got about 60 records in my database.
It is using Server-Side processing to retrieve data.
When the grid loads, the pager displays 5 buttons plus an ellipses (as if there is even more).
And even weirder, if I change the drop-down to display "all" records, it acts as I would expect i.e. the pager has 1 page button.
The payloads are pretty much identical:
{
"data": {
"draw": 8,
"recordsTotal": 86,
"recordsFiltered": 66,
"data": [rows of data here]
},
"outcome": {
"opResult": "Success",
"message": ""
}
}
When you click on page 2, it does successfully retrieve a payload with 0 rows.
But there shouldn't be a page 2 available on the pager.
The config object for the datatable looks like this:
eventsSvr.buildConfig = function (url) {
return {
"processing": true,
"serverSide": true,
//"paging": true,
"ajax": {
url: url,
type: ajax.requestPOST,
dataSrc: 'data.data' // the path in the JSON structure to the array which will be the rows.
},
"order": [[1, "asc"]],
"lengthMenu": [[1000, 2500, 5000, -1], [1000, 2500, 5000, "All"]],
"initComplete": function (settings, json) {
eventsSvr.searchTextSpan.text('Search').removeClass('search-is-on');
},
"columns": eventsSvr.grid.columns,
"columnDefs": eventsSvr.grid.columnDefs,
dom: 'ltp'
};
I do have a bunch of custom searches on the page, so I've had to write a lot of code like this:
$.fn.dataTable.ext.search.push(
function (settings, data, dataIndex) {
var picker3 = $(eventsSvr.datePickerInputs[0]).data(icapp.kendoKey);
var picker4 = $(eventsSvr.datePickerInputs[1]).data(icapp.kendoKey);
var rowStartDate = moment(data[3], icapp.date.momentParseFormat).toDate();
var rowEndDate = moment(data[4], icapp.date.momentParseFormat).toDate();
... etc.
}
);
But the odd thing is the different behavior as between "All" records vs 1000 records.
As described above, select "All" records works (resulting in just 1 page button), but none of the other paging sizes work (i.e. 1000, 2500, 5000). The data for the 1 page does return, but I get 5 page buttons and an ellipses.
Any ideas why this would be happening?
When using server-side processing mode DataTables expects draw, recordsTotal and recordsFiltered to be root-level elements. Consider changing your repsonse to the following and you can remove dataSrc option.
{
"draw": 8,
"recordsTotal": 86,
"recordsFiltered": 66,
"data": [rows of data here],
"outcome": {
"opResult": "Success",
"message": ""
}
}
Alternatively you can manipulate the response before passing it to DataTables using function supplied as value for dataSrc option, but I would recommend keep things according to expected format for more readable code.

Scrapy - Why Item Inside For Loop Has The Same Value While Accessed in Another Parser

I want to scrape the link inside the for loop, in for loop there are items, I passed the item to the callback function. But why the item in the callback function has the same value. This is my code.
import scrapy
import re
from scraper.product_items import Product
class ProductSpider(scrapy.Spider):
name = "productspider"
start_urls = [
'http://www.website.com/category-page/',
]
def parse(self, response):
item = Product()
for products in response.css("div.product-card"):
link = products.css("a::attr(href)").extract_first()
item['sku'] = products.css("div.product-card::attr(data-sku)").extract_first()
item['price'] = products.css("div.product-card__old-price::text").extract_first()
yield scrapy.Request(url = link, callback=self.parse_product_page, meta={'item': item})
def parse_product_page(self, response):
item = response.meta['item']
item['image'] = response.css("div.productImage::attr(data-big)").extract_first()
return item
The result is this.
[
{"sku": "DI684OTAA55INNANID", "price": "725", "image": "http://website.com/image1.jpg"},
{"sku": "DI684OTAA55INNANID", "price": "725", "image": "http://website.com/image2.jpg"},
{"sku": "DI684OTAA55INNANID", "price": "725", "image": "http://website.com/image3.jpg"},
]
As you can see, the sku and price has the same value for each iteration. I want the result of the sku and price different. If I get the result of the self parse, change the code like this.
import scrapy
import re
from scraper.product_items import Product
class LazadaSpider(scrapy.Spider):
name = "lazada"
start_urls = [
'http://www.lazada.co.id/beli-jam-tangan-kasual-pria/',
]
def parse(self, response):
item = Product()
for products in response.css("div.product-card"):
link = products.css("a::attr(href)").extract_first()
item['sku'] = products.css("div.product-card::attr(data-sku)").extract_first()
item['price'] = products.css("div.product-card__old-price::text").extract_first()
yield item
Then the value of sku and price is correct for each iteration.
[
{"sku": "CA199FA31FKAANID", "price": "299"},
{"sku": "SW437OTAA31QO3ANID", "price": "200"},
{"sku": "SW437OTAM1RAANID", "price": "235"},
]
You should create item inside for loop, otherwise you just share same item between all the iterations repopulating its values only. So correct code is:
def parse(self, response):
for products in response.css("div.product-card"):
item = Product()
link = products.css("a::attr(href)").extract_first()
item['sku'] = products.css("div.product-card::attr(data-sku)").extract_first()
item['price'] = products.css("div.product-card__old-price::text").extract_first()
yield item

Scrapy output only the last incrementally updated item

Can someone help me with this please, I have been searching for this information for 2 days, no luck.
I have an item with 1 field as a list of another items. The spider works fine, but in the output file I get all the lines of this item.
For example, I need json to be printed as:
{"id": "AAAA", "details": [
{"date" : "2013-01-10", type="A"},
{"date" : "2013-02-10", type="B"},
{"date" : "2013-03-10", type="C"},
{"date" : "2013-04-10"}, type="D"]}
but I get:
{"id": "AAAA", "details": [
{"date" : "2013-01-10", type="A"}]}
{"id": "AAAA", "details": [
{"date" : "2013-01-10", type="A"},
{"date" : "2013-02-10", type="B"}]}
{"id": "AAAA", "details": [
{"date" : "2013-01-10", type="A"},
{"date" : "2013-02-10", type="B"},
{"date" : "2013-03-10", type="C"}
]}
{"id": "AAAA", "details": [
{"date" : "2013-01-10", type="A"},
{"date" : "2013-02-10", type="B"},
{"date" : "2013-03-10", type="C"},
{"date" : "2013-04-10"}, type="D"]}
I use a function to update my parent item:
def rePackIt(parent, item):
if 'details' in parent:
items = parent.get('details')
else:
items = []
items.append(dict(item))
parent['details'] = items
return parent
In parse function I do:
parent = ParentItem()
parent['id'] = self.param # actually I parse a text file with many IDs
parent['details'] = []
yield FormRequest.from_response(response,
formname='...',
formdata={'...':'...', '...': parent['id'],
'...':''},
meta = {'parent': parent, 'dont_merge_cookies': True},
callback=self.parse1)
def parse1(self, response):
parent = response.meta['parent']
sel = HtmlXPathSelector(response)
records = sel.select('//ul[#class="...."]')
for record in records:
item = DetailItem()
item['type'] = record.select('child...')
doc_link = record.select('child.../a/#href').extract()
yield Request(doc_link,
callback=self.parse2,
method='GET',
headers={...},
meta={'dont_merge_cookies': True, 'cookiejar': cookieJar, 'item' : item, 'parent' : parent}
)
def parse2(self, response):
item = response.meta['item']
parent = response.meta['parent']
sel = HtmlXPathSelector(response)
# some other parsing code
item['date'] = cell.select('span[1]/text()[1]').extact()
rePackIt(parent, item)
return parent
The page you are trying to scrap and output as json has this structure
MainItem 1 {some information }
Detail Item 1
Detail Item 2
Main Item 2
Detail Item 1
Detail Item 2
You are returning the parent object for each of the detail item scrapped. While your intention is to return the parent object only once, after it is "complete". Meaning your parent is populated with all the detailed item 1..n. The problem is you don't have a nicer way to say when you finished building the parent item.
One of way to handle this would be writing the pipeline(http://doc.scrapy.org/en/latest/topics/item-pipeline.html). This might sound complicated but its not.
Basically, there is three steps in the pipeline
open_spider
you create your global object of the form
itemlist = []
process_item
if item is parent then
add the item to the list
if item is child then
find the parentitem from the itemlist
parentitem["detail"].add(childitem)
close_spider
Write your json serialise and write to the desired file. One caveat with this is, if you are scrapping huge data, all the scraped item will live in memory, until you write them to the file in this method, as you won't be able to stream write your json items.
Let me know if this works or did you find any better solution.