Scrappy: Extracting data from sub pages

Scrappy: Extracting data from sub pages - scrapy

I'm trying to extract price details from such web links using scrapy. When I select each color browser sends a new ajax request to the server. eg for color Vert cèdre.
import scrapy
class TestSpider(scrapy.Spider):
name = "Test"
def start_requests(self):
urls = [
'https://www.alinea.com/fr-fr/p/vence-canape-1.5-places-fixe-en-lin-vert-cedre-26943589.html',
'... other URL's',
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
# extract each color and color url
# hit each color page url to get pricing details
Scrappy gives me contents of the main page in the parse function. My question is how I can hit the sub page links (for colors) and extract the contents from it in the parse method so I can get the pricing detail for each color in a single object.
eg
{
'url': 'https://www.alinea.com/fr-fr/p/vence-canape-1.5-places-fixe-en-lin-vert-cedre-26943589.html'
'pricing': [{
'color': 'Beige roucas',
'price': '699,00 €'
},{
'color': 'Blanc capelan',
'price': '699,00 €'
}
....... other colors
]
}
If I yield color page url's from parse method as new requests, how can I merge pricing to get above structure.

After going over some links, doesn't seem like the prices changes with color (if I'm wrong then, you'd probably use CrawlSpider with rules or Splash together with Scrapy for a more robust spider).
But for now, for the color, their respective links etc., you could try the parse function below. Edit accordingly.
def parse(self, response):
url = response.url
price = response.xpath('//*[#class="product-price product-pricing"]/div/span/text()').get().strip()
prices = []
# Selector for the color options
color_list = response.xpath('//*/li[#class="attribute attr-color "]/div[#class="value full-line"]/select/option')
# Check if that selector exists &
# Cycle through colors adding the data
if color_list:
for color_data in color_list:
prices.append({
'color': color_data.xpath('#data-title').get().rsplit(':: ')[1],
'price': price,
'link': color_data.xpath('#data-link').get()
})
yield {
'url': url,
'pricing': prices,
}

Related

django rest pagination on api view decorator

im trying to do pagination on my django rest code, but i get the same code when i change the number of the page, this is what im doing to get that page: http://localhost:8000/movies?page=3
When i change the page number i get the same response, idk if i have to send the number of the page or something but i do the same of this stackoverflow thread
I put the entire view code:
#api_view(['GET', 'POST', 'DELETE', 'PUT'])
def movies(request):
if request.method == 'GET':
if request.query_params.get('id'):
try:
id = request.query_params.get('id')
movie = Movie.objects.get(id=id)
serializer = MovieSerializer(movie, many=False)
return Response(serializer.data)
except Movie.DoesNotExist:
return Response(status=status.HTTP_404_NOT_FOUND)
movies = Movie.objects.all().order_by('release_date')
serializer = MovieSerializer(movies , many=True, context={'request':request})
if request.query_params.get('page'):
paginator = LimitOffsetPagination()
result_page = paginator.paginate_queryset(movies, request)
serializer = MovieSerializer(result_page, many=True, context={'request':request})
return Response(serializer.data)
if request.query_params.get('Genre'):
genreparam = request.query_params.get('Genre')
genre = Genre.objects.get(name=genreparam)
queryset = Movie.objects.filter(genre_relation=genre.id).values().order_by('release_date')
return Response(queryset)
return Response(serializer.data)
this is my settings.py
REST_FRAMEWORK = {
'DEFAULT_FILTER_BACKENDS': ['django_filters.rest_framework.DjangoFilterBackend'],
'DEFAULT_PAGINATION_CLASS': 'rest_framework.pagination.PageNumberPagination',
'PAGE_SIZE': 2,
}
this is what i get whatever number i send via request params
[
{
"id": 1,
"title": "Guardians of the galaxy",
"tagline": "this is a tagline",
"overview": "this is an overview, starlord in the begins...",
"release_date": "1971-07-13T03:00:00Z",
"poster_url": "http\"//posterurl",
"backdrop_url": "http\"//backdropurl",
"imdb_id": "idk what is a imdb",
"genre_relation": []
},
{
"id": 2,
"title": "Avengers endgame",
"tagline": "this is a tagline",
"overview": "tony stark dies, theres no more happy days, only days",
"release_date": "2019-07-13T03:00:00Z",
"poster_url": "http//posterurl",
"backdrop_url": "http//backdropurl",
"imdb_id": "idk what is a imdb",
"genre_relation": [
1
]
}
]

You are not using the pagination properly. You need to instantiate the paginator with the request, and then call paginate_queryset. You are merely instantiating a paginator, and then completely ignoring it.
paginator = LimitOffsetPagination()
result_page = paginator.paginate_queryset(movies, request)
You thus should rewrite this to:
paginator = LimitOffsetPagination()
result_page = paginator.paginate_queryset(movies, request, view=self)
Note that we here pass view=self, since the LimitOffsetPagination uses self.request, self.response, etc.
Furthermore you should not construct a new serializer, but reuse the existing one, and pass result_page as the queryset:
serializer = MovieSerializer(result_page, many=True, context={'request': request})
Finally you should return the paginated results with:
return paginator.get_paginated_response(serializer.data)
This will add pagination metadata to the response.
So a full example:
#api_view(['GET', 'POST', 'DELETE', 'PUT'])
def movies(request):
# ...
if request.query_params.get('page'):
paginator = LimitOffsetPagination()
result_page = paginator.paginate_queryset(movies, request, view=self)
serializer = MovieSerializer(result_page, many=True, context={'request':request})
return paginator.get_paginated_response(serializer.data)
# ...
Note that using the #api_view decorator is often discouraged. You might want to consider using the #api_view decorator.

Passing subdomains to dash_leaflet.TileLayer

I have followed and adapted the LayersControl example from https://dash-leaflet.herokuapp.com/. I am trying to include a basemap from this (https://basemap.at/wmts/1.0.0/WMTSCapabilities.xml) source.
Upon running the code I get the error
Invalid argument subdomains passed into TileLayer.
Expected string.
Was supplied type array.
Value provided:
[
"map",
"map1",
"map2",
"map3",
"map4"
]
Looking into the documentation for dash_leaflet.TileLayer it says
- subdomains (string; optional):
Subdomains of the tile service. Can be passed in the form of one
string (where each letter is a subdomain name) or an array of
strings.
I think I understand the error message, but the error seems to disagree with to the docstring of TileLayer. I am not sure, if I have missed a detail here.
MWE:
import dash
from dash import html
import dash_leaflet as dl
url = "https://{s}.wien.gv.at/basemap/geolandbasemap/normal/google3857/{z}/{y}/{x}.png"
subdomains = ["map", "map1", "map2", "map3", "map4"]
name = "Geoland Basemap"
attribution = "basemap.at"
app = dash.Dash(__name__)
app.layout = html.Div(
dl.Map(
[
dl.LayersControl(
[
dl.BaseLayer(dl.TileLayer(), name="default map", checked=True),
dl.BaseLayer(
dl.TileLayer(
url=url, attribution=attribution, subdomains=subdomains
),
name=name,
checked=False,
),
]
)
],
zoom=7,
center=(47.3, 15.0),
),
style={"width": "100%", "height": "50vh", "margin": "auto", "display": "block"},
)
if __name__ == "__main__":
app.run_server(debug=True)
I am running
dash==2.6.1
dash_leaflet==0.1.23

Create Shopify product with Variant SKU using API

I am trying to create Products on Shopify using the API,
In the CSV upload there is a field Variant SKU which sets the (default) product SKU, I can't seem to find the correct way to create a product along with this value?
I tried (python3);
import requests
import json
from config import SHOPIFY_URL
payload = {
'product': {
'title': 'Hello Product',
'variants': {
'option1': 'Primary',
'sku': 'hello-product'
}
}
}
requests.post(
f'{SHOPIFY_URL}/products.json',
headers = {'content-type': 'application/json'},
data=json.dumps(payload)
)
The product get created but the SKU doesn't.
The TL;DR of my question;
What to I need to pass to fill the Product CSV Upload file's field Variant SKU?
Update
Thanks to David Lazar's comments, I realized that I need to use a list of variants.
payload = {
'product': {
'title': 'Hello Product',
'variants': [
{
'option1': 'Primary',
'sku': 'hello-product'
}
]
}
}
This however creates the product with one variant using the passed SKU.However what I am looking is to create the Porduct with its own SKU, no variations for the product, just a SKU for the product.

Scrapy - Why Item Inside For Loop Has The Same Value While Accessed in Another Parser

I want to scrape the link inside the for loop, in for loop there are items, I passed the item to the callback function. But why the item in the callback function has the same value. This is my code.
import scrapy
import re
from scraper.product_items import Product
class ProductSpider(scrapy.Spider):
name = "productspider"
start_urls = [
'http://www.website.com/category-page/',
]
def parse(self, response):
item = Product()
for products in response.css("div.product-card"):
link = products.css("a::attr(href)").extract_first()
item['sku'] = products.css("div.product-card::attr(data-sku)").extract_first()
item['price'] = products.css("div.product-card__old-price::text").extract_first()
yield scrapy.Request(url = link, callback=self.parse_product_page, meta={'item': item})
def parse_product_page(self, response):
item = response.meta['item']
item['image'] = response.css("div.productImage::attr(data-big)").extract_first()
return item
The result is this.
[
{"sku": "DI684OTAA55INNANID", "price": "725", "image": "http://website.com/image1.jpg"},
{"sku": "DI684OTAA55INNANID", "price": "725", "image": "http://website.com/image2.jpg"},
{"sku": "DI684OTAA55INNANID", "price": "725", "image": "http://website.com/image3.jpg"},
]
As you can see, the sku and price has the same value for each iteration. I want the result of the sku and price different. If I get the result of the self parse, change the code like this.
import scrapy
import re
from scraper.product_items import Product
class LazadaSpider(scrapy.Spider):
name = "lazada"
start_urls = [
'http://www.lazada.co.id/beli-jam-tangan-kasual-pria/',
]
def parse(self, response):
item = Product()
for products in response.css("div.product-card"):
link = products.css("a::attr(href)").extract_first()
item['sku'] = products.css("div.product-card::attr(data-sku)").extract_first()
item['price'] = products.css("div.product-card__old-price::text").extract_first()
yield item
Then the value of sku and price is correct for each iteration.
[
{"sku": "CA199FA31FKAANID", "price": "299"},
{"sku": "SW437OTAA31QO3ANID", "price": "200"},
{"sku": "SW437OTAM1RAANID", "price": "235"},
]

You should create item inside for loop, otherwise you just share same item between all the iterations repopulating its values only. So correct code is:
def parse(self, response):
for products in response.css("div.product-card"):
item = Product()
link = products.css("a::attr(href)").extract_first()
item['sku'] = products.css("div.product-card::attr(data-sku)").extract_first()
item['price'] = products.css("div.product-card__old-price::text").extract_first()
yield item

Django Rest Framework Displaying Serialized data through Views.py

class International(object):
""" International Class that stores versions and lists
countries
"""
def __init__(self, version, countrylist):
self.version = version
self.country_list = countrylist
class InternationalSerializer(serializers.Serializer):
""" Serializer for International page
Lists International countries and current version
"""
version = serializers.IntegerField(read_only=True)
country_list = CountrySerializer(many=True, read_only=True)
I have a serializer set up this way, and I wish to display serialized.data (which will be a dictionary like this: { "version": xx, and "country_list": [ ] } ) using views.py
I have my views.py setup this way:
class CountryListView(generics.ListAPIView):
""" Endpoint : somedomain/international/
"""
## want to display a dictionary like the one below
{
"version": 5
"country_list" : [ { xxx } , { xxx } , { xxx } ]
}
What do I code in this CountryListView to render a dictionary like the one above? I'm really unsure.

Try this
class CountryListView(generics.ListAPIView):
""" Endpoint : somedomain/international/
"""
def get(self,request):
#get your version and country_list data and
#init your object
international_object = International(version,country_list)
serializer = InternationalSerializer(instance=international_object)
your_data = serializer.data
return your_data

You can build on the idea from here:
http://www.django-rest-framework.org/api-guide/pagination/#example
Suppose we want to replace the default pagination output style with a modified format that includes the next and previous links under in a nested 'links' key. We could specify a custom pagination class like so:
class CustomPagination(pagination.PageNumberPagination):
def get_paginated_response(self, data):
return Response({
'links': {
'next': self.get_next_link(),
'previous': self.get_previous_link()
},
'count': self.page.paginator.count,
'results': data
})
As long as you don't need the pagination, you can setup a custom pagination class which would pack your response in whichever layout you may need:
class CountryListPagination(BasePagination):
def get_paginated_response(self, data):
return {
'version': 5,
'country_list': data
}
Then all you need to do is to specify this pagination to your class based view:
class CountryListView(generics.ListAPIView):
# Endpoint : somedomain/international/
pagination_class = CountryListPagination
Let me know how is this working for you.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Scrappy: Extracting data from sub pages - scrapy

Related

django rest pagination on api view decorator

Passing subdomains to dash_leaflet.TileLayer

Create Shopify product with Variant SKU using API

Scrapy - Why Item Inside For Loop Has The Same Value While Accessed in Another Parser

Django Rest Framework Displaying Serialized data through Views.py

Categories

Resources