Who parent if we to use rules in Scarpy? - scrapy

rules = (
Rule(LinkExtractor(
restrict_xpaths='//need_data',
deny=deny_urls), callback='parse_info'),
Rule(LinkExtractor(allow=r'/need/', deny=deny_urls), follow=True),
)
rules to extract need URLs for scraping, right?
Can I in callback def get URL we move?
For example.
website - needdata.com
Rule(LinkExtractor(allow=r'/need/', deny=deny_urls), follow=True), to extract URL like needdata.com/need/1 , right?
Rule(LinkExtractor(
restrict_xpaths='//need_data',
deny=deny_urls), callback='parse_info'),
to extract urls from needdata.com/need/1 , for example it a table with people.
and then parse_info to scrape it. Right?
But I want to understand in parse_info who a parent?
If needdata.com/need/1 has needdata.com/people/1
I want to add to a file column parent and data will be needdata.com/need/1
How to do that? Thank you very much.

We want to use
lx = LinkExtractor(allow=(r'shop-online/',))
And then
for l in lx.extract_links(response):
# l.url - it our url
And then use
meta={'category': category}
The better decision I do not find.

Related

URL parsing in SQL

I have an inconsistent url in of the tables.
The sample looks like
https://blue.decibal.com.au/Transact?pi=9024&pai=2&ct=0&gi=1950&byo=true&ai=49&pa=289&ppt=0
or
https://www.google.com/Transact?pi=9024&pai=2&ct=0&gi=1950&byo=true&ai=49&pa=289&ppt=0
or
https3A%google.com/Transact?pi=9024&pai=2&ct=0&gi=1950&byo=true&ai=49&pa=289&ppt=0
For the first URL "blue" is the result but it comes with two domains blue and decibal.
Second one is google.
Third is again google.
My requirement is to parse the url and match it with a look table with domain name which contains blue, google, bing etc.
However, the inconstancy in the URL that's stored in DB is a challenge. Need to write a sql which can identify the match and if there are two domain just pick the first one. The URL can be a sit and not expected to be a standard one.
Appreciate some help.
Are you looking for something like this? If not, I do believe that using the SPLIT as part of your parsing will help, since it then creates an array that you can manipulate. This is an example for Snowflake SQL, not SQL Server. They are both tagged in the OP, so not sure which you are looking for.
WITH x AS (
SELECT REPLACE(url,'3A%','//') as url
FROM (VALUES
('https://blue.decibal.com.au/Transact?pi=9024&pai=2&ct=0&gi=1950&byo=true&ai=49&pa=289&ppt=0'),
('https://www.google.com/Transact?pi=9024&pai=2&ct=0&gi=1950&byo=true&ai=49&pa=289&ppt=0'),
('https3A%google.com/Transact?pi=9024&pai=2&ct=0&gi=1950&byo=true&ai=49&pa=289&ppt=0')) as x (url)
)
SELECT split(split_part(split_part(url,'//',2),'/',1),'.') as url_array,
array_construct('google') as google_array,
array_construct('decibal') as decibal_array,
array_construct('bing') as bing_array,
CASE WHEN arrays_overlap(url_array,google_array) THEN 'GOOGLE'
WHEN arrays_overlap(url_array,decibal_array) THEN 'DECIBAL'
WHEN arrays_overlap(url_array,bing_array) THEN 'BING' END as domain_match
FROM x;

Best Postgres/DB format for storing an array of strings with a boolean attached to them in Django?

I'm storing an array of URL links in my Postgres database like this:
urls = ArrayField(
models.CharField(max_length=250, blank=False)
)
But I want to track if a URL has been visited or not, as a boolean. What is the best way to do this?
You can use django.contrib.postgres.fields.JSONField which can store data as JSON and is PostgreSQL specific field. So you can put in it dict with url and visited keys in it.
from django.core.serializers.json import DjangoJSONEncoder
...
urls = JSONField(encoder=DjangoJSONEncoder, default=dict)
https://docs.djangoproject.com/en/2.0/ref/contrib/postgres/fields/#django.contrib.postgres.fields.JSONField

Many inputs to one output, access wildcards in input files

Apologies if this is a straightforward question, I couldn't find anything in the docs.
currently my workflow looks something like this. I'm taking a number of input files created as part of this workflow, and summarizing them.
Is there a way to avoid this manual regex step to parse the wildcards in the filenames?
I thought about an "expand" of cross_ids and config["chromosomes"], but unsure to guarantee conistent order.
rule report:
output:
table="output/mendel_errors.txt"
input:
files=expand("output/{chrom}/{cross}.in", chrom=config["chromosomes"], cross=cross_ids)
params:
req="h_vmem=4G",
run:
df = pd.DataFrame(index=range(len(input.files), columns=["stat", "chrom", "cross"])
for i, fn in enumerate(input.files):
# open fn / make calculations etc // stat =
# manual regex of filename to get chrom cross // chrom, cross =
df.loc[i] = stat, chrom, choss
This seems a bit awkward when this information must be in the environment somewhere.
(via Johannes Köster on the google group)
To answer your question:
Expand uses functools.product from the standard library. Hence, you could write
from functools import product
product(config["chromosomes"], cross_ids)

How to add a url suffix before performing a callback in scrapy

I have a crawler that works just fine in collecting the urls I am interested in. However, before retrieving the content of these urls (i.e. the ones that satisfy rule no 3), I would like to update them, i.e. add a suffix - say '/fullspecs' - on the right-hand side. That means that, in fact, I would like to retrieve and further process - through callback function - only the updated ones. How can I do that?
rules = (
Rule(LinkExtractor(allow=('something1'))),
Rule(LinkExtractor(allow=('something2'))),
Rule(LinkExtractor(allow=('something3'), deny=('something4', 'something5')), callback='parse_archive'),
)
You can set process_value parameter to lambda x: x+'/fullspecs' or to a function if you want to do something more complex.
You'd end up with:
Rule(LinkExtractor(allow=('something3'), deny=('something4', 'something5')),
callback='parse_archive', process_value=lambda x: x+'/fullspecs')
See more at: http://doc.scrapy.org/en/latest/topics/link-extractors.html#basesgmllinkextractor

How to extract links anywhere at any depth?

I am scraping dell.com website, my goal is pages like http://accessories.us.dell.com/sna/productdetail.aspx?c=us&cs=19&l=en&s=dhs&sku=A7098144. How do I set link extracting rules so they find these pages anywhere at any depth? As I know, by default there is no limit on depth. If I do:
rules = (
Rule (
SgmlLinkExtractor(allow=r"productdetail\.aspx"),
callback="parse_item"
),
)
it doesn't work: it crawles only the starting page. If I do:
rules = (
Rule (
SgmlLinkExtractor(allow=r".*")
),
Rule (
SgmlLinkExtractor(allow=r"productdetail\.aspx"),
callback="parse_item"
),
)
it crawles product pages but doesn't scrape them (I mean doesn't call parse_item() on them). I tried include follow=True on the first rule although if there is no callback it should be True by default.
EDIT:
This is the rest of my code except for parse function:
import re
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.http import Request
class DellSpider(CrawlSpider):
name = 'dell.com'
start_urls = ['http://www.dell.com/sitemap']
rules = (
Rule (
SgmlLinkExtractor(allow=r".*")
),
Rule (
SgmlLinkExtractor(allow=r"productdetail\.aspx"),
callback="parse_item"
),
)
From the CrawlSpider documentation:
If multiple rules match the same link, the first one will be used, according to the order they’re defined in this attribute.
Thus, you need to invert the order of your Rules. Currently .* will match everything, before productdetail\.aspx is checked at all.
This should work:
rules = (
Rule (
SgmlLinkExtractor(allow=r"productdetail\.aspx"),
callback="parse_item"
),
Rule (
SgmlLinkExtractor(allow=r".*")
),
)
However, you will have to make sure that links will be followed in parse_item, if you want to follow links on productdetail pages. The second rule will not be called on productdetail pages.