Variable number of crawled items each run using Scrapy

Variable number of crawled items each run using Scrapy - scrapy

I am using Scrapy to crawl a website that contains a category menu with different sub-levels of categories (i.e. category, subcategory, sub-subcategory, sub-subcategory, etc (depending on each category)).
For example:
--Category 1
Subcategory 11
Subsubcategory 111
Subsubcategory 112
Subcategory 12
Subsubcategory 121
Subsubsubcategory 1211
Subsubsubcategory 1212
--Category 2
Subcategory 21
...
There are approximately 30.000 categories, subcategories, subsubcategories, etc and I am only scraping this section following one Rule:
rules = [
Rule(
LinkExtractor(
restrict_xpaths=['//div[#class="treecategories"]//a',],
),
follow=True,
callback='parse_categories'
)
]
And it seems to work fine. The problem is that each time I run my scraper I get a different amount of crawled items and I know the website is not being updated. What could be the reason for this behaviour?
This is the settings I am using:
settings = {
'BOT_NAME' : 'crawler',
'USER_AGENT' : 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.183 Safari/537.36',
'CONCURRENT_REQUESTS' : 64,
'COOKIES_ENABLED' : False,
'LOG_LEVEL' : 'DEBUG',
}

Related

Kusto | calculate percentage grouped by 2 columns

I have a result set that look something similar to the table below and I extended with Percentage like so:
datatable (Code:string, App:string, Requests:long)
[
"200", "tra", 63,
"200", "api", 1036,
"302", "web", 12,
"200", "web", 219,
"500", "web", 2,
"404", "api", 18
]
| as T
| extend Percentage = round(100.0 * Requests / toscalar(T | summarize sum(Requests)), 2)
The problem is I really want the percentage to be calculated from the total of Requests of the Code by App rather than the grand total.
For example, for the App "api" where Code is "200", instead of 76.74% of the total, I want to express it as a percentage of just the "api" Code values, which would be 98.29% of the total Requests for App "api".
I haven't really tried anything that would be considered valid syntax. Any help much appreciated.

you can use the join or lookup operators:
datatable (Code:string, App:string, Requests:long)
[
"200", "tra", 63,
"200", "api", 1036,
"302", "web", 12,
"200", "web", 219,
"500", "web", 2,
"404", "api", 18
]
| as T
| lookup ( T | summarize sum(Requests) by App ) on App
| extend Percentage = round(100.0 * Requests / sum_Requests, 2)
| project Code, App, Requests, Percentage
Code
App
Requests
Percentage
200
api
1036
98.29
404
api
18
1.71
200
tra
63
100
302
web
12
5.15
200
web
219
93.99
500
web
2
0.86

LibreOffice: Identifying 'Named Destinations'

I am working on an application that can open and display a PDF page using Poppler. I understand that Named Destinations are the right way to go about in order to open particular pages and in specific show an area within the page.
I figured it is possible to export headings and bookmarks in the PDF file by enabling Export outlines as named destinations option. However the names of these destinations look like below.
13 [ XYZ 96 726 null ] "5F5FRefHeading5F5F5FToc178915F2378596536"
14 [ XYZ 92 688 null ] "5F5FRefHeading5F5F5FToc179995F2378596536"
14 [ XYZ 92 655 null ] "5F5FRefHeading5F5F5FToc180015F2378596536"
14 [ XYZ 92 622 null ] "5F5FRefHeading5F5F5FToc187075F2378596536"
14 [ XYZ 92 721 null ] "5F5FRefHeading5F5F5FToc187095F2378596536"
There is no way to identify which heading is mapped to which destination. Page numbers are there but if there are multiple headings on the same page it would again take trial and error to identify the right one.
Questions
Is there any way in LibreOffice (writer) to find out what WILL BE the name of the destination once exported? Adobe Acrobat or PDF Studio Viewer have options to navigate through the list of destinations and 'see where they go'. To the best of my knowledge the navigation pane in LibreOffice does not show destination names.
Is there a guarantee that the names are maintained unique irrespective of any sections (headings) or pages that may get inserted before them?
I understand that LibreOffice uses 5F in place of _ because they are not allowed in PDF bookmarks. So if I replace those I am left with,
13 [ XYZ 96 726 null ] "__RefHeading___Toc17891_2378596536"
14 [ XYZ 92 688 null ] "__RefHeading___Toc17999_2378596536"
14 [ XYZ 92 655 null ] "__RefHeading___Toc18001_2378596536"
14 [ XYZ 92 622 null ] "__RefHeading___Toc18707_2378596536"
14 [ XYZ 92 721 null ] "__RefHeading___Toc18709_2378596536"
15 [ XYZ 96 726 null ] "__RefHeading___Toc18492_2378596536"
Decoding further the prefix (RefHeading) tells that the destination is from heading and the suffix (2378596536) is probably a unique number identifying the entire document (since it is the same for all entries). The middle portion appears to be a unique key however I am unable to identify the heading (or its section number) from this part.

Ruby on Rails iterate through column efficiently

created_at iteration group_hits_per_iteration
--------------------------------------------------------------------
2019-11-08 08:14:05.170492 300 34
2019-11-08 08:14:05.183277 300 24
2019-11-08 08:14:05.196785 300 63
2019-11-08 08:14:05.333424 300 22
2019-11-08 08:14:05.549140 300 1
2019-11-08 08:14:05.576509 300 15
2019-11-08 08:44:05.832730 301 69
2019-11-08 08:44:05.850111 301 56
2019-11-08 08:44:05.866771 301 18
2019-11-08 08:44:06.310749 301 14
Hello
My goal is to create a sum total of the values in 'group_hits_per_iteration' for each unique value in the 'iteration column' which will then be graphed using chartkick.
For example, for iteration 300 I would sum together 34,24,63,22,1,15 for a total of 159, then repeat for each unique entry.
The code I've included below does work and generates the required output but it's slow and gets slower the more data is read into the database.
It creates a hash that is fed into chartkick.
hsh = {}
Group.pluck(:iteration).uniq.each do |x|
date = Group.where("iteration = #{x}").pluck(:created_at).first.localtime
itsum = Group.where("iteration = #{x}").pluck('SUM(group_hits_per_iteration)' )
hsh[date] = itsum
end
<%= line_chart [
{name: "#{#groupdata1.first.networkid}", data: hsh}
] %>
I'm looking for other ways to approach this, I was thinking of having SQL do the heavy lifting and not do the calculations in rails but not really sure how to approach that.
Thanks for the help.

If you want to get just the sums for every iteration, following code should work:
# new lines only for readability
group_totals =
Group
.select('iteration, min(created_at) AS created_at, sum(group_hits_per_iteration) AS hits')
.group('iteration')
.order('iteration') # I suppose you want the results in some order
group_totals.each do |group|
group.iteration # => 300
group.hits # => 159
group.created_at # => 2019-11-08 08:14:05.170492
end
In this case all the hard work is done by the database, you can just read the results in your ruby code.
Note: In your code you are taking first created_at for every iteration, I took the lowest date

Remove rows with character(0) from a data.frame before proceeding to dtm

I'm analyzing a data frame of product reviews that contain some empty entries or text written in foreign language. The data also contain some customer attributes which can be used as "features" in later analysis.
To begin with, I will first convert the reviews column into DocumentTermMatrix and then convert it to lda format, I then plan to throw in the documents and vocab objects generated from the lda process along with selected columns from the original data frame into stm's prepDocuments() function such that I can leverage the more versatile estimation functions from that package, using customer attributes as features to predict topic salience.
However, because some of the empty cells, punctuation, and foreign characters might be removed during the pre-processing and thereby creating some character(0) rows in the lda's documents object, making those reviews unable to match their corresponding rows in the original data frame. Eventually, this will prevent me from generating the desired stm object from prepDocuments().
Methods to remove empty documents certainly exist (such as the methods recommended in this previous thread), but I am wondering if there're ways to also remove the rows correspond to the empty documents from the original data frame such that the number of lda documents and the row dimension of the data frame that will be used as meta in the stm functions are aligned? Will indexing help?
Part of my data is listed at below.
df = data.frame(reviews = c("buenisimoooooo", "excelente", "excelent",
"awesome phone awesome price almost month issue highly use blu manufacturer high speed processor blu iphone",
"phone multiple failure poorly touch screen 2 slot sim card work responsible disappoint brand good team shop store wine money unfortunately precaution purchase",
"//:", "//:", "phone work card non sim card description", "perfect reliable kinda fast even simple mobile sim digicel never problem far strongly anyone need nice expensive dual sim phone perfect gift love friend", "1111111", "great bang buck", "actually happy little sister really first good great picture late",
"good phone good reception home fringe area screen lovely just right size good buy", "##haha", "phone verizon contract phone buyer beware", "这东西太棒了",
"excellent product total satisfaction", "dreadful phone home button never screen unresponsive answer call easily month phone test automatically emergency police round supplier network nothing never electricals amazon good buy locally refund",
"good phone price fine", "phone star battery little soon yes"),
rating = c(4, 4, 4, 4, 4, 3, 2, 4, 1, 4, 3, 1, 4, 3, 1, 2, 4, 4, 1, 1),
source = c("amazon", "bestbuy", "amazon", "newegg", "amazon",
"amazon", "zappos", "newegg", "amazon", "amazon",
"amazon", "amazon", "amazon", "zappos", "amazon",
"amazon", "newegg", "amazon", "amazon", "amazon"))

This is a situation where embracing tidy data principles can really offer a nice solution. To start with, "annotate" the dataframe you presented with a new column that keeps track of doc_id, which document each word belongs to, and then use unnest_tokens() to transform this to a tidy data structure.
library(tidyverse)
library(tidytext)
library(stm)
df <- tibble(reviews = c("buenisimoooooo", "excelente", "excelent",
"awesome phone awesome price almost month issue highly use blu manufacturer high speed processor blu iphone",
"phone multiple failure poorly touch screen 2 slot sim card work responsible disappoint brand good team shop store wine money unfortunately precaution purchase",
"//:", "//:", "phone work card non sim card description", "perfect reliable kinda fast even simple mobile sim digicel never problem far strongly anyone need nice expensive dual sim phone perfect gift love friend", "1111111", "great bang buck", "actually happy little sister really first good great picture late",
"good phone good reception home fringe area screen lovely just right size good buy", "##haha", "phone verizon contract phone buyer beware", "这东西太棒了",
"excellent product total satisfaction", "dreadful phone home button never screen unresponsive answer call easily month phone test automatically emergency police round supplier network nothing never electricals amazon good buy locally refund",
"good phone price fine", "phone star battery little soon yes"),
rating = c(4, 4, 4, 4, 4, 3, 2, 4, 1, 4, 3, 1, 4, 3, 1, 2, 4, 4, 1, 1),
source = c("amazon", "bestbuy", "amazon", "newegg", "amazon",
"amazon", "zappos", "newegg", "amazon", "amazon",
"amazon", "amazon", "amazon", "zappos", "amazon",
"amazon", "newegg", "amazon", "amazon", "amazon"))
tidy_df <- df %>%
mutate(doc_id = row_number()) %>%
unnest_tokens(word, reviews)
tidy_df
#> # A tibble: 154 x 4
#> rating source doc_id word
#> <dbl> <chr> <int> <chr>
#> 1 4 amazon 1 buenisimoooooo
#> 2 4 bestbuy 2 excelente
#> 3 4 amazon 3 excelent
#> 4 4 newegg 4 awesome
#> 5 4 newegg 4 phone
#> 6 4 newegg 4 awesome
#> 7 4 newegg 4 price
#> 8 4 newegg 4 almost
#> 9 4 newegg 4 month
#> 10 4 newegg 4 issue
#> # … with 144 more rows
Notice that you still have all the information you had before; all the information is still there, but it is arranged in a different structure. You can fine-tune the tokenization process to fit your particular analysis needs, perhaps dealing with non-English however you need, or keeping/not keeping punctuation, etc. This is where empty documents get thrown out, if appropriate for you.
Next, transform this tidy data structure into a sparse matrix, for use in topic modeling. The columns correspond to the words and the rows correspond to the documents.
sparse_reviews <- tidy_df %>%
count(doc_id, word) %>%
cast_sparse(doc_id, word, n)
colnames(sparse_reviews) %>% head()
#> [1] "buenisimoooooo" "excelente" "excelent" "almost"
#> [5] "awesome" "blu"
rownames(sparse_reviews) %>% head()
#> [1] "1" "2" "3" "4" "5" "8"
Next, make a dataframe of covariate (i.e. meta) information to use in topic modeling from the tidy dataset you already have.
covariates <- tidy_df %>%
distinct(doc_id, rating, source)
covariates
#> # A tibble: 18 x 3
#> doc_id rating source
#> <int> <dbl> <chr>
#> 1 1 4 amazon
#> 2 2 4 bestbuy
#> 3 3 4 amazon
#> 4 4 4 newegg
#> 5 5 4 amazon
#> 6 8 4 newegg
#> 7 9 1 amazon
#> 8 10 4 amazon
#> 9 11 3 amazon
#> 10 12 1 amazon
#> 11 13 4 amazon
#> 12 14 3 zappos
#> 13 15 1 amazon
#> 14 16 2 amazon
#> 15 17 4 newegg
#> 16 18 4 amazon
#> 17 19 1 amazon
#> 18 20 1 amazon
Now you can put this together into stm(). For example, if you want to train a topic model with the document-level covariates looking at whether topics change a) with source and b) smoothly with rating, you would do something like this:
topic_model <- stm(sparse_reviews, K = 0, init.type = "Spectral",
prevalence = ~source + s(rating),
data = covariates,
verbose = FALSE)
Created on 2019-08-03 by the reprex package (v0.3.0)

What is the difference of "dom_content_loaded.histogram.bin.start/end" in Google's BigQuery?

I need to build a histogram, concerning DOMContentLoaded of a webpage. When I used BigQuery, I noticed that apart from density, there are 2 more attributes (start, end). In my head there should only be 1 attribute, the DOMContentLoaded event is only fired when the DOM has loaded.
Can anyone help clarify the difference of .start and .stop? These attributes always have 100 milliseconds difference between them (if start = X ms, then stop = X+100 ms. See a query example posted below.
I can not understand what these attributes represent exactly:
dom_content_loaded.histogram.bin.START
AND
dom_content_loaded.histogram.bin.END
Q: Which one of them represents the time that the DOMContentLoaded event
is fired in a user's browser?
SELECT
bin.START AS start,
bin.END AS endd
FROM
`chrome-ux-report.all.201809`,
UNNEST(dom_content_loaded.histogram.bin) AS bin
WHERE
origin = 'https://www.google.com'
Output:
Row |start | end
1 0 100
2 100 200
3 200 300
4 300 400
[...]

Below explains meaning of bin.start, bin.end and bin.density
Run below SELECT statement
SELECT
origin,
effective_connection_type.name type_name,
form_factor.name factor_name,
bin.start AS bin_start,
bin.end AS bin_end,
bin.density AS bin_density
FROM `chrome-ux-report.all.201809`,
UNNEST(dom_content_loaded.histogram.bin) AS bin
WHERE origin = 'https://www.google.com'
You will get 1550 rows in result
below are first 5 rows
Row origin type_name factor_name bin_start bin_end bin_density
1 https://www.google.com 4G phone 0 100 0.01065
2 https://www.google.com 4G phone 100 200 0.01065
3 https://www.google.com 4G phone 200 300 0.02705
4 https://www.google.com 4G phone 300 400 0.02705
5 https://www.google.com 4G phone 400 500 0.0225
You can read them as:
for phone with 4G load of dom_content was loaded within 100 milliseconds for 1.065% of loads; in between 100 and 200 milliseconds for 1.065%; in between 200 and 300 milliseconds for 2.705% and so on
To summarize for each origin, type and factor you got histogram that is represented as a repeated record with start and end of each bin along with density which represents percentage of respective user experience
Note: if you add up the dom_content_loaded densities across all dimensions for a single origin, you will get 1 (or a value very close to 1 due to approximations).
For example
SELECT SUM(bin.density) AS total_density
FROM `chrome-ux-report.all.201809`,
UNNEST(dom_content_loaded.histogram.bin) AS bin
WHERE origin = 'https://www.google.com'
returns
Row total_density
1 0.9995999999999978
Hope this helped

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Variable number of crawled items each run using Scrapy - scrapy

Related

Kusto | calculate percentage grouped by 2 columns

LibreOffice: Identifying 'Named Destinations'

Ruby on Rails iterate through column efficiently

Remove rows with character(0) from a data.frame before proceeding to dtm

What is the difference of "dom_content_loaded.histogram.bin.start/end" in Google's BigQuery?

Categories

Resources