How can I scrape a Website with onClick event listener using Nokogiri - ruby-on-rails-3

I am trying to scrape a website using Nokogiri and download documents thats are posted on the website. I can scrape other websites like this one: Matatiela Website and get the documents from it. But when I try to scrape this website: Mbhashe Website I can't get the documents because I have to first triger the onclick event in order to get to the document.
The problem now, I don't know how to triger the onclick event in order to get to the document. I have tried this code that I worked on with my friend but it didn't work:
if url.include?('http://www.alfredduma.gov.za/bids-tender-notices/')
file = anchor['onclick'].to_s.gsub("location.href=","").gsub(";return false;","").gsub("'","")
f = mech.get(file)
fileNmae = f.header['content-disposition']
fileNmae = fileNmae.match('"(.*?)"').andand[1].to_s
fileNmae = municipalityName+ " -" +fileNmae.gsub("_"," ")
downld(municipalityName,file,filepath,fileNmae,provinceName)
end
This code didn't work. But bellow is the code that is similar to the one i used to scrape Matatiela website but it's not working on the website of Mbhashe. Can you please help me because it does not return anything.
["https://www.mbhashemun.gov.za/procurement/tenders/","div.tb > div.tbrow > a","http://www.mbhashemun.gov.za","Mbhashe municipality","Eastern Cape"]
My Myfuction gets the css from this array.
if baseurl.include?('ttps://www.mbhashemun.gov.za/procurement/tenders/')
puts "downloading from mbhashemun"
parenturl = anchor['href']
puts parenturl
puts baseurl
tenderurl = parenturl
begin
if tenderurl.include?('http://www.mbhashemun.gov.za/web/2018/11/upgrade-and-maintenance-of-data-centre-for-a-period-of-three-03-years/')
puts "the document is currently not available"
else
puts tenderurl
passingparentUrl = HTTParty.get(tenderurl)
parsedparentUrl = Nokogiri::HTML(passingparentUrl)
downloadtenderurl = parsedparentUrl.at_css('div.media div.media-body > div.wpfilebase-attachment > div.wpfilebase-rightcol > div.wpfilebase-filetitle > a')[:href]
puts downloadtenderurl
bean = downloadtenderurl
puts bean
myfunction = bean.split('/').last
puts Myfunction
if File.exists?(File.join('public/uploads', Myfunction))
puts "the file exist in upload folder and in the database already"
else
mech.pluggable_parser.default = Mechanize::Download
mech.get(bean).save(File.join('public/uploads', monwai))
Tender.create municipality_name: municipalityName ,tender_description:Myfunction ,tender_document: Myfunction ,provincename: provinceName
end
end
rescue Exception => e
puts e
end
end
The code supposed to go throught the website and download the documents and save them on the public/uploads folder on the app.

Related

How to attach images in mailer from active storage association in Rails

In rails 5.2, I have a model using has_many_attached :images. I would like to send out an email that includes all associated images as attachments.
My mailer method currently looks like:
def discrepancy_alert(asset_discrepancy_id, options={})
#asset_discrepancy = AssetDiscrepancy.find asset_discrepancy_id
#asset_discrepancy.images.each_with_index do |img,i|
attachments["img_#{ i }"] = File.read(img)
end
mail to: 'noone#gmail.com', subject: "email subject"
end
obviously, File.read does not work here because img is not a path, it is a blob. I have not been able to find any info on this in the docs
Question One:
Is there a rails method for attaching a blob like this?
I can use the following instead:
#asset_discrepancy.images.each_with_index do |img,i|
attachments["img_#{ i }"] = img.blob.download
end
Question Two:
the download method could use a log of RAM, is this usage ill advised?
It seems, with the addition of ActiveStorage, that rails mailers would have some new methods for interaction between the two....I have not seen anything in the docs. All the mailer attachments[] examples use paths to a local file.
in app/mailers/mailer.rb:
if #content.image.attached?
#filename = object.id.to_s + object.image.filename.extension_with_delimiter
if ActiveStorage::Blob.service.respond_to?(:path_for)
attachments.inline[#filename] = File.read(ActiveStorage::Blob.service.send(:path_for, object.image.key))
elsif ActiveStorage::Blob.service.respond_to?(:download)
attachments.inline[#filename] = object.image.download
end
end
in mailer view:
if #filename
image_tag(attachments[#filename].url)
else
image_tag(attachments['placeholder.png'].url)
end
This worked for me in production using Amazon S3.
in mailer view:
if #object.images
#object.images.each do |image|
path = "https://www.example.com" + Rails.application.routes.url_helpers.rails_blob_path(image, only_path: true)
<img src="<%=path%>">
end
end
Here is the working solution for active storage url in email template. I have seen the images visible in gmail. You can use "rails_blob_url". This works for file stored in aws s3.
mailer.rb
....
#image_url = Rails.application.routes.url_helpers.rails_blob_url(blob),
....
mailer view file
<img src="<%= #image_url %>">

Scrapy only show the first result of each page

I need to scrape the items of the first page and then go to the next button to go to the second page and scrape and so on.
This is my code, but only scrape the first item of each page, if there are 20 pages enter to every page and scrape only the first item.
Could anyone please help me .
Thank you
Apologies for my english.
class CcceSpider(CrawlSpider):
name = 'ccce'
item_count = 0
allowed_domain = ['www.example.com']
start_urls = ['https://www.example.com./afiliados value=&categoria=444&letter=']
rules = {
# Reglas Para cada item
Rule(LinkExtractor(allow = (), restrict_xpaths = ('//li[#class="pager-next"]/a')), callback = 'parse_item', follow = True),
}
def parse_item(self, response):
ml_item = CcceItem()
#info de producto
ml_item['nombre'] = response.xpath('normalize-space(//div[#class="news-col2"]/h2/text())').extract()
ml_item['url'] = response.xpath('normalize-space(//div[#class="website"]/a/text())').extract()
ml_item['correo'] = response.xpath('normalize-space(//div[#class="email"]/a/text())').extract()
ml_item['descripcion'] = response.xpath('normalize-space(//div[#class="news-col4"]/text())').extract()
self.item_count += 1
if self.item_count > 5:
#insert_table(ml_item)
raise CloseSpider('item_exceeded')
yield ml_item
As you haven't given an working target url, I'm a bit guessing here, but most probably this is the problem:
parse_item should be a parse_page (and act accordingly)
Scrapy is downloading a full page which has - according to your description - multiple items and then passes this as a response object to your parse method.
It's your parse method's responsibility to process the whole page by iterating over the items displayed on the page and creating multiple scraped items accordingly.
The scrapy documentation has several good examples for this, one is here: https://doc.scrapy.org/en/latest/topics/selectors.html#working-with-relative-xpaths
Basically your code structure in def parse_XYZ should look like this:
def parse_page(self, response):
items_on_page = response.xpath('//...')
for sel_item in items_on_page:
ml_item = CcceItem()
#info de producto
ml_item['nombre'] = # ...
# ...
yield ml_item
Insert the right xpaths for getting all items on the page and adjust your item xpaths and you're ready to go.

create a (Prawn) PDF inside custom DelayedJob and upload it to S3?

Using: Rails 4.2, Prawn, Paperclip, DelayedJobs via ActiveJobs, Heroku.
I have a PDF that is very large and needs to be handled in the background. Inside a custom Job I want to create it, upload it to S3, and then email the user with a url when its ready. I facilitate this via a PdfUpload model.
Is there anything wrong with my approach/code? Im using File.open() as outlined in examples I found, but this seems to be the root of my error ( TypeError: no implicit conversion of FlightsWithGradesReport into String ).
class PdfUpload < ActiveRecord::Base
has_attached_file :report,
path: "schools/:school/pdf_reports/:id_:style.:extension"
end
/pages_controller.rb
def flights_with_grades_report
flash[:success] = "The report you requested is being generated. An email will be sent to '#{ current_user.email }' when it is ready."
GenerateFlightsWithGradesReportJob.perform_later(current_user.id, #rating.id)
redirect_to :back
authorize #rating, :reports?
end
/ the job
class GenerateFlightsWithGradesReportJob < ActiveJob::Base
queue_as :generate_pdf
def perform(recipient_user_id, rating_id)
rating = Rating.find(rating_id)
pdf = FlightsWithGradesReport.new( rating.id )
pdf_upload = PdfUpload.new
pdf_upload.report = File.open( pdf )
pdf_upload.report_processing = true
pdf_upload.report_file_name = "report.pdf"
pdf_upload.report_content_type = "application/pdf"
pdf_upload.save!
PdfMailer.pdf_ready(recipient_user_id, pdf_upload.id)
end
end
This results in an error:
TypeError: no implicit conversion of FlightsWithGradesReport into String
Changing this:
pdf_upload.report = File.open( pdf )
to this:
pdf_upload.report = StringIO.new(pdf.render)
fixed my problem.

dnn 7+ search is not indexing custom module items

I have a dnn 7.2.2 development site running under dnndev.me on my local machine. I have created a simple product catalogue module and am trying to integrate the new search for dnn 7.
Here is the implementation of ModuleSearchBase in my feature/business controller
Imports DotNetNuke.Entities.Modules
Imports DotNetNuke.Services.Exceptions
Imports DotNetNuke.Services.Search
Imports DotNetNuke.Common.Globals
Namespace Components
Public Class FeatureController
Inherits ModuleSearchBase
Implements IUpgradeable
Public Overrides Function GetModifiedSearchDocuments(moduleInfo As ModuleInfo, beginDate As Date) As IList(Of Entities.SearchDocument)
Try
Dim SearchDocuments As New List(Of Entities.SearchDocument)
'get list of changed products
Dim vc As New ViewsController
Dim pList As List(Of vw_ProductList_Short_Active) = vc.GetProduct_Short_Active(moduleInfo.PortalID)
If pList IsNot Nothing Then
''for each product, create a searchdocument
For Each p As vw_ProductList_Short_Active In pList
Dim SearchDoc As New Entities.SearchDocument
Dim ModID As Integer = 0
If p.ModuleId Is Nothing OrElse p.ModuleId = 0 Then
ModID = moduleInfo.ModuleID
Else
ModID = p.ModuleId
End If
Dim array() As String = {"mid=" + ModID.ToString, "id=" + p.ProductId.ToString, "item=" + Replace(p.Name, " ", "-")}
Dim DetailUrl = NavigateURL(moduleInfo.TabID, GetPortalSettings(), "Detail", array)
With SearchDoc
.AuthorUserId = p.CreatedByUserId
.Body = p.ShortInfo
.Description = p.LongInfo
.IsActive = True
.PortalId = moduleInfo.PortalID
.ModifiedTimeUtc = p.LastUpdatedDate
.Title = p.Name + " - " + p.ProductNumber
.UniqueKey = Guid.NewGuid().ToString()
.Url = DetailUrl
.SearchTypeId = 2
.ModuleId = p.ModuleId
End With
SearchDocuments.Add(SearchDoc)
Next
Return SearchDocuments
Else
Return Nothing
End If
Catch ex As Exception
LogException(ex)
Return Nothing
End Try
End Function
End Class
End Namespace
I cleared the site cache and then I manually started a search re-index. I can see from the host schedule history that the re-index is run and completes.
PROBLEM
None of the items in the above code are added to the index. I even used the Luke Inspector to look into the lucene index and that confirms that these items are not added.
QUESTION
I need help figuring out why these items are not getting added or I need help on how to debug the indexing to see if anything is going run during that process.
Thanks in Advance
JK
EDIT #1
I ran the following procedure in Sql Server to see if the module is even listed in the search modules:
exec GetSearchModules[PortalId]
The module in question does appear in this list. The indexing is called for the featureController, but the results are not added to the lucene index. Still need help.
EDIT #2
So I upgraded to 7.3.1 in the hopes that something during the installation would fix this issue. But it did not. The search documents are still getting created/ returned by the GetModifiedSearchDocuments function but the documents are not being added to the Lucene index and therefore do not appear in the search results.
EDIT #3
The break point is not getting hit like i thought after the upgrade, but I added a try catch to log exceptions and the following error log is getting created when I try to manually re-index (cleaned up to keep it short)
AssemblyVersion:7.3.1
PortalID:-1
PortalName:
DefaultDataProvider:DotNetNuke.Data.SqlDataProvider, DotNetNuke
ExceptionGUID:d0a443da-3d68-4b82-afb3-8c9183cf8424
InnerException:Sequence contains more than one matching element
Method:System.Linq.Enumerable.Single
StackTrace:
Message:
System.InvalidOperationException: Sequence contains more than one matching element
at System.Linq.Enumerable.Single[TSource](IEnumerable`1 source, Func`2 predicate)
at DotNetNuke.Services.Scheduling.Scheduler.CoreScheduler.LoadQueueFromTimer()
at DotNetNuke.Services.Scheduling.Scheduler.CoreScheduler.Start()
Source:
Server Name: KING-PC
EDIT #4
Okay, I fixed the problem in edit three following This Disucssion on the DNN issue tracker, but still no items being added to the lucene index.
The breakpoint is hit, and once i leave the debugger running for a while i get the following error:
{"Exception of type 'Lucene.Net.Index.MergePolicy+MergeException' was
thrown."} {"Cannot overwrite:
C:\websites\dnndev.me\App_Data\Search\_1f0.fdt"}
Looks like a permission error. I'll see what I can work out
J King,
I just finished a series on DNNHero.com on Implementing Search in your Module. Parts 3 and 4 are implementing and debugging your ModuleSearchBase implementation.
EDIT: Remove your assignment to the SearchTypeId in your implementation
Also, here is a sample snippet to see how i am setting the attributes of the SearchDocument. Again, watch my video for a whole bunch of other potential pitfalls in the Search implementation.
SearchDocument doc = new SearchDocument
{
UniqueKey = String.Format("{0}_{1}_{2}",
moduleInfo.ModuleDefinition.DefinitionName, moduleInfo.PortalID, item.ItemId),
AuthorUserId = item.AssignedUserId,
ModifiedTimeUtc = item.LastModifiedOnDate.ToUniversalTime(),
Title = item.ItemName,
Body = item.ItemDescription,
Url = "",
CultureCode = "en-US",
Description = "DotNetNuclear Search Content Item",
IsActive = true,
ModuleDefId = moduleInfo.ModuleDefID,
ModuleId = item.ModuleId,
PortalId = moduleInfo.PortalID,
TabId = tab
};

Rails / Sitemap_Generator: Subdomain sitemaps

I'm trying to create a sitemap for my app which features subdomains using the sitemap_generator gem. However, I'm getting an error with my code:
the scheme http does not accept registry part: .foo.com (or bad hostname?)
My sitemap.rb:
SitemapGenerator::Sitemap.default_host = "http://www.foo.com"
SitemapGenerator::Sitemap.sitemaps_host = "http://s3.amazonaws.com/foo/"
SitemapGenerator::Sitemap.public_path = 'tmp/'
SitemapGenerator::Sitemap.sitemaps_path = 'sitemaps/'
SitemapGenerator::Sitemap.adapter = SitemapGenerator::WaveAdapter.new
SitemapGenerator::Sitemap.create do
add '/home'
end
Customer.find_each do |customer|
SitemapGenerator::Sitemap.default_host = "http://#{customer.user_name}.foo.com"
SitemapGenerator::Sitemap.sitemaps_path = "sitemaps/#{customer.user_name}"
SitemapGenerator::Sitemap.create do
add '/home'
end
end
What am I doing wrong?
I'm the author of this gem.
I am fairly certain that the problem is with one of the customer user name's containing a character which is not allowed in a URL. A simple test with simple names works e.g.:
%w(bill mary bob).each do |customer|
SitemapGenerator::Sitemap.default_host = "http://#{customer}.foo.com"
SitemapGenerator::Sitemap.sitemaps_path = "sitemaps/#{customer}"
SitemapGenerator::Sitemap.create do
add '/home'
end
end