Using SQS with related data (Haystack)

Using SQS with related data (Haystack) - indexing

I having two fields for a Paragraph Model, with one of them being a ManyToMany field.
class Tag(models.Model):
tag = models.CharField(max_length=500)
def __unicode__(self):
return self.tag
admin.site.register(Tag)
class Paragraph(models.Model):
article = models.ForeignKey(Article)
text = models.TextField()
tags = models.ManyToManyField(Tag)
def __unicode__(self):
return "Headline: " + self.article.headline + " Tags: " + ', '.join([t.tag for t in self.tags.all()])
admin.site.register(Paragraph)
And my .txt files reflects the ManyToMany relationship to index tags-
{{object.text}}
{% for tag in object.tags.all %}
{{tag.tag}}
{% endfor %}
My views.py then uses SQS to search for all the tags (I want to accomplish this first before including text field) and retrieves those. So in this case, the query is "Politics"-
def politics(request):
paragraphs = []
sqs = SearchQuerySet().filter(tag="Politics")
paragraphs = [a.object for a in sqs[0:10]]
return render_to_response("search/home_politics.html",{"paragraphs":paragraphs},context_instance=RequestContext(request))
Edited:
and my search_indexes.py
class ParagraphIndex(indexes.SearchIndex, indexes.Indexable):
text= indexes.CharField(document=True, use_template=True)
tags= indexes.CharField(model_attr='tags')
def get_model(self):
return Paragraph
def index_queryset(self):
return self.get_model().objects
def load_all_queryset(self):
# Pull all objects related to the Paragraph in search results.
return Paragraph.objects.all().select_related()
However this doesn't retrive anything even though a few paragraphs have tags that are "Politics". Am I missing anything here or should I approach related data another way? I am a beginner with Haystack so any help will be much appreciated. Thanks in advance!

So this is a very useful article that helped me solve the problem.
Based on the article, this is how my search_indexes.py looks now:
class ParagraphIndex(indexes.SearchIndex, indexes.Indexable):
text = indexes.CharField(document=True, use_template=True)
tags = indexes.MultiValueField()
def prepare_tags(self,object):
return [tag.tag for tag in object.tags.all()]
def get_model(self):
return Paragraph
def index_queryset(self):
return self.get_model().objects
def load_all_queryset(self):
# Pull all objects related to Paragraph in search results.
return Paragraph.objects.all().select_related()
and my views.py:
def politics(request):
paragraphs = []
sqs = SearchQuerySet().filter(tags='Politics')
paragraphs = [a.object for a in sqs[0:10]]
return render_to_response("search/home.html",
{"paragraphs":paragraphs},
context_instance=RequestContext(request))
And I am using elasticsearch for the engine. Hope this helps!

Related

How can I extract the item id from the response in Scrapy?

import scrapy
class FarmtoolsSpider(scrapy.Spider):
name = 'farmtools'
allowed_domains = ['www.donedeal.ie']
start_urls = ['https://www.donedeal.ie/farmtools/']
def parse(self, response):
rows = response.xpath('//ul[#class="card-collection"]/li')
for row in rows:
yield {
'item_id': row.xpath('.//a/#href').get(),
'item_title': row.xpath('.//div[1]/p[#class="card__body-
title"]/text()').get(),
'item_county': row.xpath('.//ul[#class="card__body-
keyinfo"]/li[2]/text()').get(),
'item_price':
row.xpath('.//p[#class="card__price"]/span[1]/text()').get()
}
I want to extract the item number from the item_id response which is a url.
Is it possible to do this?
The response looks like this:
{'item_id': 'https://www.donedeal.ie/farmtools-for-sale/international-784-
tractor/25283884?campaign=3', 'item_title': 'INTERNATIONAL 784 TRACTOR',
'item_county': 'Derry', 'item_price': '3,000'}
I'd appreciate any advice, thanks

Somethink like this would work. Not clean but still, spliting the string up until you get the id you want.
def parse(self, response):
rows = response.xpath('//ul[#class="card-collection"]/li')
for row in rows:
link = row.xpath('.//a/#href').get()
link_split = link.split('/')[-1]
link_id = link_split.split('?')[0]
yield {
'item_id': link_id,
'item_title': row.xpath('.//div[1]/p[#class="card__body
title"]/text()').get(),
'item_county': row.xpath('.//ul[#class="card__body-
keyinfo"]/li[2]/text()').get(),
'item_price':
row.xpath('.//p[#class="card__price"]/span[1]/text()').get()
}
Update in response to comment
Complete code example
import scrapy
class TestSpider(scrapy.Spider):
name = 'test'
allowed_domains = ['donedeal.ie']
start_urls = ['https://www.donedeal.ie/farmtools/']
def parse(self, response):
rows = response.xpath('//ul[#class="card-collection"]/li')
for row in rows:
link = row.xpath('.//a/#href').get()
link_split = link.split('/')[-1]
link_id = link_split.split('?')[0]
yield {
'item_id':link_id,
'item_title': row.xpath('.//p[#class="card__body-title"]/text()').get(),
'item_county': row.xpath('.//ul[#class="card__body-keyinfo"]/li[2]/text()').get(),
'item_price': row.xpath('.//p[#class="card__price"]/span[1]/text()').get()
}
A note, when looping over each 'card', you don't need to specify the div if you're aiming to get a selector with a unique class like card__body-title.
Please note that yielding a dictionary is one of three ways thinking about grabbing data from Scrapy. Consider using items and itemloaders.
Items: Here
ItemLoaders: Here
ItemLoaders Example: Here

A cleaner alternative would be to use regex. You can even use it with Scrapy selectors (docs)
'item_title': row.xpath('.//div[1]/p[#class="card__body-title"]/text()').re_first(r'/(\d+)\?campaign')
In the snippet above, the regex will return a string with only the digits between / and ?campaign.
In this particular URL https://www.donedeal.ie/farmtools-for-sale/international-784-tractor/25283884?campaign=3 it would return '25283884'
Edited: Corrected the regex

What are the correct tags and properties to select?

I want to crawl a web site (http://theschoolofkyiv.org/participants/220/dan-acostioaei) to extract artist's name and biography only. When I define the tags and properties, it comes out without any text, which I want to see.
I am using scrapy to crawl the web site. For other websites, it works fine. I have tested my codes but it seems I can not define the correct tags or properties. Can you please have a look at my codes?
This is the code that I used to crawl the website. (I do not understand why stackoverflow enforces me to enter irrelevant text all the time. I have already explained what I wanted to say.)
import scrapy
from scrapy.selector import Selector
from artistlist.items import ArtistlistItem
class ArtistlistSpider(scrapy.Spider):
name = "artistlist"
allowed_domains = ["theschoolofkyiv.org"]
start_urls = ['http://theschoolofkyiv.org/participants/220/dan-acostioaei']
enter code here
def parse(self, response):
titles = response.xpath("//div[#id='participants']")
for titles in titles:
item = ArtistlistItem()
item['artist'] = response.css('.ng-binding::text').extract()
item['biography'] = response.css('p::text').extract()
yield item
This is the output that I get:
{'artist': [],
'biography': ['\n ',
'\n ',
'\n ',
'\n ',
'\n ',
'\n ']}

Simple illustration (assuming you already know about AJAX request mentioned by Tony Montana):
import scrapy
import re
import json
from artistlist.items import ArtistlistItem
class ArtistlistSpider(scrapy.Spider):
name = "artistlist"
allowed_domains = ["theschoolofkyiv.org"]
start_urls = ['http://theschoolofkyiv.org/participants/220/dan-acostioaei']
def parse(self, response):
participant_id = re.search(r'participants/(\d+)', response.url).group(1)
if participant_id:
yield scrapy.Request(
url="http://theschoolofkyiv.org/wordpress/wp-json/posts/{participant_id}".format(participant_id=participant_id),
callback=self.parse_participant,
)
def parse_participant(self, response):
data = json.loads(response.body)
item = ArtistlistItem()
item['artist'] = data["title"]
item['biography'] = data["acf"]["en_participant_bio"]
yield item

correct way to nest Item data in scrapy

What is the correct way to nest Item data?
For example, I want the output of a product:
{
'price': price,
'title': title,
'meta': {
'url': url,
'added_on': added_on
}
I have scrapy.Item of:
class ProductItem(scrapy.Item):
url = scrapy.Field(output_processor=TakeFirst())
price = scrapy.Field(output_processor=TakeFirst())
title = scrapy.Field(output_processor=TakeFirst())
url = scrapy.Field(output_processor=TakeFirst())
added_on = scrapy.Field(output_processor=TakeFirst())
Now, the way I do it is just to reformat the whole item in the pipeline according to new item template:
class FormatedItem(scrapy.Item):
title = scrapy.Field()
price = scrapy.Field()
meta = scrapy.Field()
and in pipeline:
def process_item(self, item, spider):
formated_item = FormatedItem()
formated_item['title'] = item['title']
formated_item['price'] = item['price']
formated_item['meta'] = {
'url': item['url'],
'added_on': item['added_on']
}
return formated_item
Is this correct way to approach this or is there a more straight-forward way to approach this without breaking the philosophy of the framework?

UPDATE from comments: Looks like nested loaders is the updated approach. Another comment suggests this approach will cause errors during serialization.
Best way to approach this is by creating a main and a meta item class/loader.
from scrapy.item import Item, Field
from scrapy.contrib.loader import ItemLoader
from scrapy.contrib.loader.processor import TakeFirst
class MetaItem(Item):
url = Field()
added_on = Field()
class MainItem(Item):
price = Field()
title = Field()
meta = Field(serializer=MetaItem)
class MainItemLoader(ItemLoader):
default_item_class = MainItem
default_output_processor = TakeFirst()
class MetaItemLoader(ItemLoader):
default_item_class = MetaItem
default_output_processor = TakeFirst()
Sample usage:
from scrapy.spider import Spider
from qwerty.items import MainItemLoader, MetaItemLoader
from scrapy.selector import Selector
class DmozSpider(Spider):
name = "dmoz"
allowed_domains = ["example.com"]
start_urls = ["http://example.com"]
def parse(self, response):
mainloader = MainItemLoader(selector=Selector(response))
mainloader.add_value('title', 'test')
mainloader.add_value('price', 'price')
mainloader.add_value('meta', self.get_meta(response))
return mainloader.load_item()
def get_meta(self, response):
metaloader = MetaItemLoader(selector=Selector(response))
metaloader.add_value('url', response.url)
metaloader.add_value('added_on', 'now')
return metaloader.load_item()
After that, you can easily expand your items in the future by creating more "sub-items."

I think it would be more straightforward to construct the dictionary in the spider. Here are two different ways of doing it, both achieving the same result. The only possible dealbreaker here is that the processors apply on the item['meta'] field, not on the item['meta']['added_on'] and item['meta']['url'] fields.
def parse(self, response):
item = MyItem()
item['meta'] = {'added_on': response.css("a::text").extract()[0]}
item['meta']['url'] = response.xpath("//a/#href").extract()[0]
return item
Is there a specific reason for which you want to construct it that way instead of unpacking the meta field ?

How to display items in a sorted order?

I have this in my template file:
{% get_latest_show as slideshow %}
{% for slide in slideshow.slide_set.all %}
<img src="{% thumbnail slide.image 1174x640 upscale %}" alt="{{slide.title}}" width="1174"/>
{% endfor %}
models.py
from django.db import models
import datetime
class Slide(models.Model):
title = models.CharField(max_length=50)
description = models.TextField(blank=True, null=True)
target_url = models.TextField(blank=True, null=True)
slideshow = models.ForeignKey('Slideshow')
image = models.ImageField(upload_to='slideshow', max_length=500, blank=True,null=True)
def __unicode__(self):
return self.title
class Slideshow(models.Model):
title = models.CharField(max_length=50)
pub_date = models.DateTimeField(auto_now=True)
published = models.BooleanField(default=False)
class Meta:
ordering = ['-title']
def __unicode__(self):
return self.title
slide_tags.py
from django import template
from django.core.cache import cache
from django.contrib.contenttypes.models import ContentType
from slides.models import Slide, Slideshow
register = template.Library()
class GetSlideshowNode(template.Node):
"""
Retrieves the latest published slideshow
"""
def __init__(self, varname):
self.varname = varname
def render(self, context):
try:
show = Slideshow.objects.filter(published=True)[0]
except:
show = []
context[self.varname] = show
return ''
def get_latest_show(parser, token):
"""
Retrieves the latest published slideshow
{% get_latest_show as show %}
"""
args = token.split_contents()
argc = len(args)
try:
assert (argc == 3 and args[1] == 'as')
except AssertionError:
raise template.TemplateSyntaxError('get_latest_show syntax: {% get_latest_show as varname %}')
varname = None
t, a, varname = args
return GetSlideshowNode(varname=varname)
register.tag(get_latest_show)
The problem is that my slides are being displayed out of order. When I print slideshow.slide.set.all on the page, I see:
[<Slide: Slide 2>, <Slide: Slide 3>, <Slide: Slide 4>, <Slide: Slide 1>]
How do I get the slides to appear in order?

You want the slide_set to be ordered therefor the 'ordering' statement should be on the Slide model.
class Slide(models.Model):
# fields
class Meta:
ordering = ['-title']
This will cause Slide.objects.all() to return a queryset ordered by the title field. That is equivalent to slideshow.slide_set.all()

Django: Multiple COUNTs from two models away

I am attempting to create a profile page that shows the amount of dwarves that are assigned to each corresponding career. I have 4 careers, 2 jobs within each of those careers and of course many dwarves that each have a single job. How can I get a count of the number of dwarves in each of those careers? My solution was to hardcore the career names in the HTML and to make a query for each career but that seems like an excessive amount of queries.
Here's what I "want" to see:
Unassigned: 3
Construction: 2
Farming: 0
Gathering: 1
Here's my models. I add some complexity by not connecting Careers directly to my Dwarves model (they have connected by their jobs).
from django.contrib.auth.models import User
from django.db import models
class Career(models.Model):
name = models.CharField(max_length = 64)
def __unicode__(self):
return self.name
class Job(models.Model):
career = models.ForeignKey(Career)
name = models.CharField(max_length = 64)
career_increment = models.DecimalField(max_digits = 4, decimal_places = 2)
job_increment = models.DecimalField(max_digits = 4, decimal_places = 2)
def __unicode__(self):
return self.name
class Dwarf(models.Model):
job = models.ForeignKey(Job)
user = models.ForeignKey(User)
created = models.DateTimeField(auto_now_add = True)
modified = models.DateTimeField(auto_now = True)
name = models.CharField(max_length = 64)
class Meta:
verbose_name_plural = 'dwarves'
def __unicode__(self):
return self.name
EDIT 1
my view looks something like:
def fortress(request):
careers = Career.objects.annotate(Count('dwarf_set'))
return render_to_response('ragna_base/fortress.html', {'careers': careers})
and template:
{% for career in careers %}
<li>{{ career.dwarf_set__count }}</li>
{% endfor %}
The error is:
Cannot resolve keyword 'dwarf_set' into field. Choices are: id, job, name
SOLUTION
view:
def fortress(request):
careers = Career.objects.all().annotate(dwarfs_in_career = Count('job__dwarf'))
return render_to_response('ragna_base/fortress.html', {'careers': careers})
template:
{% for career in careers reversed %}
<li>{{ career.name }}: {{ career.dwarves_in_career }}</li>
{% endfor %}
EVEN BETTER SOLUTION
careers = Career.objects.filter(Q(job__dwarf__user = 1) | Q(job__dwarf__user__isnull = True)) \
.annotate(dwarves_in_career = Count('job__dwarf'))
Don't forget to from django.db.models import Count, Q
What I like about the above solution was it not only returns careers that have dwarves working but even the careers that have none which was the next problem I encountered. Here's my view for completeness:
<ul>
{% for career in careers %}
<li>{{ career.name }}: {{ career.dwarves_in_career }}</li>
{% endfor %}
</ul>

Django's ORM isn't gonna make this uber-simple. The simple way is to do something like:
for career in Career.objects.all():
career.dwarf_set.all().count()
That will execute 1 query for each job (O(n) complexity).
You could try to speed that up by using Django's Aggregation feature, but I'm not entirely sure if it'll do what you need. You'd have to take a look.
The third option is to use custom SQL, which will absolutely get the job done. You just have to write it, and maintain it as your app grows and changes...

Does this do what you want?
from django.db.models import Count
Career.objects.annotate(Count('dwarf'))
Now each career object should have a dwarf__count property.

Can't you just get a count grouped by career? And do an outer join if you need the zero rows returned too.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Using SQS with related data (Haystack) - indexing

Related

How can I extract the item id from the response in Scrapy?

What are the correct tags and properties to select?

correct way to nest Item data in scrapy

How to display items in a sorted order?

Django: Multiple COUNTs from two models away

Categories

Resources