Where to add field that is calculated from other files in Scrapy - scrapy

I am using Scrapy to crawl real estate adds.
I have field price and size (in m2), so I can calculate price_per_m2 as price/size.
My question is where should I do this (in what class) according to best practice in Scrapy?
Now I have it in my scrapy.Spider, but should I have it in some other place (like pipeline) and how (code examples preferred)?

While I can agree with Tomas in some points, I would never put this calculation in the spider itself.
I use spiders to extract data from pages. For me, that is the only purpose of a spider. I use ItemLoader for cleaning extracted data and basic manipulation (like converting everything to the same base unit) . And finally I use Pipelines for any high level data manipulation, like combining fields from items.
Imagine you have a dozen of spiders and you calculate price_per_m2 inside them. Your project has grown and you start coding spiders for another country. You have spiders getting prices in GBP, EUR and USD. Now if you want to compare price_per_m2 you have to 1) convert units in each spider before calculations or 2) add meta data to items in order to do a post processing. Both approaches are onerous in my opinion. My approach: spiders extract data, ItemLoader convert every price to the same unit, some Pipeline calculates price_per_m2 for every item (all them in the same units).

If you have a calculation (or post-processing in general) common to all items, using a pipeline is a one possible way to do it. In real estate, I can imagine for example geocoding addresses or storing each item in database. The main reason behind using pipeline, in my opinion, is that you separate this additional logic out of spiders so you have a single place of maintenance when there's a need to change this logic. In examples given above, you might decide to change the geocoding provider or instead of using one database engine, you switch to another. That's the real strength of pipelines. That said, calculating price per m2 from price and size, which is hardly to change, can safely be put right into the spider code. On the other hand, if there is more such simple calculations, you might consider using pipelines just to save time repeating the same code in every spider.

Related

Optaplanner VRP example, multiple vehicles required per stop in same time window

We are using a customized VRP tutorial example to optimize daily routes for service engineers who travel to customers in order to execute certain repair and installation tasks. We do have time windows and we optimize 1000+ tasks for multiple weeks into the future.
Our (simplified) domain model consists of:
Engineer - the guy doing all the work
Task - a single work assignmet at a certain location
DailyRoute - an Engineer's route for given day, consists of a linked list of Tasks
As a new requirement we must now support two engineers working in parallel on the same task.
Our current plan is to implement this by creating subtasks for the second engineer and implement a rule that their arrival time must be identical to the main task.
However, this is problematic since moving one of the interdependant tasks to a different time (e.g. a different DailyRoute) will mostly violate the above constraint.
So far, we have come up with the following ideas:
Allow single task moves only to a DailyRoute on the same day as the other task's assigned route
can be done via a SelectionFilter
Use CompositeMoves to move both of the parallel tasks at once to different days
Do we need a custom MoveIteratorFactory to select the connected tasks?
Or can this be done with a CartesianProductMoveSelector instead?
Can we use nearby selection for the second move to prefer the same day as the first move's newly assigned day (is move one already done at that time)?
For two engineers working in parallel on the same task, see docs "design patterns" specifically "the delay till last pattern". There is no example, but our support services have helped implemented it a few times - it works.
For the multiple stops at the same location: I've seen users split such visits up into smaller pieces to allow optaplanner to choose which of those pieces to aggregate. It works but it's not perfect: the more fine-grained the pieces, the much bigger the search space - the more that adding a custom move that focusses on moving all pieces together might help (but I won't start out with it). Generally speaking: if the smallest vehicle has a capacity of 100, I 'd run some experiments with splitting up to half that capacity - and they try a quarter too, just to see what works best through benchmarking with optaplanner-benchmark.

Pulling large quantities of data takes too long. Need a way to speed it up

I'm creating a client dashboard website that displays many different graphs and charts of different views of data in our database.
The data is of records of medical patients and companies that they work for for insurance purposes. The data is displayed as aggregate charts but there is a filter feature on the page that the user can use to filter individual patient records. The fields that they can filter by are
Date range of the medical claim
Relationship to the insurance holder
Sex
Employer groups (user selects a number of different groups they work with, and can turn them on and off in the filter)
User Lists (the user of the site can create arbitrary lists of patients and save their IDs and edit them later). Either none, one, or multiple lists can be selected. There is also an any/all selector if multiple are chosen.
A set of filters that the user can define (with preset defaults) from other, more internally structured pieces of data. The user can customize up to three of them and can select any one, or none of them, and they return a list of patient IDs that is stored in memory until they're changed.
The problem is that loading the data can take a long time, some pages taking from 30 seconds to a minute to load (the page is loaded first and the data is then download as JSON via an ajax function while a loading spinner is displayed). Some of the stored procedures we use are very complex, requiring multiple levels of nested queries. I've tried using the Query Analyzer to simplify them, but we've made all the recommended changes and it still takes a long time. Our database people have looked and don't see any other way to make the queries simpler while still getting the data that we need.
The way it's set up now, only changes to the date range and the employer groups cause the database to be hit again. The database never filters on any of the other fields. Any other changes to the filter selection are made on the front end. I tried changing the way it worked and sending all the fields to the back end for the database to filter on, and it ended up taking even longer, not to mention having to wait on every change instead of just a couple.
We're using MS SQL 2014 (SP1). My question is, what are our options for speeding things up? Even if it means completely changing the way our data is stored?
You don't provide any specifics - so this is pretty generic.
Speed up your queries - this is the best, easiest, least error-prone option. Modern hardware can cope with huge datasets and still provide sub-second responses. Post your queries, DDL, sample data and EXPLAINs to Stack Overflow - it's very likely you can get significant improvements.
Buy better hardware - if you really can't speed up the queries, figure out what the bottleneck is, and buy better hardware. It's so cheap these days that maxing out on SSDs, RAM and CPU will probably cost less than the time it takes to figure out how to deal with the less optimal routes below.
Caching - rather than going back to the database for everything, use a cache. Figure out how "up to date" your dashboards need to be, and how unique the data is, and cache query results if at all possible. Many development frameworks have first-class support for caching. The problem with caching is that it makes debugging hard - if a user reports a bug, are they looking at cached data? If so, is that cache stale - is it a bug in the data, or in the caching?
Pre-compute if caching is not feasible, you can pre-compute data. For instance, when you create a new patient record, you could update the reports for "patient by sex", "patient by date", "patience by insurance co" etc. This creates a lot of work - and even more opportunity for bugs.
De-normalize - this is the nuclear option. Denormalization typically improves reporting speed at the expense of write speed, and at the expense of introducing lots of opportunities for bugs.

How to append to a zipline bundle

I have a trading algorithm that I am backtestesting on zipline.
I have successfully ingested US common stocks bundle from a csv file.Moving forward I'd like to backtest it continuously in the end of each trading day.
So I'd like to append to my existing bundle the daily OHLCV prices for each US equities by downloading them from Interactive Brokers (I have written a python script that does that).
Now my question:
How to append the new day's data row for each equity to my existing zipline bundle?
Specifically, I don't want to create new bundles.
I happen to be investigating this myself, and my conclusion is that it is not possible. If any zipline developer is on air, please correct me if I am wrong.
Each ingestion will create a new SQLite table basically, those are easy to find under ~/.zipline/data.
Say you have three different CSVs for three different exchanges, you would have to import them separately in three different ingests.
What is disappointing (apparently, perhaps we are missing the supposed usage) is that when running a backtest one is constrained to one single ingestion universe. If my symbols list is scattered - i.e. products on different exchanges - then it is not possible to backtest such algo.
If you are relying on the default quandle space then you do not face this issue, provided your registration has enough visibility (the free API key is pretty restricted).
One solution could be to import all of the CSVs together, under a common trading calendar. This sounds artificial, but the impact on evaluating a non-day strategy should be negligible.
So if for example you have three sets of CSVs for AS, DE and MI just import them as a generic yahoo against one of the three calendars. The detailed procedure is explained here.
Thanks,

How to do inventory adjustments for more than 50k products?

I want to do inventory adjustment for more then 50k > products. But odoo not able to do. Every time show error time exceeds. It take too much time, any way to do speedy.
Thanks for help
In this case, it might be best to develop a queue process to process pieces of this in a batch. The linked modules are from OCA and are a base for others to develop their specific queues.
In your case, it may require manual or automated splitting of the Inventory Adjustments into smaller batches or (ideally) you can process x lines at a time, such as:
stock_inventory.line_ids.filtered(lambda r: r.state == 'draft')
Note: It doesn't look like the above code will actually work because all stock.inventory.line "Status" are just related to the stock.inventory. You'll probably need to override this to be manually updated or take a different approach.
Documentation on filtered

RRDtool what use are multiple RRAs?

I'm trying to implement rrdtool. I've read the various tutorials and got my first database up and running. However, there is something that I don't understand.
What eludes me is why so many of the examples I come across instruct me to create multiple RRAs?
Allow me to explain: Let's say I have a sensor that I wish to monitor. I will want to ultimately see graphs of the sensor data on an hourly, daily, weekly and monthly basis and one that spans (I'm still on the fence on this one) about 1.5 yrs (for visualising seasonal influences).
Now, why would I want to create an RRA for each of these views? Why not just create a database like this (stepsize=300 seconds):
DS:sensor:GAUGE:600:U:U \
RRA:AVERAGE:0.5:1:160000
If I understand correctly, I can then create any graph I desire, for any given period with whatever resolution I need.
What would be the use of all the other RRAs people tell me I need to define?
BTW: I can imagine that in the past this would have been helpful when computing power was more rare. Nowadays, with fast disks, high-speed interfaces and powerful CPUs I guess you don't need the kind of pre-processing that RRAs seem to be designed for.
EDIT:
I'm aware of this page. Although it explains about consolidation very clearly, it is my understanding that rrdtool graph can do this consolidation aswell at the moment the data is graphed. There still appears (to me) no added value in "harvest-time consolidation".
Each RRA is a pre-consolidated set of data points at a specific resolution. This performs two important functions.
Firstly, it saves on disk space. So, if you are interested in high-detail graphs for the last 24h, but only low-detail graphs for the last year, then you do not need to keep the high-detail data for a whole year -- consolidated data will be sufficient. In this way, you can minimise the amount of storage required to hold the data for graph generation (although of course you lose the detail so cant access it if you should want to). Yes, disk is cheap, but if you have a lot of metrics and are keeping low-resolution data for a long time, this can be a surprisingly large amount of space (in our case, it would be in the hundreds of GB)
Secondly, it means that the consolidation work is moved from graphing time to update time. RRDTool generates graphs very quickly, because most of the calculation work is already done in the RRAs at update time, if there is an RRA of the required configuration. If there is no RRA available at the correct resolution, then RRDtool will perform the consolidation on the fly from a high-granularity RRA, but this takes time and CPU. RRDTool graphs are usually generated on the fly by CGI scripts, so this is important, particularly if you expect to have a large number of queries coming in. In your example, using a single 5min RRA to make a 1.5yr graph (where 1pixel would be about 1 day) you would need to read and process 288 times more data in order to generate the graph than if you had a 1-day granularity RRA available!
In short, yes, you could have a single RRA and let the graphing work harder. If your particular implementation needs faster updates and doesnt care about slower graph generation, and you need to keep the detailed data for the entire time, then maybe this is a solution for you, and RRDTool can be used in this way. However, usually, people will optimise for graph generation and disk space, meaning using tiered sets of RRAs with decreasing granularity.