Writing to db via spider v/s via pipelines.py - scrapy

Does it matter that my scrapy script writes to my MySQL db in the body of the spider instead of through pipelines.py? Does this slow down the spider? Note that I doubt have any items listed on items.py
Follow up: how and when is pipelines.py invoked? What happens after the yield statement?

It highly depends on the implementation, but if you implement database writing in a fashion that doesn't block too much then there isn't much different performance wise.
There is, however, a pretty huge structural difference. Scrapy's design philosphy highly encourages using middlewares and pipelines for the sake of keeping spiders clean and understandable.
In other words - spider bit should crawl data, middlewares should modify requests and responses and pipelines should pipe returned data through some external logic (like put it into a database or to a file).
Regarding your follow up question:
how and when is pipelines.py invoked? What happens after the yield statement?
Take a look at Architectual Overview documentation page and if you'd like to dig deeper you'd have to understand twisted asyncronious framework since scrapy is just a big, smart framework around it.

If you want the best performance, store items in a file (e.g. csv) and when your crawl completes bulk insert them to your database. For csv data, you could use mysqlimport (see MySQL bulk insert from CSV data files). The reccomended approach is to not block while inserting. This would require you to use a pipeline that uses the Twisted RDBMS API.

Related

Functions on Wikipedia dump file

We can use the functions from Wikipedia API to getting some results from Wikipedia.
For example:
**import Wikipedia
print(Wikipedia.search("Bill", results=2)).**
My question, how can I use Wikipedia API functions for a specific version of Wikipedia (e.g. just Wikipedia 2017)?!!
I doubt that this is possible. PyWikibot is using the online API of MediaWiki (in this case for the site Wikipedia). This one is always the live data.
The dumps, which you mention, are offline snapshots of the data of Wikipedia (assuming you're talking about https://dumps.wikimedia.org/). This data is not connected to the MediaWiki API in any way and can therefore not being queried with it.
What you can do to go through the data of Wikipedia in a specific time:
If it's a limited number of pages only: You could write a script which goes through the available revisions of the page and selects the one, that is closest to the time you want. That's probably error prone, a lot of work and does not really scale
Download the dump you want to query on and write a script which can work on the files (e.g. the database dump or the static html dump depending on what you want to do, that's not really clear from your question)
On the dump file with specific version, we can not use Wikipedia API. We just can read the dump file by our code and make what we need on this file.

How do you implement search over static content within cshtml files

I am using asp.net core and Razor - and as it is a help system I would like to implement some kind of search facility to bring back a list of results hyperlinked based on the search terms.
I would like the search to iterate essentially over the content contained within the and tags and then link this to the appropriate page/view.
What is the best way to do this?
I'm not even sure how you get a handle on the actual content of your own cshtml pages and then go from there.
This question is far too broad. However, I can provide you some pointers.
First, you need to determine what you're actually wanting to surface and where that data lives. Your question says "static web pages", but then you mention .cshtml. Traditionally, when it comes to creating your own search, you're going to have access to some particular dataset (tables in a database, for example). It's much simpler to search across the more structured data than the end result of it being dumped in various and sundry places over a web page.
Search engines like Google only index in this way because they typically don't have access to the raw data (although some amount of "access" can be granted via things like JSON-LD and other forms of Schema.org markup). In other words, they actually read from the web page out of necessity, because that's what they have to work with. It's certainly not the approach you would take if you have access to the data directly.
If for some reason you need to actually spider an index your own site's HTML content, then you'll essentially have to do what the big boys do: create a bot, run it on a schedule, crawl your site, link by link, downloading each document, and then parse and process it. The end result would be to create a set of structured data that you can actually query against, which is why all this is pretty much just wasted effort if you already have that data.
Once you have the data, however you got there, you simply query it. In the most basic of forms, you could store it in a table in a database and literally issue SQL queries against it. Your search keywords/parameters are essentially the WHERE of the SELECT statement, so you'd have to figure out a way to map the keywords/parameters you're receiving to an acceptable WHERE clause that achieves that.
More traditionally, you'd use an actual search engine: essentially a document database that is designed and optimized for search, and generally provides a more search-appropriate API to query against. There's lots of options in this space from roll your own to hosted SaaS solutions, and anywhere in between. Of course the cost meter goes down the more work you have to do and goes up the more out of the box it is.
One popular open-source and largely free option is Elasticsearch. It uses Lucene indexes, which it stitches to together in a clustered environment to provide failover and scale. Deployment is a beast, to say the least, though it's gotten considerably better with things like containerization and orchestration. You can stand up an Elasticsearch cluster in something like Kubernetes with relative ease, though you still will probably need to do a bit of config. Elasticsearch does also have hosted options, but you know, cost.

Can Npgsql dump/restore an entire database?

Is it possible to use Npgsql in a way that basically mimics pg_dumpall to a single output file without having to iterate through each table in the database? Conversely, I'd also like to be able to take such output and use Npgsql to restore an entire database if possible.
I know that with more recent versions of Npgsql I can use the BeginBinaryExport, BeginTextExport, or BeginRawBinaryCopy methods to export from the database to STDOUT or to a file. On the other side of the process, I can use the BeginBinaryImport, BeginTextImport, or BeginRawBinaryCopy methods to import from STDIN or an existing file. However, from what I've been able to find so far, these methods use the COPY SQL syntax, which (AFAIK) is limited to a single table at a time.
Why am I asking this question? I currently have an old batch file that I use to export my production database to a file (using pg_dumpall.exe) before importing it back into my testing environment (using psql.exe with the < operation). This has been working pretty much flawlessly for quite a while now, but we've recently moved the server to an off-site hosted environment, which is causing a delay that prevents the batch file from completing successfully. Because of the potential for other connectivity/timeout issues, I'm thinking of moving the batch file's functionality to a .NET application, but this part has got me a bit stumped.
Thanks for your help and let me know if you need any further clarification.
This has been asked for in https://github.com/npgsql/npgsql/issues/1397.
Long story short, Npgsql doesn't have any sort of support for dumping/restoring entire databases. Implementing that would be a pretty significant effort that would pretty much duplicate all the pg_dump logic, and the danger for subtle omissions and bugs would be considerable.
If you just need to dump data for some tables, then as you mentioned, the COPY API is pretty good for that. If, however, you need to also save the schema itself as well as other, non-table entities (sequence state, extensions...), then the only current option AFAIK is to execute pg_dump as an external process (or use one of the other backup/restore options).

Best Practise - XML Parsing to Database with heavy validation logic in a rails app

I'm receiving big (around 120MB each), nested xml files. The parsing itself is very fast, currently i'm using the Nokogiri:SAXParser which is way faster then a DOM based. I need to check back a lot of values in the database. (Should it be updated or not?) Also i keep database queries as low as possible (eager loading, pure sql selects) the performance loss is about 40x in comparision to parsing only. I can't use mass inserts due to the need of validation/check back existing records/a lot of association involved. The whole process is in a transaction which speeded up things around 1.5x . What approach would you take? I'm looking forward to any help! I'm not very skilled in the whole XML thing. Would XLST help me? Also i have a XSD file for the files which arrive me.
Thanks in advance!
I ended up with a rebuild of associations which now fit more into the third party data and I can use MASS-INSERTS. (watch out for the max_allowed_packet value!!!)I'm using the sax-machine gem. When most of the basic data is already in the database i can now process (including db stuff) a 120MB file in about 10 seconds. Which is totally fine. Feel free to ask.

Most optimized way to store crawler states?

I'm currently writing a web crawler (using the python framework scrapy).
Recently I had to implement a pause/resume system.
The solution I implemented is of the simplest kind and, basically, stores links when they get scheduled, and marks them as 'processed' once they actually are.
Thus, I'm able to fetch those links (obviously there is a little bit more stored than just an URL, depth value, the domain the link belongs to, etc ...) when resuming the spider and so far everything works well.
Right now, I've just been using a mysql table to handle those storage action, mostly for fast prototyping.
Now I'd like to know how I could optimize this, since I believe a database shouldn't be the only option available here. By optimize, I mean, using a very simple and light system, while still being able to handle a great amount of data written in short times
For now, it should be able to handle the crawling for a few dozen of domains, which means storing a few thousand links a second ...
Thanks in advance for suggestions
The fastest way of persisting things is typically to just append them to a log -- such a totally sequential access pattern minimizes disk seeks, which are typically the largest part of the time costs for storage. Upon restarting, you re-read the log and rebuild the memory structures that you were also building on the fly as you were appending to the log in the first place.
Your specific application could be further optimized since it doesn't necessarily require 100% reliability -- if you miss writing a few entries due to a sudden crash, ah well, you'll just crawl them again. So, your log file can be buffered and doesn't need to be obsessively fsync'ed.
I imagine the search structure would also fit comfortably in memory (if it's only for a few dozen sites you could probably just keep a set with all their URLs, no need for bloom filters or anything fancy) -- if it didn't, you might have to keep in memory only a set of recent entries, and periodically dump that set to disk (e.g., merging all entries into a Berkeley DB file); but I'm not going into excruciating details about these options since it does not appear you will require them.
There was a talk at PyCon 2009 that you may find interesting, Precise state recovery and restart for data-analysis applications by Bill Gribble.
Another quick way to save your application state may be to use pickle to serialize your application state to disk.