what is the purpose of scrapy items since they are only dictionaries? - scrapy

what is the purpose of scrapy items since they are only dictionaries? I can send dictionary (or tuple) and process it in item pipeline so shouldn't I use plain dictionaries instead, or use #dataclass (since python 3.7) since it is more convenient?

Related

Iterating through Kotlin map from C native export

We have a Kotlin package that we native build and export to C. We have the header file with all the nested struct and pinned-style pointers.
In the Kotlin code, there is a Map which we want to access. We can get a hold of the Kotlin package enum (the key of the Map), but what's the C code for actually indexing into the "kref kotlin Map object" to get to the value in the map?
Basically, we'd like to know how to manipulate Map, List and Array from C code. Actual steps and/or reference doc would be appreciated.
Kotlin/Native compiler does not export any of the collection's functions to the native library API. This decision was taken some time ago, with the idea to minimize the verbosity of the library header. However, this leads to the problem you faced. Right now, the recommended approach is to write wrapper functions in your Kotlin code.For an example of this approach, please see this ticket at the Kotlin issue tracker. I also recommend subscribing to it, to get the updates on the problem's state ASAP. Posting this in case the ticket won't be available for someone:
fun getListElement(list: List<Any?>, index: Int) = list.get(index)
/// function accessing the list element by index

What is the difference between functions that start with "allocate", "create", "initialize" and so on

When it comes to naming a function, what are the main differences between the following words:
"allocate", "create", "initialize", "instantiate", "make", "build", "add" and "insert".
When should I use each word?
Thank you in advance :)
I associate allocate, create, instantiate and make with the creation of a new object whereas initialize is more associated with setting initial values. The words add and insert are used for functions or methods which add new elements to some collection, like a list or a tree. When I read build I think of a process for compiling and linking software source code.
I summarize the existing usage of allocate, initialize, instantiate, make, build, create, add, insert, and two more I needed in this comparison: put and update, below.
allocate; synonyms alloc (C); used for "allocating" or alloting space in memory
initialize; synonyms init, __init__ (python)`; used for instantiating an object from a class or prototype
instantiate;; there's not a super strong case for this anywhere, but you're welcome to look
make;; in a shell scripting language (like bash), make is traditionally a command that manages compilation of different parts of a C or C++ project. in go, make is a built-in function that offers functional syntax for instantiating an object of type slice, map, or chan.
build;; a lot of Makefiles will support this method because it falls in line with "building" or compiling a project.
create; synonyms insert and POST (REST); used for creating a new web resource without an id. Errors if the web resource already exists.
put; synonyms add, sadd (redis), zadd (redis), set (redis), PUT (REST); creates a web resource by id. Updates the resource completely if it already exists.
update; synonyms hset (redis), PATCH (REST); updates a web resource by id. Some implementations throw, others create if resource does not exist.
I leave how you should be using these words to your discretion.

How to read numpy array in ND4j

I have too components that deal with n-dimension array. One component is written in python which process the data and save the processed ndarray by tobytes(). Now the other component is written in java, which need to read the serialized ndarray produced in first component.
I am curious if there are any existing java libraries that can read serialized numpy array. Or there is a better way to communicate ndarray between java & python.
Any advice is appreciated!
Thank you!
ND4J supports reading from and writing to Numpy arrays. Look at the ND4J javadocs for xxxNpyYYYArray methods .
It can read and write from/to files, byte arrays and even raw pointers to a numpy array.
The pointer methods allow for using the arrays without copying or serialization. We use the pointer methods inside jumpy (which runs Java via pyjnius) and when using javacpp's cpython/numpy preset to run a cpython interpreter inside a Java process.
I have used Apache Arrow to solve this.
First the pyarrow package has a numpy ndarray API to serialize the array into bytes. Basically the ndarray becomes an Arrow bytes sequence batch.
Then the java API provides a VectorSchemaRoot to read it from the bytes. And you could get the values in the Arrow array. You could use this array to create ND4J array(if you need), or directly operate your array.
For detailed operations you could refer to Apache Arrow doc, and if any obstacles we could discuss here.
Also, Arrow uses native memory to store the buffer so the data is off the java heap. This may be an issue at some point.
Any other solutions could also share with me. :)

Creating PageObjects in WebDriverIO

I've been creating PageObjects for WebDriverIO and have been following the ES6 method for Page Object pattern in the WebDriverIO documentation.
However, someone on my team suggested simply making objects of selectors, and then calling those strings in the tests. Is there a good reason why the Page Object pattern returns elements and not string of the selectors?
Page Object returns elements instead of just the selector string to allow actions to be called directly on the elements e.g.
PageObject.Element.waitForDisplayed()
Instead of you doing
Browser.waitForDisplayed(PageObject.Element)
Which may get lengthy and doesn't chain as well. You can find more actions that can be performed on elements here
However, you can also get the string of the selector if you want by doing
PageObject.Element.selector()
Chaining e.g.
PageObject.Element.waitForDisplayed().click()
I think the point is allow you to use the objects directly. So:
MyPageObject.MyElement.click()
versus:
browser.click(MyPageObject.MyElement)
Just a little less verbose

Migrating Lucene HitCollector (2.x) to Collector (3.x)

In one of our projects, we use an old Lucene version (2.3.2). I'm now looking at current Lucene versions (3.5.0) and trying to re-write the old code. In the old project, we extended TopFieldDocCollector to do some extra filtering in the collect() method. I'm having a bit of trouble understanding the new Collector class however, and I couldn't find a good example.
1) The method setScorer(). How/where do I get a Scorer object from?
2) The method collect(). I guess I need to create my own Collection and store the docIds I'm interested in, correct?
3) When extending TopDocsCollector instead, I'd need to implement a PriorityQueue to use in the constructor, correct? There seems to be no standard implementation for it. But I still need my own Collection to store docIds (or rather, ScoreDocs), and call populateResults after the search is finished?
Overall, it seems like extending Collector is (a lot) easier than extending TopDocsCollector, but maybe I'm missing something.
setScorer() is a hook, the Scorer is passed in by IndexSearcher when it actually does the search. So you basically override this method if you care about scores at all (e.g. saving the passed in Scorer away so you can use it). From its javadocs:
Called before successive calls to {#link #collect(int)}. Implementations
that need the score of the current document (passed-in to
{#link #collect(int)}), should save the passed-in Scorer and call
scorer.score() when needed.
collect() is called for each matching document, passing in the per-segment docid. Note if you need the 'rebased docid' (relative to the entire reader across all of its segments) then you must override setNextReader, saving the docBase, and compute docBase + docid. From Collector javadocs:
NOTE: The doc that is passed to the collect
method is relative to the current reader. If your
collector needs to resolve this to the docID space of the
Multi*Reader, you must re-base it by recording the
docBase from the most recent setNextReader call.
TopDocsCollector is a base class for TopFieldCollector (sort by field) and TopScoreDocCollector (sort by score). If you are writing a custom collector that sorts by score, then its probably easier to just extend TopScoreDocCollector.
Also: the simplest Collector example is TotalHitCountCollector!