How to append to a zipline bundle - zipline

I have a trading algorithm that I am backtestesting on zipline.
I have successfully ingested US common stocks bundle from a csv file.Moving forward I'd like to backtest it continuously in the end of each trading day.
So I'd like to append to my existing bundle the daily OHLCV prices for each US equities by downloading them from Interactive Brokers (I have written a python script that does that).
Now my question:
How to append the new day's data row for each equity to my existing zipline bundle?
Specifically, I don't want to create new bundles.

I happen to be investigating this myself, and my conclusion is that it is not possible. If any zipline developer is on air, please correct me if I am wrong.
Each ingestion will create a new SQLite table basically, those are easy to find under ~/.zipline/data.
Say you have three different CSVs for three different exchanges, you would have to import them separately in three different ingests.
What is disappointing (apparently, perhaps we are missing the supposed usage) is that when running a backtest one is constrained to one single ingestion universe. If my symbols list is scattered - i.e. products on different exchanges - then it is not possible to backtest such algo.
If you are relying on the default quandle space then you do not face this issue, provided your registration has enough visibility (the free API key is pretty restricted).
One solution could be to import all of the CSVs together, under a common trading calendar. This sounds artificial, but the impact on evaluating a non-day strategy should be negligible.
So if for example you have three sets of CSVs for AS, DE and MI just import them as a generic yahoo against one of the three calendars. The detailed procedure is explained here.
Thanks,

Related

How handle Hash-key, partition and index on Azure?

I've studied Azure Synapse and distribution types.
Hash-distributed table needs a column to distribute the data between different nodes,
For me it's the same idea of partition, I saw some examples that uses a hash-key, partition and index. It's not clear in my mind their differences and how to choose one of them. How Hash-key, partition and index could work together?
Just an analogy which might explain the difference between Hash and Partition
Suppose there exists one massive book about all history of the world. It has the size of a 42 story building.
Now what if the librarian splits that book into 1 book per year. That makes it much easier to find all information you need for some specific years. Because you can just keep the other books on the shelves.
A small book is easier to carry too.
That's what table partitioning is about. (Reference: Data Partitioning in Azure)
Keeping chunks of data together, based on a key (or set of columns) that is usefull for the majority of the queries and has a nice average distribution.
This can reduce IO because only the relevant chunks need to be accessed.
Now what if the chief librarian unbinds that book. And sends sets of pages to many different libraries. When we then need certain information, we ask each library to send us copies of the pages we need.
Even better, those librarians could already summarize the information of their pages and then just send only their summaries to one library that collects them for you.
That's what the table distribution is about. (Reference: Table Distribution Guidance in Azure)
To spread out the data over the different nodes.
For more details:
What is a difference between table distribution and table partition in sql?
https://www.linkedin.com/pulse/partitioning-distribution-azure-synapse-analytics-swapnil-mule
And indexing is physical arrangement within those nodes

Feed large text to PyTextRank

I would like to use PyTextRank for keyphrase extraction. How can I feed feed 5 million documents (each document consisting of a few paragraphs) to the package?
This is the example I see on the official tutorial.
text = "Compatibility of systems of linear constraints over the set of natural numbers. Criteria of compatibility of a system of linear Diophantine equations, strict inequations, and nonstrict inequations are considered. Upper bounds for components of a minimal set of solutions and algorithms of construction of minimal generating sets of solutions for all types of systems are given. These criteria and the corresponding algorithms for constructing a minimal supporting set of solutions can be used in solving all the considered types systems and systems of mixed types.\n"
doc = nlp(text)
for phrase in doc._.phrases:
ic(phrase.rank, phrase.count, phrase.text)
ic(phrase.chunks)
Is my option only to concatenate several million documents to a single string and pass it to nlp(text)? I do not think I could use nlp.pipe(texts) as I want to create one network by computing words/phrases from all documents.
No, instead it would almost certainly be better to run these tasks in parallel. Many use cases of pytextrank have used Spark, Dask, Ray, etc., to parallelize running documents through a spaCy pipeline with pytestrank to extract entities.
For an example of parallelization with Ray, see https://github.com/Coleridge-Initiative/rclc/blob/4d5347d8d1ac2693901966d6dd6905ba14133f89/bin/index_phrases.py#L45
One question would be how you are associating the extracted entities with documents? Are these being collected into a dataset, or perhaps a database or key/value store?
However these results get collected, you could then construct a graph of co-occurring phrases, and also include additional semantics to help structure the results. A sister project kglab https://github.com/DerwenAI/kglab was created for these kinds of use cases. There are some examples in the Jupyter notebooks included with the kglab project; see https://derwen.ai/docs/kgl/tutorial/
FWIW, we'll have tutorials coming up at ODSC West about using kglab and pytextrank and there are several videos online (under Graph Data Science) for previous tutorials at conferences. We also have monthly public office hours through https://www.knowledgegraph.tech/ – message me #pacoid on Tw for details.

What is the best way to store market data for Algorithmic Trading setup?

I am making an Algorithmic Trading setup for trades automation. Currently, I have a broker API which helps me get historical data for all stocks that I'm interested in.
I am wondering how to store all the data, whether in a file system or database (SQL based or NoSQL). Data comes through REST API if thats relevant.
My use case here would be to query historical data to make trading decisions in live market. I would also have to develop a backtesting framework that will query Historical Data to check performance of strategy historically.
I am looking at a frequency of 5 mins - 1 hr candles and mostly Intraday trading strategies. Thanks
As you say, there are many options and as STLDeveloper says this is kind of off topic since it is opinion based... anyway...
A simple strategy which I used in my own Python back-testing engine is to use Python Pandas DataFrame objects, and save/load to disk in an HD5 file using to_hdf() and read_hdf(). The primary advantage (for me) of HD5 is that it loads/saves far more quickly than CSV.
Using the above approach I easily manage several years of 1 minute data for back testing purposes, and data access certainly is not my performance bottleneck.
You will need to determine for yourself if your chosen data management approach is fast enough for live trading, but in general I think if your strategy is based on 5-min candles then any reasonable database approach is going to be sufficiently performant for your purposes.

Where to add field that is calculated from other files in Scrapy

I am using Scrapy to crawl real estate adds.
I have field price and size (in m2), so I can calculate price_per_m2 as price/size.
My question is where should I do this (in what class) according to best practice in Scrapy?
Now I have it in my scrapy.Spider, but should I have it in some other place (like pipeline) and how (code examples preferred)?
While I can agree with Tomas in some points, I would never put this calculation in the spider itself.
I use spiders to extract data from pages. For me, that is the only purpose of a spider. I use ItemLoader for cleaning extracted data and basic manipulation (like converting everything to the same base unit) . And finally I use Pipelines for any high level data manipulation, like combining fields from items.
Imagine you have a dozen of spiders and you calculate price_per_m2 inside them. Your project has grown and you start coding spiders for another country. You have spiders getting prices in GBP, EUR and USD. Now if you want to compare price_per_m2 you have to 1) convert units in each spider before calculations or 2) add meta data to items in order to do a post processing. Both approaches are onerous in my opinion. My approach: spiders extract data, ItemLoader convert every price to the same unit, some Pipeline calculates price_per_m2 for every item (all them in the same units).
If you have a calculation (or post-processing in general) common to all items, using a pipeline is a one possible way to do it. In real estate, I can imagine for example geocoding addresses or storing each item in database. The main reason behind using pipeline, in my opinion, is that you separate this additional logic out of spiders so you have a single place of maintenance when there's a need to change this logic. In examples given above, you might decide to change the geocoding provider or instead of using one database engine, you switch to another. That's the real strength of pipelines. That said, calculating price per m2 from price and size, which is hardly to change, can safely be put right into the spider code. On the other hand, if there is more such simple calculations, you might consider using pipelines just to save time repeating the same code in every spider.

RRDtool what use are multiple RRAs?

I'm trying to implement rrdtool. I've read the various tutorials and got my first database up and running. However, there is something that I don't understand.
What eludes me is why so many of the examples I come across instruct me to create multiple RRAs?
Allow me to explain: Let's say I have a sensor that I wish to monitor. I will want to ultimately see graphs of the sensor data on an hourly, daily, weekly and monthly basis and one that spans (I'm still on the fence on this one) about 1.5 yrs (for visualising seasonal influences).
Now, why would I want to create an RRA for each of these views? Why not just create a database like this (stepsize=300 seconds):
DS:sensor:GAUGE:600:U:U \
RRA:AVERAGE:0.5:1:160000
If I understand correctly, I can then create any graph I desire, for any given period with whatever resolution I need.
What would be the use of all the other RRAs people tell me I need to define?
BTW: I can imagine that in the past this would have been helpful when computing power was more rare. Nowadays, with fast disks, high-speed interfaces and powerful CPUs I guess you don't need the kind of pre-processing that RRAs seem to be designed for.
EDIT:
I'm aware of this page. Although it explains about consolidation very clearly, it is my understanding that rrdtool graph can do this consolidation aswell at the moment the data is graphed. There still appears (to me) no added value in "harvest-time consolidation".
Each RRA is a pre-consolidated set of data points at a specific resolution. This performs two important functions.
Firstly, it saves on disk space. So, if you are interested in high-detail graphs for the last 24h, but only low-detail graphs for the last year, then you do not need to keep the high-detail data for a whole year -- consolidated data will be sufficient. In this way, you can minimise the amount of storage required to hold the data for graph generation (although of course you lose the detail so cant access it if you should want to). Yes, disk is cheap, but if you have a lot of metrics and are keeping low-resolution data for a long time, this can be a surprisingly large amount of space (in our case, it would be in the hundreds of GB)
Secondly, it means that the consolidation work is moved from graphing time to update time. RRDTool generates graphs very quickly, because most of the calculation work is already done in the RRAs at update time, if there is an RRA of the required configuration. If there is no RRA available at the correct resolution, then RRDtool will perform the consolidation on the fly from a high-granularity RRA, but this takes time and CPU. RRDTool graphs are usually generated on the fly by CGI scripts, so this is important, particularly if you expect to have a large number of queries coming in. In your example, using a single 5min RRA to make a 1.5yr graph (where 1pixel would be about 1 day) you would need to read and process 288 times more data in order to generate the graph than if you had a 1-day granularity RRA available!
In short, yes, you could have a single RRA and let the graphing work harder. If your particular implementation needs faster updates and doesnt care about slower graph generation, and you need to keep the detailed data for the entire time, then maybe this is a solution for you, and RRDTool can be used in this way. However, usually, people will optimise for graph generation and disk space, meaning using tiered sets of RRAs with decreasing granularity.