Analyse huge amount of blockchain data - data-science

I am trying to go over all transactions data from every block on the bitcoin blockchain from the previous 4 years. With almost 2k transaction per block, it will take a lot of queries per block.
I have a full node running locally and I tried two ways:
Python with RPC: This is very slow and keeps losing connection after some time (httpx.ReadTimeout)
Python with os.popen commands: Doesn't have the connection problem, but still very slow.
Would there be any other way? Any recommendation on how to analyze bulk data from the blockchain? The methods listed above are unfeasible given the time it would take.
EDIT: The problem isn't memory, but the time the bitcoin node takes to answer the queries.

Hey there are differents ways to fetch bitcoin blockchain data:
Network level using P2P messages (this method doesn't require to setup a node)
Parsing .blk files which are synchronized by your node
Querying the application interface RPC
P2P messages and .blk files are raw encoded, so you will need to decode blocks and transactions.
The RPC interface abstract the raw decoding but it's slower (because it decodes).
We wrote a paper with Matthieu Latapy to give instructions about collecting the whole Bitcoin blockchain and indexing in order to make parsing efficient.
Step-by-step procedure
Full paper
Repository
Website

Related

Redis: Using lua and concurrent transactions

Two issues
Do lua scripts really solve all cases for redis transactions?
What are best practices for asynchronous transactions from one client?
Let me explain, first issue
Redis transactions are limited, with an inability to unwatch specific keys, and all keys being unwatched upon exec; we are limited to a single ongoing transaction on a given client.
I've seen threads where many redis users claim that lua scripts are all they need. Even the redis official docs state they may remove transactions in favour of lua scripts. However, there are cases where this is insufficient, such as the most standard case: using redis as a cache.
Let's say we want to cache some data from a persistent data store, in redis. Here's a quick process:
Check cache -> miss
Load data from database
Store in redis
However, what if, between step 2 (loading data), and step 3 (storing in redis) the data is updated by another client?
The data stored in redis would be stale. So... we use a redis transaction right? We watch the key before loading from db, and if the key is updated somewhere else before storage, storage would fail. Great! However, within an atomic lua script, we cannot load data from an external database, so lua cannot be used here. Hopefully I'm simply missing something, or there is something wrong with our process.
Moving on to the 2nd issue (asynchronous transactions)
Let's say we have a socket.io cluster which processes various messages, and requests for a game, for high speed communication between server and client. This cluster is written in node.js with appropriate use of promises and asynchronous concepts.
Say two requests hit a server in our cluster, which require data to be loaded and cached in redis. Using our transaction from above, multiple keys could be watched, and multiple multi->exec transactions would run in overlapping order on one redis connection. Once the first exec is run, all watched keys will be unwatched, even if the other transaction is still running. This may allow the second transaction to succeed when it should have failed.
These overlaps could happen in totally separate requests happening on the same server, or even sometimes in the same request if multiple data types need to load at the same time.
What is best practice here? Do we need to create a separate redis connection for every individual transaction? Seems like we would lose a lot of speed, and we would see many connections created just from one server if this is case.
As an alternative we could use redlock / mutex locking instead of redis transactions, but this is slow by comparison.
Any help appreciated!
I have received the following, after my query was escalated to redis engineers:
Hi Jeremy,
Your method using multiple backend connections would be the expected way to handle the problem. We do not see anything wrong with multiple backend connections, each using an optimistic Redis transaction (WATCH/MULTI/EXEC) - there is no chance that the “second transaction will succeed where it should have failed”.
Using LUA is not a good fit for this problem.
Best Regards,
The Redis Labs Team

Is blockchain a decentralised database?

I understand bitcoin uses blockchain technology to maintain a decentralised ledger of all transactions. I have also read many posts eluding to future applications of blockchain technology, none of which have been very clear to me.
Is blockchain technology simply a decentralised database with consensus validation of the data? If this was the case surely the db would grow to be too large to be effectively decentralised?
To help me understand, can anyone point me to a clear example of a non-bitcoin blockchain application?
Yes, its true that the blockchain database would grow overtime which is what is called "blockchain bloat". Currently the blockchain growth of Bitcoin is roughly less than 100mb a day. Today (2016) the bitcoin blockchain takes up about 60-100GB of space which took about 6 years to accumulate. It is indeed growing faster, but also limited by the blocksize cap of 1MB per block (every 10 minutes). Some proposed solutions have been:
SPV nodes: This is how your phone doesn't need to download the entire blockchain, but retrieve its data from SPV nodes that have the entire blockchain.
Lightning network - This is how Bitcoin can overcome the 1MB memory cap.
Those are just some of the solutions for bitcoin that I know of, as for altcoin related solutions. NXT/Ardor has implemented the solution of pruned data, because NXT/Ardor gives the ability to upload arbitrary data and messages onto its blockchain, the bloat is much more apparent in this scenario. The NXT/Ardor blockchain has the ability to delete previous data every 2 weeks and only keep the hash of its data on the blockchain which only takes a few KB. They also have the ability to retain all of the blockchain data with the pruning turned off which would mark a node as an Archival Node and other nodes can replicate this node and become an Archival node.
From my understanding NXT/Ardor has been one of the few blockchains that has production ready decentralized data storage system, marketplace, stock exchange, and messaging system built into its blockchain.
Blockchain is not just a decentralised database, but it is much more than that. While the original Bitcoin blockchain allowed only value to be transferred, along with limited data with every transaction, several new blockchains have been developed in the past 2-3 years, which have much more advanced native scripting and programming capabilities.
Apart from the Bitcoin blockchain, I would say that there a few other major blockchains like Ethereum, Ripple, R3's Corda, Hyperledger. Although Ethereum has a crypto-currency called Ether, it is actually a Turing complete EVM (Ethereum Virtual Machine). Using Ethereum, you can create Smart Contracts that would themselves run in a decentralised manner. As a developer, it opens up completely new avenues for you and changes your perspective of writing programs. While Ripple is mainly geared towards payments, Corda and Hyperledger are built with a view of being private/permissioned blockchains, to solve the issues such as scalability, privacy, and identity. The target markets for Hyperledger and Corda are mostly banks and other Financial Institutions.
As for the non-bitcoin application of blockchain, you can certainly look at some companies like Consensys (multiple different use cases on blockchain), Digix Global (gold tokens on the blockchain), Everledger (tracking of diamonds on the blockchain), Otonomos (Company registration on the blockchain), OT Docs (Trade Finance and document versioning on the blockchain) amongst others.
Blockchain is:
Name for a data structure,
Name for an algorithm,
Name for a suite of Technologies,
An umbrella term for purely distributed peer-to-peer systems with a common application area,
A peer-to-peer-based operating system with its own unique rule set that utilizes hashing to provide unique data transactions with a distributed ledger
Blockchain is much more than a "database". Yes the blocks on the chain stores data but it is more like a service. There are many applications of blockchain. Read about them: here. If you want to see the code of a blockchain application, try this one: here.
Blockchain is combination of p2p network, decentralised database and asymmetric cryptography
P2P network means you can transfer data between two deferent network nodes without any middleman, decentralised db means every nodes of network has one replica of network db and asymmetric cryptography means you can use digital signature to validate the authenticity and integrity of a messages

Bigquery streaming inserts taking time

During load testing of our module we found that bigquery insert calls are taking time (3-4 s). I am not sure if this is ok. We are using java biguqery client libarary and on an average we push 500 records per api call. We are expecting a million records per second traffic to our module so bigquery inserts are bottleneck to handle this traffic. Currently it is taking hours to push data.
Let me know if we need more info regarding code or scenario or anything.
Thanks
Pankaj
Since streaming has a limited payload size, see Quota policy it's easier to talk about times, as the payload is limited in the same way to both of us, but I will mention other side effects too.
We measure between 1200-2500 ms for each streaming request, and this was consistent over the last month as you can see in the chart.
We seen several side effects although:
the request randomly fails with type 'Backend error'
the request randomly fails with type 'Connection error'
the request randomly fails with type 'timeout' (watch out here, as only some rows are failing and not the whole payload)
some other error messages are non descriptive, and they are so vague that they don't help you, just retry.
we see hundreds of such failures each day, so they are pretty much constant, and not related to Cloud health.
For all these we opened cases in paid Google Enterprise Support, but unfortunately they didn't resolved it. It seams the recommended option to take for these is an exponential-backoff with retry, even the support told to do so. Which personally doesn't make me happy.
The approach you've chosen if takes hours that means it does not scale, and won't scale. You need to rethink the approach with async processes. In order to finish sooner, you need to run in parallel multiple workers, the streaming performance will be the same. Just having 10 workers in parallel it means time will be 10 times less.
Processing in background IO bound or cpu bound tasks is now a common practice in most web applications. There's plenty of software to help build background jobs, some based on a messaging system like Beanstalkd.
Basically, you needed to distribute insert jobs across a closed network, to prioritize them, and consume(run) them. Well, that's exactly what Beanstalkd provides.
Beanstalkd gives the possibility to organize jobs in tubes, each tube corresponding to a job type.
You need an API/producer which can put jobs on a tube, let's say a json representation of the row. This was a killer feature for our use case. So we have an API which gets the rows, and places them on tube, this takes just a few milliseconds, so you could achieve fast response time.
On the other part, you have now a bunch of jobs on some tubes. You need an agent. An agent/consumer can reserve a job.
It helps you also with job management and retries: When a job is successfully processed, a consumer can delete the job from the tube. In the case of failure, the consumer can bury the job. This job will not be pushed back to the tube, but will be available for further inspection.
A consumer can release a job, Beanstalkd will push this job back in the tube, and make it available for another client.
Beanstalkd clients can be found in most common languages, a web interface can be useful for debugging.

Need Design & Implementation inputs on Cassandra based use case

I am planning to store high-volume order transaction records from a commerce website to a repository (Have to use cassandra here, that is our DB). Let us call this component commerceOrderRecorderService.
Second part of the problem is - I want to process these orders and push to other downstream systems. This component can be called batchCommerceOrderProcessor.
commerceOrderRecorderService & batchCommerceOrderProcessor both will run on a java platform.
I need suggestion on design of these components. Especially the below:
commerceOrderRecorderService
What is he best way to design the columns, considering performance and scalability? Should I store the entire order (complex entity) as a single JSON object. There is no search requirement on the order attributes. We can at least wait until they are processed by the batch processor. Consider - that a single order can contain many sub-items - at the time of processing each of which can be fulfilled differently. Designing columns for such data structure may be an overkill
What should be the key, given that data volumes would be high. 10 transactions per second let's say during peak. Any libraries or best practices for creating such transactional data in cassandra? Can TTL also be used effectively?
batchCommerceOrderProcessor
How should the rows be retrieved for processing?
How to ensure that a multi-threded implementation of the batch processor ( and potentially would be running on multiple nodes as well ) will have row level isolation. That is no two instance would read and process the same row at the same time. No duplicate processing.
How to purge the data after a certain period of time, while being friendly to cassandra processes like compaction.
Appreciate design inputs, code samples and pointers to libraries. Thanks.
Depending on the overall requirements of your system, it could be feasible to employ the architecture composed of:
Cassandra to store the orders, analytics and what have you.
Message queue - your commerce order recorder service would simple enqueue new order to the transactional and persistent queue and return. Scalability and performance should not be an issue here as you can easily achieve thousands of transactions per second with a single queue server. You may have a look at RabbitMQ as one of available choices.
Stream processing framework - you could read a stream of messages from the queue in a scalable fashion using streaming frameworks such as Twitter Storm. You could implement in Java than 3 simple pipelined processes in Storm:
a) Spout process that dequeues next order from the queue and pass it to
the second process
b) Second process called Bolt that inserts each next order to Cassandra and pass it to the third bolt
c) Third Bolt process that pushes the order to other downstream systems.
Such an architecture offers high-performance, scalability, and near real-time, low latency data processing. It takes into account that Cassandra is very strong in high-speed data writes, but not so strong in reading sequential list of records. We use Storm+Cassandra combination in our InnoQuant MOCA platform and handle 25.000 tx/second and more depending on hardware.
Finally, you should consider if such an architecture is not an overkill for your scenario. Nowadays, you can easily achieve 10 tx/second with nearly any single-box database.
This example may help a little. It loads a lot of transactions using the jmxbulkloader and then batches the results into files of a certain size to be transported else where. It multi-threaded but within the same process.
https://github.com/PatrickCallaghan/datastax-bulkloader-writer-example
Hope it helps. BTW it uses the latest cassandra 2.0.5.

best use of NSUrlConnection when getting multiple json objects that depend on the previous

what I am doing is I am querying an API to search for articles in various data bases. There are multiple steps involved, each returns a json object. Each step involves a NSUrlConnection with different query strings to the API
step 1: returns json object indicating status of query & record set ID.
step 2: takes record set id from step 1 and returns list of databases that are valid for querying
step 3: queries each database that was ready from step 2 and gets json data array that has results
I am confused as to the best way of going about this. Is it better to use one nsurlconnection and reopen that connection in connection did finish loading based on which step I am in. Or is it better to open a new connection at the end of each subsequent connection?
A couple of observations
Network latency:
The key phenomenon that we need to be sensitive to here (and it sounds like you are) is network latency. Too often we test our apps in an idea scenario (on simulator with high speed internet access, or on device connected to wifi). But when you use an app in a real-world scenario, network latency can seriously impact performance and you'll want to architect a solution that minimizes this.
Simulating sub-optimal, real-world network situations:
By the way, if you're not doing it already, I'd suggest you install the "Network Link Conditioner" which is part of the "Hardware IO Tools" (available from the "Xcode" menu, choose "Open Developer Tool" - "More Developer Tools"). If you install the "Network Link Conditioner", you can then have your simulator simulate a variety of network experiences (e.g. Good 3G connection, Poor Edge connection, etc.).
Minimize network requests:
Anyway, I'd try to figure out how to minimize separate requests that are dependent upon the previous one. For example, I see step 1 and step 2 and wonder if you could merge those two into a single JSON request. Perhaps that's not possible, but hopefully you get the idea. You want to reduce the number of separate requests that have to happen sequentially.
I'd also look at step 3, and those look like they have to be dependent upon step 2, but perhaps you can run a couple of those step 3 requests concurrently, reducing the latency effect there.
Implementation:
In terms of how this would be implemented, I personally use a concurrent NSOperationQueue with some reasonable maxConcurrentOperationCount setting (e.g. 4 or 5, enough to enjoy concurrency and reduce latency, but not so many as to tax either the device or the server) and submit network operations. In this case, you'll probably submit step 1, with a completion operation that will submit step 2, with a completion operation that will submit a series of step 3 requests and these step 3 requests might run concurrent.
In terms of how to make a good network operation object, I might suggest using something like AFNetworking, which already has a decent network operation object (including one that parses JSON), so maybe you can start there.
In terms of re-using a NSURLConnection, generally its one connection per request. If I have had an app that wanted to have a lengthy exchange of messages with a server (e.g. a chat like service where you want the server to be able to send a message to the client whenever it wants, such as in a chat service), I've done a sockets implementation, but that doesn't seem like the right architecture here.
I would dismiss the first connection and create a new one for each connection.
Just, don't ask me why.
BTW, I would understand the question if this was about reusing vs. creating new objects in some performance sensitive context like scrolling through a table or animations or if it is just about of 10 thousands of iterations where it happens. But you are talking about 3 objects to either create new or reuse the old one. What is the gain of even thinking about it?