Windowing functions in Dataflow and Big Query - google-bigquery

I am looking at analysing streaming data (web events).
Is there a good rule of thumb to help me determine if I should
Perform Grouping and Aggregation in Dataflow and write the output
Use Dataflow to stream into Big Query and possibly use a range decorator to limit data / use a windowing function for partitions and aggregate via SQL.
Looking at the examples in the documentation and this article
Classic Batch Programming, Hourly Team Scores, All-time User Scores, User Behaviour Analysis feel like they are straightforward to create via SQL (given "created" and "write" timestamps are recorded)
The Spam filtering example I can see the limitations to using BQ if this applied on a per-event streaming basis).
The semantics of Dataflow seem to overlap in terms of GroupBy, Join, Combine, Windowing as well as BQ supporting streaming inserts with availability in seconds, well short enough for hour level aggregation.
Is there something fundamental I have not understood? Or is there a case that streaming into BigQuery and then querying will start to become unreliable?
Thank you
(Apologies if this question is a bit vague - happy to be redirected to a better place to ask)

Whether one chooses to perform grouping and aggregation in Dataflow or using BigQuery operations (after having ingested data using Dataflow) is something that depends on the application logic and on what consumes the output. For example, sessions and sliding windows are both hard to express in SQL; while Dataflow supports arbitrary processing such as triggered estimates. Another thing to consider is that it may be easier to express the computation logic using an imperative programming language instead of using SQL.

Below, not necessarily answers your exact question, but rather adds yet another aspect to consider:
1. If you are building process that supposed to power your infrastructure – dataflow might be a good choice. Of course you bound to your tech team resources.
2. In case if you plan for ad-hocs and self-serve type of activity by non-tech personnel (of course tech personnel is not excluded here also) – you can focus on employing BigQuery’s query features (including windowing functions) and make sure you have good real working examples that rest of your company can use as a template to start leveraging power of BigQuery and GCP in general. This proved to work great! Domain experts now can answer their questions (like you enlisted in your question) by themselves w/o having tech people in between. Quality and Timing much better in this scenario!


Can I the same programming language in BigQuery and Google Cloud Dataflow?

I want to use the same function for parsing events in two different technologies: Goolge Bigquery and DataFlow. Is there a language I can do this in? If not, is google planning to support one any time soon?
Background: Some of this parsing is complex (e.g., applying custom URL extraction rules, extracting information out of the user agent) but it's not computationally expensive and doesn't involve joining the events to any other large look-up tables. Because the parsing can be complex, I want to write my parsing logic in only one language and run it wherever I need it: sometimes in BigQuery, sometimes in other environments like DataFlow. I want to avoid writing the same complex parsers/extractors in different languages because of the bugs and inconsistencies that can result from that.
I know BigQuery supports javascript UDFs. Is there a clean way to run javascript on Google Cloud DataFlow? Will BigQuery someday support UDFs in some other language?
We tend to use Java to puppet bigquery jobs and parse their resulting data, and then we also do that in dataflow as well.
Likewise, you have leeway with the amount of sql that you write vs auto-generate from the code-base, and how much you lean on bigquery vs dataflow.
(we have found with our larger amounts of data, that there is a big benefit to offloading as much initial grouping/filtering into bigquery before pulling it into dataflow)

What is the best way to store market data for Algorithmic Trading setup?

I am making an Algorithmic Trading setup for trades automation. Currently, I have a broker API which helps me get historical data for all stocks that I'm interested in.
I am wondering how to store all the data, whether in a file system or database (SQL based or NoSQL). Data comes through REST API if thats relevant.
My use case here would be to query historical data to make trading decisions in live market. I would also have to develop a backtesting framework that will query Historical Data to check performance of strategy historically.
I am looking at a frequency of 5 mins - 1 hr candles and mostly Intraday trading strategies. Thanks
As you say, there are many options and as STLDeveloper says this is kind of off topic since it is opinion based... anyway...
A simple strategy which I used in my own Python back-testing engine is to use Python Pandas DataFrame objects, and save/load to disk in an HD5 file using to_hdf() and read_hdf(). The primary advantage (for me) of HD5 is that it loads/saves far more quickly than CSV.
Using the above approach I easily manage several years of 1 minute data for back testing purposes, and data access certainly is not my performance bottleneck.
You will need to determine for yourself if your chosen data management approach is fast enough for live trading, but in general I think if your strategy is based on 5-min candles then any reasonable database approach is going to be sufficiently performant for your purposes.

web application receiving millions of requests and leads to generating millions of row inserts per 30 seconds in SQL Server 2008

I am currently addressing a situation where our web application receives at least a Million requests per 30 seconds. So these requests will lead to generating 3-5 Million row inserts between 5 tables. This is pretty heavy load to handle. Currently we are using multi threading to handle this situation (which is a bit faster but unable to get a better CPU throughput). However the load will definitely increase in future and we will have to account for that too. After 6 months from now we are looking at double the load size we are currently receiving and I am currently looking at a possible new solution that is scalable and should be easy enough to accommodate any further increase to this load.
Currently with multi threading we are making the whole debugging scenario quite complicated and sometimes we are having problem with tracing issues.
FYI we are already utilizing the SQL Builk Insert/Copy that is mentioned in this previous post
Sql server 2008 - performance tuning features for insert large amount of data
However I am looking for a more capable solution (which I think there should be one) that will address this situation.
Note: I am not looking for any code snippets or code examples. I am just looking for a big picture of a concept that I could possibly use and I am sure that I can take that further to an elegant solution :)
Also the solution should have a better utilization of the threads and processes. And I do not want my threads/processes to even wait to execute something because of some other resource.
Any suggestions will be deeply appreciated.
Update: Not every request will lead to an insert...however most of them will lead to some sql operation. The appliciation performs different types of transactions and these will lead to a lot of bulk sql operations. I am more concerned towards inserts and updates.
and these operations need not be real time there can be a bit lag...however processing them real time will be much helpful.
I think your problem looks more towards getting a better CPU throughput which will lead to a better performance. So I would probably look at something like an Asynchronous Processing where in a thread will never sit idle and you will probably have to maintain a queue in the form of a linked list or any other data structure that will suit your programming model.
The way this would work is your threads will try to perform a given job immediately and if there is anything that would stop them from doing it then they will push that job into the queue and these pushed items will be processed based on how it stores the items in the container/queue.
In your case since you are already using bulk sql operations you should be good to go with this strategy.
lemme know if this helps you.
Can you partition the database so that the inserts are spread around? How is this data used after insert? Is there a natural partion to the data by client or geography or some other factor?
Since you are using SQL server, I would suggest you get several of the books on high availability and high performance for SQL Server. The internals book muight help as well. Amazon has a bunch of these. This is a complex subject and requires too much depth for a simple answer on a bulletin board. But basically there are several keys to high performance design including hardware choices, partitioning, correct indexing, correct queries, etc. To do this effectively, you have to understand in depth what SQL Server does under the hood and how changes can make a big difference in performance.
Since you do not need to have your inserts/updates real time you might consider having two databases; one for reads and one for writes. Similar to having a OLTP db and an OLAP db:
Read Database:
Indexed as much as needed to maximize read performance.
Possibly denormalized if performance requires it.
Not always up to date.
Insert/Update database:
No indexes at all. This will help maximize insert/update performance
Try to normalize as much as possible.
Always up to date.
You would basically direct all insert/update actions to the Insert/Update db. You would then create a publication process that would move data over to the read database at certain time intervals. When I have seen this in the past the data is usually moved over on a nightly bases when few people will be using the site. There are a number of options for moving the data over, but I would start by looking at SSIS.
This will depend on your ability to do a few things:
have read data be up to one day out of date
complete your nightly Read db update process in a reasonable amount of time.

Column-based query accelerator in SQL Server 2012

I have been researching, SQL Server 2012 (aka Denali) and Microsoft has a pre-release available. The pre-release is located here with some information on key features. As I have downloaded the pre-release and installed on a VM. I have been curious about the following key feature mentioned. But Im not sure of its ability.
Column-based query accelerator
Column-Based Query Accelerator will help dramatically increase query
performance ~10x and reduce
performance tuning through interactive
experiences with data for near instant
response times and streamlined setup
which removes the need to build
summary aggregates.
What I would like is to see some explanation of the performance enhancement and perhaps an example, as I do not understand What "Column-based query" acceleration is? Any Insight would be helpful.
Sounds like a Business Intelligence thing.
Query aren't "interactive" and don't generally have "summary aggregates".
MS has put a lot into Analysis Services.
Edit: it's also possible that it's already known and blogged about, but the marketing monkeys changed the name :-)
Columnar storage is a physical layout optimization where data is stored by columns, and not rows. In some use cases, the advantages here are many:
1) less read time - need to compute an aggregate on a value - no need to read the rest of the row - so less read time
2) data compression - as the column data is likely similar, you can get greater compression ratios
3) ordinal indexing (sometimes)
this approach falls apart when data is inserted and updated, but for read-only and append use-cases the performance benefits can be astounding.

Would this method work to scale out SQL queries?

I have a database containing a single huge table. At the moment a query can take anything from 10 to 20 minutes and I need that to go down to 10 seconds. I have spent months trying different products like GridSQL. GridSQL works fine, but is using its own parser which does not have all the needed features. I have also optimized my database in various ways without getting the speedup I need.
I have a theory on how one could scale out queries, meaning that I utilize several nodes to run a single query in parallel. A precondition is that the data is partitioned (vertically), one partition placed on each node. The idea is to take an incoming SQL query and simply run it exactly like it is on all the nodes. When the results are returned to a coordinator node, the same query is run on the union of the resultsets. I realize that an aggregate function like average need to be rewritten into a count and sum to the nodes and that the coordinator divides the sum of the sums with the sum of the counts to get the average.
What kinds of problems could not easily be solved using this model. I believe one issue would be the count distinct function.
Edit: I am getting so many nice suggestions, but none have addressed the method.
It's a data volume problem, not necessarily an architecture problem.
Whether on 1 machine or 1000 machines, if you end up summarizing 1,000,000 rows, you're going to have problems.
Rather than normalizing you data, you need to de-normalize it.
You mention in a comment that your data base is "perfect for your purpose", when, obviously, it's not. It's too slow.
So, something has to give. Your perfect model isn't working, as you need to process too much data in too short of a time. Sounds like you need some higher level data sets than your raw data. Perhaps a data warehousing solution. Who knows, not enough information to really say.
But there are a lot of things you can do to satisfy a specific subset of queries with a good response time, while still allowing ad hoc queries that respond in "10-20 minutes".
Edit regarding comment:
I am not familiar with "GridSQL", or what it does.
If you send several, identical SQL queries to individual "shard" databases, each containing a subset, then the simple selection query will scale to the network (i.e. you will eventually become network bound to the controller), as this is a truly, parallel, stateless process.
The problem becomes, as you mentioned, the secondary processing, notably sorting and aggregates, as this can only be done on the final, "raw" result set.
That means that your controller ends up, inevitably, becoming your bottleneck and, in the end, regardless of how "scaled out" you are, you still have to contend with a data volume issue. If you send your query out to 1000 node and inevitably have to summarize or sort the 1000 row result set from each node, resulting in 1M rows, you still have a long result time and large data processing demand on a single machine.
I don't know what database you are using, and I don't know the specifics about individual databases, but you can see how if you actually partition your data across several disk spindles, and have a decent, modern, multi-core processor, the database implementation itself can handle much of this scaling in terms of parallel disk spindle requests for you. Which implementations actually DO do this, I can't say. I'm just suggesting that it's possible for them to (and some may well do this).
But, my general point, is if you are running, specifically, aggregates, then you are likely processing too much data if you're hitting the raw sources each time. If you analyze your queries, you may well be able to "pre-summarize" your data at various levels of granularity to help avoid the data saturation problem.
For example, if you are storing individual web hits, but are more interested in activity based on each hour of the day (rather than the subsecond data you may be logging), summarizing to the hour of the day alone can reduce your data demand dramatically.
So, scaling out can certainly help, but it may well not be the only solution to the problem, rather it would be a component. Data warehousing is designed to address these kinds of problems, but does not work well with "ad hoc" queries. Rather you need to have a reasonable idea of what kinds of queries you want to support and design it accordingly.
One huge table - can this be normalised at all?
If you are doing mostly select queries, have you considered either normalising to a data warehouse that you then query, or running analysis services and a cube to do your pre-processing for you?
From your question, what you are doing sounds like the sort of thing a cube is optimised for, and could be done without you having to write all the plumbing.
By trying custom solution (grid) you introduce a lot of complexity. Maybe, it's your only solution, but first did you try partitioning the table (native solution)?
I'd seriously be looking into an OLAP solution. The trick with the Cube is once built it can be queried in lots of ways that you may not have considered. And as #HLGEM mentioned, have you addressed indexing?
Even at in millions of rows, a good search should be logarithmic not linear. If you have even one query which results in a scan then your performance will be destroyed. We might need an example of your structure to see if we can help more?
I also agree fully with #Mason, have you profiled your query and investigated the query plan to see where your bottlenecks are. Adding nodes improving speed makes me think that your query might be CPU bound.
Are you using all of the features of GridSQL? You can also use constraint exclusion partitioning, effectively breaking out your big table into several smaller tables. Depending on your WHERE clause, when the query is processed it may look at a lot less data and return results much faster.
Also, are you using multiple logical nodes per physical server? Configuring it that way can take advantage of otherwise idle cores.
If you monitor the servers during execution, is the bottleneck IO or CPU?
Also alluded to here is that you may want to roll up rows in your fact table into summary tables/cubes. I do not know enough about Tableau, will it automatically use the appropriate cube and drill down only when necessary? If so, it seems like you would get big gains doing something like this.
My guess (based on nothing but my gut) is that any gains you might see from parallelization will be eaten up by reaggregation and subsequent queries of the results. Further, I would think that writing might get more complicated with pk/fk/constraints. If this were my world, I would probably create many indexed views on top of my table (and other views) that optimized for the particular queries I need to execute (which I have worked with successfully on 10million+ row tables.)
If you run the incoming query, unpartitioned, on each node, why will any node finish before a single node running the same query would finish? Am I misunderstanding your execution plan?
I think this is, in part, going to depend on the nature of the queries you're executing and, in particular, how many rows contribute to the final result set. But surely you'll need to partition the query somehow among the nodes.
Your method to scale out queries works fine.
In fact, I've implemented such a method in:
It uses a parser, but it supports most SQL constructs.
It doesn't yet support count(distinct expr) but this is doable and I plan to add support in the future.
I also have a tool called Flexviews (google for flexviews materialized views)
This tool lets you create materialized views (summary tables) which include various aggregate functions and joins.
Those tools combined together can yield massive scalability improvements for OLAP type queries.