I was wondering why Dataflow does not support 'SortByKey' like Apache Spark.
I have a huge table in BigQuery that I cannot sort it because "Order By" is not scalable. So, I was thinking to move the output of BigQuery to Dataflow and sort it there. But, there is no SortByKey and it seems I have to write a combiner.
Any suggestions will be appreciated.
Sorting (especially by key) requires globally serial processing, which is not a scalable operation. Apache Beam / Dataflow does not provide such support, as it is frequently unnecessary.
There are a variety of alternatives that generally address the need more scalably. For instance, you can sort the values within each key, which allows each key to be processed in parallel. Another common use case is TopN either globally or per-key. Again, this can be supported much more efficiently than actually sorting.
Could you elaborate on what you need to sort by and why? It would make it possible to identify options for implementing this within the Beam and Dataflow SDKs.
Related
Could you please bring some examples?
First, find a proper definition here for "Filter Pushdown":
One way to prevent loading data that is not actually needed is filter pushdown (sometimes also referred to as predicate pushdown), which enables the execution of certain filters at the data source before it is loaded to an executor process. This becomes even more important if the executors are not on the same physical machine as the data.
Note that:
In many cases, filter pushdown is automatically applied by Spark without explicit commands or input from the user. But in certain cases, users have to provide specific information or even implement certain functionality themselves, especially when creating custom data sources, i.e. for unsupported database types or unsupported file types.
Now, you can find a simple example in databricks. In this example, you can find that the order of select and filter can be optimized in regards to the query execution performance.
Yugabyte seems to support Redis, Cassandra and SQL queries. Do they work with each other? For example, can I write data with Cassandra API and later perform SQL queries against them?
These APIs do not work with each other as is, meaning you would not be able to query YCQL data from YSQL. This is because the data types are all not always present in the other APIs, and they often have different semantics.
That said, we get asked this a lot and the plan is to enable this scenario using a foreign data wrapper. So, in effect, you would be able to "import" the YCQL table into the YSQL side and use it there. Note that PostgreSQL already has a bunch of these wrappers (for example, see this generic list of PG FDWs here - it has entries for Cassandra and Redis). The idea is to re-use/enhance these and get them to work out of the box.
If you're interested, please open a GitHub issue and we can continue there. Would love to understand your use-case better to make sure we are able to address it and work with you closely on this.
Apologies if the title made no absolute sense. But, on the other hand, I would like to know if there is any programming model which would let us use Infinispan cache as a real datastore and not just a grid on top of an underlying rdbms.
I know Key-Value stores have real limitations but I couldn't stop thinking about the possibilities of an in-memory solution with all or a subset of RDBMS functionalities. For example: If I want to retrieve a particular set of Keys based on value>34.56%, just like how we would use a where clause in an sql stmt.
My doubt is not specific to infinispan but any IMKVS for that purpose. I know it's a shot in the dark considering the data structures and algorithms behind IMKVS specifications.
Any help or direction to resources which talk about these lines would be of great help.
I suggest you write down all the queries that you execute against SQL DB and check if these could be translated into KVS language.
In Infinispan you can index the values and execute queries for such filtering, but you can't do any table joins.
If you are in need for more powerful API, specifically using JPA, take a look at Hibernate OGM.
And while KVSs offer some level of reliability, in practice I wouldn't trust the documentation too much. You need to perform extensive testing of your system and check that you can retrieve the data even after various types of crashes and network failures (or that you can live with throwing the data away).
I am working on a project which involves collecting dynamic form data. These forms are user-defined (think surveymonkey) and thus a fixed schema cannot be defined for them. Data in terms of questions/answers would be retrieved for these forms and then stored into the database. Reporting/Searching on this answers (filtering and aggregation) is of utmost importance. There are two approaches which are feasible.
Use a SQL database and store the each field data as a separate row. Reporting/Searching is then done via SQL. My apprehension is that it would result in complicated joins for reporting.
Use a NoSQL database like MongoDB. This seems to be a perfect fit for storing the dynamic data since it is schema-less. However, I am not sure how good its reporting capabilities are.
It seems easier for target users to learn sql than to define map/reduce queries. How easy would it be to build a UI for reporting/searching over mongoDB.
Simple things like - list of users who gave a particular set of answers. How many such users over a period of time etc?
Thanks,
Pulkit
It's already been mentioned in the comments, but I'll re-iterate that you should look at Mongo's map/reduce functionality for reporting and the aggregation framework.
Having done map/reduce in both Couch and Mongo I can say that they are very similar. It's definitely a barrier to entry for a developer that isn't familiar with it, but once you get a few working examples, it's not too bad.
Consider that Mongo can output a map/reduce job to a collection, which I've found to be really useful. This means you can schedule the jobs and run them periodically and output to a place that you can then report on. It's not that hard to create a framework that lets developers write simple Javascript map and reduce functions and then plug them in to be run on a schedule.
The aggregation framework is much easier to understand for a developer coming from SQL. Still a learning curve, but not as bad as map/reduce. It is much more well suited to ad-hoc reporting queries and there is nothing comparable in Couch.
You could maybe make a reporting UI that maps to the aggregation framework, but I wouldn't try to do something similar for map/reduce queries.
i am going to store nginx logs in either simpledb or in bigtable.
i want to know if simpledb or bigtable support regular expressions query (like in mongodb)
Regarding Bigtable, it does support regex. I believe that #timmers correctly assumed that the poster was referring to AppEngine storage since Cloud Bigtable wasn't available in 2011, but now that Cloud Bigtable is publicly available, I want to make sure that people that search for this know that it is supported: https://googleapis.dev/java/google-cloud-clients/latest/com/google/cloud/bigtable/data/v2/models/Filters.ValueFilter.html#regex-java.lang.String-
Simple answer here is no for either AppEngine or SimpleDB.
Appengine's queries are relatively restricted AppEngine java query documentation and can only filter on queries with the following operators :-
Query.FilterOperator.LESS_THAN
Query.FilterOperator.LESS_THAN_OR_EQUAL
Query.FilterOperator.EQUAL
Query.FilterOperator.GREATER_THAN
Query.FilterOperator.GREATER_THAN_OR_EQUAL
Query.FilterOperator.NOT_EQUAL
Query.FilterOperator.IN (equal to any of the values in the provided list)
SimpleDB is slightly more sophisticated in its queries, but only stretches as far as old-style SQL like Amaozn SimpleDB Query Documentation, which can take a '%' before/after some text to allow startswith or endswith type operation.
With either product the intended usage pattern if needing to perform queries which were not anticipated ahead of time is more to perform a map-reduce type operation on the data and have the regexp filter be applied over the resulting dataset at application level rather than attempt to provide it inside the DB.
Alternatively if you know your regexps up-front, you could pre-apply your regexps these and store the results in whichever datastore.