Does Scalding support record filtering via predicate pushdown w/Parquet? - scalding

There are obvious speed benefits from not having to read records that would fail a filter. I see Spark support for it, but I haven't found any documentation on how to do it w/Scalding.

Unfortunately there is no support for this in scalding-parquet yet. We at Tapad started working on implementing Predicate support in scalding. Once we get something working we'll share it.
We have implemented our own ParquetAvroSource that can read/store avro records in parquet. It's possible to use column projection and read only columns/fields required to a scalding job. In some cases using this feature jobs read only 1% of the input bytes.

Predicate pushdown was added to Scalding, but it is not documented yet.
For more details see scalding issue #1089

Related

What is filter pushdown optimization?

Could you please bring some examples?
First, find a proper definition here for "Filter Pushdown":
One way to prevent loading data that is not actually needed is filter pushdown (sometimes also referred to as predicate pushdown), which enables the execution of certain filters at the data source before it is loaded to an executor process. This becomes even more important if the executors are not on the same physical machine as the data.
Note that:
In many cases, filter pushdown is automatically applied by Spark without explicit commands or input from the user. But in certain cases, users have to provide specific information or even implement certain functionality themselves, especially when creating custom data sources, i.e. for unsupported database types or unsupported file types.
Now, you can find a simple example in databricks. In this example, you can find that the order of select and filter can be optimized in regards to the query execution performance.

Can I take advantage of Yugabytes compatability?

Yugabyte seems to support Redis, Cassandra and SQL queries. Do they work with each other? For example, can I write data with Cassandra API and later perform SQL queries against them?
These APIs do not work with each other as is, meaning you would not be able to query YCQL data from YSQL. This is because the data types are all not always present in the other APIs, and they often have different semantics.
That said, we get asked this a lot and the plan is to enable this scenario using a foreign data wrapper. So, in effect, you would be able to "import" the YCQL table into the YSQL side and use it there. Note that PostgreSQL already has a bunch of these wrappers (for example, see this generic list of PG FDWs here - it has entries for Cassandra and Redis). The idea is to re-use/enhance these and get them to work out of the box.
If you're interested, please open a GitHub issue and we can continue there. Would love to understand your use-case better to make sure we are able to address it and work with you closely on this.

How to use Pandas in apache beam?

How to implement Pandas in Apache beam ?
I cannot perform left join on multiple columns and Pcollections does not support sql queries. Even the Apache Beam document is not properly framed. I checked but couldn't find any kind of Panda implementation in Apache beam.
Can anyone direct me to the desired link ?
There's some confusion going on here.
pandas is "supported", in the sense that you can use the pandas library the same way you'd be using it without Apache Beam, and the same way you can use any other library from your Beam pipeline as long as you specify the proper dependencies. It is also "supported" in the sense that it is bundled as a dependency by default so you don't have to specify it yourself. For example, you can write a DoFn that performs some computation using pandas for every element; a separate computation for each element, performed by Beam in parallel over all elements.
It is not supported in the sense that Apache Beam currently provides no special integration with it, e.g. you can't use a PCollection as a pandas dataframe, or vice versa. A PCollection does not physically contain any data (this should be particularly clear for streaming pipelines) - it is just a placeholder node in Beam's execution plan.
That said, a pandas-like API for working with Beam PCollections would certainly be a good idea, and would simplify learning Beam for many existing pandas users, but I don't think anybody is working on implementing this currently. However, the Beam community is currently discussing the idea of adding schemas to PCollections, which is a step in this direction.
As well as using Pandas directly from DoFns, Beam now has an API to manipulate PCollections as Dataframes. See https://s.apache.org/simpler-python-pipelines-2020 for more details.
pandas is supported in the Dataflow SDK for Python 2.x. As of writing, workers have the pandas v0.18.1 version pre-installed, so you should not have any issue with that. StackOverflow does not accept answers where you request the community to point you to external documentation and/or tutorials, so maybe you should first try an implementation yourself, and then come back with more information about what is/isn't failing and what did you achieve before stumbling with an error.
In any case, if what you want to achieve is a left join, maybe you can also have a look at the CoGroupByKey transform type, which is documented in the Apache Beam documentation. It is used to perform relational joins of several PCollections with a common key type. In that same page, you will be able to find some examples, which use CoGroupByKey and ParDo to join the contents of several data objects.

Why are aggregate functions like group_by not supported in hibernate search?

Why are aggregate functions like group_by not supported in hibernate search?
I have a use case where i need to fetch results after applying group by in the query.
There is no technical reason, if this is what you mean. We could probably add it, but there simply wasn't enough demand for this feature to make it to the top of our priority list.
If you want to see a feature added to Hibernate Search, feel free to create a ticket on our JIRA instance, describing in details your use case and the API you would expect.
Note that I am not 100% sure we would implement it for the Lucene backend, since that would probably require a lot of effort. But for people using Elasticsearch behind Hibernate Search, we may at least introduce ways to use Elasticsearch's aggregation support from within Hibernate Search. We are currently experimenting with Hibernate Search 6 and trying this is on my checklist.
In the meantime, if you want us to suggest alternatives, please provide more details about your use case: domain model, mapping, fields you would like to aggregate as part of your "group by"...
Why it's missing
The primary reason for this to not be support by Hibernate Search is that noone ever asked for it or contributed it.
Another reason is that since the results would be "groups of entities" while the FulltextQuery API returns a List of entities, this would need a new API specifically to run such queries.
How to get it added
We could make that, but if there is not much interest in the feature it would possibly not be worth the maintenance work.
If you need such a feature I suggest you open an issue on the Hibernate Search issue tracker so that other people can also vote or express interest for it. Ideally, someone needing it like yourself might be willing to create a patch or at least start a proof of concept.
Alternatives
Until Hibernate Search provides direct support for it, you can still run such queries yourself. See Using IndexReaders directly to work on the Lucene index directly.
Using the IndexReaders you can always read and Search on Lucene using any advanced feature for which Hibernate Search doesn't provide an API.

does google bigtable and amazon simpleDB support regular expressions?

i am going to store nginx logs in either simpledb or in bigtable.
i want to know if simpledb or bigtable support regular expressions query (like in mongodb)
Regarding Bigtable, it does support regex. I believe that #timmers correctly assumed that the poster was referring to AppEngine storage since Cloud Bigtable wasn't available in 2011, but now that Cloud Bigtable is publicly available, I want to make sure that people that search for this know that it is supported: https://googleapis.dev/java/google-cloud-clients/latest/com/google/cloud/bigtable/data/v2/models/Filters.ValueFilter.html#regex-java.lang.String-
Simple answer here is no for either AppEngine or SimpleDB.
Appengine's queries are relatively restricted AppEngine java query documentation and can only filter on queries with the following operators :-
Query.FilterOperator.LESS_THAN
Query.FilterOperator.LESS_THAN_OR_EQUAL
Query.FilterOperator.EQUAL
Query.FilterOperator.GREATER_THAN
Query.FilterOperator.GREATER_THAN_OR_EQUAL
Query.FilterOperator.NOT_EQUAL
Query.FilterOperator.IN (equal to any of the values in the provided list)
SimpleDB is slightly more sophisticated in its queries, but only stretches as far as old-style SQL like Amaozn SimpleDB Query Documentation, which can take a '%' before/after some text to allow startswith or endswith type operation.
With either product the intended usage pattern if needing to perform queries which were not anticipated ahead of time is more to perform a map-reduce type operation on the data and have the regexp filter be applied over the resulting dataset at application level rather than attempt to provide it inside the DB.
Alternatively if you know your regexps up-front, you could pre-apply your regexps these and store the results in whichever datastore.