Can Bloom Filters in BigTable be used to filter based only on row ID? - bigtable

BigTable uses Bloom filters to allow point reads to avoid accessing SSTables that do not contain any data within a given key-column pair. Can these Bloom filters also be used to avoid accessing SSTables if the query only specifies the row ID and no column ID?
BigTable uses row-column pairs as keys to insert into its bloom filters. This means that a query can use these filters for a point read that specifies a row-column pair.
Now, suppose we have a query to get all columns of a row based only on the row ID. As far as I can tell, this query does not know in advance what are the columns that belong to the row, and so it may not be able to use the bloom filters as it cannot enumerate the possible row-column pairs. As a result, such a query may not be able to use the bloom filters, and so it would be less efficient.
In theory, BigTable could already be addressing this problem by also inserting just the row ID into the bloom filters, but I can't tell if the current implementation does this or not.
This question may have importance for designing efficient queries to run on BigTable. Any hints would be wonderful.

HBase Bloom filter does both row and row col checks. HBase was built based on BigTable paper, so most probably BigTable would be doing the same.
HBase Bloom Filter is a space-efficient mechanism to test whether a StoreFile contains a specific row or row-col cell.
Reference: https://learning.oreilly.com/library/view/hbase-administration-cookbook/9781849517140/ch09s11.html
The BigTable paper from 2006 however does mention only row-column based search using bloom filter.
https://static.googleusercontent.com/media/research.google.com/en//archive/bigtable-osdi06.pdf

Related

Search large number of ID search in MongoDB

Thanks for looking at my query. I have 20k+ unique identification id that I is provided by client, I want to look for all these id's in MongoDB using single query. I tried looking using $in but then it does not seems feasible to put all the 20K Id in $in and search. Is there a better version of achieving.
If the id field is indexed, an IN query should be very fast, but i don't think it is a good idea to perform a query with 20k ids in one time, as it may consume quite a lot of resources like memory, you can split the ids into multiple groups with a reasonable size and do the query separately and you still can perform the queries parallelly in application level.
Consider importing your 20k+ id into a collection(says using mongoimport etc). Then perform a $lookup from your root collection to the search collection. Depending on whether the $lookup result is empty array or not, you can proceed with your original operation that requires $in.
Here is Mongo playground for your reference.

why bloom filters do not work, tell me please

I have two tables: original and new with bloom filters. Bloom filter created for int column ( CLUSTERED BY and 'orc.bloom.filter.columns'). in hdfs in the partition, I see the number of files = the number of unique values ​​in the column. but when I query ( select min(...) from table where id = ...) these tables, requests finishes execution in the same time. and in job's logs and in 'explain analyze' I do not see the use of a bloom filter and the request reads the entire partition. what else needs to be configured in order for the bloom filter to work, requests are executed faster, and not all files in the partition are read, but only one file with the desired id?
Bloom filters can help not in all cases.
ORC contains indexes on file level, stripe level and row level (for 10000 rows, configurable). If PPD configured, indexes (min, max values) can be used to skip reading files (footer part will be read anyway), stripes also can be skipped. These indexes are useful for filtering sortable sequential values and range queries. like integer number for example. For indexes to be efficient you should sort data by index keys when inserting. Unsorted index is not efficient because all stripes can contain all keys.
Sorting during insert can be expensive.
Having only Indexes is enough in most cases.
Bloom filters are structures which can help to check if key is not present in the dataset with 100 percent probability.
Bloom filters efficient for equality queries, especially for not sequential unsorted values like GUIDs. MIN/MAX indexes do not work efficiently for such values. Filter by specific GUID should be very efficient with Bloom filter.
For sortable sequential values like integer id, min/max values stored in ORC indexes (sorted) are better.

How can I reduce the amount of data scanned by BigQuery during a query?

Please someone tell and explain the correct answer to the following Multiple Choice Question?
You have a query that filters a BigQuery table using a WHERE clause on timestamp and ID columns. By using bq query –-dry_run you learn that the query triggers a full scan of the table, even though the filter on timestamp and ID select a tiny fraction of the overall data. You want to reduce the amount of data scanned by BigQuery with minimal changes to existing SQL queries. What should you do?
Create a separate table for each ID.
Use the LIMIT keyword to reduce the number of rows returned.
Recreate the table with a partitioning column and clustering column.
Use the bq query --maximum_bytes_billed flag to restrict the number of bytes billed.
As far as I know, the only way to limit the number of bytes read by BigQuery is either through removing (entirely) column references, removing table references, or through partitioning (and perhaps clustering in some cases).
One of the challenges when starting to use BigQuery is that a query like this:
select *
from t
limit 1;
can be really, really expensive.
However, a query like this:
select sum(x)
from t;
on the same table can be quite cheap.
To answer the question, you should learn more about how BigQuery bills for usage.
Assuming these are the only four possible answers, the answer is almost certainly "Recreate the table with a partitioning column and clustering column."
Lets eliminate the others:
Use the LIMIT keyword to reduce the number of rows returned.
This isn't going to help at all, since the LIMIT is only applied after a full table scan has already happened, so you'll still be billed the same, despite the limit.
Create a separate table for each ID.
This doesn't seem likely to help, as in addition to being an organizational mess, then you'd have to query every table to find all the right timestamps, and process the same amount of data as before (but with a lot more work).
Use the bq query --maximum_bytes_billed flag to restrict the number of bytes billed.
You could do this, but then the query would fail when the maximum bytes to be billed were too high, so you wouldn't get your results.
So why partitioning and clustering?
BigQuery (on-demand) billing is based on the columns that you select, and the amount of data that you read in those columns. So you want to do everything you can to reduce the amount of data processed.
Depending on the exact query, partitioning by the timestamp allows you to only scan the data for the relevant days. This can obviously be a huge savings compared to an entire table scan.
Clustering allows to to put commonly used data together within a table by sorting based on the clustering column, so that it can eliminate the need to scan irrelevant data based on the filter (WHERE clause). Thus, you scan less data and reduce your cost. There is a similar benefit for aggregation of data.
This of course all assumes you have a good understanding of the queries you are actually making and which columns make sense to cluster on.

Bigtable design and querying with respect to number of column families

From Cloud Bigtable schema design documentation:
Grouping data into column families allows you to retrieve data from a single family, or multiple families, rather than retrieving all of the data in each row. Group data as closely as you can to get just the information that you need, but no more, in your most frequent API calls.
In my use case, I can group all the data into one single column family (currently the access pattern is retrieve all fields), or group them to say 3 logical column families and specify these column families all the time while querying. Is there any performance difference between these two designs? Which design is recommended?
In your case, there isn't a performance advantage either way. I would use the 3 logical column families so that you have cleaner code.

cursorMark is stateless and how it solves deep paging

as specified here cursormark is stateless but I don't get how it it is solving deep paging issue if its stateless. Does solr stores the indexed data in sort by unique key field if so then it will clarify my confusion.
if I am wrong please explain how the cursormark solves deep paging. Because as cursormark is stateless it also need to sort and claculate cursormark every time request a query and this is similar to start=#start-position.
From the link you referenced...
Cursors in Solr are a logical concept, that doesn't involve caching any state information on the server. Instead the sort values of the last document returned to the client are used to compute a "mark" representing a logical point in the ordered space of sort values. That "mark" can be specified in the parameters of subsequent requests to tell Solr where to continue.
This is elaborated further in an explanation of the constraints on using cursorMark...
Cursor mark values are computed based on the sort values of each document in the result, which means multiple documents with identical sort values will produce identical Cursor mark values if one of them is the last document on a page of results. In that situation, the subsequent request using that cursorMark would not know which of the documents with the identical mark values should be skipped. Requiring that the uniqueKey field be used as a clause in the sort criteria guarantees that a deterministic ordering will be returned, and that every cursorMark value will identify a unique point in the sequence of documents.
If this doesn't help clarify things for you, the next best explanation i can offer is to think of the cursorMark as an encoded filter telling Solr to skip all documents with values in the sort field(s) that come "before" (based on the asc/desc sort order) some specific values.