Index in Parquet - indexing

I would like to be able to do a fast range query on a Parquet table. The amount of data to be returned is very small compared to the total size but because a full column scan has to be performed it is too slow for my use case.
Using an index would solve this problem and I read that this was to be added in Parquet 2.0. However, I cannot find any other information on this so I am guessing that it was not. I do not think that there would be any fundamental obstacles preventing the addition of (multi-column) indexes, if the data were sorted, which in my case it is.
My question is: when will indexes be added to Parquet, and what would be the high level design for doing so? I think I would already be happy with an index that points out the correct partition.
Kind regards,
Sjoerd.

Update Dec/2018:
Parquet Format version 2.5 added column indexes.
https://github.com/apache/parquet-format/blob/master/CHANGES.md#version-250
See https://issues.apache.org/jira/browse/PARQUET-1201 for list of sub-tasks for that new feature.
Notice that this feature just got merged into Parquet format itself, it will take some time for different backends (Spark, Hive, Impala etc) to start supporting it.
This new feature is called Column Indexes. Basically Parquet has added two new structures in parquet layout - Column Index and Offset Index.
Below is a more detailed technical explanation what it solves and how.
Problem Statement
In the current format, Statistics are stored for ColumnChunks in ColumnMetaData and for individual pages inside DataPageHeader structs. When reading pages, a reader has to process the page header in order to determine whether the page can be skipped based on the statistics. This means the reader has to access all pages in a column, thus likely reading most of the column data from disk.
Goals
Make both range scans and point lookups I/O efficient by allowing direct access to pages based on their min and max values. In particular:
A single-row lookup in a rowgroup based on the sort column of that
rowgroup will only read one data page per retrieved column. Range
scans on the sort column will only need to read the exact data pages
that contain relevant data.
Make other selective scans I/O
efficient: if we have a very selective predicate on a non-sorting
column, for the other retrieved columns we should only need to
access data pages that contain matching rows.
No additional decoding
effort for scans without selective predicates, e.g., full-row group
scans. If a reader determines that it does not need to read the
index data, it does not incur any overhead.
Index pages for sorted
columns use minimal storage by storing only the boundary elements
between pages.
Non-Goals
Support for the equivalent of secondary indices, ie, an index structure sorted on the key values over non-sorted data.
Technical Approach
We add two new per-column structures to the row group metadata:
ColumnIndex: this allows navigation to the pages of a column based on column values and is used to locate data pages that contain matching values for a scan predicate
OffsetIndex: this allows navigation by row index and is used to retrieve values for rows identified as matches via the ColumnIndex. Once rows of a column are skipped, the corresponding rows in the other columns have to be skipped. Hence the OffsetIndexes for each column in a RowGroup are stored together.
The new index structures are stored separately from RowGroup, near the footer, so that a reader does not have to pay the I/O and deserialization cost for reading the them if it is not doing selective scans. The index structures’ location and length are stored in ColumnChunk and RowGroup.
Cloudera's Impala team has made some tests on this new feature (not yet available as part of Apache Impala core product). Here's their performance improvements:
and
As you can see some of the queries had a huge improvement in both both cpu time and amount of data it had to read from disks.
Original answer back from 2016:
struct IndexPageHeader {
/** TODO: **/
}
https://github.com/apache/parquet-format/blob/6e5b78d6d23b9730e19b78dceb9aac6166d528b8/src/main/thrift/parquet.thrift#L505
Index Page Header is not implemented, as of yet.
See source code of Parquet format above.
I don't see it even in Parquet 2.0 currently.
But yes - excellent answer from Ryan Blue above on Parquet that it has pseudo-indexing capabilities (bloom filters).
If your're interested in more details, I recommend great document on how Parquet bloom filters and predicate push-down work
https://www.slideshare.net/RyanBlue3/parquet-performance-tuning-the-missing-guide
a more technical implementation-specific document -
https://homepages.cwi.nl/~boncz/msc/2018-BoudewijnBraams.pdf

Parquet currently keeps min/max statistics for each data page. A data page is a group of ~1MB of values (after encoding) for a single column; multiple pages are what make up Parquet's column chunks.
Those min/max values are used to filter both column chunks and the pages that make up a chunk. So you should be able to improve your query time by sorting records by the columns you want to filter on, then writing the data into Parquet. That way, you get the most out of the stats filtering.
You can also get more granular filtering with this technique by decreasing the page and row group sizes, though you're then trading encoding efficiency and I/O efficiency.

Related

Unused index in PostgreSQL

I'm learning indexing in PostgreSQL now. I started trying to create my index and analyzing how it will affect execution time. I created some tables with such columns:
also, I filled them with data. After that I created my custom index:
create index events_organizer_id_index on events(organizer_ID);
and executed this command (events table contains 148 rows):
explain analyse select * from events where events.organizer_ID = 4;
I was surprised that the search was executed without my index and I got this result:
As far as I know, if my index was used in search there would be the text like "Index scan on events".
So, can someone explain or give references to sites, please, how to use indexes effectively and where should I use them to see differences?
From "Rows removed by filter: 125" I see there are too few rows in the events table. Just add couple of thousands rows and give it another go
from the docs
Use real data for experimentation. Using test data for setting up indexes will tell you what indexes you need for the test data, but that is all.
It is especially fatal to use very small test data sets. While
selecting 1000 out of 100000 rows could be a candidate for an index,
selecting 1 out of 100 rows will hardly be, because the 100 rows probably fit within a single disk page, and there is no plan that can
beat sequentially fetching 1 disk page.
In most cases, when database using an index it gets only address where the row is located. It contains data block_id and the offset because there might be many rows in one block of 4 or 8 Kb.
So, the database first searches index for the block adress, then it looks for the block on disk, reads it and parses the line you need.
When there are too few rows they fit into one on in couple of data blocks which makes it easier and quicker for DB to read whole table without using index at all.
See it the following way:
The database decides which way is faster to find your tuple (=record) with organizer_id 4. There are two ways:
a) Read the index and then skip to the block which contains the data.
b) Read the heap and find the record there.
The information in your screenshot show 126 records (125 skipped + your record) with a length ("width") of 62 bytes. Including overhead these data fit into two database blocks of 8 KB. As a rotating disk or SSD reads a series of blocks anyway - they read always more blocks into the buffer - it's one read operation for these two blocks.
So the database decides that it is pointless to read first the index to find the correct record (of in our case two blocks) and then read the data from the heap with the information from the index. That would be two read operations. Even with modern technology newer than rotating disks this needs more time than just scanning the two blocks. That's why the database doesn't use the index.
Indexes on such small tables aren't good for searching. Nevertheless unique indexes avoid double entries.

SSIS does not recognize Indexes?

I have a table with a non-Clustered index on a varchar column 'A'.
when I use Order By A clause I can see it scans the index and gives me the result in a few seconds.
But when I use Sort Component of SSIS for column 'A', I can see it takes minutes to sort records.
So I understand that it does not recognize my non-clustered index
Does anyone has any idea for using indexes for SSIS but not using queries instead of components??
Order By A is run in the database.
When using a sort component, the sort is done in the SSIS runtime. Note that the query you use to feed to the sort does not have an order by in it (I assume)
It's done in the runtime because it is data source agnostic - your source could be excel or a text file or an in memory dataset or a multicase or pivot or anything.
My advice is to use the database as much as possible.
The only reason to use a sort in a SSIS package is if your source doesn't support sorting (i.e. a flat file) and you want to do a merge join in your package to something else. Which is a very rare and specific case
As I researched and working with SSIS these times I found out that the only way to use indexes is to connnect to database. However, when you fetch your data in the flow, all you have are just records and data. no indexes!
So for tasks like Merge Join which needs a Sort component before that, I tried to use Lookup component instead with full cache option. and cache whole data then use ORDER BY in the Source component query
31 Days of SSIS – What The Sorts:
Whether there are one hundred rows or ten million rows – all of the rows have to be consumed by the Sort Transformation before it can return the first row. This potentially places all of the data for the data flow path in memory. And the potentially bit is because if there is enough data it will spill over out of memory.
In the image to the right you can see that until the ten million rows are all received that data after that point in the Data Flow cannot be processed.
This behavior should be expected if you consider what the transformation needs to do. Before the first row can be sent along, the last row needs to be checked to make sure that it is not the first row.
For small and narrow datasets, this is not an issue. But if you’re dataset are large or wide you can find performance issues with packages that have sorts within them. All of the data load and sorted in memory can be a serious performance hog

JSONB performance degrades as number of keys increase

I am testing the performance of jsonb datatype in postgresql. Each document will have about 1500 keys that are NOT hierarchical. The document is flattened. Here is what the table and document looks like.
create table ztable0
(
id serial primary key,
data jsonb
)
Here is a sample document:
{ "0": 301, "90": 23, "61": 4001, "11": 929} ...
As you can see the document does not contain hierarchies and all values are integers. However, Some will be text in the future.
Rows: 86,000
Columns: 2
Keys in document: 1500+
When searching for a particular value of a key or performing a group by the performance is very noticeably slow. This query:
select (data ->> '1')::integer, count(*) from ztable0
group by (data ->> '1')::integer
limit 100
took about 2 seconds to complete. Is there any way to improve performance of jsonb documents.
This is a known issue in 9.4beta2, please, have a look at this blog post, it contains some details and pointers to the mail threads.
About the issue.
PostgreSQL is using TOAST to store data values, this means that big values (typically round 2kB and more) are stored in the separate special kind of table. And PostgreSQL also tries to compress the data, using it's pglz method (been there for ages). By “tries” it means that before deciding to compress data, first 1k bytes are probed. And if results are not satisfactory, i.e. compression gives no benefits on the probed data, decision is made not to compress.
So, initial JSONB format stored a table of offsets in the beginning of it's value. And for values with high number of root keys in JSON this resulted in first 1kB (and more) being occupied by offsets. This was a series of distinct data, i.e. it was not possible to find two adjacent 4-byte sequences that'd be equal. Thus no compression.
Note, that if one would pass over the offset table, the rest of the value is perfectly compressable.
So one of the options would be to tell to the pglz code explicitly wether compression is applicable and where to probe for it (especially for the newly introduced data types), but existing infrastructure doesn't supports this.
The fix
So decision was made to change the way data is stored inside the JSONB value, making it more suitable for pglz to compress. Here's a commit message by Tom Lane with the change that implements a new JSONB on-disk format. And despite the format changes, lookup of a random element is still O(1).
It took around a month to be fixed though. As I can see, 9.4beta3 had been already tagged, so you'll be able to re-test this soon, after the official announcement.
Important Note: you'll have to do pg_dump/pg_restore exercise or utilize pg_upgrade tool to switch to 9.4beta3, as fix for the issue you've identified required changes in the way data is stored, so beta3 is not binary compatible with beta2.

Lucene Indexing

I would like to use Lucene for indexing a table in an existing database. I have been thinking the process is like:
Create a 'Field' for every column in the table
Store all the Fields
'ANALYZE' all the Fields except for the Field with the primary key
Store each row in the table as a Lucene Document.
While most of the columns in this table are small in size, one is huge. This column is also the one containing the bulk of the data on which searches will be performed.
I know Lucene provides an option to not store a Field. I was thinking of two solutions:
Store the field regardless of the size and if a hit is found for a search, fetch the appropriate Field from Document
Don't store the Field and if a hit is found for a search, query the data base to get the relevant information out
I realize there may not be a one size fits all answer ...
For sure, your system will be more responsive if you store everything on Lucene. Stored field does not affect the query time, it will only make the size of your index bigger. And probably not that bigger if it is only a small portion of the rows that have a lot of data. So if the index size is not an issue for your system, I would go with that.
I strongly disagree with a Pascal's answer. Index size can have major impact on search performance. The main reasons are:
stored fields increase index size. It could be problem with relatively slow I/O system;
stored fields are all loaded when you load Document in memory. This could be good stress for the GC
stored fields are likely to impact reader reopen time.
The final answer, of course, it depends. If the original data is already stored somewhere else, it's good practice to retrieve it from original data store.
When adding a row from the database to Lucene, you can judge if it actually needed to be write to the inverted-index. If not, you can use Index.NOT to avoid writing too much data to the inverted-index.
Meanwhile, you can judge where a column will be queried by key-value. If not, you needn't use Store.YES to store the data.

Can anyone please explain "storing" vs "indexing" in databases?

What is storing and what is indexing a field when it comes to searching?
Specifically I am talking about MySQL or SOLR.
Is there any thorough article about this, I have made some searches without luck!
Thanks
Storing information in a database just means writing the information to a file.
Indexing a database involves looking at the data in a table and creating an 'index' which is then used to perform a more efficient lookup in the table when you want to retreive the stored data.
From Wikipedia:
A database index is a data structure that improves the speed of data retrieval operations on a database table at the cost of slower writes and increased storage space. Indexes can be created using one or more columns of a database table, providing the basis for both rapid random look ups and efficient access of ordered records. The disk space required to store the index is typically less than that required by the table (since indexes usually contain only the key-fields according to which the table is to be arranged, and excludes all the other details in the table), yielding the possibility to store indexes in memory for a table whose data is too large to store in memory.
Storing is just putting data in the tables.
Storing vs. indexing is a SOLR's concept.
In SOLR, a stored field cannot be searched for or sorted on. It can be retrieved as a result of the query that includes a search on an indexed field.
In MySQL, on contrary, you can search and sort on unindexed fields too: this will be just slower, but still possible (unlike SOLR)
Storing data is just storing data somewhere so you can retrieve it later. Where indexing comes in is retrieving parts of the data efficiently. Wikipedia explains the idea quite well.
storing is just that saving the data to the disk (or whatever) so that the database can retrieve it later on demand.
indexing means creating some separate data structure to optimize the location and retrieval of that data in a faster way than simply reading the entire database (or the entire table) and looking at each and everyt record until the database searching algorithm finds what you asked it for... Generally databases use what is called a Balanced-Tree indices, which is an extension of the concept of a Binary-Tree. Look up Binary Tree on google/wikipedia to get a more indepth understanding of how this works...
Data
L1. This
L2. Is
L3. My Data
And the index is
This -> L1
Is -> L2
My -> L3
Data -> L3
The data/index analogy holds for books as well.