PostgreSQL Index not used when data rows are large

PostgreSQL Index not used when data rows are large - sql

Hi I'm curious about why index doesn't work when data rows are large even 100.
Here's select for 10 data:
mydb> explain select * from data where user_id=1;
+-----------------------------------------------------------------------------------+
| QUERY PLAN |
|-----------------------------------------------------------------------------------|
| Index Scan using ix_data_user_id on data (cost=0.14..8.15 rows=1 width=2043) |
| Index Cond: (user_id = 1) |
+-----------------------------------------------------------------------------------+
EXPLAIN
Here's select for 100 data:
mydb> explain select * from data where user_id=1;
+------------------------------------------------------------+
| QUERY PLAN |
|------------------------------------------------------------|
| Seq Scan on data (cost=0.00..44.67 rows=1414 width=945) |
| Filter: (user_id = 1) |
+------------------------------------------------------------+
EXPLAIN
How can index work when data rows are 100?

100 is not a large amount of data. Think 10,000 or 100,000 rows for a respectable amount.
To put it simply, records in a table are stored on data pages. A data page typically has about 8k bytes (it depends on the database and on settings). A major purpose of indexes is to reduce the number of data pages that need to be read.
If all the records in a table fit on one page, there is no need to reduce the number pages being read. The one page will be read. Hence, the index may not be particularly useful.

Related

Efficiency problem querying postgresql table

I have the following PostgreSQL table with about 67 million rows, which stores the EOD prices for all the US stocks starting in 1985:
Table "public.eods"
Column | Type | Collation | Nullable | Default
--------+-----------------------+-----------+----------+---------
stk | character varying(16) | | not null |
dt | date | | not null |
o | integer | | not null |
hi | integer | | not null |
lo | integer | | not null |
c | integer | | not null |
v | integer | | |
Indexes:
"eods_pkey" PRIMARY KEY, btree (stk, dt)
"eods_dt_idx" btree (dt)
I would like to query efficiently the table above based on either the stock name or the date. The primary key of the table is stock name and date. I have also defined an index on the date column, hoping to improve performance for queries that retrieve all the records for a specific date.
Unfortunately, I see a big difference in performance for the queries below. While getting all the records for a specific stock takes a decent amount of time to complete (2 seconds), getting all the records for a specific date takes much longer (about 56 seconds). I have tried to analyze these queries using explain analyze, and I have got the results below:
explain analyze select * from eods where stk='MSFT';
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on eods (cost=169.53..17899.61 rows=4770 width=36) (actual time=207.218..2142.215 rows=8364 loops=1)
Recheck Cond: ((stk)::text = 'MSFT'::text)
Heap Blocks: exact=367
-> Bitmap Index Scan on eods_pkey (cost=0.00..168.34 rows=4770 width=0) (actual time=187.844..187.844 rows=8364 loops=1)
Index Cond: ((stk)::text = 'MSFT'::text)
Planning Time: 577.906 ms
Execution Time: 2143.101 ms
(7 rows)
explain analyze select * from eods where dt='2010-02-22';
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------
Index Scan using eods_dt_idx on eods (cost=0.56..25886.45 rows=7556 width=36) (actual time=40.047..56963.769 rows=8143 loops=1)
Index Cond: (dt = '2010-02-22'::date)
Planning Time: 67.876 ms
Execution Time: 56970.499 ms
(4 rows)
I really cannot understand why the second query runs 28 times slower than the first query. They retrieve a similar number of records, they both seem to be using an index. So could somebody please explain to me why this difference in performance, and can I do something to improve the performance of the queries that retrieve all the records for a specific date?

I would guess that this has to do with the data layout. I am guessing that you are loading the data by stk, so the rows for a given stk are on a handful of pages that pretty much only contain that stk.
So, the execution engine is only reading about 25 pages.
On the other hand, no single page contains two records for the same date. When you read by date, you have to read about 7,556 pages. That is, about 300 times the number of pages.
The scaling must also take into account the work for loading and reading the index. This should be about the same for the two queries, so the ratio is less than a factor of 300.

There can be more issues - so it is hard to say where is a problem. Index scan should be usually faster, than bitmap heap scan - if not, then there can be following problems:
unhealthy index - try to run REINDEX INDEX indexname
bad statistics - try to run ANALYZE tablename
suboptimal state of table - try to run VACUUM tablename
too low, or to high setting of effective_cache_size
issues with IO - some systems has a problem with high random IO, try to increase random_page_cost
Investigation what is a issue is little bit alchemy - but it is possible - there are only closed set of very probably issues. Good start is
VACUUM ANALYZE tablename
benchmark your IO if it is possible (like bonie++)

To find the difference, you'll probably have to run EXPLAIN (ANALYZE, BUFFERS) on the query so that you see how many blocks are touched and where they come from.
I can think of two reasons:
Bad statistics that make PostgreSQL believe that dt has a high correlation while it has not. If the correlation is low, a bitmap index scan is often more efficient.
To see if that is the problem, run
ANALYZE eods;
and see if that changes the execution plans chosen.
Caching effects: perhaps the first query finds all required blocks already cached, while the second doesn't.
At any rate, it might be worth experimenting to see if a bitmap index scan would be cheaper for the second query:
SET enable_indexscan = off;
Then repeat the query.

Tuning query using index. Best approach?

I am around a problem here. I'm using Oracle 11g and I have this query:
SELECT /*+ PARALLEL(16) */
prdecdde,
prdenusi,
prdenpol,
prdeano,
prdedtpr
FROM stat_pro_det
WHERE prdeisin IS NULL AND PRDENUSI IS NOT NULL AND prdedprv = '20160114'
GROUP BY prdecdde,
prdenusi,
prdenpol,
prdeano,
prdedtpr;
I get the next execution plan:
--------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes |TempSpc| Cost (%CPU)| Time |
--------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 53229 | 2287K| | 3652 (4)| 00:00:01 |
| 1 | HASH GROUP BY | | 53229 | 2287K| 3368K| 3652 (4)| 00:00:01 |
|* 2 | TABLE ACCESS BY INDEX ROWID| STAT_PRO_DET | 53229 | 2287K| | 3012 (3)| 00:00:01 |
|* 3 | INDEX RANGE SCAN | STAT_PRO_DET_08 | 214K| | | 626 (4)| 00:00:01 |
--------------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - filter("PRDENUSI" IS NOT NULL AND "PRDEISIN" IS NULL)
3 - access("PRDEDPRV"='20160114')
Note
-----
- Degree of Parallelism is 1 because of hint
I still have a lot of CPU cost. The STAT_PRO_DET_08 index is:
CREATE INDEX STAT_PRO_DET_08 ON STAT_PRO_DET(PRDEDPRV)
I've tried to add PRDEISIN and PRDENUSI to the index, putting the most selective at first, but with worst results.
This table have 128 million records (yes...maybe we need to a PARTITION TABLE). But I can not partition the table for now.
What are the other sugestions? A different index could get better results or can not do better than this?
Thanks in advance!!!!
EDIT1:
Guys.. thanks a lot for all your help. Especially #Marmite
I have a next question: And adding these two queries to the subject. Create one index for each one or can I have a index that resolve my performance problem in these three queries?
SELECT /*+ PARALLEL(16) */
prdecdde,
prdenuau,
prdenpol,
prdeano,
prdedtpr
FROM stat_pro_det
WHERE prdeisin IS NULL AND PRDENUSI IS NULL AND prdedprv = '20160114'
GROUP BY prdecdde,
prdenuau,
prdenpol,
prdeano,
prdedtpr;
and
SELECT /*+ PARALLEL(16) */
prdeisin, prdenuau
FROM stat_pro_det, mtauto
WHERE prdedprv = '20160114' AND prdenuau = autonuau AND autoisin IS NULL
GROUP BY prdenuau, prdeisin

First, you might as well rewrite the query as:
SELECT /*+ PARALLEL(16) */ DISTINCT
prdecdde, prdenusi, prdenpol, prdeano, prdedtpr
FROM stat_pro_det
WHERE prdeisin IS NULL AND PRDENUSI IS NOT NULL AND prdedprv = '20160114';
(This is shorter and makes it easier to change the list of columns you are interested in.)
The best index for this query is: stat_pro_det(prdedprv, prdeisin, prdenusi, prdecdde, prdenpol, prdeano, prdedtpr).
The first three columns are important for the WHERE clause and filtering the data. The remaining columns "cover" the query, meaning that the index itself can resolve the query without having to access data pages.

First make a following decisions:
you access using index or using full table scan
you use parallel query or no_parallel
Generall rule is index access work fine for small number of accessed records, but scale not well with a high number.
So the best way test all options and see the results.
For parallel FULL TABLE SCAN
use hint as follows (replace you table name or alias for tab)
SELECT /*+ FULL(tab) PARALLEL(16) */
This scales better, but is not instant for small number of records.
For index access
Note that this will not be done in parallel. Check the note in your explain plan in teh question.
Defining index containing all columns (as proposed by Gordon) you will perform a (sequential) index range scan without accessing the table.
As noted above - depending of the number of accessed keys this will be quick or slow.
For parallel index access
You need to define a GLOBAL partitioned index
create index tab_idx on tab (col3,col2,col1,col4,col5)
global partition by hash (col3,col2,col1,col4,col5) PARTITIONS 16;
Than hint
SELECT /*+ INDEX(tab tab_idx) PARALLEL_INDEX(tab,16) */
You will perform the same index range scan, but this time in parallel. So there is a chance that it will respond bettwer that serial execution. If you realy can open DOP 16 depend of course of your database HW setting and configuration...

Index is not being used by optimizer

I have a query which is performing very badly due to full scan of a table.I have checked the statistics rebuild the indexes but its not working.
SQL Statement:
select distinct NA_DIR_EMAIL d, NA_DIR_EMAIL r
from gcr_items , gcr_deals
where gcr_deals.GCR_DEALS_ID=gcr_items.GCR_DEALS_ID
and
gcr_deals.bu_id=:P0_BU_ID
and
decode(:P55_DIRECT,'ALL','Y',trim(upper(NA_ORG_OWNER_EMAIL)))=
decode(:P55_DIRECT,'ALL','Y',trim(upper(:P55_DIRECT)))
order by 1
Execution Plan :
Plan hash value: 3180018891
-------------------------------------------------------------------------
| Id | Operation | Name | Rows | Time |
-------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 8 | 00:11:42 |
| 1 | SORT ORDER BY | | 8 | 00:11:42 |
| 2 | HASH UNIQUE | | 8 | 00:11:42 |
|* 3 | HASH JOIN | | 7385 | 00:11:42 |
|* 4 | VIEW | index$_join$_002 | 10462 | 00:00:05 |
|* 5 | HASH JOIN | | | |
|* 6 | INDEX RANGE SCAN | GCR_DEALS_IDX12 | 10462 | 00:00:01 |
| 7 | INDEX FAST FULL SCAN| GCR_DEALS_IDX1 | 10462 | 00:00:06 |
|* 8 | TABLE ACCESS FULL | GCR_ITEMS | 7386 | 00:11:37 |
-------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
3 - access("GCR_DEALS"."GCR_DEALS_ID"="GCR_ITEMS"."GCR_DEALS_ID")
4 - filter("GCR_DEALS"."BU_ID"=TO_NUMBER(:P0_BU_ID))
5 - access(ROWID=ROWID)
6 - access("GCR_DEALS"."BU_ID"=TO_NUMBER(:P0_BU_ID))
8 - filter(DECODE(:P55_DIRECT,'ALL','Y',TRIM(UPPER("NA_ORG_OWNER_EMAI
L")))=DECODE(:P55_DIRECT,'ALL','Y',TRIM(UPPER(:P55_DIRECT))))

In the beginning a part of the condition in the WHERE clause must be decomposed (or "decompiled" - or "reengeenered") into a simpler form without using decode function, which a form can be understandable by the query optimizer:
AND
decode(:P55_DIRECT,'ALL','Y',trim(upper(NA_ORG_OWNER_EMAIL)))=
decode(:P55_DIRECT,'ALL','Y',trim(upper(:P55_DIRECT)))
into:
AND (
:P55_DIRECT = 'ALL'
OR
trim(upper(:P55_DIRECT)) = trimm(upper(NA_ORG_OWNER_EMAIL))
)
To find rows in the table based on values stored in the index, Oracle uses an access method named Index scan, see this link for details:
https://docs.oracle.com/cd/B19306_01/server.102/b14211/optimops.htm#i52300
One of the most common access method is Index Range Scan see here:
https://docs.oracle.com/cd/B19306_01/server.102/b14211/optimops.htm#i45075
The documentation says (in the latter link) that:
The optimizer uses a range scan when it finds one or more leading
columns of an index specified in conditions, such as the following:
col1 = :b1
col1 < :b1
col1 > :b1
AND combination of the preceding conditions for leading columns in the
index
col1 like 'ASD%' wild-card searches should not be in a leading
position otherwise the condition col1 like '%ASD' does not result in a
range scan.
The above means that the optimizer is able to use the index to find rows only for query conditions that contain basic comparision operators: = < > <= >= LIKE which are used to comparing simple values with plain column names. What the documentation doesn't clearly say - and you need to deduce it reading between the lines - is a fact that when some function is used in the condition, in a form function( column_name ) or function( expression_involving_column_names ) , then the index range scan cannot be used. In this case the query optimizer must evaluate this expression individually for each row in the table, thus must read all rows (perform a full table scan).
A short conclusion and a rule of thumb:
Functions in the WHERE clause can prevent the optimizer from using
indexes
If you see some function somewhere in the WHERE clause, then it is a sign that you are
running the red light
STOP immediately and think three times how
this function impact the query optimizer and the performance of your
query, and try to rewrite the condition to a form that the optimizer
is able to understand.
Now take a look at our rewritten condition:
AND (
:P55_DIRECT = 'ALL'
OR
trim(upper(:P55_DIRECT)) = trimm(upper(NA_ORG_OWNER_EMAIL))
)
and STOP - there are still two functions trim and upper applied to a column named NA_ORG_OWNER_EMAIL. We need to think how they can impact the query optimizer.
I assume that you have created a plain index on a single column: CREATE INDEX somename ON GCR_ITEMS( NA_ORG_OWNER_EMAIL ).If yes, then the index contains only plain values of NA_ORG_OWNER_EMAIL.
But the query is trying to find trimm(upper(NA_ORG_OWNER_EMAIL)) values, which are not stored in the index, so this index cannot be used in this case.
This condition requires a function based index:
https://docs.oracle.com/cd/E11882_01/appdev.112/e41502/adfns_indexes.htm#ADFNS00505
CREATE INDEX somename ON GCR_ITEMS( trim( upper( NA_ORG_OWNER_EMAIL )))
Unfortunately even the function based index will still not help, because the condition in the query is too general - if a value of :P55_DIRECT = ALL the query must retrieve all rows from the table (perform a full table scan), otherwise must use the index to search value within it.
This is because the query is planned (think of it as "compiled") by the query optimizer only once, during it's first execution. Then the plan is stored in the cache and used to execute the query for all further executions. A value of the parameter is not know in advance, so the plan must consider each possible cases, thus will always perform a full table scan.
In 12c there is a new feature "Adaptive query optimalization":
https://docs.oracle.com/database/121/TGSQL/tgsql_optcncpt.htm#TGSQL94982
where the query optimizer analyses each parameters of the query on each runs, and is able to detect that the plan is not optimal for some runtime parameters, and choose a better "subplans" depending on actual parameter's value ... but you must use 12c, and additionally pay for Enterprise Edition, because only this edition includes that feature. And it's still not certain if the adaptive plan will work in this case or not.
What you can do without paying for 12c EE is to DIVIDE this general query into two separate variants, one for a case where :P55_DIRECT = ALL, and the other for remaining cases, and run an appropriate variant in the client (your application) depending on the value of this parameter.
A version for :P55_DIRECT = ALL, that will perform a full table scan
where gcr_deals.GCR_DEALS_ID=gcr_items.GCR_DEALS_ID
and
gcr_deals.bu_id=:P0_BU_ID
order by 1
and a version for other cases, that will use the function based index:
where gcr_deals.GCR_DEALS_ID=gcr_items.GCR_DEALS_ID
and
gcr_deals.bu_id=:P0_BU_ID
and
trim(upper(:P55_DIRECT)) = trimm(upper(NA_ORG_OWNER_EMAIL))
order by 1

Slow GroupAggregate in PostgreSQL

In PostgreSQL 9.2, I have a table of items that are being rated by users:
id | userid | itemid | rating | timestamp | !update_time
--------+--------+--------+---------------+---------------------+------------------------
522241 | 3991 | 6887 | 0.1111111111 | 2005-06-20 03:13:56 | 2013-10-11 17:50:24.545
522242 | 3991 | 6934 | 0.1111111111 | 2005-04-05 02:25:21 | 2013-10-11 17:50:24.545
522243 | 3991 | 6936 | -0.1111111111 | 2005-03-31 03:17:25 | 2013-10-11 17:50:24.545
522244 | 3991 | 6942 | -0.3333333333 | 2005-03-24 04:38:02 | 2013-10-11 17:50:24.545
522245 | 3991 | 6951 | -0.5555555556 | 2005-06-20 03:15:35 | 2013-10-11 17:50:24.545
... | ... | ... | ... | ... | ...
I want to perform a very simple query: for each user, select the total number of ratings in the database.
I'm using the following straightforward approach:
SELECT userid, COUNT(*) AS rcount
FROM ratings
GROUP BY userid
The table contains 10M records. The query takes... well, about 2 or 3 minutes. Honestly, I'm not satisfied with that, and I believe that 10M is not so large number for the query to take so long. (Or is it..??)
Henceforth, I asked PostgreSQL to show me the execution plan:
EXPLAIN SELECT userid, COUNT(*) AS rcount
FROM ratings
GROUP BY userid
This results in:
GroupAggregate (cost=1756177.54..1831423.30 rows=24535 width=5)
-> Sort (cost=1756177.54..1781177.68 rows=10000054 width=5)
Sort Key: userid
-> Seq Scan on ratings (cost=0.00..183334.54 rows=10000054 width=5)
I read this as follows: Firstly, the whole table is read from the disk (seq scan). Secondly, it is sorted by userid in n*log(n) (sort). Finally, the sorted table is read row-by-row and aggregated in linear time. Well, not exactly the optimal algorithm I think, if I were to implement it by myself, I would use a hash table and build the result in the first pass. Never mind.
It seems that it is the sorting by userid which takes so long. So added an index:
CREATE INDEX ratings_userid_index ON ratings (userid)
Unfortunately, this didn't help and the performance remained the same. I definitely do not consider myself an advanced user and I believe I'm doing something fundamentally wrong. However, this is where I got stuck. I would appreciate any ideas how to make the query execute in reasonable time. One more note: PostgreSQL worker process utilizes 100 % of one of my CPU cores during the execution, suggesting that disk access is not the main bottleneck.
EDIT
As requested by #a_horse_with_no_name. Wow, quite advanced for me:
EXPLAIN (analyze on, buffers on, verbose on)
SELECT userid,COUNT(userid) AS rcount
FROM movielens_10m.ratings
GROUP BY userId
Outputs:
GroupAggregate (cost=1756177.54..1831423.30 rows=24535 width=5) (actual time=110666.899..127168.304 rows=69878 loops=1)
Output: userid, count(userid)
Buffers: shared hit=906 read=82433, temp read=19358 written=19358
-> Sort (cost=1756177.54..1781177.68 rows=10000054 width=5) (actual time=110666.838..125180.683 rows=10000054 loops=1)
Output: userid
Sort Key: ratings.userid
Sort Method: external merge Disk: 154840kB
Buffers: shared hit=906 read=82433, temp read=19358 written=19358
-> Seq Scan on movielens_10m.ratings (cost=0.00..183334.54 rows=10000054 width=5) (actual time=0.019..2889.583 rows=10000054 loops=1)
Output: userid
Buffers: shared hit=901 read=82433
Total runtime: 127193.524 ms
EDIT 2
#a_horse_with_no_name's comment solved the problem. I feel happy to share my findings:
SET work_mem = '1MB';
EXPLAIN SELECT userid,COUNT(userid) AS rcount
FROM movielens_10m.ratings
GROUP BY userId
produces the same as above:
GroupAggregate (cost=1756177.54..1831423.30 rows=24535 width=5)
-> Sort (cost=1756177.54..1781177.68 rows=10000054 width=5)
Sort Key: userid
-> Seq Scan on ratings (cost=0.00..183334.54 rows=10000054 width=5)
However,
SET work_mem = '10MB';
EXPLAIN SELECT userid,COUNT(userid) AS rcount
FROM movielens_10m.ratings
GROUP BY userId
gives
HashAggregate (cost=233334.81..233580.16 rows=24535 width=5)
-> Seq Scan on ratings (cost=0.00..183334.54 rows=10000054 width=5)
The query now only takes about 3.5 seconds to complete.

Consider how your query could possibly return a result... You could build a variable-length hash and create/increment its values; or you could sort all rows by userid and count. Computationally, the latter option is cheaper. That is what Postgres does.
Then consider how to sort the data, taking disk IO into account. One option is to open disk pages A, B, C, D, etc., and then sorting rows by userid in memory. In other words, seq scan followed by a sort. The other option, called an index scan, would be to pull rows in order by using an index: visit page B, then D, then A, then B again, A again, C, ad nausea.
An index scan is efficient when pulling a handful of rows in order; not so much to fetch many rows in order — let alone all rows in order. As such, the plan you're getting is the optimal one:
Plough throw all rows (seq scan)
Sort rows to group by criteria
Count rows by criteria
Trouble is, you're sorting roughly 10 million rows in order to count them by userid. Nothing will make things faster short of investing in more RAM and super fast SSDs.
You can, however, avoid this query altogether. Either:
Count ratings for the handful of users that you actually need — using a where clause — instead of pulling the entire set; or
Add a ratings_count field to your users table and use triggers on ratings to maintain the count.
Use a materialized view, if the precise count is less relevant than having a vague idea of it.

Try like below, because COUNT(*) and COUNT(userid) makes a lot of difference.
SELECT userid, COUNT(userid) AS rcount
FROM ratings
GROUP BY userid

You can try to run 'VACUUM ANALYZE ratings' to update data Statics, so the optimizer can choose a better scenario to execute SQL.

Vertica and joins

I'm adapting a web analysis tool to use Vertica as the DB. I'm having real problems optimizing joins. I tried creating pre-join projections for some of my queries, and while it did make the queries blazing fast, it slowed data loading into the fact table to a crawl.
A simple INSERT INTO ... SELECT * FROM which we use to load data into the fact table from a staging table goes from taking ~5 seconds to taking 20+ minutes.
Because of this I dropped all pre-join projections and tried using the Database Designer to design query specific projections but it's not enough. Even with those projections a simple join is taking ~14 seconds, something that takes ~1 second with a pre-join projection.
My question is this: Is it normal for a pre-join projection to slow data insertion this much and if not, what could be the culprit? If it is normal, then it's a show stopper for us and are there other techniques we could use to speed up the joins?
We're running Vertica on a 5 node cluster, each node having 2 x quad core CPU and 32 GB of memory. The tables in my example query have 188,843,085 and 25,712,878 rows respectively.
The EXPLAIN output looks like this:
EXPLAIN SELECT referer_via_.url as referralPageUrl, COUNT(DISTINCT sessio
n.id) as visits FROM owa_session as session JOIN owa_referer AS referer_vi
a_ ON session.referer_id = referer_via_.id WHERE session.yyyymmdd BETWEEN
'20121123' AND '20121123' AND session.site_id = '49' GROUP BY referer_via_
.url ORDER BY visits DESC LIMIT 250;
Access Path:
+-SELECT LIMIT 250 [Cost: 1M, Rows: 250 (STALE STATISTICS)] (PATH ID: 0)
| Output Only: 250 tuples
| Execute on: Query Initiator
| +---> SORT [Cost: 1M, Rows: 1 (STALE STATISTICS)] (PATH ID: 1)
| | Order: count(DISTINCT "session".id) DESC
| | Output Only: 250 tuples
| | Execute on: All Nodes
| | +---> GROUPBY PIPELINED (RESEGMENT GROUPS) [Cost: 1M, Rows: 1 (STALE
STATISTICS)] (PATH ID: 2)
| | | Aggregates: count(DISTINCT "session".id)
| | | Group By: referer_via_.url
| | | Execute on: All Nodes
| | | +---> GROUPBY HASH (SORT OUTPUT) (RESEGMENT GROUPS) [Cost: 1M, Rows
: 1 (STALE STATISTICS)] (PATH ID: 3)
| | | | Group By: referer_via_.url, "session".id
| | | | Execute on: All Nodes
| | | | +---> JOIN HASH [Cost: 1M, Rows: 1 (STALE STATISTICS)] (PATH ID:
4) Outer (RESEGMENT)
| | | | | Join Cond: ("session".referer_id = referer_via_.id)
| | | | | Execute on: All Nodes
| | | | | +-- Outer -> STORAGE ACCESS for session [Cost: 463, Rows: 1 (ST
ALE STATISTICS)] (PUSHED GROUPING) (PATH ID: 5)
| | | | | | Projection: public.owa_session_projection
| | | | | | Materialize: "session".id, "session".referer_id
| | | | | | Filter: ("session".site_id = '49')
| | | | | | Filter: (("session".yyyymmdd >= 20121123) AND ("session"
.yyyymmdd <= 20121123))
| | | | | | Execute on: All Nodes
| | | | | +-- Inner -> STORAGE ACCESS for referer_via_ [Cost: 293K, Rows:
26M] (PATH ID: 6)
| | | | | | Projection: public.owa_referer_DBD_1_seg_Potency_2012112
2_Potency_20121122
| | | | | | Materialize: referer_via_.id, referer_via_.url
| | | | | | Execute on: All Nodes

To speedup join:
Design session table as being partitioned on column "yyyymmdd". This will enable partition pruning
Add condition on column "yyyymmdd" to _referer_via_ and partition on it, if it is possible (most likely not)
have column site_id as possible close to the beginning of order by list in used (super)projection of session
have both tables segmented on referer_id and id correspondingly.
And having more nodes in cluster do help.

My question is this: Is it normal for a pre-join projection to slow data insertion this much and if not, what could be the culprit? If it is normal, then it's a show stopper for us and are there other techniques we could use to speed up the joins?
I guess the amount affected would vary depending on data sets and structures you are working with. But, since this is the variable you changed, I believe it is safe to say the pre-join projection is causing the slowness. You are gaining query time at the expense of insertion time.
Someone please correct me if any of the following is wrong. I'm going by memory and by information picked up with conversations with others.
You can speed up your joins without a pre-join projection a few ways. In this case, the referrer ID. I believe if you segment your projections for both tables with the join predicate that would help. Anything you can do to filter the data.
Looking at your explain plan, you are doing a hash join instead of a merge join, which you probably want to look at.
Lastly, I would like to know via the explain plan or through system tables if your query is actually using the projections Database Designer has recommended. If not, explicitly specify them in your query and see if that helps.

You seem to have a lot of STALE STATISTICS.
Responding to STALE statistics is important. Because that is the reason why your queries are slow. Without statistics about the underlying data, Vertica's query optimizer cannot choose the best execution plan. And responding to STALE statistics only improves SELECT performance not update performance.
If you update your tables regularly do remember there are additional things you have to consider in VERTICA. Please check the answer that I posted to this question.
I hope that should help improve your update speed.
Explore the AHM settings as explained in that answer. If you don't need to be able to select deleted rows in a table later, it is often a good idea to not keep them around. There are ways to keep only the latest epoch version of the data. Or manually purge deleted data.
Let me know how it goes.

I think your query could use some more of being explicit. Also don't use that Devil BETWEEN Try this:
EXPLAIN SELECT
referer_via_.url as referralPageUrl,
COUNT(DISTINCT session.id) as visits
FROM owa_session as session
JOIN owa_referer AS referer_via_
ON session.referer_id = referer_via_.id
WHERE session.yyyymmdd <= '20121123'
AND session.yyyymmdd > '20121123'
AND session.site_id = '49'
GROUP BY referer_via_.url
-- this `visits` column needs a table name
ORDER BY visits DESC LIMIT 250;
I'll say I'm really perplexed as to why you would use the same DATE with BETWEEN may want to look into that.

this is my view coming from an academic background working with column databases, including Vertica (recent PhD graduate in database systems).
Blockquote
My question is this: Is it normal for a pre-join projection to slow data insertion this much and if not, what could be the culprit? If it is normal, then it's a show stopper for us and are there other techniques we could use to speed up the joins?
Blockquote
Yes, updating projections is very slow and you should ideally do it only in large batches to amortize the update cost. The fundamental reason is that each projection represents another copy of the data (of each table column that is part of the projection).
A single row insert requires adding one value (one attribute) to each column in the projection. For example, a single row insert in a table with 20 attributes requires at least 20 column updates. To make things worse, each column is sorted and compressed. This means that inserting the new value in a column requires multiple operations on large chunks of data: read data / decompress / update / sort / compress data / write data back. Vertica has several optimization for updates but cannot hide completely the cost.
Projections can be thought of as the equivalent of multi-column indexes in a traditional row store (MySQL, PostgreSQL, Oracle, etc.). The upside of projections versus traditional B-Tree indexes is that reading them (using them to answer a query) is much faster than using traditional indexes. The reasons are multiple: no need to access head data as for non-clustered indexes, smaller size due to compression, etc. The flipside is that they are way more difficult to update. Tradeoffs...

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas