I'm facing an issue during the merging of some data.
I've two tables:
CREATE TABLE tmp_table
(
TROWID ROWID NOT NULL
, NEW_FK1 NUMBER(10)
, NEW_FK2 NUMBER(10)
, CONSTRAINT TMP_TABLE_PK_1 PRIMARY KEY
(
TROWID
)
ENABLE
)
CREATE UNIQUE INDEX TMP_TABLE_PK_1 ON tmp_table (TROWID ASC)
CREATE TABLE my_table
(
M_ID NUMBER(10) NOT NULL
, M_FK1 NUMBER(10)
, M_FK2 NUMBER(10)
, M_START_DATE DATE NOT NULL
, M_END_DATE DATE
, M_DELETED NUMBER(1) NOT NULL
, M_CHECK1 NUMBER(1) NOT NULL
, M_CHECK2 NUMBER(1) NOT NULL
, M_CHECK3 NUMBER(1)
, M_CREATION_DATE DATE
, M_CREATION_USER NUMBER(10)
, M_UPDATE_DATE DATE
, M_UPDATE_USER NUMBER(10)
, CONSTRAINT MY_TABLE_PK_1 PRIMARY KEY
(
M_ID
)
ENABLE
)
CREATE UNIQUE INDEX TMP_TABLE_PK_1 ON my_table (M_ID ASC)
CREATE INDEX TMP_TABLE_IX_1 ON my_table (M_UPDATE_DATE ASC, M_FK2 ASC)
CREATE INDEX TMP_TABLE_IX_2 ON my_table (M_FK1 ASC, M_FK2 ASC)
The tmp_table is a temporary table where i stored only the records and informations that will be updated in my_table. That means tmp_table.TROWID is the rowid of my_table row that should be merged.
Total merged records should be: 94M on a total anount of my_table of 540M.
The query:
MERGE /*+parallel*/ INTO my_table m
USING (SELECT /*+parallel*/ * FROM tmp_table) t
ON (m.rowid = t.TROWID)
WHEN MATCHED THEN
UPDATE SET m.M_FK1 = t.M_FK1 , m.M_FK2 = t.M_FK2 , m.M_UPDATE_DATE = trunc(sysdate)
, m.M_UPDATE_USER = 0 , m.M_CREATION_USER = 0
The execution plan is:
Operation | Table | Estimated Rows |
MERGE STATEMENT | | |
- MERGE | my_table | |
-- PX CORDINATOR | | |
--- PX SENDER | | |
---- PX SEND QC (RANDOM) | | 95M |
----- VIEW | | |
------ HASH JOIN BUFFERED | | 95M |
------- PX RECEIVE | | 95M |
-------- PX SEND HASH | | 95M |
--------- PX BLOCK ITERATOR | | 95M |
---------- TABLE ACCESS FULL | tmp_table | 95M |
------- PX RECEIVE | | 540M |
-------- PX SEND HASH | | 540M |
--------- PX BLOCK ITERATOR | | 540M |
---------- TABLE ACCESS FULL | my_table | 540M |
In the above plan the most expensive op is the HASH JOIN BUFFERED.
For the two full scans I've seen that not require more of 5/6 minutes, instead for the hash join after 2h have reach 1% of the execution.
I've no idea how require that much time; any suggesitons?
EDIT
-----------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes |TempSpc| Cost (%CPU)| Time |
-----------------------------------------------------------------------------------------------------------
| 0 | MERGE STATEMENT | | 94M| 9719M| | 3027K (2)| 10:05:29 |
| 1 | MERGE | my_table | | | | | |
| 2 | VIEW | | | | | | |
|* 3 | HASH JOIN | | 94M| 7109M| 3059M| 3027K (2)| 10:05:29 |
| 4 | TABLE ACCESS FULL| tmp_table | 94M| 1979M| | 100K (2)| 00:20:08 |
| 5 | TABLE ACCESS FULL| my_table | 630M| 33G| | 708K (3)| 02:21:48 |
-----------------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
3 - access("tmp_table"."TROWID"="m".ROWID)
You could do a number of things. Please check whether they are beneficial for your situation, as milage will vary.
1) Use only the columns of the target table you touch (by select or update):
MERGE
INTO (SELECT m_fk1, m_fk2, m_update_date, m_update_user, m_creation_user
FROM my_table) m
2) Use only the columns of the source table you need. In your case that's all columns, so there won't be any benefit:
MERGE
INTO (...) m
USING (SELECT trowid, new_fk1, new_fk2 FROM tmp_table) t
Both 1) and 2) will reduce the size of the storage needed for a hash join and will enable the optimizer to use an index over all the columns if available.
3) In your special case with ROWIDs, it seems to be very beneficial (at least in my tests) to sort the source table. If you sort the rowids, you will likely update rows in the same physical block together, which may be more performant:
MERGE
INTO (...) m
USING (SELECT ... FROM tmp_table ORDER BY trowid) t
4) As your source table is quite large, I guess that it's tablespace is distributed over a couple of datafiles. You can check this with the query
SELECT f, count(*) FROM (
SELECT dbms_rowid.rowid_relative_fno(trowid) as f from tmp_table
) GROUP BY f ORDER BY f;
If your target table uses more than a handful of datafiles, you could try to partition your temporary table by datafile:
CREATE TABLE tmp_table (
TROWID ROWID NOT NULL
, NEW_FK1 NUMBER(10)
, NEW_FK2 NUMBER(10)
, FNO NUMBER
) PARTITION BY RANGE(FNO) INTERVAL (1) (
PARTITION p0 VALUES LESS THAN (0)
);
You can fill the column FNO with the following statement:
dbms_rowid.rowid_relative_fno(rowid)
Now you can update datafile by datafile, reducing the required memory for the hash join. Get the list of file numbers with
SELECT DISTINCT fno FROM tmp_table;
14
15
16
17
and run the updates file by file:
MERGE
INTO (SELECT ... FROM my_table) m
USING (SELECT ... FROM tmp_table PARTITION FOR (14) ORDER BY trowid) t
and next PARTITION FOR (15) etc. The file numbers will obviously be different on your system.
5) Finally, try to use nested loops instead of a hash join. Usually the optimizer picks the better join plan, but I cannot resist trying it out:
MERGE /*+ USE_NL (m t) */
INTO (SELECT ... FROM my_table) m
USING (SELECT ... FROM tmp_table partition for (14) ORDER BY trowid) t
ON (m.rowid = t.TROWID)
I have an Oracle table as below:
CREATE TABLE "TABLE1"
(
"TABLE_ID" VARCHAR2(32 BYTE),
"TABLE_DATE" DATE,
"TABLE_NAME" VARCHAR2(2 BYTE)
)
PARTITION BY RANGE ("TABLE_DATE")
Guess this table has data partitioned by the TABLE_DATE column.
How can I use this partitioning column to fetch data faster from this table in a WHERE clause like ...
SELECT * FROM TABLE1 PARTITION (P1) p
WHERE p.TABLE_DATE > (SYSDATE - 90) ;
You should change your partitioning to match your queries, not change your queries to match your partitioning. In most cases we shouldn't have to specify which partitions to read from. Oracle can automatically determine how to prune the partitions at run time.
For example, with this table:
create table table1
(
table_id varchar2(32 byte),
table_date date,
table_name varchar2(2 byte)
)
partition by range (table_date)
(
partition p1 values less than (date '2019-05-06'),
partition p2 values less than (maxvalue)
);
There's almost never a need to directly reference a partition in the query. It's extra work, and if we list the wrong partition name the query won't work correctly.
We can see partition pruning in action using EXPLAIN PLAN like this:
explain plan for
SELECT * FROM TABLE1 p
WHERE p.TABLE_DATE > (SYSDATE - 90) ;
select *
from table(dbms_xplan.display);
In the results we can see the partitioning in the Pstart and Pstop columns. The KEY means that the partition will be determined at run time. In this case, the start partition is based on the value of SYSDATE.
Plan hash value: 434062308
---------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | Pstart| Pstop |
---------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 30 | 2 (0)| 00:00:01 | | |
| 1 | PARTITION RANGE ITERATOR| | 1 | 30 | 2 (0)| 00:00:01 | KEY | 2 |
|* 2 | TABLE ACCESS FULL | TABLE1 | 1 | 30 | 2 (0)| 00:00:01 | KEY | 2 |
---------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - filter("P"."TABLE_DATE">SYSDATE#!-90)
Note
-----
- dynamic statistics used: dynamic sampling (level=2)
I am trying to run insert overwrite over a partitioned table.
The select query of insert overwrite omits one partition completely. Is it the expected behavior?
Table definition
CREATE TABLE `cities_red`(
`cityid` int,
`city` string)
PARTITIONED BY (
`state` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
TBLPROPERTIES (
'auto.purge'='true',
'last_modified_time'='1555591782',
'transient_lastDdlTime'='1555591782');
Table Data
+--------------------+------------------+-------------------+--+
| cities_red.cityid | cities_red.city | cities_red.state |
+--------------------+------------------+-------------------+--+
| 13 | KARNAL | HARYANA |
| 13 | KARNAL | HARYANA |
| 1 | Nagpur | MH |
| 22 | Mumbai | MH |
| 22 | Mumbai | MH |
| 755 | BPL | MP |
| 755 | BPL | MP |
| 10 | BANGLORE | TN |
| 10 | BANGLORE | TN |
| 10 | BANGLORE | TN |
| 10 | BANGLORE | TN |
| 12 | NOIDA | UP |
| 12 | NOIDA | UP |
+--------------------+------------------+-------------------+--+
Queries
insert overwrite table cities_red partition (state) select * from cities_red where city !='NOIDA';
It does not delete any data from the table
insert overwrite table cities_red partition (state) select * from cities_red where city !='Mumbai';
It removes the expected 2 rows from the table.
Is this an expected behavior from Hive in case of partitioned tables?
Yes, this is expected behavior.
Insert overwrite table partition select ,,, overwrites only partitions existing in the dataset returned by select.
In your example partition state=UP has records with city='NOIDA' only. Filter where city !='NOIDA' removes entire state=UP partition from the returned dataset and this is why it is not being rewritten.
Filter city !='Mumbai' does not filter entire partition, it is partially returned, this is why it is being overwritten with filtered data.
It works as designed. Consider scenario when you need to overwrite only desired partitions, this is quite normal for the incremental partition load. You do not need to touch other partitions in this case. You need to be able normally to overwrite only desired partitions. And without overwriting unchanged partitions, which can be very expensive to recover.
And if you still want to drop partitions and modify data in existing partitions, then you can drop/create table (you may need to create one more intermediate table for this) and then load partitions into it.
Or alternatively calculate partitions which you need to drop separately and execute ALTER TABLE DROP PARTITION.
I have a table in PostgreSQL with a timestamp column, and I want to modify the table to have a second timestamp column and seed it with the value of the immediately successive timestamp. Is there a way to do this? The tables are fairly large, so a correlated subquery might kill the machine.
More concretely, I want to go from this to that:
+----+------+ +----+------+------+
| ts | data | | ts | te | data |
+----+------+ +----+------+------+
| T | ... | --> | T | U | ... |
| U | ... | | U | V | ... |
| V | ... | | V | null | ... |
+----+------+ +----+------+------+
Basically, I want to be able to hand point in time queries much better (i.e., give me the data for time X).
Basically I think you could just retrieve the timestamp at the query time, not storing it in the table, but if you're performing such action and think that this is what you need then:
You need to add that column to your table:
ALTER TABLE tablename ADD COLUMN te timestamp;
Then perform an update feeding the value with the use of LEAD window function.
UPDATE tablename t
SET te = x.te
FROM (
SELECT ts, lead(ts, 1) OVER (order by ts) AS te
FROM tablename t2
) x
WHERE t.ts = x.ts
Here's an example of how it works using sample integer data: SQL Fiddle.
It will perform exactly the same for timestamp data type values.
select ts,LEAD(ts) over(order by (select null)) as te,data from table_name
I'm new to PostgreSQL and am using version 9.4. I'm having a table with collected measurements as strings and need to convert it to a kind of PIVOT table using something which is always up-to-date, like a VIEW.
Furthermore, some values need to be converted, e. g. multiplied by 1000, as you
can see in the example below for "sensor3".
Source Table:
CREATE TABLE source (
id bigint NOT NULL,
name character varying(255),
"timestamp" timestamp without time zone,
value character varying(32672),
CONSTRAINT source_pkey PRIMARY KEY (id)
);
INSERT INTO source VALUES
(15,'sensor2','2015-01-03 22:02:05.872','88.4')
, (16,'foo27' ,'2015-01-03 22:02:10.887','-3.755')
, (17,'sensor1','2015-01-03 22:02:10.887','1.1704')
, (18,'foo27' ,'2015-01-03 22:02:50.825','-1.4')
, (19,'bar_18' ,'2015-01-03 22:02:50.833','545.43')
, (20,'foo27' ,'2015-01-03 22:02:50.935','-2.87')
, (21,'sensor3','2015-01-03 22:02:51.044','6.56');
Source Table Result:
| id | name | timestamp | value |
|----+-----------+---------------------------+----------|
| 15 | "sensor2" | "2015-01-03 22:02:05.872" | "88.4" |
| 16 | "foo27" | "2015-01-03 22:02:10.887" | "-3.755" |
| 17 | "sensor1" | "2015-01-03 22:02:10.887" | "1.1704" |
| 18 | "foo27" | "2015-01-03 22:02:50.825" | "-1.4" |
| 19 | "bar_18" | "2015-01-03 22:02:50.833" | "545.43" |
| 20 | "foo27" | "2015-01-03 22:02:50.935" | "-2.87" |
| 21 | "sensor3" | "2015-01-03 22:02:51.044" | "6.56" |
Desired Final Result:
| timestamp | sensor1 | sensor2 | sensor3 | foo27 | bar_18 |
|---------------------------+---------+---------+---------+---------+---------|
| "2015-01-03 22:02:05.872" | | 88.4 | | | |
| "2015-01-03 22:02:10.887" | 1.1704 | | | -3.755 | |
| "2015-01-03 22:02:50.825" | | | | -1.4 | |
| "2015-01-03 22:02:50.833" | | | | | 545.43 |
| "2015-01-03 22:02:50.935" | | | | -2.87 | |
| "2015-01-03 22:02:51.044" | | | 6560.00 | | |
Using this:
-- CREATE EXTENSION tablefunc;
SELECT *
FROM
crosstab(
'SELECT
source."timestamp",
source.name,
source.value
FROM
public.source
ORDER BY
1'
,
'SELECT
DISTINCT
source.name
FROM
public.source
ORDER BY
1'
)
AS
(
"timestamp" timestamp without time zone,
"sensor1" character varying(32672),
"sensor2" character varying(32672),
"sensor3" character varying(32672),
"foo27" character varying(32672),
"bar_18" character varying(32672)
)
;
I got the result:
| timestamp | sensor1 | sensor2 | sensor3 | foo27 | bar_18 |
|---------------------------+---------+---------+---------+---------+---------|
| "2015-01-03 22:02:05.872" | | | | 88.4 | |
| "2015-01-03 22:02:10.887" | | -3.755 | 1.1704 | | |
| "2015-01-03 22:02:50.825" | | -1.4 | | | |
| "2015-01-03 22:02:50.833" | 545.43 | | | | |
| "2015-01-03 22:02:50.935" | | -2.87 | | | |
| "2015-01-03 22:02:51.044" | | | | | 6.56 |
Unfortunately,
the values aren't assigned to the correct column,
the columns aren't dynamic; that means the query fails when there is an additional entry in the name column like 'sensor4' and
I don't know how to change the values of some columns (multiply).
Your query works like this:
SELECT * FROM crosstab(
$$SELECT "timestamp", name
, CASE name
WHEN 'sensor3' THEN value::numeric * 1000
-- WHEN 'sensor9' THEN value::numeric * 9000 -- add more ...
ELSE value::numeric END AS value
FROM source
ORDER BY 1, 2$$
,$$SELECT unnest('{bar_18,foo27,sensor1,sensor2,sensor3}'::text[])$$
) AS (
"timestamp" timestamp
, bar_18 numeric
, foo27 numeric
, sensor1 numeric
, sensor2 numeric
, sensor3 numeric);
To multiply the value for selected columns use a "simple" CASE statement. But you need to cast to a numeric type first. Using value::numeric in the example.
Which begs the question: Why not store value as numeric type to begin with?
You need to use the version with two parameters. Detailed explanation:
PostgreSQL Crosstab Query
Truly dynamic cross tabulation tables is next to impossible, since SQL demands to know the result type in advance - at call time at the latest. But you can do something with polymorphic types:
Dynamic alternative to pivot with CASE and GROUP BY
#Erwin: It said "too long by 7128 characters" for a comment! Anyway:
Your post gave me the hints for the right direction, so thank you very much,
but particularly in my case I need it be truly dynamic. Currently I've got
38886 rows with 49 different items (= columns to be pivoted).
To first answer yours and #Jasen's urgent question:
The source table layout is not up to me, I'm already very happy to get this
data into an RDBMS. If it were to me, I'd always save UTC-timestamps! But
there's also a reason for having the data saved as strings: it may contain
various data types, like boolean, integer, float, string etc.
To avoid confusing me further I created a new demo dataset, prefixing the data
type (I know some hate this!) to avoid problems with keywords and changing the
timestamp (--> minutes) for better overview:
-- --------------------------------------------------------------------------
-- Create demo table of given schema and insert arbitrary data
-- --------------------------------------------------------------------------
DROP TABLE IF EXISTS table_source;
CREATE TABLE table_source
(
column_id BIGINT NOT NULL,
column_name CHARACTER VARYING(255),
column_timestamp TIMESTAMP WITHOUT TIME ZONE,
column_value CHARACTER VARYING(32672),
CONSTRAINT table_source_pkey PRIMARY KEY (column_id)
);
INSERT INTO table_source VALUES ( 15,'sensor2','2015-01-03 22:01:05.872','88.4');
INSERT INTO table_source VALUES ( 16,'foo27' ,'2015-01-03 22:02:10.887','-3.755');
INSERT INTO table_source VALUES ( 17,'sensor1','2015-01-03 22:02:10.887','1.1704');
INSERT INTO table_source VALUES ( 18,'foo27' ,'2015-01-03 22:03:50.825','-1.4');
INSERT INTO table_source VALUES ( 19,'bar_18','2015-01-03 22:04:50.833','545.43');
INSERT INTO table_source VALUES ( 20,'foo27' ,'2015-01-03 22:05:50.935','-2.87');
INSERT INTO table_source VALUES ( 21,'seNSor3','2015-01-03 22:06:51.044','6.56');
SELECT * FROM table_source;
Furthermore based on #Erwin's suggestions I created a view which already
converts the data type. This has the nice feature, beside being fast, to only
add required transformations for known items, but not impacting other (new)
items.
-- --------------------------------------------------------------------------
-- Create view to process source data
-- --------------------------------------------------------------------------
DROP VIEW IF EXISTS view_source_processed;
CREATE VIEW
view_source_processed
AS
SELECT
column_timestamp,
column_name,
CASE LOWER( column_name)
WHEN LOWER( 'sensor3') THEN CAST( column_value AS DOUBLE PRECISION) * 1000.0
ELSE CAST( column_value AS DOUBLE PRECISION)
END AS column_value
FROM
table_source
;
SELECT * FROM view_source_processed ORDER BY column_timestamp DESC LIMIT 100;
This is the desired result of the whole question:
-- --------------------------------------------------------------------------
-- Desired result:
-- --------------------------------------------------------------------------
/*
| column_timestamp | bar_18 | foo27 | sensor1 | sensor2 | seNSor3 |
|---------------------------+---------+---------+---------+---------+---------|
| "2015-01-03 22:01:05.872" | | | | 88.4 | |
| "2015-01-03 22:02:10.887" | | -3.755 | 1.1704 | | |
| "2015-01-03 22:03:50.825" | | -1.4 | | | |
| "2015-01-03 22:04:50.833" | 545.43 | | | | |
| "2015-01-03 22:05:50.935" | | -2.87 | | | |
| "2015-01-03 22:06:51.044" | | | | | 6560 |
*/
This is #Erwin's solution, adopted to the new demo source data. It's perfect,
as long as the items (= columns to be pivoted) doesn't change:
-- --------------------------------------------------------------------------
-- Solution by Erwin, modified for changed demo dataset:
-- http://stackoverflow.com/a/27773730
-- --------------------------------------------------------------------------
SELECT *
FROM
crosstab(
$$
SELECT
column_timestamp,
column_name,
column_value
FROM
view_source_processed
ORDER BY
1, 2
$$
,
$$
SELECT
UNNEST( '{bar_18,foo27,sensor1,sensor2,seNSor3}'::text[])
$$
)
AS
(
column_timestamp timestamp,
bar_18 DOUBLE PRECISION,
foo27 DOUBLE PRECISION,
sensor1 DOUBLE PRECISION,
sensor2 DOUBLE PRECISION,
seNSor3 DOUBLE PRECISION
)
;
When reading through the links #Erwin provided, I found a Dynamic SQL example
by #Clodoaldo Neto and remembered, that I had already done it this way in
Transact-SQL; this is my attempt:
-- --------------------------------------------------------------------------
-- Dynamic attempt based on:
-- http://stackoverflow.com/a/12989297/131874
-- --------------------------------------------------------------------------
DO $DO$
DECLARE
list_columns TEXT;
BEGIN
DROP TABLE IF EXISTS temp_table_pivot;
list_columns := (
SELECT
string_agg( DISTINCT column_name, ' ' ORDER BY column_name)
FROM
view_source_processed
);
EXECUTE(
FORMAT(
$format_1$
CREATE TEMP TABLE
temp_table_pivot(
column_timestamp TIMESTAMP,
%1$s
)
$format_1$
,
(
REPLACE(
list_columns,
' ',
' DOUBLE PRECISION, '
) || ' DOUBLE PRECISION'
)
)
);
EXECUTE(
FORMAT(
$format_2$
INSERT INTO temp_table_pivot
SELECT
*
FROM crosstab(
$crosstab_1$
SELECT
column_timestamp,
column_name,
column_value
FROM
view_source_processed
ORDER BY
column_timestamp, column_name
$crosstab_1$
,
$crosstab_2$
SELECT DISTINCT
column_name
FROM
view_source_processed
ORDER BY
column_name
$crosstab_2$
)
AS
(
column_timestamp TIMESTAMP,
%1$s
);
$format_2$
,
REPLACE( list_columns, ' ', ' DOUBLE PRECISION, ')
||
' DOUBLE PRECISION'
)
);
END;
$DO$;
SELECT * FROM temp_table_pivot ORDER BY column_timestamp DESC LIMIT 100;
Beside getting this into a stored procedure, I will, for performance reasons,
try to adopt this to an intermediate table where only new values are inserted.
I'll keep you up-to-date!
Thanks!!!
L.
PS: NO, I don't want to answer my own question, but the "comment"-field is too small!