Greenplum's FDW tool duplicates data many times

Greenplum's FDW tool duplicates data many times - sql

I have Greenplum database version 6.14.1, working on CentOS 7.2 host.
So I try to copy data from Postgres 11 to Greenplum 6.14 by Foreign Data Wrapper.
With default options I receive N rows and all data comes through master node.
So I decide to change options to (mpp_execute "all segment"),
but in this case I receive 24*N rows, because my cluster has 24 segments nodes.
I think this is well known issue, but unfortunately can't find solution at all.
Steps to reproduce the behavior:
On Postgres server
create table x(id int, value float8);
insert into x select r, r * random() from generate_series(1,1000) r;
select count(*) from x;
1000
(1 row)
On Greenplum server
CREATE EXTENSION postgres_fdw;
create server foreign_server_x FOREIGN DATA WRAPPER postgres_fdw
OPTIONS(host '172.16.128.135', port '5432', dbname 'postgres');
-- user mapping
CREATE USER MAPPING FOR current_user
SERVER foreign_server_x OPTIONS (user 'admin', password 'admin');
-- foreign table foreign_x
CREATE FOREIGN TABLE foreign_x
(id int, value float8) SERVER foreign_server_x OPTIONS (schema_name 'public', table_name 'x');
select count(*) from foreign_x;
1000
(1 row)
-- mpp_execute = all segments
alter foreign table foreign_x options (add mpp_execute 'all segments');
-- foreign_x (24 segments)
select count(*) from foreign_x;
24000
(1 row)

this would be expected behavior since you have 24 segments, and are asking all of them to go query the database. I would suggest trying to execute only from the master, or select a unique count(*), or leverage an external table instead of FDW.

Related

`CREATE TABLE AS SELECT FROM` in Oracle Cloud doesn't create a new table

I was trying to create a series of tables in a single SQL query in Oracle Cloud under the ADMIN account. In the minimum script below, RAW_TABLE refers to an existing table.
CREATE TABLE BASE1 AS SELECT * FROM RAW_TABLE;
CREATE TABLE BASE2 AS SELECT * FROM BASE1;
CREATE TABLE BASE3 AS SELECT * FROM BASE2;
SELECT * FROM BASE3
This returns a view of the first 100 rows in BASE3, but it doesn't create the three tables along the way. Did I miss something or is there something peculiar about create table statements in Oracle SQL?
EDIT: The environment is Oracle Database Actions in Oracle Cloud. The three tables would not be available in the list of tables in the database, and doing something like select * from BASE3 in a subsequent query would fail.

CREATE TABLE BASE1 AS SELECT * FROM RAW_TABLE;
CREATE TABLE BASE2 AS SELECT * FROM BASE1;
CREATE TABLE BASE3 AS SELECT * FROM BASE2;
SELECT * FROM BASE3
Above is a valid query sequence for Oracle database. It should have been created three new tables in database. Since it's not happening please do the work in few steps to find out what's wrong.
First please check whether RAW_TABLE is available in database or not. Then try to select data from RAW_TABLE
select * from RAW_TABLE;
If all those are successful then try to create single table with below query:
CREATE TABLE BASE1 AS SELECT * FROM RAW_TABLE;
Hope you would find the problem by then.
DB-Fiddle:
Creating RAW_TABLE and populating data
create table RAW_TABLE (id int, name varchar(50));
insert into RAW_TABLE values (1,'A');
Query to create three more tables ans selecting from the last table:
CREATE TABLE BASE1 AS SELECT * FROM RAW_TABLE;
CREATE TABLE BASE2 AS SELECT * FROM BASE1;
CREATE TABLE BASE3 AS SELECT * FROM BASE2;
SELECT * FROM BASE3
Output:
ID
NAME
1
A
db<>fiddle here

your query fails because you are executing the whole script as one batch and each line is depends on another one , the transactional DBMS's work with blocks of code as one transaction , and that block of code doesn't commit until sql engine can parse and validate the whole block, and since in your block, BASE1 and BASE2 tables doesn't exists just yet , It fails.
so you need to run each statement as a separate batch. either by executing them one by one or in Oracle you can use / as batch separator, like in sql server you can use GO. these commands are not SQL or Oracle commands and are not sent to the database server , they are just break block of code in batches on your client ( like SQL*Plus or shell or SSMS (for Microsoft sql server), so It would look like this:
CREATE TABLE BASE1 AS SELECT * FROM RAW_TABLE;
/
CREATE TABLE BASE2 AS SELECT * FROM BASE1;
/
CREATE TABLE BASE3 AS SELECT * FROM BASE2;
/
SELECT * FROM BASE3
if your client doesn't support that then you only have to run them one by one in separate batches.

PostgreSQL: function to query across multiple databases

I have several databases on the same PostgreSQL server with the exact same tables with the same columns in it. I want to write a function that a user could use to query across all these databases at once, something like:
SELECT * FROM all_databases();
For the moment, I just found how to query another database:
-- 1. Get database names
SELECT datname
FROM pg_database
WHERE name LIKE '%someString%';
-- 2. Get data from different databases with postgres_fdw (same host or remote host)
-- 2.1. Install the module
CREATE EXTENSION postgres_fdw;
-- 2.2. Create a server connection
CREATE SERVER foreign_db
FOREIGN DATA WRAPPER postgres_fdw
OPTIONS (host 'localhost', dbname 'foreignDbName', port '5432');
-- 2.3. Create user mapping for the foreign server
CREATE USER MAPPING FOR CURRENT_USER
SERVER foreign_db
OPTIONS (user 'postgres', password 'password');
-- 2.4. Import the foreign schema
IMPORT FOREIGN SCHEMA public
FROM SERVER foreign_db INTO public;
So, what I want to do is to execute something like what is written in 2 for every result returned by 1. It looks like I will have to use some dynamic SQL, but I am a little bit lost...

Firstly, if the tables in both dbs have the same name, you can't import it in the same schema, you have to import it in another schema or manually with another foreign table name.(see CREATE FOREIGN TABLE)
Secondly, you can do your query with a simple SELECT over your foreign tables. Eg.
CREATE SCHEMA ft_db2; -- foreign tables db2 schema
IMPORT FOREIGN SCHEMA public
FROM SERVER foreign_db INTO ft_db2;
CREATE OR REPLACE FUNCTION all_databases()
RETURNS SETOF public.test AS
$$
SELECT * FROM public.test
UNION ALL
SELECT * FROM ft_db2.test;
$$
LANGUAGE sql;
SELECT * FROM all_databases();

maybe something like SELECT * FROM *;

How can i delete a partition fuction on sql

So i got and ETL that stores 3 years '17 (corrupt), '18(corrupt), '19:
STG_tables: import Data from 3 differents DB and Export it to
DWH_tables: This is the Relational fase where all the historical information is stored. Here only the normalization and parameterization of the tables and the fields are carried out to adapt them to the developed logical model, but no business rules are applied.
DIM_tables: Finally, in the Dimensional Fase, the business rules are applied and the tables and indexes are optimized for the queries, since this is where the analytical tools will attack.
I got 2 types of Reloads:
Daily Reload: This Job is responsible for executing the SSIS packages necessary to perform the incremental daily load of the Data Warehouse. it only loads the last partition of the large tables (corresponding to the current year) in the dimensional Fase.
Full Reload: Loads full 3 years (this one is not working)
This wasn't done by me and i have 0 technical documentation, so im just trying to figure out how this works, my thinking is that once i get to do this full reload, the data will restore.
Im getting error on STG_fase:
DROP TABLE DWH_PROD.DWH_XX;
DROP TABLE ... ':' The partition function 'pfPetitions' is being used in one or more partition schemes.'. Possible reasons for the error: problems with the query, the property 'ResultSet' was not set correctly, parameters not set correctly or connection poorly established.
i dont know how to drop this partition so i can create it again
and cant find 'ResultSet' property, please help
USE DB;
GO
DROP TABLE DWH_PROD.DWH_ALBARANES_TARIFA;
DROP TABLE DWH_PROD.DWH_PETICIONES;
DROP TABLE DWH_PROD.DWH_SOLICITUDES;
DROP TABLE DWH_PROD.DWH_RESULTADOS;
DROP TABLE DWH_PROD.DWH_INCIDENCIAS;
-------i delete code so the text is not so big------
Here there are all the creation of the drop tables above
IF NOT EXISTS (SELECT * FROM sys.tables WHERE name = N'DWH_ALBARANES_TARIFA')
CREATE TABLE DWH_PROD.DWH_ALBARANES_TARIFA (
);
IF NOT EXISTS (SELECT * FROM sys.tables WHERE name = N'DWH_INCIDENCIAS')
CREATE TABLE DWH_PROD.DWH_INCIDENCIAS (
);
IF EXISTS (SELECT * FROM sys.partition_functions WHERE name = N'pfPeticiones')
DROP PARTITION FUNCTION pfPeticiones;
CREATE PARTITION FUNCTION pfPeticiones (DATE)
AS RANGE RIGHT FOR VALUES
('2017-01-01', '2018-01-01', '2019-01-01');
IF EXISTS (SELECT * FROM sys.partition_schemes WHERE name = N'psPeticiones')
DROP PARTITION SCHEME psPeticiones;
CREATE PARTITION SCHEME psPeticiones
AS PARTITION pfPeticiones
ALL TO ([Primary]);
IF NOT EXISTS (SELECT * FROM sys.tables WHERE name = N'DWH_PETICIONES')
CREATE TABLE DWH_PROD.DWH_PETICIONES (
) ON psPeticiones(FEC_PETICION);
IF NOT EXISTS (SELECT * FROM sys.tables WHERE name = N'DWH_SOLICITUDES')
CREATE TABLE DWH_PROD.DWH_SOLICITUDES (
) ON psPeticiones(FEC_PETICION);
IF NOT EXISTS (SELECT * FROM sys.tables WHERE name = N'DWH_RESULTADOS')
CREATE TABLE DWH_PROD.DWH_RESULTADOS (
) ON psPeticiones(FEC_PETICION);

You need to perform a few actions in order to do delete a partitioning function:
Delete or move (i.e. if you have a heap, create a clustered index on PRIMARY) all tables that use the partitioning schema.
Delete the partitioning schema.
Delete the partitioning function.

Is there a way to create a temporary table in SQL that deletes right after the query finishes? [duplicate]

This question already has answers here:
Creating temporary tables in SQL
(2 answers)
Closed 6 years ago.
I have a complicated query I'm working on. It involves several tables.
It would be very helpful for me to create a new table and then simply query from that. However, this is a shared database and I don't want to make a new table, especially when i don't plan on using that table specifically. (I just want it as a stepping stone in my query)
Is it possible to create a table just for 1 query that deletes right when the query is done? (i.e a temporary table)

Sure. Use CREATE TEMPORARY TABLE:
=> CREATE TEMPORARY TABLE secret_table(id BIGSERIAL, name text);
=> INSERT INTO secret_table(name) VALUES ('Good day');
INSERT 0 1
=> INSERT INTO secret_table(name) VALUES ('Good night');
INSERT 0 1
=> SELECT * FROM secret_table;
id | name
----+------------
1 | Good day
2 | Good night
(2 rows)
But upon reconnection:
psql (9.5.4)
Type "help" for help.
=> SELECT * FROM secret_table;
ERROR: relation "secret_table" does not exist
LINE 1: SELECT * FROM secret_table;

You could use temporary tables which drops itself at the end of session in which they were created (not after the query finishes, as you've said). Though, you could always drop it manually at the end of your operation.
If you'd like to create such table as a result from a query then this is the sample to be expanded to your needs:
CREATE TEMP TABLE tmp_table_name AS ( SELECT 1 AS col1 );
But I'm thinking you may be looking for a CTE instead of a table since you're saying that you're planning to use it only once. Consider this:
WITH tmp_table AS ( SELECT 1 AS col1 )
SELECT *
FROM tmp_table
...

You can also do dinamically The result of a query is also a Table
select * from (select col1, col2, col3
from my_complex_table
... ) t1

use keyword temporary, the temporary table is only visible in your current connection and drop after you disconnect your connection.
The other way would create a table and drop the table by yourself when you don't need it

How Do I Deep Copy a Set of Data, and Change FK References to Point to All the Copies?

Suppose I have Table A and Table B. Table B references Table A. I want to deep copy a set of rows in Table A and Table B. I want all of the new Table B rows to reference the new Table A rows.
Note that I'm not copying the rows into any other tables. The rows in table A will be copied into table A, and the rows in table B will be copied into table B.
How can I ensure that the foreign key references get readjusted as part of the copy?
To clarify, I'm trying to find a generic way to do this. The example I'm giving involves two tables, but in practice the dependency graph may be much more complicated. Even a generic way to dynamically generate SQL to do the work would be fine.
UPDATE:
People are asking why this is necessary, so I'll give some background. It may be way too much, but here goes:
I'm working with an old desktop application that's been moved to a client-server model. But, the application still uses a rudimentary in-house binary file format for storing data for its tables. A data file is just a header followed by a series of rows, each of which is just the binary serialized field values, the order of which is determined by a schema text file. The only thing good about it is that it's very fast. It's terrible in every other respect. I'm moving the application to SQL Server and trying not to degrade the performance too badly.
This is a kind of scheduling application; the data's not critical to anybody, and there's no audit tracking, etc. necessary. It's not a supermassive amount of data, and we don't necessarily need to keep very old data around if the database grows too large.
One feature that they are accustomed to is the ability to duplicate entire schedules in order to create "what-if" scenarios that they can muck with. Any user can do this as many times as they want, as often as they want. In the old database, the data files for each schedule are stored in their own data folder, identified by name. So, copying a schedule was as simple as copying the data folder and renaming it.
I must be able to do effectively the same thing with SQL Server or the migration will not work. Maybe you're thinking that I can just only copy the data that actually gets changed in order to avoid redundancy; but that honestly sounds too complicated to be feasible.
To throw another wrench into the mix, there can be a hierarchy of schedule data folders. So, a data folder may contain a data folder, which may contain a data folder. And the copying can occur at any level.
In SQL Server, I'm implementing a nested set hierarchy to mimic this. I have a DATA_SET table like this:
CREATE TABLE dbo.DATA_SET
(
DATA_SET_ID UNIQUEIDENTIFIER PRIMARY KEY,
NAME NVARCHAR(128) NOT NULL,
LFT INT NOT NULL,
RGT INT NOT NULL
)
So, there's a tree structure of data sets. Each data set represents a schedule, and may contain child data sets. Every row in every table has a DATA_SET_ID FK reference, indicating which data set it belongs to. Whenever I copy a data set, I copy all the rows in the table for that data set, and every other data set, into the same table, but referencing new data sets.
So, here's a simple concrete example:
CREATE TABLE FOO
(
FOO_ID BIGINT PRIMARY KEY,
DATA_SET_ID BIGINT FOREIGN KEY REFERENCES DATA_SET(DATA_SET_ID) NOT NULL
)
CREATE TABLE BAR
(
BAR_ID BIGINT PRIMARY KEY,
DATA_SET_ID BIGINT FOREIGN KEY REFERENCES DATA_SET(DATA_SET_ID) NOT NULL,
FOO_ID UNIQUEIDENTIFIER PRIMARY KEY
)
INSERT INTO FOO
SELECT 1, 1 UNION ALL
SELECT 2, 1 UNION ALL
SELECT 3, 1 UNION ALL
INSERT INTO BAR
SELECT 1, 1, 1
SELECT 2, 1, 2
SELECT 3, 1, 3
So, let's say I copy data set 1 into a new data set of ID 2. After I copy, the tables will look like this:
FOO
FOO_ID, DATA_SET_ID
1 1
2 1
3 1
4 2
5 2
6 2
BAR
BAR_ID, DATA_SET_ID, FOO_ID
1 1 1
2 1 2
3 1 3
4 2 4
5 2 5
6 2 6
As you can see, the new BAR rows are referencing the new FOO rows. It's not the rewiring of the DATA_SET_ID's that I'm asking about. I'm asking about rewiring the foreign keys in general.
So, that was surely too much information, but there you go.
I'm sure there are a lot of concerns about performance with the idea of bulk copying the data like this. The tables are not going to be huge. I'm not expecting more than 1000 records in any table, and most of the tables will be much much smaller than that. Old data sets can be deleted outright with no repercussions.
Thanks,
Tedderz

Here is an example with three tables that can probably get you started.
DB schema
CREATE TABLE users
(user_id int auto_increment PRIMARY KEY,
user_name varchar(32));
CREATE TABLE agenda
(agenda_id int auto_increment PRIMARY KEY,
`user_id` int, `agenda_name` varchar(7));
CREATE TABLE events
(event_id int auto_increment PRIMARY KEY,
`agenda_id` int,
`event_name` varchar(8));
An SP to clone a user with his agenda and events records
DELIMITER $$
CREATE PROCEDURE clone_user(IN uid INT)
BEGIN
DECLARE last_user_id INT DEFAULT 0;
INSERT INTO users (user_name)
SELECT user_name
FROM users
WHERE user_id = uid;
SET last_user_id = LAST_INSERT_ID();
INSERT INTO agenda (user_id, agenda_name)
SELECT last_user_id, agenda_name
FROM agenda
WHERE user_id = uid;
INSERT INTO events (agenda_id, event_name)
SELECT a3.agenda_id_new, e.event_name
FROM events e JOIN
(SELECT a1.agenda_id agenda_id_old,
a2.agenda_id agenda_id_new
FROM
(SELECT agenda_id, #n := #n + 1 n
FROM agenda, (SELECT #n := 0) n
WHERE user_id = uid
ORDER BY agenda_id) a1 JOIN
(SELECT agenda_id, #m := #m + 1 m
FROM agenda, (SELECT #m := 0) m
WHERE user_id = last_user_id
ORDER BY agenda_id) a2 ON a1.n = a2.m) a3
ON e.agenda_id = a3.agenda_id_old;
END$$
DELIMITER ;
To clone a user
CALL clone_user(3);
Here is SQLFiddle demo.

I recently found myself needing to solve a similar problem; that is, I needed to copy a set of rows in a table (Table A) as well as all of the rows in related tables which have foreign keys pointing to Table A's primary key. I was using Postgres so the exact queries may differ but the overall approach is the same. The biggest benefit of this approach is that it can be used recursively to go infinitely deep
TLDR: the approach looks like this
1) find all the related table/columns of Table A
2) copy the necessary data into temporary tables
3) create a trigger and function to propagate primary key column
updates to related foreign keys columns in the temporary tables
4) update the primary key column in the temporary tables to the next
value in the auto increment sequence
5) Re-insert the data back into the source tables, and drop the
temporary tables/triggers/function
1) The first step is to query the information schema to find all of the tables and columns which are referencing Table A. In Postgres this might look like the following:
SELECT tc.table_name, kcu.column_name
FROM information_schema.table_constraints tc
JOIN information_schema.key_column_usage kcu
ON tc.constraint_name = kcu.constraint_name
JOIN information_schema.constraint_column_usage ccu
ON ccu.constraint_name = tc.constraint_name
WHERE constraint_type = 'FOREIGN KEY'
AND ccu.table_name='<Table A>'
AND ccu.column_name='<Primary Key>'
2) Next we need to copy the data from Table A, and any other tables which reference Table A - lets say there is one called Table B. To start this process, lets create a temporary table for each of these tables and we will populate it with the data that we need to copy. This might look like the following:
CREATE TEMP TABLE temp_table_a AS (
SELECT * FROM <Table A> WHERE ...
)
CREATE TEMP TABLE temp_table_b AS (
SELECT * FROM <Table B> WHERE <Foreign Key> IN (
SELECT <Primary Key> FROM temp_table_a
)
)
3) We can now define a function that will cascade primary key column updates out to related foreign key columns, and trigger which will execute whenever the primary key column changes. For example:
CREATE OR REPLACE FUNCTION cascade_temp_table_a_pk()
RETURNS trigger AS
$$
BEGIN
UPDATE <Temp Table B> SET <Foreign Key> = NEW.<Primary Key>
WHERE <Foreign Key> = OLD.<Primary Key>;
RETURN NEW;
END;
$$ LANGUAGE plpgsql;
CREATE TRIGGER trigger_temp_table_a
AFTER UPDATE
ON <Temp Table A>
FOR EACH ROW
WHEN (OLD.<Primary Key> != NEW.<Primary Key>)
EXECUTE PROCEDURE cascade_temp_table_a_pk();
4) Now we just update the primary key column in to the next value of the sequence of the source table (). This will activate the trigger, and the updates will be cascaded out to the foreign key columns in . In Postgres you can do the following:
UPDATE <Temp Table A>
SET <Primary Key> = nextval(pg_get_serial_sequence('<Table A>', '<Primary Key>'))
5) Insert the data back from the temporary tables back into the source tables. And then drop the temporary tables, triggers, and functions after that.
INSERT INTO <Table A> (SELECT * FROM <Temp Table A>)
INSERT INTO <Table B> (SELECT * FROM <Temp Table B>)
DROP TRIGGER trigger_temp_table_a
DROP cascade_temp_table_a_pk()
It is possible to take this general approach and turn it into a script which can be called recursively in order to go infinitely deep. I ended up doing just that using python (our application was using django so I was able to use the django ORM to make some of this easier)

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Greenplum's FDW tool duplicates data many times - sql

this would be expected behavior since you have 24 segments, and are asking all of them to go query the database. I would suggest trying to execute only from the master, or select a unique count(*), or leverage an external table instead of FDW.

Related

`CREATE TABLE AS SELECT FROM` in Oracle Cloud doesn't create a new table

PostgreSQL: function to query across multiple databases

How can i delete a partition fuction on sql

Is there a way to create a temporary table in SQL that deletes right after the query finishes? [duplicate]

How Do I Deep Copy a Set of Data, and Change FK References to Point to All the Copies?

Categories

Resources