DISTRIBUTE BY notices in Greenplum

DISTRIBUTE BY notices in Greenplum - sql

Say I run the following query on psql:
> select a.c1, b.c2 into temp_table from db.A as a inner join db.B as b
> on a.x = b.x limit 10;
I get the following message:
NOTICE: Table doesn't have 'DISTRIBUTED BY' clause -- Using
column(s) named 'c1' as the Greenplum Database data distribution key
for this table.
HINT: The 'DISTRIBUTED BY' clause determines the
distribution of data. Make sure column(s) chosen are the optimal
data distribution key to minimize skew.
What is a DISTRIBUTED BY column?
Where is temp_table stored? Is it stored on my client or on the server?

DISTRIBUTED BY is how Greenplum determines which segment will store each row. Because Greenplum is an MPP database in most production databases you will have multiple segment servers. You want to make sure that the Distribution column is the column you will join on usaly.
temp_table is a table that will be created for you on the Greenplum cluster. If you haven't set search_path to something else it will be in the public schema.

For your first question, the DISTRIBUTE BY clause is used for telling the database server how to store the database on the disk. (Create Table Documentation)
I did see one thing right away that could be wrong with the syntax on your Join clause where you say on a.x = s.x --> there is no table referenced as s. Maybe your problem is as simple as changing this to on a.x = b.x?
As far as where the temp table is stored, I believe it is generally stored on the database server. This would be a question for your DBA as it is a setup item when installing the database. You can always dump your data to a file on your computer and reload at a later time if you want to save your results (without printing.)

As I know, tmp table is stored in memory. It is faster when there are less data and it is recommended to use temp table. In the opposite, as temp table is stored into memory, if there are too much data it will consume very large memory. It is recommended to use regular tables with distributed clause. As it will be distributed across your cluster.
In addition, tmp table is stored into a special schema, so you don't need to specify the schema name when creating the temp table, and it only exist in the current connection, after you close the current connection, postgresql will drop the table automatically.

Related

Truncation of large table in SQL Server database

I would like to completely clear one table in my SQL Server database.
Unfortunately, the table is large (> 90GB). I am going to use the TRUNCATE statement.
The question is whether I should pay attention to something before?
I am also wondering if it will somehow affect the server's disk space (currently about 110 GB free)?
After all the action, SHRINK DATABASE will probably be necessary.

TRUNCATE TABLE is faster and uses fewer system and transaction log resources
than DELETE with no WHERE clause,
but if you need even faster solution, you can create new version of the table (table1), drop the old table, and rename table1 into table.
R

BigQuery Atomicity

I am trying to do a full load of a table in big query daily, as part of ETL. The target table has dummy partition column of type integer and is clustered. I want to have the statement to be atomic i.e either it will completely overwrite the new data or rollback to old data in case of failure for any reason in between and it will serve user queries with old data until it completely overwritten.
One way of doing this is delete and insert but big query does not support multi statement transactions.
I am thinking to use the below statement. Please let me know if this is atomic.
create or replace table_1 partition by dummy_int cluster dummy_column
as select col1,col2,col3 from stage_table1

table to table copy command with where condition

Is there anyway to write a copy command direct which will copy data from 1 table and populate another table (with some condition will be better)?
what I have observed copy command performance is far more better that INSERT INTO in vertica. So I am trying to replace the INSERT INTO with copy command.
Thanks!!

What you want to do is an INSERT /*+ DIRECT */ INTO table2 SELECT ... FROM table1 WHERE .... The direct hint will make it do a direct load to ROS containers instead of through WOS. If you are doing large bulk loads, this would be fastest. If you are doing many small insert/selects like this, then it would be best to use WOS and leave out the DIRECT.
Another possibility would be to do a CREATE TABLE table2 AS SELECT ... FROM table1 WHERE ....
Finally, if you are really just copying all the data and not filtering (which I know isn't your question, but I'm including this for completeness)... and the tables are partitioned, you can do a COPY_PARTITONS_TO_TABLE which will just create references from the source table's ROS containeres to the target table. Any changes to the new table would reorganize the ROS containers (over time, using the tuple mover, etc. Containers wouldn't get cleaned up unless both tables reorganized them).

SQL Server bulk insert for large data set

I have 1 million rows of data in a file, I want to insert all the records into SQL Server. While inserting I am doing some comparison with existing data on the server, if the comparison satisfied I will update the existing records in the server or else I will insert the record from the file.
I'm currently doing this by looping from C#, which consume more than 3 hours to complete the work. Can anyone suggest idea to improve the performance?
Thanks,
Xavier.

Check if your database in Full or Simple recovery mode:
SELECT recovery_model_desc
FROM sys.databases
WHERE name = 'MyDataBase';
If database is SIMPLE recovery mode you can create a staging table right there. If it is in Full mode then better create Staging table in separate database with Simple model.
Use any BulkInsert operation/tool (for instance BCP, as already suggested)
Insert only those data from your staging table, which do not exist in your target table. (hope you know how to do it)

How to figure out which record has been deleted in an effiecient way?

I am working on an in-house ETL solution, from db1 (Oracle) to db2 (Sybase). We needs to transfer data incrementally (Change Data Capture?) into db2.
I have only read access to tables, so I can't create any table or trigger in Oracle db1.
The challenge I am facing is, how to detect record deletion in Oracle?
The solution which I can think of, is by using additional standalone/embedded db (e.g. derby, h2 etc). This db contains 2 tables, namely old_data, new_data.
old_data contains primary key field from tahle of interest in Oracle.
Every time ETL process runs, new_data table will be populated with primary key field from Oracle table. After that, I will run the following sql command to get the deleted rows:
SELECT old_data.id FROM old_data WHERE old_data.id NOT IN (SELECT new_data.id FROM new_data)
I think this will be a very expensive operation when the volume of data become very large. Do you have any better idea of doing this?
Thanks.

Which edition of Oracle ? If you have Enterprise Edition, look into Oracle Streams.
You can grab the deletes out of the REDO log rather than the database itself

One approach you could take is using the Oracle flashback capability (if you're using version 9i or later):
http://forums.oracle.com/forums/thread.jspa?messageID=2608773
This will allow you to select from a prior database state.
If there may not always be deleted records, you could be more efficient by:
Storing a row count with each query iteration.
Comparing that row count to the previous row count.
If they are different, you know you have a delete and you have to compare the current set with the historical data set from flashback. If not, then don't bother and you've saved a lot of cycles.
A quick note on your solution if flashback isn't an option: I don't think your select query is a big deal - it's all those inserts to populate those side tables that will really take a lot of time. Why not just run that query against the sybase production server before doing your update?

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

DISTRIBUTE BY notices in Greenplum - sql

Related

Truncation of large table in SQL Server database

BigQuery Atomicity

table to table copy command with where condition

SQL Server bulk insert for large data set

How to figure out which record has been deleted in an effiecient way?

Categories

Resources