Transfer files from one table to another in impala - hive

I have two tables in impala and I want to move the data from one to another.
Both tables have hdfs path like
/user/hive/db/table1 or table2/partitiona/partitionb/partitionc/file
I know the procedure with INSERT INTO to move the data from one table to another.
What I do not know is how to move also the files in the hdfs paths or if this happens automatically with the INSERT INTO statement
Also, if a table is sorted in the creation settings, if any data insert into it it will be sorted too?

It happens automatically and done by hive. When you do INSERT INTO table1 SELECT * FROM table2, hive copies data from /user/hive/db/table1 to table2/partitiona/partitionb/partitionc/file.
You do not have to move anything. You may need to analyze table1 for better performance.
Answer to your second question, if you use sort by while creating table1, then data will be automatically sorted by in table1 irrespective of data sorted or unsorted in table2.

Related

Keeping BigQuery table data up-to-date

This is probably incorrect use case for BigQuery but I have following problem: I need to periodically update Big Query table. Update should be "atomic" in a sense that clients which read data should either use only old version of data or completely new version of data. The only solution I have now is to use date partitions. The problem with this solution is that clients which just need to read up to date data should know about partitions and get data only from certain partitions. Every time I want to make a query I would have first to figure out which partition to use and only then select from the table. Is there any way to improve this? Ideally I would like solution to be easy and transparent for clients who read data.
You didn't mention the size of your update, I can only give some general guideline.
Most BigQuery updates, including single DML (INSERT/UPDATE/DELETE/MERGE) and single load job, are atomic. Your reader reads either old data or new data.
Lacking multi-statement transaction right now, if you do have updates which doesn't fit into single load job, the solution is:
Load update into a staging table, after all loads finished
Use single INSERT or MERGE to merge updates from staging table to primary data table
The drawback: scanning staging table is not for free
Update: since you have multiple tables to update atomically, there is a tiny trick which may be helpful.
Assuming for each table that you need an update, there is a ActivePartition column as partition key, you may have a table with only one row.
CREATE TABLE ActivePartition (active DATE);
Each time after loading, you set ActivePartition.active to a new active date, then your user use a script:
DECLARE active DATE DEFAULT (SELECT active FROM ActivePartition);
-- Actual query
SELECT ... FROM dataTable WHERE ActivePartition = active

Refreshing SQL Server table

Background
I have a huge table (Table A) in my database. From which, I apply some filters based on business rules. I then apply these filters into another table (Table B).
Therefore, Table B will always contain data from A and will always be much smaller. Table A contains 500,000 entries and Table B contains 3000 entries.
Both tables have the same structure.
Issue
The huge table (Table A) can be updated at any moment without notice. Therefore, to ensure that Table B contains the most up-to-date business data, it needs to be refreshed regularly. In this instance, I do this once a week.
Solution
How I go about this is by setting up a simple stored procedure that does the following:
TRUNCATE Table B
Apply filters on Table A and INSERT data into Table B
Issue with this is the fact that I have to truncate the table every week. I was wondering if there is a more efficient way of doing this, rather than deleting all the data and inserting the data into table all over again?
Is there a way to check which values are in A that are missing in B and add them accordingly?
Thanks
Instead of doing that in a stored procedure, create a View. View keep in sync with the table, as the data changes in the Table A the View will automatically update.
Here are some details you may want to read about Views and how they work.
How to create Views

Postgres: UPSERT or INCREMENTAL update of a large table

I have a following problem.
We have two systems, an OLTP system and our Data Warehouse, both on PostgreSQL. I need to sync one table from OLTP system with DWH, so they're identical. I am using Foreign Data Wrapper for comunication between databases, so then sync could be straightforward and look like that:
TRUNCATE TABLE target_table;
INSERT INTO target_table SELECT * FROM source_table;
Unfortunately, the table is huge for the hardware we use (hundreds of millions of rows, and about 50GB of data), so this is massively time consuming.
I have then figured out a way to do it faster by doing incremental upsert, based on following approach.
Identify new rows in source table by taking rows with insert_time > SELECT max(insert_time) FROM target_table;.
Identify rows, that need to be updated, by joining both tables on id and source_table.update_time > target_table.update_time.
That worked absolutely fine, however, a new problem arised - rows from source table can be removed, therefore I need to remove corresponding rows from target table. All approaches I tried are way to slow, like this one for example:
CREATE VIEW deleted_rows AS SELECT
target.row_id
FROM target
LEFT JOIN source USING(id)
WHERE source.id IS NULL;
Any idea how to solve this problem smartly and quickly? Thanks in advance for your responses.

Finding changed records in a database table

I have a problem that I haven't been able to come up with a solution for yet. I have a database (actually thousands of them at customer sites) that I want to extract data from periodically. I'd like to do a full data extract one time (select * from table) then after that only get rows that have changed.
The challenge is that there aren't any updated date columns in most of the tables that could be used to constrain the SQL query. I can't use a trigger based approach nor change the application that writes to the database since it's another group that develops the app and they are way backed up already.
I may be able to write to the database tables when doing the data extract, but would prefer not to do that. Does anyone have any ideas for how we might be able to do this?
You will have to programatically mark the records. I see suggestions of an auto-incrementing field but that will only get newly inserted records. How will you track updated or deleted records?
If you only want newly inserted that an autoincrementing field will do the job; in subsequent data dumps grab every thing since the last value of the autoincrment field and then recrod the current value.
If you want updates the minimum I can see is to have a last_update field and probably a trigger to populare it. If the last_update is later the the last data dump grab that record. This will get inserts and updates but not deletes.
You could try something like a 'instead of delete' trigger if your RDBMS supports it and NULL the last_update field. On subsequent data dumps grap all recoirds where this field is NULL and then delete them. But there would be problems with this (e.g. how to stop the app seeing them between the logical and physical delete)
The most fool proof method I can see is aset of history (audit) tables and ech change gets written to them. Then you select your data dump from there.
By the way do you only care about know the updates have happened? What about if 2 (or more) updates have happened. The history table is the only way that I can see you capturing this scenario.
This should isolate rows that have changed since your last backup. Assuming DestinationTable is a copy of SourceTable even on the key fields; if not you could list out the important fields.
SELECT * FROM SourceTable
EXCEPT
SELECT * FROM DestinationTable

Delete All / Bulk Insert

First off let me say I am running on SQL Server 2005 so I don't have access to MERGE.
I have a table with ~150k rows that I am updating daily from a text file. As rows fall out of the text file I need to delete them from the database and if they change or are new I need to update/insert accordingly.
After some testing I've found that performance wise it is exponentially faster to do a full delete and then bulk insert from the text file rather than read through the file line by line doing an update/insert. However I recently came across some posts discussing mimicking the MERGE functionality of SQL Server 2008 using a temp table and the output of the UPDATE statement.
I was interested in this because I am looking into how I can eliminate the time in my Delete/Bulk Insert method when the table has no rows. I still think that this method will be the fastest so I am looking for the best way to solve the empty table problem.
Thanks
I think your fastest method would be to:
Drop all foreign keys and indexes
from your table.
Truncate your
table.
Bulk insert your data.
Recreate your foreign keys and
indexes.
Is the problem that Joe's solution is not fast enough, or that you can not have any activity against the target table while your process runs? If you just need to prevent users from running queries against your target table, you should contain your process within a transaction block. This way, when your TRUNCATE TABLE executes, it will create a table lock that will be held for the duration of the transaction, like so:
begin tran;
truncate table stage_table
bulk insert stage_table
from N'C:\datafile.txt'
commit tran;
An alternative solution which would satsify your requirement for not having "down time" for the table you are updating.
It sounds like originally you were reading the file and doing an INSERT/UPDATE/DELETE 1 row at a time. A more performant approach than that, that does not involve clearing down the table is as follows:
1) bulk load the file into a new, separate table (no indexes)
2) then create the PK on it
3) Run 3 statements to update the original table from this new (temporary) table:
DELETE rows in the main table that don't exist in the new table
UPDATE rows in the main table where there is a matching row in the new table
INSERT rows into main table from the new table where they don't already exist
This will perform better than row-by-row operations and should hopefully satisfy your overall requirements
There is a way to update the table with zero downtime: keep two day's data in the table, and delete the old rows after loading the new ones!
Add a DataDate column representing the date for which your ~150K rows are valid.
Create a one-row, one-column table with "today's" DataDate.
Create a view of the two tables that selects only rows matching the row in the DataDate table. Index it if you like. Readers will now refer to this view, not the table.
Bulk insert the rows. (You'll obviously need to add the DataDate to each row.)
Update the DataDate table. View updates Instantly!
Delete yesterday's rows at your leisure.
SELECT performance won't suffer; joining one row to 150,000 rows along the primary key should present no problem to any server less than 15 years old.
I have used this technique often, and have also struggled with processes that relied on sp_rename. Production processes that modify the schema are a headache. Don't.
For raw speed, I think with ~150K rows in the table, I'd just drop the table, recreate it from scratch (without indexes) and then bulk load afresh. Once the bulk load has been done, then create the indexes.
This assumes of course that having a period of time when the table is empty/doesn't exist is acceptable which it does sound like could be the case.