How to handle a cycle in dbt? - cycle

I need to add delta data to table x. So, I need to create a cycle. I read table x, then I find the delta between the source and table x. After that, I insert the delta to table x. How to do this in dbt? Is it possible to break the cycle?

I change it from ref to source. so instead of using ref, I use source. I create another schema for the source. just to break the cycle in dbt. so basically, 1 table has ref and source at the same time.

what you need is an incremental model, have a look at:
incremental models in DBT

If the full scan approach is fine, then you can just create a query
select col1,col2, col2... from source_Table
minus
select col1,col2, col2... from x
this will give you all the records that are there in source_table but not in table x.
If the full scan is not fine, then you can need to make sure that you have updated_timestamp in your source table, and based on that you can write an incremental model.

Related

How to delete customer information from hdfs

Suppose, I have several customers today so I am storing their information like customer_id, customer_name, customer_emailid etc. If my customer is leaving and he wants that his personal information should be removed from my hdfs.
So I have below two approaches to achieve the same.
Approach 1:
1.Create Internal Table on top of HDFS
2.Create external table from first table using filter logic
3.While Creating 2nd Table apply udfs on specific columns for more column filtering
Approach 2:
Spark=> Read, filter, write
Is there any other solution?
Approach 2 is possible in Hive - select, filter, write
Create a table on top of directory in hdfs (external or managed, does not matter in this context, better external if you are going to drop table later and keep the data as is). Insert overwrite table or partition from select with filter.
insert overwrite mytable
select *
from mytable --the same table
where customer_id not in (...) --filter rows

Hive - Huge 10TB table repartitioning (Adding new partition columns)

Techies,
Background -
We have 10TB existing hive table which has been range partitioned on column A. Business case has changes which now require adding of partition column B in addition to Column A.
Problem statement -
Since data on HDFS is too huge and needs to be restructured to inherit the new partition column B, we are facing difficulty to copy over table onto backup and re-ingest using simple IMPALA INSERT OVERWRITE into main table.
We want to explore if there is/ are efficient way to handle adding over partition columns to such huge table
Alright!
If I understand your situation correctly, you have a table backed by 10 TB of data in HDFS with partition on Column A and you want to add the partition also on Column B.
So, if Column B is going to be the sub partition, the HDFS directory would look like user/hive/warehouse/database/table/colA/colB or /colB/colA otherwise (considering it as an managed table).
Restructuring the HDFS directory manually won't be a great idea because it will become a nightmare to scan the data on all files and organize it accordingly in its corresponding folder.
Below is one way of doing it,
1. Create a new table with new structure - i.e., with partitions on Col A and Col B.
CREATE TABLE NEWTABLE ( COLUMNS ... ) PARTITON ON ( COL_A INT, COL_B INT )
2.a. Insert data from the old table to the new table (created in Step #1) like below,
INSERT INTO NEWTABLE
SELECT * FROM OLDTABLE
But Yes, this step is going to consume a lot of resources during execution if not handled properly, space in HDFS for storing the results as data for NEWTABLE and of-course the time.
OR
2.b. If you think that HDFS will not have enough space to hold all the data or resource crunch, I would suggest you to this INSERT in batches with removal of old data after each INSERT operations.
INSERT INTO NEWTABLE
SELECT * FROM OLDTABLE
WHERE COL_A='abc'
DELETE FROM OLDTABLE
WHERE COL_A='abc'
INSERT INTO NEWTABLE
SELECT * FROM OLDTABLE
WHERE COL_A='def'
DELETE FROM OLDTABLE
WHERE COL_A='def'
.
.
.
so on.
This way, you can unload HDFS with already handled data and balancing the space.
If you follow step 2.b. then you can write a Script to automate this process by passing the partition names (derived from SHOW PARTITIONS) dynamically for each run. But, try the first two attempts manually before going with automation to make sure things go as expected.
Let me know if it helps!

SQL: Does a temp table return different results when called later if the base table changes?

Let's say I have the following:
--Section 1
SELECT * INTO #TEMP1 FROM TABLEA
--Section 2
/*more code here*/
--Section 3
SELECT * FROM #TEMP1
During the processing of Section 2, TABLEA changes significantly. At Section 3, will my temp table return the results of the original SELECT INTO or will it now return the current results from the table?
As #TaPaKaHUA stated, the data in #TEMP1 would not change. It is persisted. It's not like a pointer in some programming languages that only reference a memory location. You could test this easily though, by running select * from #TEMP1 before, and after section 2 of your code and you'll see the data is identical. You'll want to use an order by though so the data is returned in the same order, making comparing more easy. Otherwise, the data would or could be returned in an arbitrary order. More on that in this blog post if you are interested.
A temp table has no link or "knowledge" of how it got populated. Indeed, SELECT ... INTO is only one means of populating a temp table. It can be populated by multiple different statements, have some of its data UPDATEd or DELETEd, etc.
It doesn't care where the data came from, it just stores the rows you insert into it. In most respects, it's just like any other table.
If you want something that "stores" a query rather than a set of results (that is, if you had wanted it to always reflect the current state), you can get something like that by defining a cursor (with appropriate options set so that it is a dynamic cursor).
But almost always, a cursor is the wrong answer to whatever question is asked, and it's not directly usable in a later query in the same way that a temp table is.
No. A table is a snapshot of when it is created -- unless you have explicitly changed the contents with DML ("data modification language", such as insert, delete, update, truncate).
If you want something that changes with the base table, then use a view:
create v_tableA as
select a.*
from TABLEA a;
This will be recalculated when called, so it "changes" with the base table(s).

Huge Data Base in oracle

I had about 20,000,000 records
in a table (random data), and then I added empty column to that table...
but when I update that table to fill that column, the process was broken down..
I tried to use the cursor and the index but no results..
do you have any fast solution or any alternative solution?
Thank you in advance :)
Maybe the fastest way would be to create new_table as select * from existing table, and then inside the select statment of CTAS, calculate the value of the new column. After that, you can rename old table to something like table_bckp, then rename new table to the original table name, and then apply constraints, indexes, and other scripts previously saved from old table definitions.

Copy data from a table and load it into another table

I have a table 'A' having 40 columns. I need to copy the data from 20 specific columns from 'A' , to another table 'B' having those 20 columns. There will be around 3 - 10 million records.
What will be the most efficient way to do this in PLSQL.
"daily table B will be truncated and new data will be inserted into it
from A."
Okay, so the most efficient way to do this is not to do it. Use a materialized view instead; a materialized view log on table A will allow you to capture incremental changes and apply them daily, or at any other window you like. Find out more.
Compared to that approach using handrolled PL/SQL - or even pure SQL - is laughably inefficient.
Do you need to do any sort of conversion on the data or is it just copying data straight from one table to another?
The easiest way to do this is, although you would have to create the indexes separately.
create table B as (select A.c1, A.c2, A.c3..... from A);
If table x already existed, you could just do a
insert into B select A.c1, A.c2.... from A
To speed this up, you would want to drop all the indexes on table x until the insert was done.