when should we go for a external table and internal table in Hive? - hive

I understand the difference between Internal tables and external tables in hive as below
1) if we drop the internal Table File and metadata will be deleted, however , in case of External only metadata will be
deleted
2) if the file data need to be shared by other tools/applications then we go for external table if not
internal table, so that if we drop the table(external) data will still be available for other tools/applications
I have gone through the answers for question "Difference between Hive internal tables and external tables? "
but still I am not clear about the proper uses cases for Internal Table
so my question is why is that I need to make an Internal table ? why cant I make everything as External table?

Use EXTERNAL tables when:
The data is also used outside of Hive.
For example, the data files are read and processed by an existing program that doesn't lock the files.
The data is permanent i.e used when needed.
Use INTERNAL tables when:
The data is temporary.
You want Hive to completely manage the lifecycle of the table and data.

Let's understand it with two simple scenarios:
Suppose you have a data set, and you have to perform some analytics/problem statements on it. Because of the nature of problem statements, few of them can be done by HiveQL, few of them need Pig Latin and few of them need Map Reduce etc., to get the job done. In this situation External Table comes into picture- the same data set can be used to solve entire analytics instead of having different different copies of same data set for the different different tools. Here Hive don't need authority on the data set because several tools are going to use it.
There can be a scenario, where entire analytics/problem statements can be solved by only HiveQL. In such situation Internal Table comes into picture- Means you can put the entire data set into Hive's Warehouse and Hive is going to have complete authority on the data set.

Related

Checking of replicated data Pentaho

I have about 100 tables to which we replicate data, e.g. from the Oracle database.
I would like to quickly check that the data replicated to the tables in db2 is the same as in the source system.
Does anyone have a way to do this? I can create 100 transformations, but that's monotonous and time consuming. I would prefer to process this in a loop.
I thought I would keep the queries in a table and reach into it for records.
I read the data from Table input (sql_db2, sql_source, table_name) and write do copy rows to result. Next I read single record and I read a single record and put it into a loop.
But here came a problem because I don't know how to dynamically compare the data for the tables. Each table has different columns and here I have a problem.
I don't know if this is also possible?
You can inject metadata (in this case your metadata would be the column and table names) to a lot of steps in Pentaho, you create a transformation to collect the metadata to inject to another transformation that has only the steps and some basic information, but the bulk of the information of the columns affected by the different steps is in the transformation injecting the metadata.
Check Pentaho official documentation about Metadata Injection (MDI) and the sample with a basic example of metadata injection available in your PDI installation.

How to pass dynamic table names for sink database in Azure Data Factory

I am trying to copy tables from one schema to another with the same Azure SQL db. So far, I have created a lookup pipeline and passed the parameters for the for each loop and copy activity. But my sink dataset is not taking the parameter value I have given under "table option" field rather it is taking the dummy table I chose when creating the sink dataset. Can someone tell how can I pass dynamic table name to a sink dataset?
I have given concat('dest_schema.STG_',#{item().table_name})} in the table option field.
To make the schema and table names dynamic, add Parameters to the Dataset:
Most important - do NOT import a schema. If you already have one defined in the Dataset, clear it. For this Dataset to be dynamic, you don't want improper schemas interfering with the process.
In the Copy activity, provide the values at runtime. These can be hardcoded, variables, parameters, or expressions, so very flexible.
If it's the same database, you can even use the same Dataset for both, just provide different values for the Source and Sink.
WARNING: If you use the "Auto-create table" option, the schema for the new table will define any character field as varchar(8000), which can cause serious performance problems.
MY OPINION:
While you can do this, one of my personal rules is to not cross the database boundary. If the Source and Sink are on the same SQL database, I would try to solve this problem with a Stored Procedure rather than a data factory.

Is there a way to merge ORC files in HDFS without using ALTER TABLE CONCATENATE command?

This is my first week with Hive and HDFS, so please bear with me.
Almost all the ways I saw so far to merge multiple ORC files suggest using ALTER TABLE with CONCATENATE command.
But I need to merge multiple ORC files of the same table without having to ALTER the table. Another option is to create a copy of the existing table and then use ALTER TABLE on that so that my original table remains unchanged. But I can't do that as well because space and data redundancy reasons.
The thing I'm trying to achieve (ideally) is: I need to transport these ORCs as one file per table into a cloud environment. So, is there a way that I can merge the ORCs on-the-go during the transfer process into cloud? Can this be achieved with/without Hive, maybe directly in HDFS?
Two possible methods other than ALTER TABLE CONCATENATE:
Try to configure merge task, see details here: https://stackoverflow.com/a/45266244/2700344
Alternatively you can force single reducer. This method is quite applicable for not too big files. You can overwrite the same table with ORDER BY, this will force single reducer on the last ORDER BY stage. This will work slow or even fail with big files because all the data will be passed through single reducer:
INSERT OVERWRITE TABLE
SELECT * FROM TABLE
ORDER BY some_col; --this will force single reducer
As a side effect you will get better packed ORC file with efficient index on columns listed in order by.

SSIS : Huge Data Transfer from Source (SQL Server) to Destination (SQL Server)

Requirement :
Transfer millions of records from source (SQL Server) to destination (SQL Server).
Structure of source tables is different from destination tables.
Refresh data once per week in destination server.
Minimum amount of time for the processing.
I am looking for optimized approach using SSIS.
Was thinking these options :
Create Sql dump from source server and import that dump in destination server.
Directly copy the tables from source server to destination server.
Lots of issues to consider here. Such as are the servers in the same domain, on same network, etc.
Most of the time you will not want to move the data as a single large chunk of millions of records but in smaller amounts. An SSIS package handles that logic for you, but you can always recreate it as well but iterating the changes easier. Sometimes this is a reason to push changes more often rather than wait an entire week as smaller syncs are easier to manage with less downtime.
Another consideration is to be sure you understand your delta's and to ensure that you have ALL of the changes. For this reason I would generally suggest using a staging table at the destination server. By moving changes to staging and then loading to the final table you can more easily ensure that changes are applied correctly. Think of the scenario of a an increment being out of order (identity insert), datetime ordered incorrectly or 1 chunk failing. When using a staging table you don't have to rely solely on the id/date and can actually do joins on primary keys to look for changes.
Linked Servers proposed by Alex K. can be a great fit, but you will need to pay close attention to a couple of things. Always do it from Destination server so that it is a PULL not a push. Linked servers are fast at querying the data but horrible at updating/inserting in bulk. 1 XML column cannot be in the table at all. You may need to set some specific properties for distributed transactions.
I have done this task both ways and I would say that SSIS does give a bit of advantage over Linked Server just because of its robust error handling, threading logic, and ability to use different adapters (OLEDB, ODBC, etc. they have different performance do a search and you will find some results). But the key to your #4 is to do it in smaller chunks and from a staging table and if you can do it more often it is less likely to have an impact. E.g. daily means it would already be ~1/7th of the size as weekly assuming even daily distribution of changes.
Take 10,000,000 records changed a week.
Once weekly = 10mill
once daily = 1.4 mill
Once hourly = 59K records
Once Every 5 minutes = less than 5K records
And if it has to be once a week. just think about still doing it in small chunks so that each insert will have more minimal affect on your transaction logs, actual lock time on production table etc. Be sure that you never allow loading of a partially staged/transferred data otherwise identifying delta's could get messed up and you could end up missing changes/etc.
One other thought if this is a scenario like a reporting instance and you have enough server resources. You could bring over your entire table from production into a staging or update a copy of the table at destination and then simply do a drop of current table and rename the staging table. This is an extreme scenario and not one I generally like but it is possible and actual impact to the user would be very nominal.
I think SSIS is good at transfer data, my approach here:
1. Create a package with one Data Flow Task to transfer data. If the structure of two tables is different then it's okay, just map them.
2. Create a SQL Server Agent job to run your package every weekend
Also, feature Track Data Changes (SQL Server) is also good to take a look. You can config when you want to sync data and it's good at performance too
With SQL Server versions >2005, it has been my experience that a dump to a file with an export is equal to or slower than transferring data directly from table to table with SSIS.
That said, and in addition to the excellent points #Matt makes, this the usual pattern I follow for this sort of transfer.
Create a set of tables in your destination database that have the same table schemas as the tables in your source system.
I typically put these into their own database schema so their purpose is clear.
I also typically use the SSIS OLE DB Destination package's "New" button to create the tables.
Mind the square brackets on [Schema].[TableName] when editing the CREATE TABLE statement it provides.
Use SSIS Data Flow tasks to pull the data from the source to the replica tables in the destination.
This can be one package or many, depending on how many tables you're pulling over.
Create stored procedures in your destination database to transform the data into the shape it needs to be in the final tables.
Using SSIS data transformations is, almost without exception, less efficient than using server side SQL processing.
Use SSIS Execute SQL tasks to call the stored procedures.
Use parallel processing via Sequence Containers where possible to save time.
This can be one package or many, depending on how many tables you're transforming.
(Optional) If the transformations are complex, requiring intermediate data sets, you may want to create a separate Staging database schema for this step.
You will have to decide whether you want to use the stored procedures to land the data in your ultimate destination tables, or if you want to have the procedures write to intermediate tables, and then move the transformed data directly into the final tables. Using intermediate tables minimizes down time on the final tables, but if your transformations are simple or very fast, this may not be an issue for you.
If you use intermediate tables, you will need a package or packages to manage the final data load into the destination tables.
Depending on the number of packages all of this takes, you may want to create a Master SSIS package that will call the extraction package(s), then the transformation package(s), and then, if you use intermediate processing tables, the final load package(s).

Bteq Scripts to copy data between two Teradata servers

How do I copy data from multiple tables within one database to another database residing on a different server?
Is this possible through a BTEQ Script in Teradata?
If so, provide a sample.
If not, are there other options to do this other than using a flat-file?
This is not possible using BTEQ since you have mentioned both the databases are residing in different servers.
There are two solutions for this.
Arcmain - You need to use Arcmain Backup first, which creates files containing data from your tables. Then you need to use Arcmain restore which restores the data from the files
TPT - Teradata Parallel Transporter. This is a very advanced tool. This does not create any files like Arcmain. It directly moves the data between two teradata servers.(Wikipedia)
If I am understanding your question, you want to move a set of tables from one DB to another.
You can use the following syntax in a BTEQ Script to copy the tables and data:
CREATE TABLE <NewDB>.<NewTable> AS <OldDB>.<OldTable> WITH DATA AND STATS;
Or just the table structures:
CREATE TABLE <NewDB>.<NewTable> AS <OldDB>.<OldTable> WITH NO DATA AND NO STATS;
If you get real savvy you can create a BTEQ script that dynamically builds the above statement in a SELECT statement, exports the results, then in turn runs the newly exported file all within a single BTEQ script.
There are a bunch of other options that you can do with CREATE TABLE <...> AS <...>;. You would be best served reviewing the Teradata Manuals for more details.
There are a few more options which will allow you to copy from one table to another.
Possibly the simplest way would be to write a smallish program which uses one of their communication layers (ODBC, .NET Data Provider, JDBC, cli, etc.) and use that to take a select statement and an insert statement. This would require some work, but it would have less overhead than trying to learn how to write TPT scripts. You would not need any 'DBA' permissions to write your own.
Teradata also sells other applications which hides the complexity of some of the tools. Teradata Data Mover handles provides an abstraction layer between tools like arcmain and tpt. Access to this tool is most likely restricted to DBA types.
If you want to move data from one server to another server then
We can do this with the flat file.
First we have fetch data from source table to flat file through any utility such as bteq or fastexport.
then we can load this data into target table with the help of mload,fastload or bteq scripts.