How to perform incremental load using sqoop tool

How to perform incremental load using sqoop tool - hive

Actually, I have my data in Teradata table. I have sqooped that teradata table data into Hive using sqoop-import command.
But, my teradata table will get the data on a daily basis. So, there is a need to sqoop the newly added data i.e, incremental data from teradata into Hive table.
Can anyone please suggest me some solutions to achieve this ...

If you have a any column similar to row-id/timestamp in your table, then you can use:
--incremental [mode] --last-value [value] --check-column [col]
If you have a saved job for this, you can skip --last-value as it will be automatically maintained.
--incremental [mode] has two modes. lastmodified and append, you can use any one based on your requirement.

Related

Duplicate several tables in bigquery project at once

In our BQ export schema, we have one table for each day as per the screenshot below.
I want to copy the tables before a certain date (2021-feb-07). I know how to copy one day at a time via the UI, but is there not a way to use the cloud console to write a code for copying the selected date range, all at once? Or maybe an sql command directly from a query window?

I think you should transform your sharding tables into a partitioned table. So you can handled your tables with just a single query. As mention in the official documentation, partitioned tables perform better.
To make the conversion, you can just execute the following commands in the console.
bq partition \
--time_partitioning_type=DAY \
--time_partitioning_expiration 259200 \
mydataset.sourcetable_ \
mydataset.mytable_partitioned
This will make your sharded tables sourcetable_(xxx) into a single partitioned table mytable_partitioned which can be query with just a single query trough your entire set of data entries.
SELECT
*
FROM
`myprojectid.mydataset.mytable_partitioned`
WHERE
_PARTITIONTIME BETWEEN TIMESTAMP('2022-01-01') AND TIMESTAMP('2022-01-03')
For more details about the conversion commands you can check this link. Also, I recommend to check the links about querying partionated tables and partiotioned tables for more details.

BigQuery: Atomically replace a date partition using DML

I often want to load one day's worth of data into a date-partitioned BigQuery table, replacing any data that's already there. I know how to do this for 'old-style' data partitioned tables (the ones that have a _PARTITIONTIME field) but don't know how to do this with the new-style date-partitioned tables (which use a normal date/timestamp column to specify the partitioning because they don't allow one to use the $ decorator.
Let's say I want to do this on my_table. With old-style date-partitioned tables, I accomplished this using a load job that utilized the $ decorator and the WRITE_TRUNCATE write disposition -- e.g., I'd set the destination table to be my_table$20181005.
However, I'm not sure how to perform the equivalent operation using a DML. I find myself performing separate DELETE and INSERT commands. This isn't great because it increases complexity, the number of queries, and the operation isn't atomic.
I want to know how to do this using the MERGE command to keep this all contained within a single, atomic operation. However I can't wrap my head around the MERGE command's syntax and haven't found an example for this use case. Does anyone know how this should be done?
The ideal answer would be a DML statement that selected all columns from source_table and inserted it into the 2018-10-05 date partition of my_table, deleting any existing data that was in my_table's 2018-10-05 date partition. We can assume that source_table and my_table have the same schemas, and that my_table is partitioned on the day column, which is of type DATE.

because they don't allow one to use the $ decorator
But they do--you can use table_name$YYYYMMDD when you load into column-based partitioned table as well. For example, I made a partitioned table:
$ bq query --use_legacy_sql=false "CREATE TABLE tmp_elliottb.PartitionedTable (x INT64, y NUMERIC, date DATE) PARTITION BY date"
Then I loaded into a specific partition:
$ echo "1,3.14,2018-11-07" > row.csv
$ bq load "tmp_elliottb.PartitionedTable\$20181107" ./row.csv
I tried to load into the wrong partition for the input data, and received an error:
$ echo "1,3.14,2018-11-07" > row.csv
$ bq load "tmp_elliottb.PartitionedTable\$20181105" ./row.csv
Some rows belong to different partitions rather than destination partition 20181105
I then replaced the data for the partition:
$ echo "2,0.11,2018-11-07" > row.csv
$ bq load --replace "tmp_elliottb.PartitionedTable\$20181107" ./row.csv
Yes, you can use MERGE as a way of replacing data for a partitioned table's partition, but you can also use a load job.

drop hive table partition through pig script

Currently we are dropping the table daily and running the script which loads the data to the tables. Script takes 3-4 hrs during which data will not be available. So now our aim is to make the old hive data available to analysts until new data load execution is complete.
I am achieving this thing in hql script by loading daily data to the hive tables partitioned on load_year, load_month and load_day and dropping the yesterdays data by dropping the partition.
But what is the option for pig script to achieve the same? Can we alter the table through pig script? I dont want to execute the other hql to drop partition after pig.
Thanks

Since HDP 2.3 you can use HCatalog commands inside Pig scripts. Therefore, you can use the HCatalog command to drop a Hive table partition. The following is an example of dropping a Hive partition:
-- Set the correct hcat path
set hcat.bin /usr/bin/hcat;
-- Drop a table partion or execute other any Hcatalog command
sql ALTER TABLE midb1.mitable1 DROP IF EXISTS PARTITION(activity_id = "VENTA_ALIMENTACION",transaction_month = 1);
Another way is to use sh command execution inside Pig Script. However I had some problems to escape special characters in ALTER commands. So, the first is the best option in my opinion.
Regards,
Roberto Tardío

Trying to copy data from Impala Parquet table to a non-parquet table

I am moving data around within Impala, not my design, and I have lost some data. I need to copy the data from the parquet tables back to their original non-parquet tables. Originally, the developers had done this with a simple one liner in a script. Since I don't know anything about databases and especially about Impala I was hoping you could help me out. This is the one line that is used to translate to a parquet table that I need to be reversed.
impalaShell -i <ipaddr> use db INVALIDATE METADATA <text_table>;
CREATE TABLE <parquet_table> LIKE <text_table> STORED AS PARQUET TABLE;
INSERT OVERWRITE <parquet_table> SELECT * FROM <text_table>;
Thanks.

Have you tried simply doing
CREATE TABLE <text_table>
AS
SELECT *
FROM <parquet_table>
Per the Cloudera documentation, this should be possible.
NOTE: Ensure that your does not exist or use a table name that does not already exist so that you do not accidentally overwrite other data.

Incremental updates in HIVE using SQOOP appends data into middle of the table

I am trying to append the new data from SQLServer to Hive using the following command
sqoop import --connect 'jdbc:sqlserver://10.1.1.12;database=testdb' --username uname --password passwd --table testable --where "ID > 11854" --hive-import -hive-table hivedb.hivetesttable --fields-terminated-by ',' -m 1
This command appends the data.
But when I run
select * from hivetesttable;
it doesnot show the new data at the end.
This is because the sqoop import statement for appending the new data result the mapper output as part-m-00000-copy
So my data in the hive table directory looks like
part-m-00000
part-m-00000-copy
part-m-00001
part-m-00002
Is there any way to append the data at end by changing the name of mapper?

Hive, similarly to any other relational database, doesn't guarantee any order unless you explicitly use ORDER BY clause.
You're correct in your analysis - the reason why the data appears in the "middle" is that Hive will read one file after another based on lexicographical sort and Sqoop simply names the files that they will get appended somewhere in the middle of that list.
However this operation is fully valid - Sqoop appended data to Hive table and because your query doesn't have any explicit ORDER BY statement the result have no guarantees with regards to order. In fact Hive itself can change this behavior and read files based on time of creation without breaking any compatibility.
I'm also interested to see how this is affecting your use case? I'm assuming that the query to list all rows is just a test one. Do you have any issues with actual production queries?

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas