I want to write a hive query and use it in a workflow. The query should take input and insert 'n' values into a table. The catch is that 'n' can be different every time the workflow is run. So is this possible to achieve using hive query?
Related
I ma trying to capture cpu usage of my current SQL server over a time period and came across a query from here
Please guide me how can i use the above query to insert the results in permanent table over collected period of time without overwriting timestamp values or duplicating entries?
Thanks
You could use a scheduled job to execute the query and insert data into your table
https://learn.microsoft.com/en-us/sql/ssms/agent/schedule-a-job?view=sql-server-ver15
If you use the syntax
INSERT INTO dbo.sysUseLig
SELECT --the query
Having previously created your table dbo.sysUseLig with columns with the correct data types you will have what you need.
There is a timestamp based column in the select.
How the hive will behave if I use insert overwrite on partitioned external and non partitioned external table while other processes are writing into the same table?
I am trying below non partitioned table:
insert overwrite table customer_master (select * from customer_master);
Other Process:
insert into table customer_master select a, b, c;
By default, transaction is turned off in Hive. There is no conflict detections for different sessions updating the same data.
Statements in hive are just map-reduce(or spark, tez, etc.) jobs, they will run independently. Since they operate on the same table and in the end will write to the same directory, if the insert into job finish before the insert overwrite job, the first result will be overwriten. Because the insert overwrite job will clean the directory before writing the result.
To avoid this, use hive transaction.
Scenario: We have a number of scheduled queries that copy data into a project that we use as our centralized data warehouse. These are scheduled queries are configured to run nightly, and are set to WRITE_TRUNCATE.
Problem: We added descriptions to the columns in several of our destination tables in order to document them. However, when the scheduled queries ran they removed all of the column descriptions. (Table description was maintained.)
Desired Outcome: Is there a way to insert the column descriptions as part of the scheduled queries, or some other way to avoid having these deleted nightly? Or is that simply a limitation of WRITE_TRUNCATE scheduled queries?
I've searched Google & Stack Overflow, and reviewed the documentation, but I can't find any references to table / column descriptions in relation to scheduled queries.
One solution is instead of using WRITE_TRUNCATE with SELECT, you can use:
CREATE OR REPLACE TABLE( <column_list_with_description>)
AS SELECT ...
If you don't want to repeat the column description in every schedule query, you may use:
DELETE FROM table WHERE true;
INSERT INTO table SELECT ...
If the atomacy of the update is required, above query could be written into one MERGE statement like:
MERGE full_table
USING (
SELECT *
FROM data_updates_table
)
ON FALSE
WHEN NOT MATCHED BY SOURCE THEN DELETE
WHEN NOT MATCHED BY TARGET THEN INSERT ROW
I have a partitioned hive table partitioned on column 'part'. The table has two partition values part='good' and part='bad'.
I need to move a record from 'bad' partition into 'good' partition and overwrite 'bad' partition to remove that moved record. To complicate this, I am looking for a way to do it in a single query as exception handling would be difficult otherwise.
I tried to do it with multi-table insert having two insert queries on the same table as below,
from tbl_partition
insert into tbl_partition partition (part='good') select a,b,c where a='a' and part='bad' -- this is where a record is moved from bad to good
insert overwrite table tbl_partition partition (part='bad') select a,b,c where part='bad' and a not in ('a'); -- Overwrite the bad partition excluding already moved record
But the above query always does an insert into, rather than one insert and the other insert overwrite!!
I even tried with a common table expression and used the common table to insert simultaneously into this table with no luck!
Is there any other way this can be achieved in a single query or am I doing something wrong in the above step?
Please note that I am doing this on a HDP cluster with hive 1.2
I have TableA that has millions of records and 40 columns.
I would like to move:
- columns 1-30 into Table B
- columns 31-40 into Table C
This multiple Insert question shows how I would assume I should do it
INSERT INTO TableB (col1, col2, ...)
SELECT c1, c2,...
FROM TableA...
I wanted to know if there was a different/quicker way I could pass the data. Essentially, I don't want to wait for One table to finish Insert processing before the other Insert statement starts to execute
I'm afraid there is no way in the SQL standard to have what is often called a T junction at the end of an INSERT .. SELECT. This, I'm afraid, is the privilege of ETL tools. But the ETL tools connect twice to the database, once for each leg of the T junction, and the resulting two INSERT INTO tab_x VALUES (?,?,?,?) statements run in parallel.
Which brings me to a possible solution that could make sense:
Create two scripts. One goes INSERT INTO table_b1 SELECT col1,col2 FROM table_a;. One goes INSERT INTO table_b2 SELECT col3,col4 FROM table_a;. Then, as it's SQL server, launch two isql sessions in parallel, each running their own script.