My Hive query is running for long time

My Hive query is running for long time - hive

I ran the below query from hive CLI.
The query is running for long time and failing after that.
SET hive.tez.container.size=10240;
SET hive.tez.java.opts=-Xmx8192m;
set tez.runtime.io.sort.mb=4096;
set tez.runtime.unordered.output.buffer.size-mb=1024;
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
set hive.vectorized.execution.reduce.enabled;
set hive.execution.engine=tez;
SELECT
cust_his.cname AS cname
,cust_his.creg AS creg
,Upper(Trim(cust_his.ccountry)) AS ccountry
,Upper(Trim(cust_his.cloc)) AS cloc
FROM
customer_history cust_his
WHERE
cust_his.cust_d BETWEEN 20160501 AND 20160531
AND Substr(Trim(cust_his.cloc), 1, Locate('|', cust_his.cloc, 1) - 1) <> ''
AND Substr(Trim(cust_his.cloc), 1, Locate('|', cust_his.cloc, 1) - 1) IS NOT NULL
AND cast(Trim(cust_his.cmfid) as int) NOT IN ( 1,2,3 )
AND cust_his.cmat = '8';
The table is partitioned on cust_d column.
The table is having 420TB of data.
Please help me to resolve this.
Thanks in Advance.

Your query should run on mapper only because there is no group by or join or files merge.
Check how many mappers starts and is partition pruning works, check query plan.
If partition pruning does not work, try to replace BETWEEN condition with >= and <=, sometimes it helps, vectorized execution may not support BETWEEN in your version according to this: https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.0/bk_dataintegration/content/query-vectorization.html.
Also add this:
set hive.vectorized.execution.enabled = true;
Also this condition is redundant:
AND Substr(Trim(cust_his.cloc), 1, Locate('|', cust_his.cloc, 1) - 1) IS NOT NULL
You do not need it because you already have the same <> '', this automatically excludes nulls

Related

Snowflake ignores statement in where clause where I'm comparing timestamps

so I'm building a SCD type 2 in snowflake, but it ignores the where clause in which is comparision between "to_timestamp" and "expiry_date". Expiry_date is a variable that is set to '9999-08-17 07:31:29.901000000' (as infinity) and To_timestamp is a column in table. I want to query only the rows that have to_timestamp set to infinity (they are still active) but snowflake seems to ignore this part of where clause. Below is some of the code (it should update the rows that are expired - that means change their "to_timestamp" to current time. and it does but it does to rows with timestamps of all kind - it ignores last line)
SET EXPIRY_DATE_NTZ = '9999-08-17 07:31:29.901000000';
SET CURRENT_DATE_NTZ = TO_TIMESTAMP_NTZ(CURRENT_TIMESTAMP());
UPDATE CUSTOMER_TARGET CT
SET CT.TO_TIMESTAMP = $CURRENT_DATE_NTZ
FROM POC.SNOWFLAKE_POC.CUSTOMER_STAGE CS
WHERE CT.C_CUSTOMER_ID = CS.C_CUSTOMER_ID
AND (CT.C_FIRST_NAME <> CS.C_FIRST_NAME OR CT.C_LAST_NAME <> CS.C_LAST_NAME OR CT.C_BIRTH_YEAR
<> CS.C_BIRTH_YEAR OR CT.C_BIRTH_COUNTRY <> CS.C_BIRTH_COUNTRY OR CT.C_LAST_REVIEW_DATE<>CS.C_LAST_REVIEW_DATE)
AND CT.TO_TIMESTAMP = $EXPIRY_DATE_NTZ;
I have two of these update statements (one for updates and one for deletes) and a merge statement for inserts. And it ignores the comparision in every single one, updating the rows that have "to_timestamp" set to something like "2021-08-24 07:11:53.510000000". I've tried every combination possible (between ... and ..., >= ... <=, <=, >=, comparing in "case" statement of update,...) - nothing. What could be the cause/solution?

As we do not know the structure of CUSTOMER_TARGET I would suggest to explicitly set the data type of EXPIRY_DATE_NTZ variable to match the column data type:
SET EXPIRY_DATE_NTZ = '9999-08-17 07:31:29.901000000';
SELECT $EXPIRY_DATE_NTZ;
DESCRIBE RESULT LAST_QUERY_ID();
to:
-- TIMESTAMP_NTZ as an example
SET EXPIRY_DATE_NTZ = '9999-08-17 07:31:29.901000000'::TIMESTAMP_NTZ;
SELECT $EXPIRY_DATE_NTZ;
DESCRIBE RESULT LAST_QUERY_ID();
By doing that way there are no "implicit conversions" involved in the process.
Another advice is usage of IS DISTINCT FROM instead of <>. IS DISTINCT FROM is NULL safe, which is important if columns are defined as nullable.
UPDATE CUSTOMER_TARGET CT
SET CT.TO_TIMESTAMP = $CURRENT_DATE_NTZ
FROM POC.SNOWFLAKE_POC.CUSTOMER_STAGE CS
WHERE CT.C_CUSTOMER_ID = CS.C_CUSTOMER_ID
AND (CT.C_FIRST_NAME IS DISTINCT FROM CS.C_FIRST_NAME
OR CT.C_LAST_NAME IS DISTINCT FROM CS.C_LAST_NAME
OR CT.C_BIRTH_YEAR IS DISTINCT FROM CS.C_BIRTH_YEAR
OR CT.C_BIRTH_COUNTRY IS DISTINCT FROM CS.C_BIRTH_COUNTRY
OR CT.C_LAST_REVIEW_DATE IS DISTINCT FROM CS.C_LAST_REVIEW_DATE)
AND CT.TO_TIMESTAMP = $EXPIRY_DATE_NTZ;

Your SQL does not have any issues with the filters (ORs are surrounded by the brackets etc). I assume that you have checked the execution profile, and did not see your filter (CT.TO_TIMESTAMP = '9999-08-17 07:31:29.901000000'). In this case, all rows in the target table should have this value in the column TO_TIMESTAMP.
I highly recommend you check the data first. If you are running multiple UPDATE/MERGE commands, you may miss that the data has already updated with this value.

Google Bigquery multiple updates based on WHERE - need a solution

is there a way to do multiple updates based on other field value
WHERE, not CASE
idea is below
thanks
#standardSQL
UPDATE dataset.people
SET CBSA_CODE = '54620' where substr(zip,1,5) = '99047',
SET CBSA_CODE = '31793' where substr(zip,1,5) = '45700'

A CASE expression is in fact the typical way you would handle this logic:
UPDATE dataset.people
SET CBSA_CODE = CASE SUBSTR(zip, 1, 5)
WHEN '99047' THEN '54620'
WHEN '45700' THEN '31793' END
WHERE
SUBSTR(zip, 1, 5) IN ('99047', '45700');
The only alternative to this which I can see would be to run mutliple update statements, one for each ZIP code value. But that seems unwieldy and undesirable as compared to using a CASE expression.

Use SQL CASE, which is part of Standard SQL (see official BQ docs):
#standardSQL
UPDATE dataset.people
SET CBSA_CODE = CASE
WHEN substr(zip,1,5) = '99047' THEN '54620'
WHEN substr(zip,1,5) = '45700' THEN '31793'
END
WHERE substr(zip,1,5) IN('99047', '45700')

Getting table(records) to update properply using the MERGE Statement

Good morning everyone!
Below is a piece of code I stitched together: I used a CTE to grab the records(data) from a link table and than convert strings to dates, than use the merge statement to get the data into a local table:
I am having a problem with the column(field) LAST_RACE_DATE this field is set to NULL and is not required but it does not update with my current set up. What I am trying to accomplished is for this field to populate when data is entered but also update, meaning it should also update with NULL.
So if the field has a specific date, and a new date is entered in the remote database, this field should update as well, even if the data is deleted in the back end, it should also remove the local table data for this field.
WITH CTE AS(
SELECT MEMBER_ID
,[MEMBER_DATE] = MAX(CONVERT(DATE, MEMBER_DATE))
,RACE_DATE = MAX(CONVERT(DATE, RACE_DATE))
,LAST_RACE_DATE = MAX(CONVERT(DATE, LAST_RACE_DATE))
FROM [EXAMPLE].[dbo].[LINKED_MEMBER_DATA]
WHERE (MEMBER_DATE IS NOT NULL) AND (ISDATE(MEMBER_DATE)<> 0) AND (RACE_DATE IS NOT NULL) AND (ISDATE(RACE_DATE)<> 0)
AND (LAST_RACE_DATE IS NULL) OR (ISDATE(LAST_RACE_DATE)<> 0)
GROUP BY MEMBER_ID)
MERGE dbo.LINKED_MEMBER_DATA AS Target
USING (SELECT
MEMBER_ID, MEMBER_DATE, RACE_DATE, LAST_RACE_DATE
FROM CTE
GROUP BY MEMBER_ID, RACE_DATE, LAST_RACE_DATE)AS SOURCE ON (Target.MEMBER_ID = SOURCE.MEMBER_ID)
WHEN MATCHED AND
(Target.MEMBER_DATE) <> (SOURCE.MEMBER_DATE)
OR (Target.RACE_DATE) <> (SOURCE.RACE_DATE)
OR ISNULL(TARGET.LAST_RACE_DATE , Target.LAST_RACE_DATE) <> ISNULL(SOURCE.LAST_RACE_DATE, SOURCE.LAST_RACE_DATE)
THEN UPDATE SET
Target.MEMBER_DATE = SOURCE.MEMBER_DATE
,Target.RACE_DATE = SOURCE.RACE_DATE
,Target.LAST_RACE_DATE = SOURCE.LAST_RACE_DATE
WHEN NOT MATCHED BY TARGET THEN
INSERT(
MEMBER_ID, MEMBER_DATE, RACE_DATE, LAST_RACE_DATE)
VALUES (Source.MEMBER_ID, Source.MEMBER_DATE, Source.RACE_DATE, Source.LAST_RACE_DATE);
I also tried this:
ISNULL(Target.LAST_RACE_DATE,'N/A') <> ISNULL(SOURCE.LAST_RACE_DATE,'N/A')
But it generates the below error for dates conversion:
Conversion failed when converting date and/or time from character string.
Thanks a Million!!

Your current statement is failing because the ISNULLs that you have don't do anything (if one of the values is NULL the expression will evaluate to NULL), and NULL values don't compare. Your second attempt doesn't work because ISNULL requires the data types of the two values to be the same, so you could try eg ISNULL(Target.LAST_RACE_DATE, '1970-01-01') <> ISNULL(Source.LAST_RACE_DATE, '1970-01-01').
Another option would be to simply enumerate the different cases (eg, (((Source.LAST_RACE_DATE IS NULL AND Target.LAST_RACE_DATE IS NOT NULL) OR (Source.LAST_RACE_DATE IS NOT NULL AND Target.LAST_RACE_DATE IS NULL) OR (Source.LAST_RACE_DATE <> Target.LAST_RACE_DATE))). Enumerating the different situations makes the code a bit more verbose, but it can result in better performance (whether it is measurably better really depends on how much data you are processing).

sqlite3 UPDATE generating nulls

I'm trying to transition from MySQL to SQLIte3 and running into an update problem. I'm using SQLite 3.6.20 on redhat.
My first line of code behaves normally
update atv_covar set noncomp= 2;
All values for noncomp (in the rightmost column) are appropriately set to 2.
select * from atv_covar;
A5202|S182|2
A5202|S183|2
A5202|S184|2
It is the second line of code that gives me problems:
update atv_covar
set noncomp= (select 1 from f4003 where
atv_covar.study = f4003.study and
atv_covar.rpid = f4003.rpid and
(rsoffrx="81" or rsoffrx="77"));
It runs without generating errors and appropriately sets atv_covar.noncomp to 1 where it matches the SELECT statement. The problem is that it changes atv_covar.noncomp for the non-matching rows to null, where I want it to keep them as 2.
select * from atv_covar;
A5202|S182|
A5202|S183|1
A5202|S184|
Any help would be welcome.

#Dan, the problem with your query is not specific to SQLite; you are updating all rows of atv_covar, but not all of them have correspondence in f4003, so these default to NULL. You should filter the update or provide a default value.
The following statement sets 1 only to the rows that macth the filtering condition:
UPDATE atv_covar
SET noncomp = 1
WHERE EXISTS (
SELECT 'x'
FROM f4003
WHERE atv_covar.study = f4003.study
AND atv_covar.rpid = f4003.rpid
AND (rsoffrx="81" or rsoffrx="77")
);
The following statement sets 1 or 2 for all rows of noncomp, depending on the filtering match (use this instead of two updates):
UPDATE atv_covar
SET noncomp = COALESCE((
SELECT 1
FROM f4003
WHERE atv_covar.study = f4003.study
AND atv_covar.rpid = f4003.rpid
AND (rsoffrx="81" or rsoffrx="77")
), 2);

Update field in table for all records using a select statement

A previous developer created a table that stores the absolute path to files in our server. I want to convert them to relative paths instead.
I already wrote the portion that properly strips the string down to a relative path. My issue is understanding how to basically update each record, with a new version of its own string.
Here is what I originally tried:
UPDATE LFRX_Attachments
SET [File] = (SELECT TOP 1 SUBSTRING([File], PATINDEX('%Files\%', [File]) + 6, LEN([File]))
FROM LFRX_Attachments A
WHERE [Type] = 4 AND AttachmentId = A.AttachmentId)
However, this tanked in epic fashion by just overwriting every record to have the value of the first record in the table. Any suggestions?

UPDATE LFRX_Attachments
SET [File] = SUBSTRING([File], PATINDEX('Files\', [File]) + 6, LEN([File]))
WHERE [Type] = 4

From a readability/maintenance standpoint, you're better off selecting for the data you want to alter, then iterating through the result set and updating each record separately.

Does this work for you?
UPDATE LFRX_Attachments SET [File] = SUBSTRING([File], PATINDEX('Files\', [File]) + 6, LEN([File]))

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

My Hive query is running for long time - hive

Related

Snowflake ignores statement in where clause where I'm comparing timestamps

Google Bigquery multiple updates based on WHERE - need a solution

Getting table(records) to update properply using the MERGE Statement

sqlite3 UPDATE generating nulls

Update field in table for all records using a select statement

Categories

Resources