Apache drill Error Index out of bounds - hive

Im performing queries with apache drill over a 100GB dataset in relatively small cluster (4 nodes with 16GB). When i try to run a certain query it gives me the following error:
Error: SYSTEM ERROR: IndexOutOfBoundsException: index: 88, length: 4
(expected: range(0, 64)).
Im using the standard TPCH-H data model, the error happens in Query 12.The same query ran successfully when i was working with a smaller dataset. Can someone give some sugestion why this happens?
Query:
select
l_shipmode,
sum(case
when o_orderpriority = '1-URGENT'
or o_orderpriority = '2-HIGH'
then 1
else 0
end) as high_line_count,
sum(case
when o_orderpriority <> '1-URGENT'
and o_orderpriority <> '2-HIGH'
then 1
else 0
end) as low_line_count
from
pq_orders,
pq_lineitem
where
o_orderkey = l_orderkey
and l_shipmode in ('REG AIR', 'MAIL')
and l_commitdate < l_receiptdate
and l_shipdate < l_commitdate
and l_receiptdate >= '1995-01-01'
and l_receiptdate < '1996-01-01'
group by
l_shipmode
order by
l_shipmode;

Related

Trying to grab data from two columns and format them properly

So I have a database here with a table that lists off whether or not certain processes have failed. There are two columns, IsProcessed, and IsFailed. A failed process can still be considered processed if the error was handled, but I still need to recognize that it failed. They're both bit values, and so I have to try and grab and separate them despite that they may depend on one another. After they've been separated out, I need to count the relative successes and relative failures.
I utilize an AND statement in my WHERE clause to try and separate out the successes from the failures. I honestly have no idea where to go from here.
SELECT CAST(PQ.ProcessedDate AS Date) AS Date, COUNT(PQ.IsProcessed) AS Successes
FROM PQueue PQ
WHERE PQ.ProcessDate BETWEEN '2019-10-1' AND '2019-10-31' AND PQ.IsFailed = 0 AND PQ.IsProcessed = 1
GROUP BY CAST(PQ.ProcessDate AS Date)
ORDER BY CAST(PQ.ProcessDate AS Date) ASC
Because a failed process can still be processed in the system, we have to do a check first to try and grab the data that was processed but didn't flag a failure. Now I need to try and find a way to not exclude the failures, but include them and place them in a group. I can do the group part, but I'm relatively new to SQL so I don't know whether or not I can place something in an IF statement somewhere or try to use variables to get this done. Thank you in advance.
You seem to want conditional aggregation:
SELECT CAST(PQ.ProcessedDate AS Date) AS Date,
SUM(CASE WHEN PQ.IsFailed = 0 AND PQ.IsProcessed = 1 THEN 1 ELSE 0 END) as Successes,
SUM(CASE WHEN PQ.IsFailed = 1 AND PQ.IsProcessed = 1 THEN 1 ELSE 0 END) as Fails
FROM PQueue PQ
WHERE PQ.ProcessDate BETWEEN '2019-10-1' AND '2019-10-31'
GROUP BY CAST(PQ.ProcessDate AS Date)
ORDER BY CAST(PQ.ProcessDate AS Date) ASC
If SQL Server then maybe a CASE statement would help you out.
eg
SELECT ...........
CASE
WHEN IsFailed = 1 AND IsProcessed = 1 THEN "Processed But Failed"
WHEN IsFailed = 0 AND IsProcessed = 0 THEN "Not Processed"
WHEN IsFailed = 0 AND IsProcessed = 1 THEN "Processed Succesfully"
WHEN IsFailed = 1 AND IsProcessed = 0 THEN "Failed"
END as REsult

Hive query with except conditions

I am trying to build a hive query that does only the below features or a combination of these features. For example, the features include
name = "summary"
name = "details"
name1 = "vehicle stats"
Basically, the query should exclude all the other features in name and name1.
I am quite new to hive. In sql, i know this can be done using except keyword. Just wondering whether there is some functions that can achieve the same.
Thanks very much !!
If I understand correctly, I approach this using group by and having:
select ?
from t
group by ?
having sum(case when name = 'summary' then 1 else 0 end) > 0 and
sum(case when name = 'details' then 1 else 0 end) > 0 and
sum(case when name1 = 'vehicle_stats' then 1 else 0 end) > 0;
The ? is for the column that you want the summary of.

Hive summary function inside case statement

I am trying to write a simple Hive query:
select sum(case when pot_sls_q > 2* avg(pit_sls_q) then 1 else 0)/count(*) from prd_inv_fnd.item_pot_sls where dept_i=43 and class_i=3 where p_wk_end_d = 2014-06-28;
Here pit_sls_q and pot_sls_q both are columns in the Hive table and I want proportion of records which have pot_sls_q more than 2 times average of pit_sls_q. However I get error:
FAILED: SemanticException [Error 10128]: Line 1:95 Not yet supported place for UDAF 'avg'
To fool around I even tried using some window function:
select sum(case when pot_sls_q > 2* avg(pit_sls_q) over (partition by dept_i,class_i) then 1 else 0 end)/count(*) from prd_inv_fnd.item_pot_sls where dept_i=43 and class_i=3 and p_wk_end_d = '2014-06-28';
which is fine considering the fact filtering or partitioning the data on same condition is "same" data essentially but even with this I get error:
FAILED: SemanticException [Error 10002]: Line 1:36 Invalid column reference 'avg': (possible column names are: p_wk_end_d, dept_i, class_i, item_i, pit_sls_q, pot_sls_q)
please suggest right way of doing this.
You are using AVG inside SUM which won't work (along with other syntax errors).
Try analytic AVG OVER () this:
select sum(case when pot_sls_q > 2 * avg_pit_sls_q then 1 else 0 end) / count(*)
from (
select t.*,
avg(pit_sls_q) over () avg_pit_sls_q
from prd_inv_fnd.item_pot_sls t
where dept_i = 43
and class_i = 3
and p_wk_end_d = '2014-06-28'
) t;

Query optimization with 3000000 in single date oracle

Table x contains millions of rows and I have to fetch data for single date using function based index(trunc).
Single date data for eg, for 22-07-16 we have 3000000 rows. I am also using case for sum of columns. Query taking 18 sec. How I can reduce time.
EDIT
QUERY:
SELECT SUM(
CASE
WHEN cssgoldenc1_.impact='Low'
THEN 1
ELSE 0
END) AS col_0_0_,
SUM(
CASE
WHEN cssgoldenc1_.impact='High'
THEN 1
ELSE 0
END) AS col_1_0_
FROM CSSCOMPLIANCEDETAIL csscomplia0_,
CSSGoldenConfiguration cssgoldenc1_,
CSS css7_
WHERE csscomplia0_.cssGoldenConfigurationID_FK=cssgoldenc1_.CSSGoldenConfigurationId_PK
AND csscomplia0_.cssID_FK =css7_.cssId_PK
AND (cssgoldenc1_.cmcategory IN ('Access List','Application of QoS Policy','Archive','BFD','BGP', 'CPU','Clock','Debug','Default settings','Entity Check','IGP Routing','Inclusion in VRF', 'Interface Parameters','LDP','LDP Establishment','License','Logging/Syslog/Debug','MTU Size', 'Multicast','Multilink','NodeReadiness','Nomenclature Related','Performance Optimization', 'QoS','Router OAM','Routing','SNMP','Security','Services','System Recovery', 'Type of Interface','Unicast','Unrequired Services','mBGP'))
AND TRUNC(csscomplia0_.creationDate) =to_Date('22-07-16','dd-mm-yy')
AND (css7_.softwareVersion IN ('/asr920-universalk9.V155_1S2_SR635680903_6.bin', '/asr920-universalk9_npe.03.13.00.S.154-3.S-ext.bin','/asr920-universalk9_npe.03.14.02.S.155-1.S2-std.bin', '/asr920-universalk9_npe.V155_1_S2_SR635680903_2.bin','/asr920-universalk9_npe.V155_1_S2_SR635680903_6.bin', '/bootflash','asr901-universalk9-mz.155-3.S1a.bin','asr903rsp1-universalk9_npe.V155_1_S2_SR635680903_10.bin', 'asr920-universalk9.V155_1S2_SR635680903_6.bin','asr920-universalk9_npe.03.13.00.S.154-3.S-ext.bin', 'asr920-universalk9_npe.03.13.00z.S.154-3.S0z-ext.bin','asr920-universalk9_npe.03.14.02.S.155-1.S2-', 'asr920-universalk9_npe.03.14.02.S.155-1.S2-std.bin','asr920-universalk9_npe.03.15.01.S.155-2.S1-std.bin', 'asr920-universalk9_npe.03.16.01a.S.155-3.S1a-ext.bin','asr920-universalk9_npe.2016-05-10_07.53_saappuku.bin' ,'asr920-universalk9_npe.V155_1_S2_SR635680903_2.bin','asr920-universalk9_npe.V155_1_S2_SR635680903_6.bin',
'asr920-universalk9_npe.V155_1_S2_SR635680903_6.binn','bootflash'));
Index :
create index idx_fnc on CSSCOMPLIANCEDETAIL(trunc(creationDate));
Try this. Basically I took the CASE to a subquery since this way it shouldn't be evaluated 3M times. I also change the query in order to use JOIN
with cssgoldenc1_ as
(select /*+ Materialize */ CASE WHEN impact='Low' THEN 1
ELSE 0 END AS col_0_0_,
CASE WHEN impact='High' THEN 1
ELSE 0 END AS col_1_0_,
CSSGoldenConfigurationId_PK
from CSSGoldenConfiguration
where cssgoldenc1_.cmcategory IN ('Access List','Application of QoS Policy','Archive','BFD','BGP', 'CPU','Clock','Debug','Default settings','Entity Check','IGP Routing','Inclusion in VRF', 'Interface Parameters','LDP','LDP Establishment','License','Logging/Syslog/Debug','MTU Size', 'Multicast','Multilink','NodeReadiness','Nomenclature Related','Performance Optimization', 'QoS','Router OAM','Routing','SNMP','Security','Services','System Recovery', 'Type of Interface','Unicast','Unrequired Services','mBGP')
)
SELECT SUM(col_0_0_) AS col_0_0_,
SUM(col_1_0_) AS col_1_0_
FROM CSSCOMPLIANCEDETAIL csscomplia0_ join cssgoldenc1_ on csscomplia0_.cssGoldenConfigurationID_FK = cssgoldenc1_.CSSGoldenConfigurationId_PK
join CSS css7_ on csscomplia0_.cssID_FK = css7_.cssId_PK
WHERE TRUNC(csscomplia0_.creationDate) =to_Date('22-07-16','dd-mm-yy')
AND css7_.softwareVersion IN ('/asr920-universalk9.V155_1S2_SR635680903_6.bin', '/asr920-universalk9_npe.03.13.00.S.154-3.S-ext.bin','/asr920-universalk9_npe.03.14.02.S.155-1.S2-std.bin', '/asr920-universalk9_npe.V155_1_S2_SR635680903_2.bin','/asr920-universalk9_npe.V155_1_S2_SR635680903_6.bin', '/bootflash','asr901-universalk9-mz.155-3.S1a.bin','asr903rsp1-universalk9_npe.V155_1_S2_SR635680903_10.bin', 'asr920-universalk9.V155_1S2_SR635680903_6.bin','asr920-universalk9_npe.03.13.00.S.154-3.S-ext.bin', 'asr920-universalk9_npe.03.13.00z.S.154-3.S0z-ext.bin','asr920-universalk9_npe.03.14.02.S.155-1.S2-', 'asr920-universalk9_npe.03.14.02.S.155-1.S2-std.bin','asr920-universalk9_npe.03.15.01.S.155-2.S1-std.bin', 'asr920-universalk9_npe.03.16.01a.S.155-3.S1a-ext.bin','asr920-universalk9_npe.2016-05-10_07.53_saappuku.bin' ,'asr920-universalk9_npe.V155_1_S2_SR635680903_2.bin','asr920-universalk9_npe.V155_1_S2_SR635680903_6.bin',
'asr920-universalk9_npe.V155_1_S2_SR635680903_6.binn','bootflash');

Trouble counting null and missing values (and differentiating between the two) using RODBC package

I am working to create a a matrix of missingness for a SQL database consisting of 5 tables and nearly 10 years of data. I have established ODBC connectivity and am using the RODBC package in R as my working environment. I am trying to write a function that will output a count of rows for each year for each table, a count and percent of null values (values not present) in a given year for a given table, and a count and percent of missing (questions skipped/not answered) values for a given table. I am working with the code below, trying to get it to work on one variable then turning it into a function once it works. However, when I run this code(see below), it appears to not be working, and I believe the issue lies with assigning an integer value to the character for null, NA. I am getting this message when trying to list vars in the function:
Error in as.environment(pos) : no item called "22018 245 [Microsoft][ODBC SQL Server Driver][SQL Server]Conversion failed when converting the varchar value 'NA' to data type int." on the search list.
Also, when I try to find the environment for the function, R returns NULL. I do not necessarily want to assign a new value to the already existent variable, and I new to SQL, but I am trying to do something along these lines If X = 'NA' then Y = 1 else 0. I get the following error message when I try to run the final 2 lines creating the percent vars:
Error in eval(substitute(expr), data, enclos = parent.frame()) : invalid 'envir' argument of type 'character'
Any insight?
test1 <- sqlQuery(channel, "select
[EVENT_YEAR] AS 'YEAR',
COUNT(*) AS 'TOTAL',
SUM(CASE WHEN MOTHER_EDUCATION_TRENDABLE = 'NA' THEN 1 ELSE 0 END) AS 'NULL_VAL',
SUM(CASE WHEN MOTHER_EDUCATION_TRENDABLE = -1 THEN 1 ELSE 0 END) AS 'MISS_VAL'
from [GA_CMH].[dbo].[BIRTHS]
GROUP BY [EVENT_YEAR]
ORDER BY [EVENT_YEAR]")
test1$nullpct<-with(test1, NULL_VAL/TOTAL)
test1$misspct<-with(test1, MISS_VAL/TOTAL)
I believe the data type of your column MOTHER_EDUCATION_TRENDABLE is an integer, if so, try:
select
[EVENT_YEAR] AS 'YEAR',
COUNT(*) AS 'TOTAL',
SUM(CASE WHEN MOTHER_EDUCATION_TRENDABLE IS NULL THEN 1 ELSE 0 END) AS 'NULL_VAL',
SUM(CASE WHEN MOTHER_EDUCATION_TRENDABLE = -1 THEN 1 ELSE 0 END) AS 'MISS_VAL'
from [GA_CMH].[dbo].[BIRTHS]
GROUP BY [EVENT_YEAR]
ORDER BY [EVENT_YEAR]