hive query is returning no data - hive

CREATE EXTERNAL TABLE invoiceitems (
InvoiceNo INT,
StockCode INT,
Description STRING,
Quantity INT,
InvoiceDate BIGINT,
UnitPrice DOUBLE,
CustomerID INT,
Country STRING,
LineNo INT,
InvoiceTime STRING,
StoreID INT,
TransactionID STRING
)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
LOCATION 's3a://streamingdata/data/*';
The data files were created by a spark structured streaming job:
...
data/part-00000-006fc42a-c6a1-42a2-af03-ae0c326b40bd-c000.json 7.1 KB 29/08/2018 10:27:32 PM
data/part-00000-0075634b-8513-47b3-b5f8-19df8269cf9d-c000.json 1.3 KB 30/08/2018 10:47:32 AM
data/part-00000-00b6b230-8bb3-49d1-a42e-ad768c1f9a94-c000.json 2.3 KB 30/08/2018 1:25:02 AM
...
Here is are the first few rows of the first file:
{"InvoiceNo":5421462,"StockCode":22426,"Description":"ENAMEL WASH BOWL CREAM","Quantity":8,"InvoiceDate":1535578020000,"UnitPrice":3.75,"CustomerID":13405,"Country":"United Kingdom","LineNo":6,"InvoiceTime":"21:27:00","StoreID":0,"TransactionID":"542146260180829"}
{"InvoiceNo":5501932,"StockCode":22170,"Description":"PICTURE FRAME WOOD TRIPLE PORTRAIT","Quantity":4,"InvoiceDate":1535578020000,"UnitPrice":6.75,"CustomerID":13952,"Country":"United Kingdom","LineNo":26,"InvoiceTime":"21:27:00","StoreID":0,"TransactionID":"5501932260180829"}
However, if I run the query, no data is returned:
hive> select * from invoiceitems limit 5;
OK
Time taken: 24.127 seconds
The log files for hive are empty:
$ ls /var/log/hive*
/var/log/hive:
/var/log/hive-hcatalog:
/var/log/hive2:
How can I debug this further?

I received more of a hint about the error when I ran:
select count(*) from invoiceitems;
This returned the following error
...
DAG did not succeed due to VERTEX_FAILURE. failedVertices:1
killedVertices:1 FAILED: Execution Error, return code 2 from
org.apache.hadoop.hive.ql.exec.tez.TezTask. Vertex failed,
vertexName=Map 1, vertexId=vertex_1535521291031_0011_1_00,
diagnostics=[Vertex vertex_1535521291031_0011_1_00 [Map 1]
killed/failed due to:ROOT_INPUT_INIT_FAILURE, Vertex Input:
invoiceitems initializer failed, vertex=vertex_1535521291031_0011_1_00
[Map 1], java.io.IOException: cannot find dir =
s3a://streamingdata/data/part-00000-006fc42a-c6a1-42a2-af03-ae0c326b40bd-c000.json
in pathToPartitionInfo: [s3a://streamingdata/data/*]
I decided to change the create table definition from:
LOCATION 's3a://streamingdata/data/*';
to
LOCATION 's3a://streamingdata/data/';
and this fixed the issue.

Related

APACHE PIG - error Projected field [Units_Sold] does not exist in schema: group:chararray,D2:bag{:tuple(Item_Type:chararray,Units_Sold:int)}

Good afternoon - I have a Sales Dataset and am trying to see which Item has the most units sold.
Here is my code:
Country:chararray,
Item_Type:chararray,
Sales_Channel:chararray,
Order_Priority_site:chararray,
Order_Date:chararray,
Order_ID:chararray,
Ship_Date:chararray,
Units_Sold:int,
Unit_Price: int,
Unit_Cost: int,
Total_Revenue: int,
Total_Cost: int,
Total_Profit:int);
D2 = FOREACH data GENERATE Item_Type, Units_Sold;
D3 = GROUP D2 BY Item_Type;
D4 = FOREACH D3 GENERATE group, SUM(Units_Sold);
DUMP D4;```
However, I get the error:
```<file D, line 20, column 36> Invalid field projection. Projected field [Units_Sold] does not exist in schema: group:chararray,D2:bag{:tuple(Item_Type:chararray,Units_Sold:int)}.```
Does anybody know how to fix this? Let me know if you need more info, this is the first qurstion I have posted on here
SUM is expecting a bag. The error shows you the schema:
D2:bag{:tuple(Item_Type:chararray,Units_Sold:int)}
Therefore change SUM to:
SUM(D2.Units_Sold)

How to iterate many Hive scripts over spark

I have many hive scripts (somewhat 20-25 scripts), each scripts having multiple queries. I want to run these scripts using spark so that the process can run faster. As map reduce job in hive takes long time to execute from spark it will be much faster. Below is the code I have written but its working for 3-4 files but when given multiple files with multiple queries its getting failed.
Below is the code for the same. Please help me if possible to optimize the same.
val spark = SparkSession.builder.master("yarn").appName("my app").enableHiveSupport().getOrCreate()
val filename = new java.io.File("/mapr/tmp/validation_script/").listFiles.filter(_.getName.endsWith(".hql")).toList
for ( i <- 0 to filename.length - 1)
{
val filename1 = filename(i)
scala.io.Source.fromFile(filename1).getLines()
.filterNot(_.isEmpty) // filter out empty lines
.foreach(query =>
spark.sql(query))
}
some of the error I cam getting is like
ERROR SparkSubmit: Job aborted.
org.apache.spark.SparkException: Job aborted.
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:224)
ERROR FileFormatWriter: Aborting job null.
org.apache.spark.SparkException: Job aborted due to stage failure: ShuffleMapStage 12 (sql at validationtest.scala:67) has failed the maximum allowable number of times: 4. Most recent failure reason: org.apache.spark.shuffle.FetchFailedException: failed to allocate 16777216 byte(s) of direct memory (used: 1023410176, max: 1029177344) at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:528)
many different types of error I get when run the same code multiple times.
Below is how one of the HQL file looks like. its name is xyz.hql and has
drop table pontis_analyst.daydiff_log_sms_distribution
create table pontis_analyst.daydiff_log_sms_distribution as select round(datediff(date_sub(current_date(),cast(date_format(CURRENT_DATE ,'u') as int) ),cast(subscriberActivationDate as date))/7,4) as daydiff,subscriberkey as key from pontis_analytics.prepaidsubscriptionauditlog
drop table pontis_analyst.weekly_sms_usage_distribution
create table pontis_analyst.weekly_sms_usage_distribution as select sum(event_count_ge) as eventsum,subscriber_key from pontis_analytics.factadhprepaidsubscriptionsmsevent where effective_date_ge_prt < date_sub(current_date(),cast(date_format(CURRENT_DATE ,'u') as int) - 1 ) and effective_date_ge_prt >= date_sub(date_sub(current_date(),cast(date_format(CURRENT_DATE ,'u') as int) ),84) group by subscriber_key;
drop table pontis_analyst.daydiff_sms_distribution
create table pontis_analyst.daydiff_sms_distribution as select day.daydiff,sms.subscriber_key,sms.eventsum from pontis_analyst.daydiff_log_sms_distribution day inner join pontis_analyst.weekly_sms_usage_distribution sms on day.key=sms.subscriber_key
drop table pontis_analyst.weekly_sms_usage_final_distribution
create table pontis_analyst.weekly_sms_usage_final_distribution as select spp.subscriberkey as key, case when spp.tenure < 3 then round((lb.eventsum )/dayDiff,4) when spp.tenure >= 3 then round(lb.eventsum/12,4)end as result from pontis_analyst.daydiff_sms_distribution lb inner join pontis_analytics.prepaidsubscriptionsubscriberprofilepanel spp on spp.subscriberkey = lb.subscriber_key
INSERT INTO TABLE pontis_analyst.validatedfinalResult select 'prepaidsubscriptionsubscriberprofilepanel' as fileName, 'average_weekly_sms_last_12_weeks' as attributeName, tbl1_1.isEqual as isEqual, tbl1_1.isEqualCount as isEqualCount, tbl1_2.countAll as countAll, (tbl1_1.isEqualCount/tbl1_2.countAll)* 100 as percentage from (select tbl1_0.isEqual as isEqual, count(isEqual) as isEqualCount from (select case when round(aal.result) = round(srctbl.average_weekly_sms_last_12_weeks) then 1 when aal.result is null then 1 when aal.result = 'NULL' and srctbl.average_weekly_sms_last_12_weeks = '' then 1 when aal.result = '' and srctbl.average_weekly_sms_last_12_weeks = '' then 1 when aal.result is null and srctbl.average_weekly_sms_last_12_weeks = '' then 1 when aal.result is null and srctbl.average_weekly_sms_last_12_weeks is null then 1 else 0 end as isEqual from pontis_analytics.prepaidsubscriptionsubscriberprofilepanel srctbl left join pontis_analyst.weekly_sms_usage_final_distribution aal on srctbl.subscriberkey = aal.key) tbl1_0 group by tbl1_0.isEqual) tbl1_1 inner join (select count(*) as countAll from pontis_analytics.prepaidsubscriptionsubscriberprofilepanel) tbl1_2 on 1=1
Your issue is your code is running out of memory as shown below
failed to allocate 16777216 byte(s) of direct memory (used: 1023410176, max: 1029177344)
Though what you are trying to do is not optimal way of doing things in Spark but I would recommend that you remove the memory serialization as it will not help in anyways. You should cache data only if it is going to be used in multiple transformations. If it is going to be used once there is no reason to put the data in cache.

Data ingest issues hive: java.lang.OutOfMemoryError: unable to create new native thread

I'm a hive newbie and having an odyssey of problems getting a large (1TB) HDFS file into a partitioned Hive managed table. Can you please help me get around this? I feel like I have a bad config somewhere because I'm not able to complete reducer jobs.
Here is my query:
DROP TABLE IF EXISTS ts_managed;
SET hive.enforce.sorting = true;
CREATE TABLE IF NOT EXISTS ts_managed (
svcpt_id VARCHAR(20),
usage_value FLOAT,
read_time SMALLINT)
PARTITIONED BY (read_date INT)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS ORC
TBLPROPERTIES("orc.compress"="snappy","orc.create.index"="true","orc.bloom.filter.columns"="svcpt_id");
SET hive.vectorized.execution.enabled = true;
SET hive.vectorized.execution.reduce.enabled = true;
SET set hive.cbo.enable=true;
SET hive.tez.auto.reducer.parallelism=true;
SET hive.exec.reducers.max=20000;
SET yarn.nodemanager.pmem-check-enabled = true;
SET optimize.sort.dynamic.partitioning=true;
SET hive.exec.max.dynamic.partitions=10000;
INSERT OVERWRITE TABLE ts_managed
PARTITION (read_date)
SELECT svcpt_id, usage, read_time, read_date
FROM ts_raw
DISTRIBUTE BY svcpt_id
SORT BY svcpt_id;
My cluster specs are:
VM cluster
4 total nodes
4 data nodes
32 cores
140 GB RAM
Hortonworks HDP 3.0
Apache Tez as default Hive engine
I am the only user of the cluster
My yarn configs are:
yarn.nodemanager.resource.memory-mb = 32GB
yarn.scheduler.minimum-allocation-mb = 512MB
yarn.scheduler.maximum-allocation-mb = 8192MB
yarn-heapsize = 1024MB
My Hive configs are:
hive.tez.container.size = 682MB
hive.heapsize = 4096MB
hive.metastore.heapsize = 1024MB
hive.exec.reducer.bytes.per.reducer = 1GB
hive.auto.convert.join.noconditionaltask.size = 2184.5MB
hive.tex.auto.reducer.parallelism = True
hive.tez.dynamic.partition.pruning = True
My tez configs:
tez.am.resource.memory.mb = 5120MB
tez.grouping.max-size = 1073741824 Bytes
tez.grouping.min-size = 16777216 Bytes
tez.grouping.split-waves = 1.7
tez.runtime.compress = True
tez.runtime.compress.codec = org.apache.hadoop.io.compress.SnappyCodec
I've tried countless configurations including:
Partition on date
Partition on date, cluster on svcpt_id with buckets
Partition on date, bloom filter on svcpt, sort by svcpt_id
Partition on date, bloom filter on svcpt, distribute by and sort by svcpt_id
I can get my mapping vertex to run, but I have not gotten my first reducer vertex to complete. Here is my most recent example from the above query:
----------------------------------------------------------------------------------------------
VERTICES MODE STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED
----------------------------------------------------------------------------------------------
Map 1 .......... container SUCCEEDED 1043 1043 0 0 0 0
Reducer 2 container RUNNING 9636 0 0 9636 1 0
Reducer 3 container INITED 9636 0 0 9636 0 0
----------------------------------------------------------------------------------------------
VERTICES: 01/03 [=>>-------------------------] 4% ELAPSED TIME: 6804.08 s
----------------------------------------------------------------------------------------------
The error was:
Error: Error while processing statement: FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.tez.TezTask. Vertex failed, vertexName=Reducer 2, vertexId=vertex_1537061583429_0010_2_01, diagnostics=[Task failed, taskId=task_1537061583429_0010_2_01_000070, diagnostics=[TaskAttempt 0 failed, info=[Error: Error while running task ( failure ) : java.lang.OutOfMemoryError: unable to create new native thread
I either get this OOM error which I cannot seem to get around or I get datanodes going offline and not being able to meet my replication factor requirements.
At this point I've been troubleshooting for over 2 weeks. Any contacts for professional consultants I can pay to solve this problem would also be appreciated.
Thanks in advance!
I ended up solving this after speaking with a Hortonworks tech guy. Turns out I was over-partitioning my table. Instead of partitioining by day over about 4 years I partitioned by month and it worked great.

SQL, Get average job runtime from a log table

I have a table LOGS with this attributes:
ID(int)
Date(datetime)
Core(int)
Source(string)
Message(string)
That table contains log entries from multiple jobs.
A job has multiple log entries, but the start/end entries always have the same message.
Example:
->1, 07.12.2016 10:49:00, 2, Some DLL, Calling Execute on CleanUp // Start
2, 07.12.2016 10:49:01, 3, Other DLL, BLABLABLA
3, 07.12.2016 10:49:10, 1, Other DLL, BLABLABLA
->4, 07.12.2016 10:50:15, 2, Other DLL, BLABLABLA // Job does sth.
->5, 07.12.2016 10:50:50, 2, Other DLL, Execution completed // End
The rows marked with an arrow belonging to the same job.
As you can see, a job starts with 'Calling Execute...' and ends with 'Execution completed'.
What I want to achieve:
My task is to get the average job running times. The initial aproach was to filter with
WHERE Message LIKE '%JOBNAME%' OR Message LIKE 'Execution completed'
and comparing the dateTimes. This worked for some jobs, but some jobs run rarely so I only get "Execution completed" and the precision is not that great doing this manually.
At the end I want a list with following attributes:
ID(start log),
Start-Date,
End-Date,
Core,
Source-Start,
Source-End,
Message-Start,
Message-End
So later it's easy to calculate the difference and do the avg on it.
My idea
-> Get jobs by searching for a message.
-> Get a list with the message "Executing completed" having:
a higher ID (end log is always after start log)
a later datetime
the same core
For example:
Having a job with the attributes
1, 07.12.2016 11:33:00, 2, Source 1, Calling Execute on job Cleanup
Then searching for all logs with
ID>1,
dateTime>07.12.2016 11:33:00,
Core=2,
Message="Execution completed"
Picking the first item of that list should be the end log of the job
How can I do this with a sql query?
PS: I cannot change anything in the database, I can only read data.
You can identify the jobs using a correlated subquery to get the next end record. The following shows how to get these fields:
select l.*, lend.*
from (select l.*,
(select min(l2.date)
from logs l2
where l2.core = l.core and
l2.message like '% End'
l2.date > l.date
) as end_date
from logs l
where l.message like '% Start'
) l join
logs lend
on lend.core = l.core and lend.date = l.end_date;
This assumes that the date/time values are unique for a given "core".

Error with function percentile_disc in Bigquery

I am currently running the following query in BigQuery:
SELECT
longTime,
PERCENTILE_DISC(0.99) OVER (ORDER BY longTime DESC) AS percentil_99,
cityList
FROM (FLATTEN(FLATTEN([table], longTime), clientId))
WHERE
cityList IS NOT NULL
AND clientId='UKTRAR'
AND bp_Time BETWEEN TIMESTAMP('2015-01-18')
AND TIMESTAMP('2015-12-18')
but when finished displays the following error
Error: Resources exceeded during query execution.
Job ID: table:job_Qm_LOGJAFMSXSeXXnVRoLer4C8A
the table has the following data
Table Size => 23.4 GB
Number of Rows => 71,809,730