Cannot load data into a table stored in ORC format - hive

I'm trying to load data into a Hive table stored in the ORC format. I use the approach described here:
Loading Data from a .txt file to Table Stored as ORC in Hive .
Unfortunately, this method does not work for me. I get the following messages trying to insert data into the table:
Total MapReduce jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Job running in-process (local Hadoop)
Hadoop job information for null: number of mappers: 0; number of reducers: 0
2014-03-04 16:28:01,497 null map = 0%, reduce = 0%
Ended Job = job_local531196501_0001 with errors
Error during job, obtaining debugging information...
Execution failed with exit status: 2
Obtaining error information
Task failed!
Task ID:
Stage-1
Where can I find the real reason for failure? No logs are mentioned in the output.

Related

What is the Flow Run Size Limit of a notebook activity in Azure Synapse?

I created a large spark notebook and ran it successfully in Azure Synapse. Then I created a new pipeline with a new notebook activity pointing to the existing spark notebook. I triggered it and it failed with the error message:
ErrorCode=FlowRunSizeLimitExceeded, ErrorMessage=Triggering the pipeline failed
due to large run size. This could happen when a run has a large number of
activities or large inputs used in some of the activities, including parameters.
There is only one activity in that pipeline; so, it can't be the number of activities being exceeded. I googled flow run size limit on activity and there was no result. What is the flow run size limit on the notebook activity?
Here is the information:
Filename
Blob Size
UID_ISO_FIPS_LookUp_Table.csv
396 KiB
05-11-2021.csv
630 KiB
https://ghoapi.azureedge.net/api
476 KiB
Type
Size
Cell Total
Cluster Size
.ipynb notebook
668,522 bytes
43 cells
Small (4 vCores / 32 GB) - 3 to 3 nodes
Here is the error message after triggering the pipeline
Here is the sample code in the notebook. The purpose is to join three files into a single file with a single table. Some processing of csv files are filtering, selecting columns, renaming columns, and aggregating values.
Could someone explain why the error message occurred?
I was able to import that .csv with the following Python code on a small Spark pool:
%%pyspark
df = spark.read.load('abfss://someContainer#somestorageAccount.dfs.core.windows.net/raw/csv/05-11-2021.csv', format='csv'
, header = True
)
display(df.limit(10))
df.createOrReplaceTempView("tmp")
Saving it as a temp view allows you to write some conventional SQL to query the dataframe, eg
%%sql
SELECT SUM(deaths) xsum
FROM tmp

AWS glue write dynamic frame out of memory (OOM)

I am using AWS glue to run a pyspark to read a dynamic frame from catalog (data in redshift),then write it to s3 in csv format. I am getting this error saying the executor is out of memory:
An error occurred while calling o883.pyWriteDynamicFrame. Job aborted due to stage failure: Task 388 in stage 21.0 failed 4 times, most recent failure: Lost task 388.3 in stage 21.0 (TID 10713, ip-10-242-88-215.us-west-2.compute.internal, executor 86): ExecutorLostFailure (executor 86 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 16.1 GB of 16 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead or disabling yarn.nodemanager.vmem-check-enabled because of YARN-4714.
My guess is that the dataframe is not partitioned well before writing so one executor runs out of memory. But when I follow this doc https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-partitions.html to add partition keys to my dynamicframe, the job simply timeout after 4 hours. (The partition key I choose splits the data set into around 10 partitions)
Some other approaches I tried:
Trying to configure the fetchsize but the aws doc shows the glue has configure the fetchsize to 1000 by default. https://aws.amazon.com/blogs/big-data/optimize-memory-management-in-aws-glue/#:~:text=With%20AWS%20Glue%2C%20Dynamic%20Frames,Spark%20executor%20and%20database%20instance.
Trying to set pushdown predicates but the input dataset is created daily and not partitioned. I also need all rows to perform joins/filters in ETL so this might not be a good solution to me.
Does anyone know what are some good alternatives to try out?

Why does Java OutOfMemoryError occurs when selecting less columns in hive query?

I have two hive select statements:
select * from ode limit 5;
This successfully pulls out 5 records from the table 'ode'. All the columns are included in the result. However, This following query caused an error:
select content from ode limit 5;
Where 'content' is one column in the table. The error is:
hive> select content from ode limit 5;
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOfRange(Arrays.java:3664)
at java.lang.String.<init>(String.java:207)
The second query should be a lot cheaper and why does it cause a memory issue? How to fix this?
When you select the whole table, Hive triggers Fetch task instead of MR that involves no parsing (it is like calling hdfs dfs -cat ... | head -5).
As far as I can see in your case, the hive client tries to run map locally.
You can choose one of the two ways:
Force remote execution with hive.fetch.task.conversion
Increase hive client heap size using HADOOP_CLIENT_OPTS env variable.
You can find more details regarding fetch tasks here.

Can I filter data returned by the BigQuery connector for Spark?

I have adapted the instructions at Use the BigQuery connector with Spark to extract data from a private BigQuery object using PySpark. I am running the code on Dataproc. The object in question is a view that has a cardinality >500million rows. When I issue this statement:
table_data = spark.sparkContext.newAPIHadoopRDD(
'com.google.cloud.hadoop.io.bigquery.JsonTextBigQueryInputFormat',
'org.apache.hadoop.io.LongWritable',
'com.google.gson.JsonObject',
conf=conf)
in the job output I see:
Bigquery connector version 0.10.7-hadoop2
Creating BigQuery from default credential.
Creating BigQuery from given credential.
Using working path: 'gs://dataproc-9e5dc592-1a35-42e6-9dd6-5f9dd9c8df87-europe-west1/hadoop/tmp/bigquery/pyspark_input20181108091647'
Estimated number of shards < desired num maps (0 < 93); clipping to 0.
Computed '2' shards for sharded BigQuery export.
Table 'projectname:datasetname.viewname' to be exported has 0 rows and 0 bytes
Estimated number of shards < desired num maps (0 < 93); clipping to 0.
Computed '2' shards for sharded BigQuery export.
Table 'projectname:datasetname.viewname' to be exported has 0 rows and 0 bytes
(timestamp/message-level/namespace removed for readability)
That was over 2 hours ago and the job is still running, there has been no more output in that time. I have looked in the mentioned gs location and can see that a directory called shard-0 has been located, but it is empty. Essentially there has been no visible activity for the past 2 hours.
I'm worried that the bq connector is trying to extract the entirety of that view. Is there a way that I can issue a query to define what data to extract as opposed to extracting the entire view?
UPDATE
I was intrigued by this message in the output:
Estimated number of shards < desired num maps (0 < 93); clipping to 0
It seems strange to me that estimated number of shards would be 0. I've taken a look at some of the code (ShardedExportToCloudStorage.java) that is getting executed here and the above message is logged from computeNumShards(). Given numShards=0 I'm assuming that numTableBytes=0 which means function call:
tableToExport.getNumBytes();
(ShardedExportToCloudStorage.java#L97)
is returning 0 and I assume that the reason for that is that the object I am accessing is a view, not a table. Am I onto something here or am I on a wild goose chase?
UPDATE 2...
To test out my theory (above) that the source object being a view is causing a problem I have done the following:
Created a table in the same project as my dataproc cluster
create table jt_test.jttable1 (col1 string)
Inserted data into it
insert into jt_test.jttable1 (col1) values ('foo')
Submitted a dataproc job to read the table and output the number of rows
Here's the code:
conf = {
# Input Parameters.
'mapred.bq.project.id': project
,'mapred.bq.gcs.bucket': bucket
,'mapred.bq.temp.gcs.path': input_directory
,'mapred.bq.input.project.id': 'myproject'
,'mapred.bq.input.dataset.id': 'jt_test'
,'mapred.bq.input.table.id': jttable1
}
table_data = spark.sparkContext.newAPIHadoopRDD(
'com.google.cloud.hadoop.io.bigquery.JsonTextBigQueryInputFormat',
'org.apache.hadoop.io.LongWritable',
'com.google.gson.JsonObject',
conf=conf)
print ('got table_data')
print (table_data.toDF().head(10))
print ('row tally={}'.format(table_data.toDF().count()))
When I run the dataproc pyspark job, here's the output:
8/11/08 14:59:26 INFO <cut> Table 'myproject:jt_test.jttable1' to be exported has 1 rows and5 bytes
got table_data
[Row(_1=0, _2=u'{"col1":"foo"}')]
row tally=1
Create a view over the table
create view jt_test.v_jtview1 as select col1 from `myproject.jt_test.jttable1`
Run the same job but this time consume the view instead of the table
conf = {
# Input Parameters.
'mapred.bq.project.id': project
,'mapred.bq.gcs.bucket': bucket
,'mapred.bq.temp.gcs.path': input_directory
,'mapred.bq.input.project.id': 'myproject'
,'mapred.bq.input.dataset.id': 'jt_test'
,'mapred.bq.input.table.id': v_jtview1
}
When I run the dataproc pyspark job, here's the output:
Table 'dh-data-dev-53702:jt_test.v_jtview1' to be exported has 0 rows and 0 bytes
and that's it! There's no more output and the job is still running, exactly the same as I explained above. Its effectively hung.
Seems to be a limitation of the BigQuery connector - I can't use it to consume from views.
To close the loop here, jamiet# confirmed in the comment that root cause is that BigQuery does not support export from Views, it supports export only from Tables.

BigQuery Error: Destination deleted/expired during execution

I've a batch script to load data from google cloud bucket to a table in big query. A scheduled SSIS job executes this batch file daily.
bq load -F "\t" --encoding=UTF-8 --replace=true db_name.tbl_name gs://GSCloudBucket/file.txt "column1:string, column2:string, column3:string"
Weirdly, the execution is successful some days and not some other time. Here is what I have on the log.
Waiting on bqjob_r790a43a4_00000155a65559c2_1 ... (0s) Current status: RUNNING ......
Waiting on bqjob_r790a43a4_00000155a65559c2_1 ... (7s) Current status: DONE
BigQuery error in load operation: Error processing job: Destination
deleted/expired during execution
one option is if you have 1 day (or multiple of days) expiration on that table (either on table directly or via default expiration on dataset). In this case - because actual time of load very you can get to situation when destination table has expired by that time.
You can use configuration.load.createDisposition attribute to address this.
Or/and you can make sure you have proper expiration set - for daily process it would be let's say - 26 hours - so you have extra 2 hours for your SSIS job to complete before table can expire