What is the Flow Run Size Limit of a notebook activity in Azure Synapse? - azure-synapse

I created a large spark notebook and ran it successfully in Azure Synapse. Then I created a new pipeline with a new notebook activity pointing to the existing spark notebook. I triggered it and it failed with the error message:
ErrorCode=FlowRunSizeLimitExceeded, ErrorMessage=Triggering the pipeline failed
due to large run size. This could happen when a run has a large number of
activities or large inputs used in some of the activities, including parameters.
There is only one activity in that pipeline; so, it can't be the number of activities being exceeded. I googled flow run size limit on activity and there was no result. What is the flow run size limit on the notebook activity?
Here is the information:
Filename
Blob Size
UID_ISO_FIPS_LookUp_Table.csv
396 KiB
05-11-2021.csv
630 KiB
https://ghoapi.azureedge.net/api
476 KiB
Type
Size
Cell Total
Cluster Size
.ipynb notebook
668,522 bytes
43 cells
Small (4 vCores / 32 GB) - 3 to 3 nodes
Here is the error message after triggering the pipeline
Here is the sample code in the notebook. The purpose is to join three files into a single file with a single table. Some processing of csv files are filtering, selecting columns, renaming columns, and aggregating values.
Could someone explain why the error message occurred?

I was able to import that .csv with the following Python code on a small Spark pool:
%%pyspark
df = spark.read.load('abfss://someContainer#somestorageAccount.dfs.core.windows.net/raw/csv/05-11-2021.csv', format='csv'
, header = True
)
display(df.limit(10))
df.createOrReplaceTempView("tmp")
Saving it as a temp view allows you to write some conventional SQL to query the dataframe, eg
%%sql
SELECT SUM(deaths) xsum
FROM tmp

Related

AWS glue write dynamic frame out of memory (OOM)

I am using AWS glue to run a pyspark to read a dynamic frame from catalog (data in redshift),then write it to s3 in csv format. I am getting this error saying the executor is out of memory:
An error occurred while calling o883.pyWriteDynamicFrame. Job aborted due to stage failure: Task 388 in stage 21.0 failed 4 times, most recent failure: Lost task 388.3 in stage 21.0 (TID 10713, ip-10-242-88-215.us-west-2.compute.internal, executor 86): ExecutorLostFailure (executor 86 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 16.1 GB of 16 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead or disabling yarn.nodemanager.vmem-check-enabled because of YARN-4714.
My guess is that the dataframe is not partitioned well before writing so one executor runs out of memory. But when I follow this doc https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-partitions.html to add partition keys to my dynamicframe, the job simply timeout after 4 hours. (The partition key I choose splits the data set into around 10 partitions)
Some other approaches I tried:
Trying to configure the fetchsize but the aws doc shows the glue has configure the fetchsize to 1000 by default. https://aws.amazon.com/blogs/big-data/optimize-memory-management-in-aws-glue/#:~:text=With%20AWS%20Glue%2C%20Dynamic%20Frames,Spark%20executor%20and%20database%20instance.
Trying to set pushdown predicates but the input dataset is created daily and not partitioned. I also need all rows to perform joins/filters in ETL so this might not be a good solution to me.
Does anyone know what are some good alternatives to try out?

Not able to update Big query table with Transfer from a Storage file

I am not able to update a big query table from a storage file. I have latest data file and transfer runs successfully. But it say "8:36:01 AM Detected that no changes will be made to the destination table.".
Tried multiple ways.
Please help.
Thanks,
-Srini
You have to wait 1 hour after your file has been updated in Cloud Storage: https://cloud.google.com/bigquery-transfer/docs/cloud-storage-transfer?hl=en_US#minimum_intervals
I had the same error. I created two transfers from GCS to BigQuery, with write preference set to MIRROR and APPEND. I got the logs below (no error). The GCS file was uploaded less than one hour before.
MIRROR: Detected that no changes will be made to the destination table. Summary: succeeded 0 jobs, failed 0 jobs.
APPEND: None of the 1 new file(s) found matching "gs://mybucket/myfile" meet the requirement of being at least 60 minutes old. They will be loaded in next run. Summary: succeeded 0 jobs, failed 0 jobs.
Both jobs went through one hour later.

Can you control hdfs file size for a HortonWorks HDP 3.4.1 Managed Table?

Currently testing a cluster and when using the "CREATE TABLE AS" the resulting managed table ends up being one file ~ 1.2 GB while the base file the query is created from has many small files. The SELECT portion runs fast, but then the result is 2 reducers running to create one file which takes 75% of the run time.
Additional testing:
1) If using "CREATE EXTERNAL TABLE AS" is used the query runs very fast and there is no merge files step involved.
2) Also, the merging doesn't appear to occur with version HDP 3.0.1.
You can change set hive.exec.reducers.bytes.per.reducer=<number> to let hive decide number of reducers based on reducer input size (default value is set to 1 GB or 1000000000 bytes ) [ you can refer to links provided by #leftjoin to get more details about this property and fine tuning for your needs ]
Another option you can try is to change following properties
set mapreduce.job.reduces=<number>
set hive.exec.reducers.max=<number>

Can I filter data returned by the BigQuery connector for Spark?

I have adapted the instructions at Use the BigQuery connector with Spark to extract data from a private BigQuery object using PySpark. I am running the code on Dataproc. The object in question is a view that has a cardinality >500million rows. When I issue this statement:
table_data = spark.sparkContext.newAPIHadoopRDD(
'com.google.cloud.hadoop.io.bigquery.JsonTextBigQueryInputFormat',
'org.apache.hadoop.io.LongWritable',
'com.google.gson.JsonObject',
conf=conf)
in the job output I see:
Bigquery connector version 0.10.7-hadoop2
Creating BigQuery from default credential.
Creating BigQuery from given credential.
Using working path: 'gs://dataproc-9e5dc592-1a35-42e6-9dd6-5f9dd9c8df87-europe-west1/hadoop/tmp/bigquery/pyspark_input20181108091647'
Estimated number of shards < desired num maps (0 < 93); clipping to 0.
Computed '2' shards for sharded BigQuery export.
Table 'projectname:datasetname.viewname' to be exported has 0 rows and 0 bytes
Estimated number of shards < desired num maps (0 < 93); clipping to 0.
Computed '2' shards for sharded BigQuery export.
Table 'projectname:datasetname.viewname' to be exported has 0 rows and 0 bytes
(timestamp/message-level/namespace removed for readability)
That was over 2 hours ago and the job is still running, there has been no more output in that time. I have looked in the mentioned gs location and can see that a directory called shard-0 has been located, but it is empty. Essentially there has been no visible activity for the past 2 hours.
I'm worried that the bq connector is trying to extract the entirety of that view. Is there a way that I can issue a query to define what data to extract as opposed to extracting the entire view?
UPDATE
I was intrigued by this message in the output:
Estimated number of shards < desired num maps (0 < 93); clipping to 0
It seems strange to me that estimated number of shards would be 0. I've taken a look at some of the code (ShardedExportToCloudStorage.java) that is getting executed here and the above message is logged from computeNumShards(). Given numShards=0 I'm assuming that numTableBytes=0 which means function call:
tableToExport.getNumBytes();
(ShardedExportToCloudStorage.java#L97)
is returning 0 and I assume that the reason for that is that the object I am accessing is a view, not a table. Am I onto something here or am I on a wild goose chase?
UPDATE 2...
To test out my theory (above) that the source object being a view is causing a problem I have done the following:
Created a table in the same project as my dataproc cluster
create table jt_test.jttable1 (col1 string)
Inserted data into it
insert into jt_test.jttable1 (col1) values ('foo')
Submitted a dataproc job to read the table and output the number of rows
Here's the code:
conf = {
# Input Parameters.
'mapred.bq.project.id': project
,'mapred.bq.gcs.bucket': bucket
,'mapred.bq.temp.gcs.path': input_directory
,'mapred.bq.input.project.id': 'myproject'
,'mapred.bq.input.dataset.id': 'jt_test'
,'mapred.bq.input.table.id': jttable1
}
table_data = spark.sparkContext.newAPIHadoopRDD(
'com.google.cloud.hadoop.io.bigquery.JsonTextBigQueryInputFormat',
'org.apache.hadoop.io.LongWritable',
'com.google.gson.JsonObject',
conf=conf)
print ('got table_data')
print (table_data.toDF().head(10))
print ('row tally={}'.format(table_data.toDF().count()))
When I run the dataproc pyspark job, here's the output:
8/11/08 14:59:26 INFO <cut> Table 'myproject:jt_test.jttable1' to be exported has 1 rows and5 bytes
got table_data
[Row(_1=0, _2=u'{"col1":"foo"}')]
row tally=1
Create a view over the table
create view jt_test.v_jtview1 as select col1 from `myproject.jt_test.jttable1`
Run the same job but this time consume the view instead of the table
conf = {
# Input Parameters.
'mapred.bq.project.id': project
,'mapred.bq.gcs.bucket': bucket
,'mapred.bq.temp.gcs.path': input_directory
,'mapred.bq.input.project.id': 'myproject'
,'mapred.bq.input.dataset.id': 'jt_test'
,'mapred.bq.input.table.id': v_jtview1
}
When I run the dataproc pyspark job, here's the output:
Table 'dh-data-dev-53702:jt_test.v_jtview1' to be exported has 0 rows and 0 bytes
and that's it! There's no more output and the job is still running, exactly the same as I explained above. Its effectively hung.
Seems to be a limitation of the BigQuery connector - I can't use it to consume from views.
To close the loop here, jamiet# confirmed in the comment that root cause is that BigQuery does not support export from Views, it supports export only from Tables.

Cannot load data into a table stored in ORC format

I'm trying to load data into a Hive table stored in the ORC format. I use the approach described here:
Loading Data from a .txt file to Table Stored as ORC in Hive .
Unfortunately, this method does not work for me. I get the following messages trying to insert data into the table:
Total MapReduce jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Job running in-process (local Hadoop)
Hadoop job information for null: number of mappers: 0; number of reducers: 0
2014-03-04 16:28:01,497 null map = 0%, reduce = 0%
Ended Job = job_local531196501_0001 with errors
Error during job, obtaining debugging information...
Execution failed with exit status: 2
Obtaining error information
Task failed!
Task ID:
Stage-1
Where can I find the real reason for failure? No logs are mentioned in the output.