Unable to resolve ERROR 2017: Internal error creating job configuration on EMR when running PIG - amazon-s3

I have been trying to run a very simple task with Pig on Amazon EMR. When I run the commands in interactive shell, everything works fine. But when I ran the same thing as batch job, I get
[main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2017: Internal
error creating job configuration.
and the running the script fails.
Here's my 7 line script. It's just computing averages over tuples of Google bigrams. mc is match count and vc is volume count.
bigrams = LOAD 's3n://<<bucket-name>>/gb­bigrams/*' AS (bigram:chararray, year:int, mc:int, vc:int);
grouped_bigrams = group bigrams by bigram;
answer1 = foreach grouped_bigrams generate group, ((DOUBLE) SUM(bigrams.mc))/COUNT(bigrams) AS avg_mc;
sort_answer1 = ORDER answer1 BY avg_mc desc;
answer2 = LIMIT sort_answer1 5;
STORE answer1 INTO 's3n://<bucket-name>/output/bigram/20130409/answer1';
STORE answer2 INTO 's3n://<bucket-name>/output/bigram/20130409/answer2';
I was guessing the error has to something to do with STORE and s3 path. So I have tried various combinations like using $OUTPUT, backslashes, etc. But keep getting the same error.
Any help would be highly appreciated.

Have you tried using the S3 Block File System instead of the native file system?
e.g.
s3://<<bucket-name>>/gb­bigrams/*
s3://<bucket-name>/output/bigram/20130409/answer1

Related

Deploy sql workflow with DBX

I am developing deployment via DBX to Azure Databricks. In this regard I need a data job written in SQL to happen everyday. The job is located in the file data.sql. I know how to do it with a python file. Here I would do the following:
build:
python: "pip"
environments:
default:
workflows:
- name: "workflow-name"
#schedule:
quartz_cron_expression: "0 0 9 * * ?" # every day at 9.00
timezone_id: "Europe"
format: MULTI_TASK #
job_clusters:
- job_cluster_key: "basic-job-cluster"
<<: *base-job-cluster
tasks:
- task_key: "task-name"
job_cluster_key: "basic-job-cluster"
spark_python_task:
python_file: "file://filename.py"
But how can I change it so I can run a SQL job instead? I imagine it is the last two lines of code (spark_python_task: and python_file: "file://filename.py") which needs to be changed.
There are various ways to do that.
(1) One of the most simplest is to add a SQL query in the Databricks SQL lens, and then reference this query via sql_task as described here.
(2) If you want to have a Python project that re-uses SQL statements from a static file, you can add this file to your Python Package and then call it from your package, e.g.:
sql_statement = ... # code to read from the file
spark.sql(sql_statement)
(3) A third option is to use the DBT framework with Databricks. In this case you probably would like to use dbt_task as described here.
I found a simple workaround (although might not be the prettiest) to simply change the data.sql to a python file and run the queries using spark. This way I could use the same spark_python_task.

DBT: How to fix Database Error Expecting Value?

I was running into troubles today while running Airflow and airflow-dbt-python. I tried to debug a bit using the logs and the error shown in the logs was this one:
[2022-12-27, 13:53:53 CET] {functions.py:226} ERROR - [0m12:53:53.642186 [error] [MainThread]: Encountered an error:
Database Error
Expecting value: line 2 column 5 (char 5)
Quite a weird one.
Possibly check your credentials file that allows DBT to run queries on your DB (in our case we run DBT with BigQuery), in our case the credentials file was empty. We even tried to run DBT directly in the worker instead of running it through airflow, giving as a result exactly the same error. Unfortunately this error is not really explicit.

Writing apache beam pCollection to bigquery causes type Error

I have a simple beam pipeline, as follows:
with beam.Pipeline() as pipeline:
output = (
pipeline
| 'Read CSV' >> beam.io.ReadFromText('raw_files/myfile.csv',
skip_header_lines=True)
| 'Split strings' >> beam.Map(lambda x: x.split(','))
| 'Convert records to dictionary' >> beam.Map(to_json)
| beam.io.WriteToBigQuery(project='gcp_project_id',
dataset='datasetID',
table='tableID',
create_disposition=bigquery.CreateDisposition.CREATE_NEVER,
write_disposition=bigquery.WriteDisposition.WRITE_APPEND
)
)
However upon runnning I get a typeError, stating the following:
line 2147, in __init__
self.table_reference = bigquery_tools.parse_table_reference(if isinstance(table,
TableReference):
TypeError: isinstance() arg 2 must be a type or tuple of types
I have tried defining a TableReference object and passing it to the WriteToBigQuery class but still facing the same issue. Am I missing something here? I've been stuck at this step for what feels like forever and I don't know what to do. Any help is appreciated!
This probably occurred since you installed Apache Beam without the GCP modules. Please make sure to do following (in a virtual environment).
pip install apache-beam[gcp]
It's a weird error though so feel free to file a Github issue against the Apache Beam project.

How do I get Source Extractor to Analyze an Image?

I'm relatively inexperienced in coding, so right now I'm just familiarizing myself with the basics of how to use SE, which I'll need to use in the near future.
At the moment I'm trying to get it to analyze a FITS file on my computer (which is a Mac). I'm sure this is something obvious, but I haven't been able to get it do that. Following the instructions in Chapters 6 and 7 of Source Extractor for Dummies (linked below), I input the following:
sex MedSpiral_20deg_Serl2_.45_.fits.fits -c configuration_file.txt
And got the following error message:
WARNING: configuration_file.txt not found, using internal defaults
----- SExtractor 2.19.5 started on 2020-02-05 at 17:10:59 with 1 thread
Setting catalog parameters
ERROR: can't read default.param
I then tried entering parameters manually:
sex MedSpiral_20deg_Ser12_.45_.fits.fits -c configuration_file.txt -DETECT_TYPE CCD -MAG_ZEROPOINT 2.5 -PIXEL_SCALE 0 -SATUR_LEVEL 1 -SEEING_FWHM 1
And got the same error message. I tried referencing default.sex directly:
sex MedSpiral_20deg_Ser12_.45_.fits.fits -c default.sex
And got the same error message again, substituting "configuration_file.txt not found" with "default.sex not found" (I checked that default.sex was on my computer, it is). The same thing happened when I tried to use default.param.
Here's the link to SE for Dummies (Chapter 6 begins on page 19):
http://astroa.physics.metu.edu.tr/MANUALS/sextractor/Guide2source_extractor.pdf
If you run the command "sex MedSpiral_20deg_Ser12_.45_fits.fits -c default.sex" within the config folder (within the sextractor folder), you will be able to run it.
However, I wonder how I can possibly run sextractor command from any folder in my computer?

Is there a way to get a nice error report summary when running many jobs on DRMAA cluster?

I need to run a snakemake pipeline on a DRMAA cluster with a total number of >2000 jobs. When some of the jobs have failed, I would like to receive in the end an easy readable summary report, where only the failed jobs are listed instead of the whole job summary as given in the log.
Is there a way to achieve this without parsing the log file by myself?
These are the (incomplete) cluster options:
jobs: 200
latency-wait: 5
keep-going: True
rerun-incomplete: True
restart-times: 2
I am not sure if there is another way than parsing the log file yourself, but I've done it several times with grep and I am happy with the results:
cat .snakemake/log/[TIME].snakemake.log | grep -B 3 -A 3 error
Of course you should change the TIME placeholder for whichever run you want to check.