getting error ERROR 1000: Error during parsing. Lexical error - apache-pig

I wrote pig script as :
my_script.pig
bag_1 = LOAD '$INPUT' USING PigStorage('|') AS (LN_NR:chararray,ET_NR:chararray,ET_ST_DT:chararray,ED_DT:chararray,PI_ID:chararray);
bag_2 = LIMIT bag_1 $SIZE;
DUMP bag_2;
and made one param file as :
my_param.txt:
INPUT = hdfs://0.0.0.0:8020/user/training/example
SIZE = 10
now, I am calling the script by
pig my_param.txt my_script.pig
this command but getting error as:
ERROR 1000: Error during parsing. Lexical error
any suggestions for that

I think you need to provide the parameter file using -m or -param_file option. Refer the help documentation below.
$ pig --help
Apache Pig version 0.11.0-cdh4.7.1 (rexported)
compiled Nov 18 2014, 09:08:23
USAGE: Pig [options] [-] : Run interactively in grunt shell.
Pig [options] -e[xecute] cmd [cmd ...] : Run cmd(s).
Pig [options] [-f[ile]] file : Run cmds found in file.
options include:
-4, -log4jconf - Log4j configuration file, overrides log conf
-b, -brief - Brief logging (no timestamps)
-c, -check - Syntax check
-d, -debug - Debug level, INFO is default
-e, -execute - Commands to execute (within quotes)
-f, -file - Path to the script to execute
-g, -embedded - ScriptEngine classname or keyword for the ScriptEngine
-h, -help - Display this message. You can specify topic to get help for that topic.
properties is the only topic currently supported: -h properties.
-i, -version - Display version information
-l, -logfile - Path to client side log file; default is current working directory.
-m, -param_file - Path to the parameter file
-p, -param - Key value pair of the form param=val
-r, -dryrun - Produces script with substituted parameters. Script is not executed.
-t, -optimizer_off - Turn optimizations off. The following values are supported:
SplitFilter - Split filter conditions
PushUpFilter - Filter as early as possible
MergeFilter - Merge filter conditions
PushDownForeachFlatten - Join or explode as late as possible
LimitOptimizer - Limit as early as possible
ColumnMapKeyPrune - Remove unused data
AddForEach - Add ForEach to remove unneeded columns
MergeForEach - Merge adjacent ForEach
GroupByConstParallelSetter - Force parallel 1 for "group all" statement
All - Disable all optimizations
All optimizations listed here are enabled by default. Optimization values are case insensitive.
-v, -verbose - Print all error messages to screen
-w, -warning - Turn warning logging on; also turns warning aggregation off
-x, -exectype - Set execution mode: local|mapreduce, default is mapreduce.
-F, -stop_on_failure - Aborts execution on the first failed job; default is off
-M, -no_multiquery - Turn multiquery optimization off; default is on
-P, -propertyFile - Path to property file
$

You are not using the command correctly.
To use a property file, use -param_file in the command:
pig -param_file <file> pig_script.pig
You can refer more details in the Parameter Substitution

Related

transferring strings across gitlab ci tasks stages using variables

I am wanting to store the output from a script in a variable for use in subsequent commands from within Gitlab CI.
Here is the script:
image: ...
build c-ares:
variables:
CARES_ARTIFACTS_DIR: "-"
script:
- CARES_ARTIFACTS_DIR=$(./build-c-ares.sh)
after_script:
- echo $CARES_ARTIFACTS_DIR
artifacts:
name: CARES_ARTIFACTS
paths:
- $CARES_ARTIFACTS_DIR
My intention is to:
first declare the variable CARES_ARTIFACTS_DIR with global scope
Set the variable value using the output from the build-c-ares.sh script
Recover the output from the build-c-ares.sh script on a later command using the variable
My code does not behave as intended - on dereferencing the variable I find it contains the original value it was assigned at declaration:
$ CARES_ARTIFACTS_DIR=$(./build-c-ares.sh)
Cloning into 'c-ares'...
Running after_script
00:01
Running after script...
$ echo $CARES_ARTIFACTS_DIR
-
Uploading artifacts for successful job
00:00
Uploading artifacts...
WARNING: -: no matching files. Ensure that the artifact path is relative to the working directory
ERROR: No files to upload
It is probably easier to just redirect the script output to a file and define that as an artifact.
Something similar to:
image: ...
build c-ares:
script:
- ./build-c-ares.sh > script_output
- cat script_output
artifacts:
paths:
- script_output
In regards to the specific issue, the variables used in the "artefacts" step will again use the variable initialisation defined for the job. Both the "artefacts" and the "script" steps for the job will start with the custom CARES_ARTIFACTS_DIR variable set to the value "-":
build c-ares:
variables:
CARES_ARTIFACTS_DIR: "-"
script:
# $CARES_ARTIFACTS_DIR=="-"
- CARES_ARTIFACTS_DIR=$(./build-c-ares.sh)
# $CARES_ARTIFACTS_DIR=="hello from build-c-ares.sh"
- echo $CARES_ARTIFACTS_DIR # prints "hello from build-c-ares.sh"
after_script:
# $CARES_ARTIFACTS_DIR=="-"
- echo $CARES_ARTIFACTS_DIR # prints "-"
Fundamentally, Gitlab variables cannot feed information across job steps as intended in the original post. My subjective opinion is to keep steps independent where possible and restrict input to artefacts from upstream jobs or variables explicitly defined in the pipeline script or settings.

Pig basic program error

I am getting the below error while running a pig script.![
]1
Please read Pig manual carefully
https://pig.apache.org/docs/r0.9.1/start.html
and observe that -x expects execution mode to be specified (either local or mapreduce). So the correct command would be
pig -x local wordcount.pig

How do we use the 'variables' keyword in gitlab-ci.yml?

I am trying to make use of the variables: keyword documented in the Gitlab CI Documentation here:
FROM: https://docs.gitlab.com/ce/ci/yaml/README.html
variables
This feature requires gitlab-runner with version equal or greater than
0.5.0.
GitLab CI allows you to add to .gitlab-ci.yml variables that are set
in build environment. The variables are stored in repository and are
meant to store non-sensitive project configuration, ie. RAILS_ENV or
DATABASE_URL.
variables:
DATABASE_URL: "postgres://postgres#postgres/my_database"
These variables can be later used in all executed commands and
scripts.
The YAML-defined variables are also set to all created service
containers, thus allowing to fine tune them.
When I attempt to use it, my builds do not run any stages and are marked successful anyway, a good sign of bad YAML. I pasted my gitlab-ci.yml contents into the LINT tool in the settings area and the output error is:
Status: syntax is incorrect
Error: variables job: unknown parameter PACKAGE_NAME
I'm using my YAML syntax the same as the docs, however it will not work. I'm unable to find any open bugs related to this. Below are my current versions and a sanitized version of my gitlab-ci.yml.
Gitlab Version: 7.13.2 Omnibus
Gitlab Runner Version: 0.5.2
gitlab-ci.yml (Sanitized)
types:
- test
- build
variables:
PACKAGE_NAME: "awesome-django-app"
PACKAGE_SUMMARY: "Awesome webapp backend."
MAJOR_RELEASE: "1"
MINOR_RELEASE: "0"
PATCH_LEVEL: "0dev"
DEV_DB_URL: "db"
DEV_SERVER: "pydev.example.com"
PROD_SERVER: "pyprod.example.com"
TEST_SERVER: "pytest.example.com"
envtest:
type: test
script:
- ". ./testbuild.sh"
tags:
- python2.7
- postgres
- linux
except:
- tags
buildrpm:
type: build
script:
- mkdir -p ~/rpmbuild/SOURCES
- mkdir -p ~/rpmbuild/SPECS
- mkdir -p ~/tarbuild/$PACKAGE_NAME-$MAJOR_RELEASE.$MINOR_RELEASE.$PATCH_LEVEL
- cp $PACKAGE_NAME.spec ~/rpmbuild/SPECS/.
- cp -r * ~/tarbuild/$PACKAGE_NAME-$MAJOR_RELEASE.$MINOR_RELEASE.$PATCH_LEVEL/.
- cd ~/tarbuild
- tar -zcf ~/rpmbuild/SOURCES/$PACKAGE_NAME-$MAJOR_RELEASE.$MINOR_RELEASE.$PATCH_LEVEL.tar.gz *
- cd ~
- rm -Rf ~/tarbuild
- rpmlint -i ~/rpmbuild/SPECS/$PACKAGE_NAME.spec
- echo $CI_BUILD_ID
- 'rpmbuild -ba ~/rpmbuild/SPECS/$PACKAGE_NAME.spec \
--define="_build_number $CI_BUILD_ID" \
--define="_python_version_min 2.7" \
--define="_version $MAJOR_RELEASE.$MINOR_RELEASE.$PATCH_LEVEL" \
--define="_package_name $PACKAGE_NAME" \
--define="_summary $SUMMARY"'
- scp rpmbuild/RPMS/noarch/$PACKAGE_NAME-$MAJOR_RELEASE.$MINOR_RELEASE.$PATCH_LEVEL-$CI_BUILD_ID.noarch.rpm $DEV_SERVER:~/.
tags:
- python2.7
- postgres
- linux
- rpm
except:
- tags
Question:
How do I use this value properly?
Additional Info:
Removing this section from the YAML file causes everything to work so the rest of the file is in working order. (Of course undefined variables lead to script errors...)
Even just reducing the variables for testing down to just PACKAGE_NAME causes the same break.
The original answer is no longer correct.
The original documentation now stands, Now there are more ways as well. Variables can be created from the GUI, API, or by being defined in the .gitlab-ci.yml as well.
https://docs.gitlab.com/ce/ci/variables/README.html
While it is in the documentation, I do not believe that variables were included in the latest version of gitlab (7.13). The functionality to read variables out of the yaml files was brought in by a commit by ayufan 9 days ago.
Looking at the parser on the 7.13 stable branch, you can see that his contribution did not make it in. So assuming you're on 7.13 or earlier, I'm afraid we are out of luck. Since it is on master, I am fairly certain that we'll see it in the next release. Until then, we could either monkey patch, do a git pull if you're using the source directly, or just rely on the project variables until the next release.

SGE Command Not Found, Undefined Variable

I'm attempting to setup a new compute cluster, and currently experiencing errors when using the qsub command in the SGE. Here's a simple experiment that shows the problem:
test.sh
#!/usr/bin/zsh
test="hello"
echo "${test}"
test.sh.eXX
test=hello: Command not found.
test: Undefined variable.
test.sh.oXX
Warning: no access to tty (Bad file descriptor).
Thus no job control in this shell.
If I ran the script on the head node (sh test.sh), the output is correct. I submit the job to the SGE by typing "qsub test.sh".
If I submit the exact same script job in the same way on an established compute cluster like HPC, it works perfectly as expected. What setting could be causing this problem?
Thanks for any help on this matter.
Most likely the queues on your cluster are set to posix_compliant mode with a default shell of /bin/csh. The posix_compliant setting means your #! line is ignored. You can either change the queues to unix_behavior or specify the required shell using qsub's -S option.
#$ -S /bin/sh

Pig Batch mode: how to set logging level to hide INFO log messages?

Using Apache Pig version 0.10.1.21 (rexported). When I execute a pig script, there are a lots of INFO logging lines which looks like that:
2013-05-18 14:30:12,810 [Thread-28] INFO org.apache.hadoop.mapred.Task - Task 'attempt_local_0005_r_000000_0' done.
2013-05-18 14:30:18,064 [main] WARN org.apache.pig.tools.pigstats.PigStatsUtil - Failed to get RunningJob for job job_local_0005
2013-05-18 14:30:18,094 [Thread-31] WARN org.apache.hadoop.mapred.JobClient - No job jar file set. User classes may not be found. See JobConf(Class) or JobConf#setJar(String).
2013-05-18 14:30:18,114 [Thread-31] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1
2013-05-18 14:30:18,254 [Thread-32] INFO org.apache.hadoop.mapred.Task - Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin#3fcb2dd1
2013-05-18 14:30:18,265 [Thread-32] INFO org.apache.hadoop.mapred.MapTask - io.sort.mb = 10
Is there a SET command within the pig script or a command line flag to allow the logging level? Basically I would like to hide the [Thread-xx] INFO messages. Only showing WARNING and ERROR. I have tried the command line debug flag. Unfortunately, the INFO messages still show up:
pig -x local -d WARN MyScript.pig
Hope there is a solution. Thanks in advance for any help.
SOLVED: Answer by Loran Bendig, set the log4j.properties. Summarized here for convenience
Step1: copy the log4j config file to the folder where my pig scripts are located.
cp /etc/pig/conf.dist/log4j.properties log4j_WARN
Step2: Edit log4j_WARN file and make sure these two lines are present
log4j.logger.org.apache.pig=WARN, A
log4j.logger.org.apache.hadoop = WARN, A
Step3: Run pig script and instruct it to use the custom log4j
pig -x local -4 log4j_WARN MyScript.pig
Another setting could be also like this:
Create a file named nolog.conf, with the following content
log4j.rootLogger=fatal
and then run pig as follows
pig -x local -4 nolog.conf
You can override the default log configuration (which includes INFO messages) like this:
pig -4 log4j.properties MyScript.pig
You need to set rootLogger too:
log4j.rootLogger=ERROR, A
log4j.logger.org.apache.pig=ERROR, A
log4j.logger.org.apache.hadoop = ERROR, A