Pentaho Spoon Job Executes Fine, Endless Loop in Kitchen - pentaho

Without getting too much into the weeds, I have a Pentaho PDI job with multiple sub-transformations and sub-jobs (ETL from MySQL to Postgres). This job runs exactly as expected from Spoon, no errors, but when I run the job--with the following command--I am met with an endless loop error at the first step where a parameter would need to be defined and passed from within the job (the named params from the command seem to integrate fine). The command I am using is as follows:
sudo /bin/sh kitchen.sh \
-rep=KettleFileRepo \
-dir=M2P \
-job=ETL-M2P \
-level=Rowlevel \
-param:MY.PAR.LOADTYPE=full \
-param:MY.PAR.TABLELIST=table1 \
-param:MY.PAR.TENANTS=tenant1 \
/
Has anyone run into this type of issue with a discrepancy between Spoon and Kitchen? Is there some sort of config or command line option that I am missing? I am running version 6.0.1.0-386 on OS X 10.11.4.
If you think more details would be beneficial please let me know and I can provide whatever is necessary.

I am not aware of any discrepancy between Spoon and Kitchen. Are you sure, its not something in the ETL that causing the loop. I would suggest to go through your ETL in detail.
Another thing you can try to debug is run only part of the job in kitchen and keep adding more as you see success.

Related

TWS job is getting abended

While executing the job, I am getting below error
Job.sh is the script - while executing it in tws I am getting below error
path of the script - job.sh: sqlplus: not found
Can any please help on this
Thanks
I don't know how Tivoli works, but you are writing to run a "job.sh" script, so I assume it is a shell script.
Normally when working with oracle it is a good idea to enter absolute paths, so your script could be
#!/bin/sh
$ORACLE_HOME=<your path of oracle installation>
$ORACLE_SID=<your instance name>
$PWD=<Position of your sql script>
..........
$ORACLE_HOME/bin/sqlplus login/password #$PWD/script.sql
I hope that's what you were looking for.

Any specific problems running (linux) BCP on "too many" threads?

Are there any specific problems with running Microsoft's BCP utility (on CentOS 7, https://learn.microsoft.com/en-us/sql/linux/sql-server-linux-migrate-bcp?view=sql-server-2017) on multiple threads? Googling could not find much, but am looking at a problem that seems to be related to just that.
Copying a set of large TSV files from HDFS to a remote MSSQL Server with some code of the form
bcpexport() {
filename=$1
TO_SERVER_ODBCDSN=$2
DB=$3
TABLE=$4
USER=$5
PASSWORD=$6
RECOMMEDED_IMPORT_MODE=$7
DELIMITER=$8
echo -e "\nRemoving header from TSV file $filename"
echo -e "Current head:\n"
echo $(head -n 1 $filename)
echo "$(tail -n +2 $filename)" > $filename
echo "First line of file is now..."
echo $(head -n 1 $filename)
# temp. workaround safeguard for NFS latency
#sleep 5 #FIXME: appears to sometimes cause script to hang, workaround implemented below, throws error if timeout reached
timeout 30 sleep 5
echo -e "\nReplacing null literal values with empty chars"
NULL_WITH_TAB="null\t" # WARN: assumes the first field is prime-key so never null
TAB="\t"
sed -i -e "s/$NULL_WITH_TAB/$TAB/g" $filename
echo -e "Lines containing null (expect zero): $(grep -c "\tnull\t" $filename)"
# temp. workaround safeguard for NFS latency
#sleep 5 #FIXME: appears to sometimes cause script to hang, workaround implemented below
timeout 30 sleep 5
/opt/mssql-tools/bin/bcp "$TABLE" in "$filename" \
$TO_SERVER_ODBCDSN \
-U $USER -P $PASSWORD \
-d $DB \
$RECOMMEDED_IMPORT_MODE \
-t "\t" \
-e ${filename}.bcperror.log
}
export -f bcpexport
parallel -q -j 7 bcpexport {} "$TO_SERVER_ODBCDSN" $DB $TABLE $USER $PASSWORD $RECOMMEDED_IMPORT_MODE $DELIMITER \
::: $DATAFILES/$TARGET_GLOB
where $DATAFILES/$TARGET_GLOB constructs a glob that lists a set of files in a directory.
When running this code for a set of TSV files, finding that sometimes some (but not all) of the parallel BCP threads fail, ie. some files successfully copy to MSSQL Server
Starting copy...
5397376 rows copied.
Network packet size (bytes): 4096
Clock Time (ms.) Total : 154902 Average : (34843.8 rows per sec.)
while others output error message
Starting copy...
BCP copy in failed
Usually, see this pattern: a few successful BCP copy-in operations in the first few threads returned, then a bunch of failing threads return their output until run out of files (GNU Parallel returns output only when whole thread done to appear same as if sequential).
Notice in the code there is the -e option to produce an error file for each BCP copy-in operation (see https://learn.microsoft.com/en-us/sql/tools/bcp-utility?view=sql-server-2017#e). When examining the files after observing these failing behaviors, all are blank, no error messages.
Only have seen this with the number of threads >= 10 (and only for certain sets of data (assuming has something to do with total number of files are files sizes, and yet...)), no errors seen so far when using ~7 threads, which further makes me suspect this has something to do with multi-threading.
Monitoring system resources (via free -mh) shows that generally ~13GB or RAM is always available.
May be helpful to note that the data bcp is trying to copy-in may be ~500000-1000000 records long with an upper limit of ~100 columns per record.
Does anyone have any idea what could be going on here? Note, am pretty new to using BCP as well as GNU Parallel and multi-threading.
No, no issues specific to the BCP program being run in multiple threads. You seem to be on the track of what I would say your issue is, system resources. Have you monitored system resources while increasing the number of threads? If anything, there is likely an issue with BCP executing properly when memory/cpu/network resources are low. Regarding the "-e" option, it is meant to output data errors. login errors, bad table names... many other errros are not reported in the file created with the -e option. When you get output using the "-e" option, you'll see info like "value truncated" and such... will give you line numbers and sample data that was at issue.
TLDR: Adding more threads to run concurrently to have bcp copy-in files of data seems to have the affect of overwhelming the endpoint MSSQL Server with write instructions, causing the bcp threads to fail (maybe timeing out?). When the number of threads becomes too many seems to depend on the size of the files getting copy-in'ed by bcp (ie. both the number of records in the file as well as the width of each record (ie. number of columns)).
Long version (more reasons for my theory):
1.
When running a larger number of bcp threads and looking at the processes started on the machine (https://clustershell.readthedocs.io/en/latest/tools/clush.html)
ps -aux | grep bcp
seeing a bunch of sleeping processes (notice the S, see https://askubuntu.com/a/360253/760862) as shown below (added newlines for readability)
me 135296 14.5 0.0 77596 6940 ? S 00:32 0:01
/opt/mssql-tools/bin/bcp TABLENAME in /path/to/tsv/1_16_0.tsv -D -S MyMSSQLServer -U myusername -P -d myDB -c -t \t -e /path/to/logfile
These threads appear to sleep for very long time. Further debugging into why these threads are sleeping suggests that they may in fact be doing their intended job (which would further imply that the problem may be coming from BCP itself (see https://stackoverflow.com/a/52748660/8236733)). From https://unix.stackexchange.com/a/47259/260742 and https://unix.stackexchange.com/a/36200/260742)
A process in S state is usually in a blocking system call, such as reading or writing to a file or the network, or waiting for another called program to finish.
(eg. writing to the MSSQL Server endpoint destination given to bcp in the ODBCDSN)
Your process will be in S state when it is doing reads and possibly writes that are blocking. Can also happen while waiting on semaphores or other synchronization primitives... This is all normal and expected, and not usually a problem... you don't want it to waste CPU while it's waiting for user input.
2. When running different sets of files of varying record-amount-per-file (eg. ranges of 500000 - 1000000 rows/file) and record-width-per-file (~10 - 100 columns/row), found that in cases with either very large data width or amounts, running a fixed set of bcp threads would fail.
Eg. for a set of ~33 TSVs with ~500000 rows each, each row being ~100 columns wide, a set of 30 threads would write the first few OK, but then all the rest would start returning failure messages. Incorporating a bit from #jamie's answer, the fact the the failure messages returned are "BCP copy in failed" errors does not necessarily mean it has do do with the content of the data in question. Having no actual content being written into the -e errorlog files from my process, #jamie's post says this
Regarding the "-e" option, it is meant to output data errors. login errors, bad table names... many other errros are not reported in the file created with the -e option. When you get output using the "-e" option, you'll see info like "value truncated" and such... will give you line numbers and sample data that was at issue.
Meanwhile, a set of ~33 TSVs with ~500000 rows each, each row being ~100 wide, and still using 30 bcp threads would complete quickly and without error (also would be faster when reducing the number of threads or file set). The only difference here being the overall size of the data being bcp copy-in'ed to the MSSQL Server.
All this while
free -mh
still showed that the machine running the threads still had ~15GB of free RAM remaining in each case (which is again why I suspect that the problem has to do with the remote MSSQL Server endpoint rather than with the code or local machine itself).
3. When running some of the tests from (2), found that manually killing the parallel process (via CTL+C) and then trying to remotely truncate the testing table being written to with /opt/mssql-tools/bin/sqlcmd -Q "truncate table mytable" on the local machine would take a very long time (as opposed to manually logging into the MSSQL Server and executing a truncate mytable in the DB). Again this makes me think that this has something to do with the MSSQL Server having too many connections and just being overwhelmed.
** Anyone with any MSSQL Mgmt Studio experience reading this (I have basically none), if you see anything here that makes you think that my theory is incorrect please let me know your thoughts.

tcl tcltest unknown option -run

When I run ANY test I get the same message. Here is an example test:
package require tcltest
namespace import -force ::tcltest::*
test foo-1.1 {save 1 in variable name foo} {} {
set foo 1
} {1}
I get the following output:
WARNING: unknown option -run: should be one of -asidefromdir, -constraints, -debug, -errfile, -file, -limitconstraints, -load, -loadfile, -match, -notfile, -outfile, -preservecore, -relateddir, -singleproc, -skip, -testdir, -tmpdir, or -verbose
I've tried multiple tests and nothing seems to work. Does anyone know how to get this working?
Update #1:
The above error was my fault, it was due to it being run in my script. However if I run the following at a command line I got no output:
[root#server1 ~]$ tcl
tcl>package require tcltest
2.3.3
tcl>namespace import -force ::tcltest::*
tcl>test foo-1.1 {save 1 in variable name foo} {expr 1+1} {2}
tcl>echo [test foo-1.1 {save 1 in variable name foo} {expr 1+1} {2}]
tcl>
How do I get it to output pass or fail?
You don't get any output from the test command itself (as long as the test passes, as in the example: if it fails, the command prints a "contents of test case" / "actual result" / "expected result" summary; see also the remark on configuration below). The test statistics are saved internally: you can use the cleanupTests command to print the Total/Passed/Skipped/Failed numbers (that command also resets the counters and does some cleanup).
(When you run runAllTests, it runs test files in child processes, intercepting the output from each file's cleanupTests and adding them up to a grand total.)
The internal statistics collected during testing is available in AFACT undocumented namespace variables like ::tcltest::numTests. If you want to work with the statistics yourself, you can access them before calling cleanupTests, e.g.
parray ::tcltest::numTests
array set myTestData [array get ::tcltest::numTests]
set passed $::tcltest::numTests(Passed)
Look at the source for tcltest in your library to see what variables are available.
The amount of output from the test command is configurable, and you can get output even when the test passes if you add p / pass to the -verbose option. This option can also let you have less output on failure, etc.
You can also create a command called ::tcltest::ReportToMaster which, if it exists, will be called by cleanupTests with the pertinent data as arguments. Doing so seems to suppress both output of statistics and at least most resetting and cleanup. (I didn't go very far in investigating that method.) Be aware that messing about with this is more likely to create trouble than solve problems, but if you are writing your own testing software based on tcltest you might still want to look at it.
Oh, and please use the newer syntax for the test command. It's more verbose, but you'll thank yourself later on if you get started with it.
Obligatory-but-fairly-useless (in this case) documentation link: tcltest

Runing pentaho command from a scheduler (tidal)

I am trying to execute the pentaho job over the windows through TIDAL, but the TIDAL does not execute the job at all. But when i run seperately on CMD PROMPT is executes.
The below is command used, IT does not the read the parameters assigned to it.
Kindly suggest on what has to be done.
E:\apps\Pentaho\data-integration\kitchen.bat /rep:Merlin_Repository /user:admin /pass:admin /dir=wwclaims /job=J-CLAIMS /level:Basic
You forgot a slash in /dir: option and you must use : not = symbols in your command.
For example in a windows batch script command
#echo off
SET LOG_PATHFILE=C:\logs\KITCHEN_name_of_job_%DATETIME%.log
call Kitchen.bat /rep:"name_repository" /job:"name_of_job" /dir:/foo/sub_foo1 /user:dark /pass:vador /level:Detailed >> %LOG_PATHFILE%`

Hadoop jobs getting poor locality

I have some fairly simple Hadoop streaming jobs that look like this:
yarn jar /usr/lib/hadoop-mapreduce/hadoop-streaming-2.2.0.2.0.6.0-101.jar \
-files hdfs:///apps/local/count.pl \
-input /foo/data/bz2 \
-output /user/me/myoutput \
-mapper "cut -f4,8 -d," \
-reducer count.pl \
-combiner count.pl
The count.pl script is just a simple script that accumulates counts in a hash and prints them out at the end - the details are probably not relevant but I can post it if necessary.
The input is a directory containing 5 files encoded with bz2 compression, roughly the same size as each other, for a total of about 5GB (compressed).
When I look at the running job, it has 45 mappers, but they're all running on one node. The particular node changes from run to run, but always only one node. Therefore I'm achieving poor data locality as data is transferred over the network to this node, and probably achieving poor CPU usage too.
The entire cluster has 9 nodes, all the same basic configuration. The blocks of the data for all 5 files are spread out among the 9 nodes, as reported by the HDFS Name Node web UI.
I'm happy to share any requested info from my configuration, but this is a corporate cluster and I don't want to upload any full config files.
It looks like this previous thread [ why map task always running on a single node ] is relevant but not conclusive.
EDIT: at #jtravaglini's suggestion I tried the following variation and saw the same problem - all 45 map jobs running on a single node:
yarn jar \
/usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples-2.2.0.2.0.6.0-101.jar \
wordcount /foo/data/bz2 /user/me/myoutput
At the end of the output of that task in my shell, I see:
Launched map tasks=45
Launched reduce tasks=1
Data-local map tasks=18
Rack-local map tasks=27
which is the number of data-local tasks you'd expect to see on one node just by chance alone.