How to force STORE (overwrite) to HDFS in Pig? - apache-pig

When developing Pig scripts that use the STORE command I have to delete the output directory for every run or the script stops and offers:
2012-06-19 19:22:49,680 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 6000: Output Location Validation Failed for: 'hdfs://[server]/user/[user]/foo/bar More info to follow:
Output directory hdfs://[server]/user/[user]/foo/bar already exists
So I'm searching for an in-Pig solution to automatically remove the directory, also one that doesn't choke if the directory is non-existent at call time.
In the Pig Latin Reference I found the shell command invoker fs. Unfortunately the Pig script breaks whenever anything produces an error. So I can't use
fs -rmr foo/bar
(i. e. remove recursively) since it breaks if the directory doesn't exist. For a moment I thought I may use
fs -test -e foo/bar
which is a test and shouldn't break or so I thought. However, Pig again interpretes test's return code on a non-existing directory as a failure code and breaks.
There is a JIRA ticket for the Pig project addressing my problem and suggesting an optional parameter OVERWRITE or FORCE_WRITE for the STORE command. Anyway, I'm using Pig 0.8.1 out of necessity and there is no such parameter.

At last I found a solution on grokbase. Since finding the solution took too long I will reproduce it here and add to it.
Suppose you want to store your output using the statement
STORE Relation INTO 'foo/bar';
Then, in order to delete the directory, you can call at the start of the script
rmf foo/bar
No ";" or quotations required since it is a shell command.
I cannot reproduce it now but at some point in time I got an error message (something about missing files) where I can only assume that rmf interfered with map/reduce. So I recommend putting the call before any relation declaration. After SETs, REGISTERs and defaults should be fine.
Example:
SET mapred.fairscheduler.pool 'inhouse';
REGISTER /usr/lib/pig/contrib/piggybank/java/piggybank.jar;
%default name 'foobar'
rmf foo/bar
Rel = LOAD 'something.tsv';
STORE Rel INTO 'foo/bar';

Once you use the fs command, there a lot of ways to do this. For an individual file, I wound up adding this to the beginning of my scripts:
-- Delete file (won't work for output, which will be a directory
-- but will work for a file that gets copied or moved during the
-- the script.)
fs -touchz top_100
rm top_100
For a directory
-- Delete dir
fs -rm -r out

Related

Run Script execution purpose in xcode run script

I have a app that uses the shell script that do something during compilation, I am trying to understand what exactly that script do since I am getting compilation error and not able to run the source code.
If someone can help me and explain what below script do
#!/bin/bash
echo BUILD_ROOT = "${BUILD_ROOT}"
p=`echo "${BUILD_ROOT}" | sed -ne 's/.*\(^\/.*\/DerivedData\).*/\1/p'`
echo "//${BUILD_ROOT}" > ${SRCROOT}/info.h
cat "${p}/../info.h" >> ${SRCROOT}/info.h
I am getting following errors during compilation
BUILD_ROOT =
/Working/XXXXXXX/XXXXX/testTarget/DerivedData/testTarget/Build/Products
cat: /Working/XXXXXX/XXXXXXX/testTarget/DerivedData/../info.h: No such
file or directory
This script doesn't make too much sense. What are you hoping for it to do? I'll break it down per line:
Essentially this does nothing, it's storing the value of BUILD_ROOT in itself.
By default, Xcode uses ~/Library/Developer/Xcode/DerivedData/<YOUR_APP>-XXXXX/... for its BUILD_ROOT variable. This line of code stores the full path of that folder in variable p.
Creates a file called info.h at the base folder of your app's source code with the value of BUILD_ROOT.
This line is going to the folder above DerivedData and printing the value of info.h - which doesn't actually exist and therefore crashing the script - and appending it to the info.h created in line 3.

IntelliJ: Dynamically updated file header

By default, IntelliJ Idea will insert (something like) the following as the header of a new source file:
/**
* Created by JohnDoe on 2016-04-27.
*/
The corresponding template is:
/**
* Created by ${USER} on ${DATE}.
*/
Is it possible to update this template so that it inserts the last date of modification when the file is changed? For example:
/**
* Created by JohnDoe on 2016-03-27.
* Last modified by JaneDoe on 2016-04-27
*/
It is not supported out of the box. I suggest you do not include information about author and last edit/create time in file at all.
The reason is that your version control system (Git, SVN) contains the same information automatically. So the manual labelling is just duplicate of already existing info, but is only more error prone and needs to be manually updated.
Here's a working solution similar to what I'm using. Tested on mac os.
Create a bash script which will replace first occurrence of Last modified by JaneDoe on $DATE only if the exact value is not contained in the file:
#!/bin/bash
FILE=src/java/test/Test.java
DATE=`date '+%Y-%m-%d'`
PREFIX="Last modified by JaneDoe on "
STRING="$PREFIX.*$"
SUBSTITUTE="$PREFIX$DATE"
if ! grep -q "$SUBSTITUTE" "$FILE"; then
sed -i '' "1,/$(echo "$STRING")/ s/$(echo "$STRING")/$(echo "$SUBSTITUTE")/" $FILE
fi
Install File Watchers plugin.
Create a file watcher with appropriate scope (it may be this single file or any other scope, so that any change in project's source code will update modified date or version etc.) and put a path to your bash script into Program field.
Now every time the file changes the date will update. If you want to update date for each file separately, an argument $FilePath$ should be passed to the script.
This might have been just a comment to #oleg-mikhailov excellent idea, but the code snippet won't fit. Basically, I just tweaked his solution.
I needed a slightly different syntax but that's not the issue. The issue was that when the script ran automatically upon file save using the File Watchers plugin, if ran on a file which doesn't include PREFIX it would run over and over for ever.
I presume the that the issue is with the plugin itself, as it didn't happen when run from the shell, but I'm not sure why it happened.
Anyway, I ended up running the following script (as I said only a slight change with respect to the original). The new script also raises an error if the the prefix doesn't exist. For me this is a feature as Pycharm prompts me with the error, and I can fix the file.
Tested with PyCharm 2021.2.3 on macOS 11.6.
#!/bin/bash
FILE=$1
DATE=`date '+%Y-%m-%d'`
PREFIX="last_modified_date: "
STRING="$PREFIX.*$"
SUBSTITUTE="$PREFIX$DATE"
if ! grep -q "$SUBSTITUTE" "$FILE"; then
if grep -q "$PREFIX" "$FILE"; then
sed -i '' "s/$(echo "$STRING")/$(echo "$SUBSTITUTE")/" $FILE
else
echo "Error!"
echo "'$PREFIX' doesn't appear in $FILE"
exit 1
fi
fi
PHPStorm has not a "hook" for launching task after detect a change in file (just for uploading in server yes). Code templating is based on the creation of file not change.
The behaviour you want (automatic change file after manual change file) can be useful for lot of things but it's circular headhache for editor. Because if you change a file it must change file (and if a file is change ? it change file ?).
However, You can, perhaps, "enable Live Templates" when you launch a "reformat code" which able to rewrite your begin template code that way rewrite date modification.
Other solution is that use a tools with as grunt but I don't know if manage php file.

ActiveTCL - Unable to run a batch file from an Expect Script

I was originally trying to run an executable (tftpd32.exe) from Expect with the following command, but for some unknown reason it would hanged the entire script:
exec c:/tftpd32.351/tftpd32.exe
So, decided to call a batch file that will start the executable.
I tried to call the batch file with the following command, but get an error message stating windows cannot find the file.
exec c:/tftpd32.351/start_tftp.bat
I also tried the following, but it does not start the executable:
spwan cmd.exe /c c:/tftpd32.351/start_tftp.bat
The batch file contains this and it run ok when I double click on it:
start tftpd32.exe
Any help would be very much appreciated.
Thanks
The right way to run that program from Tcl is to do:
set tftpd "c:/tftpd32.351/tftpd32.exe"
exec {*}[auto_execok start] "" [file nativename $tftpd]
Note that you should always have that extra empty argument when using start (due to the weird way that start works; it has an optional string in quotes that specifies the window title to create, but it tends to misinterpret the first quoted string as that even if that leaves it with no mandatory arguments) and you need to use the native system name of the executable to run, hence the file nativename.
If you've got an older version of Tcl inside your expect program (8.4 or before) you'd do this instead:
set tftpd "c:/tftpd32.351/tftpd32.exe"
eval exec [auto_execok start] [list "" [file nativename $tftpd]]
The list command in that weird eval exec construction adds some necessary quoting that you'd have trouble generating otherwise. Use it exactly as above or you'll get very strange errors. (Or upgrade to something where you don't need nearly as much code gymnastics; the {*} syntax was added for a good reason!)

batch scripting: how to get parent dir name without full path?

I'm working on a script that processes a folder and there is always one file in it I need to rename. The new name should be the parent directory name. How do I get this in a batch file? The full path to the dir is known.
It is not very clear how the script is supposed to become acquainted with the path in question, but the following example should at least give you an idea of how to proceed:
FOR %%D IN ("%CD%") DO SET "DirName=%%~nxD"
ECHO %DirName%
This script gets the path from the CD variable and extracts the name only from it to DirName.
You can use basename command:
FULLPATH=/the/full/path/is/known
JUSTTHENAME=$(basename "$FULLPATH")
You can use built-in bash tricks:
FULLPATH=/the/full/path/is/known
JUSTTHENAME=${FULLPATH##*/}
Explanations:
first # means 'remove the pattern from the begining'
second # means 'remove the longer possible pattern'
*/ is the pattern
Using built-in bash avoid to call an external command (i.e. basename) therefore this optimises you script. However the script is less portable.

Git - how do I view the change history of a method/function?

So I found the question about how to view the change history of a file, but the change history of this particular file is huge and I'm really only interested in the changes of a particular method. So would it be possible to see the change history for just that particular method?
I know this would require git to analyze the code and that the analysis would be different for different languages, but method/function declarations look very similar in most languages, so I thought maybe someone has implemented this feature.
The language I'm currently working with is Objective-C and the SCM I'm currently using is git, but I would be interested to know if this feature exists for any SCM/language.
Recent versions of git log learned a special form of the -L parameter:
-L :<funcname>:<file>
Trace the evolution of the line range given by "<start>,<end>" (or the function name regex <funcname>) within the <file>. You may not give any pathspec limiters. This is currently limited to a walk starting from a single revision, i.e., you may only give zero or one positive revision arguments. You can specify this option more than once.
...
If “:<funcname>” is given in place of <start> and <end>, it is a regular expression that denotes the range from the first funcname line that matches <funcname>, up to the next funcname line. “:<funcname>” searches from the end of the previous -L range, if any, otherwise from the start of file. “^:<funcname>” searches from the start of file.
In other words: if you ask Git to git log -L :myfunction:path/to/myfile.c, it will now happily print the change history of that function.
Using git gui blame is hard to make use of in scripts, and whilst git log -G and git log --pickaxe can each show you when the method definition appeared or disappeared, I haven't found any way to make them list all changes made to the body of your method.
However, you can use gitattributes and the textconv property to piece together a solution that does just that. Although these features were originally intended to help you work with binary files, they work just as well here.
The key is to have Git remove from the file all lines except the ones you're interested in before doing any diff operations. Then git log, git diff, etc. will see only the area you're interested in.
Here's the outline of what I do in another language; you can tweak it for your own needs.
Write a short shell script (or other program) that takes one argument -- the name of a source file -- and outputs only the interesting part of that file (or nothing if none of it is interesting). For example, you might use sed as follows:
#!/bin/sh
sed -n -e '/^int my_func(/,/^}/ p' "$1"
Define a Git textconv filter for your new script. (See the gitattributes man page for more details.) The name of the filter and the location of the command can be anything you like.
$ git config diff.my_filter.textconv /path/to/my_script
Tell Git to use that filter before calculating diffs for the file in question.
$ echo "my_file diff=my_filter" >> .gitattributes
Now, if you use -G. (note the .) to list all the commits that produce visible changes when your filter is applied, you will have exactly those commits that you're interested in. Any other options that use Git's diff routines, such as --patch, will also get this restricted view.
$ git log -G. --patch my_file
Voilà!
One useful improvement you might want to make is to have your filter script take a method name as its first argument (and the file as its second). This lets you specify a new method of interest just by calling git config, rather than having to edit your script. For example, you might say:
$ git config diff.my_filter.textconv "/path/to/my_command other_func"
Of course, the filter script can do whatever you like, take more arguments, or whatever: there's a lot of flexibility beyond what I've shown here.
The closest thing you can do is to determine the position of your function in the file (e.g. say your function i_am_buggy is at lines 241-263 of foo/bar.c), then run something to the effect of:
git log -p -L 200,300:foo/bar.c
This will open less (or an equivalent pager). Now you can type in /i_am_buggy (or your pager equivalent) and start stepping through the changes.
This might even work, depending on your code style:
git log -p -L /int i_am_buggy\(/,+30:foo/bar.c
This limits the search from the first hit of that regex (ideally your function declaration) to thirty lines after that. The end argument can also be a regexp, although detecting that with regexp's is an iffier proposition.
git log has an option '-G' could be used to find all differences.
-G Look for differences whose added or removed line matches the
given <regex>.
Just give it a proper regex of the function name you care about. For example,
$ git log --oneline -G'^int commit_tree'
40d52ff make commit_tree a library function
81b50f3 Move 'builtin-*' into a 'builtin/' subdirectory
7b9c0a6 git-commit-tree: make it usable from other builtins
The correct way is to use git log -L :function:path/to/file as explained in eckes answer.
But in addition, if your function is very long, you may want to see only the changes that various commit had introduced, not the whole function lines, included unmodified, for each commit that maybe touch only one of these lines. Like a normal diff does.
Normally git log can view differences with -p, but this not work with -L.
So you have to grep git log -L to show only involved lines and commits/files header to contextualize them. The trick here is to match only terminal colored lines, adding --color switch, with a regex. Finally:
git log -L :function:path/to/file --color | grep --color=never -E -e "^(^[\[[0-9;]*[a-zA-Z])+" -3
Note that ^[ should be actual, literal ^[. You can type them by pressing ^V^[ in bash, that is Ctrl + V, Ctrl + [. Reference here.
Also last -3 switch, allows to print 3 lines of output context, before and after each matched line. You may want to adjust it to your needs.
Show function history with git log -L :<funcname>:<file> as showed in eckes's answer and git doc
If it shows nothing, refer to Defining a custom hunk-header to add something like *.java diff=java to the .gitattributes file to support your language.
Show function history between commits with git log commit1..commit2 -L :functionName:filePath
Show overloaded function history (there may be many function with same name, but with different parameters) with git log -L :sum\(double:filepath
git blame shows you who last changed each line of the file; you can specify the lines to examine so as to avoid getting the history of lines outside your function.