How to determine why process fails in kotlin script - jvm

I am developing a kotlin script which executes code on the platform which it is running on. Platform code is called using this method from the script:
fun exec(command: String, vararg arguments: String, runLive: Boolean = !isDebug): OutputStream {
val allArgs = arguments.joinToString(" ")
if (runLive) {
val process = Runtime.getRuntime().exec(command, arguments)
val exitCode = process.waitFor()
if (exitCode != 0) {
val platformError = String(BufferedInputStream(process.errorStream).readAllBytes(), Charset.defaultCharset())
throw IllegalStateException("Execution of '$command $allArgs' failed with exit code $exitCode!\n$platformError")
}
return process.outputStream
} else {
println("$command $allArgs")
return object : OutputStream() {
override fun write(b: Int) {
println("dummy $b")
}
}
}
}
In the script, I try to get all the tags for a git repository using this call:
exec(command = "git", "tag", runLive = true)
When the command fails, how can process.errorStream be read? The output now is not readable and the script failure says:
java.lang.IllegalStateException: Execution of 'git tag' failed with exit code 1!
at Snap_tag_main.exec(snap-tag.main.kts:75)
at Snap_tag_main.<init>(snap-tag.main.kts:64)

Not all commands use the error stream when displaying errors. The solution is to bring both the input stream as well as the output stream like this:
if (exitCode != 0) {
val platformMessages = """
${String(BufferedInputStream(process.inputStream).readAllBytes(), Charset.defaultCharset())}
${String(BufferedInputStream(process.errorStream).readAllBytes(), Charset.defaultCharset())}
""".trimIndent().trim()
throw IllegalStateException("Execution of '$command $allArgs' failed with exit code $exitCode!\n$platformMessages\n---")
}
The failure message now looks like
java.lang.IllegalStateException: Execution of 'git tag' failed with exit code 1!
usage: git [--version] [--help] [-C <path>] [-c <name>=<value>]
[--exec-path[=<path>]] [--html-path] [--man-path] [--info-path]
[-p | --paginate | -P | --no-pager] [--no-replace-objects] [--bare]
[--git-dir=<path>] [--work-tree=<path>] [--namespace=<name>]
[--super-prefix=<path>] [--config-env=<name>=<envvar>]
<command> [<args>]
These are common Git commands used in various situations:
start a working area (see also: git help tutorial)
clone Clone a repository into a new directory
init Create an empty Git repository or reinitialize an existing one
work on the current change (see also: git help everyday)
add Add file contents to the index
mv Move or rename a file, a directory, or a symlink
restore Restore working tree files
rm Remove files from the working tree and from the index
examine the history and state (see also: git help revisions)
bisect Use binary search to find the commit that introduced a bug
diff Show changes between commits, commit and working tree, etc
grep Print lines matching a pattern
log Show commit logs
show Show various types of objects
status Show the working tree status
grow, mark and tweak your common history
branch List, create, or delete branches
commit Record changes to the repository
merge Join two or more development histories together
rebase Reapply commits on top of another base tip
reset Reset current HEAD to the specified state
switch Switch branches
tag Create, list, delete or verify a tag object signed with GPG
collaborate (see also: git help workflows)
fetch Download objects and refs from another repository
pull Fetch from and integrate with another repository or a local branch
push Update remote refs along with associated objects
'git help -a' and 'git help -g' list available subcommands and some
concept guides. See 'git help <command>' or 'git help <concept>'
to read about a specific subcommand or concept.
See 'git help git' for an overview of the system.
---
at Snap_tag_main.exec(snap-tag.main.kts:82)
at Snap_tag_main.<init>(snap-tag.main.kts:66)

Related

issue with a modification of youtube-dl in .zshrc

the code I have in my .zshrc is:
ytdcd () { #youtube-dl that automatically puts stuff in a specific folder and returns to the former working directory after.
cd ~/youtube/new/ && {
youtube-dl "$#"
cd - > /dev/null
}
}
ytd() { #sofar, this function can only take one page. so, i can only send one youttube video code per line. will modify it to accept multiple lines..
for i in $*;
do
params=" $params https://youtu.be/$i"
done
ytdcd -f 18 $params
}
so, on the commandline (terminal), when i enter ytd DFreHo3UCD0, i would like to have the video at https://youtu.be/DFreHo3UCD0 to be downloaded. the problem is that when I enter the command in succession, the system just tries to download the video for the previous command and rightly claims the download is complete.
For example, entering:
> ytd DFreHo3UCD0
> ytd L3my9luehfU
would not attempt to download the video for L3my9luehfU but only the video for DFreHo3UCD0 twice.
First -- there's no point to returning to the old directory for ytdcd: You can change to a new directory only inside a subshell, and then exec youtube-dl to replace that subshell with the application process:
This has fewer things to go wrong: Aborting the function's execution can't leave things in the wrong directory, because the parent shell (the one you're interactively using) never changed directories in the first place.
ytdcd () {
(cd ~/youtube/new/ && exec youtube-dl "$#")
}
Second -- use an array when building argument lists, not a string.
If you use set -x to log its execution, you'll see that your original command runs something like:
ytdcd -f 18 'https://youtu.be/one https://youtu.be/two https://youtu.be/three'
See those quotes? That's because $params is a string, passed as a single argument, not an array. (In bash -- or another shell following POSIX rules -- an unquoted string expansion would be string-split and glob-expanded, but zsh doesn't follow POSIX rules).
The following builds up an array of separate arguments and passes them individually:
ytd() {
local -a params=( )
local i
for i; do
params+=( "https://youtu.be/$i" )
done
ytdcd -f 18 "${params[#]}"
}
Finally, it's come up that you don't actually intend to pass all the URLs to just one youtube-dl instance. To run a separate instance per URL, use:
ytd() {
local i retval=0
for i; do
ytdcd -f 18 "$i" || retval=$?
done
return "$retval"
}
Note here that we're capturing non-success exit status, so as not to hide an error in any ytdcd instance other than the last (which would otherwise occur).
I would declare param as local, so that you are not appending url after urls...
You can try to add this awesome function to your .zshrc:
funfun() {
local _fun1="$_fun1 fun1!"
_fun2="$_fun2 fun2!"
echo "1 says: $_fun1"
echo "2 says: $_fun2"
}
To observe the thing ;)
EDIT (Explanation):
When sourcing shell script, you add it to you current environment, that is why you can run those function you define. So, when those function use variables, by default, those variable will be global and accessible from anywhere in your environment! Therefore, In this case param is defined globally for all the length of your shell session. Since you want to allow the download of several video at once, you are appending values to this global variable, which will grow all the time.
Enforcing local tells zsh to limit the scope of params to the function only.
Another solution is to reset the variable when you call the function.

Remove artifacts from CI manually

I have a private repository at gitlab.com that uses the CI feature. Some of the CI jobs create artifacts files that are stored. I just implemented that the artifacts are deleted automatically after one day by adding this to the CI configuration:
expire_in: 1 day
That works great - however, old artifacts won't be deleted (as expected). So my question is:
How can I delete old artifacts or artifacts that do not expire? (on gitlab.com, no direct access to the server)
You can use the GitLab REST API to delete the artifacts from the jobs if you don't have direct access to the server. Here's a sample curl script that uses the API:
#!/bin/bash
# project_id, find it here: https://gitlab.com/[organization name]/[repository name]/edit inside the "General project settings" tab
project_id="3034900"
# token, find it here: https://gitlab.com/profile/personal_access_tokens
token="Lifg_azxDyRp8eyNFRfg"
server="gitlab.com"
# go to https://gitlab.com/[organization name]/[repository name]/-/jobs
# then open JavaScript console
# copy/paste => copy(_.uniq($('.ci-status').map((x, e) => /([0-9]+)/.exec(e.href)).toArray()).join(' '))
# press enter, and then copy the result here :
# repeat for every page you want
job_ids=(48875658 48874137 48873496 48872419)
for job_id in ${job_ids[#]}
do
URL="https://$server/api/v4/projects/$project_id/jobs/$job_id/erase"
echo "$URL"
curl --request POST --header "PRIVATE-TOKEN:${token}" "$URL"
echo "\n"
done
An API call should be easier to script, with GitLab 14.7 (January 2022), which now offers:
Bulk delete artifacts with the API
While a good strategy for managing storage consumption is to set regular expiration policies for artifacts, sometimes you need to reduce items in storage right away.
Previously, you might have used a script to automate the tedious task of deleting artifacts one by one with API calls, but now you can use a new API endpoint to bulk delete job artifacts quickly and easily.
See Documentation, Issue 223793 and Merge Request 75488.
curl --request DELETE --header "PRIVATE-TOKEN: <your_access_token>" \
"https://gitlab.example.com/api/v4/projects/1/artifacts"
Building on top of #David 's answer,
#Philipp pointed out that there is now an api endpoint to delete only the job artifacts instead of the entire job.
You can run this script directly in the browser's Dev Tools console, or use node-fetch to run in node.js.
//Go to: https://gitlab.com/profile/personal_access_tokens
const API_KEY = "API_KEY";
//You can find project id inside the "General project settings" tab
const PROJECT_ID = 12345678;
const PROJECT_URL = "https://gitlab.com/api/v4/projects/" + PROJECT_ID + "/"
let jobs = [];
for(let i = 0, currentJobs = []; i == 0 || currentJobs.length > 0; i++){
currentJobs = await sendApiRequest(
PROJECT_URL + "jobs/?per_page=100&page=" + (i + 1)
).then(e => e.json());
jobs = jobs.concat(currentJobs);
}
//skip jobs without artifacts
jobs = jobs.filter(e => e.artifacts);
//keep the latest build.
jobs.shift();
for(let job of jobs)
await sendApiRequest(
PROJECT_URL + "jobs/" + job.id + "/artifacts",
{method: "DELETE"}
);
async function sendApiRequest(url, options = {}){
if(!options.headers)
options.headers = {};
options.headers["PRIVATE-TOKEN"] = API_KEY;
return fetch(url, options);
}
According to the documentation, deleting the entire job log (click on the trash can) will also delete the artifacts.
I am on GitLab 8.17 and am able to remove artifacts for particular job by navigating to storage directory on server itself, default path is:
/var/opt/gitlab/gitlab-rails/shared/artifacts/<year_month>/<project_id?>/<jobid>
Removing both whole folder for job or simply contents, disappears artifact view from GitLab pipline page.
The storage path can be changed as described in docs:
https://gitlab.com/gitlab-org/gitlab-ce/blob/master/doc/administration/job_artifacts.md#storing-job-artifacts
If you have deleted all the jobs by accident (thinking the artifacts would be gone, but they didn't) what would be the alternative then brute-forcing a loop range?
I have this code, which does bruteforce on a range of numbers. But since I use the gitlab.com public runners, It's a long-range
# project_id, find it here: https://gitlab.com/[organization name]/[repository name]/edit inside the "General project settings" tab
project_id="xxxxxx" #
# token, find it here: https://gitlab.com/profile/personal_access_tokens
token="yyyyy"
server="gitlab.com"
# Get a range of the oldest known job and the lastet known one, then bruteforce. Used in the case when you deleted pipelines and can't retrive Job Ids.
# https://stackoverflow.com/questions/52609966/for-loop-over-sequence-of-large-numbers-in-bash
for (( job_id = 59216999; job_id <= 190239535; job_id++ )) do
echo "$job_id"
echo Job ID being deleted is "$job_id"
curl --request POST --header "PRIVATE-TOKEN:${token}" "https://${server}/api/v4/projects/${project_id}/jobs/${job_id}/erase"
echo -en '\n'
echo -en '\n'
done
This Python solution worked for me with GitLab 13.11.3.
#!/bin/python3
# delete_artifacts.py
import json
import requests
# adapt accordingly
base_url='https://gitlab.example.com'
project_id='1234'
access_token='123412341234'
#
# Get Version Tested with Version 13.11.3
# cf. https://docs.gitlab.com/ee/api/version.html#version-api
#
print(f'GET /version')
x= (requests.get(f"{base_url}/api/v4/version", headers = {"PRIVATE-TOKEN": access_token }))
print(x)
data=json.loads(x.text)
print(f'Using GitLab version {data["version"]}. Tested with 13.11.3')
#
# List project jobs
# cf. https://docs.gitlab.com/ee/api/jobs.html#list-project-jobs
#
request_str=f'projects/{project_id}/jobs'
url=f'{base_url}/api/v4/{request_str}'
print(f'GET /{request_str}')
x= (requests.get(url, headers = {"PRIVATE-TOKEN": access_token }))
print(x)
data=json.loads(x.text)
input('WARNING: This will delete all artifacts. Job logs will remain be available. Press Enter to continue...' )
#
# Delete job artifacts
# cf. https://docs.gitlab.com/ee/api/job_artifacts.html#delete-artifacts
#
for entry in data:
request_str=f'projects/{project_id}/jobs/{entry["id"]}/artifacts'
url=f'{base_url}/api/v4/{request_str}'
print(f'DELETE /{request_str}')
x = requests.delete(url, headers = {"PRIVATE-TOKEN": access_token })
print(x)
I'll keep an updated version here. Feel free to reach out and improve the code.

LibGit2Sharp equivalent of git diff --stat

I'm looking for a way to capture how many lines have changed in each file in my working directory - like git diff --stat in git - is there a way to do this with LibGit2Sharp?
I know I can get total LinesAdded/Deleted from a Patch, but I'm wondering on a file by file basis.
The following will enumerate all the files that have been changed between two commits, along with the number of changes (global, line additions and line removals).
var patch = repo.Diff.Compare<Patch>(fromCommit, untilCommit);
foreach (var pec in patch)
{
Console.WriteLine("{0} = {1} ({2}+ and {3}-)",
pec.Path,
pec.LinesAdded + pec.LinesDeleted,
pec.LinesAdded,
pec.LinesDeleted);
}
Would you need to access a specific file within a Patch, the types exposes an indexer to ease that
PatchEntryChanges entryChanges = patch["path/to/my/file.txt"];

How can i view all comments posted by users in bitbucket repository

In the repository home page , i can see comments posted in recent activity at the bottom, bit it only shows 10 commnets.
i want to all the comments posted since beginning.
Is there any way
Comments of pull requests, issues and commits can be retrieved using bitbucket’s REST API.
However it seems that there is no way to list all of them at one place, so the only way to get them would be to query the API for each PR, issue or commit of the repository.
Note that this takes a long time, since bitbucket has seemingly set a limit to the number of accesses via API to repository data: I got Rate limit for this resource has been exceeded errors after retrieving around a thousand results, then I could retrieve about only one entry per second elapsed from the time of the last rate limit error.
Finding the API URL to the repository
The first step is to find the URL to the repo. For private repositories, it is necessary to get authenticated by providing username and password (using curl’s -u switch). The URL is of the form:
https://api.bitbucket.org/2.0/repositories/{repoOwnerName}/{repoName}
Running git remote -v from the local git repository should provide the missing values. Check the forged URL (below referred to as $url) by verifying that repository information is correctly retrieved as JSON data from it: curl -u username $url.
Fetching comments of commits
Comments of a commit can be accessed at $url/commit/{commitHash}/comments.
The resulting JSON data can be processed by a script. Beware that the results are paginated.
Below I simply extract the number of comments per commit. It is indicated by the value of the member size of the retrieved JSON object; I also request a partial response by adding the GET parameter fields=size.
My script getNComments.sh:
#!/bin/sh
pw=$1
id=$2
json=$(curl -s -u username:"$pw" \
https://api.bitbucket.org/2.0/repositories/{repoOwnerName}/{repoName}/commit/$id/comments'?fields=size')
printf '%s' "$json" | grep -q '"type": "error"' \
&& printf "ERROR $id\n" && exit 0
nComments=$(printf '%s' "$json" | grep -o '"size": [0-9]*' | cut -d' ' -f2)
: ${nComments:=EMPTY}
checkNumeric=$(printf '%s' "$nComments" | tr -dc 0-9)
[ "$nComments" != "$checkNumeric" ] \
&& printf >&2 "!ERROR! $id:\n%s\n" "$json" && exit 1
printf "$nComments $id\n"
To use it, taking into account the possibility for the error mentioned above:
A) Prepare input data. From the local repository, generate the list of commits as wanted (run git fetch -a prior to update the local git repo if needed); check out git help rev-list for how it can be customised.
git rev-list --all | sort > sorted-all.id
cp sorted-all.id remaining.id
B) Run the script. Note that the password is passed here as a parameter – so first assign it to a variable safely using stty -echo; IFS= read -r passwd; stty echo, in one line; also see security considerations below. The processing is parallelised onto 15 processes here, using the option -P.
< remaining.id xargs -P 15 -L 1 ./getNComments.sh "$passwd" > commits.temp
C) When the rate limit is reached, that is when getNComments.sh prints !ERROR!, then kill the above command (Ctrl-C), and execute these below to update the input and output files. Wait a while for the request limit to increase, then re-execute the above one command and repeat until all the data is processed (that is when wc -l remaining.id returns 0).
cat commits.temp >> commits.result
cut -d' ' -f2 commits.result | sort | comm -13 - sorted-all.id > remaining.id
D) Finally, you can get the commits which received comments with:
grep '^[1-9]' commits.result
Fetching comments of pull requests and issues
The procedure is the same as for fetching commits’ comments, but for the following two adjustments:
Edit the script to replace in the URL commit by pullrequests or by issues, as appropriate;
Let $n be the number of issues/PRs to search. The git rev-list command above becomes: seq 1 $n > sorted-all.id
The total number of PRs in the repository can be obtained with:
curl -su username $url/pullrequests'?state=&fields=size'
and, if the issue tracker is set up, the number of issues with:
curl -su username $url/issues'?fields=size'
Hopefully, the repository has few enough PRs and issues so that all data can be fetched in one go.
Viewing comments
They can be viewed normally via the web interface on their commit/PR/issue page at:
https://bitbucket.org/{repoOwnerName}/{repoName}/commits/{commitHash}
https://bitbucket.org/{repoOwnerName}/{repoName}/pull-requests/{prId}
https://bitbucket.org/{repoOwnerName}/{repoName}/issues/{issueId}
For example, to open all PRs with comments in firefox:
awk '/^[1-9]/{print "https://bitbucket.org/{repoOwnerName}/{repoName}/pull-requests/"$2}' PRs.result | xargs firefox
Security considerations
Arguments passed on the command line are visible to all users of the system, via ps ax (or /proc/$PID/cmdline). Therefore the bitbucket password will be exposed, which could be a concern if the system is shared by multiple users.
There are three commands getting the password from the command line: xargs, the script, and curl.
It appears that curl tries to hide the password by overwriting its memory, but it is not guaranteed to work, and even if it does, it leaves it visible for a (very short) time after the process starts. On my system, the parameters to curl are not hidden.
A better option could be to pass the sensitive information through environment variables. They should be visible only to the current user and root via ps axe (or /proc/$PID/environ); although it seems that there are systems that let all users access this information (do a ls -l /proc/*/environ to check the environment files’ permissions).
In the script simply replace the lines pw=$1 id=$2 with id=$1, then pass pw="$passwd" before xargs in the command line invocation. It will make the environment variable pw visible to xargs and all of its descendent processes, that is the script and its children (curl, grep, cut, etc), which may or may not read the variable. curl does not read the password from the environment, but if its password hiding trick mentioned above works then it might be good enough.
There are ways to avoid passing the password to curl via the command line, notably via standard input using the option -K -. In the script, replace curl -s -u username:"$pw" with printf -- '-s\n-u "%s"\n' "$authinfo" | curl -K - and define the variable authinfo to contain the data in the format username:password. Note that this method needs printf to be a shell built-in to be safe (check with type printf), otherwise the password will show up in its process arguments. If it is not a built-in, try with print or echo instead.
A simple alternative to an environment variable that will not appear in ps output in any case is via a file. Create a file with read/write permissions restricted to the current user (chmod 600), and edit it so that it contains username:password as its first line. In the script, replace pw=$1 with IFS= read -r authinfo < "$1", and edit it to use curl’s -K option as in the paragraph above. In the command line invocation replace $passwd with the filename.
The file approach has the drawback that the password will be written to disk (note that files in /proc are not on the disk). If this too is undesirable, it is possible to pass a named pipe instead of a regular file:
mkfifo pipe
chmod 600 pipe
# make sure printf is a builtin, or use an equivalent instead
(while :; do printf -- '%s\n' "username:$passwd"; done) > pipe&
pid=$!
exec 3<pipe
Then invoke the script passing pipe instead of the file. Finally, to clean up do:
kill $pid
exec 3<&-
This will ensure the authentication info is passed directly from the shell to the script (through the kernel), is not written to disk and is not exposed to other users via ps.
You can go to Commits and see the top line for each commit, you will need to click on each one to see further information.
If I find a way to see all without drilling into each commit, I will update this answer.

How can I inspect a Hadoop SequenceFile for which I lack full schema information?

I have a compressed Hadoop SequenceFile from a customer which I'd like to inspect. I do not have full schema information at this time (which I'm working on separately).
But in the interim (and in the hopes of a generic solution), what are my options for inspecting the file?
I found a tool forqlift: http://www.exmachinatech.net/01/forqlift/
And have tried 'forqlift list' on the file. It complains that it can't load classes for the custom subclass Writables included. So I will need to track down those implementations.
But is there any other option available in the meantime? I understand that most likely I can't extract the data, but is there some tool for scanning how many key values and of what type?
From shell:
$ hdfs dfs -text /user/hive/warehouse/table_seq/000000_0
or directly from hive (which is much faster for small files, because it is running in an already started JVM)
hive> dfs -text /user/hive/warehouse/table_seq/000000_0
works for sequence files.
Check the SequenceFileReadDemo class in the 'Hadoop : The Definitive Guide'- Sample Code. The sequence files have the key/value types embedded in them. Use the SequenceFile.Reader.getKeyClass() and SequenceFile.Reader.getValueClass() to get the type information.
My first thought would be to use the Java API for sequence files to try to read them. Even if you don't know which Writable is used by the file, you can guess and check the error messages (there may be a better way that I don't know).
For example:
private void readSeqFile(Path pathToFile) throws IOException {
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(conf);
SequenceFile.Reader reader = new SequenceFile.Reader(fs, pathToFile, conf);
Text key = new Text(); // this could be the wrong type
Text val = new Text(); // also could be wrong
while (reader.next(key, val)) {
System.out.println(key + ":" + val);
}
}
This program would crash if those are the wrong types, but the Exception should say which Writable type the key and value actually are.
Edit:
Actually if you do less file.seq usually you can read some of the header and see what the Writable types are (at least for the first key/value). On one file, for example, I see:
SEQ^F^Yorg.apache.hadoop.io.Text"org.apache.hadoop.io.BytesWritable
I'm not a Java or Hadoop programmer, so my way of solving problem could be not the best one, but anyway.
I spent two days solving the problem of reading FileSeq locally (Linux debian amd64) without installation of hadoop.
The provided sample
while (reader.next(key, val)) {
System.out.println(key + ":" + val);
}
works well for Text, but didn't work for BytesWritable compressed input data.
What I did?
I downloaded this utility for creating (writing SequenceFiles Hadoop data)
github_com/shsdev/sequencefile-utility/archive/master.zip
, and got it working, then modified for reading input Hadoop SeqFiles.
The instruction for Debian running this utility from scratch:
sudo apt-get install maven2
sudo mvn install
sudo apt-get install openjdk-7-jdk
edit "sudo vi /usr/bin/mvn",
change `which java` to `which /usr/lib/jvm/java-7-openjdk-amd64/bin/java`
Also I've added (probably not required)
'
PATH="/home/mine/perl5/bin${PATH+:}${PATH};/usr/lib/jvm/java-7-openjdk-amd64/"; export PATH;
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64/
export JAVA_VERSION=1.7
'
to ~/.bashrc
Then usage:
sudo mvn install
~/hadoop_tools/sequencefile-utility/sequencefile-utility-master$ /usr/lib/jvm/java-7-openjdk-amd64/bin/java -jar ./target/sequencefile-utility-1.0-jar-with-dependencies.jar
-- and this doesn't break the default java 1.6 installation that is required for FireFox/etc.
For resolving error with FileSeq compatability (e.g. "Unable to load native-hadoop library for your platform... using builtin-java classes where applicable"), I used the libs from the Hadoop master server as is (a kind of hack):
scp root#10.15.150.223:/usr/lib/libhadoop.so.1.0.0 ~/
sudo cp ~/libhadoop.so.1.0.0 /usr/lib/
scp root#10.15.150.223:/usr/lib/jvm/java-6-sun-1.6.0.26/jre/lib/amd64/server/libjvm.so ~/
sudo cp ~/libjvm.so /usr/lib/
sudo ln -s /usr/lib/libhadoop.so.1.0.0 /usr/lib/libhadoop.so.1
sudo ln -s /usr/lib/libhadoop.so.1.0.0 /usr/lib/libhadoop.so
One night drinking coffee, and I've written this code for reading FileSeq hadoop input files (using this cmd for running this code "/usr/lib/jvm/java-7-openjdk-amd64/bin/java -jar ./target/sequencefile-utility-1.3-jar-with-dependencies.jar -d test/ -c NONE"):
import org.apache.hadoop.io.*;
import org.apache.hadoop.io.SequenceFile;
import org.apache.hadoop.io.SequenceFile.ValueBytes;
import java.io.DataOutputStream;
import java.io.FileNotFoundException;
import java.io.FileOutputStream;
Path file = new Path("/home/mine/mycompany/task13/data/2015-08-30");
reader = new SequenceFile.Reader(fs, file, conf);
long pos = reader.getPosition();
logger.info("GO from pos "+pos);
DataOutputBuffer rawKey = new DataOutputBuffer();
ValueBytes rawValue = reader.createValueBytes();
int DEFAULT_BUFFER_SIZE = 1024 * 1024;
DataOutputBuffer kobuf = new DataOutputBuffer(DEFAULT_BUFFER_SIZE);
kobuf.reset();
int rl;
do {
rl = reader.nextRaw(kobuf, rawValue);
logger.info("read len for current record: "+rl+" and in more details ");
if(rl >= 0)
{
logger.info("read key "+new String(kobuf.getData())+" (keylen "+kobuf.getLength()+") and data "+rawValue.getSize());
FileOutputStream fos = new FileOutputStream("/home/mine/outb");
DataOutputStream dos = new DataOutputStream(fos);
rawValue.writeUncompressedBytes(dos);
kobuf.reset();
}
} while(rl>0);
I've just added this chunk of code to file src/main/java/eu/scape_project/tb/lsdr/seqfileutility/SequenceFileWriter.java just after the line
writer = SequenceFile.createWriter(fs, conf, path, keyClass,
valueClass, CompressionType.get(pc.getCompressionType()));
Thanks to these sources of info:
Links:
If using hadoop-core instead of mahour, then will have to download asm-3.1.jar manually:
search_maven_org/remotecontent?filepath=org/ow2/util/asm/asm/3.1/asm-3.1.jar
search_maven_org/#search|ga|1|asm-3.1
The list of avaliable mahout repos:
repo1_maven_org/maven2/org/apache/mahout/
Intro to Mahout:
mahout_apache_org/
Good resource for learning interfaces and sources of Hadoop Java classes (I used it for writing my own code for reading FileSeq):
http://grepcode.com/file/repo1.maven.org/maven2/com.ning/metrics.action/0.2.7/org/apache/hadoop/io/BytesWritable.java
Sources of project tb-lsdr-seqfilecreator that I used for creating my own project FileSeq reader:
www_javased_com/?source_dir=scape/tb-lsdr-seqfilecreator/src/main/java/eu/scape_project/tb/lsdr/seqfileutility/ProcessParameters.java
stackoverflow_com/questions/5096128/sequence-files-in-hadoop - the same example (read key,value that doesn't work)
https://github.com/twitter/elephant-bird/blob/master/core/src/main/java/com/twitter/elephantbird/mapreduce/input/RawSequenceFileRecordReader.java - this one helped me (I used reader.nextRaw the same as in nextKeyValue() and other subs)
Also I've changed ./pom.xml for native apache.hadoop instead of mahout.hadoop, but probably this is not required, because the bugs for read->next(key, value) are the same for both so I had to use read->nextRaw(keyRaw, valueRaw) instead:
diff ../../sequencefile-utility/sequencefile-utility-master/pom.xml ./pom.xml
9c9
< <version>1.0</version>
---
> <version>1.3</version>
63c63
< <version>2.0.1</version>
---
> <version>2.4</version>
85c85
< <groupId>org.apache.mahout.hadoop</groupId>
---
> <groupId>org.apache.hadoop</groupId>
87c87
< <version>0.20.1</version>
---
> <version>1.1.2</version>
93c93
< <version>1.1</version>
---
> <version>1.1.3</version>
I was just playing with Dumbo. When you run a Dumbo job on a Hadoop cluster, the output is a sequence file. I used the following to dump out an entire Dumbo-generated sequence file as plain text:
$ bin/hadoop jar contrib/streaming/hadoop-streaming-1.0.4.jar \
-input totals/part-00000 \
-output unseq \
-inputformat SequenceFileAsTextInputFormat
$ bin/hadoop fs -cat unseq/part-00000
I got the idea from here.
Incidentally, Dumbo can also output plain text.
Following the anwer of Praveen Sripati, here a small example of SequenceFileReadDemo.java from Hadoop the Definitive Guide by Tom White.
Data are in HDFS, in this position : user/hduser/output-hashsort/ and the file is
part-r-00001
In eclipse, in the Arguments folder I've written this string :
and this is part of the output, with the debugger