Mahout seqdirectory not making a new file - sequence

I am trying to convert a text file into a sequence file that I can run mahout kmeans on. When I run the seqdirectory utility, I do not get any errors and it says that the program is completed. However, when I look in the output directory, it is empty. I've looked around and can't find any solutions to this. Thoughts?
Here is what I run in the terminal:
hduser#ubuntu:~$ $MAHOUT_HOME/bin/mahout seqdirectory --input Downloads/google/ --output Downloads/sparsefiles/ -c UTF-8
This is the output I get:
12/07/06 06:24:19 INFO driver.MahoutDriver: Program took 1091 ms (Minutes: 0.018183333333333333)

I think it may be producing the output on hdfs. Try checking:
hadoop dfs -ls Downloads/sparsefiles/
Also, to ensure it produces in your local filesystem you can modify the command like:
$MAHOUT_HOME/bin/mahout seqdirectory --input file://<home path>/Downloads/google/ --output file://<home path>/Downloads/sparsefiles/ -c UTF-8

Related

RISC-V: How to fix "file format not recognized" when disassembling a .img file?

I'm playing with RISC-V.
I have a .img file and I want to disassemble it into a .asm file, so I ran the following command:
> riscv64-unknown-elf-objdump -d xxx.img > xxx.asm
However, I got this issue:
riscv64-unknown-elf-objdump: xxx.img: file format not recognized
How can I fix it? I have no idea what to do with this issue.
If you run:
riscv64-unknown-elf-objdump --help
You'll see a line like:
riscv64-unknown-elf-objdump: supported architectures: riscv riscv:rv64 riscv:rv32
These are the supported architectures that you need to pass as the -m argument. Normally, an ELF file will encode this information so there's no guesswork, but in the case of using a flat file, there's no way for objdump to know how the instructions are supposed to be interpreted. The final command is:
riscv64-unknown-elf-objdump -b binary -m riscv:rv64 -D xxx.bin

zsh: command not found: duarouter to create a rou.xml file

I'm using sumo in macOS. I'm trying to create a duarouter by calling the following command after creating a random trips for a given network:
duarouter -n ~/SUMOTutorials/sumotest.net.xml --route-files ~/SUMOTutorials/sumotest.trips.xml -o ~/SUMOTutorials/sumotest.rou.xml --ignore-errors
However I just get the error:
zsh: command not found: duarouter
I see in sumo directory there is duarouter of kind Unix executable file in sumo/bin and sumo/tests and in I run the above command when I'm in each of those directory but I just get an error.
I found the answer. Because the type is Unix executable file, we have to write the following in the directory where it is:
./duarouter -n ~/SUMOTutorials/sumotest.net.xml --route-files ~/SUMOTutorials/sumotest.trips.xml -o ~/SUMOTutorials/sumotest.rou.xml --ignore-errors.

impala shell command to export a parquet file as a csv

I have some parquet files stored in HDFS that I want to convert to csv files FIRST and export them in a remote file using ssh.
I don't know if it's possible or simple by writing a spark job (I know that we can convert parquet to csv file JUST by using spark.read.parquet then to the same DF use spark.write as a csv file). But I really wanted to do it by using a impala shell request.
So, I thought about something like this :
hdfs dfs -cat my-file.parquet | ssh myserver.com 'cat > /path/to/my-file.csv'
Can you help me PLEASE with this request ? Please.
Thank you !!
Example without kerberos:
impala-shell -i servername:portname -B -q 'select * from table' -o filename '--output_delimiter=\001'
I could explain it all, but it is late and here is a link that allows you to do that as well as the header if you want: http://beginnershadoop.com/2019/10/02/impala-export-to-csv/
You can do that by multiples ways.
One approach could be as in the example below.
With impala-shell you can run a query and pipe to ssh to write the output in a remote machine.
$ impala-shell --quiet --delimited --print_header --output_delimiter=',' -q 'USE fun; SELECT * FROM games' | ssh remoteuser#ip.address.of.remote.machine "cat > /home/..../query.csv"
This command change from default database to a fun database and run a query on it.
You can change the --output_delimiter='\t', --print_header or not along with other options.

How to keep the snakemake shell file while running in cluster

While running my snakemake file in cluster I keep getting an error,
snakemake -j 20 --cluster "qsub -o out.txt -e err.txt -q debug" -s
seadragon/scripts/viral_hisat.snake --config json="<input file>"
output="<output file>"
Now this gives me the follwing error,
Error in job run_salmon while creating output file
/gpfs/home/user/seadragon/output/quant_v2_4/test.
ClusterJobException in line 58 of seadragon/scripts/viral_hisat.snake
:
Error executing rule run_salmon on cluster (jobid: 1, external: 156618.sn-mgmt.cm.cluster, jobscript: /gpfs/home/user/.snakemake/tmp.j9nb0hyo/snakejob.run_salmon.1.sh). For detailed error see the cluster log.
Will exit after finishing currently running jobs.
Exiting because a job execution failed. Look above for error message
Now I don't find any way to track the error, since my cluster does not give me an way to store the log files, on the other hand /gpfs/home/user/.snakemake/tmp.j9nb0hyo/snakejob.run_salmon.1.sh file is deleted immediately after finishing.
Please let me know if there is an way to keep this shell file even if the snakemake fails.
I am not a qsub user anymore, but if I remember correctly, stdout and stderr are stored in the working directory, under the jobid that Snakemake gives you under external in the error message.
You need to redirect the standard output and standard error output to a file yourself instead of relying on the cluster or snakemake to do this for you.
Instead of the following
my_script.sh
Run the following
my_script.sh > output_file.txt 2> error_file.txt

Redirect stderr through grep -v in LSF batch job

I'm using a library that generates a whole ton of output to stderr (and there is really no way to suppress the output directly in the code; it is ROOT's Minuit2 minimizer which is known for not having a way to suppress the output). I'm running batch jobs through the LSF system, and the error output files are so big that they exceed my disk quota. Erk.
When I run locally on a shell, I do:
python main.py 2> >( grep -v Minuit2 2>&1 )
to suppress the output, as is done here.
This works great, but unfortunately I can't seem to get that or any variation of it to work when running on LSF. I think this is due to LSF not spawning the necessary subshell, but it's not clear.
I run on batch by passing LSF a submit script. The relevant line is:
python main.py $INPUT_FILE
which works great, aside from the aforementioned problem of gigantic error files.
When I try changing that line to
python main.py $INPUT_FILE 2> >( grep -v Minuit2 2>&1 )
I end up with
./singleSubmit.sh: line 16: syntax error near unexpected token `>'
./singleSubmit.sh: line 16: `python $MAIN $1 2> >( grep -v Minuit2 2>&1 )'
in the error log file.
Any idea how I could accomplish what I want, or why this is not working?
Thanks a ton!
The syntax you're using works in bash, not in csh/tcsh. Try changing the first line of your submission script to
#!/bin/bash