How to extract Jester dataset files? - gzip

I am trying to extract the dataset given on link:https://20bn.com/datasets/jester
I am unable to extract these file with no extension. I tried using tar and also what they have mentioned on their website, i.e.,
cat 20bn-jester-v1-?? | tar zx
Please assist.

As you can read on the web site:
The video data is provided as one large TGZ archive, split into parts of 1 GB max
so you need to combine the files and the extract.
the command like this can help you:
cat 20bn-jester-v1-??|gzip -dc|tar xf -
cat is used to combine all parts in one file, then gzip to decompress the file and on the end tar to extract the file(s)

I also experienced the same problem for quite some time but I figured out you can use GitBash to unzip the data. just open Gitbash inside the folder where you downloaded the data by right-clicking and selecting gitbash (you must have it installed). then type in the command cat 20bn-jester-v1-?? | tar zx. Press enter and you are done.

First: unzip '*.zip'
Then: cat 20bn-jester-v1-?? | tar zx

Related

tar gzip of directory leaves old directory there and tar.gz is much larger

New to zipping files here. I used the following command to gzip a bunch of large files within a single directory:
tar -cvzf archive-RAW-MAFs.tar.gz RAW_MAFS/
When this was done, I noticed that it left the old directory tree where it was, and that the tar.gz was much larger. I'm not sure what the original size of the directory was as I didn't check it beforehand, but I think it was much larger than stated here...
-rw-r----- 1 xxx xxxx 21218045403 May 8 21:39 archive-RAW-MAFs.tar.gz
drwxr-s--- 34 xxx xxxx 4096 May 8 20:21 RAW_MAFS
I can also traverse through the original RAW_MAFs directory and open files. Ideally, I would like only the zipped file, because I don't need to touch this data again for a while and want to save as much as I can.
I'll take the second question first.
The original file are still there because you haven't told tar to delete them. Add the --remove-files option to the command line to get tar to do what you want
tar -cvzf archive-RAW-MAFs.tar.gz RAW_MAFS/ --remove-files
Regarding the size of the RAW_MAFS directory tree. If it hasn't been deleted yet can you not check their sizes?
If the original files in RAW_MAFS are already compressed, then compressing again when you put them in your tar file will increase the size. Can you provide more details on what you are storing in the tar file?
If you are storing compressed files in the tar, try running without the z option.

Muttrc: how to source a file in muttrc's directory

I have a muttrc file which sources a secondary file mutt-secrets which resides in its same directory. But I have what appear to be two conflicting needs:
Be free to reference the muttrc file from any working directory
Be free to move it (and mutt-secrets) without having to edit muttrc to change the source path for mutt-secrets
At present, the first line of my muttrc says: source mutt-secrets. That works fine when I run mutt from within the directory where the two file reside, but if I run mutt from elsewhere and reference muttrc with a -F flag, then mutt can find muttrc, but muttrc can't find mutt-secrets.
How can I solve this?
Use absolute paths. For example:
source ~/.mutt/mutt-secrets
TL;DR one-line solution:
source `lsof -c mutt -Fn | grep '/muttrc$' | sed 's|^n||; s|/muttrc$||;'`/mutt-secrets
or, if you want to reuse the muttrc directory, you can save it to a custom variable:
set my_muttrc_dir = `lsof -c mutt -Fn | grep '/muttrc$' | sed 's|^n||; s|/muttrc$||'`
source $my_muttrc_dir/mutt-secrets
If you want to see the output of the command when you launch mutt, you can put this line in your muttrc:
echo `lsof -c mutt -Fn | grep '/muttrc$' | sed 's|^n||; s|/muttrc$||'`
Assumptions: the Mutt process is called mutt and the Mutt's initialization file is called muttrc. Furthermore, you could get in trouble if you have more than one Mutt instance running (for example if you launch in parallel two or more Mutt instances with different initialization files, because the command may select the wrong path).
Explanation
The idea is to look for muttrc full path in the list of open files by Mutt. We can get this list using lsof command (it has to be installed in your system), then extract the full path by parsing the lsof output with grep and sed commands.
This approach is viable because Mutt's initialization files support the use of external command's output with backticks (``). When Mutt encounter and execute our command enclosed in backticks (``), it is in the process of reading the muttrc file, so the muttrc file appears in the list of currently open files by Mutt. This enables us to use the lsof command.
lsof parameters
-c mutt: list open files of process named mutt;
-Fn: for each element, print only the name (it is the path in our case). Because of lsof output format, the path will be prefixed with the character n.
grep and sed
We use grep to select the line which contains muttrc file path, assuming the filename is exactly muttrc. Then we clean the lsof output with sed by both removing the n character at the beginning of the line and the /muttrc string from the end of the line. This way we get the path of the directory containing the muttrc file.
There is a cleaner solution?
Mutt expands relative paths inside initialization files from its current working directory, i.e. from the directory you launch Mutt. It supports a mechanism that allows path's expansion relatives to something different, but the "initialization file directory" or something similar are not available. See here.
I neither found a way to get the -F <path> option you pass to the mutt command inside the initialization file.
References
backticks in Mutt's initialization file;
current directory;
_mutt_buffer_expand_path, source code
source_rc, source code
source_rc call, source code
Tested with: Mutt 2.0.5, lsof 4.93.2, GNU grep 3.7, GNU sed 4.7.

Is there a way to download multiple PDF's that are linked on a website?

I am trying to download a bunch of PDF's from the federal reserve archives but I have to click on a link and then view the PDF before I can download. Is there a way to automate this?
Example: https://fraser.stlouisfed.org/title/5170#521653 is a link to speeches and then you have to click the title, then view pdf, then the actual download button.
All of the remote .pdf files follow the path format:
https://fraser.stlouisfed.org/files/docs/historical/frbatl/speeches/guynn_xxxxxxxx.pdf
where each x is a placeholder for a digit.
So, yes, it's very easy to download a bunch of these PDFs in one go using the command-line in Terminal or whatever shell program you have access to.
If you're in a *nix-based operating system (including MacOS), that's good because your shell probably already has a command utility called curl installed. Windows may have it too, I'm not sure; I don't use Windows.
If you're using Windows, you'll have to make some tweaks to the code below, because the folder structures and file naming conventions are different, so the first couple of commands won't work.
But, if you're happy to proceed, open up a Terminal window, and type in this command to create a new directory in your Downloads folder, into which the .pdf files will be downloaded:
mkdir ~/Downloads/FRASER_PDFs; cd ~/Downloads/FRASER_PDFs
Hit Enter. Next, If there's no error, copy-n-paste this long command and then hit Enter:
curl --url \
"https://fraser.stlouisfed.org/files/docs/historical/frbatl/speeches/guynn_{"$(curl \
https://fraser.stlouisfed.org/title/5170#521653 --silent \
| egrep -io -e '/files/docs/historical/frbatl/speeches/guynn_\d+\.pdf' \
| egrep -o -e '\d+' | tr '\n' ',')"}.pdf" -O --remote-name-all
You can see this uses the URL you supplied in your question, from which that command retrieves all the .pdf links. If you need to do the same with other similar pages, provided they all use the same URL format, you can just substitute 5170#521653 with whatever page reference contains another list of .pdfs.

how to merge gz files into a tar.gz without decompression?

I have a program that only consumes uncomprssed files. I have a couple of .gz files and my goal is to feed the concatenation of them to that program. If I had a tar.gz file I could mount the tar.gz archive with the archivemont command.
I know I can concatenate the gz files:
cat a.gz b.gz > c.gz
But there is no way, that I am aware of, to mount a gz file. I don't have enough disk space to uncompress all of the files and the tar command do not accept stdin as the input so I cannot do this:
zcat *.gz | tar - | gzip > file.tar.gz
It is not clear what operations you need to perform on the tar.gz archive. But from what I can discern, tar.gz is not the format for this application. The entire archive stream is compressed by gzip, so you can't pull out or change a file without having to re-compress everything after it. The tar.gz stream can be specially prepared to keep the compression of each file independent, but then you might as well use the .zip format, which is better suited for random access and manipulation of individual files in the archive.
To address one of your comments, tar can in fact accept stdin as input. See pipe tar extract into tar create for some examples, where both GNU tar and BSD tar (with different syntax) can take in a tar file from stdin, delete entries, and write a new tar file to stdout.

Programmatically extract tar.gz in a single step (on Windows with 7-Zip)

Problem: I would like to be able to extract tar.gz files in a single step. This makes my question almost identical to this one: Stack Overflow question for tar-gz.
My question is almost the same, but not the same, because I would like to do this on windows using 7-Zip command-line (or something similar) inside a bat file or Ruby/Perl/Python script.
Question: This seemingly simple task is proving to be more involved than the first appearance would make it out to be. Does anyone have a script that does this already?
Old question, but I was struggling with it today so here's my 2c. The 7zip commandline tool "7z.exe" (I have v9.22 installed) can write to stdout and read from stdin so you can do without the intermediate tar file by using a pipe:
7z x "somename.tar.gz" -so | 7z x -aoa -si -ttar -o"somename"
Where:
x = Extract with full paths command
-so = write to stdout switch
-si = read from stdin switch
-aoa = Overwrite all existing files without prompt.
-ttar = Treat the stdin byte stream as a TAR file
-o = output directory
See the help file (7-zip.chm) in the install directory for more info on the command line commands and switches.
As noted by #zespri powershell will buffer the input to the second 7z process so can consume a lot of memory if your tar file is large. i.e:
& 7z x "somename.tar.gz" -so | & 7z x -aoa -si -ttar -o"somename"
A workaround from this SO answer if you want to do this from powershell is to pass the commands to cmd.exe:
& cmd.exe '/C 7z x "somename.tar.gz" -so | 7z x -aoa -si -ttar -o"somename"'
7z e example.tar.gz && 7z x example.tar
Use && to combine two commands in one step. Use the 7-Zip portable (you will need 7z.exe and 7z.dll only).
Use the win32 port of tar.
tar -xvfz filename.tar.gz
Since you asked for 7-zip or something similar, I am providing an alternative tool that was written for your exact use case.
The tartool utility is a free and open source tool built using the .NET SharpZipLib library.
Example commmand to extract a .tar.gz / .tgz - file,
C:\>TarTool.exe D:\sample.tar.gz ./
Disclaimer : I am the author of this utility.
As you can see 7-Zip is not very good at this. People have been asking for
tarball atomic operation since 2009. As an alternative, you can use the
Arc program. Example command:
arc unarchive test.tar.gz