Size of HDFS database does not make sense

Size of HDFS database does not make sense - hive

I am given these usage statistics on my HDFS folder allocated to my project.
hdfs dfs -df -h hdfs://hp3/data/test_data.db
Filesystem Size Used Available Use%
hdfs://hp3 6.1 P 5.1 P 1.0 P 83%
What does the 'P' stand for? It cannot be petabytes because I know the data I have uploaded and it is ~ 10 GB.

Disk usage command is
hdfs dfs -du -s -h hdfs://hp3/data/test_data.db
Also count can be used:
hadoop fs -count -v -h hdfs://hp3/data/test_data.db
Command you used if for free space:
-df command displays free space

Related

Convert qcow2 to vmdk and make it ESXi 6.0 Compatible

I am currently working on VMWare virtualization; I am trying to make a converted image from qcow2 to vmdk work with ESXi Server 6.0.
I have myImage.qcow2 with a disk which is thin provisioned for 300GB.
I used the image converter tool qemu with the following command
qemu-img convert -f qcow2 myImage.qcow2 -O vmdk myNewImage.vmdk
This command gives me a vmdk image which is only VMWare Workstation compatible. Therefore, in order to make it ESXi compatible I have to use the vmkfstools with the following command.
vmkfstools -i myImage.vmdk outputName.vmdk -d thin
The vmkfstools command gives me two files, an metadata.vmdk and the actual data.vmdk.
As mentioned above my disk is thin provisioned for 300GB and when I apply vmkfstools it expands the disk and gives me a size of 300GB.
Deploying the image through the vSphere Client works without any problem; however, for the purpose of this project I want to use the ovftool and doing so with such a large image is not feasible.
Is there a way for me to make my .vmdk ESXi compatible without vmkfstools expanding my image to 300GB?
Or Is there any other method for me to deploy those 300GB using the ovftool while the disk image is on the datastore, so that it doesn't have to be downloaded/uploaded through the deployment process?
I have been stuck on this for weeks and any help will be highly appreciated.

FYI: This support has been added in Qemu 2.1 and above as per changelogs
qemu-img convert -f qcow2 -O vmdk -o adapter_type=lsilogic,subformat=streamOptimized,compat6 SC-1.qcow2 SC-1.vmdk

This worked for me with VMware 6.7
The TL;DR is
qemu-img convert -f qcow2 -O vmdk -o subformat=streamOptimized source_qcow_image_path destination_path_to_vmdk
For example:
qemu-img convert -f qcow2 -O vmdk -o subformat=streamOptimized \
CentOS-7-x86_64-GenericCloud-1503.qcow2 \
CentOS-7-x86_64-GenericCloud-1503.vmdk
Update the vmdk version setting embedded in the converted image using this script:
* This is what actually worked for me *
printf '\x03' | dd conv=notrunc of=<vmdk file name> bs=1 seek=$((0x4))
For example:
printf '\x03' | dd conv=notrunc of=CentOS-7-x86_64-GenericCloud-1503.vmdk bs=1 seek=$((0x4))
source: https://kb.vmware.com/s/article/2144687

setting umask for hive client

How can I set the umask for an Hive HQL script? Either via statements within the script or via a client side configuration set before running the script? I want to make the change on the client side without changing the server side configuration.
I've found that this works from a shell prompt, but I'd like to do it from inside a hive script.
$ hdfs dfs -Dfs.permissions.umask-mode=000 -mkdir /user/jeff/foo
$ hdfs dfs -Dfs.permissions.umask-mode=000 -put bar /user/jeff/foo
These tries don't work:
hive> dfs -mkdir -Dfs.permissions.umask-mode=000 /user/jeff/foo;
-mkdir: Illegal option -Dfs.permissions.umask-mode=000
hive> dfs -Dfs.permissions.umask-mode=000 -mkdir /user/jeff/foo;
-Dfs.permissions.umask-mode=000: Unknown command
Setting hive.files.umask.value in .hiverc doesn't have the desired effect (The g+w and o+w bits aren't set which was what I was trying to do with this umask.):
hive> set hive.files.umask.value;
hive.files.umask.value=000
hive> dfs -mkdir /user/jeff/foo;
hive> dfs -ls -d /user/jeff/foo;
drwxr-xr-x - jeff hadoop 0 2016-02-23 15:19 /user/jeff/foo
It looks like I'll need to sprinkle a bunch of "dfs -chmod 777 ..." statements in my HQL script.
Ideas??

PgSQL - Export select query data direct to amazon s3 with headers

I have this requirement where i need to export the report data directly to csv since getting the array/query response and then building the scv and again uploading the final csv to amazon takes time. Is there a way by which i can directly create the csv with the redshift postgresql.
PgSQL - Export select query data direct to amazon s3 servers with headers
here is my version of pgsql - Version PgSQL 8.0.2 on amazon redshift
Thanks

You can use UNLOAD statement to save results to a S3 bucket. Keep in mind that this will create multiple files (at least one per computing node).
You will have to download all the files, combine them locally, sort (if needed), then add column headers and upload result back to S3.
Using the EC2 instance shouldn't take a lot of time - connection between EC2 and S3 is quite good.
In my experience, the quickest method is to use shells' commands:
# run query on the redshift
export PGPASSWORD='__your__redshift__pass__'
psql \
-h __your__redshift__host__ \
-p __your__redshift__port__ \
-U __your__redshift__user__ \
__your__redshift__database__name__ \
-c "UNLOAD __rest__of__query__"
# download all the results
s3cmd get s3://path_to_files_on_s3/bucket/files_prefix*
# merge all the files into one
cat files_prefix* > files_prefix_merged
# sort merged file by a given column (if needed)
sort -n -k2 files_prefix_merged > files_prefix_sorted
# add column names to destination file
echo -e "column 1 name\tcolumn 2 name\tcolumn 3 name" > files_prefix_finished
# add merged and sorted file into destination file
cat files_prefix_sorted >> files_prefix_finished
# upload destination file to s3
s3cmd put files_prefix_finished s3://path_to_files_on_s3/bucket/...
# cleanup
s3cmd del s3://path_to_files_on_s3/bucket/files_prefix*
rm files_prefix* files_prefix_merged files_prefix_sorted files_prefix_finished

Utilizing multi core for tar+gzip/bzip compression/decompression

I normally compress using tar zcvf and decompress using tar zxvf (using gzip due to habit).
I've recently gotten a quad core CPU with hyperthreading, so I have 8 logical cores, and I notice that many of the cores are unused during compression/decompression.
Is there any way I can utilize the unused cores to make it faster?

You can also use the tar flag "--use-compress-program=" to tell tar what compression program to use.
For example use:
tar -c --use-compress-program=pigz -f tar.file dir_to_zip

You can use pigz instead of gzip, which does gzip compression on multiple cores. Instead of using the -z option, you would pipe it through pigz:
tar cf - paths-to-archive | pigz > archive.tar.gz
By default, pigz uses the number of available cores, or eight if it could not query that. You can ask for more with -p n, e.g. -p 32. pigz has the same options as gzip, so you can request better compression with -9. E.g.
tar cf - paths-to-archive | pigz -9 -p 32 > archive.tar.gz

Common approach
There is option for tar program:
-I, --use-compress-program PROG
filter through PROG (must accept -d)
You can use multithread version of archiver or compressor utility.
Most popular multithread archivers are pigz (instead of gzip) and pbzip2 (instead of bzip2). For instance:
$ tar -I pbzip2 -cf OUTPUT_FILE.tar.bz2 paths_to_archive
$ tar --use-compress-program=pigz -cf OUTPUT_FILE.tar.gz paths_to_archive
Archiver must accept -d. If your replacement utility hasn't this parameter and/or you need specify additional parameters, then use pipes (add parameters if necessary):
$ tar cf - paths_to_archive | pbzip2 > OUTPUT_FILE.tar.gz
$ tar cf - paths_to_archive | pigz > OUTPUT_FILE.tar.gz
Input and output of singlethread and multithread are compatible. You can compress using multithread version and decompress using singlethread version and vice versa.
p7zip
For p7zip for compression you need a small shell script like the following:
#!/bin/sh
case $1 in
-d) 7za -txz -si -so e;;
*) 7za -txz -si -so a .;;
esac 2>/dev/null
Save it as 7zhelper.sh. Here the example of usage:
$ tar -I 7zhelper.sh -cf OUTPUT_FILE.tar.7z paths_to_archive
$ tar -I 7zhelper.sh -xf OUTPUT_FILE.tar.7z
xz
Regarding multithreaded XZ support. If you are running version 5.2.0 or above of XZ Utils, you can utilize multiple cores for compression by setting -T or --threads to an appropriate value via the environmental variable XZ_DEFAULTS (e.g. XZ_DEFAULTS="-T 0").
This is a fragment of man for 5.1.0alpha version:
Multithreaded compression and decompression are not implemented yet, so this
option has no effect for now.
However this will not work for decompression of files that haven't also
been compressed with threading enabled. From man for version 5.2.2:
Threaded decompression hasn't been implemented yet. It will only work
on files that contain multiple blocks with size information in
block headers. All files compressed in multi-threaded mode meet this
condition, but files compressed in single-threaded mode don't even if
--block-size=size is used.
Recompiling with replacement
If you build tar from sources, then you can recompile with parameters
--with-gzip=pigz
--with-bzip2=lbzip2
--with-lzip=plzip
After recompiling tar with these options you can check the output of tar's help:
$ tar --help | grep "lbzip2\|plzip\|pigz"
-j, --bzip2 filter the archive through lbzip2
--lzip filter the archive through plzip
-z, --gzip, --gunzip, --ungzip filter the archive through pigz

You can use the shortcut -I for tar's --use-compress-program switch, and invoke pbzip2 for bzip2 compression on multiple cores:
tar -I pbzip2 -cf OUTPUT_FILE.tar.bz2 DIRECTORY_TO_COMPRESS/

If you want to have more flexibility with filenames and compression options, you can use:
find /my/path/ -type f -name "*.sql" -o -name "*.log" -exec \
tar -P --transform='s#/my/path/##g' -cf - {} + | \
pigz -9 -p 4 > myarchive.tar.gz
Step 1: find
find /my/path/ -type f -name "*.sql" -o -name "*.log" -exec
This command will look for the files you want to archive, in this case /my/path/*.sql and /my/path/*.log. Add as many -o -name "pattern" as you want.
-exec will execute the next command using the results of find: tar
Step 2: tar
tar -P --transform='s#/my/path/##g' -cf - {} +
--transform is a simple string replacement parameter. It will strip the path of the files from the archive so the tarball's root becomes the current directory when extracting. Note that you can't use -C option to change directory as you'll lose benefits of find: all files of the directory would be included.
-P tells tar to use absolute paths, so it doesn't trigger the warning "Removing leading `/' from member names". Leading '/' with be removed by --transform anyway.
-cf - tells tar to use the tarball name we'll specify later
{} + uses everyfiles that find found previously
Step 3: pigz
pigz -9 -p 4
Use as many parameters as you want.
In this case -9 is the compression level and -p 4 is the number of cores dedicated to compression.
If you run this on a heavy loaded webserver, you probably don't want to use all available cores.
Step 4: archive name
> myarchive.tar.gz
Finally.

A relatively newer (de)compression tool you might want to consider is zstandard. It does an excellent job of utilizing spare cores, and it has made some great trade-offs when it comes to compression ratio vs. (de)compression time. It is also highly tweak-able depending on your compression ratio needs.

Here is an example for tar with modern zstd compressor, as finding out good examples on this one was difficult:
apt poem to install zstd and pv utilities for Ubuntu
Compress multiple files and folders (zstd command alone can only do single files)
Display progress using pv - shows the total bytes compressed and compression speed GB/sec real-time
Use all physical cores with -T0
Set compression level higher than the default with -8
Display the resulting wall clock and CPU time used after the operation is finished using time
apt install zstd pv
DATA_DIR=/path/to/my/folder/to/compress
TARGET=/path/to/my/arcive.tar.zst
time (cd $DATA_DIR && tar -cf - * | pv | zstd -T0 -8 -o $TARGET)

Check the total content size of a tar gz file

How can I extract the size of the total uncompressed file data in a .tar.gz file from command line?

This works for any file size:
zcat archive.tar.gz | wc -c
For files smaller than 4Gb you could also use the -l option with gzip:
$ gzip -l compressed.tar.gz
compressed uncompressed ratio uncompressed_name
132 10240 99.1% compressed.tar

This will sum the total content size of the extracted files:
$ tar tzvf archive.tar.gz | sed 's/ \+/ /g' | cut -f3 -d' ' | sed '2,$s/^/+ /' | paste -sd' ' | bc
The output is given in bytes.
Explanation: tar tzvf lists the files in the archive in verbose format like ls -l. sed and cut isolate the file size field. The second sed puts a + in front of every size except the first and paste concatenates them, giving a sum expression that is then evaluated by bc.
Note that this doesn't include metadata, so the disk space taken up by the files when you extract them is going to be larger - potentially many times larger if you have a lot of very small files.

The command gzip -l archive.tar.gz doesn't work correctly with file sizes greater than 2Gb. I would recommend zcat archive.tar.gz | wc --bytes instead for really large files.

I know this is an old answer; but I wrote a tool just for this two years ago. It’s called gzsize and it gives you the uncompressed size of a gzip'ed file without actually decompressing the whole file on disk:
$ gzsize <your file>

Use the following command:
tar -xzf archive.tar.gz --to-stdout|wc -c

I'm finding everything sites in the web, and don't resolve this problem the get size when file size is bigger of 4GB.
first, which is most faster?
[oracle#base tmp]$ time zcat oracle.20180303.030001.dmp.tar.gz | wc -c
6667028480
real 0m45.761s
user 0m43.203s
sys 0m5.185s
[oracle#base tmp]$ time gzip -dc oracle.20180303.030001.dmp.tar.gz | wc -c
6667028480
real 0m45.335s
user 0m42.781s
sys 0m5.153s
[oracle#base tmp]$ time tar -tvf oracle.20180303.030001.dmp.tar.gz
-rw-r--r-- oracle/oinstall 111828 2018-03-03 03:05 oracle.20180303.030001.log
-rw-r----- oracle/oinstall 6666911744 2018-03-03 03:05 oracle.20180303.030001.dmp
real 0m46.669s
user 0m44.347s
sys 0m4.981s
definitely, tar -xvf is the most faster, but
¿how to cancel executions after get header?
my solution is this:
[oracle#base tmp]$ time echo $(timeout --signal=SIGINT 1s tar -tvf oracle.20180303.030001.dmp.tar.gz | awk '{print $3}') | grep -o '[[:digit:]]*' | awk '{ sum += $1 } END { print sum }'
6667023572
real 0m1.005s
user 0m0.013s
sys 0m0.066s

A tar file is uncompressed until/unless it is filtered through another program, such as gzip, bzip2, lzip, compress, lzma, etc. The file size of the tar file is the same as the extracted files, with probably less than 1kb of header info added in to make it a valid tarball.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas