Hive Table Load the Data From Hdfs Location with handled Duplicate Files - hive

There is scenario if daily files loading particular path of HDFS location. on top of that path we have created Hive external table to load the data into table in hive. there is worst scenario the files pushed to particular path(HDFS) two times or duplicate files.
How do we load second files instead of doing delete or other job running. what is the best practice to handle this scenario.
Kindly clarify

Duplicated files with similar filanames are not possible in HDFS. If you worry about two files with possible similar content, you might want to load it as is to avoid missing data and maintain a managed table that handles the duplicates.
Use case: Get only latest file
Detect latest file from HDFS directory:
hdfs dfs -ls -R /your/hdfs/dir/ | awk -F" " '{print $6" "$7" "$8}' | sort -nr | head -1 | cut -d" " -f3
Then, move it to another HDFS directory. This directory should be emptied because we want latest file only.
# delete old files in here
hdfs dfs -rm -r /your/hdfs/latest_dir/
# copy latest file
hdfs dfs -cp $(hadoop fs -ls -R /your/hdfs/dir/ | awk -F" " '{print $6" "$7" "$8}' | sort -nr | head -1 | cut -d" " -f3) /your/hdfs/latest_dir/

Related

How to get around 'Argument list too long' error when concatenating multiple gzip files?

I am trying to concatenate around ~21,000 gzip files that are all located in a local directory so I can unzip 1 large gzip file then convert the unzipped file into a csv. Unfortunately this is over the maximum number of arguments that can be accepted by the cat command. I have tried using cat *gz > final.gz as well as ls *.gz | xargs cat but both have given me the error 'Argument list too long'. How might I work around this error to concatenate all gzipped files?
You can try something like:
find . -name \*.gz -print0 | xargs -0 cat > final.gz
If there are .gz files in subdirectories of the current directory, and you only want the ones in the current directory, then add -maxdepth 1 to the find options.
If you want to impose a particular order on the files, you can pipe through an appropriate sort between the find and the xargs.

Extract huge tar.gz archives from S3 without copying archives to a local system

I'm looking for a way to extract huge dataset (18 TB+ found here https://github.com/cvdfoundation/open-images-dataset#download-images-with-bounding-boxes-annotations) with this in mind I need the process to be fast (i.e. I don't want to spend twice the time for first copying and then extracting files) Also I don't want archives to take extra space not even one 20 gb+ archive.
Any thoughts on how one can achieve that?
If you can arrange to pipe the data straight into tar, it can uncompress and extract it without needing a temporary file.
Here is a example. First create a tar file to play with
$ echo abc >one
$ echo def >two
$ tar cvf test.tar
$ tar cvf test.tar one two
one
two
$ gzip test.tar
Remove the test files
$ rm one two
$ ls one two
ls: cannot access one: No such file or directory
ls: cannot access two: No such file or directory
Now extract the contents by piping the compressed tar file into the tar command.
$ cat test.tar.gz | tar xzvf -
one
two
$ ls one two
one two
The only part missing now is how to download the data and pipe it into tar. Assuming you can access the URL with wget, you can get it to send the data to stdout. So you end up with this
wget -qO- https://youtdata | tar xzvf -

find the oldest file based on date in filename in Google Cloud Storage

I would like to list some of the oldest files from my GCS storage based timestamp in file name.
File name is something like this
abcdefghijklmnop_qrstu_vwxyz_table_v2.20190101000000.csv.gz
I was able to list the latest file based on this command
gsutil ls -l gs://bucket_name/folder/* | awk -F\. 'm<$4{m=$4;f=$0} END{print f}
but unable to find the right command to list the oldest file based on filename. sort -kn |head -n1 did not work
gsutil ls -l gs://bucket_name/folder/* | awk -F\. 'm<$4{m=$4;f=$0} END{print f}
gsutil ls gs://***/ | awk -F\. 'BEGIN{t=2**64}{if(t>$2){t=$2;m=$0;}}END{print m}'

s3cmd copy files preserving path

Is there a way to use copy files to an S3 bucket by preserving the file path?
This is the example:
1. I produce a list of files that are different in bucket1 then in bucket2 using s3cmd sync --dry-run
The list looks like this:
s3://BUCKET/20150831/PROD/JC-migration-test-01/META-INF/vault/definition/.content.xml
s3://BUCKET/20150831/PROD/JC-migration-test-01/META-INF/vault/nodetypes.cnd
s3://BUCKET/20150831/PROD/JC-migration-test-01/META-INF/vault/properties.xml
s3://BUCKET/20150831/PROD/JC-migration-test-01/jcr_root/.content.xml
s3://BUCKET/20150831/PROD/JC-migration-test-01/jcr_root/content/.content.xml
s3://BUCKET/20150831/PROD/JC-migration-test-01/jcr_root/content/app-store/.content.xml
I need to process this list to upload to a new location in the bucket (e.g. s3://bucket/diff/) only the files in the list BUT with the full path as shown in the list.
A simple loop like this:
diff_file_list=$(s3cmd -c s3cfg sync --dry-run s3://BUCKET/20150831/PROD s3://BUCKET/20150831/DEV | awk '{print $2}')
for f in $diff_file_list; do
s3cmd -c s3cfg cp $f s3://BUCKET/20150831/DIFF/
done
does not work; it produces this:
File s3://BUCKET/20150831/PROD/JC-migration-test-01/META-INF/vault/definition/.content.xml copied to s3://BUCKET/20150831/DIFF/.content.xml
File s3://BUCKET/20150831/PROD/JC-migration-test-01/META-INF/vault/nodetypes.cnd copied to s3://BUCKET/20150831/DIFF/nodetypes.cnd
File s3://BUCKET/20150831/PROD/JC-migration-test-01/META-INF/vault/properties.xml copied to s3://BUCKET/20150831/DIFF/properties.xml
File s3://BUCKET/20150831/PROD/JC-migration-test-01/jcr_root/.content.xml copied to s3://BUCKET/20150831/DIFF/.content.xml
File s3://BUCKET/20150831/PROD/JC-migration-test-01/jcr_root/content/.content.xml copied to s3://BUCKET/20150831/DIFF/.content.xml
File s3://BUCKET/20150831/PROD/JC-migration-test-01/jcr_root/content/origin-store/.content.xml copied to s3://BUCKET/20150831/DIFF/.content.xml
Thanks,
Short answer: not it is not! That is because the paths in S3 buckets are not actually directories/folders and the S3 bucket have no such concepts of structure even if various tools are presenting it this way (including s3cmd which is really confusing...).
So, the "path" is actually a prefix (although the sc3cmd sync to local knows how to translate this prefix in a directory structure on your filesystem).
For a bash script the solution is:
1. create a file listing all the paths from a s3cmd sync --dry-run command (basically a list of diffs) => file1
copy that file and use sed to modify the paths as needed:
sed 's/(^s3.*)PROD/\1DIFF/') => file2
Merge the files so that line1 in file1 is continued by line1 in file2 and so on:
paste file1 file2 > final.txt
Read final.txt, line by line, in a loop and use each line as a set of 2 parameters to a copy or syun command:
while IFS='' read -r line || [[ -n "$line" ]]; do
s3cmd -c s3cfg sync $line
done < "final.txt"
Notes:
1. $line in the s3cmd must not be in quotes; if it is the sync command will complain that it received one parameter only... of course!
2. the [[ -n "$line" ]] is used here so that read will not fail of the last line has not new line character
Boto could not help more unfortunately so if you need something similar in python you would do it pretty much the same....

s3cmd: searching for files based on extension and delete from bucket

I have a S3 bucket with thousands of folders and many txt files inside those folders.
I would like to list all txt files inside the bucket so I can check if they're removable. Then remove them if they are.
Any idea how to do this with s3cmd?
This is fairly simple, but depends on how sophisticated you want the check to be. Suppose you wanted to remove every text file whose filename includes 'foo':
s3cmd --recursive ls s3://mybucket |
awk '{ print $4 }' | grep "*.txt" | grep "foo" | xargs s3cmd del
If you want a more sophisticated check than grep can handle, just redirect the first three commands to a file, then either manually edit the file or use or awk or perl or whatever your favorite tool is, then cat the output into s3cmd (depending on the check, you could do it all with piping, too):
s3cmd --recursive ls s3://mybucket | awk '{ print $4 }' | grep "*.txt" > /tmp/textfiles
magic-command-to-check-filenames /tmp/textfiles
cat /tmp/textfiles | xargs s3cmd del