HDFS calculate size of subfolders - size

Please advice on how can I calculate the size of subfolders in HDFS and sort them by size?
hdfs dfs -ls -h /mds/snapshots/user/data | du -sh * | sort -rh | head -10
Seems it should work - but as I understand hdfs doesn't work with additional commands after |

You can use:
hdfs dfs -du -s /path/* | sort -r -k 1 -g | awk '{ suffix="KMGT"; for(i=0;
$1>1024 && i < length(suffix); i++) $1/=1024; print int($1) substr(suffix, i,
1), $3; }'

Related

Better way to process dmidecode data using awk/sed

I run
dmidecode -t 4 | awk '/Signature/,/L1 Cache Handle/' |
grep -e 'Signature' -e 'L1 Cache Handle' |
awk -v Model="$4" '{
if ($4 == "Model")
print $5 " " $7;
else if ($1 == "L1")
print " " $4}' >> data
The contents of 'data' on my system is :
49, 0
0x002E
Essentially, 'data' corresponds to :
Signature: Family 23, Model 49, Stepping 0
L1 Cache Handle: 0x002E
(Model # and L1 cache handle)
Looking for a better/efficient way to do the above operation. Thanks.
Would you please try the following:
dmidecode -t 4 | sed -nE '/Signature|L1 Cache Handle/{s/.*Model ([0-9]+), Stepping ([0-9]+).*/\1 \2/p;s/.*L1 Cache Handle: ([0-9A-Za-z])/\1/p}' | xargs
The sed command extracts the values for Model, Stepping and
L1 Cache Handle:.
The final xargs merges two lines into one.

How do I keep the first n lines of a file/command, but grep the rest?

Easiest to give an example.
bash-$ psql -c 'select relname, reltype from pg_catalog.pg_class limit 5;
relname | reltype
------------------------+---------
bme_reltag_02 | 0
bme_reltag_type1_type2 | 0
bme_reltag_10 | 0
bme_reltag_11 | 0
bme_reltag_cvalue3 | 0 👈 what I care about
But what I am really interested in is anything with cvalue in it. Rather than modifying each query by hand (yes, I know I could do it), I can egrep what I care about.
psql -c 'select relname, reltype from pg_catalog.pg_class limit 5;' | egrep 'cvalue'
but that strips out the first two lines with the column headers.
bme_reltag_cvalue3 | 0
I know I can also do this:
psql -c 'select relname, reltype from pg_catalog.pg_class limit 5;' | head -2 && psql -c 'select relname, reltype from pg_catalog.pg_class limit 5;' | egrep 'cvalue'
relname | reltype
------------------------+---------
bme_reltag_cvalue3 | 0
but what I really want to do is to keep the head (or tail) of some lines one way and then process the rest another.
My particular use case here is grepping the contents of arbitrary psql selects, but I'm curious as to what bash capabilities are in this domain.
I've done this before by writing to a temp file and then processing the temp file in multiple steps, but that's not what I am looking for.
A while read loop and grep, if that is acceptable.
#!/usr/bin/env bash
while IFS= read -r lines; do
[[ $lines == [12]* ]] && echo "${lines#*:}"
[[ $lines == *cvalue[0-9]* ]] && echo "${lines#*:}"
done < <(psql -c 'select relname, reltype from pg_catalog.pg_class limit 5;' | grep -n .)
Without the grep an alternative is a counter to know the line number, which will be a pure bash solution.
#!/usr/bin/env bash
counter=1
while IFS= read -r lines; do
[[ $counter == [12] ]] && echo "$lines"
[[ $lines == *cvalue[0-9]* ]] && echo "$lines"
((counter++))
done < <(psql -c 'select relname, reltype from pg_catalog.pg_class limit 5;')
If bash4+ is available.
#!/usr/bin/env bash
mapfile -t files < <(psql -c 'select relname, reltype from pg_catalog.pg_class limit 5;')
printf '%s\n' "${files[0]}" "${files[1]}"
unset 'files[0]' 'files[1]'
for file in "${files[#]}"; do
[[ $file == *cvalue[0-9]* ]] && echo "$file"
done
By default the builtin read strips the leading and trailing white spaces, so in this case we don't want that, so we use IFS=
grep -n . adds the line number with a :
[12] is a glob not regex which means either 1 or 2 and the glob * will match if it is the first character of the line.
*cvalue[0-9]* will match cvalue and any amount of int/digit next to it.
"${lines#*:}" is a parameter expansion that strips the leading :
<( ) is called process substitution.
$ psql -c ... | awk 'NR<3 || /cvalue/' file
This can be done with sed using its range feature to only operate on lines 3 and beyond
sed '3,${/cvalue/!{d;};}'
Proof of Concept
$ cat ./psql
relname | reltype
------------------------+---------
bme_reltag_02 | 0
bme_reltag_type1_type2 | 0
bme_reltag_10 | 0
bme_reltag_11 | 0
bme_reltag_cvalue3 | 0
$ sed '3,${/cvalue/!{d;};}' ./psql
relname | reltype
------------------------+---------
bme_reltag_cvalue3 | 0
Explanation
3,${...;}: Start processing from line 3 until the end of file $
/cvalue/!{d;}: Delete d any line that does not match (!) the regex /cvalue/
You can use bash.. tail.and head commands
cat file.sql | head -n 15 > head.sql
Replace the 15 with the number of lines
Or replace head with tail... for the bottom of the file

How to find the position of word in a list of string to use in awk?

Morning guys,
I often have files that I want to grep+awk but that have a lot of fields.
I'm interested in one in particular (so I'd like to awk '{print $i}') but how can I know what position (ie "i" here) my column is, other than counting it manually?
With files of around 70 fields, I'd be saving lot of time! :)
Thanks a lot,
[Edit]
Following Ian McGowan's suggestion, I'll look for the column number in the file's header:
head -1 myFile | awk '{for (i=1; i<=NF; i++) printf("%3d=%s\n", i, $i); }' | grep -i <the_column_Im_looking_for>
Thanks everyone,
Will1v
I was looking for a sample but:
$ cat > file
this is
the sample
$ awk '{
for(i=1;i<=NF;i++)
if($i=="sample")
print NR,i
}' file
2 2
I do this all the time when trying to profile some large text delimited file.
$head -4 myfile
4A 1 321 537 513.30
4B 0.00
8 592 846 905.66
9B2 39 887 658.77
Transpose or pivot by looping over the columns/fields:
$awk '{ for (i=1; i<=NF; i++) printf("%4d %3d=%s\n", NR, i, $i); }' < myfile
1 1=4A
1 2=1
1 3=321
1 4=537
1 5=513.30
2 1=4B
2 2=0.00
3 1=8
3 2=592
3 3=846
3 4=905.66
4 1=9B2
4 2=39
4 3=887
4 4=658.77
You can add printf("row=%-4d col=%-3d:%s\n", NR, i, $i); or whatever in the format mask for printf, and then grep for just the data you care about to find out the column, or if you know the columns grep for col=44 to get the 44th column.
xargs -n1 will print the columns one per line, so you can do this:
head -1 file | xargs -n1 | grep -n "column_name"

How to kill all processes of the same name but keep the latest one running?

I've seen this but I can't quite figure out how to kill all processes of a given name but keep the most recent one running.
So for example, this will show me the running processes:
$ ps -eo pid,etime,comm | grep maya.bin
23752 10:40 maya.bin
23841 12:04 maya.bin
23891 14:46 maya.bin
Somehow, I would need to filter out the PIDs 23752 23841 (but not 23891) and kill those.
If I could just figure out how to find those PIDs, presumably via awk, I could kill via e.g.
ps -eo pid,etime,comm | awk ??? | kill -9
or
kill -9 $(ps aux | grep 'maya.bin' | awk ??? )
Is this possible using awk?
If I understand correctly, you need to select the first word and then all but the last row. You can do that with
awk '{ print $1 }' | head -n -1
The total command would then be
kill -9 $(ps aux | grep 'maya.bin' | awk '{ print $1 }' | head -n -1)

awk associative array grows fast

I have a file that assigns numbers to md5sums like follows:
0 0000001732816557DE23435780915F75
1 00000035552C6F8B9E7D70F1E4E8D500
2 00000051D63FACEF571C09D98659DC55
3 0000006D7695939200D57D3FBC30D46C
4 0000006E501F5CBD4DB56CA48634A935
5 00000090B9750D99297911A0496B5134
6 000000B5AEA2C9EA7CC155F6EBCEF97F
7 00000100AD8A7F039E8F48425D9CB389
8 0000011ADE49679AEC057E07A53208C1
Another file containts three md5sums in each line like follows:
00000035552C6F8B9E7D70F1E4E8D500 276EC96E149571F8A27F4417D7C6BC20 9CFEFED8FB9497BAA5CD519D7D2BB5D7
00000035552C6F8B9E7D70F1E4E8D500 44E48C092AADA3B171CE899FFC6943A8 1B757742E1BF2AA5DB6890E5E338F857
What I want to to is replace the first and third md5sums in the second file with the integers of the first file. Currently i am trying the following awk script:
awk '{OFS="\t"}FNR==NR{map[$2]=$1;next}
{print map[$1],$2,map[$3]}' mapping.txt relation.txt
The problem is that the script needs more that 16g ram despite the fact that the first file is only 5.7g on the hard drive.
If you don't have enough memory to store the first file, then you need to write something like this to look up the 1st file for each value in the 2nd file:
awk 'BEGIN{OFS="\t"}
{
val1 = val3 = ""
while ( (getline line < "mapping.txt") > 0 ) {
split(line,flds)
if (flds[2] == $1) {
val1 = flds[1]
}
if (flds[2] == $3) {
val3 = flds[1]
}
if ( (val1 != "") && (val3 != "") ) {
break
}
}
close("mapping.txt")
print val1,$2,val3
}' relation.txt
It will be slow. You could add a cache of N getline-d lines to speed it up if you like.
This problem could be solved, as follows (file1.txt is the file with the integers and md5sums while file2.txt is the file with the three columns of md5sums):
#!/bin/sh
# First sort each of file 1 and the first and third columns of file 2 by MD5
awk '{ print $2 "\t" $1}' file1.txt | sort >file1_n.txt
# Before we sort the file 2 columns, we number the rows so we can put them
# back into the original order later
cut -f1 file2.txt | cat -n - | awk '{ print $2 "\t" $1}' | sort >file2_1n.txt
cut -f3 file2.txt | cat -n - | awk '{ print $2 "\t" $1}' | sort >file2_3n.txt
# Now do a join between them, extract the two columns we want, and put them back in order
join -t' ' file2_1n.txt file1_n.txt | awk '{ print $2 "\t" $3}' | sort -n | cut -f2 >file2_1.txt
join -t' ' file2_3n.txt file1_n.txt | awk '{ print $2 "\t" $3}' | sort -n | cut -f2 >file2_3.txt
cut -f2 file2.txt | paste file2_1.txt - file2_3.txt >file2_new1.txt
For a case where file1.txt and file2.txt are each 1 million lines long, this solution and Ed Morton's awk-only solution take about the same length of time on my system. My system would take a very long time to solve the problem for 140 million lines, regardless of the approach used but I ran a test case for files with 10 million lines.
I had assumed that a solution that relied on sort (which automatically uses temporary files when required) should be faster for large numbers of lines because it would be O(N log N) runtime, whereas a solution that re-reads the mapping file for each line of the input would be O(N^2) if the two files are of similar size.
Timing results
My assumption with respect to the performance relationship of the two candidate solutions turned out to be faulty for the test cases that I've tried. On my system, the sort-based solution and the awk-only solution took similar (within 30%) amounts of time to each other for each of 1 million and 10 million line input files, with the awk-only solution being faster in each case. I don't know if that relationship will hold true when the input file size goes up by another factor of more than 10, of course.
Strangely, the 10 million line problem took about 10 times as long to run with both solutions as the 1 million line problem, which puzzles me as I would have expected a non-linear relationship with file length for both solutions.
If the size of a file causes awk to run out of memory, then either use another tool, or another approach entirely.
The sed command might succeed with much less memory usage. The idea is to read the index file and create a sed script which performs the remapping, and then invoke sed on the generated sedscript.
The bash script below is a implementation of this idea. It includes some STDERR output to help track progress. I like to produce progress-tracking output for problems with large data sets or other kinds of time-consuming processing.
This script has been tested on a small set of data; it may work on your data. Please give it a try.
#!/bin/bash
# md5-indexes.txt
# 0 0000001732816557DE23435780915F75
# 1 00000035552C6F8B9E7D70F1E4E8D500
# 2 00000051D63FACEF571C09D98659DC55
# 3 0000006D7695939200D57D3FBC30D46C
# 4 0000006E501F5CBD4DB56CA48634A935
# 5 00000090B9750D99297911A0496B5134
# 6 000000B5AEA2C9EA7CC155F6EBCEF97F
# 7 00000100AD8A7F039E8F48425D9CB389
# 8 0000011ADE49679AEC057E07A53208C1
# md5-data.txt
# 00000035552C6F8B9E7D70F1E4E8D500 276EC96E149571F8A27F4417D7C6BC20 9CFEFED8FB9497BAA5CD519D7D2BB5D7
# 00000035552C6F8B9E7D70F1E4E8D500 44E48C092AADA3B171CE899FFC6943A8 1B757742E1BF2AA5DB6890E5E338F857
# Goal replace field 1 and field 3 with indexes to md5 checksums from md5-indexes
md5_indexes='md5-indexes.txt'
md5_data='md5-data.txt'
talk() { echo 1>&2 "$*" ; }
talkf() { printf 1>&2 "$#" ; }
track() {
local var="$1" interval="$2"
local val
eval "val=\$$var"
if (( interval == 0 || val % interval == 0 )); then
shift 2
talkf "$#"
fi
eval "(( $var++ ))" # increment the counter
}
# Build a sedscript to translate all occurances of the 1st & 3rd MD5 sums into their
# corresponding indexes
talk "Building the sedscript from the md5 indexes.."
sedscript=/tmp/$$.sed
linenum=0
lines=`wc -l <$md5_indexes`
interval=$(( lines / 100 ))
while read index md5sum ; do
track linenum $interval "..$linenum"
echo "s/^[[:space:]]*[[:<:]]$md5sum[[:>:]]/$index/" >>$sedscript
echo "s/[[:<:]]$md5sum[[:>:]]\$/$index/" >>$sedscript
done <$md5_indexes
talk ''
sedlength=`wc -l <$sedscript`
talkf "The sedscript is %d lines\n" $sedlength
cmd="sed -E -f $sedscript -i .bak $md5_data"
talk "Invoking: $cmd"
$cmd
changes=`diff -U 0 $md5_data.bak $md5_data | tail +3 | grep -c '^+'`
talkf "%d lines changed in $md5_data\n" $changes
exit
Here are the two files:
cat md5-indexes.txt
0 0000001732816557DE23435780915F75
1 00000035552C6F8B9E7D70F1E4E8D500
2 00000051D63FACEF571C09D98659DC55
3 0000006D7695939200D57D3FBC30D46C
4 0000006E501F5CBD4DB56CA48634A935
5 00000090B9750D99297911A0496B5134
6 000000B5AEA2C9EA7CC155F6EBCEF97F
7 00000100AD8A7F039E8F48425D9CB389
8 0000011ADE49679AEC057E07A53208C1
cat md5-data.txt
00000035552C6F8B9E7D70F1E4E8D500 276EC96E149571F8A27F4417D7C6BC20 9CFEFED8FB9497BAA5CD519D7D2BB5D7
00000035552C6F8B9E7D70F1E4E8D500 44E48C092AADA3B171CE899FFC6943A8 1B757742E1BF2AA5DB6890E5E338F857
Here is the sample run:
$ ./md5-reindex.sh
Building the sedscript from the md5 indexes..
..0..1..2..3..4..5..6..7..8
The sedscript is 18 lines
Invoking: sed -E -f /tmp/83800.sed -i .bak md5-data.txt
2 lines changed in md5-data.txt
Finally, the resulting file:
$ cat md5-data.txt
1 276EC96E149571F8A27F4417D7C6BC20 9CFEFED8FB9497BAA5CD519D7D2BB5D7
1 44E48C092AADA3B171CE899FFC6943A8 1B757742E1BF2AA5DB6890E5E338F857