base64 decoding from file column - awk

I have a file, every line with 6 columns separated by ",". Last column is zipped and encoded in base 64. Output file should be column 3 and column 6(decoded/unzipped).
I tried to do this by
awk -F',' '{"echo "$6" | base64 -di | gunzip" | getline x;print $3,x }' OFS=',' inputfile.csv >outptfile_decoded.csv
The result for the first lines is ok, but after some lines the decode output is the same as the line before. It seems that decoding & unzipping hungs, but I didn't get error message.
Singe decode/unzipping works fine i.e.
echo "H4sIAAAAAAAAA7NJTkuxs0lMLrEztNEHUTZAgcy8tHw7m7zSXLuS1BwrbRNjMzMTc3MDAzMDG32QqE1uSWVBqh2QB2HYlCYX2xnb6IMoG324ASCWHQAaafi1YQAAAA==" | base64 -di | gunzip
What can be the reason for this effect? (there are no error messages).
Is there another way which works reliable?

without a test case difficult to recommend anything. Here is a working script with input data
create a test data file
$ while read f; do echo $f,$(echo $f | gzip -f | base64); done < <(seq 5) | tee file.g
1,H4sIAJhBuVkAAzPkAgBT/FFnAgAAAA==
2,H4sIAJhBuVkAAzPiAgCQr3xMAgAAAA==
3,H4sIAJhBuVkAAzPmAgDRnmdVAgAAAA==
4,H4sIAJhBuVkAAzPhAgAWCCYaAgAAAA==
5,H4sIAJhBuVkAAzPlAgBXOT0DAgAAAA==
and decode
$ awk 'BEGIN {FS=OFS=","}
{cmd="echo "$2" | base64 -di | gunzip"; cmd | getline v; print $1,v}' file.g
1,1
2,2
3,3
4,4
5,5

Related

Awk, order foreach 12 lines to insert query

I have the following script:
curl -s 'https://someonepage=5m' | jq '.[]|.[0],.[1],.[2],.[3],.[4],.[5],.[6],.[7],.[8],.[9],.[10],.[11],.[12]' | perl -p -e 's/\"//g' |awk '/^[0-9]/{print; if (++onr%12 == 0) print ""; }'
This is part of result:
1517773500000
0.10250100
0.10275700
0.10243500
0.10256600
257.26700000
1517773799999
26.38912220
1229
104.32200000
10.70579910
0
1517773800000
0.10256600
0.10268000
0.10231600
0.10243400
310.64600000
1517774099999
31.83806883
1452
129.70500000
13.29758266
0
1517774100000
0.10243400
0.10257500
0.10211800
0.10230000
359.06300000
1517774399999
36.73708621
1296
154.78500000
15.84041910
0
I want to insert this data in a MySQL database. I want for each line this result:
(1517773800000,0.10256600,0.10268000,0.10231600,0.10243400,310.64600000,1517774099999,31.83806883,1452,129.70500000,13.29758266,0)
(1517774100000,0.10243400,0.10257500,0.10211800,0.10230000,359.06300000,151774399999,36.73708621,1296,154.78500000,15.84041910,0)
I need merge lines each 12 lines, any can help me for get this result.
Here's an all-jq solution:
.[] | .[0:12] | #tsv | gsub("\t";",") | "(\(.))"
In the sample, all the subarrays have length 12, so you might be able to drop the .[0:12] part of the pipeline. If using jq 1.5 or later, you could use join(“,”) instead of the #tsv|gsub portion of the pipeline. You might, for example, want to consider:
.[] | join(“,”) | “(\(.))”. # jq 1.5 or later
Invocation: use the -r command-line option
Sample output:
(1517627400000,0.10452300,0.10499000,0.10418200,0.10449400,819.50400000,1517627699999,85.57150693,2340,452.63400000,47.27213035,0)
(1517627700000,0.10435700,0.10449200,0.10366000,0.10370000,717.37000000,1517627999999,74.60582079,1996,321.25500000,33.42273846,0)
(1517628000000,0.10376600,0.10390000,0.10366000,0.10370400,519.59400000,1517628299999,53.88836170,1258,239.89300000,24.88613854,0)
$ awk 'BEGIN {RS=""; OFS=","} {$1=$1; $0="("$0")"}1' file
(1517773500000,0.10250100,0.10275700,0.10243500,0.10256600,257.26700000,1517773799999,26.38912220,1229,104.32200000,10.70579910,0)
(1517773800000,0.10256600,0.10268000,0.10231600,0.10243400,310.64600000,1517774099999,31.83806883,1452,129.70500000,13.29758266,0)
(1517774100000,0.10243400,0.10257500,0.10211800,0.10230000,359.06300000,1517774399999,36.73708621,1296,154.78500000,15.84041910,0)
RS="":
Treat groups of lines separated one or more blank lines as a record
OFS=","
Set the output separator to be a ","
$1=$1
Reconstitute the line, replacing the input separators with the output separator
$0="("$0")"
Surround the record with parens
1
Print the record

Only output line if value in specific column is unique

Input:
line1 a gh
line2 a dd
line3 c dd
line4 a gg
line5 b ef
Desired output:
line3 c dd
line5 b ef
That is, I want to output line only in the case that no other line includes the same value in column 2. I thought I could do this with combination of sort (e.g. sort -k2,2 input) and uniq, but it appears that with uniq I can only skip columns from the left (-f avoid comparing the first N fields). Surely there's some straightforward way to do this with awk or something.
You can do this as a two-pass awk script:
awk 'NR==FNR{a[$2]++;next} a[$2]<2' file file
This runs through the file once incrementing a counter in an array whose key is the second field of each line, then runs through a second time printing only those lines whose counter is less than 2.
You'd need multiple reads of the file because at any point during the first read, you can't possibly know whether there will be another instance of the second field of that line later in the file.
Here is a one pass awk solution:
awk '{a1[$2]++;a2[$2]=$0} END{for (a in a1) if (a1[a]==1) print a2[a]}' file
The original order of the file will be lost however.
You can combine awk, grep, sort and uniq for a quick one-liner:
grep -v "^[^ ]* $(awk '{print $2}' input.txt | sort | uniq -d) " input.txt
Edit, to avoid the regexes, \+ and \backreferences:grep -v "^[^ ]* $(awk '{print $2}' input.txt | sort | uniq -d | sed 's/[^+0-9]/\\&/g') " input.txt
alternative to awk to demonstrate that it can still be done with sort and uniq (there is option -u for this), however setting up the right format requires some juggling (decorate/do stuff/undecorate pattern).
$ paste file <(cut -d' ' -f2 file) | sort -k2 | uniq -uf3 | cut -f1
line5 b ef
line3 c dd
as a side effect you lose the original sorting order, which can be recovered as well if you add line numbers...

Add text to beginning of awk result

Good day all,
I am running below command:
netstat -an | awk '/:25/{ print $4 }' | sed 's/:25//' | paste -sd ',' -
which produces
192.168.2.22,127.0.0.1
I would like to amend the result to something like below (to be parsed as a csv by an application)
Manuallyaddedtext 192.168.2.22,127.0.0.1
Many thanks
echo -n "Mytext " ; netstat...

awk associative array grows fast

I have a file that assigns numbers to md5sums like follows:
0 0000001732816557DE23435780915F75
1 00000035552C6F8B9E7D70F1E4E8D500
2 00000051D63FACEF571C09D98659DC55
3 0000006D7695939200D57D3FBC30D46C
4 0000006E501F5CBD4DB56CA48634A935
5 00000090B9750D99297911A0496B5134
6 000000B5AEA2C9EA7CC155F6EBCEF97F
7 00000100AD8A7F039E8F48425D9CB389
8 0000011ADE49679AEC057E07A53208C1
Another file containts three md5sums in each line like follows:
00000035552C6F8B9E7D70F1E4E8D500 276EC96E149571F8A27F4417D7C6BC20 9CFEFED8FB9497BAA5CD519D7D2BB5D7
00000035552C6F8B9E7D70F1E4E8D500 44E48C092AADA3B171CE899FFC6943A8 1B757742E1BF2AA5DB6890E5E338F857
What I want to to is replace the first and third md5sums in the second file with the integers of the first file. Currently i am trying the following awk script:
awk '{OFS="\t"}FNR==NR{map[$2]=$1;next}
{print map[$1],$2,map[$3]}' mapping.txt relation.txt
The problem is that the script needs more that 16g ram despite the fact that the first file is only 5.7g on the hard drive.
If you don't have enough memory to store the first file, then you need to write something like this to look up the 1st file for each value in the 2nd file:
awk 'BEGIN{OFS="\t"}
{
val1 = val3 = ""
while ( (getline line < "mapping.txt") > 0 ) {
split(line,flds)
if (flds[2] == $1) {
val1 = flds[1]
}
if (flds[2] == $3) {
val3 = flds[1]
}
if ( (val1 != "") && (val3 != "") ) {
break
}
}
close("mapping.txt")
print val1,$2,val3
}' relation.txt
It will be slow. You could add a cache of N getline-d lines to speed it up if you like.
This problem could be solved, as follows (file1.txt is the file with the integers and md5sums while file2.txt is the file with the three columns of md5sums):
#!/bin/sh
# First sort each of file 1 and the first and third columns of file 2 by MD5
awk '{ print $2 "\t" $1}' file1.txt | sort >file1_n.txt
# Before we sort the file 2 columns, we number the rows so we can put them
# back into the original order later
cut -f1 file2.txt | cat -n - | awk '{ print $2 "\t" $1}' | sort >file2_1n.txt
cut -f3 file2.txt | cat -n - | awk '{ print $2 "\t" $1}' | sort >file2_3n.txt
# Now do a join between them, extract the two columns we want, and put them back in order
join -t' ' file2_1n.txt file1_n.txt | awk '{ print $2 "\t" $3}' | sort -n | cut -f2 >file2_1.txt
join -t' ' file2_3n.txt file1_n.txt | awk '{ print $2 "\t" $3}' | sort -n | cut -f2 >file2_3.txt
cut -f2 file2.txt | paste file2_1.txt - file2_3.txt >file2_new1.txt
For a case where file1.txt and file2.txt are each 1 million lines long, this solution and Ed Morton's awk-only solution take about the same length of time on my system. My system would take a very long time to solve the problem for 140 million lines, regardless of the approach used but I ran a test case for files with 10 million lines.
I had assumed that a solution that relied on sort (which automatically uses temporary files when required) should be faster for large numbers of lines because it would be O(N log N) runtime, whereas a solution that re-reads the mapping file for each line of the input would be O(N^2) if the two files are of similar size.
Timing results
My assumption with respect to the performance relationship of the two candidate solutions turned out to be faulty for the test cases that I've tried. On my system, the sort-based solution and the awk-only solution took similar (within 30%) amounts of time to each other for each of 1 million and 10 million line input files, with the awk-only solution being faster in each case. I don't know if that relationship will hold true when the input file size goes up by another factor of more than 10, of course.
Strangely, the 10 million line problem took about 10 times as long to run with both solutions as the 1 million line problem, which puzzles me as I would have expected a non-linear relationship with file length for both solutions.
If the size of a file causes awk to run out of memory, then either use another tool, or another approach entirely.
The sed command might succeed with much less memory usage. The idea is to read the index file and create a sed script which performs the remapping, and then invoke sed on the generated sedscript.
The bash script below is a implementation of this idea. It includes some STDERR output to help track progress. I like to produce progress-tracking output for problems with large data sets or other kinds of time-consuming processing.
This script has been tested on a small set of data; it may work on your data. Please give it a try.
#!/bin/bash
# md5-indexes.txt
# 0 0000001732816557DE23435780915F75
# 1 00000035552C6F8B9E7D70F1E4E8D500
# 2 00000051D63FACEF571C09D98659DC55
# 3 0000006D7695939200D57D3FBC30D46C
# 4 0000006E501F5CBD4DB56CA48634A935
# 5 00000090B9750D99297911A0496B5134
# 6 000000B5AEA2C9EA7CC155F6EBCEF97F
# 7 00000100AD8A7F039E8F48425D9CB389
# 8 0000011ADE49679AEC057E07A53208C1
# md5-data.txt
# 00000035552C6F8B9E7D70F1E4E8D500 276EC96E149571F8A27F4417D7C6BC20 9CFEFED8FB9497BAA5CD519D7D2BB5D7
# 00000035552C6F8B9E7D70F1E4E8D500 44E48C092AADA3B171CE899FFC6943A8 1B757742E1BF2AA5DB6890E5E338F857
# Goal replace field 1 and field 3 with indexes to md5 checksums from md5-indexes
md5_indexes='md5-indexes.txt'
md5_data='md5-data.txt'
talk() { echo 1>&2 "$*" ; }
talkf() { printf 1>&2 "$#" ; }
track() {
local var="$1" interval="$2"
local val
eval "val=\$$var"
if (( interval == 0 || val % interval == 0 )); then
shift 2
talkf "$#"
fi
eval "(( $var++ ))" # increment the counter
}
# Build a sedscript to translate all occurances of the 1st & 3rd MD5 sums into their
# corresponding indexes
talk "Building the sedscript from the md5 indexes.."
sedscript=/tmp/$$.sed
linenum=0
lines=`wc -l <$md5_indexes`
interval=$(( lines / 100 ))
while read index md5sum ; do
track linenum $interval "..$linenum"
echo "s/^[[:space:]]*[[:<:]]$md5sum[[:>:]]/$index/" >>$sedscript
echo "s/[[:<:]]$md5sum[[:>:]]\$/$index/" >>$sedscript
done <$md5_indexes
talk ''
sedlength=`wc -l <$sedscript`
talkf "The sedscript is %d lines\n" $sedlength
cmd="sed -E -f $sedscript -i .bak $md5_data"
talk "Invoking: $cmd"
$cmd
changes=`diff -U 0 $md5_data.bak $md5_data | tail +3 | grep -c '^+'`
talkf "%d lines changed in $md5_data\n" $changes
exit
Here are the two files:
cat md5-indexes.txt
0 0000001732816557DE23435780915F75
1 00000035552C6F8B9E7D70F1E4E8D500
2 00000051D63FACEF571C09D98659DC55
3 0000006D7695939200D57D3FBC30D46C
4 0000006E501F5CBD4DB56CA48634A935
5 00000090B9750D99297911A0496B5134
6 000000B5AEA2C9EA7CC155F6EBCEF97F
7 00000100AD8A7F039E8F48425D9CB389
8 0000011ADE49679AEC057E07A53208C1
cat md5-data.txt
00000035552C6F8B9E7D70F1E4E8D500 276EC96E149571F8A27F4417D7C6BC20 9CFEFED8FB9497BAA5CD519D7D2BB5D7
00000035552C6F8B9E7D70F1E4E8D500 44E48C092AADA3B171CE899FFC6943A8 1B757742E1BF2AA5DB6890E5E338F857
Here is the sample run:
$ ./md5-reindex.sh
Building the sedscript from the md5 indexes..
..0..1..2..3..4..5..6..7..8
The sedscript is 18 lines
Invoking: sed -E -f /tmp/83800.sed -i .bak md5-data.txt
2 lines changed in md5-data.txt
Finally, the resulting file:
$ cat md5-data.txt
1 276EC96E149571F8A27F4417D7C6BC20 9CFEFED8FB9497BAA5CD519D7D2BB5D7
1 44E48C092AADA3B171CE899FFC6943A8 1B757742E1BF2AA5DB6890E5E338F857

Count files that have unique prefixes

I have set of files that looks like the following. I'm looking for a good way to count all files that have unique prefixes, where "prefix" is defined by all characters before the second hyphen.
0406-0357-9.jpg 0591-0349-9.jpg 0603-3887-27.jpg 59762-1540-40.jpg 68180-517-6.jpg
0406-0357-90.jpg 0591-0349-90.jpg 0603-3887-28.jpg 59762-1540-41.jpg 68180-517-7.jpg
0406-0357-91.jpg 0591-0349-91.jpg 0603-3887-29.jpg 59762-1540-42.jpg 68180-517-8.jpg
0406-0357-92.jpg 0591-0349-92.jpg 0603-3887-3.jpg 59762-1540-5.jpg 68180-517-9.jpg
0406-0357-93.jpg 0591-0349-93.jpg 0603-3887-30.jpg 59762-1540-6.jpg
Depending on what you actually want output, either of these might be what you want:
ls | awk -F'-' '{c[$1"-"$2]++} END{for (p in c) print p, c[p]}'
or
ls | awk -F'-' '!seen[$1,$2]++{count++} END{print count+0}'
If it's something else, update your question to show the output you're looking for.
This should do it:
ls *.jpg | cut -d- -s -f1,2 | uniq | wc -l
Or if your prefixes are always 4 digits, one dash, 4 digits, you don't need cut:
ls *.jpg | uniq -w9 | wc -l
Parses ls (bad, but it doesn't look like it will cause a problem with these filenames),
uses awk to set the field separator as -.
!seen[$1,$2]++) uses an associative array with $1,$2 as the key and increments, then checks if the value equals 0 to ensure it is only printed once (based on $1 and $2).
print prints on screen :)
ls | awk 'BEGIN{FS="-" ; printf("%-20s%-10s\n","Prefix","Count")} {seen[$1"-"$2]++} END{ for (k in seen){printf("%-20s%-10i\n",k,seen[k])}}'
Will now count based on prefix with headers :)