Format time from alphanumeric to numeric - awk
I have a text file:
ifile.txt
x y z t value
1 1 5 01hr01Jan2018 3
1 1 5 02hr01Jan2018 3.1
1 1 5 03hr01Jan2018 3.2
1 3.4 3 01hr01Jan2018 4.1
1 3.4 3 02hr01Jan2018 6.1
1 3.4 3 03hr01Jan2018 1.1
1 4.2 6 01hr01Jan2018 6.33
1 4.2 6 02hr01Jan2018 8.33
1 4.2 6 03hr01Jan2018 5.33
3.4 1 2 01hr01Jan2018 3.5
3.4 1 2 02hr01Jan2018 5.65
3.4 1 2 03hr01Jan2018 3.66
3.4 3.4 4 01hr01Jan2018 6.32
3.4 3.4 4 02hr01Jan2018 9.32
3.4 3.4 4 03hr01Jan2018 12.32
3.4 4.2 8.1 01hr01Jan2018 7.43
3.4 4.2 8.1 02hr01Jan2018 7.93
3.4 4.2 8.1 03hr01Jan2018 5.43
4.2 1 3.4 01hr01Jan2018 6.12
4.2 1 3.4 02hr01Jan2018 7.15
4.2 1 3.4 03hr01Jan2018 9.12
4.2 3.4 5.5 01hr01Jan2018 2.2
4.2 3.4 5.5 02hr01Jan2018 3.42
4.2 3.4 5.5 03hr01Jan2018 3.21
4.2 4.2 6.2 01hr01Jan2018 1.3
4.2 4.2 6.2 02hr01Jan2018 3.4
4.2 4.2 6.2 03hr01Jan2018 1
Explanation: Each coordinate (x,y) has a z-value and three time values. The spaces are not tabs. They are sequence of spaces.
I would like to format the t-column from alphanumeric to numeric and then convert to a csv file. My expected output is as:
ofile.txt
x,y,z,201801010100,201801010200,201801010300
1,1,5,3,3.1,3.2
1,3.4,3,4.1,6.1,1.1
1,4.2,6,6.33,8.33,5.33
3.4,1,2,3.5,5.65,3.66
3.4,3.4,4,6.32,9.32,12.32
3.4,4.2,8.1,7.43,7.93,5.43
4.2,1,3.4,6.12,7.15,9.12
4.2,3.4,5.5,2.2,3.42,3.21
4.2,4.2,6.2,1.3,3.4,1
The desire time format is replaced with YYYYMMDDHHMin.
I had asked part of this question previously. Please see Format and then convert txt to csv using shell script and awk. However I can't able to change the time format within the following script.
awk -v OFS=, '{k=$1 OFS $2 OFS $3}
!($4 in hdr){hn[++h]=$4; hdr[$4]}
k in row{row[k]=row[k] OFS $5; next}
{rn[++n]=k; row[k]=$5}
END {
printf "%s", rn[1]
for(i=1; i<=h; i++)
printf "%s", OFS hn[i]
print ""
for (i=2; i<=n; i++)
print rn[i], row[rn[i]]
}' ifile.txt
Expanding on my answer from your previous question:
gawk '
BEGIN {
SUBSEP = OFS = ","
month["Jan"] = "01"; month["Feb"] = "02"; month["Mar"] = "03";
month["Apr"] = "04"; month["May"] = "05"; month["Jun"] = "06";
month["Jul"] = "07"; month["Aug"] = "08"; month["Sep"] = "09";
month["Oct"] = "10"; month["Nov"] = "11"; month["Dec"] = "12";
}
function timestamp_to_numeric(s) {
# 03hr31Jan2001 => 200101310300
return substr(s,10,4) month[substr(s,7,3)] substr(s,5,2) substr(s,1,2) "00"
}
NR==1 {next}
{g = timestamp_to_numeric($4); groups[g]; value[$1,$2,$3][g] = $5}
END {
PROCINFO["sorted_in"] = "#ind_str_asc"
printf "x,y,z"; for (g in groups) printf ",%s", g; printf "\n"
for (a in value) {
printf "%s", a
for (g in groups) printf "%s%s", OFS, 0+value[a][g]
printf "\n"
}
}
' ifile.txt
x,y,z,201801010100,201801010200,201801010300
1,1,5,3,3.1,3.2
1,3.4,3,4.1,6.1,1.1
1,4.2,6,6.33,8.33,5.33
3.4,1,2,3.5,5.65,3.66
3.4,3.4,4,6.32,9.32,12.32
3.4,4.2,8.1,7.43,7.93,5.43
4.2,1,3.4,6.12,7.15,9.12
4.2,3.4,5.5,2.2,3.42,3.21
4.2,4.2,6.2,1.3,3.4,1
You have to create a mapping between the month name and the month number, then create a function to transform the timestamp to the new format. Beyond that, the code is the same.
Related
awk equivalents for tidyverse concepts (melt and spread)
I have some text logs that I need to parse and format into CSV. I have a working R script but it is slow once file sizes increase and this problem seems like a good candidate for a speed up using awk (or other commandline tools?) as I understand. I have not done much with awk, and the issue I am having is translating how I think about processing in R to how awk scripting is done. Example truncated input data (Scrap.log): ; these are comment lines ; ******************************************************************************* ; \\C:\Users\Computer\Folder\Folder\Scrap.log !!G 99999 % % % % % % % % CURRENT XYZ ABC STATE1 STATE2 _START Header1 Header2 Header3 Header4 Header5 Header6 Header7 10 12.23 1.91 6.63 1.68 50.03 0.50 13.97 11 11.32 1.94 6.64 1.94 50.12 0.58 15.10 12 12.96 2.15 6.57 2.12 55.60 0.62 16.24 13 11.43 2.18 6.60 2.36 50.89 0.68 17.39 14 14.91 2.32 6.64 2.59 56.09 0.73 18.41 15 13.16 2.38 6.53 2.85 51.62 0.81 19.30 16 15.02 2.50 6.67 3.05 56.22 0.85 20.12 !!G 99999 % % % % % % % % CURRENT XYZ ABC STATE1 STATE2 _START Header8 Header9 Header10 Header11 Header12 Header13 Header14 10 22.03 24.41 15.01 51.44 44.28 16.57 11.52 11 21.05 24.62 15.62 51.23 45.42 16.47 11.98 12 20.11 24.64 16.38 52.16 46.59 16.54 12.42 13 24.13 24.93 17.23 52.34 47.72 16.51 12.88 14 27.17 24.95 18.06 52.79 48.72 16.45 13.30 15 22.87 25.04 19.27 53.01 49.50 16.47 13.63 16 23.08 25.22 20.12 53.75 50.64 16.55 14.03 Expected output (truncated): HH1,HH2,HH3,HH4,HH5,HH6,HH7,HH8,HH9,HH10,HH11,HH12,HH13,HH14,START,HeaderName,Value 99999,CURRENT,XYZ,ABC,STATE1,STATE2,%,%,%,%,%,%,%,%,10,Header1,12.23 99999,CURRENT,XYZ,ABC,STATE1,STATE2,%,%,%,%,%,%,%,%,10,Header2,1.91 99999,CURRENT,XYZ,ABC,STATE1,STATE2,%,%,%,%,%,%,%,%,10,Header3,6.63 99999,CURRENT,XYZ,ABC,STATE1,STATE2,%,%,%,%,%,%,%,%,10,Header4,1.68 99999,CURRENT,XYZ,ABC,STATE1,STATE2,%,%,%,%,%,%,%,%,10,Header5,50.03 99999,CURRENT,XYZ,ABC,STATE1,STATE2,%,%,%,%,%,%,%,%,10,Header6,0.5 99999,CURRENT,XYZ,ABC,STATE1,STATE2,%,%,%,%,%,%,%,%,10,Header7,13.97 99999,CURRENT,XYZ,ABC,STATE1,STATE2,%,%,%,%,%,%,%,%,11,Header1,11.32 99999,CURRENT,XYZ,ABC,STATE1,STATE2,%,%,%,%,%,%,%,%,11,Header2,1.94 99999,CURRENT,XYZ,ABC,STATE1,STATE2,%,%,%,%,%,%,%,%,11,Header3,6.64 99999,CURRENT,XYZ,ABC,STATE1,STATE2,%,%,%,%,%,%,%,%,11,Header4,1.94 99999,CURRENT,XYZ,ABC,STATE1,STATE2,%,%,%,%,%,%,%,%,11,Header5,50.12 99999,CURRENT,XYZ,ABC,STATE1,STATE2,%,%,%,%,%,%,%,%,11,Header6,0.58 99999,CURRENT,XYZ,ABC,STATE1,STATE2,%,%,%,%,%,%,%,%,11,Header7,15.1 99999,CURRENT,XYZ,ABC,STATE1,STATE2,%,%,%,%,%,%,%,%,12,Header1,12.96 99999,CURRENT,XYZ,ABC,STATE1,STATE2,%,%,%,%,%,%,%,%,12,Header2,2.15 99999,CURRENT,XYZ,ABC,STATE1,STATE2,%,%,%,%,%,%,%,%,12,Header3,6.57 99999,CURRENT,XYZ,ABC,STATE1,STATE2,%,%,%,%,%,%,%,%,12,Header4,2.12 99999,CURRENT,XYZ,ABC,STATE1,STATE2,%,%,%,%,%,%,%,%,12,Header5,55.6 99999,CURRENT,XYZ,ABC,STATE1,STATE2,%,%,%,%,%,%,%,%,12,Header6,0.62 99999,CURRENT,XYZ,ABC,STATE1,STATE2,%,%,%,%,%,%,%,%,12,Header7,16.24 99999,CURRENT,XYZ,ABC,STATE1,STATE2,%,%,%,%,%,%,%,%,13,Header1,11.43 99999,CURRENT,XYZ,ABC,STATE1,STATE2,%,%,%,%,%,%,%,%,13,Header2,2.18 ... My general steps in the R script: add a single header row with new names at the top of file spread the top row (starting with !!G) to each row melt the header column (_START) from wide to long format Pieces I have working in awk so far include: how to grab and print the header lines awk '/_START/ {header = $0; print header}' Scrap.log How to write a single row with the new header values awk ' BEGIN{ ORS=" "; for (counter = 1; counter <= 14; counter++) print "HH",counter;}' I know each block is separated by a newline and starts with a !!G, so can write a match on that. Unsure if a split-apply-combine type of thinking works well in awk? awk '/!!G/,/\n/ {print}' Scrap.log alternatively, I tried setting RS/FS parameters like: awk ' BEGIN{RS="\n";FS=" ";}/^!!G/{header=$0;print header}/[0-9]/{print $2}END{}' Scrap.log I then get stuck on iterating over the rows and fields to do the melt step as well as combining the capture groups correctly. How do I combine all these pieces to get to the CSV format?
I think the following: awk ' BEGIN{ # output the header line print "HH1,HH2,HH3,HH4,HH5,HH6,HH7,HH8,HH9,HH10,HH11,HH12,HH13,HH14,START,HeaderName,Value" } # ignore comment lines /;/{next} /!!G/{ valcnt = 1 # save and shuffle the values val[valcnt++] = $2 val[valcnt++] = $11 val[valcnt++] = $12 val[valcnt++] = $13 val[valcnt++] = $14 val[valcnt++] = $15 val[valcnt++] = $3 val[valcnt++] = $4 val[valcnt++] = $5 val[valcnt++] = $6 val[valcnt++] = $7 val[valcnt++] = $8 val[valcnt++] = $9 val[valcnt++] = $10 next } /_START /{ # these are headers - save them to head, to be reused later for (i = 2; i <= NF; ++i) { # fun fact: its indexed on NF head[i] = $i } next } # this function is redundant, but its just easier for me to think about the code function output(firstval, header, value, \ cur, i) { cur = valcnt val[cur++] = firstval val[cur++] = header val[cur++] = value # output val as csv for (i = 1; i < cur; ++i) { printf "%s%s", val[i], i != cur - 1 ? "," : "\n" } } /[0-9]+/{ for (i = 2; i <= NF; ++i) { # add these 3 to all the other values and output them # ie. add first column, the header from header and the value output($1, head[i], $i) } } ' Should output what you want. Tested on repl.
MDF to VTK Converting using AWK
I am a beginner so sorry if this has been covered before,but I can't seem to find exactly what I need to solve my problem. I am trying to write an AWK "script" that can convert an MDF(Mesh Definition File) as input into a (VALID) VTK file as output. I have a sample MDF file that looks like this : TITLE "1" NMESHPOINTS 4 NNODES 4 NELEMENTS_TRIANG1 2 TIMESTEP 0.00001 NINTERNAL_TIMESTEPS 1000 NEXTERNAL_TIMESTEPS 100 DAMPING_FACTOR 0.01 MESHPOINT_COORDINATES 1 0.0 0.0 0.0 2 1.0 0.0 0.0 3 1.0 1.0 0.0 4 0.0 1.0 0.0 NODES_TRIANG1 1 1 2 3 2 1 3 4 And I want to make a valid VTK file from this input. Here is how the output should look like: # vtk DataFile Version 1.0 2D Unstructured Grid ASCII DATASET UNSTRUCTURED_GRID POINTS 4 float 0.0 0.0 0.0 1.0 0.0 0.0 1.0 1.0 0.0 0.0 1.0 0.0 CELLS 2 8 3 0 1 2 3 0 2 3 CELL_TYPES 2 5 5 I tried to make a picture how the mappings works I hope it explains some of them. To make it a bit easier for this specific example let's say we only want to work with triangles. Sadly I dont have the same file as VTK and MDF too, I tried to manualy write one. Is there any way to do this with AWK? Any help will be much appreciated!!
Excellent diagram showing the input -> output mapping! Made it extremely easy to write this: $ cat tst.awk $1 ~ /^[[:alpha:]]/ { f[$1] = $2 } !NF { block = "" } $1 == "MESHPOINT_COORDINATES" { block = $1 print "# vtk DataFile Version 1.0" print "2D Unstructured Grid" print "ASCII" print "" print "DATASET UNSTRUCTURED_GRID" printf "POINTS %d float\n", f["NMESHPOINTS"] next } block == "MESHPOINT_COORDINATES" { $1 = "" sub(/^[[:space:]]+/,"") print } $1 == "NODES_TRIANG1" { block = $1 printf "\nCELLS %d %d\n", f["NELEMENTS_TRIANG1"], f["NELEMENTS_TRIANG1"] * 4 next } block == "NODES_TRIANG1" { printf "%s", 3 for (i=2; i<=NF; i++) { printf " %s", $i - 1 } print "" nlines++ } END { printf "\nCELL_TYPES %d\n", nlines for (i=1; i<=nlines; i++) { print 5 } } . $ awk -f tst.awk file.mdf # vtk DataFile Version 1.0 2D Unstructured Grid ASCII DATASET UNSTRUCTURED_GRID POINTS 4 float 0.0 0.0 0.0 1.0 0.0 0.0 1.0 1.0 0.0 0.0 1.0 0.0 CELLS 2 8 3 0 1 2 3 0 2 3 CELL_TYPES 2 5 5 Normally we only answer questions where the poster has attempted to solve it themselves first but you put enough effort into creating the example and describing the mapping that IMHO you deserve help with a solution so - see the above, try to figure out how it's working yourself (add "prints", check the man page, etc.) and then post a new question if you have any specific questions about it.
awk script to sum numbers in a column over a loop not working for some iterations in the loop
Sample input 12.0000 0.6000000 0.05 13.0000 1.6000000 0.05 14.0000 2.6000000 0.05 15.0000 3.0000000 0.05 15.0000 3.2000000 0.05 15.0000 3.4000000 0.05 15.0000 3.6000000 0.10 15.0000 3.8000000 0.10 15.0000 4.0000000 0.10 15.0000 4.2000000 0.11 15.0000 4.4000000 0.12 15.0000 4.6000000 0.13 15.0000 4.8000000 0.14 15.0000 5.0000000 0.15 15.0000 5.2000000 0.14 15.0000 5.4000000 0.13 15.0000 5.6000000 0.12 15.0000 5.8000000 0.11 15.0000 6.0000000 0.10 15.0000 6.2000000 0.10 15.0000 6.4000000 0.10 15.0000 6.6000000 0.05 15.0000 6.8000000 0.05 15.0000 7.0000000 0.05 Goal Print line 1 in output as 0 0 For $2 = 5.000000, $3 = 0.15. Print line 2 in output as 1 0.15 For $2 = 4.800000 through $2 = 5.200000, sum+=$3 for each line (i.e. 0.14 + 0.15 + 0.14 = 0.43). Print line 3 in output as 2 0.43. For $2 = 4.600000 through $2 = 5.400000, sum+=$3 for each line (i.e. 0.13 + 0.14 + 0.15 + 0.14 + 0.13 = 0.69). Print line 4 in output as 3 0.69 Continue this pattern until $2 = 5.000000 +- 1.6 (9 lines total, plus line 1 as 0 0 = 10 total lines in output) Desired Output 0 0 1 0.15 2 0.43 3 0.69 4 0.93 5 1.15 6 1.35 7 1.55 8 1.75 9 1.85 Attempt Script 1 #!/bin/bash for (( i=0; i<=8; i++ )); do awk '$2 >= 5.0000000-'$i'*0.2 {sum+=$3} $2 == 5.0000000+'$i'*0.2 {print '$i', sum; exit }' test.dat done > test.out produces 0 0.15 1 0.43 2 0.69 3 0.93 4 1.15 5 1.35 6 1.55 7 1.75 8 1.85 This is very close. However, the output is missing 0 0 for line 1, and because of this, lines 2 through 10 have $1 and $2 mismatched by 1 line. Script 2 #!/bin/bash for (( i=0; i<=8; i++ )); do awk ''$i'==0 {sum=0} '$i'>0 && $2 > 5.0000000-'$i'*0.2 {sum+=$3} $2 == 5.0000000+'$i'*0.2 - ('$i' ? 0.2 : 0) {print '$i', sum; exit }' test.dat done > test.out which produces 0 0 1 0.15 2 0.43 4 0.93 5 1.15 6 1.35 7 1.55 $1 and $2 are now correctly matched. However, I am missing the lines with $1=3, $1=8, and $1=9 completely. Adding the ternary operator causes my code to skip these iterations in the loop somehow. Question Can anyone explain what's wrong with script 2, or how to achieve the desired output in one line of code? Thank you. Solution I used Ed Morton's solution to solve this. Both of them work for different goals. Instead of using the modulus to save array space, I constrained the array to $1 = 15.0000. I did this instead of the modulus in order to include two other "key" variables that I had wanted to also sum over at different parts of the input, into separate output files. Furthermore, as far as I understood it, the script summed only for lines with $2 >= 5.0000000, and then multiplied the summation by 2, in order to include the lines with $2 <= 5.0000000. This works for the sample input here because I made $3 symmetric around 0.15. I modified it to sum them separately, though. awk 'BEGIN { key=5; range=9} $1 == 15.0000 { a[NR] = $3 } $2 == key { keyIdx = NR} END { print (0, 0) > "test.out" sum = a[keyIdx] for (delta=1; delta<=range; delta++) { print (delta, sum) > "test.out" plusIdx = (keyIdx + delta) minusIdx = (keyIdx - delta) sum += a[plusIdx] + a[minusIdx] } exit }' test.dat
Is this what you're trying to do? $ cat tst.awk $2 == 5 { keyNr = NR } { nr2val[NR] = $3 } END { print 0, 0 sum = nr2val[keyNr] for (delta=1; delta<=9; delta++) { print delta, sum sum += nr2val[keyNr+delta] + nr2val[keyNr-delta] } } $ awk -f tst.awk file 0 0 1 0.15 2 0.43 3 0.69 4 0.93 5 1.15 6 1.35 7 1.55 8 1.75 9 1.85 We could optimize it to only store 2*(range=9) values in vals[] (using a modulus operator NR%(2*range) for the index) and do the calculation when we hit an NR that's range lines past the line where $2 == key rather than doing it after we've read the whole of the input if it's either too slow or your input file is too big to store all in memory, e.g.: $ cat tst.awk BEGIN { key=5; range=9 } { idx = NR % (2*range) nr2val[idx] = $3 } $2 == key { keyIdx = idx; endNr = NR+range } NR == endNr { exit } END { print 0, 0 sum = nr2val[keyIdx] for (delta=1; delta<=range; delta++) { print delta, sum idx = (keyIdx + delta) % (2*range) sum += nr2val[idx] + nr2val[idx] } exit } $ awk -f tst.awk file 0 0 1 0.15 2 0.43 3 0.69 4 0.93 5 1.15 6 1.35 7 1.55 8 1.75 9 1.85
I like your problem. It is an adequate challenge. My approach is to put all possible into the awk script. And scan the input file only once. Because I/O manipulation is slower than computation (these days). Do as many computation (actually 9) on the relevant input line. The required inputs are variable F1 and text file input.txt The execution command is: awk -v F1=95 -f script.awk input.txt So the logic is: 1. Initialize: Compute the 9 range markers and store their values in an array. 2. Store the 3rd input value in an order array `field3`. We use this array to compute the sum. 3. On each line that has 1st field equals 15.0000. 3.1 If found begin marker then mark it. 3.2 If found end marker then compute the sum, and mark it. 4. Finalize: Output all the computed results script.awk including few debug printout to assist in debugging BEGIN { itrtns = 8; # iterations count consistent all over the program. for (i = 0; i <= itrtns; i++) { # compute range markers per iteration F1start[i] = (F1 - 2 - i)/5 - 14; # print "F1start["i"]="F1start[i]; F1stop[i] = (F1 - 2 + i)/5 - 14; # print "F1stop["i"]="F1stop[i]; b[i] = F1start[i] + (i ? 0.2 : 0); # print "b["i"]="b[i]; } } { field3[NR] = $3;} # store 3rd input field in ordered array. $1==15.0000 { # for each input line that has 1st input field 15.0000 currVal = $2 + 0; # convert 2nd input field to numeric value for (i = 0; i <= itrtns; i++) { # on each line scan for range markers # print "i="i, "currVal="currVal, "b["i"]="b[i], "F1stop["i"]="F1stop[i], isZero(currVal-b[i]), isZero(currVal-F1stop[i]); if (isZero(currVal - b[i])) { # if there is a begin marker F1idx[i] = NR; # store the marker index postion # print "F1idx["i"] =", F1idx[i]; } if (isZero(currVal - F1stop[i])) { # if there is an end marker for (s = F1idx[i]; s <= NR; s++) {sum[i] += field3[s];} # calculate its sum F2idx[i] = NR; # store its end marker postion (for debug report) # print "field3["NR"]=", field3[NR]; } } } END { # output the computed results for (i = 0; i <= itrtns; i++) {print i, sum[i], "rows("F1idx[i]"-"F2idx[i]")"} } function isZero(floatArg) { # floating point number pecision comparison tolerance = 0.00000000001; if (floatArg < tolerance && floatArg > -1 * tolerance ) return 1; return 0; } Provided input.txt from the question. 12.0000 0.6000000 0.05 13.0000 1.6000000 0.05 14.0000 2.6000000 0.05 15.0000 3.0000000 0.05 15.0000 3.2000000 0.05 15.0000 3.4000000 0.05 15.0000 3.6000000 0.10 15.0000 3.8000000 0.10 15.0000 4.0000000 0.10 15.0000 4.2000000 0.11 15.0000 4.4000000 0.12 15.0000 4.6000000 0.13 15.0000 4.8000000 0.14 15.0000 5.0000000 0.15 15.0000 5.2000000 0.14 15.0000 5.4000000 0.13 15.0000 5.6000000 0.12 15.0000 5.8000000 0.11 15.0000 6.0000000 0.10 15.0000 6.2000000 0.10 15.0000 6.4000000 0.10 15.0000 6.6000000 0.05 15.0000 6.8000000 0.05 15.0000 7.0000000 0.05 The output for: awk -v F1=95 -f script.awk input.txt 0 0.13 rows(12-12) 1 0.27 rows(12-13) 2 0.54 rows(11-14) 3 0.79 rows(10-15) 4 1.02 rows(9-16) 5 1.24 rows(8-17) 6 1.45 rows(7-18) 7 1.6 rows(6-19) 8 1.75 rows(5-20) The output for: awk -v F1=97 -f script.awk input.txt 0 0.15 rows(14-14) 1 0.29 rows(14-15) 2 0.56 rows(13-16) 3 0.81 rows(12-17) 4 1.04 rows(11-18) 5 1.25 rows(10-19) 6 1.45 rows(9-20) 7 1.65 rows(8-21) 8 1.8 rows(7-22)
What's wrong with my pdf writer?
I'm writing code to produce pdfs (from postscript of course), and I've tried to follow the spec as best I could. But imagemagick's identify says there's something wrong with my xref table. Can anyone see where/what my problem is? $ echo quit | gsnd -q pw.ps dancingmen.ps | identify - **** Warning: An error occurred while reading an XREF table. **** The file has been damaged. This may have been caused **** by a problem while converting or transfering the file. **** Ghostscript will attempt to recover the data. **** This file had errors that were repaired or ignored. **** Please notify the author of the software that produced this **** file that it does not conform to Adobe's published PDF **** specification. -=>/tmp/magick-16940kBciKvHuOrD3 PBM 612x792 612x792+0+0 16-bit Bilevel Gray 61KB 0.000u 0:00.000 My pdf (made with ghostscript on Linux, single LF eols): %PDF-1.3 1 0 obj << /Type /Catalog /Pages 2 0 R >> endobj 2 0 obj << /Kids [ 3 0 R ] /Type /Pages /Count 1 >> endobj 3 0 obj << /Contents [ 4 0 R ] /MediaBox [ 0.0 0.0 612.0 792.0 ] /Type /Page /Parent 2 0 R >> endobj 4 0 obj << /Length 1287 >> stream 2.0 4.0 m 2.0 3.9 l 2.05516 3.9 2.1 3.94484 2.1 4.0 c 2.1 4.05516 2.05516 4.1 2.0 4.1 c 1.94484 4.1 1.9 4.05516 1.9 4.0 c 1.9 3.94484 1.94484 3.9 2.0 3.9 c f 2.0 3.6 m 2.5 3.1 l S -2.0 3.6 m -1.5 3.1 l S 2.0 3.1 m 2.4 2.8 l 2.1 2.4 l 2.2 2.35 l S -2.0 3.1 m -1.7 2.6 l -1.5 2.8 l S 2.0 3.9 m 2.0 3.6 l 2.0 3.1 l S 3.0 4.0 m 3.0 3.9 l 3.05516 3.9 3.1 3.94484 3.1 4.0 c 3.1 4.05516 3.05516 4.1 3.0 4.1 c 2.94484 4.1 2.9 4.05516 2.9 4.0 c 2.9 3.94484 2.94484 3.9 3.0 3.9 c f 3.0 3.6 m 3.5 3.1 l S -3.0 3.6 m -2.5 4.1 l S 3.0 3.1 m 3.0 2.3 l 3.15 2.3 l S -3.0 3.1 m -3.0 2.3 l -2.85 2.3 l S 3.0 3.9 m 3.0 3.6 l 3.0 3.1 l S 4.0 4.0 m 4.0 3.9 l 4.05516 3.9 4.1 3.94484 4.1 4.0 c 4.1 4.05516 4.05516 4.1 4.0 4.1 c 3.94484 4.1 3.9 4.05516 3.9 4.0 c 3.9 3.94484 3.94484 3.9 4.0 3.9 c f 4.0 3.6 m 4.5 4.1 l S -4.0 3.6 m -3.5 4.1 l S 4.0 3.1 m 4.3 2.6 l 4.5 2.8 l S -4.0 3.1 m -3.7 2.6 l -3.5 2.8 l S 4.0 3.9 m 4.0 3.6 l 4.0 3.1 l S 5.0 4.0 m 5.0 3.9 l 5.05516 3.9 5.1 3.94484 5.1 4.0 c 5.1 4.05516 5.05516 4.1 5.0 4.1 c 4.94484 4.1 4.9 4.05516 4.9 4.0 c 4.9 3.94484 4.94484 3.9 5.0 3.9 c f 5.0 3.6 m 5.5 4.1 l 5.5 4.3 l 5.6 4.3 l 5.6 4.2 l 5.5 4.2 l S -5.0 3.6 m -4.5 3.1 l S 5.0 3.1 m 5.4 2.8 l 5.1 2.4 l 5.2 2.35 l S -5.0 3.1 m -4.6 2.8 l -4.9 2.4 l -4.8 2.35 l S 5.0 3.9 m 5.0 3.6 l 5.0 3.1 l S endstream endobj xref 0 4 0000000000 65535 f 0000000010 00000 n 0000000063 00000 n 0000000127 00000 n 0000000234 00000 n trailer << /Root 1 0 R /Size 4 >> startxref 1581 %%EOF For reference, this is the postscript drawing which is being converted. Update: I've fixed several of the issues mentioned: missing xref keyword, %%EOF instead of $$EOF. Same error from identify, but chrome's viewer actually shows me an image (really small, in the lower left corner because I haven't dealt with graphics state yet). link to file link to newer file with single content stream Output from ghostscript: $ echo pstack quit | gsnd -q data/pw.ps data/dancingmen.ps | gsnd -sDEVICE=ps2write -dPDFDEBUG -dPDFSTOPONERROR - GPL Ghostscript 9.18 (2015-10-05) Copyright (C) 2015 Artifex Software, Inc. All rights reserved. This software comes with NO WARRANTY: see the file PUBLIC for details. **** Warning: An error occurred while reading an XREF table. **** The file has been damaged. This may have been caused **** by a problem while converting or transfering the file. **** Ghostscript will attempt to recover the data. << /Root 1 0 R /Size 4 >> %Resolving: [1 0] << /Type /Catalog /Pages 2 0 R >> endobj %Resolving: [2 0] << /Kids [ 3 0 R ] /Type /Pages /Count 1 >> endobj %Resolving: [3 0] << /Contents [ 4 0 R ] /MediaBox [ 0.0 0.0 612.0 792.0 ] /Type /Page /Parent 2 0 R >> endobj %Resolving: [1 0] %Resolving: [2 0] %Resolving: [1 0] %Resolving: [1 0] %Resolving: [1 0] %Resolving: [1 0] %Resolving: [1 0] %Resolving: [2 0] Processing pages 1 through 1. Page 1 %Resolving: [1 0] %Resolving: [2 0] %Resolving: [3 0] %Resolving: [3 0] %Resolving: [3 0] %Resolving: [3 0] %Resolving: [3 0] %Resolving: [2 0] %Resolving: [1 0] %Resolving: [2 0] %Resolving: [1 0] %Resolving: [2 0] %Resolving: [1 0] %Resolving: [2 0] %Resolving: [1 0] %Resolving: [2 0] %Resolving: [2 0] %Resolving: [1 0] %Resolving: [2 0] %Resolving: [1 0] %Resolving: [2 0] %Resolving: [1 0] %Resolving: [2 0] %Resolving: [1 0] %Resolving: [2 0] %Resolving: [1 0] %Resolving: [2 0] %Resolving: [4 0] << /Length 1288 >> stream %FilePosition: 270 endobj 2.0 4.0 m 2.0 3.9 l 2.05516 3.9 2.1 3.94484 2.1 4.0 c 2.1 4.05516 2.05516 4.1 2.0 4.1 c 1.94484 4.1 1.9 4.05516 1.9 4.0 c 1.9 3.94484 1.94484 3.9 2.0 3.9 c f 2.0 3.6 m 2.5 3.1 l S -2.0 3.6 m -1.5 3.1 l S 2.0 3.1 m 2.4 2.8 l 2.1 2.4 l 2.2 2.35 l S -2.0 3.1 m -1.7 2.6 l -1.5 2.8 l S 2.0 3.9 m 2.0 3.6 l 2.0 3.1 l S 3.0 4.0 m 3.0 3.9 l 3.05516 3.9 3.1 3.94484 3.1 4.0 c 3.1 4.05516 3.05516 4.1 3.0 4.1 c 2.94484 4.1 2.9 4.05516 2.9 4.0 c 2.9 3.94484 2.94484 3.9 3.0 3.9 c f 3.0 3.6 m 3.5 3.1 l S -3.0 3.6 m -2.5 4.1 l S 3.0 3.1 m 3.0 2.3 l 3.15 2.3 l S -3.0 3.1 m -3.0 2.3 l -2.85 2.3 l S 3.0 3.9 m 3.0 3.6 l 3.0 3.1 l S 4.0 4.0 m 4.0 3.9 l 4.05516 3.9 4.1 3.94484 4.1 4.0 c 4.1 4.05516 4.05516 4.1 4.0 4.1 c 3.94484 4.1 3.9 4.05516 3.9 4.0 c 3.9 3.94484 3.94484 3.9 4.0 3.9 c f 4.0 3.6 m 4.5 4.1 l S -4.0 3.6 m -3.5 4.1 l S 4.0 3.1 m 4.3 2.6 l 4.5 2.8 l S -4.0 3.1 m -3.7 2.6 l -3.5 2.8 l S 4.0 3.9 m 4.0 3.6 l 4.0 3.1 l S 5.0 4.0 m 5.0 3.9 l 5.05516 3.9 5.1 3.94484 5.1 4.0 c 5.1 4.05516 5.05516 4.1 5.0 4.1 c 4.94484 4.1 4.9 4.05516 4.9 4.0 c 4.9 3.94484 4.94484 3.9 5.0 3.9 c f 5.0 3.6 m 5.5 4.1 l 5.5 4.3 l 5.6 4.3 l 5.6 4.2 l 5.5 4.2 l S -5.0 3.6 m -4.5 3.1 l S 5.0 3.1 m 5.4 2.8 l 5.1 2.4 l 5.2 2.35 l S -5.0 3.1 m -4.6 2.8 l -4.9 2.4 l -4.8 2.35 l S 5.0 3.9 m 5.0 3.6 l 5.0 3.1 l S **** This file had errors that were repaired or ignored. **** Please notify the author of the software that produced this **** file that it does not conform to Adobe's published PDF **** specification. %Resolving: [2 0] %Resolving: [1 0] Update: Sigh. I suppose it's best if I show the code. This program is intended to hook into certain drawing operators of postscript and capture paths and produce a pdf file of the contents. I'm ignoring the quality of the output, in particular transformation matrices, for now. /prompt {} def << /.create-pdf-data { % called at start install-operator-overrides } /.create-pdf-page { % called at showpage 1 /PageNumber += << /Type /Page /Parent pdf-object-names /Pages get create-ref /MediaBox [gsave newpath clippath pathbbox grestore] /Contents [] >> current-page-name dup 3 1 roll create-object pdf-object-names exch get create-ref add-to-pages-kids [ display-list { exch pop create-content-stream } for-each ] { ( ) exch strcat strcat } reduce add-content-to-page } /current-page-name { (Page) PageNumber as-string strcat } /current-page { pdf-objects pdf-object-names current-page-name get get } /.output-pdf { % called at quit /OutputFileName where { pop OutputFileName }{ (%stdout) } ifelse (w) file write-pdf pstack } /operator-overrides << %/start .create-pdf-data /stroke ({ mark-path /S cvx ] display //super//call }) /fill ({ mark-path /f cvx ] display //super//call }) /showpage ({ .create-pdf-page //super//call }) /quit ({ .output-pdf //super//call }) >> /install-operator-overrides { operator-overrides { 1 index load dup /super exch def type /arraytype eq { /exec load }{ /dummyproc cvx } ifelse /call exch def cvx exec userdict 3 1 roll put } forall userdict /dummyproc {} put } /PageNumber 0 /+= { dup load 3 2 roll add store } /write-pdf { /f exch def (1.3) write-header write-body write-xref-table write-trailer } /pdf-output-file-position 0 /write-header { /pdf-output-file-position 0 store (%PDF-) .w .w \n \n } /write-body { write-objects-and-save-positions } /write-objects-and-save-positions { pdf-objects { 1 index save-position write-object } for-each } /write-xref-table { (xref) .w \n pdf-output-file-position /xref-position exch def (0 ) .w pdf-object-positions length 1 sub .n \n 0 format-10 .w ( 65535 f ) .w \n pdf-object-positions { write-xref-table-row } for-each } /write-xref-table-row { exch pop format-10 .w ( 00000 n ) .w \n } /format-10-string 20 string /format-10 { format-10-string cvs (0000000000) 0 10 3 index length sub getinterval exch strcat } /write-trailer { (trailer) .w \n (<<) .w \n ( /Root 1 0 R) .w \n ( /Size ) .w pdf-objects length 1 sub .n \n (>>) .w \n (startxref) .w \n xref-position .n \n (%%EOF) .w \n } /create-content-stream { to-string-with-spaces %dup length ==only ( ) print == } /write-object { exch .n ( 0 obj) .w \n dup write-dict pdf-streams exch 2 copy known { write-stream }{ pop pop } ifelse (endobj) .w \n \n } /write-stream { (stream) .w \n get .w \n (endstream) .w \n } /write-dict { (<< ) .w { exch write-thing write-thing \n } forall (>> ) .w \n } /write-thing { +is-ref { write-ref }{ +is-name { write-name }{ +is-array { write-array }{ +is-null { pop (null ) .w }{ .n ( ) .w } ifelse } ifelse } ifelse } ifelse } /write-ref { ref .n ( 0 R ) .w } /write-name { dup xcheck not { (/) .w } if .n ( ) .w } /write-array { ([ ) .w { write-thing } forall (] ) .w } /+is-ref { dup is-ref } /+is-name { dup is-name } /+is-array { dup is-array } /+is-null { dup is-null } /is-string { type /stringtype eq } /is-array { type /arraytype eq } /is-name { type /nametype eq } /is-null { type /nulltype eq } /is-ref { +is-name { is-ref-format }{ pop false } ifelse } /is-ref-format { ref-check-string cvs 0 1 getinterval (&) eq } /ref-check-string 20 string /ref { 10 string cvs rest cvi } /create-ref { (&) exch 10 string cvs strcat cvn } /mark-path { [ { /m } { /l } { /c } { /h } pathforall } /display { add-to-display-list } /display-list << 0 null >> /add-to-display-list { display-list dup 3 1 roll length exch put } /clear-display-list { /display-list << 0 null >> store } /pdf-objects << % integer keys 0 null 1 << /Type /Catalog /Pages /&2 >> 2 << /Type /Pages /Kids [] /Count 0 >> >> /pdf-object-names << % integer values /Catalog 1 /Pages 2 >> /pdf-object-positions << % integer keys 0 null >> /pdf-streams << >> /create-object { % dict name exch pdf-objects dup length 3 2 roll put pdf-object-names exch pdf-objects length 1 sub put } /object { % name -> dict pdf-object-names exch get pdf-objects exch get } /save-position { pdf-object-positions exch pdf-output-file-position put } /Pages { pdf-objects pdf-object-names /Pages get get } /add-content-to-page { << /Length 2 index length 1 add >> dup 3 2 roll pdf-streams 3 1 roll put /current-content create-object pdf-object-names /current-content get create-ref current-page /Contents 2 copy get [ exch {}forall counttomark 4 add -1 roll ] put } /add-to-pages-kids { % ref Pages /Kids 2 copy get [ exch {}forall counttomark 4 add -1 roll ] put Pages /Count 2 copy get 1 add put } /.w { f exch dup length /pdf-output-file-position += writestring } /.n { dup is-string not { .n-string cvs } if .w } /.n-string 100 string /\n { (\n) .w } /to-string-with-spaces { {as-string} map {( ) exch strcat strcat} reduce } /map { 1 index xcheck 3 1 roll [ 3 1 roll forall ] exch { cvx } if } /reduce { exch dup first exch rest 3 -1 roll forall } /first { 0 get } /rest { 1 1 index length 1 sub getinterval } /as-string { 20 string cvs dup length 13 gt { 0 7 getinterval } if } /strcat { 2 copy length exch length add string dup 4 2 roll 3 copy pop 0 exch putinterval exch length exch putinterval } /for-each { % dict proc key(int) value *proc* 1 1 3 index length 1 sub % d p 1 1 lim [ 6 5 roll % p 1 1 lim [ d 1 /index cvx /get cvx % p 1 1 lim [ d 1 index get 9 8 roll /exec cvx ] cvx % 1 1 lim { d 1 index get p exec } for } >> { dup { dup type /arraytype ne { def }{ % Dict name proc [ 3 index /begin cvx 3 -1 roll {} forall /end cvx ] cvx def } ifelse } forall pop } pop begin .create-pdf-data
Sigh, ran out of space in the comments again.... It would help to put the file somewhere, rather than pasting it. PDF files are binary and length calculations depend on things like CR/LF pairs, meaning that each /Length could potentially be incorrect and its not possible to tell from looking at the pasted file. Similarly the xref table offsets could be incorrect. In fact the offset for entry 1 looks incorrect, even assuming LF EOLs, but its not possible to be certain from the pasted file. Note the error message is from Ghostscript (which IM uses to deal with PDF files). You would probably get more information if you just fed the PDF file to Ghostscript in the first place. You could also try setting -DPDFDEBUG and -dPDFSTOPONERROR, the combination will print out which object GS is dealing with and what it thinks is the problem (if there's a PostScript error). Other PDF problems usually send some kind of back-channel output. Notice that the Ghostscript message references the xref table as the problem: **** Warning: An error occurred while reading an XREF table. So I suspect your xref table is incorrect (also see below, object 0). Non-breaking, but not best practice: xref entry 0, the head of the linked list of free objects, has an offset of 0000000028 should be 0. Your file seems to end $$EOF instead of %%EOF. Its normal practice to place binary in a comment on line 2 in order to force applications to treat the file as binary when transmitting Better to elide the Resources dictionary than use a null object, its smaller. Similarly, better to have a single Contents stream (despite recent Adobe engines producing multiple streams) again because its smaller. Obviously this is an early work in progress, I'm sure you will deal with these in time. If you'll post the actual PDF file somewhere I can take a look. [edit] So the first problem is that the xref table subsection is incorrect. The subsection starts with 2 numbers, the initial index, and the number of entries in the table. The xref table has 5 entries starting from index 0 and going up to index 4. The subsection says 0 4 Correcting that to 0 5 leads us to the next problem, the Size entry in the trailer dictionary is 4, and should be 5. But Ghostscript still complains. The final problem is that the startxref offset is incorrect. Currently this is: startxref 1581 But the actual byte offset of the 'xref' keyword is byte 1576. If I correct all 3 of these problems then Ghostscript opens the file without complaint. It already did render the content of course (very tiny because there's no CTM operations) but now it doesn't have to fix the file.
find first 5 maximum values in each line using awk
I am trying to read a text file like the following word 1 2 3 4 5 6 7 8 9 10 hello 0.2 0.3 0.5 0.1 0.7 0.8 0.6 0.1 0.9 I would like to print the word, "hello", and the maximun 5 values along with the number of the column where they are, like this using awk: hello 10 0.9 7 0.8 6 0.7 8 0.6 3 0.5 I have thought something like this awk '{ for (i=1; i <= 10; i++) a[$i]=$i};END{c=asort(a)?? for(i in a)print i,a[i]??}', but I would like to print in each line read.
With GNU awk 4.* for sorted_in: $ cat tst.awk BEGIN { PROCINFO["sorted_in"] = "#val_num_desc" } NR>1 { split($0,a) printf "%s", a[1] delete a[1] for (i in a) { printf " %d %s", i, a[i] if (++c == 5) { c=0 break } } print "" } $ awk -f tst.awk file hello 10 0.9 7 0.8 6 0.7 8 0.6 4 0.5
here is an awk assisted Unix tool set solution. $ awk -v RS=" " 'NR==1; NR>1{print NR, $0 | "sort -k2nr"} ' file | head -6 | xargs hello 10 0.9 7 0.8 6 0.7 8 0.6 4 0.5 I think your expected output has some typos.
You stated [you] would like to print in each line read so no limits to records read: $ awk '{delete a; for(i=2; i<=NF; i++) {a[$i]=$i; b[$i]=i}; n=asort(a); printf "%s: ",$1; for(i=n; i>n-(n>=5?5:n); i--) printf "%s %s ", b[a[i]], a[i]; printf "\n"}' test.in word: 11 10 10 9 9 8 8 7 7 6 hello: 10 0.9 7 0.8 6 0.7 8 0.6 4 0.5 Walk-thru version: { delete a # delete the array before each record for(i=2; i<=NF; i++) { # from the second field to the last a[$i]=$i # set field to array index and value b[$i]=i # remember the field number } n=asort(a) # sort the a array printf "%s: ",$1 # print the record identifier ie. the first field for(i=n; i>n-(n>=5?5:n); i--) # for the 5 (or value count) biggest values printf "%s %s", b[a[i]], a[i] # print them out printf "\n" # enter after each record } If a value repeats, it's only printed once.
Using Perl $ cat cloudy.txt word 1 2 3 4 5 6 7 8 9 10 hello 0.2 0.3 0.5 0.1 0.7 0.8 0.6 0.1 0.9 $ perl -lane '%kv=();%kv=map{ $_=>$F[$_] } 1..$#F; printf("$F[0] ");$i=0; for $x (reverse sort {$a <=> $b} values %kv) { #y=grep $x eq $kv{$_}, (keys %kv); printf("%d %.1f ",$y[0]+1,$x) if $i++ <5 } print "" ' cloudy.txt word 11 10.0 10 9.0 9 8.0 8 7.0 7 6.0 hello 10 0.9 7 0.8 6 0.7 8 0.6 4 0.5 $