'grep' or 'awk' for extracting numeric data from a file - awk

I have a CFD output file containing alpha-numeric data. My goal is to extract certain rows having numeric data to be able to plot. I was able to extract data which starts with a numeric value using grep. However, some of the rows of this extracted data start with a number but contains alphabets also which i do not want. here is a sample
3185 interface metric data, zone 1444, binary.
33268 interface metric data, zone 1440, binary.
3d, double precision, pressure-based, SST k-omega solver.
1 1.0000e+00 1.0163e-01 4.9782e-06 1.2250e-05 6.5126e-06 3.8876e+01 4.1845e+03 7.8685e+02 7.9475e+02 7.8234e+02 3.0537e+00 4.4427e+02 106:48:28 4999
2 1.0000e+00 6.5455e-02 1.4961e-04 2.2052e-04 1.3530e-02 6.8334e-01 4.5948e-01 7.9448e+02 8.0249e+02 7.9007e+02 1.3742e+00 5.7040e+02 92:12:06 4998
4587 interface metric data, zone 2541, binary.
6584 interface metric data, zone 1254, binary.
3 1.0000e+00 4.2029e-02 1.5227e-04 2.1588e-04 3.0255e-03 6.4570e-01 1.2661e-01 7.8044e+02 7.9048e+02 7.7804e+02 -2.3999e+05 6.4085e+02 80:35:24 4997
4 9.9121e-01 3.0808e-02 1.1390e-04 1.7132e-04 1.6542e-03 6.0594e-01 3.4626e-02 7.8613e+02 7.9673e+02 7.8422e+02 -1.9033e+05 7.0184e+02 70:56:41 4996
This is the command i used grep -P '^\s*\d+' file. How can i modify grep command to give me last rows with numeric data only ie
1 1.0000e+00 1.0163e-01 4.9782e-06 1.2250e-05 6.5126e-06 3.8876e+01 4.1845e+03 7.8685e+02 7.9475e+02 7.8234e+02 3.0537e+00 4.4427e+02 106:48:28 4999
2 1.0000e+00 6.5455e-02 1.4961e-04 2.2052e-04 1.3530e-02 6.8334e-01 4.5948e-01 7.9448e+02 8.0249e+02 7.9007e+02 1.3742e+00 5.7040e+02 92:12:06 4998
3 1.0000e+00 4.2029e-02 1.5227e-04 2.1588e-04 3.0255e-03 6.4570e-01 1.2661e-01 7.8044e+02 7.9048e+02 7.7804e+02 -2.3999e+05 6.4085e+02 80:35:24 4997
4 9.9121e-01 3.0808e-02 1.1390e-04 1.7132e-04 1.6542e-03 6.0594e-01 3.4626e-02 7.8613e+02 7.9673e+02 7.8422e+02 -1.9033e+05 7.0184e+02 70:56:41 4996

How can i modify grep command to give me last 4 rows only
Pipe the grep output to tail.
grep -P '^\s*\d+' file | tail -n 4

Given that the text in the question is the only thing we have to go on, I see a few patterns we might use to extract the last four lines.
The following matches lines whose first field is a number and contain no commas:
egrep '^[[:space:]]*[0-9][^,]+$'
This one matches lines containing numbers in scientific notation:
grep '[0-9]e[+-][0-9]'
And this one matches lines containing what looks like a time followed by an integer at the end of the line:
egrep '[0-9]+(:[0-9]{2}){2} [0-9]+$'
Or if you want an explicit match for the whole line -- that is, an integer, a number of scientific numbers, a time and then an integer, you can bundle it all together:
egrep '^[[:space:]]*[0-9]([[:space:]]+-?[0-9]+\.[0-9]+e[+-][0-9]+)+[[:space:]]+[0-9]+(:[0-9]{2}){2} [0-9]+$'
Note that I'm using explicit class names and ERE rather than shortcuts and PREG to maintain compatibility with non-Linux environments.

If your desired section of data can be identified by a certain header, e.g., the 3d, before it, you can look for the header and only start printing matching lines afterwards, e.g.,
awk '/^\s*3d,/ { in_data=1; next } in_data && /^\s*[0-9]/' file
Here /^\s*3d,/ is the pattern for the header which indicates the beginning of the "data section" (from the next line, i.e., not including the header itself). And /^\s*[0-9]/ is the pattern for lines to print within the data section.
In case there is no such header, you could try to identify the first line of data itself with a more complicated pattern, e.g., the number of fields combined with a regular expression:
awk 'NF == 15 && /^\s*[0-9]*\s*/ { in_data=1 } in_data && /^\s*[0-9]/' file

Related

How to import a CSV, which is not a CSV file, file into SQL database using PowerShell

I have a txt file that look something like this
Number Name #about 4 spaces between
89273428 John #about 7 spaces between
59273423 Hannah
95693424 David
I'm trying to upload into my SQL Server Database using PowerShell but I'm not sure how to do it so any suggestion or help would be really appreciated.
I tried to convert to csv file but all the content are merge together in 1 single column so I can't do it like this.
$CSVImport = Import-CSV $Global:TxtConvertCSV
ForEach ($CSVLine in $CSVImport) {
$CSVNumber = $CSVLine.Number.ToUpper()
$CSVName = $CSVLine.Name.ToUpper()
$Date = $CurDate
$query = "INSERT INTO Table (Number, Name, Added_Date) VALUES('$CSVNumber', '$CSVName','$Date');"
Invoke-Sqlcmd -Query $query
}
In order to successfully use the import-csv cmdlet your file must have a reliable delimiter. For example if your file is actually tab delimited then you can use:
import-csv -delimiter "`t"
If the file has no delimiter, but uses fixed positions of a known length for each "field" then you can do the following:
Please note this will only work if the file uses a fixed layout.
As an example, assume we have a file which contains a number and a name on each row. On each line of the file positions 0 through 7 contain the number. Even if not all numbers have a length of 8, those positions are still reserved for them. Positions 15 through 23 contain the name. As for the numbers, even if each name does not take up all of the positions, they are still reserved for the "name" field.
Example file contents:
Number Name
12345 John
333 Brittany
2222 Jeff
12345678 Johannes
Since there are 7 unused spaces between the end of the number field and the start of the name field, we will ignore those.
You could process this file as follows:
$fileContents = Get-Content Path_to_my_DataFile.txt
#use -skip 1 to ignore the first line, since that line just contains the column headings
$recordsFromFile = $fileContents | Select-Object -Skip 1 -Property #{name = 'number'; expression={$_.substring(0,8).trim()}},
#{name='name';expression={$_.substring(15,8).trim()}}
When this completes you will have an array of objects, where each item is a PSCustomObject containing the properties "number" and "name".
You can confirm that the fields look correct by using out-gridview like this:
$recordsFromFile | out-gridview
You can also convert this into a CSV like this:
$recordsFromFile | convertto-csv -notypeinformation
If the file is not actually fixed width, then the substring(start,length) will likely not work. In that case you can potentially leave out the "length" argument to substring from start position to the end of the line, but that will really only work on the last field of each line. Failing that you would have to resort to pattern matching to actually identify where each "field" begins and ends, processing each line individually.

Split one file into multiple file using PIG script

I have a pipe delimited text file say abc.txt. which has different number of columns in different records. Number of columns in a record can be 100,80,70,60. I need to split abc.txt based on 3rd column value. If third column has value "A" , then that record will go to A.txt, if "B" then B.txt. Need to write a PIG script.
abc = LOAD 'abc.txt' using PigStorage('|');
Assuming you have the 3rd column in all the records, SPLIT using the positional notation. It starts from 0, so the third column will be $2.
SPLIT abc into a_records if $2 == 'A', b_records if $2 == 'B';
Then store the results, also note that STORE does not accept filenames as path.
STORE a_records into 'A_DIR' using PigStorage('|');
STORE b_records into 'B_DIR' using PigStorage('|');

How to split a column containing two records separately

I have millions of observation in different columns and one of the column contains records of two factors together. for instance, 136789 and i want to split the first character (1) and the rest (36789) as separate columns for all observations.
The field looks like this
#136789
I want to see like this
#1 36789
You can make use of sub() function.
For example:
kent$ awk 'BEGIN{x="123456";sub(/^./,"& ",x);print x}'
1 23456
In your code, you need apply sub() on some column ($x)

AWK: Ignore lines grouped by an unique value conditioned on occurrences of a specific field value

Please help revise the title and the post if needed, thanks.
In short, I would like to firstly group lines with a unique value in the first field and accumulate the occurrences of a specific value in the other field in the underlying group of lines. If the sum of occurrences doesn't meet the self-defined threshold, the lines in the group should be ignored.
Specifically, with input
111,1,P,1
111,1,P,1
111,1,P,0
111,1,M,1
222,1,M,1
222,1,M,0
333,1,P,0
333,1,P,1
444,1,M,1
444,1,M,1
444,0,M,0
555,1,P,1
666,1,P,0
the desired output should be
111,1,P,1
111,1,P,1
111,1,P,0
111,1,M,1
333,1,P,0
333,1,P,1
555,1,P,1
666,1,P,0
meaning that "because the unique values in the first field 222 and 444 don't have at least one (which can be any desired threshold) P in the third field, lines corresponding to 222 and 444 are ignored."
Furthermore, this should be done without editing the original file and have to be combined with the solved issue Split CSV to Multiple Files Containing a Set Number of Unique Field Values. By doing this, a few lines will not be involved in the resulted split files.
I believe this one-liner does what you want:
$ awk -F, '{a[$1,++c[$1]]=$0}$3=="P"{p[$1]}END{for(i in c)if(i in p)for(j=1;j<=c[i];++j)print a[i,j]}' file
111,1,P,1
111,1,P,1
111,1,P,0
111,1,M,1
333,1,P,0
333,1,P,1
555,1,P,1
666,1,P,0
Array a, keeps track of all the lines in the file, grouping them by the first field and a count c which we use later. If the third field contains a P, set a key in the p array.
After processing the entire file, loop through all the values of the first field. If a key has been set in p for the value, then print the lines from a.
You mention a threshold number of entries in your question. If by that, you mean that there must be N occurrences of "P" in order for the lines to be printed, you could change {p[$1]} to {++p[$1]}, then change if(i in p) to if(p[i]>=N) in the END block.

gnuplot store one number from data file into variable

OSX v10.6.8 and Gnuplot v4.4
I have a data file with 8 columns. I would like to take the first value from the 6th column and make it the title. Here's what I have so far:
#m1 m2 q taua taue K avgPeriodRatio time
#1 2 3 4 5 6 7 8
K = #read in data here
graph(n) = sprintf("K=%.2e",n)
set term aqua enhanced font "Times-Roman,18"
plot file using 1:3 title graph(K)
And here is what the first few rows of my data file looks like:
1.00e-07 1.00e-07 1.00e+00 1.00e+05 1.00e+04 1.00e+01 1.310 12070.00
1.11e-06 1.00e-07 9.02e-02 1.00e+05 1.00e+04 1.00e+01 1.310 12070.00
2.12e-06 1.00e-07 4.72e-02 1.00e+05 1.00e+04 1.00e+01 1.310 12070.00
3.13e-06 1.00e-07 3.20e-02 1.00e+05 1.00e+04 1.00e+01 1.310 12090.00
I don't know how to correctly read in the data or if this is even the right way to go about this.
EDIT #1
Ok, thanks to mgilson I now have
#m1 m2 q taua taue K avgPeriodRatio time
#1 2 3 4 5 6 7 8
set term aqua enhanced font "Times-Roman,18"
K = "`head -1 datafile | awk '{print $6}'`"
print K+0
graph(n) = sprintf("K=%.2e",n)
plot file using 1:3 title graph(K)
but I get the error: Non-numeric string found where a numeric expression was expected
EDIT #2
file = "testPlot.txt"
K = "`head -1 file | awk '{print $6}'`"
K=K+0 #Cast K to a floating point number #this is line 9
graph(n) = sprintf("K=%.2e",n)
plot file using 1:3 title graph(K)
This gives the error--> head: file: No such file or directory
"testPlot.gnu", line 9: Non-numeric string found where a numeric expression was expected
You have a few options...
FIRST OPTION:
use columnheader
plot file using 1:3 title columnheader(6)
I haven't tested it, but this may prevent the first row from actually being plotted.
SECOND OPTION:
use an external utility to get the title:
TITLE="`head -1 datafile | awk '{print $6}'`"
plot 'datafile' using 1:3 title TITLE
If the variable is numeric, and you want to reformat it, in gnuplot, you can cast strings to a numeric type (integer/float) by adding 0 to them (e.g).
print "36.5"+0
Then you can format it with sprintf or gprintf as you're already doing.
It's weird that there is no float function. (int will work if you want to cast to an integer).
EDIT
The script below worked for me (when I pasted your example data into a file called "datafile"):
K = "`head -1 datafile | awk '{print $6}'`"
K=K+0 #Cast K to a floating point number
graph(n) = sprintf("K=%.2e",n)
plot "datafile" using 1:3 title graph(K)
EDIT 2 (addresses comments below)
To expand a variable in backtics, you'll need macros:
set macro
file="mydatafile.txt"
#THE ORDER OF QUOTES (' and ") IS CRUCIAL HERE.
cmd='"`head -1 ' . file . ' | awk ''{print $6}''`"'
# . is string concatenation. (this string has 3 pieces)
# to get a single quote inside a single quoted string
# you need to double. e.g. 'a''b' yields the string a'b
data=#cmd
To address your question 2, it is a good idea to familiarize yourself with shell utilities -- sed and awk can both do it. I'll show a combination of head/tail:
cmd='"`head -2 ' . file . ' | tail -1 | awk ''{print $6}''`"'
should work.
EDIT 3
I recently learned that in gnuplot, system is a function as well as a command. To do the above without all the backtic gymnastics,
data=system("head -1 " . file . " | awk '{print $6}'")
Wow, much better.
This is a very old question, but here's a nice way to get access to a single value anywhere in your data file and save it as a gnuplot-accessible variable:
set term unknown #This terminal will not attempt to plot anything
plot 'myfile.dat' index 0 every 1:1:0:0:0:0 u (var=$1):1
The index number allows you to address a particular dataset (separated by two carriage returns), while every allows you to specify a particular line.
The colon-separated numbers after every should be of the form 1:1:<line_number>:<block_number>:<line_number>:<block_number>, where the line number is the line with the the block (starting from 0), and the block number is the number of the block (separated by a single carriage return, again starting from 0). The first and second numbers say plot every 1 lines and every one data block, and the third and fourth say start from line <line_number> and block <block_number>. The fifth and sixth say where to stop. This allows you to select a single line anywhere in your data file.
The last part of the plot command assigns the value in a particular column (in this case, column 1) to your variable (var). There needs to be two values to a plot command, so I chose column 1 to plot against my variable assignment statement.
Here is a less 'awk'-ward solution which assigns the value from the first row and 6th column of the file 'Data.txt' to the variable x16.
set table
# Syntax: u 0:($0==RowIndex?(VariableName=$ColumnIndex):$ColumnIndex)
# RowIndex starts with 0, ColumnIndex starts with 1
# 'u' is an abbreviation for the 'using' modifier
plot 'Data.txt' u 0:($0==0?(x16=$6):$6)
unset table
A more general example for storing several values is given below:
# Load data from file to variable
# Gnuplot can only access the data via the "plot" command
set table
# Syntax: u 0:($0==RowIndex?(VariableName=$ColumnIndex):$ColumnIndex)
# RowIndex starts with 0, ColumnIndex starts with 1
# 'u' is an abbreviation for the 'using' modifier
# Example: Assign all values according to: xij = Data33[i,j]; i,j = 1,2,3
plot 'Data33.txt' u 0:($0==0?(x11=$1):$1),\
'' u 0:($0==0?(x12=$2):$2),\
'' u 0:($0==0?(x13=$3):$3),\
'' u 0:($0==1?(x21=$1):$1),\
'' u 0:($0==1?(x22=$2):$2),\
'' u 0:($0==1?(x23=$3):$3),\
'' u 0:($0==2?(x31=$1):$1),\
'' u 0:($0==2?(x32=$2):$2),\
'' u 0:($0==2?(x33=$3):$3)
unset table
print x11, x12, x13 # Data from first row
print x21, x22, x23 # Data from second row
print x31, x32, x33 # Data from third row