Distance between two lines - awk

I have a set of points for which I need to calculate the distance between lines.
Especially for the range 70:80. Can it be possible via awk ? or any other method
sample data
70.9247 24
73.6148 24
70.9231 25
73.6144 25
70.9216 26
73.6141 26
70.9201 27
73.6138 27
70.9187 28
73.6136 28
Few points
1) Data sorted on y. So each value of y has 2 points.
2) I want the distance between x points for every y. i.e. y(new) = y(n+1)-y(n)
expected output:
2.6901 24
2.6912 25
...........
2.6949 28
thanks

What you are after is something like:
awk 'NR%2{t=$1;next}{print $1-t,$2}'
This does something like:
If the record/line number NR is an odd number, store the value of the first field in t and skip to the next record/line
Otherwise, print the expected output.
A similar way of writing this is:
awk '{if(NR%2){t=$1}else{print $1-t,$2}}'
but this is less awk-ish!

Related

Collapse pandas DataFrame based on daily column value

I have a pandas DataFrame with multiple measurements per day (for example hourly measurements, but that is not necessarily the case), but I want to keep only the hour for which a certain column is the daily minimum.
My one day in my data frame looks somewhat like this
DATE Value Distance
17 1979-1-2T00:00:00.0 15.5669870447436 34.87
18 1979-1-2T01:00:00.0 81.6306803714536 31.342
19 1979-1-2T02:00:00.0 83.1854759740486 33.264
20 1979-1-2T03:00:00.0 23.8659679630303 32.34
21 1979-1-2T04:00:00.0 63.2755504429306 31.973
22 1979-1-2T05:00:00.0 91.2129044773733 34.091
23 1979-1-2T06:00:00.0 76.493130052689 36.837
24 1979-1-2T07:00:00.0 63.5443183375785 34.383
25 1979-1-2T08:00:00.0 40.9255407683688 35.275
26 1979-1-2T09:00:00.0 54.5583051827551 32.152
27 1979-1-2T10:00:00.0 26.2690011881422 35.104
28 1979-1-2T11:00:00.0 71.3059740399097 37.28
29 1979-1-2T12:00:00.0 54.0111262724049 38.963
30 1979-1-2T13:00:00.0 91.3518048568241 36.696
31 1979-1-2T14:00:00.0 81.7651763485069 34.832
32 1979-1-2T15:00:00.0 90.5695814525067 35.473
33 1979-1-2T16:00:00.0 88.4550315358515 30.998
34 1979-1-2T17:00:00.0 41.6276969038137 32.353
35 1979-1-2T18:00:00.0 79.3818377264749 30.15
36 1979-1-2T19:00:00.0 79.1672568582629 37.07
37 1979-1-2T20:00:00.0 1.48337999844262 28.525
38 1979-1-2T21:00:00.0 87.9110385474789 38.323
39 1979-1-2T22:00:00.0 38.6646421460678 23.251
40 1979-1-2T23:00:00.0 88.4920153764757 31.236
I would like to keep all rows that have the minimum "distance" per day, so for the one day shown above, one would have only one row left (the one with index value 39). I know how to collapse the data frame so that I only have the Distance column left. I can do that - if I first set the DATE as index - with
df_short = df.groupby(df.index.floor('D'))["Distance"].min()
But I also want the Value column in my final result. How do I keep all columns?
It doesn't seem to work if I do
df_short = df.groupby(df.index.floor('D')).min(["Distance"])
This does keep all the columns in the final result, but it seems like the outcome is wrong, so I'm not sure what this does.
Maybe this is already posted somewhere, but I have trouble finding it.
You can use aggregate
df_short = df.groupby(df.index.floor('D')).agg({'Distance': min, 'Value': max})
If you want the kept Value column is the same with minimum of Distance column:
df_short = df.loc[df.groupby(df.index.floor('D'))['Distance'].idxmin(), :]
Make a datetime Index:
df.DATE = pd.to_datetime(df.DATE) # If not already datetime.
df.set_index('DATE', inplace=True)
Resample and find the min Distance's location:
df.loc[df.resample('D')['Distance'].idxmin()]
Output:
Value Distance
DATE
1979-01-02 22:00:00 38.664642 23.251

How to plot a chart so it adds to the value of previous value instead of plotting it over a zero line

In this code i have ploted pct_day. Since the value does not increase like it would in a stock value, is it possible to plot this data where the current value which is to be plotted is added to the previous value and that data is plotted. This way the line graph would increase over time as opposed to the image below where the chart is plotted over a zero line?
High Low Open Close Volume Adj Close year pct_day
month day
1 2 794.913004 779.509998 788.783002 789.163007 6.372860e+08 789.163007 1997.400000 0.002211
3 833.470005 818.124662 823.937345 828.889339 9.985193e+08 828.889339 1997.866667 0.004160
4 863.153573 849.154299 858.737861 853.571429 1.042729e+09 853.571429 1997.714286 -0.003345
5 900.455715 888.571429 895.716426 894.472137 1.022023e+09 894.472137 1998.357143 -0.001216
6 847.453076 837.161537 840.123847 844.383843 8.889831e+08 844.383843 1998.076923 0.003679
... ... ... ... ... ... ... ... ... ...
12 27 909.735997 900.942000 905.528664 904.734009 7.485793e+08 904.734009 1998.133333 -0.000308
28 946.635010 940.440016 942.995721 944.127147 7.552150e+08 944.127147 1998.071429 0.001251
29 950.723837 941.625390 944.760775 947.200773 6.830400e+08 947.200773 1998.076923 0.002899
30 891.501671 883.954989 887.031665 887.819181 6.010675e+08 887.819181 1997.833333 0.001844
31 918.943857 910.320763 916.251549 913.786154 6.879523e+08 913.786154 1997.923077 -0.002772
363 rows × 8 columns
in Jupyter notebook as shows below:
You need the cumulative sum of the column pct_day. First, create a new column where you compute that value by means of numpy cumsum
pct_value_list = df['pct_value'].tolist()
pct_value_cumsum = list(np.cumsum(pct_value_list))
df['pct_value_cumsum'] = pct_value_cumsum
After that you can plot by df.plot(y='pct_value_cumsum')

Find average of numbers from a specific line

I have a text file with 2 columns of numbers.
10 2
20 3
30 4
40 5
50 6
60 7
70 8
80 9
90 10
100 11
110 12
120 13
130 14
I would like to find the average of the 2nd column data from the 6th line. That is ( (7+8+9+10+11+12+13+14)/8 = 10.5 )
I could find this post Scripts for computing the average of a list of numbers in a data file
and used the following:
awk'{s+=$2}END{print "ave:",s/NR}' fileName
but I get an average of entire second column data.
Any hint here.
This one-liner should do:
awk -v s=6 'NR<s{next} {c++; t+=$2} END{printf "%.2f (%d samples)\n", t/c, c}' file
This awk script has three pattern/action pairs. The first is responsible for skipping the first s lines. The second executes on every line (from s onwards); it increments a counter and adds column 2 to a running total. The third runs after all data have been processed, and prints your results.
Below script should do the job
awk 'NR>=6{avg+=$2}END{printf "Average of field 2 starting from 6th line %.1f\n",avg/(NR-5)}' file
Output
Average of field 2 starting from 6th line 10.5

Circle Summation (30 Points) InterviewStree Puzzle

The following is the problem from Interviewstreet I am not getting any help from their site, so asking a question here. I am not interested in an algorithm/solution, but I did not understand the solution given by them as an example for their second input. Can anybody please help me to understand the second Input and Output as specified in the problem statement.
Circle Summation (30 Points)
There are N children sitting along a circle, numbered 1,2,...,N clockwise. The ith child has a piece of paper with number ai written on it. They play the following game:
In the first round, the child numbered x adds to his number the sum of the numbers of his neighbors.
In the second round, the child next in clockwise order adds to his number the sum of the numbers of his neighbors, and so on.
The game ends after M rounds have been played.
Input:
The first line contains T, the number of test cases. T cases follow. The first line for a test case contains two space seperated integers N and M. The next line contains N integers, the ith number being ai.
Output:
For each test case, output N lines each having N integers. The jth integer on the ith line contains the number that the jth child ends up with if the game starts with child i playing the first round. Output a blank line after each test case except the last one. Since the numbers can be really huge, output them modulo 1000000007.
Constraints:
1 <= T <= 15
3 <= N <= 50
1 <= M <= 10^9
1 <= ai <= 10^9
Sample Input:
2
5 1
10 20 30 40 50
3 4
1 2 1
Sample Output:
80 20 30 40 50
10 60 30 40 50
10 20 90 40 50
10 20 30 120 50
10 20 30 40 100
23 7 12
11 21 6
7 13 24
Here is an explanation of the second test case. I will use a notation (a, b, c) meaning that child one has number a, child two has number b and child three has number c. In the beginning, the position is always (1,2,1).
If the first child is the first to sum its neighbours, the table goes through the following situations (I will put an asterisk in front of the child that just added its two neighbouring numbers):
(1,2,1)->(*4,2,1)->(4,*7,1)->(4,7,*12)->(*23,7,12)
If the second child is the first to move:
(1,2,1)->(1,*4,1)->(1,4,*6)->(*11,4,6)->(11,*21,6)
And last if the third child is first to move:
(1,2,1)->(1,2,*4)->(*7,2,4)->(7,*13,4)->(7,13,*24)
And as you notice the output to the second case are exactly the three triples computed that way.
Hope that helps.

Formula in gawk

I have a problem that I’m trying to work out in gawk. This should be so simple, but my attempts ended up with a divide by zero error.
What I trying to accomplish is as follows –
maxlines = 22 (fixed value)
maxnumber = > max lines (unknown value)
Example:
maxlines=22
maxnumber=60
My output should look like the following:
print lines:
1
2
...
22
print lines:
23
24
...
45
print lines:
46 (remainder of 60 (maxnumber))
47
...
60
It's not clear what you're asking, but I assume you want to loop through input lines and print a new header (page header?) after every 22 lines. Using a simple counter and check for
count % 22 == 1
which tells you it's time to print the next page.
Or you could keep two counters, one for the absolute line number and another for the line number within the current page. When the second counter exceeds 22, reset it to zero and print the next page heading.
Worked out gawk precedence with some help and this works -
maxlines = 22
maxnumber = 60
for (i = 1; i <= maxnumber; i++){
if ( ! ( (i-1) % maxlines) ){
print "\nprint lines:"
}
print i
}