Gnuplot conditional plotting - conditional-statements

I need to use gnuplot to plot wind direction values (y) against time (x) in a 2D plot using lines and points. This works fine if successive values are close together. If the values are eg separated by 250 degrees then I need to have a condition that checks the previous y value and does not draw a line joining the two points. This condition occurs when the wind dir is in the 280 degrees to 20 degrees sector and the plots are messy eg a north wind. AS the data is time dependent I cannot use polar plots except at a specific point in time. I need to show change in direction over time.
Basically the problem is:
plot y against x ; when (y2-y1)>= 180 then break/erase line joining successive points
Can anyone give me an example of how to do this?
A sample from the data file is:
2014-06-16 16:00:00 0.000 990.081 0.001 0.001 0.001 0.001 0.002 0.001 11.868 308 002.54 292 004.46 00
2014-06-16 16:10:00 0.000 990.047 0.001 0.001 0.001 0.001 0.002 0.001 11.870 303 001.57 300 002.48 00
2014-06-16 16:20:00 0.000 990.014 0.001 0.001 0.001 0.001 0.002 0.001 11.961 334 001.04 314 002.07 00
2014-06-16 16:30:00 0.000 990.014 0.001 0.001 0.001 0.001 0.002 0.001 11.818 005 001.18 020 002.14 00
2014-06-16 16:40:00 0.000 990.014 0.001 0.001 0.001 0.001 0.002 0.001 11.725 332 001.14 337 002.26 00
and I want to plot column 12 vs time.

You can insert a filtering condition in the using statement and use a value of 1/0 if the condition is not fullfilled. In that case this point is not connect to others:
set timefmt '%Y-%m-%d %H:%M:%S'
set xdata time
unset key
y1 = y2 = 0
plot 'data.dat' using 1:(y1 = y2, y2 = $12, ($0 == 0 || y2 - y1 < 180) ? $12 : 1/0) with lines,\
'data.dat' using 1:12 with points
With your data sample and gnuplot version 4.6.5 I get the plot:
Unfortunately, with this approach you cannot categorize lines, but only points and also the line following the 1/0 point aren't drawn.
A better approach would be to use awk to insert an empty line when a jump occurs. In a 2D-plot points from different data blocks (separated by a single new line) aren't connected:
set timefmt '%Y-%m-%d %H:%M:%S'
set xdata time
unset key
plot '< awk ''{y1 = y2; y2 = $12; if (NR > 1 && y2 - y1 >= 180) printf("\n"); print}'' data.dat' using 1:12 with linespoints

In order to break the joining lines two conditional statements must be fulfilled and BOTH must include the newline statement printf("\n"):
plot '< awk ''{y1 = y2; y2 = $12; if (NR > 1 && y2 - y1 >= 180) printf("\n") ; if (NR > 1 && y2 -y1 <= 0) printf("\n"); print}'' /Desktop/plotdata.txt' using 1:12 with linespoints

There is absolutely no need for awk. You can simply "interrupt" the lines by using variable color for the line. For gnuplot<5.0.0 you can use 0xffffff=white (or whatever your background color is) as linecolor and the line will hardly be visible. For gnuplot>=5.0.0 you can use any transparent color, e.g. 0xff123456, i.e. the line is really invisible.
Data: SO24425910.dat
2014-06-16 16:00:00 330
2014-06-16 16:10:00 320
2014-06-16 16:20:00 310
2014-06-16 16:30:00 325
2014-06-16 16:40:00 090
2014-06-16 16:50:00 060
2014-06-16 17:00:00 070
2014-06-16 17:10:00 280
2014-06-16 17:20:00 290
2014-06-16 17:30:00 300
Script: (works for gnuplot>=4.4.0, March 2010)
### conditional interruption of line
reset
FILE = "SO24425910.dat"
set key noautotitle
set yrange[0:360]
set ytics 90
set grid x
set grid y
set xdata time
set timefmt "%Y-%m-%d %H:%M"
set format x "%H:%M"
set multiplot layout 2,1
plot y1=NaN FILE u 1:(y0=y1,y1=$3):(abs(y1-y0)>=180?0xffffff:0xff0000) w l lc rgb var
plot y1=NaN FILE u 1:(y0=y1,y1=$3):(abs(y1-y0)>=180?0xffffff:0xff0000) w l lc rgb var, \
'' u 1:3 w p pt 7 lc rgb "red"
unset multiplot
### end of script
Result:

Related

Negative binomial , Poisson-gamma mixture winbugs

Winbugs trap error
model
{
for (i in 1:5323) {
Y[i] ~ dpois(mu[i]) # NB model as a Poisson-gamma mixture
mu[i] ~ dgamma(b[i], a[i]) # NB model as a poisson-gamma mixture
a[i] <- b[i] / Emu[i]
b[i] <- B * X[i]
Emu[i] <- beta0 * pow(X[i], beta1) # model equation
}
# Priors
beta0 ~ dunif(0,10) # parameter
beta1 ~ dunif(0,10) # parameter
B ~ dunif(0,10) # over-dispersion parameter
}
X[] Y[]
1.5 0
2.9 0
1.49 0
0.39 0
3.89 0
2.03 0
0.91 0
0.89 0
0.97 0
2.16 0
0.04 0
1.12 1s
2.26 0
3.6 1
1.94 0
0.41 1
2 0
0.9 0
0.9 0
0.9 0
0.1 0
0.88 1
0.91 0
6.84 2
3.14 3
End ```
This is just a sample of the data, the model question is coming from Ezra Hauer 8.3.2, the art of regression of road safety, the model is providing an **error undefined real result. **
The aim of model is to fully Bayesian and a one step model and not use empirical bayes.
The results should be similar to MLE where beta0 is 1.65, beta1 0.871, overdispersion is 0.531
X is the only variable and y is actual collision,
So X cannot be zero or negative, while y cannot be lower than zero, if the model in solved as Poisson gamma mixture using maximum likelihood then it can be created
How can I make this model work
Solving an error in winbugs?
the data is in excel, the model worked fine when I selected the biggest 1000 observations only.

Apply custom function to Rolling Dataframe

I have a function (let's call it) RBSA(df) (that I'm currently treating the function as a blackbox) that takes a dataframe
DATE RETURN STYLE1 STYLE2 STYLE3 STYLE4
2020-09-01 0.01 100 251 300 211
2020-09-02 0.04 106 248 310 210
2020-09-03 0.03 104 251 308 211
2020-09-03 0.02 110 258 306 212
...
and returns a dataframe like this
DATE STYLE1 STYLE2 STYLE3 STYLE4 R2
2020 0.01 85 10 4.99 68
Now I want to be able to apply that function on a rolling basis with a window of 30 to the initial database so that the dataframe looks something like this.
DATE STYLE1 STYLE2 STYLE3 STYLE4 R2
2020-09 0.01 85 10 4.99 68 #applied date range would be 09-01 to 09-30
2020-09 0.99 80 15 4.01 77 #applied date range would be 09-02 to 10-01
2020-09 3.93 80 10 6.07 89 #applied date range would be 09-03 to 10-02
So far I've tried using df.rolling(30).apply(RBSA) however from what I can tell the rolling.apply function applies the function by turning each window into a numpy.ndarray. However since I'm treating the RBSA() function as a black box, I would rather not change the RBSA() function to have a numpy.ndarray as it's input.
My second idea was to create a for loop that append() each dataframe to a, initially, empty dataframe. However, i'm not really sure how to emulate a rolling window using a while loop.
def rolling30(df):
count = len(count) - 30
ret = []
while (count > 0):
count = count - 1
df2 = df[count:count + 30]
df2 = style(df2)
ret.append(df2)
However unlike when I manually append the dataframes together, for some reason when I'm appending the dataframes together it seems to create an output that looks like this (notice the comma)
DATE STYLE1 STYLE2 STYLE3 STYLE4 R2
2020-09 0.01 85 10 4.99 68, DATE STYLE1 STYLE2 STYLE3 STYLE4 R2
2020-09 0.99 80 15 4.01 77, DATE STYLE1 STYLE2 STYLE3 STYLE4 R2
2020-09 3.93 80 10 6.07 89, DATE STYLE1 STYLE2 STYLE3 STYLE4 R2
Right now it feels like I'm closest when coming to a solution with the while loop. Though it doesn't feel as elegant as using rolling.apply
UPDATE: just did a isinstance(rolling30(df), pd.DataFrame) and it returned Falseso I assume the problem is that somewhere it's reverting it into something thats not a dataframe.
So I figured out the solution to my while loop probem. Realized the initial retwas a list and changed that to ret = pd.DataFrame() and then the append to call back to ret
def rolling30(df):
count = len(count) - 30
ret = pd.DataFrame()
while (count > 0):
count = count - 1
df2 = df[count:count + 30]
df2 = style(df2)
ret = ret.append(df2)
I would still like to see what methods other people had to this problem since I feel like the deployment of the While Loop isn't a elegant solution here.

pandas add a '*' for values less than .05

I have a pandas dataframe of p values.
disorder p value(group) p value(cluster) p value(interaction)
0 Specific phobia 0.108 0.022 0.075
1 Social phobia 0.848 0.001 0.690
2 Depression 0.923 0.034 0.016
3 PTSD 0.519 0.039 0.004
4 ODD 0.013 0.053 0.003
5 ADHD 0.876 0.062 0.012
How can I add '*' to those values that are less than .05?
Let us do
df.iloc[:,1:]=df.iloc[:,1:].mask(df.iloc[:,1:].le(0.05),df.astype(str).apply(lambda x : x.str[:5]).add('*'))
Try something like -
df.astype(str)
specific_column = ['Column Name you want to check on']
df[specific_column] = df[specific_column].apply(lambda x: x + "*" if int(x)<0.5 else x)
I have just figured out a way to add * and ** to values less than .05 and .01, respectively.
report2 = report.copy()
report[report2.iloc[:,1:].le(0.05)] = report[report2.iloc[:,1:].le(0.05)].astype(str).apply(lambda x : x.str[:5]).add('*')
report[report2.iloc[:,1:].le(0.01)] = report[report2.iloc[:,1:].le(0.01)].astype(str).apply(lambda x : x.str[:5]).add('**')

Select every nth row as a Pandas DataFrame without reading the entire file

I am reading a large file that contains ~9.5 million rows x 16 cols.
I am interested in retrieving a representative sample, and since the data is organized by time, I want to do this by selecting every 500th element.
I am able to load the data, and then select every 500th row.
My question: Can I immediately read every 500th element (using.pd.read_csv() or some other method), without having to read first and then filter my data?
Question 2: How would you approach this problem if the date column was not ordered? At the moment, I am assuming it's ordered by date, but all data is prone to errors.
Here is a snippet of what the data looks like (first five rows) The first 4 rows are out of order, bu the remaining dataset looks ordered (by time):
VendorID tpep_pickup_datetime tpep_dropoff_datetime passenger_count trip_distance RatecodeID store_and_fwd_flag PULocationID DOLocationID payment_type fare_amount extra mta_tax tip_amount tolls_amount improvement_surcharge total_amount
0 1 2017-01-09 11:13:28 2017-01-09 11:25:45 1 3.30 1 N 263 161 1 12.5 0.0 0.5 2.00 0.00 0.3 15.30
1 1 2017-01-09 11:32:27 2017-01-09 11:36:01 1 0.90 1 N 186 234 1 5.0 0.0 0.5 1.45 0.00 0.3 7.25
2 1 2017-01-09 11:38:20 2017-01-09 11:42:05 1 1.10 1 N 164 161 1 5.5 0.0 0.5 1.00 0.00 0.3 7.30
3 1 2017-01-09 11:52:13 2017-01-09 11:57:36 1 1.10 1 N 236 75 1 6.0 0.0 0.5 1.70 0.00 0.3 8.50
4 2 2017-01-01 00:00:00 2017-01-01 00:00:00 1 0.02 2 N 249 234 2 52.0 0.0 0.5 0.00 0.00 0.3 52.80
Can I immediately read every 500th element (using.pd.read_csv() or some other method), without having to read first and then filter my data?
Something you could do is to use the skiprows parameter in read_csv, which accepts a list-like argument to discard the rows of interest (and thus, also select). So you could create a np.arange with a length equal to the amount of rows to read, and remove every 500th element from it using np.delete, so this way we'll only be reading every 500th row:
n_rows = 9.5e6
skip = np.arange(n_rows)
skip = np.delete(skip, np.arange(0, n_rows, 500))
df = pd.read_csv('my_file.csv', skiprows = skip)
Can I immediately read every 500th element (using.pd.read_csv() or some other method), without having to read first and then filter my data?
First get the length of the file by a custom function, remove each 500 row by numpy.setdiff1d and pass to the skiprows parameter in read_csv:
#https://stackoverflow.com/q/845058
def file_len(fname):
with open(fname) as f:
for i, l in enumerate(f):
pass
return i + 1
len_of_file = file_len('test.csv')
print (len_of_file)
skipped = np.setdiff1d(np.arange(len_of_file), np.arange(0,len_of_file,500))
print (skipped)
df = pd.read_csv('test.csv', skiprows=skipped)
How would you approach this problem if the date column was not ordered? At the moment, I am assuming it's ordered by date, but all data is prone to errors.
The idea is read only the datetime column by parameter usecols, and then sort and select each 500 index value, get the difference and pass again to parameter skiprows:
def file_len(fname):
with open(fname) as f:
for i, l in enumerate(f):
pass
return i + 1
len_of_file = file_len('test.csv')
df1 = pd.read_csv('test.csv',
usecols=['tpep_pickup_datetime'],
parse_dates=['tpep_pickup_datetime'])
sorted_idx = (df1['tpep_pickup_datetime'].sort_values()
.iloc[np.arange(0,len_of_file,500)].index)
skipped = np.setdiff1d(np.arange(len_of_file), sorted_idx)
print (skipped)
df = pd.read_csv('test.csv', skiprows=skipped).sort_values(by=['tpep_pickup_datetime'])
use a lambda with skiprows:
pd.read_csv(path, skiprows=lambda i: i % N)
to skip every N rows.
source: https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html
You can use csv module return a iterator and use itertools.cycle to select every nth row.
import csv
from itertools import cycle
source_file='D:/a.txt'
cycle_size=500
chooser = (x == 0 for x in cycle(range(cycle_size)))
with open(source_file) as f1:
rdr = csv.reader(f1)
data = [row for pick, row in zip(chooser, rdr) if pick]

How to customize headers and column widths of DataFrame display?

As a rule, I like to use long, descriptive column names (e.g. estimated_background_signal rather than just bg) for DataFrame objects. The one downside of this preference is that the DataFrame's display form has several columns that are much wider than their values require. For example:
In [10]: data.head()
barcode estimated_background_signal inhibitor_code inhibitor_concentration
0 R00577279 133 IRB 0.001
1 R00577279 189 SNZ 0.001
2 R00577279 101 CMY 0.001
3 R00577279 112 BRC 0.001
4 R00577279 244 ISB 0.001
It would be nice if the display were narrower. Disregarding the headers, the narrowest display would be:
0 R00577279 113 IRB 0.001
1 R00577279 189 SNZ 0.001
2 R00577279 101 CMY 0.001
3 R00577279 112 BRC 0.001
4 R00577279 244 ISB 0.001
...but eliminating the headers altogether is not an entirely satisfactory solution. A better one would be to make the display wide enough to allow for some headers, possibly taking up several lines:
barcode estim inhib inhib
ated_ itor_ itor_
backg code conce
0 R00577279 113 IRB 0.001
1 R00577279 189 SNZ 0.001
2 R00577279 101 CMY 0.001
3 R00577279 112 BRC 0.001
4 R00577279 244 ISB 0.001
It's probably obvious that no single convention would be suitable for all situations, but, in any case, does pandas offer any way to customize the headers and column widths of a DataFrame's display form?
This is a bit of a hack that uses the multi-index feature of pandas in a non-standard way, although I don't see any significant problems with doing that. Of course, there is some increased complexity from using a multi-index rather than a simple index.
cols = df.columns
lencols = [ int(len(c)/2) for c in cols ]
df.columns = pd.MultiIndex.from_tuples(
tuple( ( c[:ln], c[ln:] ) for c, ln in zip(cols, lencols) ) )
Results:
bar estimated_bac inhibit inhibitor_c
code kground_signal or_code oncentration
0 R00577279 133 IRB 0.001
1 R00577279 189 SNZ 0.001
2 R00577279 101 CMY 0.001
3 R00577279 112 BRC 0.001
4 R00577279 244 ISB 0.001
You could also consider creating a dictionary to convert between long & short names as needed:
Display column name different from dictionary key name in Pandas?
There is obviously pd.set_option display settings you can utilize. If you're looking for a pandas specific answer that doesn't involve changing notebook display settings, consider the below.
df = pd.DataFrame(np.random.randn(10, 2),
columns=['Very Long Column Title ' + str(i) for i in range(2)])
df.style.set_table_styles([dict(selector="th",props=[('max-width', '50px')])])