Read in array data into different sized Fortran arrays - file-io

Let's say I have a 5 x 5 array of floating points in a file array.txt:
1.0 1.1 0.0 0.0 0.0
1.2 1.3 1.4 0.0 0.0
0.0 1.5 1.6 1.7 0.0
0.0 0.0 1.8 1.9 1.0
0.0 0.0 0.0 1.1 1.2
I know this is probably a strange thing to do, but I'm just trying to learn the read statements better: I want to create two 3x3 arrays in Fortran, i.e. real, dimension(3,3) :: array1, array2 and try reading in the first 9 values by row into array1 and the following 9 values into array2. That is, I would like arrays to have the form
array1 = 1.0 1.1 0.0
0.0 0.0 1.2
1.3 1.4 0.0
array2 = 0.0 0.0 1.5
1.6 1.7 0.0
0.0 0.0 1.8
Next I want to try to do the same by columns:
array1 = 1.0 1.2 0.0
0.0 0.0 1.1
1.3 1.5 0.0
array2 = 0.0 0.0 1.4
1.6 1.8 0.0
0.0 0.0 1.7
My "closest" attempt for row-wise:
program scratch
implicit none
real, dimension(3,3) :: array1, array2
integer :: i
open(12, file="array.txt")
!read in values
do i = 1,3
read(12,'(3F4.1)', advance="no") array1(i,:)
end do
end program scratch
My questions:
A. How to advance to next record when at the end?
B. How to do the same for reading in column-wise?
C. Why is '(3F4.1)' needed, as opposed to '(3F3.1)'?

Reading by line is easy :
READ(12,*) ((array1(i,j),j=1,3),i=1,3),((array2(i,j),j=1,3),i=1,3)
"advance='no'" is necessary only if you use 2 read statements instead of 1 (and only on the first READ). But this works only with explicit format ...
Reading a file by column is not so obvious, especially because reading a file is usually an expensive task. I suggest you read the file in a larger table and then you distribute the values into your two arrays. For instance :
real :: table(5,5)
integer :: i,j,ii,jj,k
..
read(12,*) ((table(i,j),j=1,5),i=1,5)
k=0
do j=1,3
do i=1,3
k=k+1
jj=(k-1)/5+1
ii=k-(jj-1)*5
array1(i,j)=table(ii,jj)
enddo
enddo
do j=1,3
do i=1,3
k=k+1
jj=(k-1)/5+1
ii=k-(jj-1)*5
array2(i,j)=table(ii,jj)
enddo
enddo
(3F4.1) is better than (3F3.1) because each number occupies 4 bytes in fact (3 for the digits and 1 for the space between numbers). But as you see, I have used * which avoids to think about such detail.

because of the requirement to "assign by columns" i would advise reading the whole works into a 5x5 array:
real tmp(5,5)
read(unit,*)tmp
(note no format specification required)
Then do the assignments you need using array operations.
for this small array, the simplest thing to do seems to be:
real tmp(5,5),flat(25),array1(3,3),array2(3,3)
read(unit,*)tmp
flat=reshape(tmp,shape(flat))
array1=reshape(flat(:9),shape(array1))
array2=reshape(flat(10:18),shape(array2))
then the transposed version is simply:
flat=reshape(transpose(tmp),shape(flat))
array1=reshape(flat(:9),shape(array1))
array2=reshape(flat(10:18),shape(array2))
If it was a really big array I'd think of a way that avoided making an extra copy of the data.
Note you can wrap each of those assignments in transpose if needed depending on
how you really want the data represented, eg.
array1=transpose(reshape(flat(:9),shape(array1)))

Related

How to compare a value against a column value containing csv in Postgres?

I have a table called device_info that looks like below (only a sample provided)
device_ip
cpu
memory
100.33.1.0
10.0
29.33
110.35.58.2
3.0, 2.0
20.47
220.17.58.3
4.0, 3.0
23.17
30.13.18.8
-1
26.47
70.65.18.10
-1
20.47
10.25.98.11
5.0, 7.0
19.88
12.15.38.10
7.0
22.45
Now I need to compare a number say 3 against the cpu column values and get the rows that are greater than that. Since the cpu column values are stored as a csv, I am not sure how to do the comparison.
I found there is a concept called string_to_array in Postgres which converts csv to array and accordingly tried the below query that didn't work out
select device_ip, cpu, memory
from device_info
where 3 > any(string_to_array(cpu, ',')::float[]);
What am I doing wrong?
Expected output
device_ip
cpu
memory
100.33.1.0
10.0
29.33
220.17.58.3
4.0, 3.0
23.17
10.25.98.11
5.0, 7.0
19.88
12.15.38.10
7.0
22.45
The statement as-is is saying "3 is greater than my array value". What I think you want is "3 is less than my array value".
Switch > to <.
select device_ip, cpu
from device_info
where 3 < any(string_to_array(cpu, ',')::float[]);
View on DB Fiddle

How to get the NaN value in BigQuery?

Is there a way to get a NaN value directly in BigQuery? For example, I'm trying to get all the 'special' float values BQ supports, but what's the way to generate the NaN?:
SELECT 1e500, -1e500, 1e-500, -1e-500, ???
-- Infinity -Infinity 0.0 0.0 NaN
Is the only way to do CAST('NaN' AS float64) ? Or is there a way to represent it with scientific notation?
Another options (few out of many similar - you got an idea)
SELECT IEEE_DIVIDE(0, 0), LOG(1, 1e500)
with output

Which is the correct tree in XGBClassifier?

I have trained the following XGBClassifier in Pandas:
model = XGBClassifier(
objective='binary:logistic',
base_score=0.5,
booster='gbtree',
colsample_bylevel=1,
colsample_bynode=1,
colsample_bytree=1,
enable_categorical=False,
gamma=2,
gpu_id=-1,
importance_type=None,
interaction_constraints='',
learning_rate=0.1,
max_delta_step=0,
max_depth=3,
min_child_weight=7,
monotone_constraints='(1,1,1,1,1)',
n_estimators=3,
n_jobs=1,
nthread=1,
num_parallel_tree=1,
predictor='auto',
random_state=0,
reg_alpha=0,
reg_lambda=1,
scale_pos_weight=1,
silent=True,
subsample=0.8,
tree_method='exact',
validate_parameters=1,
pred_contribs=True,
verbosity=None)
model.fit(X, Y)
The resulting tree looks like this:
As you can see there are 4 leaves:
Leaf3 -> Log odds = -0.13381
Leaf4 -> Log odds = -0.05526
Leaf5 -> Log odds = -0.04303
Leaf6 -> Log odds = 0.00275
My assumption is that the number that we see in the picture are log odds.
Then I use the apply method to check the predicted leaf for every tree for each sample:
model.fit(X, Y)
x = pd.DataFrame((model.apply(X)))
x.to_csv('x.csv')
print(x)
The printout looks like this:
0 1 2
0 6.0 6.0 6.0
1 3.0 3.0 6.0
2 3.0 4.0 3.0
3 6.0 6.0 6.0
4 5.0 5.0 4.0
.. ... ... ...
457 4.0 4.0 6.0
458 6.0 6.0 6.0
459 5.0 5.0 4.0
460 6.0 6.0 5.0
461 3.0 4.0 5.0
The total number of Trees is 3 (from 0 to 2) because I set the n_estimators=3. Is my understanding correct?
Question: which one of these three trees above corresponds to the tree (plot) displayed in the picture above?
I have dumped the tree:
df = model.get_booster().trees_to_dataframe()
print(df)
Which looks like this:
I have highlighted in orange the leaves and gains that correspond to the plot shown above.
Therefore I assume that Tree 0 is the one chosen by the algo to segment the dataset.
Now, when I merge the dataframe that I have used to train the XGBClassifier with the dataframe that contains the apply results by choosing only the tree 0, I obtain a dataframe which contains the Probabilities that I have predicted with the model.predict_proba(X) method and a column called Leaf that contains leaves 3,4,5,6 as expected. The problem is that the probabilities field only contains TWO values: I was expecting FOUR values (one for each leaf).
Why is that? I expect that there is one and one only probability assigned to each leaf.
How can I figure out what is the segmentation that the algo has chosen to assign a leaf to each record? Where can I find the actual segmentation? And How can I create a column in the train dataframe that contains the correct leaf?

Finding significant values from a series

I have a series with index and the count can be 0 to 1000.
I can select all the entries where the value is greater than 3
But after looking at the data, I decide to select all the entries where the value is more than 10 because some values are significantly higher than others!
s[s > 3].dropna()
-PB-[variable][variable] 8.0
-[variable] 15.0
-[variable][variable] 6.0
A-[variable][variable] 5.0
B 5.0
B-[variable][variable] 5.0
Book 4.0
Bus 8.0
Date 5.0
Dear 1609.0
MR 4.0
Man[variable] 4.0
Number[variable] 5.0
PM[variable] 4.0
Pickup 12.0
Pump[variable] 5.0
RJ 9.0
RJ-[variable]-PB-[variable][variable] 6.0
Time[variable] 6.0
[variable] 103.0
[variable][variable] 15.0
I have refined my query to something like this...
s[s > 10].dropna()
-[variable] 15.0
Dear 1609.0
Pickup 12.0
[variable] 103.0
[variable][variable] 15.0
Is there any function in pandas to return the significant entries. I can sort in descending order and select the first 5 or 10, but there is no guarantee that those entries will be very high compared to average. In that case I will prefer to select all entries.
In other words, I have decided the threshold of 10 in this case after looking at the data. Is there any method to select that value programmatically?
Selecting a threshold value with the quntile method might be a better solution, but still not the exact answer.
You can use .head function to select default top 5 row and .sort_values to sort in that dataframe. If you want to select top 10 then pass 10 in head function.
Simply call:
s[s['column_name'] > 10].sort_values(kind='quicksort', by='column_name_to_sort', ascending=False).head(10)

AMPL Non-Linear least Square

Could anyone help me to find the error in this AMPL's code for a simple least-square error base on the function:
F(X)=1/1+e^-x
param N>=1;# N Number of simulations
param M>=1;# Number of inputs
param simulations {1..N};
param training{1..N,1..M};
var W{1..10};
minimize obj: sum{i in simulations , j in 1..4} (1/(1+exp-(W[9]/(1+exp(-
W[j]/(1+exp(-training[i][j]))))+ W[10]/(1+exp(-W[2*j]/(1+exp(-training[i][j]))))))-training[i][5])^2;
'###### DATA
param N:=6;
param M:=4;
param training:
1 2 3 4 5 :=
1 0.209 0.555 0.644 0.355 0.0
2 0.707 0.450 0.587 0.305 1.0
3 0.579 0.521 0.745 0.394 1.0
4 0.574 0.883 0.211 0.550 1.0
5 0.797 0.055 0.430 0.937 1.0
6 0.782 0.865 0.114 0.317 1.0 ;
Thank you!
A couple of things:
is that quote mark before ###### DATA meant to be there?
You have specified that training has dimension N x M, and your data specifies that N=6, M=4, but you then define training as 6 x 5 and your objective function also refers to column 5.
If that doesn't answer your question, you might want to give more information about what error messages you're getting.