Fortran read file into array - transposed dimensions - file-io

I'm trying to read a file into memory in a Fortran program. The file has N rows with two values in each row. This is what I currently do (it compiles and runs, but gives me incorrect output):
program readfromfile
implicit none
integer :: N, i, lines_in_file
real*8, allocatable :: cs(:,:)
N = lines_in_file('datafile.txt') ! a function I wrote, which works correctly
allocate(cs(N,2))
open(15, 'datafile.txt', status='old')
read(15,*) cs
do i=1,N
print *, cs(i,1), cs(i,2)
enddo
end
What I hoped to get was the data loaded into the variable cs, with lines as first index and columns as second, but when the above code runs, it first gives prints a line with two "left column" values, then a line with two "right column" values, then a line with the next two "left column values" and so on.
Here's a more visual description of the situation:
In my data file: Desired output: Actual output:
A1 B1 A1 B1 A1 A2
A2 B2 A2 B2 B1 B2
A3 B3 A3 B3 A3 A4
A4 B4 A4 B4 B3 B4
I've tried switching the indices when allocating cs, but with the same results (or segfault, depending on wether I also switch indices at the print statement). I've also tried reading the values row-by-row, but because of the irregular format of the data file (comma-delimited, not column-aligned) I couldn't get this working at all.
How do I read the data into memory the best way to achieve the results I want?

I do not see any comma in your data file. It should not make any difference with the list-directed input anyway. Just try to read it like you write it.
do i=1,N
read (*,*) cs(i,1), cs(i,2)
enddo
Otherwise if you read whole array in one command, it reads it in column-major order, i.e., cs(1,1), cs(2, 1), ....cs(N,1), cs(1, 2), cs(2,2), ... This is the order in which the array is stored in memory.

Related

how to make pandas code faster or using dask dataframe or how to use vectorization for this type of problem?

import pandas as pd
# list of name, degree, score
label1 = ["a1", "a1", "a1","a1", "a2","a2","a2","a2", "b1","b1","b1","b1", "b2","b2","b2","b2"]
label2 = ["a1", "a2", "b1","b2", "a1","a2","b1","b2", "a1","a2","b1","b2", "a1","a2","b1","b2"]
m1 = [ 0, 3, 2, 7, 3, 0, 5, 8, 2, 5, 0, 9, 7, 8, 9, 0]
# dictionary of lists
dict = {'label1': label1, 'label2': label2,'m1':m1}
df = pd.DataFrame(dict)
df
output of this dataframe:
label1 label2 m1
0 a1 a1 0
1 a1 a2 3
2 a1 b1 2
3 a1 b2 7
4 a2 a1 3
5 a2 a2 0
6 a2 b1 5
7 a2 b2 8
8 b1 a1 2
9 b1 a2 5
10 b1 b1 0
11 b1 b2 9
12 b2 a1 7
13 b2 a2 8
14 b2 b1 9
15 b2 b2 0
I want to write a function that will take strings (samp1)a, (samp2)b, and a (df) data frame as input. We have to preprocess those two input strings so that we can get desired strings in our data frame. Then we need to access some particular rows' (like (a1,b1) or (a2,b2)) indices of the data frame to get their corresponding 'm1' value. Next, we will make some (addition) operations for those m1 values and store them in two variables and after that, it will return the minimum of two variables. [looking at coding snippet may be easier to understand]
The following is the code for this function:
def min_4line(samp1,samp2,df):
k=['1','2']
#k and samp are helping to generate variable along with number
#for example it will take a,b and can create a1,a2,b1,b2.....
samp1_1=samp1+k[0]
samp1_2=samp1+k[1]
samp2_1=samp2+k[0]
samp2_2=samp2+k[1]
#print(samp1_1)#a1
#print(samp1_2)#a2
#print(samp2_1)#b1
#print(samp2_2)#b2
"""
#As we are interested about particular rows to get comb1 variable, we need those row's
#indexes
#for comb1 we want to sum (a1,b1)[which located at ind1] and (a2,b2)[which located at ind2]
#same types of thing for comb2
"""
ind1=df.index[(df['label1']==samp1_1) & (df['label2']==samp2_1)].tolist()
ind2=df.index[(df['label1']==samp1_2) & (df['label2']==samp2_2)].tolist()
#print(ind1)#[2]
#print(ind2)#[7]
comb1=int(df.loc[ind1,'m1'])+int(df.loc[ind2,'m1'])
#print('comb1: ',comb1)#comb1: 10
ind3=df.index[(df['label1']==samp1_2) & (df['label2']==samp2_1)].tolist()
ind4=df.index[(df['label1']==samp1_1) & (df['label2']==samp2_2)].tolist()
#print(ind3)#[6]
#print(ind4) #[3]
comb2=int(df.loc[ind3,'m1'])+int(df.loc[ind4,'m1'])
#print('comb2: ',comb2)#comb2: 12
return min(comb1,comb2)#10
To append unique char like a,b from the dataframe we need to do a list operation:
#this list is needed so that I can compare how many unique values are there...
#it could get a,b,c,d.... and make comparison
#like (a,b), (a,c),(a,d), (b,c),(b,d),(c,d) for the function
list_line=list(df['label1'].unique())
string_test=[a[:-1] for a in list_line]
#string_test will exclude number portion of character
list_img=sorted(list(set(string_test)))
#print(list_img)#['a', 'b']
#print(len(list_img))#2
Now we need to create a data frame that will go over the 'list_img' and call the min4line function to get value like (a,b), (a,c) and corresponding output of the function. Here a nested loop is necessary as suppose list consist [a,b,c,d]. it will go like(a,b),(a,c),(a,d),(b,c),(b,d),(c,d). So that we can have unique pair. The code for this is:
%%time
d=[]
for i in range(len(list_img)):
for j in range(i+1,len(list_img)):
a=min_4line(list_img[i],list_img[j],df)
print(a)
d.append({'label1':str(list_img[i]),'label2':str(list_img[j]), 'metric': str(a)})
dataf=pd.DataFrame(d)
dataf.head(5)
output is:
label1label2metric
0 a b 10
Is there any way to make the code faster? I broke down the problem into small parts. this operation is needed for 16 million rows. I am interested in using dask for this. But when I have asked this type of question previously, many people failed to understand as I was not able to state the problem clearly. Hope this time I broke it down in easier format. You can copy those code cell and run in jupyter notebook to check the output and suggest me any good way to make the program faster.
[updated]
Can anyone suggest, how can I get those particular indices of those rows using numpy or any kind of vectorized operation?

Oracle SQL - Combine results from two columns

I am seeking to combine the results of two columns, and view it in a single column:
select description1, description2 from daclog where description2 is not null;
results two registry:
1st row:
DESCRIPTION1
Initialization scans sent to RTU 1, 32 bit mask: 0x00000048. Initialization mask bits are as follows: B0 - status dump, B1 - analog dump B2 - accumulator dump, B3 - Group Data Dump, B4 - accumulat
(here begin DESCRIPTION2)
,or freeze, B5 - power fail reset, B6 - time sync.
2nd row:
DESCRIPTION1
Initialization scans sent to RTU 1, 32 bit mask: 0x00000048. Initialization mask bits are as follows: B0 - status dump, B1 - analog dump B2 - accumulator dump, B3 - Group Data Dump, B4 - accumulat
(here begin DESCRIPTION2)
,or freeze, B5 - power fail reset, B6 - time sync.
Then I need the value of description1 and description2, on the same column.
It is possible?
Thank you!
You can combine two columns into one by using || operator.
select description1 || description2 as description from daclog where description2 is not null;
If you would like to use some substrings from each of the descriptions, you can use String functions and then combine the results. FNC(description1) || FNC(descriptions2) where FNC might be a function to return the desired substring of your columns.

Database transactions theory

No book seems to be able to answer this.
Suppose I have two transactions:
T1: Lock A, Lock B, Unlock A
T2: Lock B, Unlock B, Lock A, Unlock A
Q1. How many ways are there to plan these transactions? (Is it just a simple graph and the result is 3! * 4! ?)
Q2. How many of these ways are serializable?
I would really like to know what is the thinking process, how do you get to the answer?
Q1 is 7.
Proof: First of all, we have to merge the set 'Lock A', 'Lock B', 'Unlock A' (I denote the items as A1, A2, A3) into the set 'Lock B',..,'Unlock A' (I denote them B1..B4), that is to put 3 items into 5 places (between B's) with repetitions allowed, that is binomial coeff. choose 3 from (5-1+3). It is equal to 7!/(3!*4!) = 35.
Next, we have to drop 'bad' solutions (the ones prevented by locking conditions). It's where A1 stands between B3 and B4 (3 solutions) and A2 between B1 and B2 (2*4 = 8). Also, we have to exclude the solutions with B3 between A1 and A3 too. There are 3*3=9 with B3 between A1 and A2, and 6*2=12 with B3 between A2 and A3. Thus, we have 35-3-8-9-12=3. But we should also satisfy inclusion-exclusion principle: add solutions which violates two rules simultaneously. They could be only like this: B1 A2 B2 B3 B4, with A1 in either of two left positions, and A3 in either of two right ones. 4 in total. So, we have the final answer 35 - 3 - 8 - 9 - 12 + 4 = 7.

Redis and linked hashes

everyone
I would like to ask community of help to find a way of how to cache our huge plain table by splitting it to the multiple hashes or otherwise.
The sample of table, as an example for structure:
A1 B1 C1 D1 E1 X1
A1 B1 C1 D1 E1 X2
A7 B5 C2 D1 E2 X3
A8 B1 C1 D1 E2 X4
A1 B6 C3 D2 E2 X5
A1 B1 C1 D2 E1 X6
This our denormalized data, we don't have any ability to normalize it.
So currently we must perform 'group by' to get required items, for instance to get all D* we perform 'data.GroupBy(A1).GroupBy(B1).GroupBy(C1)' and it takes a lot of time.
Temporarily we had found workaround for this by creating composite a string key:
A1 -> 'list of lines begin A1'
A1:B1 -> 'list of lines begin A1:B1'
A1:B1:C1 -> 'list of lines begin A1:B1:C1'
...
as a cache of results of grouping operations.
The question is how it can be stored efficiently?
Estimated number of lines in denormalized data around 10M records and as in my an example there is 6 columns it will be 60M entries in hash. So, I'm looking an approach to lookup values in O(N) if it's possible.
Thanks.

Using : operator to index numpy.ndarray of numpy.void (as output by numy.genfromtxt)

I generate data using numpy.genfromtxt like this:
ConvertToDate = lambda s:datetime.strptime(s,"%d/%m/%Y")
data= numpy.genfromtxt(open("PSECSkew.csv", "rb"),
delimiter=',',
dtype=[('CalibrationDate', datetime),('Expiry', datetime), ('B0', float), ('B1', float), ('B2', float), ('ATMAdjustment', float)],
converters={0: ConvertToDate, 1: ConvertToDate})
I now want to extract the last 4 columns (of each row but in a loop so lets just consider a single row) to separate variables. So I do this:
B0 = data[0][2]
B1 = data[0][3]
B2 = data[0][4]
ATM = data[0][5]
But if I can do this (like I could with a normal 2D ndarray for example) I would prefer it:
B0, B1, B2, ATM = data[0][2:]
But this gives me an 'invalid index' error. Is there a way to do this nicely or should I stick with the 4 line approach?
As output of np.genfromtxt, you have a structured array, that is, a 1D array where each row as different fields.
If you want to access some fields, just access them by names:
data["B0"], data["B1"], ...
You can also group them:
data[["B0", "B1]]
which gives you a 'new' structured array with only the fields you wanted (quotes around 'new' because the data is not copied, it's still the same as your initial array).
Should you want some specific 'rows', just do:
data[["B0","B1"]][0]
which outputs the first row. Slicing and fancy indexing work too.
So, for your example:
B0, B1, B2, ATM = data[["B0","B1","B2","ATMAdjustment"]][0]
If you want to access only those fields row after row, I would suggest to store the whole array of the fields you want first, then iterate:
filtered_data = data[["B0","B1","B2","ATMAdjustment"]]
for row in filtered_data:
(B0, B1, B2, ATM) = row
do_something
or even :
for (B0, B1, B2, ATM) in filtered_data:
do_something