Using : operator to index numpy.ndarray of numpy.void (as output by numy.genfromtxt) - numpy

I generate data using numpy.genfromtxt like this:
ConvertToDate = lambda s:datetime.strptime(s,"%d/%m/%Y")
data= numpy.genfromtxt(open("PSECSkew.csv", "rb"),
delimiter=',',
dtype=[('CalibrationDate', datetime),('Expiry', datetime), ('B0', float), ('B1', float), ('B2', float), ('ATMAdjustment', float)],
converters={0: ConvertToDate, 1: ConvertToDate})
I now want to extract the last 4 columns (of each row but in a loop so lets just consider a single row) to separate variables. So I do this:
B0 = data[0][2]
B1 = data[0][3]
B2 = data[0][4]
ATM = data[0][5]
But if I can do this (like I could with a normal 2D ndarray for example) I would prefer it:
B0, B1, B2, ATM = data[0][2:]
But this gives me an 'invalid index' error. Is there a way to do this nicely or should I stick with the 4 line approach?

As output of np.genfromtxt, you have a structured array, that is, a 1D array where each row as different fields.
If you want to access some fields, just access them by names:
data["B0"], data["B1"], ...
You can also group them:
data[["B0", "B1]]
which gives you a 'new' structured array with only the fields you wanted (quotes around 'new' because the data is not copied, it's still the same as your initial array).
Should you want some specific 'rows', just do:
data[["B0","B1"]][0]
which outputs the first row. Slicing and fancy indexing work too.
So, for your example:
B0, B1, B2, ATM = data[["B0","B1","B2","ATMAdjustment"]][0]
If you want to access only those fields row after row, I would suggest to store the whole array of the fields you want first, then iterate:
filtered_data = data[["B0","B1","B2","ATMAdjustment"]]
for row in filtered_data:
(B0, B1, B2, ATM) = row
do_something
or even :
for (B0, B1, B2, ATM) in filtered_data:
do_something

Related

how to make pandas code faster or using dask dataframe or how to use vectorization for this type of problem?

import pandas as pd
# list of name, degree, score
label1 = ["a1", "a1", "a1","a1", "a2","a2","a2","a2", "b1","b1","b1","b1", "b2","b2","b2","b2"]
label2 = ["a1", "a2", "b1","b2", "a1","a2","b1","b2", "a1","a2","b1","b2", "a1","a2","b1","b2"]
m1 = [ 0, 3, 2, 7, 3, 0, 5, 8, 2, 5, 0, 9, 7, 8, 9, 0]
# dictionary of lists
dict = {'label1': label1, 'label2': label2,'m1':m1}
df = pd.DataFrame(dict)
df
output of this dataframe:
label1 label2 m1
0 a1 a1 0
1 a1 a2 3
2 a1 b1 2
3 a1 b2 7
4 a2 a1 3
5 a2 a2 0
6 a2 b1 5
7 a2 b2 8
8 b1 a1 2
9 b1 a2 5
10 b1 b1 0
11 b1 b2 9
12 b2 a1 7
13 b2 a2 8
14 b2 b1 9
15 b2 b2 0
I want to write a function that will take strings (samp1)a, (samp2)b, and a (df) data frame as input. We have to preprocess those two input strings so that we can get desired strings in our data frame. Then we need to access some particular rows' (like (a1,b1) or (a2,b2)) indices of the data frame to get their corresponding 'm1' value. Next, we will make some (addition) operations for those m1 values and store them in two variables and after that, it will return the minimum of two variables. [looking at coding snippet may be easier to understand]
The following is the code for this function:
def min_4line(samp1,samp2,df):
k=['1','2']
#k and samp are helping to generate variable along with number
#for example it will take a,b and can create a1,a2,b1,b2.....
samp1_1=samp1+k[0]
samp1_2=samp1+k[1]
samp2_1=samp2+k[0]
samp2_2=samp2+k[1]
#print(samp1_1)#a1
#print(samp1_2)#a2
#print(samp2_1)#b1
#print(samp2_2)#b2
"""
#As we are interested about particular rows to get comb1 variable, we need those row's
#indexes
#for comb1 we want to sum (a1,b1)[which located at ind1] and (a2,b2)[which located at ind2]
#same types of thing for comb2
"""
ind1=df.index[(df['label1']==samp1_1) & (df['label2']==samp2_1)].tolist()
ind2=df.index[(df['label1']==samp1_2) & (df['label2']==samp2_2)].tolist()
#print(ind1)#[2]
#print(ind2)#[7]
comb1=int(df.loc[ind1,'m1'])+int(df.loc[ind2,'m1'])
#print('comb1: ',comb1)#comb1: 10
ind3=df.index[(df['label1']==samp1_2) & (df['label2']==samp2_1)].tolist()
ind4=df.index[(df['label1']==samp1_1) & (df['label2']==samp2_2)].tolist()
#print(ind3)#[6]
#print(ind4) #[3]
comb2=int(df.loc[ind3,'m1'])+int(df.loc[ind4,'m1'])
#print('comb2: ',comb2)#comb2: 12
return min(comb1,comb2)#10
To append unique char like a,b from the dataframe we need to do a list operation:
#this list is needed so that I can compare how many unique values are there...
#it could get a,b,c,d.... and make comparison
#like (a,b), (a,c),(a,d), (b,c),(b,d),(c,d) for the function
list_line=list(df['label1'].unique())
string_test=[a[:-1] for a in list_line]
#string_test will exclude number portion of character
list_img=sorted(list(set(string_test)))
#print(list_img)#['a', 'b']
#print(len(list_img))#2
Now we need to create a data frame that will go over the 'list_img' and call the min4line function to get value like (a,b), (a,c) and corresponding output of the function. Here a nested loop is necessary as suppose list consist [a,b,c,d]. it will go like(a,b),(a,c),(a,d),(b,c),(b,d),(c,d). So that we can have unique pair. The code for this is:
%%time
d=[]
for i in range(len(list_img)):
for j in range(i+1,len(list_img)):
a=min_4line(list_img[i],list_img[j],df)
print(a)
d.append({'label1':str(list_img[i]),'label2':str(list_img[j]), 'metric': str(a)})
dataf=pd.DataFrame(d)
dataf.head(5)
output is:
label1label2metric
0 a b 10
Is there any way to make the code faster? I broke down the problem into small parts. this operation is needed for 16 million rows. I am interested in using dask for this. But when I have asked this type of question previously, many people failed to understand as I was not able to state the problem clearly. Hope this time I broke it down in easier format. You can copy those code cell and run in jupyter notebook to check the output and suggest me any good way to make the program faster.
[updated]
Can anyone suggest, how can I get those particular indices of those rows using numpy or any kind of vectorized operation?

Turning one-hot encoded table into 2D table of counts

I think I can solve this problem without too much difficulty but suspect that any solution I come up with will be sub-optimal, so am interested in how the real pandas experts would do it; I'm sure I could learn something from that.
I have a table of data that is one-hot encoded, something like:
Index. A1. A2. A3. B1. B2. C1. C2. C3. C4.
0. True. False. True. True. True. False. True. False. False.
...
So every entry is a Boolean and my columns consist of several groups of categories (the A's, B's and C's).
What I want to create is new DataFrames where I pick any two categories and get a table of counts of how many people are in the pair of categories corresponding to that row/column. So, if I was looking at categories A and B, I would generate a table:
Index. A1. A2. A3. None Total
B1. x11. x12. x13. x1N x1T
B2. x21. x22. x23. x2N. x2T
None. xN1. xN2. xN3. xNN xNT
Total. xT1. xT2. xT3. xTN xTT
where x11 is the count of rows in the original table that have both A1 and B1 True, x12 is the count of those rows that have A1 and B2 True, and so on.
I'm also interested in the counts of those entries where all the A values were False and/or all the B values were false, which are accounted for in the None columns.
Finally, I would also like the totals of rows where any of the columns in the corresponding category were True. So x1T would be the number of rows where B1 was True and any of A1, A2 or A3 were True, and so on (note that this is not just the sum of x11, x12 and x13 as the categories are not always mutually exclusive; a row could have both A1 True and A2 True for example). xNN is the number of rows that have all false values for A1, A2, A3, B1, B2, and xTT is the number of rows that have at least one true value for any of A1, A2, A3, B1 and B2, so xNN + xTT would equal the total number of rows in the original table.
Thanks
Graham
This is my approach:
def get_table(data, prefix):
'''
get the columns in the respective category
and assign `none` and `Total` columns
'''
return (df.filter(like=prefix)
.assign(none=lambda x: (1-x).prod(1),
Total=lambda x: x.any(1))
)
# sample data
np.random.seed(1)
df = pd.DataFrame(np.random.choice([True,False], size=(5,9), p=(0.6,0.4)),
columns=[f'{x}{y}'for x in 'ABC' for y in '123'])
# the output
df = df.astype(int)
get_table(df, 'B').T # get_table(df, 'A')
Output:
A1 A2 A3 none Total
B1 3 2 1 0 3
B2 3 2 1 0 3
B3 2 1 1 0 2
none 0 0 1 0 1
Total 4 3 2 0 5
Here I don't understand as why (none, Total) must be zero. Since none corresponds to all False in B and Total corresponds to some True in A.

Octave: is there a way of searching items in a cell array containing scalar structure with fields and how to define the conditions

I have this array in Octave :
dwnSuccess(1,1)
ans =
{
[1,1] =
scalar structure containing the fields:
site = FRED
interval = d
aard = logDir log/
dwnGrootte = log/
time = 737861.64028
and I would like to formulate conditions to find cells containing e.g. logDir in the field 'aard'.
I don't find the correct syntax. Someone knows where to find or has an example with combinations of conditions. Thanks
Assuming that you need to keep a cell array of scalar structs (instead of a struct array which makes more sense if each struct has a defined set of fieldnames), then you need to iterate the cell array to get that field and then use logical indexing to create a new cell array with the structs of interest. Like so:
aards = cellfun (#getfield, cs, {"aard"}, "UniformOutput", false);
m = strcmp(aards, "logDir"); # this must match the whole string
filter_cs2 = cs(m);
If you are interested on finding whether a string is somewhere in that field, then it's just a bit more complex:
m = ! cellfun ("isempty", strfind (aards, "logDir"));
If I understood correctly your question, then suppose you have the following cell array:
a = cell();
a{1} = struct('a', 1, 'b', 'dwn', 'c', 2);
a{2} = struct('a', 2, 'b', 'notdwn', 'c', 3);
a{3} = struct('a', 3, 'b', 'dwn', 'c', 4);
a{4} = struct('a', 4, 'b', 'dwn', 'c', 5);
I think the easiest thing to do would be to first convert it to a struct array. You can do so easily via 'sequence generator' syntax, i.e.
s = [a{:}]; % collect all cell elements as a sequence, then wrap into an array
If you are in charge of this code, then I would instead just create a struct array instead of a cell array from the very beginning.
Once you have that, you can again use a 'sequence generator' syntax on the struct array, with an appropriate function that tests for equality. In your case, you could do something like this:
strcmp( {s.b}, 'dwn' )
% ans = 1 0 1 1
s.b accesses the field 'b' in each element of the struct array, returning it as a comma separated list. Wrapping this in braces causes this sequence to become a cell array. You then pass this resulting cell array of strings into strcmp, to compare each element with the string 'dwn'.
Depending on what you want to do next, you can use that logical array as an index to your struct array to isolate only the structs that contain that value etc.
Obviously this is a quick way of doing it, if you're comfortable with generating sequences in this way. If not, the general idea stands and you're welcome to iterate using traditional for loops etc.

Convert cell text in progressive number

I have written this SQL in PostgreSQL environment:
SELECT
ST_X("Position4326") AS lon,
ST_Y("Position4326") AS lat,
"Values"[4] AS ppe,
"Values"[5] AS speed,
"Date" AS "timestamp",
"SourceId" AS smartphone,
"Track" as session
FROM
"SingleData"
WHERE
"OsmLineId" = 44792088
AND
array_length("Values", 1) > 4
AND
"Values"[5] > 0
ORDER BY smartphone, session;
Now I have imported the result in Matlab and I have six vectors and one cell (because the text from the UUIDs was converted in cell) all of 5710x1 size.
Now I would like convert the text in the cell, in a progressive number, like 1, 2, 3... for each different session code.
In Excel it is easy with FIND.VERT(obj, matrix, col), but I do not know how do it in Matlab.
Now I have a big cell with a lot of codes like:
ff95465f-0593-43cb-b400-7d32942023e1
I would like convert this cell in an array of numbers where at the first occurrence of
ff95465f-0593-43cb-b400-7d32942023e1 -> 1
and so on. And you put 2 when a different code appear, and so on.
OK, I have solve.
I put the single session code in a second cell C.
At this point, with a for loop, I obtain:
%% Converting of the UUIDs into integer
C = unique(session);
N = length(session);
session2 = zeros(N, 1);
for i = 1:N
session2(i) = find(strcmp(C, session(i)));
end
Thanks to all!

Fortran read file into array - transposed dimensions

I'm trying to read a file into memory in a Fortran program. The file has N rows with two values in each row. This is what I currently do (it compiles and runs, but gives me incorrect output):
program readfromfile
implicit none
integer :: N, i, lines_in_file
real*8, allocatable :: cs(:,:)
N = lines_in_file('datafile.txt') ! a function I wrote, which works correctly
allocate(cs(N,2))
open(15, 'datafile.txt', status='old')
read(15,*) cs
do i=1,N
print *, cs(i,1), cs(i,2)
enddo
end
What I hoped to get was the data loaded into the variable cs, with lines as first index and columns as second, but when the above code runs, it first gives prints a line with two "left column" values, then a line with two "right column" values, then a line with the next two "left column values" and so on.
Here's a more visual description of the situation:
In my data file: Desired output: Actual output:
A1 B1 A1 B1 A1 A2
A2 B2 A2 B2 B1 B2
A3 B3 A3 B3 A3 A4
A4 B4 A4 B4 B3 B4
I've tried switching the indices when allocating cs, but with the same results (or segfault, depending on wether I also switch indices at the print statement). I've also tried reading the values row-by-row, but because of the irregular format of the data file (comma-delimited, not column-aligned) I couldn't get this working at all.
How do I read the data into memory the best way to achieve the results I want?
I do not see any comma in your data file. It should not make any difference with the list-directed input anyway. Just try to read it like you write it.
do i=1,N
read (*,*) cs(i,1), cs(i,2)
enddo
Otherwise if you read whole array in one command, it reads it in column-major order, i.e., cs(1,1), cs(2, 1), ....cs(N,1), cs(1, 2), cs(2,2), ... This is the order in which the array is stored in memory.