Read text file content into matrix based on condition - file-io

I have a text files of this format:
Food
Fruits1 [heading]
Apple [value]
Mango [value]
Orange [value]
Veg1 [heading]
Potato [value]
Lettuce [value]
I want to load this into Octave as a matrix of this format:
----------------------------------------------------------------------
Item | Fruits1 | Apple | Mango | Orange | Veg1 | Potato | Lettuce
----------------------------------------------------------------------
| | | | | | |
----------------------------------------------------------------------
Hence, I need a matrix of size = 2x(n+m+1); where n = number of [heading], m = number of [value].
How can I use fgetl to read each line from the text file and store into a matrix satisfying the above condition? Any better ideas?
Thanks!
Edit: Code:-
fid = fopen('food.txt','r');
num = 1;
if (fid < 0)
printf('Error:could not open file\n')
else
while ~feof(fid),
line = fgetl(fid);
arr=[line;];
num=num+1;
end;
fclose(fid)
end;

A Matrix is not possible because your strings doesn't have the same length. It's only possible to put it into a cell. But I would recommend a structured list.
fid = fopen('list.txt');
while 1
tmp = fgetl(fid);
if ~ischar(tmp)
% end of file
break
end
if strcmp(deblank(tmp), 'Food')
# this can't be empty
listObj.('item') = 'Food';
else
tmp = cell2mat(regexp(deblank(tmp),'(\w+)','tokens'));
if strcmp(tmp{1,2}, 'heading')
head = tmp{1,1};
else
listObj.(head).(tmp{1,1}) = str2double(tmp{1,2});
end
end
end
You can access it much better.
>> stackoverflow
>> listObj
listObj =
scalar structure containing the fields:
item = Food
Fruits1 =
scalar structure containing the fields:
Apple = 3
Mango = 4
Orange = 1
Veg1 =
scalar structure containing the fields:
Potato = 8
Lettuce = 0
>> listObj.Fruits1.Apple
ans = 3
>> listObj.Veg1.Lettuce
ans = 0
>>
But take care about your input file, it has to be strictly formatted. My example file looks like this
Food
Fruits1 [heading]
Apple 3
Mango 4
Orange 1
Veg1 [heading]
Potato 8
Lettuce 0

Related

In Dask, how would I remove data that is not repeated across all values of another column?

I'm trying to find a set of data that exists across multiple instances of a column's value.
As an example, let's say I have a DataFrame with the following values:
+-------------+------------+----------+
| hardware_id | model_name | data_v |
+-------------+------------+----------+
| a | 1 | 0.595150 |
+-------------+------------+----------+
| b | 1 | 0.285757 |
+-------------+------------+----------+
| c | 1 | 0.278061 |
+-------------+------------+----------+
| d | 1 | 0.578061 |
+-------------+------------+----------+
| a | 2 | 0.246565 |
+-------------+------------+----------+
| b | 2 | 0.942299 |
+-------------+------------+----------+
| c | 2 | 0.658126 |
+-------------+------------+----------+
| a | 3 | 0.160283 |
+-------------+------------+----------+
| b | 3 | 0.180021 |
+-------------+------------+----------+
| c | 3 | 0.093628 |
+-------------+------------+----------+
| d | 3 | 0.033813 |
+-------------+------------+----------+
What I'm trying to get would be a DataFrame with all elements except the rows that contain a hardware_id of d, since they do not occur at least once per model_name.
I'm using Dask as my original data size is on the order of 7 GB, but if I need to drop down to Pandas that is also feasable. I'm very happy to hear any suggestions.
I have tried splitting the dataframe into individual dataframes based on the model_name attribute, then running a loop:
models = ['1','1','1','2','2','2','3','3','3','3']
import dask.dataframe as dd
frame_1 = dd.DataFrame( {'hardware_id':['a','b','c','a','b','c','a','b','c','d'], 'model_name':mn,'data_v':np.random.rand(len(mn))} )
model_splits = []
for i in range(1,4):
model_splits.append(frame_1[frame_1['model_name'.eq(str(i))]])
aggregate_list = []
while len(model_splits) > 0:
data = aggregate_list.pop()
for other_models in aggregate_list:
data = data[data.hardware_id.isin(other_models.hardware_id.to__bag())]
aggregate_list.append(data)
final_data = dd.concat(aggregate_list)
However, this is beyond inefficient, and I'm not entirely sure that my logic is sound.
Any suggestions on how to achieve this?
Thanks!
One way to accomplish this is to treat it as a groupby-aggregation problem.
Pandas
First, we set up the data:
import pandas as pd
import numpy as np
np.random.seed(12)
models = ['1','1','1','2','2','2','3','3','3','3']
df = pd.DataFrame(
{'hardware_id':['a','b','c','a','b','c','a','b','c','d'],
'model_name': models,
'data_v': np.random.rand(len(models))
}
)
Then, collect the unique values of your model_name column.
unique_model_names = df.model_name.unique()
unique_model_names
array(['1', '2', '3'], dtype=object)
Next, we'll do several related steps at once. Our goal is to figure out which hardware_ids co-occur wiht the entire unique set of model_names. First we can do a groupby aggregation to get the unique model_names per hardware_id. This returns a list, but we want this as a tuple for efficiency so it works in the next step. At this point, every hardware ID is associated with a tuple of it's unique models. Next, we check to see if that tuple exactly matches our unique model names, using isin. If it doesn't we know the condition should be False (exactly what we get).
agged = df.groupby("hardware_id", as_index=False).agg({"model_name": "unique"})
agged["model_name"] = agged["model_name"].map(tuple)
agged["all_present_mask"] = agged["model_name"].isin([tuple(unique_model_names)])
agged
hardware_id model_name all_present_mask
0 a (1, 2, 3) True
1 b (1, 2, 3) True
2 c (1, 2, 3) True
3 d (3,) False
Finally, we can use this to get our list of "valid" hardware IDs, and then filter our initial dataframe.
relevant_ids = agged.loc[
agged.all_present_mask
].hardware_id
​
result = df.loc[
df.hardware_id.isin(relevant_ids)
]
result
hardware_id model_name data_v
0 a 1 0.154163
1 b 1 0.740050
2 c 1 0.263315
3 a 2 0.533739
4 b 2 0.014575
5 c 2 0.918747
6 a 3 0.900715
7 b 3 0.033421
8 c 3 0.956949
Dask
We can do essentially the same thing, but we need to be a little clever with our calls to compute.
import dask.dataframe as dd
​
ddf = dd.from_pandas(df, 2)
unique_model_names = ddf.model_name.unique()
​
agged = ddf.groupby("hardware_id").model_name.unique().reset_index()
agged["model_name"] = agged["model_name"].map(tuple)
agged["all_present_mask"] = agged["model_name"].isin([tuple(unique_model_names)])
​
relevant_ids = agged.loc[
agged.all_present_mask
].hardware_id
​
result = ddf.loc[
ddf.hardware_id.isin(relevant_ids.compute()) # cant pass a dask Series to `ddf.isin`
]
result.compute()
hardware_id model_name data_v
0 a 1 0.154163
1 b 1 0.740050
2 c 1 0.263315
3 a 2 0.533739
4 b 2 0.014575
5 c 2 0.918747
6 a 3 0.900715
7 b 3 0.033421
8 c 3 0.956949
Note that you would probably want to persist agged_df and relevant_ids if you have the memory available to avoid some redundant calculation.

Pandas - how to get the minimum value for each row from values across several rows

I have a pandas dataframe in the following structure:
|index | a | b | c | d | e |
| ---- | -- | -- | -- | -- | -- |
|0 | -1 | -2| 5 | 3 | 1 |
How can I get the minimum value for each row using only the positive values in columns a-e?
For the example row above, the minimum of (5,3,1) should be 1 and not (-2).
You can use the loop on all rows and apply your condition on the rows.
for example:
df = pd.DataFrame([{"a":-2,"b":2,"c":5},{"a":3,"b":0,"c":-1}])
# a b c
#0 -2 2 5
#1 3 0 -1
def my_condition(li):
li = [i for i in li if i>=0]
return min(li)
min_cel = []
for k,r in df.iterrows():
li = r.to_dict().values()
min_cel.append( my_condition(li) )
df["min"] = min_cel
# a b c min
#0 -2 2 5 2
#1 3 0 -1 0
You can also write the same code on one line:
df['min'] = ddd.apply(lambda row: min([i for i in row.to_dict().values() if i>=0]) , axis=1)

How to set manually split long rows size on Octave's Terminal Output?

How to set manually slipt long rows size on Octave's Terminal Output?
I am using Octave through Sublime Text output build panel, and octave cannot recognize correctly how many rows it should use to split/to fill up the screen.
Example, It is currently filling the screen like this:
octave:13> rand (2,10)
ans =
Columns 1 through 6:
0.75883 0.93290 0.40064 0.43818 0.94958 0.16467
0.75697 0.51942 0.40031 0.61784 0.92309 0.40201
Columns 7 through 10:
0.90174 0.11854 0.72313 0.73326
0.44672 0.94303 0.56564 0.82150
But I want to set 10 columns (Columns 1 through 10) instead of Columns 1 through 6.
If I disable the split_long_rows, never splits.
Query or set the internal variable that controls whether rows of a
matrix may be split when displayed to a terminal window.
If the rows are split, Octave will display the matrix in a series of
smaller pieces, each of which can fit within the limits of your
terminal width and each set of rows is labeled so that you can easily
see which columns are currently being displayed.
https://www.gnu.org/software/octave/doc/v4.0.1/Matrices.html#XREFsplit_005flong_005frows
You cannot to split them like that. The Octave output is just a simple and fast way to debug your program. To print things beautifully as you want to, just to create a function for it and it to print them.
This is a similar example, where a table is printed:
...
for i = 2 : 7
...
% https://www.gnu.org/software/octave/doc/v4.0.1/Basic-Usage-of-Cell-Arrays.html
results(end+1).vector = { m, gaussLegendreIntegral________, gaussLegendreIntegralErroExato___ };
end
printf( "%20s | %30s | %30s\n", "m", "Gm", "Erro Exato Gm = |Gm - Ie |" )
printf( "%20s | %30s | %30s\n", "--------------------", "------------------------------", "------------------------------" )
numberToStringPrecision = 15;
for i = 1 : numel( results )
# https://www.gnu.org/software/octave/doc/v4.0.0/Processing-Data-in-Cell-Arrays.html
# https://www.gnu.org/software/octave/doc/v4.0.1/Converting-Numerical-Data-to-Strings.html#XREFnum2str
printf( "%20s | ", num2str( cell2mat( results(i).vector(1) ), numberToStringPrecision ) )
printf( "%30s | ", num2str( cell2mat( results(i).vector(2) ), numberToStringPrecision ) )
printf( "%30s\n" , num2str( cell2mat( results(i).vector(3) ), numberToStringPrecision ) )
end
It would generate a output like this:
m | Gm | Erro Exato Gm = |Gm - Ie |
-------------------- | ------------------------------ | ------------------------------
2 | -0.895879734614027 | 0.104120265385973
3 | -0.947672383858322 | 0.0523276161416784
4 | -0.968535977854582 | 0.0314640221454183
5 | -0.979000992287376 | 0.0209990077126242
6 | -0.984991210262343 | 0.0150087897376568
7 | -0.988738923004894 | 0.0112610769951058

FormulaArray not averaging out all the specified entries

Table 1:
G H I J K
| Lane | Bowler | Score | Score | Score | 1
|:-----------|------------:|:------------:|:------------:|:------------:|
| Lane 1 | Thomas| 100 | 100 | 100 | 2
| Lane 2 | column | 200 | 200 | 100 | 3
| Lane 3 | Mary | 300 | 300 | 100 | 4
| Lane 1 | Cool | 150 | 400 | 100 | 5
| Lane 2 | right | 160 | 500 | 100 | 6
| Lane 9 | Susan | 170 | 600 | 100 | 7
say I want to find the average for each Lane that appeared in table 2 and put them in column O:
Table 2:
N O
| Lane | Average | 1
|:-----------|------------:|
| Lane 1 | | 2
| Lane 2 | | 3
| Lane 3 | | 4
I would put
=AVERAGE(IF(N2=$G$2:$G$7, $I$2:$K$7 )) for lane 1 (put this formula on cell "O2")
=AVERAGE(IF(N3=$G$2:$G$7, $I$2:$K$7 )) for Lane 2 ("O3")
=AVERAGE(IF(N4=$G$2:$G$7, $I$2:$K$7 )) for Lane 2 ("O4")
My first question is
What if I want to find the Average of ALL the lane together that appear in table 2. So average of Lane 1, Lane 2 and Lane 3 together (but not other lane, such as lane 9).
My attempt:
= Average(IF(G2:G7 = N2:N4, I2:K:7)) why doesn't this work?
My second question is
I have done the "average of each individual Lane" using vba:
.
Dim i As Integer
For i = 2 To 4
Cells(i, 15).FormulaArray = "=AVERAGE(IF(RC[-1]=R2C7:R7C7,R2C9:R7C12))"
Next i
.
What if I have done it using vba without the .formula method
For Lane 1 only:
pseudo code:
Loop from G2 to G7
If cell (N1) = Gx then //x: 2 to 7
Sum = Sum + Ix + Jx + Kx
}
Average = Sum/totalEntries
Would this be slower than if I were to use the build in .formula? is there a advanage to doing it this way instead?
The answer to the first question about why this FormulaArray
= Average(IF(G2:G7 = N2:N4, I2:K7)) doesn't work?
Is implicit on how this other FormulaArray works:
= AVERAGE( IF( $G$7:$G$12 = $N7, $I$7:$K$12 ) )
Let’s see how each part of this “single-cell formula array” works:
1st part: $G$7:$G$12 = $N7
The first part of the formula generates an array with the records from range $G$7:$G$12 complying with the condition = $N7. Fig. 1 shows the first part of the FormulaArray in as a “multi-cell formula array”.
2nd Part: $I$7:$K$12
The result of the first part is applied to the second part to obtain the range of scores complying with the condition = $N7 (see Fig. 2)
3rd part: AVERAGE
Finally the last part of the formula calculates the average of the scores complying with the condition = $N7
Now let’s try to apply the same analysis to the formula:
= AVERAGE( IF( G2:G7 = N2:N4, I2:K7 ) )
Unfortunately, we cannot go beyond the first part G2:G7 = N2:N4 as it fails trying to compare two arrays of different dimensions thus resulting in #N/A (see Fig. 3)
However, even if the arrays have same dimension the result would not have shown the duplicated values, as the members are compared one to one (see Fig. 4)
To obtain the average for Lanes 1 to 3 use this FormulaArray
=AVERAGE( IF(
( $G$7:$G$12 = $N7 ) + ( $G$7:$G$12 = $N8 ) + ( $G$7:$G$12 = $N9 ),
$I$7:$K$12 ) )
It generates an array with the records complying with the conditions = $N7 + = $N8 + = $N9 (+ equivalent to operator OR)
As regards the second question:
Performance is intrinsically associated to maintenance and efficiency.
The sample procedure just enters a formula which is hard coded and only works for this particular case, for example:
If needed to change the formulas to expand the ranges, the macro has to be updated, it may still have to change the formula but no need to open the VBA editor.
If any of the columns before column G get deleted as it becomes obsolete, the macro needs to be updated, while the formulas will not require any maintenance as they are automatically updated.
In reference to the macro without the .Formula method
I found this redundant, as it’s like writing an algorithm to do something that can be done efficiently and accurately with an existing function, as such a macro will not bring anything that's it's not there actually.
I'll consider the advantage of writing such a procedure in a situation in which the workbook is very large and it heavily uses resource significantly slowing down the performance of the workbook, however the advantages to be delivered by the procedure will not reside and just writing the formulas but it must calculate the results and enter the values resulting from the formulas instead of the formulas thus making the workbook light, fast and smooth to the end user.
To get the average of them all, just use
=AVERAGE(I2:K7)
As to the VBA, as it is all done on the same lines, could you just use
For i = 2 To 7
Cells(i,"O").Value = Application.Sum(Range(Cells(i,"I"),Cells(i,"K")))
Next i

Using Linq for Object Dataset Processing

I have a collection (IList(Of Sample)) of the following class:
Public Class Sample
Public sampleNum As String
Public volume As Integer
Public initial As Single
Public final As Single
End Class
This collection is filled from a regex that gets passed over a file.
What I want to do is use Linq to generate a collection of these for each unique samplenum using the following conditions.
For each samplenum:
Have the highest volume where the final is greater then one
If the sample has multiple records for this volume then pick the one with the the highest final
If the previous step leaves us with no records pick the record with the highest final ignoring volume
I am extremely new to Linq and just can't get my head around this. For now I have solved this using for each's and temporary lists but I am interested in how this would be handled using pure Linq.
Sample Data:
samplenum | volume | initial | final
1 | 50 | 8.47 | 6.87
1 | 300 | 8.93 | 3.15
2 | 5 | 8.28 | 6.48
2 | 10 | 8.18 | 5.63
2 | 5 | 8.33 | 6.63
2 | 10 | 8.26 | 5.58
3 | 1 | 8.31 | 0.75
3 | 5 | 8.19 | 0.03
4 | 50 | 8.28 | 6.55
4 | 300 | 7.19 | 0.03
This should hopefully solve your problems:
Dim source As IEnumerable(Of Sample)
' Get the data...
Dim processed = source _
.GroupBy(Function(s) s.sampleNum) _
.Select(Function(s) Process(s))
Dim array = processed.ToArray()
Console.ReadLine()
The Process function:
Private Function Process(ByVal sequence As IEnumerable(Of Sample)) As Sample
Dim filtered = (
From s In sequence
Where s.final > 1
Order By
s.volume Descending,
s.final Descending
)
' If we don't have any elements after the filtering,
' return the one with the highest final.
' Otherwise, return the first element.
If Not filtered.Any() Then
Return (From s In sequence Order By s.final Descending).FirstOrDefault()
Else
Return filtered.First()
End If
End Function
Try this. I haven't tried it but it should do what you want. There is probs a better way of doing this:
' For each sample number in the list
For Each el In (From p In lst Select p.sampleNum).Distinct()
' can cause odd results in some cases so always put the foreach var into another var
Dim var As String = el
' get another "list" but for this sample num
Dim res As IEnumerable(Of Sample) = lst.Where(Function(p) p.volume > 1 AndAlso p.sampleNum = var)
Dim sam As Sample ' the result
If Not res Is Nothing And res.Any() Then
' we have a result, so get the first result where the
sam = res.Where(Function(p) p.volume = res.Max(Function(x) x.volume)).First()
Else
' we have no results, so resort back to the normal list, for this sample number
sam = lst.Where(Function(p) p.sampleNum = var AndAlso p.volume = lst.Max(Function(x) x.volume)).First()
End If
'
' do what ever with the sample here
'
Next