Training an object detector on dataset layers of increasing ambiguity - tensorflow

I have a number of Raspberry Pi cameras focused on bird feeders,
continually running a TensorFlow Object Detection graph (SSD MNet2) to detect birds.
Over time I've built a dataset of +10k image over 11 species, retraining the graph frequently.
I intend to cap the number of items in the dataset to 10k items (perhaps arbitrarily).
There is a flow of data through the dataset so that it continually improves.
New candidate detections are triaged by a judge (me) as follows:
Add as new training/evaluation item.
The detection is judged representative of a category.
After adjustment, the image and detection can be added to the ground truth
Add as counter example item.
The detection is false, but can be converted to an unclassified counter example.
After adjustment, the image and detection can be added to the ground truth
Discard item
Not useful for training.
Also note that some existing data is retired when sufficient better data is available.
To date, all the items in the ground truth are delivered to training with a weight of 1.0.
See: https://github.com/tensorflow/models/blob/master/research/object_detection/data_decoders/tf_example_decoder.py
def default_groundtruth_weights():
return tf.ones(
[tf.shape(tensor_dict[fields.InputDataFields.groundtruth_boxes])[0]],
dtype=tf.float32)
But this is obviously not very true.
I know by inspection that some of the items are not so good, but at any one time they're the best examples available.
Over time, eventually, bad items get replaced with better items.
Ranked training records
I have wondered about the impact on training and whether the situation would be improved by ranking the dataset, by some value of ideality,
and then training in successive ranks, so that the model initialises on the most ideal data and subsequently learns less and less ideal data.
What I'm imagining trying to avoid is the model paying too much attention to bad data and not enough to good data, especially during the initial epochs of training.
Where bad and good data mean how well the data items contribute to the veracity and visualization (via Lucid) of the trained models.
Weighted Dataset
Setting a weight (between 0 and 1) on an item means the loss calculated for that item is reduced (by the weight factor);
I assume it means "pay less attention to this item by this much".
See: Class weights for balancing data in TensorFlow Object Detection API
I've visited every item in my dataset to retrospectively set a weight.
I did this by running some recent models over the dataset images (admittedly, from which the models were trained) and then matching detections.
The weight given to each item was calculated by averaging scores from the model detections (and rounding to one decimal place to make bands).
The entire dataset was then reviewed to increase or reduce weights as judged necessary.
The results are shown in the following table:
| Class\Weight Bin | 0.3 | 0.4 | 0.5 | 0.6 | 0.7 | 0.8 | 0.9 | Total|
| blackbird | | | 34 | 84 | 212 | 305 | 115 | 750 |
| blue tit | | | 47 | 94 | 211 | 435 | 241 | 1028 |
| collared dove | | | 17 | 52 | 236 | 302 | 101 | 708 |
| dunnock | | | 50 | 140 | 260 | 236 | 228 | 914 |
| goldfinch | | | 60 | 103 | 220 | 392 | 164 | 939 |
| great tit | | 35 | 42 | 71 | 234 | 384 | 201 | 967 |
| mouse | 40 | 29 | 35 | 50 | 87 | 142 | | 383 |
| robin | | 43 | 44 | 97 | 175 | 207 | 52 | 618 |
| sparrow | | 31 | 51 | 75 | 278 | 475 | 220 | 1130 |
| starling | | 19 | 28 | 39 | 97 | 227 | 73 | 483 |
| wood pigeon | | | 10 | 34 | 82 | 265 | 560 | 951 |
| | | | | | | | | 8871 |
The first training results look promising, in that the model is training well.
But I haven't reviewed the visualizations yet.
Is setting an appropriate weight on each dataset item equivalent to layering the delivery of ranked training records?

First, try removing ambiguous data from the dataset and train the model, and compare its results with previous model.
If that does not help, then go with class weights for balancing data.

Related

How to manage relationships between a main table and a variable number of secondary tables in Postgresql

I am trying to create a postgresql database to store the performance specifications of wind turbines and their characteristics.
The way I have structures this in my head is the following:
A main table with a unique id for each turbine model as well as basic information about them (rotor size, max power, height, manufacturer, model id, design date, etc.)
example structure of the "main" table holding all of the main turbine characteristics
turbine_model
rotor_size
height
max_power
etc.
model_x1
200
120
15
etc.
model_b7
250
145
18
etc.
A lookup table for each turbine model storing how much each produces for a given wind speed, with one column for wind speeds and another row for power output. There will be as many of these tables as there are rows in the main table.
example table "model_x1":
wind_speed
power_output
1
0.5
2
1.5
3
2.0
4
2.7
5
3.2
6
3.9
7
4.9
8
7.0
9
10.0
However, I am struggling to find a way to implement this as I cannot find a way to build relationships between each row of the "main" table and the lookup tables. I am starting to think this approach is not suited for a relational database.
How would you design a database to solve this problem?
A relational database is perfect for this, but you will want to learn a little bit about normalization to design the layout of the tables.
Basically, you'll want to add a 3rd column to your poweroutput reference table so that each model is just more rows (grow long, not wide).
Here is an example of what I mean, but I even took this to a further extreme where you might want to have a reference for other metrics in addition to windspeed (rpm in this case) so you can see what I mean.
PowerOutput Reference Table
+----------+--------+------------+-------------+
| model_id | metric | metric_val | poweroutput |
+----------+--------+------------+-------------+
| model_x1 | wind | 1 | 0.5 |
| model_x1 | wind | 2 | 1.5 |
| model_x1 | wind | 3 | 3 |
| ... | ... | ... | ... |
| model_x1 | rpm | 1250 | 1.5 |
| model_x1 | rpm | 1350 | 2.5 |
| model_x1 | rpm | 1450 | 3.5 |
| ... | ... | ... | ... |
| model_bg | wind | 1 | 0.7 |
| model_bg | wind | 2 | 0.9 |
| model_bg | wind | 3 | 1.2 |
| ... | ... | ... | ... |
| model_bg | rpm | 1250 | 1 |
| model_bg | rpm | 1350 | 1.5 |
| model_bg | rpm | 1450 | 2 |
+----------+--------+------------+-------------+

How do I make this calculated measure axis independent and portable?

So I am a beginner at MDX and I have an MDX query that works the way I want it to so long as I put the set on either the columns or rows. If I put the same set on the filter axis it doesn't work. I'd like to make this calculated measure is independent on where this set lives. I'm guaranteed to always have some form of a set included, but I'm not guaranteed which axis the user will place it on (eg row, columns, filter).
Here is the query that works:
WITH MEMBER Measures.avgApplicants as
Avg([applicationDate].[yearMonth].[month].Members, [Measures].[applicants])
SELECT
{[Measures].[applicants],[Measures].[avgApplicants]} ON 0,
{[applicationDate].[yearMonth].[year].[2015]:[applicationDate].[yearMonth].[year].[2016]} ON 1
FROM [applicants]
And results:
| | applicants | avgMonthlyApplicants |
+------+------------+----------------------+
| 2015 | 367 | 33 |
| 2016 | 160 | 33 |
However, if I shift this query around to move the set onto the filter axis I get nothing:
WITH MEMBER Measures.avgApplicants as
Avg([applicationDate].[yearMonth].[month].Members, [Measures].[applicants])
SELECT
{[Measures].[applicants],[Measures].[avgApplicants]} ON 0,
{[Gender].Members} ON 1
FROM [applicants]
WHERE ([applicationDate].[yearMonth].[year].[2015]:[applicationDate].[yearMonth].[year].[2016])
I get this:
| | applicants | avgApplicants |
+-------------+-------------+------------+---------------+
| All Genders | | 478 | |
| | Female | 172 | |
| | Male | 183 | |
| | Not Known | 61 | |
| | Unspecified | 62 | |
So how do a create this calculated measure work so that it isn't dependent on which axis the set is placed on?

Detecting rising and falling edge via SQL (loading cycles)

i need to detect rising and falling edges from a loading state in my logs and need to list all loading cycles.
Lets say i have a table LOG
UTS | VALUE | STATE
1438392102 | 1000 | 0
1438392104 | 1001 | 1
1438392106 | 1002 | 1
1438392107 | 1003 | 0
1438392201 | 1007 | 1
1438392220 | 1045 | 1
1438392289 | 1073 | 0
1438392305 | 1085 | 1
1438392310 | 1090 | 1
1438392315 | 1095 | 1
And need all cycles where STATE = 1
I need to know when they started how long they lasted
and how much VALUE changed in each cycle.
I also might have a situation where the last cycle isn't
finished yet.
Do you have an idea how i can do this in SQL in a
good performing way ? Cuz i might run into situations
where my logs return several hundred of thausends of rows.
Thanks for help

Find a subset of numbers that equals to the target weighted average and target sum

There is a SQL server table containing 1 million of rows. A sample data is shown below.
Percentage column is computed as = ((Y/X)* 100)
+----+--------+-------------+-----+-----+-------------+
| ID | Amount | Percentage | X | Y | Z |
+----+--------+-------------+-----+-----+-------------+
| 1 | 10 | 9.5 | 100 | 9.5 | 95 |
| 2 | 20 | 9.5 | 100 | 9.5 | 190 |
| 3 | 40 | 5 | 100 | 5 | 200 |
| 4 | 50 | 5.555555556 | 90 | 5 | 277.7777778 |
| 5 | 70 | 8.571428571 | 70 | 6 | 600 |
| 6 | 100 | 9.230769231 | 65 | 6 | 923.0769231 |
| 7 | 120 | 7.058823529 | 85 | 6 | 847.0588235 |
| 8 | 60 | 10.52631579 | 95 | 10 | 631.5789474 |
| 9 | 80 | 10 | 100 | 10 | 800 |
| 10 | 95 | 10 | 100 | 10 | 950 |
+----+--------+-------------+-----+-----+-------------+
Now I need to find the rows such that their amount value add up to a given Amount and weighted average matches to the given Percentage.
For example, if the target Amount =365 and target Percentage=9.84, then from the given dataset, we can say that rows with ID=1,2,6,8,9,10 form the subset which will match the given targets.
Amount = 10+20+100+60+80+95
= 365
Percentage = Sum of (product of Amount and Percentage)/Sum of (Amount)
(I am using Z column to store the products of Amount and Percentage to make the calculations easier)
= ((10*9.5)+(20*9.5)+(100*9.23077)+(60*10.5264)+(80*10)+(95*10))/ (10+20+100+60+80+95)
= 9.834673618
So the rows 1,2,6,8,9,10 matches the given target sum and target weighted average.
Proposed algorithm should work on the 1 million rows and main objective is to achieve the match on the weighted average (Percentage) with Amount as much close as possible to the target Amount.
I found few questions on the stackoverflow which are related to match the target sum. But my problem is to match two target attributes Sum and weighted average.
Which algorithm can be used to achieve this?
Since the target "Percentage" is only approximate (therefore not an actual constraint), let's try removing it and find a solution for Amount. This can only make the problem easier.
What's left is the Subset Sum Problem, which is NP-complete. There are simple exponential-time solutions, and sneaky pseudo-polynomial-time solutions, but I don't think any of them will be practical for a table with 106 rows.
If this is an academic exercise, I suggest you write up the cleverest pseudo-polynomial-time solution you can come up with. If it's a task in the real world, I suggest you go back to the person who gave it to you, explain that an exact solution is impractical, and negotiate for an approximate solution.

How to assign event counts to relative date values in SQL?

I want to line up multiple series so that all milestone dates are set to month zero, allowing me to measure the before-and-after effect of the milestone. I'm hoping to be able to do this using SQL server.
You can see an approximation of what I'm starting with at this data.stackexchange.com query. This sample query returns a table that basically looks like this:
+------------+-------------+---------+---------+---------+---------+---------+
| UserID | BadgeDate | 2014-01 | 2014-02 | 2014-03 | 2014-04 | 2014-05 |
+------------+-------------+---------+---------+---------+---------+---------+
| 7 | 2014-01-02 | 232 | 22 | 19 | 77 | 11 |
+------------+-------------+---------+---------+---------+---------+---------+
| 89 | 2014-04-02 | 345 | 45 | 564 | 13 | 122 |
+------------+-------------+---------+---------+---------+---------+---------+
| 678 | 2014-03-11 | 55 | 14 | 17 | 222 | 109 |
+------------+-------------+---------+---------+---------+---------+---------+
| 897 | 2014-03-07 | 234 | 56 | 201 | 19 | 55 |
+------------+-------------+---------+---------+---------+---------+---------+
| 789 | 2014-02-22 | 331 | 33 | 67 | 108 | 111 |
+------------+-------------+---------+---------+---------+---------+---------+
| 989 | 2014-01-09 | 12 | 89 | 97 | 125 | 323 |
+------------+-------------+---------+---------+---------+---------+---------+
This is not what I'm ultimately looking for. Values in month columns are counts of answers per month. What I want is a table with counts under relative month numbers as defined by BadgeDate (with BadgeDate month set to month 0 for each user, earlier months set to negative relative month #s, and later months set to positive relative month #s).
Is this possible in SQL? Or is there a way to do it in Excel with the above table?
After generating this table I plan on averaging relative month totals to plot a line graph that will hopefully show a noticeable inflection point at relative month zero. If there's no apparent bend, I can probably assume the milestone has a negligible effect on the Y-axis metric. (I'm not even quite sure what this kind of chart is called. I think Google might have been more helpful if I knew the proper terms for what I'm talking about.)
Any ideas?
This is precisely what the aggregate functions and case when ... then ... else ... end construct are for:
select
UserID
,BadgeDate
,sum(case when AnswerDate = '2014-01' then 1 else 0 end) as '2014-01'
-- etc.
group by
userid
,BadgeDate
The PIVOT clause is also available in some flavours and versions of SQL, but is less flexible in general so the traditional mechanism is worth understanding.
Likewise, the PIVOT TABLE construct in EXCEL can produce the same report, but there is value in maximally aggregating the data on the server in bandwidth competitive environments.