search within an array with a condition - vb.net

I have two array I'm trying to compare at many levels. Both have the same structure with 3 "columns.
The first column contains the polygon's ID, the second a area type, and the third, the percentage of each area type for a polygone.
So, for many rows, it will compare, for example, ID : 1 Type : aaa % : 100
But for some elements, I have many rows for the same ID. For example, I'll have ID 2, Type aaa, 25% --- ID 2, type bbb, 25% --- ID 2, type ccc, 50%. And in the second array, I'll have ID 2, Type aaa, 25% --- ID 2, type bbb, 10% --- ID 2, type eee, 38% --- ID 2, type fff, 27%.
here's a visual example..
So, my function has to compare these two array and send me an email if there are differences.
(I wont show you the real code because there are 811 lines). The first "if" condition is
if array1.id = array2.id Then
if array1.type = array2.type Then
if array1.percent = array2.percent Then
zone_verification = True
Else
zone_verification = False
The probleme is because there are more than 50 000 rows in each array. So when I run the function, for each "array1.id", the function search through 50 000 rows in array2. 50 000 searchs for 50 000 rows.. it's pretty long to run!
I'm looking for something to get it running faster. How could I get my search more specific. Example : I have many id "2" in the array1. If there are many id "2" in the array2, find it, and push all the array2.id = 3 in a "sub array" or something like that, and search in these specific rows. So I'll have just X rows in array1 to compare with X rows in array 2, not with 50 000. and when each "id 2" in array1 is done, do the same thing for "id 4".. and for "id 5"...
Hope it's clear. it's almost the first time I use VB.net, and I have this big function to get running.
Thanks
EDIT
Here's what I wanna do.
I have two different layers in a geospatial database. Both layers have the same structure. They are a "spatial join" of the land parcels (55 000), and the land use layer. The first layer is the current one, and the second layer is the next one we'll use after 2015.
So I have, for each "land parcel" the percentage of each land use. So, for a "land parcel" (ID 7580-80-2532, I can have 50% of farming use (TYPE FAR-23), and 50% of residantial use (RES-112). In the first array, I'll have 2 rows with the same ID (7580-80-2532), but each one will have a different type (FAR-23, RES-112) and a different %.
In the second layer, the same the municipal zoning (land use) has changed. So the same "land parcel" will now be 40% of residential use (RES-112), 20% of commercial (COM-54) and 40% of a new farming use (FAR-33).
So, I wanna know if there are some differences. Some land parcels will be exactly the same. Some parcels will keep the same land use, but not the same percentage of each. But for some land parcel, there will be more or less land use types with different percentage of each.
I want this script to compare these two layers and send me an email when there are differences between these two layers for the same land parcel ID.
The script is already working, but it takes too much time.
The probleme is, I think, the script go through all array2 for each row in array 1.
What I want is when there are more than 1 rows with the same ID in array1, take only this ID in both arrays.
Maybe if I order them by IDs, I could write a condition. kind of "when you find what you're looking for, stop searching when you'll find a different value?
It's hard to explain it clearly because I've been using VB since last week.. And english isn't my first language! ;)

If you just want to find out if there are any differences between the first and second array, you could do:
Dim diff = New HashSet(of Polygon)(array1)
diff.SymmetricExceptWith(array2)
diff will contain any Polygon which is unique to array1 or array2. If you want to do other types of comparisons, maybe you should explain what you're trying to do exactly.
UPDATE:
You could use grouping and lookups like this:
'Create lookup with first array, for fast access by ID
Dim lookupByID = array1.ToLookup(Function(p) p.id)
'Loop through each group of items with same ID in array2
For Each secondArrayValues in array2.GroupBy(Function(p) p.id)
Dim currentID As Integer = secondArrayValues.Key 'Current ID is the grouping key
'Retrieve values with same ID in array1
'Use a hashset to easily compare for equality
Dim firstArrayValues As New HashSet(of Polygon)(lookupByID(currentID))
'Check for differences between the two sets of data, for this ID
If Not firstArrayValues.SetEquals(secondArrayValues) Then
'Data has changed, do something
Console.WriteLine("Differences for ID " & currentID)
End If
Next

I am answering this question based on the first part that you wrote (that is without the EDIT section). The correct answer should explain a good algorithm but I am suggesting you to use DB capabilities because they have optimized many queries for these purpose.
Put all the records in DB two tables - O(n) time ... If the records are static you dont need to perform this step every time.
Table 1
id type percent
Table 2
id type percent
Then use the DB query, some thing like this
select count(*) from table1 t1, table2 t2 where t1.id!=t2.id and t1.type!=t2.type
(you can use some better queries, what I am trying to say is give the control to DB to perform this operation)
retrieve the result in your code and perform the necessary operation.
EDIT
1) You can sort them in O(n logn) time based on ID + type + Percent and then perform binary search.
2) Store the first record in hash map with appropriate key - could be ID only or ID+type
this will take O(n) time and searching ,if key is correct, will take constant time.

You need to define a structure to store this data. We'll store all the data in a LandParcel class, which will have a HashSet<ParcelData>
public class ParcelData
{
public ParcelType Type { get; set; } // This can be an enum, string, etc.
public int Percent { get; set; }
// Redefine Equals and GetHashCode conveniently
}
public class LandParcel
{
public ID Id { get; set; } // Whatever the type of the ID is...
public HashSet<ParcelData> Data { get; set; }
}
Now you have to build your data structure, with something like this:
Dictionary<ID, LandParcel> data1 = new ....
foreach (var item in array1)
{
LandParcel p;
if (!data1.TryGetValue(item.id, out p)
data1[item.id] = p = new LandParcel(id);
// Can this data be repeated?
p.Data.Add(new ParcelData(item.type, item.percent));
}
You do the same with a data2 dictionary for the second array. Now you iterate for all items in data1 and compare them with the item with the same id for data2.
foreach (var parcel2 in data2.Values)
{
var parcel1 = data1[parcel2.ID]; // Beware with exceptions here !!!
if (!parcel1.Data.SetEquals(parcel2.Data))
// You have different parcels
}
(Now that I look at it, we are practically doing a small database query here, kind of smelly code ...)
Sorry for the C# code since I don't really feel so comfortable with VB, but it should be fairly straightforward.

Related

Maintaining auto ranking as a column in MongoDB

I am using MongoDB as my database.
I have data which contains rank and name as columns. Now a new row can be updated with a rank different from ranks already existing or can be same.
If same then the ranks of other rows must be adjusted .
Rows having lesser rank than the to be inserted one must be incremented by one and the rows which are having ranks can remain as it it.
Feature is something like number bulleted list in MS Word type of applications. Where inserting a row in between adjust the numbering of other rows below it.
Rank 1 is the highest rank.
For e.g. there are 3 rows
Name Rank
A 1
B 2
C 3
Now i want to update a row with D as name and 2 as rank. So now after the row insert, the DB should like below
Name Rank
A 1
B 3
C 4
D 2
Probably using Database triggers i can achieve this by updating the other rows.
I have couple of questions
(a) Is there any other better way than using database trigger for achieving this kind of scenario ? Updating all the rows might be a time consuming job.
(b) Does MongoDB support database trigger natively ?
Best Regards,
Saurav
No, MongoDB, does not provide triggers (yet). Also I don't think trigger is really a great way to achieve this.
So I would just like to throw some ideas, see if it makes sense.
Approach 1
Maybe instead of disturbing those many documents, you can create a collection with only one document (Let's call the collection ranking). In that document, have an array field call ranks. Since it's an array it's already maintaining a sequence.
{
_id : "RANK",
"ranks" : ["A","B","C"]
}
Now if you want to add D to this rank at 2nd position
db.ranking.update({_id:"RANK"},{$push : {"ranks":{$each : ["D"],$position:1}}});
it would add D to index 1 which is 2nd position considering index starts at 0.
{
_id : "RANK",
"ranks" : ["A","D","B","C"]
}
But there is a catch, what if you want to change C position to 1st from 4th, you need to remove it from end and put it in the beginning, I am sure both operation can't be achieved in single update (didn't dig in the options much), so we can run two queries
db.ranking.update({_id:"RANK"},{$pull : {"ranks": "C"}});
db.ranking.update({_id:"RANK"},{$push : {"ranks":{$each : ["C"],$position:0}}});
Then it would be like
{
_id : "RANK",
"ranks" : ["C","A","D","B"]
}
maintaining the rest of sequence.
Now you would probably want to store id instead of A,B,C etc. one document can be 16MB so basically this ranks array can store more than 1.3 million entries of id, if id is MongoDB ObjectId of 12 bytes each. if that is not enough, we still have option to have followup document(s) with further ranking.
Approach 2
you can also, instead of having rank as number, just have two field like followedBy and precededBy.
so your user document would look
{
_id:"A"
"followedBy":"B",
}
{
_id:"B"
"followedBy":"C",
"precededBy":"A"
}
{
_id:"c"
"precededBy":"B",
}
if you want to add D at second position, then you need to change the current 2nd position and you need to insert the new One, so it would be change in only two document
{
_id:"A"
"followedBy":"B",
}
{
_id:"B"
"followedBy":"C",
"precededBy":"D" //changed from A to D
}
{
_id:"c"
"precededBy":"B",
}
{
_id:"D"
"followedBy":"B",
"precededBy":"A"
}
The downside of this approach is that you cannot sort in query based on the ranking until and unless you get all these in application and create a linkedlist sort of structure.
This approach just preserve the ranking with minimum DB changes.

SQL Schema - car model with modifiers as unique

I need to build a DB for the following scenario:
I will have an input stream of auctions, and I want to make a price histogram for items on said auction (ie. what they usually go for etc).
The input stream looks something like:
{['item_id': 1, ... 'price': 123, ...],
['item_id': 1, ... 'price': 124, ... modifiers: [1, 2, 3],
['item_id': 1, ... 'price': 125, ... modifiers: [100, 150, 500...],
['item_id': 2, ... 'price': 200, ...],
...}
As you might have noticed, item doesn't only consist of some id, but also of modifiers. Think of it as a car that can be modified with extra stuff (e.g. AC, electronic windows etc).
What would be the most efficient way to store this information? Basically what I want to have is a unique id for each combination that can occur. It's not necessary to store it at all times, but if there is an auction for such an combination, and the combination doesn't exist yet, create it then.
I thought of something like:
base_item:
id
modifier:
id
item:
id (autonumber)
base_item_id
item_modifications:
item_id (FK item.id)
modification_id (FK modifier.id)
item_price_history:
item_id (FK item.id)
price
time
This setup might work. The problem is, imagine I have hundreds of millions of such auctions every day (ie. the auction's information is updated every 20 minutes and it cosists of 2 million auctions in average).
I want to be able to quickly do something like: INSERT INTO item_price_history VALUES (some_item_id, some_price, now()) but in order to do that, I need to find some_item_id. I know base_item_id and modifiers (from auction itself), but doing such call hundreds of millions of times is quite costly I think?
Ie, pseudo code:
for a in auctions:
base_item_id = a['item_id']
modifiers = a['modifiers']
price = a['price']
actual_item_id = some_query(base_item_id, modifiers) #expensive. Can be avoided?
insert_into_histogram(auctual_item_id, price) #expensive but necessary I think
Is there some obvious mistake I'm making in this design?
The schema you describe is the textbook solution.
But wow, that would be a beast to work with. As I understand it, every time you added a price record, you would have to find the item record with that exact set of parameters: no more, no less. And if no such item record existed, you would then have to create the item record. Only then could you add the price record.
While I think one should be very careful about denormalizing, I'd be sorely tempted to denormalize in this case. Namely, it seems to me that in practice, the key to an item record is the combination of the base item id plus the modifiers. I'd be tempted to to create a "modifier string" formed by stringing together codes or IDs for all the modifiers. Of course to be workable they'd have to be strung together in a defined sequence, like you can't have both "1,2" and "2,1". But then you could easily find the desired item record: just have a function that builds the concatenated modifier string, and select item where base_item_id=#base and modifiers=#modifiers. If not found, create the record and all the associated modifier records.
I'd be strongly inclined to make this modifier string be redundant with individual modifier records, but data that is strung together like this is very difficult to process. I mean, if you have a textbook schema like you describe, and someone wants to know prices for cars with air conditioning, it's very easy to select * from price where price.item in (select id from item join modifier on modifier.item_id=item.id where modifier.name='AC'). But try and do that on the concatenated string, say the ID for AC is "17". select blah blah where modifier_string like '%17%' doesn't work: it will find 117 and 171 and so on. like '%,17,%' doesn't work because it won't find it if it's the first or the last. Etc. That's why I routinely tell people NOT to string data together like this in general: create separate records. But if the most common use case is that you want the record with a specific combination of modifiers, creating a redundant modifier string is a plausible denormalization. (And the first time I typed that I accidentally typed 'demoralization', which may have been a Freudian slip.)

Solution for allowing user sorting in SQlite

By user sorting I mean that as a user on the site you see a bunch of items, and you are supposed to be able to reorder them (I'm using jQuery UI).
The user only sees 20 items on each page, but the total number of items can be thousands.
I assume I need to add another column in the table for custom ordering.
If the user sees items from 41-60, and and he sorts them like:
41 = 2nd
42 = 1st
43 = fifth
etc.
I can't just set the ordering column to 2,1,5.
I would need to go through the entire table and change each record.
Is there any way to avoid this and somehow sort only the current selection?
Add another column to store the custom order, just as you suggested yourself. You can avoid the problem of having to reassign all rows' values by using a REAL-typed column: For new rows, you still use an increasing integer sequence for the column's value. But if a user reorders a row, the decimal data type will allow you to use the formula ½ (previous row's value + next row's value) to update the column of the single row that was moved. You
have got two special cases to take care of, namely if a user moves a row to the very beginning or end of the list. In that case, just use min - 1 rsp. max + 1.
This approach is the simplest I can think of, but it also has some downsides. First, it has a theoretical limitation due to the datatype having only double-precision. After a finite number of reorderings, the values are too close together for their average to be a different number. But that's really only a theoretical limit you should never reach in practical applications. Also, the column will use 8 bytes of memory per row, which probably is much more than you actually need.
If your application might scale to the point where those 8 bytes matter or where you might have users that overeagerly reorder rows, you should instead stick to the INTEGER column and use multiples of a constant number as the default values (e.g. 100, 200, 300, ..). You still use the update formula from above, but whenever two values become too close together, you reassign all values. By tweaking the constant multiplier to the average table size / user behaviour, you can control how often this expensive operation has to be done.
There are a couple ways I can think of to do this. One would be to use a SELECT FROM SELECT style statement. As in something like this.
SELECT *
FROM (
SELECT col1, col2, col3...
FROM ...
WHERE ...
LIMIT n,m
) as Table_A
ORDER BY ...
The second option would be to use temp tables such as:
INSERT INTO temp_table_A SELECT ... FROM ... WHERE ... LIMIT n,m;
SELECT * FROM temp_table_A ORDER BY ...
Another option to look at would be jQuery plugin like DataTables
one way i can think of is:
Add a new column (if feasible) or create a new table for holding the order of the items.
On any page you will show around 20 items based on the initial ordering.
Using the jquery's Draggable you can send updates to this table
I think you can do this with an extra column.
First, you could prepopulate this new column with a default sort order and then allow the user to interactively modify it with the drag and drop of jquery-ui.
Lets say this user has 100 items in the table. You set the values in the order column to [1,2,3,...,99,100]. I suggest that you run a script on the original table to set all items to a default sort order.
Now going back to your example where the user is presented with items 41-60: the initial presentation in their browser would rank those at orders [41,42,43,...,59,60]. You might also need to save the lowest order that appears in this subset, in this case 41. Or better yet, save the entire array of rankings and restore the exact same numbers in the new order. This covers the case where they select a set of records that are not already consecutively ordered, perhaps because they belong to someone else.
To demonstrate what I mean: when they reorder them in the page, your javascript reassigns those same numbers back to the subset in the new order. Like this:
item A : 41
item B : 45
item C : 46
item D : 47
item E : 51
item F : 54
item G : 57
then the user changes them to this order, but you reassign the numbers like this:
item D : 41
item F : 45
item E : 46
item A : 47
item C : 51
item B : 54
item G : 57
This should also work if the subset is consecutive.

SQL "group by" like - grouping algorithm

I have a table with more than 2 columns (let's say A, B and C). One column holds some numbers (C) and I want to do a "group by" like grouping, summing the numbers in C, but I don't know the algorithm for doing so.
I tried sorting the table by each column (from last to first, aside from the numbers column (C), so in this case: sort(B) and then sort(A)) and then, wherever nth row holds same values in A and B as in n-1th row, I add the number from nth row to n-1th row (in the C column), and then delete the nth row. Else, if A or B value in row n differs from A or B value in n-1th row, I'll just move to the next row. Then I repeat the algorithm till the last row in table. But somehow this isn't working all the time, especially when there're a lot more columns (some rows remain ungrouped, maybe because of the sorting method).
I want to know whether this is a good grouping algorithm and I need to look for the problem into the sorting method, or I need to use another (sorting and/or grouping) algorithm and which one. Thank you.
LE: Apparently the algorithm that I used works well after a thorough check of the code and fixing some minor mistakes that junior programmers like me often make :)
I think a good way to do this would be to wrap your row into a class, implement the equals method, and then use a Map to add the values up:
public class MyRow {
private Long columnA;
private String columnB;
private int columnC;
#Override
public boolean equals(final Object other) {
if (!other instanceof MyRow) {
return false;
}
final MyRow otherRow = (MyRow) other;
return this.columnA.equals(otherRow.getColumnA()) && this.columnB.equals(otherRow.getColumnB);
}
}
Then you can iterate over all the rows, and create a Map for holding the sums of C.
final Map<MyRow, Integer> computedCSums = new HashMap<MyRow, Integer>();
for (final MyRow myRow : myRows) {
if (computedCSums.get(myRow) == null) {
computedCSums.put(myRow, myRow.getColumnC());
} else {
computedCSums.put(myRow, computedSums.get(myRow) + myRow.getColumnC());
}
}
Then, to get the sum of grouped Cs of any row, you just do:
computedCSum.get(mySelectedRow);
I think there is three things should be considered about group by
less or equal is abstract
comparing two rows A, B according it columns (C1..Cn) are like this : compare each column from C1 to Cn , if we can get which is less, then return ,or if the two values are equal, then we go to compare next, repeat this until return.
which algorithm we choose
1)build a binary search tree or a hash table to store tuples , when we get a tuple, search the equal tuple , if we have , then merge the tuple which have the same group value, else put it to our search structure
2) read some tuples, then sort , walk the buffer and merge the same group
I prefer 1 rather than 2.
memory size
if out input is huge, we must consider memory limit.
we can use merge algorithm to deal this.
if memory exceed our limit , then write the tuples in memory to the tape order by their group columns
when we finish reading the input, then merge the result set in tape.

SQL Select statement to find a unique entry based on many attributes

To put this work in context... I'm trying to filter a database of objects and build descriptions which can be verbalized for a speech UI. To minimise the descriptions I want to find the shortest way to describe an object, based on the idea of Grices Maxims.
It's possible in code by iterating through the records, and running through all permutations, but I keep thinking there ought to be a way to do this in SQL... so far I haven't found it. (I'm using PostGRES.)
So I have a table that looks something like this:
id colour position height
(int) (text) (text) (int)
0 "red" "left" 9
1 "red" "middle" 8
2 "blue" "middle" 8
3 "blue" "middle" 9
4 "red" "left" 7
There are two things I wish to find based on the attributes (excluding the ID).
a) are any of the records unique, based on the minimum number of attributes?
=> e.g. record 0 is unique based on colour and height
=> e.g. record 1 is the only red item in the middle
=> e.g. record 4 is unique as its the only one which has a height of 7
b) how is a particular record unique?
=> e.g. how is record 0 unique? because it is the only item with a colour red, and height of 9
=> e.g. record 4 is unique because it is the only item with a height of 7
It may of course be that no objects are unique based on the attributes which is fine.
+++++++++++++++++++++++++
Answer for (a)
So the only way I can think to do this in SQL is to start off by testing a single attribute to see if there is a single match from all records. If not then add attribute 2 and test again. Then try attributes 1 and 3. Finally try attributes 1,2 and 3.
Something like this:-
single column test:
select * from griceanmaxims
where height=(Select height from griceanmaxims
group by height
having (count(height)=1))
or
relpos=
(Select relpos
from griceanmaxims
group by relpos
having (count(relpos)=1))
or
colour=
(Select colour
from griceanmaxims
group by colour
having (count(colour)=1))
double column tests:
(Select colour,relpos
from griceanmaxims
group by colour,relpos
having (count(colour)=1))
(Select colour,height
from griceanmaxims
group by colour,height
having (count(colour)=1))
etc
++++++++
I'm not sure if there's a better way or how to join up the results from the double column tests.
Also if anyone has any suggestions on how to determine the distinguishing factors for a record (as in question b), that would be great. My guess is that (b) would require (a) to be run for all of the field combinations, but I'm not sure if there's a better way.
Thanks in advance for any help on this one....
I like the idea of attacking the problem using a General Purpose Language eg C#:
1) Iterate through and see if any have 1 attribute which is unique eg ID = 4, which is unique because height is 7. Take ID 4 out of the 'doing' collection, and put into 'done' collection with appropriate attribute
Use a unit testing tool eg MSUNIT to prove the above works
2) Try and extend to n attibutes
Unit Test
3) See if any can be unique with 2 attributes. Take those IDs out of doing and into done with the pairs of attributes
Unit Test
4) Extend to m attributes
Unit Test
3) Refactor maybe using recursion
Hope this helps.