SQL "group by" like - grouping algorithm - sql

I have a table with more than 2 columns (let's say A, B and C). One column holds some numbers (C) and I want to do a "group by" like grouping, summing the numbers in C, but I don't know the algorithm for doing so.
I tried sorting the table by each column (from last to first, aside from the numbers column (C), so in this case: sort(B) and then sort(A)) and then, wherever nth row holds same values in A and B as in n-1th row, I add the number from nth row to n-1th row (in the C column), and then delete the nth row. Else, if A or B value in row n differs from A or B value in n-1th row, I'll just move to the next row. Then I repeat the algorithm till the last row in table. But somehow this isn't working all the time, especially when there're a lot more columns (some rows remain ungrouped, maybe because of the sorting method).
I want to know whether this is a good grouping algorithm and I need to look for the problem into the sorting method, or I need to use another (sorting and/or grouping) algorithm and which one. Thank you.
LE: Apparently the algorithm that I used works well after a thorough check of the code and fixing some minor mistakes that junior programmers like me often make :)

I think a good way to do this would be to wrap your row into a class, implement the equals method, and then use a Map to add the values up:
public class MyRow {
private Long columnA;
private String columnB;
private int columnC;
#Override
public boolean equals(final Object other) {
if (!other instanceof MyRow) {
return false;
}
final MyRow otherRow = (MyRow) other;
return this.columnA.equals(otherRow.getColumnA()) && this.columnB.equals(otherRow.getColumnB);
}
}
Then you can iterate over all the rows, and create a Map for holding the sums of C.
final Map<MyRow, Integer> computedCSums = new HashMap<MyRow, Integer>();
for (final MyRow myRow : myRows) {
if (computedCSums.get(myRow) == null) {
computedCSums.put(myRow, myRow.getColumnC());
} else {
computedCSums.put(myRow, computedSums.get(myRow) + myRow.getColumnC());
}
}
Then, to get the sum of grouped Cs of any row, you just do:
computedCSum.get(mySelectedRow);

I think there is three things should be considered about group by
less or equal is abstract
comparing two rows A, B according it columns (C1..Cn) are like this : compare each column from C1 to Cn , if we can get which is less, then return ,or if the two values are equal, then we go to compare next, repeat this until return.
which algorithm we choose
1)build a binary search tree or a hash table to store tuples , when we get a tuple, search the equal tuple , if we have , then merge the tuple which have the same group value, else put it to our search structure
2) read some tuples, then sort , walk the buffer and merge the same group
I prefer 1 rather than 2.
memory size
if out input is huge, we must consider memory limit.
we can use merge algorithm to deal this.
if memory exceed our limit , then write the tuples in memory to the tape order by their group columns
when we finish reading the input, then merge the result set in tape.

Related

How to omit columns for specific rows in SQL?

As an example:
select
pre, sze, fdm_pre, val
from
form_data_stage
where
fdm_pre in (1,2,3,4)
order by
pre;
this will return values for all the columns pre, sze, fdm_pre, val for any of the fdm_pre values listed, i.e. (1,2,3,4). However, I only care about the pre and size values when fdm_pre is 1.
I could write a query such as
select
case when fdm_pre = 1 then pre else null end as pre,
case when fdm_pre = 1 then sze else null end as sze,
fdm_pre,
val
from
form_data_stage
where
fdm_pre in (1,2,3,4)
order by
pre;
But, is there some standard way of dealing with this situation? Is it generally more efficient to return all the columns, even if they aren't used? Or, would it be better to do some conditional checking as in the second query? The pre and sze columns are integer values.
It's not efficient to return all the columns when they are not used, specially if the unused columns are massive in size (BLOB, CLOB, TEXT, ARRAY, etc.).
In your particular example the columns "not returned" are small ones (measured in bytes), so it won't really matter if you produce nulls instead.
You can use execute string instead of this kind of script. In that way your desire columns should put in string and with if statement you can decide about which column is shown and which one is not necessary.
I can make a sample for you if you need.
I think keeping the query simple and clear to read is important. I would suggest you return all the relevant columns (that might be useful) and on the app logic side deal with these 2 options of column usage.

Maintaining auto ranking as a column in MongoDB

I am using MongoDB as my database.
I have data which contains rank and name as columns. Now a new row can be updated with a rank different from ranks already existing or can be same.
If same then the ranks of other rows must be adjusted .
Rows having lesser rank than the to be inserted one must be incremented by one and the rows which are having ranks can remain as it it.
Feature is something like number bulleted list in MS Word type of applications. Where inserting a row in between adjust the numbering of other rows below it.
Rank 1 is the highest rank.
For e.g. there are 3 rows
Name Rank
A 1
B 2
C 3
Now i want to update a row with D as name and 2 as rank. So now after the row insert, the DB should like below
Name Rank
A 1
B 3
C 4
D 2
Probably using Database triggers i can achieve this by updating the other rows.
I have couple of questions
(a) Is there any other better way than using database trigger for achieving this kind of scenario ? Updating all the rows might be a time consuming job.
(b) Does MongoDB support database trigger natively ?
Best Regards,
Saurav
No, MongoDB, does not provide triggers (yet). Also I don't think trigger is really a great way to achieve this.
So I would just like to throw some ideas, see if it makes sense.
Approach 1
Maybe instead of disturbing those many documents, you can create a collection with only one document (Let's call the collection ranking). In that document, have an array field call ranks. Since it's an array it's already maintaining a sequence.
{
_id : "RANK",
"ranks" : ["A","B","C"]
}
Now if you want to add D to this rank at 2nd position
db.ranking.update({_id:"RANK"},{$push : {"ranks":{$each : ["D"],$position:1}}});
it would add D to index 1 which is 2nd position considering index starts at 0.
{
_id : "RANK",
"ranks" : ["A","D","B","C"]
}
But there is a catch, what if you want to change C position to 1st from 4th, you need to remove it from end and put it in the beginning, I am sure both operation can't be achieved in single update (didn't dig in the options much), so we can run two queries
db.ranking.update({_id:"RANK"},{$pull : {"ranks": "C"}});
db.ranking.update({_id:"RANK"},{$push : {"ranks":{$each : ["C"],$position:0}}});
Then it would be like
{
_id : "RANK",
"ranks" : ["C","A","D","B"]
}
maintaining the rest of sequence.
Now you would probably want to store id instead of A,B,C etc. one document can be 16MB so basically this ranks array can store more than 1.3 million entries of id, if id is MongoDB ObjectId of 12 bytes each. if that is not enough, we still have option to have followup document(s) with further ranking.
Approach 2
you can also, instead of having rank as number, just have two field like followedBy and precededBy.
so your user document would look
{
_id:"A"
"followedBy":"B",
}
{
_id:"B"
"followedBy":"C",
"precededBy":"A"
}
{
_id:"c"
"precededBy":"B",
}
if you want to add D at second position, then you need to change the current 2nd position and you need to insert the new One, so it would be change in only two document
{
_id:"A"
"followedBy":"B",
}
{
_id:"B"
"followedBy":"C",
"precededBy":"D" //changed from A to D
}
{
_id:"c"
"precededBy":"B",
}
{
_id:"D"
"followedBy":"B",
"precededBy":"A"
}
The downside of this approach is that you cannot sort in query based on the ranking until and unless you get all these in application and create a linkedlist sort of structure.
This approach just preserve the ranking with minimum DB changes.

What's the effective way to count rows in Pig?

In Pig, what is the effective way to get count? We can do a GROUP ALL, but this is given only 1 reducer. When the data size is very large,say n Terabytes, can we try multiple reducers somehow?
dataCount = FOREACH (GROUP data ALL) GENERATE
'count' as metric,
COUNT(dataCount) as value;
Instead of using directly a GROUP ALL, you could divide it into two steps. First, group by some field and count the number of rows. And then, perform a GROUP ALL to sum all of these counts. This way, you would be able to count the number of rows in parallel.
Note, however, that if the field you use in the first GROUP BY does not have duplicates, the resulting counts will all be of 1 so there wont be any difference. Try using a field that has many duplicates to improve its performance.
See this example:
a;1
a;2
b;3
b;4
b;5
If we first group by the first field, which has duplicates, the final COUNT will deal with 2 rows instead of 5:
A = load 'data' using PigStorage(';');
B = group A by $0;
C = foreach B generate COUNT(A);
dump C;
(2)
(3)
D = group C all;
E = foreach D generate SUM(C.$0);
dump E;
(5)
However, if we group by the second one, which is unique, it will deal with 5 rows:
A = load 'data' using PigStorage(';');
B = group A by $1;
C = foreach B generate COUNT(A);
dump C;
(1)
(1)
(1)
(1)
(1)
D = group C all;
E = foreach D generate SUM(C.$0);
dump E;
(5)
I just dig a bit more in this topic, and it seems you don't have to afraid that a single reducer will have to process enormous amount of data if you're using an up-to-date pig version.
The algebraic UDF-s will handle the COUNT smart, and it's calculated on the mapper. So the reducer just have to deal with the aggregated data (counts/mapper).
I think it's introduced in 0.9.1, but 0.14.0 definitely has it
Algebraic Interface
An aggregate function is an eval function that takes a bag and returns
a scalar value. One interesting and useful property of many aggregate
functions is that they can be computed incrementally in a distributed
fashion. We call these functions algebraic. COUNT is an example of an
algebraic function because we can count the number of elements in a
subset of the data and then sum the counts to produce a final output.
In the Hadoop world, this means that the partial computations can be
done by the map and combiner, and the final result can be computed by
the reducer.
But my previous answer was definitely wrong:
In the grouping you can use the PARALLEL n keyword this set the
number of reducers.
Increase the parallelism of a job by specifying the number of reduce
tasks, n. The default value for n is 1 (one reduce task).

Solution for allowing user sorting in SQlite

By user sorting I mean that as a user on the site you see a bunch of items, and you are supposed to be able to reorder them (I'm using jQuery UI).
The user only sees 20 items on each page, but the total number of items can be thousands.
I assume I need to add another column in the table for custom ordering.
If the user sees items from 41-60, and and he sorts them like:
41 = 2nd
42 = 1st
43 = fifth
etc.
I can't just set the ordering column to 2,1,5.
I would need to go through the entire table and change each record.
Is there any way to avoid this and somehow sort only the current selection?
Add another column to store the custom order, just as you suggested yourself. You can avoid the problem of having to reassign all rows' values by using a REAL-typed column: For new rows, you still use an increasing integer sequence for the column's value. But if a user reorders a row, the decimal data type will allow you to use the formula ½ (previous row's value + next row's value) to update the column of the single row that was moved. You
have got two special cases to take care of, namely if a user moves a row to the very beginning or end of the list. In that case, just use min - 1 rsp. max + 1.
This approach is the simplest I can think of, but it also has some downsides. First, it has a theoretical limitation due to the datatype having only double-precision. After a finite number of reorderings, the values are too close together for their average to be a different number. But that's really only a theoretical limit you should never reach in practical applications. Also, the column will use 8 bytes of memory per row, which probably is much more than you actually need.
If your application might scale to the point where those 8 bytes matter or where you might have users that overeagerly reorder rows, you should instead stick to the INTEGER column and use multiples of a constant number as the default values (e.g. 100, 200, 300, ..). You still use the update formula from above, but whenever two values become too close together, you reassign all values. By tweaking the constant multiplier to the average table size / user behaviour, you can control how often this expensive operation has to be done.
There are a couple ways I can think of to do this. One would be to use a SELECT FROM SELECT style statement. As in something like this.
SELECT *
FROM (
SELECT col1, col2, col3...
FROM ...
WHERE ...
LIMIT n,m
) as Table_A
ORDER BY ...
The second option would be to use temp tables such as:
INSERT INTO temp_table_A SELECT ... FROM ... WHERE ... LIMIT n,m;
SELECT * FROM temp_table_A ORDER BY ...
Another option to look at would be jQuery plugin like DataTables
one way i can think of is:
Add a new column (if feasible) or create a new table for holding the order of the items.
On any page you will show around 20 items based on the initial ordering.
Using the jquery's Draggable you can send updates to this table
I think you can do this with an extra column.
First, you could prepopulate this new column with a default sort order and then allow the user to interactively modify it with the drag and drop of jquery-ui.
Lets say this user has 100 items in the table. You set the values in the order column to [1,2,3,...,99,100]. I suggest that you run a script on the original table to set all items to a default sort order.
Now going back to your example where the user is presented with items 41-60: the initial presentation in their browser would rank those at orders [41,42,43,...,59,60]. You might also need to save the lowest order that appears in this subset, in this case 41. Or better yet, save the entire array of rankings and restore the exact same numbers in the new order. This covers the case where they select a set of records that are not already consecutively ordered, perhaps because they belong to someone else.
To demonstrate what I mean: when they reorder them in the page, your javascript reassigns those same numbers back to the subset in the new order. Like this:
item A : 41
item B : 45
item C : 46
item D : 47
item E : 51
item F : 54
item G : 57
then the user changes them to this order, but you reassign the numbers like this:
item D : 41
item F : 45
item E : 46
item A : 47
item C : 51
item B : 54
item G : 57
This should also work if the subset is consecutive.

search within an array with a condition

I have two array I'm trying to compare at many levels. Both have the same structure with 3 "columns.
The first column contains the polygon's ID, the second a area type, and the third, the percentage of each area type for a polygone.
So, for many rows, it will compare, for example, ID : 1 Type : aaa % : 100
But for some elements, I have many rows for the same ID. For example, I'll have ID 2, Type aaa, 25% --- ID 2, type bbb, 25% --- ID 2, type ccc, 50%. And in the second array, I'll have ID 2, Type aaa, 25% --- ID 2, type bbb, 10% --- ID 2, type eee, 38% --- ID 2, type fff, 27%.
here's a visual example..
So, my function has to compare these two array and send me an email if there are differences.
(I wont show you the real code because there are 811 lines). The first "if" condition is
if array1.id = array2.id Then
if array1.type = array2.type Then
if array1.percent = array2.percent Then
zone_verification = True
Else
zone_verification = False
The probleme is because there are more than 50 000 rows in each array. So when I run the function, for each "array1.id", the function search through 50 000 rows in array2. 50 000 searchs for 50 000 rows.. it's pretty long to run!
I'm looking for something to get it running faster. How could I get my search more specific. Example : I have many id "2" in the array1. If there are many id "2" in the array2, find it, and push all the array2.id = 3 in a "sub array" or something like that, and search in these specific rows. So I'll have just X rows in array1 to compare with X rows in array 2, not with 50 000. and when each "id 2" in array1 is done, do the same thing for "id 4".. and for "id 5"...
Hope it's clear. it's almost the first time I use VB.net, and I have this big function to get running.
Thanks
EDIT
Here's what I wanna do.
I have two different layers in a geospatial database. Both layers have the same structure. They are a "spatial join" of the land parcels (55 000), and the land use layer. The first layer is the current one, and the second layer is the next one we'll use after 2015.
So I have, for each "land parcel" the percentage of each land use. So, for a "land parcel" (ID 7580-80-2532, I can have 50% of farming use (TYPE FAR-23), and 50% of residantial use (RES-112). In the first array, I'll have 2 rows with the same ID (7580-80-2532), but each one will have a different type (FAR-23, RES-112) and a different %.
In the second layer, the same the municipal zoning (land use) has changed. So the same "land parcel" will now be 40% of residential use (RES-112), 20% of commercial (COM-54) and 40% of a new farming use (FAR-33).
So, I wanna know if there are some differences. Some land parcels will be exactly the same. Some parcels will keep the same land use, but not the same percentage of each. But for some land parcel, there will be more or less land use types with different percentage of each.
I want this script to compare these two layers and send me an email when there are differences between these two layers for the same land parcel ID.
The script is already working, but it takes too much time.
The probleme is, I think, the script go through all array2 for each row in array 1.
What I want is when there are more than 1 rows with the same ID in array1, take only this ID in both arrays.
Maybe if I order them by IDs, I could write a condition. kind of "when you find what you're looking for, stop searching when you'll find a different value?
It's hard to explain it clearly because I've been using VB since last week.. And english isn't my first language! ;)
If you just want to find out if there are any differences between the first and second array, you could do:
Dim diff = New HashSet(of Polygon)(array1)
diff.SymmetricExceptWith(array2)
diff will contain any Polygon which is unique to array1 or array2. If you want to do other types of comparisons, maybe you should explain what you're trying to do exactly.
UPDATE:
You could use grouping and lookups like this:
'Create lookup with first array, for fast access by ID
Dim lookupByID = array1.ToLookup(Function(p) p.id)
'Loop through each group of items with same ID in array2
For Each secondArrayValues in array2.GroupBy(Function(p) p.id)
Dim currentID As Integer = secondArrayValues.Key 'Current ID is the grouping key
'Retrieve values with same ID in array1
'Use a hashset to easily compare for equality
Dim firstArrayValues As New HashSet(of Polygon)(lookupByID(currentID))
'Check for differences between the two sets of data, for this ID
If Not firstArrayValues.SetEquals(secondArrayValues) Then
'Data has changed, do something
Console.WriteLine("Differences for ID " & currentID)
End If
Next
I am answering this question based on the first part that you wrote (that is without the EDIT section). The correct answer should explain a good algorithm but I am suggesting you to use DB capabilities because they have optimized many queries for these purpose.
Put all the records in DB two tables - O(n) time ... If the records are static you dont need to perform this step every time.
Table 1
id type percent
Table 2
id type percent
Then use the DB query, some thing like this
select count(*) from table1 t1, table2 t2 where t1.id!=t2.id and t1.type!=t2.type
(you can use some better queries, what I am trying to say is give the control to DB to perform this operation)
retrieve the result in your code and perform the necessary operation.
EDIT
1) You can sort them in O(n logn) time based on ID + type + Percent and then perform binary search.
2) Store the first record in hash map with appropriate key - could be ID only or ID+type
this will take O(n) time and searching ,if key is correct, will take constant time.
You need to define a structure to store this data. We'll store all the data in a LandParcel class, which will have a HashSet<ParcelData>
public class ParcelData
{
public ParcelType Type { get; set; } // This can be an enum, string, etc.
public int Percent { get; set; }
// Redefine Equals and GetHashCode conveniently
}
public class LandParcel
{
public ID Id { get; set; } // Whatever the type of the ID is...
public HashSet<ParcelData> Data { get; set; }
}
Now you have to build your data structure, with something like this:
Dictionary<ID, LandParcel> data1 = new ....
foreach (var item in array1)
{
LandParcel p;
if (!data1.TryGetValue(item.id, out p)
data1[item.id] = p = new LandParcel(id);
// Can this data be repeated?
p.Data.Add(new ParcelData(item.type, item.percent));
}
You do the same with a data2 dictionary for the second array. Now you iterate for all items in data1 and compare them with the item with the same id for data2.
foreach (var parcel2 in data2.Values)
{
var parcel1 = data1[parcel2.ID]; // Beware with exceptions here !!!
if (!parcel1.Data.SetEquals(parcel2.Data))
// You have different parcels
}
(Now that I look at it, we are practically doing a small database query here, kind of smelly code ...)
Sorry for the C# code since I don't really feel so comfortable with VB, but it should be fairly straightforward.