GroupBy in Deedle - deedle

I want to group my dataframe by certain columns and then iterate over the groups like I can easily do using the LINQ .GroupBy().
var groupedByTags = dataFrame.GroupRowsUsing((i, series) => GetTags(series));
// GetTags extracts my groupKey
does do the grouping, but I couldn't figure out how to iterate over the groups. I think the GetByLevel function might help, but it seems I need to know the groups first before I can use it.
Also I'm using the C# version of deedle.

Related

Pyspark: Filter DF based on columns, then run every subset DF through a function

I am new to Pyspark and am a bit confused on how to think of the problem.
I have a large dataframe and I would like to filter down every subset of that dataframe based on two columns and run it through the same algorithm.
Here is an example of how I run it (extremely inefficiently) now:
for letter in ['a', 'b', 'c']:
for number in [1, 2, 3]
filtered_DF_1, filtered_DF_2 = filter_func(DF_1, DF_2, letter, number)
process_function(filtered_DF_1, filtered_DF_2)
Basic filter function:
def filter_func(DF_1, DF_2, letter, number):
DF_1 = DF_1.filter(
(F.col("Letter") == letter) &
(F.col('Number') == number)
)
DF_2 = DF_2.filter(
(F.col("Letter") == letter) &
(F.col('Number') == number)
)
return DF_1, DF_2
Since this is Pyspark, I would like to parallelize it, since each iteration of the function can run independently.
Do I need to do some sort of mapping to get all my data subsets?
And then do I need to do anything to the process_function to make it available to all nodes as well to run and return an answer?
What is the best way to do this?
​
EDIT:
The process_function takes the filtered dataset and runs it through about 7 different functions that are already written in 300 lines of pyspark --> the end goal is to return a list of timestamps that are overbooked based on a bunch of complicated logic.
I think my plan is to build a dictionary of letter --> [number], then explode that list to get every permutation and create a dataset from that. Then map through that, and hopefully am able to create a udf for my process_function.
I don't think you need to worry a lot about parallelizing or the execution plan because the spark catalyst does it in the background for you. Also better to avoid UDF, you can do it mostly with inbulit function.
Are you doing a transformation function or an aggregate function inside you process_func?
Please provide any test data and suitable example of expected output. That would help in better answering..

Processing pandas data in declarative style

I have a pandas dataframe of vehicle co-ordinates (from multiple vehicles on multiple days). For each vehicle and for each day, I do two things: either apply an algorithm to it, or filter it out of the dataset completely if it doesn't satisfy certain criteria.
To acheive this I use df.groupby('vehicle_id', 'day') and then .apply(algorithm) or .filter(condition) where algorithm and condition are functions which take in a dataframe.
I would like the full processing of my dataset (which involves multiple .apply and .filter steps) to be written out in a declaritive style, as opposed to imperatively looping through the groups, with the goal of the whole thing to look something like:
df.group_by('vehicle_id', 'day').apply(algorithm1).filter(condition1).apply(algorithm2).filter(condition2)
Of course, the above code is incorrect since .apply() and .filter() return new dataframes, and this is exactly my problem. They return all the data back in a single dataframe, and I find that I have apply .groupby('vehicle_id', 'day') continuously.
Is there a nice way that I can write this out without having to group by the same columns over and over?
Since apply uses a for loop anyway (meaning there are no sophisticated optimizations in the background), I suggest using an actual for loop:
arr = []
for key, dfg in df.groupby(['vehicle_id', 'day']):
dfg = dfg.do_stuff1() # Perform all needed operations
dfg = do_stuff2(dfg) #
arr.append(dfg)
result = pd.concat(arr)
An alternative is to create a function which runs all of the applies and filters sequentially on a given dataframe, and then map a single groupby/apply to it:
def all_operations(dfg):
# Do stuff
return result_df
result = df.group_by(['vehicle_id', 'day']).apply(all_operations)
In both options you will have to deal with cases in which an empty dataframe is returned from the filters, if such cases exist.

Grouping ActiveRecord objects in batches

I'm trying to return a group of products in my rails so that I can seperate them when I iterate in my view.
For example, if I have 13 products, I want the block in the view to put the first 7 on one row, break and put the next six on the next row (I'm using css to put a shelf under the products).
I've been experimenting with find_in_batches, but can't seem to get this to work (not even sure it's the appropriate method).
#shelves = Product.find_in_batches(:batch_size => 7) { |products| products }
I usually use group_by when I want to group based on a date, for example -- is there a way to use group_by to group by counts, instead of model attributes?
find_in_batches.map will give you a no block error. What you actually want is:
#shelves = Product.all.in_groups_of(7)
And if you'd like the last group to not have extra nil objects padding it out, try:
#shelves = Product.all.in_groups_of(7, false)
Of course, you'll want to replace all with a more sensible scope so you're not loading your entire list of database objects into memory :)
You want an array of batches. Just mapping over the batches should do it.
#shelves = Product.find_in_batches(:batch_size => 7).map{|batch| batch}

Get the last element of the list in Django

I have a model:
class List:
data = ...
previous = models.ForeignKey('List', related_name='r1')
obj = models.ForeignKey('Obj', related_name='nodes')
This is one direction list containing reference to some obj of Obj class. I can reverse relation and get some list's all elements refering to obj by:
obj.nodes
But how Can I get the very last node? Without using raw sql, genering as little SQL queries by django as can.
obj.nodes is a RelatedManager, not a list. As with any manager, you can get the last queried element by
obj.nodes.all().reverse()[0]
This makes sense anyway only if there is any default order defined on the Node's Meta class, because otherwise the semantic of 'reverse' don't make any sense. If you don't have any specified order, set it explicitly:
obj.nodes.order_by('-pk')[0]
len(obj.nodes)-1
should give you the index of the last element (counting from 0) of your list
so something like
obj.nodes[len(obj.nodes)-1]
should give the last element of the list
i'm not sure it's good for your case, just give it a try :)
I see this question is quite old, but in newer versions of Django there are first() and last() methods on querysets now.
Well, you just can use [-1] index and it will return last element from the list. Maybe this question are close to yours:
Getting the last element of a list in Python
for further reading, Django does not support negative indexing and using something like
obj.nodes.all()[-1]
will raise an error.
in newer versions of Django you can use last() function on queryset to get the last item of your list.
obj.nodes.last()
another approach is to use len() function to get the index of last item of a list
obj.nodes[len(obj.nodes)-1]

sorting and getting uniques

i have a string that looks like this
"apples,fish,oranges,bananas,fish"
i want to be able to sort this list and get only the uniques. how do i do it in vb.net? please provide code
A lot of your questions are quite basic, so rather than providing the code I'm going to provide the thought process and let you learn from implementing it.
Firstly, you have a string that contains multiple items separated by commas, so you're going to need to split the string at the commas to get a list. You can use String.Split for that.
You can then use some of the extension methods for IEnumerable<T> to filter and order the list. The ones to look at are Enumerable.Distinct and Enumerable.OrderBy. You can either write these as normal methods, or use Linq syntax.
If you need to get it back into a comma-separated string, then you'll need to re-join the strings using the String.Join method. Note that this needs an array so Enumerable.ToArray will be useful in conjunction.
You can do it using LINQ, like this:
Dim input = "apples,fish,oranges,bananas,fish"
Dim strings = input.Split(","c).Distinct().OrderBy(Function(s) s)
I'm not a VB.NET programmer, but I can give you a suggestion:
Split the string into an array
Create a second array
Cycle through the first array, adding any value that is not in the second.
Upon completion, your second array will have only unique values.