Associating arbitrary properties with cells in Pandas - pandas

I’m fairly new to pandas. I want to know if there’s a way to associate my own arbitrary key/value pairs with a given cell.
For example, maybe row 5 of column ‘customerCount’ contains the int value 563. I want to use the pandas functions normally for computing sums, averages, etc. However, I also want to remember that this particular cell also has my own properties such as verified=True or flavor=‘salty’ or whatever.
One approach would be to make additional columns for those values. That would work if you just have a property or two on one column, but it would result in an explosion of columns if you have many properties, and you want to mark them on many columns.
I’ve spent a while researching it, and it looks like I could accomplish this by subclassing pd.Series. However, it looks like there would be a considerable amount of deep hacking involved. For example, if a column’s underlying data would normally be an ndarray of dtype ‘int’, I could internally store my own list of objects in my subclass, but hack getattribute() so that my_series.dtype yields ‘int’. The intent in doing that would be that pandas still treats the series as having dtype ‘int’, without knowing or caring what’s going on under the covers. However, it looks like I might have to override a bunch of things in my pd.Series subclass to make pandas behave normally.
Is there some easier way to accomplish this? I don’t necessarily need a detailed answer; I’m looking for a pointer in the right direction.
Thanks.

Related

Natural way of indexing elements in Flink

Is there a built-in way to index and access indices of individual elements of DataStream/DataSet collection?
Like in typical Java collections, where you know that e.g. a 3rd element of an ArrayList can be obtained by ArrayList.get(2) and vice versa ArrayList.indexOf(elem) gives us the index of (the first occurence of) the specified element. (I'm not asking about extracting elements out of the stream.)
More specifically, when joining DataStreams/DataSets, is there a "natural"/easy way to join elements that came (were created) first, second, etc.?
I know there is a zipWithIndex transformation that assigns sequential indices to elements. I suspect the indices always start with 0? But I also suspect that they aren't necessarily assigned in the order the elements were created in (i.e. by their Event Time). (It also exists only for DataSets.)
This is what I currently tried:
DataSet<Tuple2<Long, Double>> tempsJoIndexed = DataSetUtils.zipWithIndex(tempsJo);
DataSet<Tuple2<Long, Double>> predsLinJoIndexed = DataSetUtils.zipWithIndex(predsLinJo);
DataSet<Tuple3<Double, Double, Double>> joinedTempsJo = tempsJoIndexed
.join(predsLinJoIndexed).where(0).equalTo(0)...
And it seems to create wrong pairs.
I see some possible approaches, but they're either non-Flink or not very nice:
I could of course assign an index to each element upon the stream's
creation and have e.g. a stream of Tuples.
Work with event-time timestamps. (I suspect there isn't a way to key by timestamps, and even if there was, it wouldn't be useful for
joining multiple streams like this unless the timestamps are
actually assigned as indices.)
We could try "collecting" the stream first but then we wouldn't be using Flink anymore.
The 1. approach seems like the most viable one, but it also seems redundant given that the stream should by definition be a sequential collection and as such, the elements should have a sense of orderliness (e.g. `I'm the 36th element because 35 elements already came before me.`).
I think you're going to have to assign index values to elements, so that you can partition the data sets by this index, and thus ensure that two records which need to be joined are being processed by the same sub-task. Once you've done that, a simple groupBy(index) and reduce() would work.
But assigning increasing ids without gaps isn't trivial, if you want to be reading your source data with parallelism > 1. In that case I'd create a RichMapFunction that uses the runtimeContext sub-task id and number of sub-tasks to calculate non-overlapping and monotonic indexes.

Why does pandas.apply() work differently for Series and DataFrame columns

apologies if this is a silly question, but I am not quite sure as to why this behavior is the case, and/or whether I am misunderstanding it. I was trying to create a function for the 'apply' method, and noticed that if you run apply on a series the series is passed as a np.array and if you pass the same series within a dataframe of 1 column, the series is passed as a series to the (u)func.
This affects the way a simpleton like me writes the function (i prefer iloc indexing to integer-based indexing on the array) so I was wondering whether this is on purpose, or historical accident?
Thanks,

Fast, efficient method of assigning large array of data to array of clusters?

I'm looking for a faster, more efficient method of assigning data gathered from a DAQ to its proper location in a large cluster containing arrays of subclusters.
My current method 1 relies heavily on the OpenG cluster manipulation tools, but with a large data-set the performance is far too slow.
The array and cluster location of each element of data from the DAQ is determined during an initialization phase and doesn't change during acquisition.
Because the data element origin and end points are the same throughout acquisition, I would think an array of memory locations could be created and the data directly assigned to its proper place. I'm just not sure how to implement such a thing.
The following code does what you want:
For each of your cluster elements (AMC, ANLG_PM and PA) you should add a case in the string case structure, for the elements AMC and PA you will need to place a second case structure.
This is really more of a comment, but I do not have the reputation to leave those yet, so here it is:
Regarding adding cases for every possible value of Array name, is there any reason why you cannot use an enum here? Since you are placing it into a cluster anyway, I would suggest making a type-defined enum of your possible array names. That way, when you want to add or remove one, you only have to do it in one place.
You will still need to right-click on your case structures that use this enum and select Add item for every value if you are adding a value, or manually delete the obsolete value if you are removing one. I suppose some maintenance is required either way...

Why do we use multiple dimensional arrays?

I have an understanding about how multiple dimensional arrays work and how to use them except for one thing, In what situation would we need to use them and why?
Basically multi dimension arrays are used if you want to put arrays inside an array.
Say you got 10 students and each writes 3 tests. You can create an array like: arr_name[10][3]
So, calling arr_name[0][0] gives you the result of student 1 on lesson 1.
Calling arr_name[5][2] gives you the result of student 6 on test 3.
You can do this with a 30 position array, but the multi dimension is:
1) easier to understand
2) easier to debug.
Here are a couple examples of arrays in familiar situations.
You might imagine a 2 dimensional array is as a grid. So naturally it is useful when you're dealing with graphics. You might get a pixel from the screen by saying
pixel = screen[20][5] // get the pixel at the 20th row, 5th column
That could also be done with a 3 dimensional array to represent 3d space.
An array could act like a spreadsheet. Here the rows are customers, and the columns are name, email, and date of birth.
name = customers[0][0]
email = customers[0][1]
dateofbirth = customers[0][2]
Really there is a more fundamental pattern underlying this. Things have things have things... and so on. And in a sense you're right to wonder whether you need multidimensional arrays, because there are other ways to represent that same pattern. It's just there for convenience. You could alternatively
Have a single dimensional array and do some math to make it act multidimensional. If you indexed pixels one by one left to right top to bottom you would end up with a million or so elements. Divide by the width of the screen to get the row. The remainder is the column.
Use objects. Instead of using a multidimensional array in example 2 you could have a single dimensional array of Customer objects. Each Customer object would have the attributes name, email and dob.
So there's rarely one way to do something. Just choose the most clear way. With arrays you're accessing by number, with objects you're accessing by name.
Such solution comes as intuitive when you are faced with accessing a data element identified by a multidimensional vector. So if "which element" is defined by more than two "dimensions".
Good uses for 2D or Two D arrays might be:
Matrix Math i.e. rotation things in space on a plane and more.
Maps like game maps, top or side views for either actual graphics or descriptive data.
Spread Sheet like storage.
Multi Columns of display table data.
Kinds of Graphics work.
I know there could be much more, so maybe someone else can add to this list in their answers.

Keeping an array sorted - at setting, getting or later?

As an aid to learning objective c/oop, I'm designing an iOS app to store and display periodic bodyweight measurements. I've got a singleton which returns a mutablearray of the shared store of measurement object. Each measurement will have at least a date and a body weight, and I want to be able to add historic measurements.
I'd like to display the measurements in date order. What's the best way to do this? As far as I can see the options are as follows: 1) when adding a measurement - I override addobject to sort the shared store every time after a measurement is added, 2) when retrieving the mutablearray I sort it, or 3) I retrieve the mutablearray in whatever order it happens to be in the shared store, then sort it when displaying the table/chart.
It's likely that the data will be retrieved more frequently than a new datum is added, so option 1 will reduce redundant sorting of the shared store - so this is the best way, yes?
You can use a modified version of (1). Instead of sorting the complete array each time a new object is inserted, you use the method described here: https://stackoverflow.com/a/8180369/1187415 to insert the new object into the array at the correct place.
Then for each insert you have only a binary search to find the correct index for the new object, and the array is always in correct order.
Since you said that the data is more frequently retrieved than new data is added, this seems to be more efficient.
If I forget your special case, this question is not so easy to answer. There are two basic solutions:
Keep array unsorted and when you try to access the element and array is not sorted, then sort it. Let's call it "lazy sorting".
Keep array sorted when inserting elements. Note this is not about appending new element at the end and then sort the whole array. This is about finding where the element should be (binary search) and place it there. Let's call it "sorted insert".
Both techniques are correct and useful and deciding which one is better depends on your use cases.
Example:
You want to insert hundreds of elements into the array, then access the elements, then again insert hundreds of elements, then access. In summary, you will be inserting values in big chunks. In this case, lazy sorting will be better.
You will often insert individual elements and you will access the elements often. Then sorted insert will have better performance.
Something in the middle (between inserting 1 and inserting tens of elements). You probably don't care which one of the methods will be used.
(Note that you can use also specialized structures to keep an array sorted, not based on NSArray, e.g. structures based on a balanced tree, while keeping number of elements in the subtree).