Why does pandas.apply() work differently for Series and DataFrame columns - pandas

apologies if this is a silly question, but I am not quite sure as to why this behavior is the case, and/or whether I am misunderstanding it. I was trying to create a function for the 'apply' method, and noticed that if you run apply on a series the series is passed as a np.array and if you pass the same series within a dataframe of 1 column, the series is passed as a series to the (u)func.
This affects the way a simpleton like me writes the function (i prefer iloc indexing to integer-based indexing on the array) so I was wondering whether this is on purpose, or historical accident?
Thanks,

Related

Pandas dataframe sharing between functions isn't working

I have a script that modifies a pandas dataframe with several concurrent functions (asyncio coroutines). Each function adds rows to the dataframe and it's important that the functions all share the same list. However, when I add a row with pd.concat a new copy of the dataframe is created. I can tell because each dataframe now has a different memory location as given by id().
As a result the functions are no longer share the same object. How can I keep all functions pointed at a common dataframe object?
Note that this issue doesn't arise when I use the append method, but that is being deprecated.
pandas dataframes are efficient because they use contiguous memory blocks, frequently of fundamental types like int and float. You can't just add a row because the dataframe doesn't own the next bit of memory it would have to expand into. Concatenation usually requires that new memory is allocated and data is copied. Once that happens, referrers to the original dataframe
If you know the final size you want, you can preallocate and fill. Otherwise, you are better off keeping a list of new dataframes and concatenating them all at once. Since these are parallel procedures, they aren't dependent on each others output, so this may be a feasable option.

Does pandas categorical data speed up indexing?

Somebody told me it is a good idea to convert identifying columns (e.g. person numbers) from strings to categorical. This would speed up some operations like searching, filtering and grouping.
I understand that a 40 chars strings costs much more RAM and time to compare instead of a simple integer.
But I would have some overhead because of a str-to-int-table for translating between two types and to know which integer number belongs to which string "number".
Maybe .astype('categorical') can help me here? Isn't this an integer internally? Does this speed up some operations?
The user guide has the following about categorical data use cases:
The categorical data type is useful in the following cases:
A string variable consisting of only a few different values. Converting such a string variable to a categorical variable will save some memory, see here.
The lexical order of a variable is not the same as the logical order (“one”, “two”, “three”). By converting to a categorical and specifying an order on the categories, sorting and min/max will use the logical order instead of the lexical order, see here.
As a signal to other Python libraries that this column should be treated as a categorical variable (e.g. to use suitable statistical methods or plot types).
See also the API docs on categoricals.
The book, Python for Data Analysis by Wes McKinney, has the following on this topic:
The categorical representation can yield significant performance
improvements when you are doing analytics. You can also perform
transformations on the categories while leaving the codes unmodified.
Some example transformations that can be made at relatively low cost are:
Renaming categories
Appending a new category without changing the order or position of the existing categories
GroupBy operations can be significantly faster with categoricals because the underlying algorithms use the integer-based codes array instead of an array of strings.
Series containing categorical data have several special methods similar to the Series.str specialized string methods. This also provides convenient access to the categories and codes.
In large datasets, categoricals are often used as a convenient tool for memory savings and better performance.

Associating arbitrary properties with cells in Pandas

I’m fairly new to pandas. I want to know if there’s a way to associate my own arbitrary key/value pairs with a given cell.
For example, maybe row 5 of column ‘customerCount’ contains the int value 563. I want to use the pandas functions normally for computing sums, averages, etc. However, I also want to remember that this particular cell also has my own properties such as verified=True or flavor=‘salty’ or whatever.
One approach would be to make additional columns for those values. That would work if you just have a property or two on one column, but it would result in an explosion of columns if you have many properties, and you want to mark them on many columns.
I’ve spent a while researching it, and it looks like I could accomplish this by subclassing pd.Series. However, it looks like there would be a considerable amount of deep hacking involved. For example, if a column’s underlying data would normally be an ndarray of dtype ‘int’, I could internally store my own list of objects in my subclass, but hack getattribute() so that my_series.dtype yields ‘int’. The intent in doing that would be that pandas still treats the series as having dtype ‘int’, without knowing or caring what’s going on under the covers. However, it looks like I might have to override a bunch of things in my pd.Series subclass to make pandas behave normally.
Is there some easier way to accomplish this? I don’t necessarily need a detailed answer; I’m looking for a pointer in the right direction.
Thanks.

Natural way of indexing elements in Flink

Is there a built-in way to index and access indices of individual elements of DataStream/DataSet collection?
Like in typical Java collections, where you know that e.g. a 3rd element of an ArrayList can be obtained by ArrayList.get(2) and vice versa ArrayList.indexOf(elem) gives us the index of (the first occurence of) the specified element. (I'm not asking about extracting elements out of the stream.)
More specifically, when joining DataStreams/DataSets, is there a "natural"/easy way to join elements that came (were created) first, second, etc.?
I know there is a zipWithIndex transformation that assigns sequential indices to elements. I suspect the indices always start with 0? But I also suspect that they aren't necessarily assigned in the order the elements were created in (i.e. by their Event Time). (It also exists only for DataSets.)
This is what I currently tried:
DataSet<Tuple2<Long, Double>> tempsJoIndexed = DataSetUtils.zipWithIndex(tempsJo);
DataSet<Tuple2<Long, Double>> predsLinJoIndexed = DataSetUtils.zipWithIndex(predsLinJo);
DataSet<Tuple3<Double, Double, Double>> joinedTempsJo = tempsJoIndexed
.join(predsLinJoIndexed).where(0).equalTo(0)...
And it seems to create wrong pairs.
I see some possible approaches, but they're either non-Flink or not very nice:
I could of course assign an index to each element upon the stream's
creation and have e.g. a stream of Tuples.
Work with event-time timestamps. (I suspect there isn't a way to key by timestamps, and even if there was, it wouldn't be useful for
joining multiple streams like this unless the timestamps are
actually assigned as indices.)
We could try "collecting" the stream first but then we wouldn't be using Flink anymore.
The 1. approach seems like the most viable one, but it also seems redundant given that the stream should by definition be a sequential collection and as such, the elements should have a sense of orderliness (e.g. `I'm the 36th element because 35 elements already came before me.`).
I think you're going to have to assign index values to elements, so that you can partition the data sets by this index, and thus ensure that two records which need to be joined are being processed by the same sub-task. Once you've done that, a simple groupBy(index) and reduce() would work.
But assigning increasing ids without gaps isn't trivial, if you want to be reading your source data with parallelism > 1. In that case I'd create a RichMapFunction that uses the runtimeContext sub-task id and number of sub-tasks to calculate non-overlapping and monotonic indexes.

Extracting data as a list from a Pandas dataframe while preserving order

Suppose I have some Pandas dataframe df that has a column called "HEIGHT", among many other columns.
If I issue list(df["HEIGHT"]), then this will give me a list of the items in that column in the exact order in which they were in the dataframe, i.e. ordered by the index of the dataframe.
Is that always the case? The df["HEIGHT"] command will return a Series and list() will convert it to a list. But are these operations always order-preserving? Interestingly in the [book1 by the Pandas author (!), from my reading so far, it is unclear to me, when these elementary operations preserve order; is order perhaps always preserved, or is there some simple rule to know when order should be preserved?
The order of elements in a pandas Series (i.e., a column in a pandas DataFrame) will not change unless you do something that makes it change. And the order of a python list is guaranteed to reflect insertion order (SO thread).
So yes, df[0].tolist() (slightly faster than list(df[0])) should always yield a Python list of elements in the same order as the elements in df[0].
Order will always be preserved. When you use the list function, you provide it an iterator, and construct a list by iterating over it. For more information on iterators, you might want to read PEP 234 on iterators.
The iteration order is determined by the iterator you provide it. Iterators for a series are provided by pd.Series.__iter__() (the standard way to access an iterator for an object, which is searched for by the list method and similar). For more information on iteration and indexing in Pandas, consider reading the relevant API reference section and the much more in-depth indexing documentation.