Extracting data as a list from a Pandas dataframe while preserving order - pandas

Suppose I have some Pandas dataframe df that has a column called "HEIGHT", among many other columns.
If I issue list(df["HEIGHT"]), then this will give me a list of the items in that column in the exact order in which they were in the dataframe, i.e. ordered by the index of the dataframe.
Is that always the case? The df["HEIGHT"] command will return a Series and list() will convert it to a list. But are these operations always order-preserving? Interestingly in the [book1 by the Pandas author (!), from my reading so far, it is unclear to me, when these elementary operations preserve order; is order perhaps always preserved, or is there some simple rule to know when order should be preserved?

The order of elements in a pandas Series (i.e., a column in a pandas DataFrame) will not change unless you do something that makes it change. And the order of a python list is guaranteed to reflect insertion order (SO thread).
So yes, df[0].tolist() (slightly faster than list(df[0])) should always yield a Python list of elements in the same order as the elements in df[0].

Order will always be preserved. When you use the list function, you provide it an iterator, and construct a list by iterating over it. For more information on iterators, you might want to read PEP 234 on iterators.
The iteration order is determined by the iterator you provide it. Iterators for a series are provided by pd.Series.__iter__() (the standard way to access an iterator for an object, which is searched for by the list method and similar). For more information on iteration and indexing in Pandas, consider reading the relevant API reference section and the much more in-depth indexing documentation.

Related

Pandas dataframe sharing between functions isn't working

I have a script that modifies a pandas dataframe with several concurrent functions (asyncio coroutines). Each function adds rows to the dataframe and it's important that the functions all share the same list. However, when I add a row with pd.concat a new copy of the dataframe is created. I can tell because each dataframe now has a different memory location as given by id().
As a result the functions are no longer share the same object. How can I keep all functions pointed at a common dataframe object?
Note that this issue doesn't arise when I use the append method, but that is being deprecated.
pandas dataframes are efficient because they use contiguous memory blocks, frequently of fundamental types like int and float. You can't just add a row because the dataframe doesn't own the next bit of memory it would have to expand into. Concatenation usually requires that new memory is allocated and data is copied. Once that happens, referrers to the original dataframe
If you know the final size you want, you can preallocate and fill. Otherwise, you are better off keeping a list of new dataframes and concatenating them all at once. Since these are parallel procedures, they aren't dependent on each others output, so this may be a feasable option.

How to efficently flatten JSON structure returned in elasticsearch_dsl queries?

I'm using elasticsearch_dsl to make make queries for and searches of an elasticsearch DB.
One of the fields I'm querying is an address, which as a structure like so:
address.first_line
address.second_line
address.city
adress.code
The returned documents hold this in JSON structures, such that the address is held in a dict with a field for each sub-field of address.
I would like to put this into a (pandas) dataframe, such that there is one column per sub-field of the address.
Directly putting address into the dataframe gives me a column of address dicts, and iterating the rows to manually unpack (json.normalize()) each address dict takes a long time (4 days, ~200,000 rows).
From the docs I can't figure out how to get elasticsearch_dsl to return flattened results. Is there a faster way of doing this?
Searching for a way to solve this problem, I've come across my own answer and found it lacking, so will update with a better way
Specifically: pd.json_normalize(df['json_column'])
In context: pd.concat([df, pd.json_normalize(df['json_column'])], axis=1)
Then drop the original column if required.
Original answer from last year that does the same thing much more slowly
df.column_of_dicts.apply(pd.Series) returns a DataFrame with those dicts flattened.
pd.concat(df,new_df) gets the new columns onto the old dataframe.
Then delete the original column_of_dicts.
pd.concat([df, df.address.apply(pd.Series)], axis=1) is the actual code I used.

Natural way of indexing elements in Flink

Is there a built-in way to index and access indices of individual elements of DataStream/DataSet collection?
Like in typical Java collections, where you know that e.g. a 3rd element of an ArrayList can be obtained by ArrayList.get(2) and vice versa ArrayList.indexOf(elem) gives us the index of (the first occurence of) the specified element. (I'm not asking about extracting elements out of the stream.)
More specifically, when joining DataStreams/DataSets, is there a "natural"/easy way to join elements that came (were created) first, second, etc.?
I know there is a zipWithIndex transformation that assigns sequential indices to elements. I suspect the indices always start with 0? But I also suspect that they aren't necessarily assigned in the order the elements were created in (i.e. by their Event Time). (It also exists only for DataSets.)
This is what I currently tried:
DataSet<Tuple2<Long, Double>> tempsJoIndexed = DataSetUtils.zipWithIndex(tempsJo);
DataSet<Tuple2<Long, Double>> predsLinJoIndexed = DataSetUtils.zipWithIndex(predsLinJo);
DataSet<Tuple3<Double, Double, Double>> joinedTempsJo = tempsJoIndexed
.join(predsLinJoIndexed).where(0).equalTo(0)...
And it seems to create wrong pairs.
I see some possible approaches, but they're either non-Flink or not very nice:
I could of course assign an index to each element upon the stream's
creation and have e.g. a stream of Tuples.
Work with event-time timestamps. (I suspect there isn't a way to key by timestamps, and even if there was, it wouldn't be useful for
joining multiple streams like this unless the timestamps are
actually assigned as indices.)
We could try "collecting" the stream first but then we wouldn't be using Flink anymore.
The 1. approach seems like the most viable one, but it also seems redundant given that the stream should by definition be a sequential collection and as such, the elements should have a sense of orderliness (e.g. `I'm the 36th element because 35 elements already came before me.`).
I think you're going to have to assign index values to elements, so that you can partition the data sets by this index, and thus ensure that two records which need to be joined are being processed by the same sub-task. Once you've done that, a simple groupBy(index) and reduce() would work.
But assigning increasing ids without gaps isn't trivial, if you want to be reading your source data with parallelism > 1. In that case I'd create a RichMapFunction that uses the runtimeContext sub-task id and number of sub-tasks to calculate non-overlapping and monotonic indexes.

no method matching size(::DataFrames.GroupedDataFrame)

It's the first time a post a question, so I will try to give some example but I might not be totally aware of the best way to do it.
I am using groupby() function to divide a DataFrame according to a pooled variable. My intent is to create from the SubDataframes a new one in which the rows splitted with groupby() become 2 separate columns. For instance a in DataFrame A I have :meanX and :Treatment, in dataframe B I want to have :meanX_Treatment1 and :meanX_Treatment2.
Now I found a way to use join() for this pourpose, but having many other variables to block I need to repeat the operation several time and I need to know how many SubDataFrames the initial call of groupby() created. The result is variable so I can't simply read it I need to store it in a variable, that's why I tried size(::DataFrames.GroupedDataFrame).
Is there a solution?
To get the number of groups in a GroupedDataFrame use the length method. For example:
using DataFrames
df = DataFrame(x=repeat(1:4,inner=2,outer=2),y='a':'p')
grouped = groupby(df,:x)
num_of_groups = length(grouped) # returns 4
# to do something with each group `for g in grouped ... end` is useful
As noted in comments, you might also consider using Query.jl (see documentation at http://www.david-anthoff.com/Query.jl/stable) for data processing along the question's lines.

Why does pandas.apply() work differently for Series and DataFrame columns

apologies if this is a silly question, but I am not quite sure as to why this behavior is the case, and/or whether I am misunderstanding it. I was trying to create a function for the 'apply' method, and noticed that if you run apply on a series the series is passed as a np.array and if you pass the same series within a dataframe of 1 column, the series is passed as a series to the (u)func.
This affects the way a simpleton like me writes the function (i prefer iloc indexing to integer-based indexing on the array) so I was wondering whether this is on purpose, or historical accident?
Thanks,