no method matching size(::DataFrames.GroupedDataFrame) - dataframe

It's the first time a post a question, so I will try to give some example but I might not be totally aware of the best way to do it.
I am using groupby() function to divide a DataFrame according to a pooled variable. My intent is to create from the SubDataframes a new one in which the rows splitted with groupby() become 2 separate columns. For instance a in DataFrame A I have :meanX and :Treatment, in dataframe B I want to have :meanX_Treatment1 and :meanX_Treatment2.
Now I found a way to use join() for this pourpose, but having many other variables to block I need to repeat the operation several time and I need to know how many SubDataFrames the initial call of groupby() created. The result is variable so I can't simply read it I need to store it in a variable, that's why I tried size(::DataFrames.GroupedDataFrame).
Is there a solution?

To get the number of groups in a GroupedDataFrame use the length method. For example:
using DataFrames
df = DataFrame(x=repeat(1:4,inner=2,outer=2),y='a':'p')
grouped = groupby(df,:x)
num_of_groups = length(grouped) # returns 4
# to do something with each group `for g in grouped ... end` is useful
As noted in comments, you might also consider using Query.jl (see documentation at http://www.david-anthoff.com/Query.jl/stable) for data processing along the question's lines.

Related

Pandas dataframe sharing between functions isn't working

I have a script that modifies a pandas dataframe with several concurrent functions (asyncio coroutines). Each function adds rows to the dataframe and it's important that the functions all share the same list. However, when I add a row with pd.concat a new copy of the dataframe is created. I can tell because each dataframe now has a different memory location as given by id().
As a result the functions are no longer share the same object. How can I keep all functions pointed at a common dataframe object?
Note that this issue doesn't arise when I use the append method, but that is being deprecated.
pandas dataframes are efficient because they use contiguous memory blocks, frequently of fundamental types like int and float. You can't just add a row because the dataframe doesn't own the next bit of memory it would have to expand into. Concatenation usually requires that new memory is allocated and data is copied. Once that happens, referrers to the original dataframe
If you know the final size you want, you can preallocate and fill. Otherwise, you are better off keeping a list of new dataframes and concatenating them all at once. Since these are parallel procedures, they aren't dependent on each others output, so this may be a feasable option.

Custom distance function between every row of two dataframes

I have two dataframes, I want to calculate the "distance" between every row in one data frame compared to every row in another using a custom distance measure (for example euclidian for the first column, taxi cab for the second, etc). Is there a way to do this quickly using broadcasting?
you can simply create a custom function and use that in apply function:
for example:
def custom_func(a, b):
result = a + b for example:
return result
df = df.apply(custom_func)
if you want answer that needs to be tested in some dataset, please add the data. But Idea is you can create any custom functions and passed in your pandas apply function.
more about apply function and examples are: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html
Also you may want to check applymap function in pandas too. https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.applymap.html

How to efficently flatten JSON structure returned in elasticsearch_dsl queries?

I'm using elasticsearch_dsl to make make queries for and searches of an elasticsearch DB.
One of the fields I'm querying is an address, which as a structure like so:
address.first_line
address.second_line
address.city
adress.code
The returned documents hold this in JSON structures, such that the address is held in a dict with a field for each sub-field of address.
I would like to put this into a (pandas) dataframe, such that there is one column per sub-field of the address.
Directly putting address into the dataframe gives me a column of address dicts, and iterating the rows to manually unpack (json.normalize()) each address dict takes a long time (4 days, ~200,000 rows).
From the docs I can't figure out how to get elasticsearch_dsl to return flattened results. Is there a faster way of doing this?
Searching for a way to solve this problem, I've come across my own answer and found it lacking, so will update with a better way
Specifically: pd.json_normalize(df['json_column'])
In context: pd.concat([df, pd.json_normalize(df['json_column'])], axis=1)
Then drop the original column if required.
Original answer from last year that does the same thing much more slowly
df.column_of_dicts.apply(pd.Series) returns a DataFrame with those dicts flattened.
pd.concat(df,new_df) gets the new columns onto the old dataframe.
Then delete the original column_of_dicts.
pd.concat([df, df.address.apply(pd.Series)], axis=1) is the actual code I used.

Extracting data as a list from a Pandas dataframe while preserving order

Suppose I have some Pandas dataframe df that has a column called "HEIGHT", among many other columns.
If I issue list(df["HEIGHT"]), then this will give me a list of the items in that column in the exact order in which they were in the dataframe, i.e. ordered by the index of the dataframe.
Is that always the case? The df["HEIGHT"] command will return a Series and list() will convert it to a list. But are these operations always order-preserving? Interestingly in the [book1 by the Pandas author (!), from my reading so far, it is unclear to me, when these elementary operations preserve order; is order perhaps always preserved, or is there some simple rule to know when order should be preserved?
The order of elements in a pandas Series (i.e., a column in a pandas DataFrame) will not change unless you do something that makes it change. And the order of a python list is guaranteed to reflect insertion order (SO thread).
So yes, df[0].tolist() (slightly faster than list(df[0])) should always yield a Python list of elements in the same order as the elements in df[0].
Order will always be preserved. When you use the list function, you provide it an iterator, and construct a list by iterating over it. For more information on iterators, you might want to read PEP 234 on iterators.
The iteration order is determined by the iterator you provide it. Iterators for a series are provided by pd.Series.__iter__() (the standard way to access an iterator for an object, which is searched for by the list method and similar). For more information on iteration and indexing in Pandas, consider reading the relevant API reference section and the much more in-depth indexing documentation.

Why does pandas.apply() work differently for Series and DataFrame columns

apologies if this is a silly question, but I am not quite sure as to why this behavior is the case, and/or whether I am misunderstanding it. I was trying to create a function for the 'apply' method, and noticed that if you run apply on a series the series is passed as a np.array and if you pass the same series within a dataframe of 1 column, the series is passed as a series to the (u)func.
This affects the way a simpleton like me writes the function (i prefer iloc indexing to integer-based indexing on the array) so I was wondering whether this is on purpose, or historical accident?
Thanks,