Pandas MultiIndex manipulation - pandas

I'm not very adept at Python, but I have a "bandaid" solution to a problem and trying to find out if there is a better way to do things. I have a dataframe of stocks I download from pandas_datareader. This gives me a MultiIndex df, and I'm trying to exact just the attributes that I want.
The initial df from pandas_datareader results in the following structure:
I'm interested in getting just the "High" and "Closing" prices in this structure. To achieve this, I have done the following:
df.loc[:, ['High', 'Close']]
Which gives me:
This is close to what I want, but not grouped by the stock, rather the attribute. To group the attribute by stock, I tried swapping the levels, and then specifying the columns I want:
newdf = df.swaplevel(axis='columns')
newdf.loc[:, [('BHP.AX','High'),('BHP.AX','Close'),('S32.AX','Close'),('S32.AX','High')]]
This gives me the desired result, but seems a very "hardcoded" and inefficient way of doing this:
Is there a more generalized way I could go about doing this? I want to be able to just specify the attributes (e.g. Close, High etc) and the result to be for all stocks in there (grouped by stock rather than the attribute). This Multiindex is not making it easy for me so any help you can offer is appreciated.

You can use the index slice function to get it easily. Please correct the 'ACN' and 'IT' as I tested it on different stocks.
References.MultiIndex / advanced indexing
idx = pd.IndexSlice
data = data.loc[:,idx[:,('High','Low','ACN','IT')]] # edit your symbol
data = data.swaplevel(axis='columns')
data.sort_index(level=0, axis=1, inplace=True)
data.head()
ACN IT
Close High Close High
Date
2020-03-31 163.259995 169.880005 99.570000 109.160004
2020-04-01 154.679993 160.820007 93.290001 96.209999
2020-04-02 156.270004 160.500000 94.099998 94.919998
2020-04-03 152.149994 158.720001 91.820000 94.290001
2020-04-06 166.050003 166.750000 99.860001 100.940002

Found a rather simple solution.
newdf = rawout.loc[:,['Close','High', 'Open']].swaplevel(axis='columns')
Using this there is no need to specify all the stocks. I swap the levels in the code above but this may not be required by someone else.

Related

Pandas dataframe being treated as a series object after using groupby

I am conducting an analysis of a dataset. To find my results, I use this line of code:
new_df = df_ncis.groupby(['state', 'year'])['totals'].mean()
The object returned by this statement is a Series, when it should be a dataframe. I don't understand why this happened, or how to solve this issue. Also, one of the columns of the new object is missing its name. Here is the github link for the project: https://github.com/louishrm/gundataUS.
Any help would be great.
You are filtering the result by ['totals'] which is a series.
try this instead
new_df = df_ncis[['state', 'year', 'totals']].groupby(['state', 'year']).mean()
which will give you a dataframe with your 3 columns.
or if you want it as a dataframe of one column (Note the double brackets)
new_df = df_ncis.groupby(['state', 'year'])[['totals']].mean()

Koalas GroupBy > Apply > Lambda > Series

I am trying to port some code from Pandas to Koalas to take advantage of Spark's distributed processing. I am taking a dataframe and grouping it on A and B and then applying a series of functions to populate the columns of the new dataframe. Here is the code that I was using in Pandas:
new = old.groupby(['A', 'B']) \
.apply(lambda x: pd.Series({
'v1': x['v1'].sum(),
'v2': x['v2'].sum(),
'v3': (x['v1'].sum() / x['v2'].sum()),
'v4': x['v4'].min()
})
)
I believe that it is working well and the resulting dataframe appears to be correct value-wise.
I just have a few questions:
Does this error mean that my method will be deprecated in the future?
/databricks/spark/python/pyspark/sql/pandas/group_ops.py:76: UserWarning: It is preferred to use 'applyInPandas' over this API. This API will be deprecated in the future releases. See SPARK-28264 for more details.
How can I rename the group-by columns to 'A' and 'B' instead of "__groupkey_0__ __groupkey_1__"?
As you noticed, I had to call pd.Series -- is there a way to do this in Koalas? Calling ks.Series gives me the following error that I am unsure how to implement:
PandasNotImplementedError: The method `pd.Series.__iter__()` is not implemented. If you want to collect your data as an NumPy array, use 'to_numpy()' instead.
Thanks for any help that you can provide!
I'm not sure about the error. I am using koalas==1.2.0 and pandas==1.0.5 and I don't get the error so I wouldn't worry about it
The groupby columns are already called A and B when I run the code. This again may have been a bug which has since been patched.
For this you have 3 options:
Keep utilising pd.Series. As long as your original Dataframe is a koalas Dataframe, your output will also be a koalas Dataframe (with the pd.Series automatically converted to ks.Series)
Keep the function and the data exactly the same and just convert the final dataframe to koalas using the from_pandas function
Do the whole thing in koalas. This is slightly more tricky because you are computing an aggregate column based on two GroupBy columns and koalas doesn't support lambda functions as a valid aggregation. One way we can get around this is by computing the other aggregations together and adding the multi-column aggregation afterwards:
import databricks.koalas as ks
ks.set_option('compute.ops_on_diff_frames', True)
# Dummy data
old = ks.DataFrame({"A":[1,2,3,1,2,3], "B":[1,2,3,3,2,3], "v1":[10,20,30,40,50,60], "v2":[4,5,6,7,8,9], "v4":[0,0,1,1,2,2]})
new = old.groupby(['A', 'B']).agg({'v1':'sum', 'v2':'sum', 'v4': 'min'})
new['v3'] = old.groupby(['A', 'B']).apply(lambda x: x['v1'].sum() / x['v2'].sum())

Processing pandas data in declarative style

I have a pandas dataframe of vehicle co-ordinates (from multiple vehicles on multiple days). For each vehicle and for each day, I do two things: either apply an algorithm to it, or filter it out of the dataset completely if it doesn't satisfy certain criteria.
To acheive this I use df.groupby('vehicle_id', 'day') and then .apply(algorithm) or .filter(condition) where algorithm and condition are functions which take in a dataframe.
I would like the full processing of my dataset (which involves multiple .apply and .filter steps) to be written out in a declaritive style, as opposed to imperatively looping through the groups, with the goal of the whole thing to look something like:
df.group_by('vehicle_id', 'day').apply(algorithm1).filter(condition1).apply(algorithm2).filter(condition2)
Of course, the above code is incorrect since .apply() and .filter() return new dataframes, and this is exactly my problem. They return all the data back in a single dataframe, and I find that I have apply .groupby('vehicle_id', 'day') continuously.
Is there a nice way that I can write this out without having to group by the same columns over and over?
Since apply uses a for loop anyway (meaning there are no sophisticated optimizations in the background), I suggest using an actual for loop:
arr = []
for key, dfg in df.groupby(['vehicle_id', 'day']):
dfg = dfg.do_stuff1() # Perform all needed operations
dfg = do_stuff2(dfg) #
arr.append(dfg)
result = pd.concat(arr)
An alternative is to create a function which runs all of the applies and filters sequentially on a given dataframe, and then map a single groupby/apply to it:
def all_operations(dfg):
# Do stuff
return result_df
result = df.group_by(['vehicle_id', 'day']).apply(all_operations)
In both options you will have to deal with cases in which an empty dataframe is returned from the filters, if such cases exist.

Python - Using a List, Dict Comprehension, and Mapping to Change Plot Order

I am relatively new to Python, Pandas, and plotting. I am looking to make a custom sort order in a pandas plot using a list, mapping, and sending them through to the plot function.
I am not "solid" on mapping or dict comprehensions. I've looked around a bit on Google and haven't found anything really clear - so any direction to helpful references would be much appreciated.
I have a dataframe that is the result of a groupby:
Exchange
AMEX 267
NYSE 2517
Nasdaq 2747
Name: Symbol, dtype: int64
The numerical column is 'Symbol' and the exchange listing is the index
When I do a straightforward pandas plot
my_plot = Exchange['Symbol'].plot(kind='bar')
I get this:
The columns are in the order of the rows in the dataframe (Amex, NYSE, Nasdaq) but I would like to present, left to right, NYSE, Nasdaq, and Amex. So a "sort" won't work.
There is another post:
Sorting the Order of Bars
that gets at this - but I just couldn't figure it out.
I feel like the solution is one step out of my reach. I think this is a very important concept to get down as it would help me considerably in visualizing data where the not-infrequent case of a custom row presentation in a chart is needed. I'm also hoping discussion here could help me better understand mapping as that seems to be very useful in many instances but I just can't seem to find the right on-line resource to explain it clearly.
Thank you in advance.
The solution to your problem is putting your output dataframe into desired order:
order = [1,2,0] # the desired order
Exchange['Symbol'].iloc[order]
NYSE 2517
Nasdaq 2747
AMEX 267
Name: Symbol, dtype: int64
As soon as you have the rightly ordered data you can plot it:
Exchange['Symbol'].iloc[order].plot(kind='bar');

Pass pandas sub dataframe to master dataframe

I have a dataframe which I am doing some work on
d={'x':[2,8,4,-5,4,5,-3,5],'y':[-.12,.35,.3,.15,.4,-.5,.6,.57]}
df=pd.DataFrame(d)
df['x_even']=df['x']%2==0
subdf, get all rows where x is negative and then square x and then multiple 100 to y
subdf=df[df.x<0]
subdf['x']=subdf.x**2
subdf['y']=subdf.y*100
subdf's work is completed. I am not sure how I can incorporate these changes to the master dataframe (df).
It looks like your current code should give you a SettingWithCopyWarning warning.
To avoid this you could do the following:
df.loc[df.x<0, 'y'] = df.loc[df.x<0, 'y']*100
df.loc[df.x<0, 'x'] = df.loc[df.x<0, 'x']**2
Which will change your df, without raising a warning and there is no need to merge anything back.
pd.merge(subdf,df,how='outer')
This does what I was asking for. Thanks for the tip Primer