Convert Pandas DataFrame into Series with multiIndex - pandas

Let us consider a pandas DataFrame (df) like the one shown above.
How do I convert it to a pandas Series?

Just select the single column of your frame
df['Count']

result = pd.Series(df['Count'])

Related

Flatten and rename multi-index agg columns

I have some Pandas / cudf code that aggregates a particular column using two aggregate methods, and then renames the multi-index columns to flattened columns.
df = (
some_df
.groupby(["some_dimension"])
.agg({"some_metric" : ["sum", "max"]})
.reset_index()
.rename(columns={"some_dimension" : "some_dimension__id", ("some_metric", "sum") : "some_metric_sum", ("some_metric", "max") : "some_metric_max"})
)
This works great in cudf, but does not work in Pandas 0.25 -- the hierarchy is not flattened out.
Is there a similar approach using Pandas? I like the cudf tuple syntax and how they just implicitly flatten the columns. Hoping to find a similarly easy way to do it in Pandas.
Thanks.
In pandas 0.25.0+ there is something called groupby aggregation with relabeling.
Here is a stab at your code
df = (some_df
.groupby(["some_dimension"])
.agg(some_metric_sum=("some_metric", "sum"),
some_metric_max=("some_metric", "max"]})
.reset_index()
.rename(colunms = {"some_dimension":"some_dimension_id"}))

How to convert the result of DataFrame groupby().agg() to a new Dataframe

Sounds basic, but...
I have a dataframe df with (yy, mm, dd, value1, value2,...)
df1 = df.groupby(['yy','dd'], as_index = False).agg({'value1':['count'],'value2':['sum']})
working ok, returning a df1 multi index object, that I can 'visualize' e.g. df1.info()
Q: how to convert this df1 into a 'basic' 2D DataFrame.
You need to drop the multilevel from the pandas column, and then reset index. You can try this:-
df.groupby(['yy','dd'], as_index = True).agg({'value1':['count'],'value2':['sum']})
df1.columns = df1.columns.droplevel()
df1.reset_index(inplace=True)
Hope this solves your problem.

How to convert datatype of all the columns of a pandas dataframe to string

I have tried multiple ways to achieve this for ex:
inputpd = pd.DataFrame(inputpd.columns,dtype=str)
But it does not work. sorry for asking this question as i am beginner to spark.
If it's a Pandas DataFrame:
df = df.astype(str)
The easiest way I think it is:
df = df.applymap(str)
df is your dataframe.

when reading an html (pandas.read_html), how to select dataframe and set_ index in one line

I'm reading an html which brings back a list of dataframes. I want to be able to choose the dataframe from the list and set my index (index_col) in the least amount of lines.
Here is what I have right now:
import pandas as pd
df =pd.read_html('http://finviz.com/insidertrading.ashx?or=-10&tv=100000&tc=1&o=-transactionvalue', header = 0)
df2 =df[4] #here I'm assigning df2 to dataframe#4 from the list of dataframes I read
df2.set_index('Date', inplace =True)
Is it possible to do all this in one line? Do I need to create another dataframe (df2) to assign one dataframe from a list, or is it possible I can assign the dataframe as soon as I read the list of dataframes (df).
Thanks.
Anyway:
import pandas as pd
df = pd.read_html('http://finviz.com/insidertrading.ashx?or=-10&tv=100000&tc=1&o=-transactionvalue', header = 0)[4].set_index('Date')

concat series onto dataframe with column name

I want to add a Series (s) to a Pandas DataFrame (df) as a new column. The series has more values than there are rows in the dataframe, so I am using the concat method along axis 1.
df = pd.concat((df, s), axis=1)
This works, but the new column of the dataframe representing the series is given an arbitrary numerical column name, and I would like this column to have a specific name instead.
Is there a way to add a series to a dataframe, when the series is longer than the rows of the dataframe, and with a specified column name in the resulting dataframe?
You can try Series.rename:
df = pd.concat((df, s.rename('col')), axis=1)
One option is simply to specify the name when creating the series:
example_scores = pd.Series([1,2,3,4], index=['t1', 't2', 't3', 't4'], name='example_scores')
Using the name attribute when creating the series is all I needed.
Try:
df = pd.concat((df, s.rename('CoolColumnName')), axis=1)