Pandas: how do I group a Data Frame by a set of ordinal values? - pandas

I'm starting to learn about Python Pandas and want to generate a graph with the sum of arbitrary groupings of an ordinal value. It can be better explained with a simple example.
Suppose I have the following table of food consumption data:
And I have two groups of foods defined as two lists:
healthy = ['apple', 'brocolli']
junk = ['cheetos', 'coke']
Now I want to plot a graph with the evolution of consumption of junk and healthy food. I believe I must then process my data to get a DataFrame like:
Suppose the first table is already in a Dataframe called food, how do I transform it to get the second one?
I also welcome suggestions to reword my question to make it clearer, or for different approaches to generate the plot.

First create dictinary with lists and then swap keys with values.
Then groupby by mapped column food by dict and year, aggregate sum and last reshape by unstack:
healthy = ['apple', 'brocolli']
junk = ['cheetos', 'coke']
d1 = {'healthy':healthy, 'junk':junk}
##http://stackoverflow.com/a/31674731/2901002
d = {k: oldk for oldk, oldv in d1.items() for k in oldv}
print (d)
{'brocolli': 'healthy', 'cheetos': 'junk', 'apple': 'healthy', 'coke': 'junk'}
df1 = df.groupby([df.food.map(d), 'year'])['amount'].sum().unstack(0)
print (df1)
food healthy junk
year
2010 10 11
2011 17 10
2012 13 24
Another solution with pivot_table:
df1 = df.pivot_table(index='year', columns=df.food.map(d), values='amount', aggfunc='sum')
print (df1)
food healthy junk
year
2010 10 11
2011 17 10
2012 13 24

Related

Multiplying two data frames in pandas

I have two data frames as shown below df1 and df2. I want to create a third dataframe i.e. df as shown below. What would be the appropriate way?
df1={'id':['a','b','c'],
'val':[1,2,3]}
df1=pd.DataFrame(df)
df1
id val
0 a 1
1 b 2
2 c 3
df2={'yr':['2010','2011','2012'],
'val':[4,5,6]}
df2=pd.DataFrame(df2)
df2
yr val
0 2010 4
1 2011 5
2 2012 6
df={'id':['a','b','c'],
'val':[1,2,3],
'2010':[4,8,12],
'2011':[5,10,15],
'2012':[6,12,18]}
df=pd.DataFrame(df)
df
id val 2010 2011 2012
0 a 1 4 5 6
1 b 2 8 10 12
2 c 3 12 15 18
I can basically convert df1 and df2 as 1 by n matrices and get n by n result and assign it back to the df1. But is there any easy pandas way?
TL;DR
We can do it in one line like this:
df1.join(df1.val.apply(lambda x: x * df2.set_index('yr').val))
or like this:
df1.join(df1.set_index('id') # df2.set_index('yr').T, on='id')
Done.
The long story
Let's see what's going on here.
To find the output of multiplication of each df1.val by values in df2.val we use apply:
df1['val'].apply(lambda x: x * df2.val)
The function inside will obtain df1.vals one by one and multiply each by df2.val element-wise (see broadcasting for details if needed). As far as df2.val is a pandas sequence, the output is a data frame with indexes df1.val.index and columns df2.val.index. By df2.set_index('yr') we force years to be indexes before multiplication so they will become column names in the output.
DataFrame.join is joining frames index-on-index by default. So due to identical indexes of df1 and the multiplication output, we can apply df1.join( <the output of multiplication> ) as is.
At the end we get the desired matrix with indexes df1.index and columns id, val, *df2['yr'].
The second variant with # operator is actually the same. The main difference is that we multiply 2-dimentional frames instead of series. These are the vertical and horizontal vectors, respectively. So the matrix multiplication will produce a frame with indexes df1.id and columns df2.yr and element-wise multiplication as values. At the end we connect df1 with the output on identical id column and index respectively.
This works for me:
df2 = df2.T
new_df = pd.DataFrame(np.outer(df1['val'],df2.iloc[1:]))
df = pd.concat([df1, new_df], axis=1)
df.columns = ['id', 'val', '2010', '2011', '2012']
df
The output I get:
id val 2010 2011 2012
0 a 1 4 5 6
1 b 2 8 10 12
2 c 3 12 15 18
Your question is a bit vague. But I suppose you want to do something like that:
df = pd.concat([df1, df2], axis=1)

Iterating through pandas dataframe and appending a dictionary?

I am trying to transition from excel to python, and for practice I would like to analyze sports data from the NFL season. I have created a pandas dataframe with the data I would like to track, but I was wondering how I can go through the data and create a dictionary with each teams wins and loses. I thought that I could iterate through the dataframe and check whether or not each team has already been entered into the dictionary, and if not append their name to it.
Any advice?
closing_lines dataframe sample:
Year
Week
side
type
line
odds
outcome
0
2006
01
PIT
MONEYLINE
NaN
-125.0
1.0
1
2006
01
MIA
MONEYLINE
NaN
105.0
0.0
2
2006
01
MIA
SPREAD
1.5
NaN
0.0
3
2006
01
PIT
SPREAD
-1.5
NaN
1.0
results = {'Team': [], 'Wins': [], 'Losses': []}
# iterate through the data
# check to see if the dictionary has the team we are looking at
# if it doesn't, add it to the dictionary
# if it does, add a unit to either the wins or the losses
closing_lines = closing_lines.reset_index() #make sure that the index matches the number of rows
for index, row in closing_lines.iterrows():
for key, Team in results.items():
if Team == closing_lines[row, 'side']:
pass
else:
results['Team'].append(closing_lines[row, 'side'])
The more pandas way of doing this is to create a new data frame indexed by team with columns for wins and losses. The groupby method can help with this. You can group the rows of your dataframe by team and then run some kind of summary over the results, e.g.:
closing_lines.groupby('side')['outcome'].sum()
creates a new Series indexed by 'side' with the sum of the 'outcome' column for each 'side' (which I think is Wins for this data).
Check out this answer to see how to count zeros and non-zeros in a groupby column.

How can I combine same-named columns into one in a pandas dataframe so all the columns are unique?

I have a dataframe that looks like this:
In [268]: dft.head()
Out[268]:
ticker BYND UBER UBER UBER ... ZM ZM BYND ZM
0 analyst worlds uber revenue ... company owning pet things
1 moskow apac note uber ... try things humanization users
2 growth anheuserbusch growth target ... postipo unicorn products revenue
3 stock kong analysts raised ... software revenue things million
4 target uberbeating stock rising ... earnings million pets direct
[5 rows x 500 columns]
In [269]: dft.columns.unique()
Out[269]: Index(['BYND', 'UBER', 'LYFT', 'SPY', 'WORK', 'CRWD', 'ZM'], dtype='object', name='ticker')
How do I combine the the columns so there is only a single unique column name for each ticker?
Maybe you should try making a copy of the column you wish to join then extend the first column with the copy you have.
Code :
First convert the all columns name into one case either in lower or upper case so that there is no miss-match in header case.
def merge_(df):
'''Return the data-frame with columns with the same lowercase'''
# Get the list of unique columns in lowercase
columns = set(map(str.lower,df.columns))
df1 = pd.DataFrame(data=np.zeros((len(df),len(columns))),columns=columns)
# Merging the matching columns
for col in df.cloumns:
df1[col.lower()] += df[col] # words are in str format so '+' will concatenate
return df1

How to select data from multiple dataframes

I'm beginner on pandas i have to dataframes first called
DATA_DF which contains many fields and i'm interested for DATA_DF['Date effet'] as type datetime
and i have other dataframe called TAUX_DF contains years and every year has a value
TAUX_DF =
Année <10 ans >10 ans
1987 2,8168% 3,4664%
1988 2,8168% 3,4664%
1989 2,8168% 3,4664%
1990 2,8168% 3,4664%
i want to create new column on DATA_DF called "DATA_DF['Taux technique']"
it take from DATA_DF['Date effet'].dt.year compare it with the year on TAUX_DF['Année'] and put value like this on Excel
=SI(G5>120;RECHERCHEV(ANNEE(C5);Taux!$A$2:$C$29;3;FAUX);RECHERCHEV(ANNEE(C5);Taux!$A$2:$C$29;2;FAUX))
DATA_DF['Année'] = DATA_DF['Date effet'].dt.year ## ==> make column with year of the data_df in order to compare (merge) it later on with the 'TAUX_DF'.
DATA_DF = pd.merge(DATA_DF,TAUX_DF,left_on='Année',right_on='Année', how='left')

Histogram of a pandas dataframe

I couldn't find anywhere on the site a similar question.
I have a fairly large file, with over 100000 lines and I read it using pandas:
df = pd.read_excel("somefile.xls",index_col='Offense Type')
ended up with a dataframe consisting of the first column (the index column) and another column, 'Offense_type' and 'Hour' respectively.
'Offense Type' consists of a series of "cathegories" say cat1, cat2, cat3, etc...
'Hour' consists of a series of integer numbers between 1 and 24.
What I would like to do is obtain a histogram of the ocurrences of each number in the dataframe (there aren't that many cathegories It's at most 10 of them)
Here's an ASCII representation of what I want to get"
(the x's represent the bars in the histogram, they will surely be at a much higher value than 1,2 or 3)
x x # And so on
x x x x x x #
x x x x x x x #
1 2 11 20 5 8 18 #
Cat1 Cat2 #
But i'm getting a single barplot for every line in df using:
df.plot(kind='bar')
which is basically unreadable:
I've also tried with the hist() and Histogram() function with no luck.
Here's some sample data:
After a long night, I got the answer since every event was ocurring only once I added an extra column in the file with the number one and then indexed the dataframe by this:
df = pd.read_excel("somefile.xls",index_col='Numberone')
And then simply tried this:
df.hist(by=df['Offense Type'])
finally getting exactly what I wanted