I am trying to transition from excel to python, and for practice I would like to analyze sports data from the NFL season. I have created a pandas dataframe with the data I would like to track, but I was wondering how I can go through the data and create a dictionary with each teams wins and loses. I thought that I could iterate through the dataframe and check whether or not each team has already been entered into the dictionary, and if not append their name to it.
Any advice?
closing_lines dataframe sample:
Year
Week
side
type
line
odds
outcome
0
2006
01
PIT
MONEYLINE
NaN
-125.0
1.0
1
2006
01
MIA
MONEYLINE
NaN
105.0
0.0
2
2006
01
MIA
SPREAD
1.5
NaN
0.0
3
2006
01
PIT
SPREAD
-1.5
NaN
1.0
results = {'Team': [], 'Wins': [], 'Losses': []}
# iterate through the data
# check to see if the dictionary has the team we are looking at
# if it doesn't, add it to the dictionary
# if it does, add a unit to either the wins or the losses
closing_lines = closing_lines.reset_index() #make sure that the index matches the number of rows
for index, row in closing_lines.iterrows():
for key, Team in results.items():
if Team == closing_lines[row, 'side']:
pass
else:
results['Team'].append(closing_lines[row, 'side'])
The more pandas way of doing this is to create a new data frame indexed by team with columns for wins and losses. The groupby method can help with this. You can group the rows of your dataframe by team and then run some kind of summary over the results, e.g.:
closing_lines.groupby('side')['outcome'].sum()
creates a new Series indexed by 'side' with the sum of the 'outcome' column for each 'side' (which I think is Wins for this data).
Check out this answer to see how to count zeros and non-zeros in a groupby column.
Related
I have separate pandas DataFrame objects containing two columns each. One column is a team name and the other is a numerical value by which I sorted the entries in the DataFrame. Here is an example of two of these dataframes with generic values:
df1:
Name PPG
1 TOR 105.0
11 SAC 102.5
17 LAL 100.0
15 PHX 98.5
df2:
Name TRB
17 LAL 45.0
11 SAC 44.0
15 PHX 42.5
1 TOR 42.0
What I want to do is get the difference in the ranking of each team ('Name') in these separate dataframes. For example, LAL is ranked 3rd in PPG and 1st in TRB. Is there a way I could get the difference in ranking position (in this example with LAL, it would be 2, and for SAC it would be 0, and so on)?
So far, I have used df1['PPG'].rank() and df2['TRB'].rank to create a new rank column in each dataframe. Once I have that, I tried using df1['Rank'].compare(df2['Rank']) but I get the following error:
ValueError: Can only compare identically-labeled Series objects
The documentation for pandas.read_excel mentions something called 'roundtripping', under the description of the index_col parameter.
Missing values will be forward filled to allow roundtripping with to_excel for merged_cells=True.
I have never heard of this term before, and if I search for a definition, I can find one only in the context of finance. I have seen it referred to in the context of merging dataframes in Pandas, but I have not found a definition.
For context, this is the complete description of the index_col parameter:
index_col : int, list of int, default None
Column (0-indexed) to use as the row labels of the DataFrame. Pass
None if there is no such column. If a list is passed, those columns
will be combined into a MultiIndex. If a subset of data is selected
with usecols, index_col is based on the subset.
Missing values will be forward filled to allow roundtripping with
to_excel for merged_cells=True. To avoid forward filling the
missing values use set_index after reading the data instead of
index_col.
For a general idea of the meaning of roundtripping, have a look at the answers to this post on SE. Applied to your example, "allow roundtripping" is used to mean something like this:
facilitate an easy back-and-forth between the data in an Excel file
and the same data in a df. I.e. while maintaining the intended
structure throughout.
Example round trip
The usefulness of this idea is perhaps best seen if we start with a somewhat complex df with both index and columns as named MultiIndices (for the constructor, see pd.MultiIndex.from_product):
import pandas as pd
import numpy as np
df = pd.DataFrame(data=np.random.rand(4,4),
columns=pd.MultiIndex.from_product([['A','B'],[1,2]],
names=['col_0','col_1']),
index=pd.MultiIndex.from_product([[0,1],[1,2]],
names=['idx_0','idx_1']))
print(df)
col_0 A B
col_1 1 2 1 2
idx_0 idx_1
0 1 0.952749 0.447125 0.846409 0.699479
2 0.297437 0.813798 0.396506 0.881103
1 1 0.581273 0.881735 0.692532 0.725254
2 0.501324 0.956084 0.643990 0.423855
If we now use df.to_excel with the default for merge_cells (i.e. True) to write this data to an Excel file, we will end up with data as follows:
df.to_excel('file.xlsx')
Result:
Aesthetics aside, the structure here is very clear, and indeed, the same as the structure in our df. Take notice of the merged cells especially.
Now, let's suppose we want to retrieve this data again from the Excel file at some later point, and we use pd.read_excel with default parameters. Problematically, we will end up with a complete mess:
df = pd.read_excel('file.xlsx')
print(df)
Unnamed: 0 col_0 A Unnamed: 3 B Unnamed: 5
0 NaN col_1 1.000000 2.000000 1.000000 2.000000
1 idx_0 idx_1 NaN NaN NaN NaN
2 0 1 0.952749 0.447125 0.846409 0.699479
3 NaN 2 0.297437 0.813798 0.396506 0.881103
4 1 1 0.581273 0.881735 0.692532 0.725254
5 NaN 2 0.501324 0.956084 0.643990 0.423855
Getting this data "back into shape" would be quite time-consuming. To avoid such a hassle, we can rely on the parameters index_col and header inside pd.read_excel:
df2 = pd.read_excel('file.xlsx', index_col=[0,1], header=[0,1])
print(df2)
col_0 A B
col_1 1 2 1 2
idx_0 idx_1
0 1 0.952749 0.447125 0.846409 0.699479
2 0.297437 0.813798 0.396506 0.881103
1 1 0.581273 0.881735 0.692532 0.725254
2 0.501324 0.956084 0.643990 0.423855
# check for equality
df.equals(df2)
# True
As you can see, we have made a "round trip" here, and index_col and header allow for it to have been smooth sailing!
Two final notes:
(minor) The docs for pd.read_excel contain a typo in the index_col section: it should read merge_cells=True, not merged_cells=True.
The header section is missing a similar comment (or a reference to the comment at index_col). This is somewhat confusing. As we saw above, the two behave exactly the same (for present purposes, at least).
I am trying to do an exercise in pandas.
I have two dataframes. I need to compare few columns between both dataframes and change the value of one column in the first dataframe if the comparison is successful.
Dataframe 1:
Article Country Colour Buy
Pants Germany Red 0
Pull Poland Blue 0
Initially all my articles have the flag 'Buy' set to zero.
I have dataframe 2 that looks as:
Article Origin Colour
Pull Poland Blue
Dress Italy Red
I want to check if the article, country/origin and colour columns match (so check whether I can find the each article from dataframe 1 in dataframe two) and, if so, I want to put the flag 'Buy' to 1.
I trying to iterate through both dataframe with pyspark but pyspark daatframes are not iterable.
I thought about doing it in pandas but apaprently is a bad practise to change values during iteration.
Which code in pyspark or pandas would work to do what I need to do?
Thanks!
merge with an indicator then map the values. Make sure to drop_duplicates on the merge keys in the right frame so the merge result is always the same length as the original, and rename so we don't repeat the same information after the merge. No need to have a pre-defined column of 0s.
df1 = df1.drop(columns='Buy')
df1 = df1.merge(df2.drop_duplicates().rename(columns={'Origin': 'Country'}),
indicator='Buy', how='left')
df1['Buy'] = df1['Buy'].map({'left_only': 0, 'both': 1}).astype(int)
Article Country Colour Buy
0 Pants Germany Red 0
1 Pull Poland Blue 1
I'm starting to learn about Python Pandas and want to generate a graph with the sum of arbitrary groupings of an ordinal value. It can be better explained with a simple example.
Suppose I have the following table of food consumption data:
And I have two groups of foods defined as two lists:
healthy = ['apple', 'brocolli']
junk = ['cheetos', 'coke']
Now I want to plot a graph with the evolution of consumption of junk and healthy food. I believe I must then process my data to get a DataFrame like:
Suppose the first table is already in a Dataframe called food, how do I transform it to get the second one?
I also welcome suggestions to reword my question to make it clearer, or for different approaches to generate the plot.
First create dictinary with lists and then swap keys with values.
Then groupby by mapped column food by dict and year, aggregate sum and last reshape by unstack:
healthy = ['apple', 'brocolli']
junk = ['cheetos', 'coke']
d1 = {'healthy':healthy, 'junk':junk}
##http://stackoverflow.com/a/31674731/2901002
d = {k: oldk for oldk, oldv in d1.items() for k in oldv}
print (d)
{'brocolli': 'healthy', 'cheetos': 'junk', 'apple': 'healthy', 'coke': 'junk'}
df1 = df.groupby([df.food.map(d), 'year'])['amount'].sum().unstack(0)
print (df1)
food healthy junk
year
2010 10 11
2011 17 10
2012 13 24
Another solution with pivot_table:
df1 = df.pivot_table(index='year', columns=df.food.map(d), values='amount', aggfunc='sum')
print (df1)
food healthy junk
year
2010 10 11
2011 17 10
2012 13 24
What is the most efficient way to forward fill information in a large dataframe?
I combined about 6 million rows x 50 columns of dimensional data from daily files. I dropped the duplicates and now I have about 200,000 rows of unique data which would track any change that happens to one of the dimensions.
Unfortunately, some of the raw data is messed up and has null values. How do I efficiently fill in the null data with the previous values?
id start_date end_date is_current location dimensions...
xyz987 2016-03-11 2016-04-02 Expired CA lots_of_stuff
xyz987 2016-04-03 2016-04-21 Expired NaN lots_of_stuff
xyz987 2016-04-22 NaN Current CA lots_of_stuff
That's the basic shape of the data. The issue is that some dimensions are blank when they shouldn't be (this is an error in the raw data). An example is that for previous rows, the location is filled out for the row but it is blank in the next row. I know that the location has not changed but it is capturing it as a unique row because it is blank.
I assume that I need to do a groupby using the ID field. Is this the correct syntax? Do I need to list all of the columns in the dataframe?
cols = [list of all of the columns in the dataframe]
wfm.groupby(['id'])[cols].fillna(method='ffill', inplace=True)
There are about 75,000 unique IDs within the 200,000 row dataframe. I tried doing a
df.fillna(method='ffill', inplace=True)
but I need to do it based on the IDs and I want to make sure that I am being as efficient as possible (it took my computer a long time to read and consolidate all of these files into memory).
It is likely efficient to execute the fillna directly on the groupby object:
df = df.groupby(['id']).fillna(method='ffill')
Method referenced
here
in documentation.
How about forward filling each group?
df = df.groupby(['id'], as_index=False).apply(lambda group: group.ffill())
github/jreback: this is a dupe of #7895. .ffill is not implemented in cython on a groupby operation (though it certainly could be), and instead calls python space on each group.
here's an easy way to do this.
url:https://github.com/pandas-dev/pandas/issues/11296
according to jreback's answer, when you do a groupby ffill() is not optimized, but cumsum() is. try this:
df = df.sort_values('id')
df.ffill() * (1 - df.isnull().astype(int)).groupby('id').cumsum().applymap(lambda x: None if x == 0 else 1)