Pandas, approximating a bar plot for large dataframes - pandas

I have a dataframe (around 10k rows) of the following form:
id | voted
123 1.0
12 0.0
215 1.0
362 0.0
...
And I want to bar plot this and look at where the values are mostly 0.0 and where they are mostly 1.0. (the order of indices in the first column is essential, as the dataframe is sorted).
I tried doing a bar plot, but even if I restrict myself to a small subset of the dataframe, the plot is still not readable:
Is there a way to approximate areas that are mostly 1.0 with a single thicker bar, such as we do for histograms, when we set the bins to a higher and lower number?

As you are searching for an interval approximation for the density of the votes, maybe you can add a moving average to it, :
df['ma'] = df['voted'].rolling(5).mean()
With this you would have always an average, then you could plot it over the indexes as a line graph, if the value is close to 1 then you know that you have a group of id's which votes with 1.0.

Related

Bar plot from two different datasets with different data range

I have the following datasets:
df1 = {'lower':[3.99,4.99,5.99,1700], 'percentile':[1,2,5,10,50,100]}
df2 = {'lower':[2.99,4.50,5,1850], 'percentile':[2,4,7,15,55,100]}
The data:
The percentile refers to the percentage of the data that corresponds
to a particular price e.g: 3.99 would represent 1% of the data while
all values under 5.99 would represent 5% of the data.
The length of the two datasets is 100 given that we are showing percentiles, but they vary between the two datasets as the price.
What I have done so far:
What I need help with:
As you see in the third graph, I can plot the two datasets overlayed, which is what I need, but I have been unsuccessful trying to change the legend and the weird tick x values on the third graph. It is not showing the percentile, or other metrics I might use the x axis with.
Any help?

Iterating through pandas dataframe and appending a dictionary?

I am trying to transition from excel to python, and for practice I would like to analyze sports data from the NFL season. I have created a pandas dataframe with the data I would like to track, but I was wondering how I can go through the data and create a dictionary with each teams wins and loses. I thought that I could iterate through the dataframe and check whether or not each team has already been entered into the dictionary, and if not append their name to it.
Any advice?
closing_lines dataframe sample:
Year
Week
side
type
line
odds
outcome
0
2006
01
PIT
MONEYLINE
NaN
-125.0
1.0
1
2006
01
MIA
MONEYLINE
NaN
105.0
0.0
2
2006
01
MIA
SPREAD
1.5
NaN
0.0
3
2006
01
PIT
SPREAD
-1.5
NaN
1.0
results = {'Team': [], 'Wins': [], 'Losses': []}
# iterate through the data
# check to see if the dictionary has the team we are looking at
# if it doesn't, add it to the dictionary
# if it does, add a unit to either the wins or the losses
closing_lines = closing_lines.reset_index() #make sure that the index matches the number of rows
for index, row in closing_lines.iterrows():
for key, Team in results.items():
if Team == closing_lines[row, 'side']:
pass
else:
results['Team'].append(closing_lines[row, 'side'])
The more pandas way of doing this is to create a new data frame indexed by team with columns for wins and losses. The groupby method can help with this. You can group the rows of your dataframe by team and then run some kind of summary over the results, e.g.:
closing_lines.groupby('side')['outcome'].sum()
creates a new Series indexed by 'side' with the sum of the 'outcome' column for each 'side' (which I think is Wins for this data).
Check out this answer to see how to count zeros and non-zeros in a groupby column.

Creating a 2D image by logarithimal binning

I have a DataFrame consisting of two columns as follows:
col1 col2
0.33 4.33
0.21 4.89
3.2 18.78
6.22 0.05
6.0 2.1
... ...
... ...
Now I would like to create a 200 x 200 numpy array by binning both columns. The x-axis should be col1 and the y-axis should be col2. col1 should be binned logarithmically from 0 to 68 and col2 logarithmically from 0 to 35. I would like to use logarithmic binning because there are more smaller values than larger values (i.e. the bins are getting larger with larger values). The 200 x 200 array should then store the amount of samples in each bin (i.e. the count).
Is this possible to do in an efficient way?
Something like this might work for you... (note that you have to choose how close to zero the lower end is):
bins1 = np.logspace(np.log10(0.001), np.log10(68), num=201)
bins2 = np.logspace(np.log10(0.001), np.log10(35), num=201)
result = np.histogram2d(df['col1'], df['col2'], bins=[bins1, bins2])
...where result[0] are the counts in the bins, and result[1] and result[2] are the bin edges (the same as bins1 and bins2)

How to filter a Dataframe based on an ID-Column which corresponds to a second Dataframe containing conditions for each ID efficiently?

I have a Dataframe with one ID column and two data columns X,Y containing numeric values. For each ID there are several rows of data.
I have a second Dataframe with the same ID column and two numeric columns specifing the lower and upper Limit for the X - Values for each ID.
I want to use the second Dataframe to filter the first Dataframe to only have rows which have X Values within in the X_min-X_max Range of the specific ID.
I can solve this by Looping over the second dataframe and filtering groupby(ID) - Elements of the first DF but that is slow for large amount of IDs. Is there an efficient way to solve this?
Example Code with the data in df, the ranges in df_ranges and the expected result in df_result. The real data Frame is obviously a lot bigger.
import pandas as pd
x=[2.1,2.2,2.6,2.4,2.8,3.5,2.8,3.2]
y=[3.1,3.5,3.4,2.7,2.1,2.7,4.1,4.3]
ID=[0]*4+[0.1]*4
x_min=[2.0,3.0]
x_max=[2.5,3.4]
IDs=[0,0.1]
df=pd.DataFrame({'ID':ID,'X':x,'Y':y})
df_ranges=pd.DataFrame({'ID':IDs,'X_min':x_min,'X_max':x_max})
df_result=df.iloc[[0,1,3,7],:]
Possible Solution:
def filter_ranges(grp,df_ranges):
x_min=df_ranges.loc[df_ranges.ID==grp.name,'X_min'].values[0]
x_max=df_ranges.loc[df_ranges.ID==grp.name,'X_max'].values[0]
return grp.loc[(grp.X>=x_min)&(grp.X<=x_max),:]
target_df_grp=df.groupby('ID').apply(filter_ranges,df_ranges=df_ranges)
Try this:
merged = df.merge(df_ranges, on='ID')
target_df = merged[(merged.X>=merged.X_min)&(merged.X<=merged.X_max)][['ID', 'X', 'Y']] # Here, desired filter is applied.
print(target_df) will give:
ID X Y
0 0.0 2.1 3.1
1 0.0 2.2 3.5
3 0.0 2.4 2.7
7 0.1 3.2 4.3

Scatter plot of Multiindex GroupBy()

I'm trying to make a scatter plot of a GroupBy() with Multiindex (http://pandas.pydata.org/pandas-docs/stable/groupby.html#groupby-with-multiindex). That is, I want to plot one of the labels on the x-axis, another label on the y-axis, and the mean() as the size of each point.
df['RMSD'].groupby([df['Sigma'],df['Epsilon']]).mean() returns:
Sigma_ang Epsilon_K
3.4 30 0.647000
40 0.602071
50 0.619786
3.6 30 0.646538
40 0.591833
50 0.607769
3.8 30 0.616833
40 0.590714
50 0.578364
Name: RMSD, dtype: float64
And I'd like to to plot something like: plt.scatter(x=Sigma, y=Epsilon, s=RMSD)
What's the best way to do this? I'm having trouble getting the proper Sigma and Epsilon values for each RMSD value.
+1 to Vaishali Garg. Based on his comment, the following works:
df_mean = df['RMSD'].groupby([df['Sigma'],df['Epsilon']]).mean().reset_index()
plt.scatter(df_mean['Sigma'], df_mean['Epsilon'], s=100.*df_mean['RMSD'])