I'm new to data science & pandas. I'm just trying to visualize the distribution of data from a single series (a single column), but the histogram that I'm generating is only a single column (see below where it's sorted descending).
My data is over 11 million rows. The max value is 27,235 and the min values are 1. I'd like to see the "count" column grouped into different bins and a column/bar whose height is the total for each bin. But, I'm only seeing a single bar and am not sure what to do.
Data
df = pd.DataFrame({'count':[27235,26000,25877]})
Solution
import matplotlib.pyplot as plt
df['count'].hist()
Alternatively
sns.distplot(df['count'])
Related
I have a DataFrame which I want to slice into many DataFrames by adding rows by one until the sum of column Score of the DataFrame is greater than 50,000. Once that condition is met, then I want a new slice to begin.
Here is an example of what this might look like:
Sum Score cumulatively, floor divide it by 50,000, and shift it up one cell (since you want each group to be > 50,000 and not < 50,000).
import pandas as pd
import numpy as np
# Generating DataFrame with random data
df = pd.DataFrame(np.random.randint(1,60000,15))
# Creating new column that's a cumulative sum with each
# value floor divided by 50000
df['groups'] = df[0].cumsum() // 50000
# Values shifted up one and missing values filled with the maximum value
# so that values at the bottom are included in the last DataFrame slice
df.groups = df.groups.shift(-1, fill_value=df.groups.max())
Then as per this answer you can use pandas.DataFrame.groupby in a list comprehension to return a list of split DataFrames.
df_list = [df_slice for _, df_slice in df.groupby(['groups'])]
Data
How can I split the values in the category_lvl2 column into bins for each different value, and find the average amount for all the values in each bin?
For example finding the average amount spent on coffee
I have already performed feature scaling on the amounts
You can use groupby() method and provide the groups you get with pd.cut(). The example below bins the data into 10 categories by sepal_length column. Then those categories are used to groupby the iris df. You could also bin with a variable and get the mean of another one with groupby.
import pandas as pd
import seaborn as sns
iris = sns.load_dataset('iris')
bins = pd.cut(iris.sepal_length, 10)
iris.groupby(bins).sepal_length.mean()
I'm looking to analyze the ABV and Style of beer, and then take an average for graphing. I have all the beer styles and their ABV in a dataframe, I'm looking to create seperate Dataframes for each style, and then take the average of that styles ABV.
I've tried groupby and got nothing.
What I want to accomplish:
-Split dataframe into multiple dataframe by style which would include all ABV's per that style (there are some duplicate ABV values and 90 Styles, 71 unique ABV's)
-Take the average of each style
-Graph in a scatter plot.
Data Frame:
]
I managed to find documentation on iterating through groups and ran passed this loop into it. This sorted it by style
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
beers_csv = pd.read_csv("Resources/cleaned_beer.csv")
dropped_beers_csv = beers_csv.drop(columns=["Unnamed: 0", "Brewery ID", "Brewery", "City", "IBU", "State", "OZ.", "Beer"])
beer_data = dropped_beers_csv
grouped = beer_data.groupby('Style')
for name, group in grouped:
print(name)
print(group)
grouped_beer = grouped.mean()
grouped_beer
It returned all the styles and the ABV (for example, it returned 2 Abby Single Ales and their ABV.
The last two lines just applied the mean function and spit out a dataframe with 90 rows, and running a unique count on my original csv file shows 90 unique styles, and then the mean function took the average of each group. Now I have a 90 row data frame containing each unique style and the average ABV for that style.
I am interested in knowing how to interpolate/resample/extrapolate columns of a pandas dataframe for pure numerical and datetime type indices. I'd like to perform this with either straight-forward linear interpolation or spline interpolation.
Consider first a simple pandas data frame that has a numerical index (signifying time) and a couple of columns:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(10,2), index=np.arange(0,20,2))
print(df)
0 1
0 0.937961 0.943746
2 1.687854 0.866076
4 0.410656 -0.025926
6 -2.042386 0.956386
8 1.153727 -0.505902
10 -1.546215 0.081702
12 0.922419 0.614947
14 0.865873 -0.014047
16 0.225841 -0.831088
18 -0.048279 0.314828
I would like to resample the columns of this dataframe over some denser grid of time indices which possibly extend beyond the last time index (thus requiring extrapolation).
Denote the denser grid of indices as, for example:
t = np.arange(0,40,.6)
The interpolate method for a pandas dataframe seems to interpolate only nan's and thus requires those new indices (which may or may not coincide with the original indices) to already be part of the dataframe. I guess I could append a data frame of nans at the new indices to the original dataframe (excluding any indices appearing in both the old and new dataframes) and call interpolate and then remove the original time indices. Or, I could do everything in scipy and create a new dataframe at the desired time indices.
Is there a more direct way to do this?
In addition, I'd like to know how to do this same thing when the indices are, in fact, datetimes. That is, when, for example:
df.index = np.array('2015-07-04 02:12:40', dtype=np.datetime64) + np.arange(0,20,2)
In a csv file, how can i calculate the average of selected rows in a column:
Columns
I did this:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
#Read the csv file:
df = pd.read_csv("D:\\xxxxx\\mmmmm.csv")
#Separate the columns and get the average:
# Skid:
S = df['Skid Number after milling'].mean()
But this just gave me the average for the entire column
Thank you for the help!
For selecting rows in a pandas dataframe or series you can use the .iloc attribute.
For example df['A'].iloc[3:5] selects the fourth and fifth row in column "A" of a DataFrame. Indexing starts at 0 and the number behind the colon is not included. This returns a pandas series.
You can do the same using numpy: df["A"].values[3:5]
This already returns a numpy array.
Possibilities to calculate the mean are therefore.
df['A'].iloc[3:5].mean()
or
df["A"].values[3:5].mean()
Also see the documentation about indexing in pandas.