Histogram of a pandas dataframe - pandas

I couldn't find anywhere on the site a similar question.
I have a fairly large file, with over 100000 lines and I read it using pandas:
df = pd.read_excel("somefile.xls",index_col='Offense Type')
ended up with a dataframe consisting of the first column (the index column) and another column, 'Offense_type' and 'Hour' respectively.
'Offense Type' consists of a series of "cathegories" say cat1, cat2, cat3, etc...
'Hour' consists of a series of integer numbers between 1 and 24.
What I would like to do is obtain a histogram of the ocurrences of each number in the dataframe (there aren't that many cathegories It's at most 10 of them)
Here's an ASCII representation of what I want to get"
(the x's represent the bars in the histogram, they will surely be at a much higher value than 1,2 or 3)
x x # And so on
x x x x x x #
x x x x x x x #
1 2 11 20 5 8 18 #
Cat1 Cat2 #
But i'm getting a single barplot for every line in df using:
df.plot(kind='bar')
which is basically unreadable:
I've also tried with the hist() and Histogram() function with no luck.
Here's some sample data:

After a long night, I got the answer since every event was ocurring only once I added an extra column in the file with the number one and then indexed the dataframe by this:
df = pd.read_excel("somefile.xls",index_col='Numberone')
And then simply tried this:
df.hist(by=df['Offense Type'])
finally getting exactly what I wanted

Related

create n dataframes in a for loop with an extra column with a specific number in it

Hi all I have a dataframe like that shown in the picture:
I am trying to create 2 different dataframes with the same "hour", "minute", "value" (and value.1 respectively) columns, by adding also column with number 0 and 1 respectively). I would like to do it in a for loop as I want to create n dataframe (not just 2 shown here).
I tried something like this but it's not working (error: KeyError: "['value.i'] not in index"):
for i in range(1):
series[i] = df_new[['hour', 'minute', 'value.i']]
series[i].insert(0, 'number', 'i')
can you help me ?
thannks
from what I have understood you want to make value.i to show value.1 or value.2
for i in range(1):
# f is for the format so can interpret i as variable only
series[i] = df_new[['hour','minute',f'value.{i}']]

Python: I need to find the average over x amount of rows in a specific column of a large csv file

I have a large CSV file with two columns in it as shown below:
I have already filtered the data. I need to calculate the average pressure every x amount of rows.
I've looked for a while on here but was unable to find how to calculate the average every x amount of rows for a specific column. Thanks for any help you can provide.
numpy - average & reshape
n = 3
x = df['Pressure']
# calculates the average
avgResult = np.average(givenArray.reshape(-1, n), axis=1)
the result is array, which divide columns into n sets:
eg:
array([3.33333333, 4.66666667])
in:
n=3
x = np.array([1, 4, 5,2,8,4])

how to extract a character out of string in a column with 100k rows

I have a dataframe with a column x that has 100k rows is as follows:
x
DIV.CDN
DIV.XYN
VIM.NGN
VIM.AHY
I need to extract the 3rd character from the dot(.) to the right, for example:
N
N
N
Y
How to do it Pandas dataframe?
Use str accessor:
>>> df
x
0 DIV.CDN
1 DIV.XYN
2 VIM.NGN
3 VIM.AHY
>>> df['x'].str[-1]
0 N
1 N
2 N
3 Y
Name: x, dtype: object
Please, read the documentation: Working with text data
Assuming that each column value is a string, for each one you can grab it with a simple index:
third_character = string[-1]
Here, it is essentially grabbing the last element of the string. In order to do it over all of the rows, one option is to implement a for-loop that iteratively does this over all rows, appending to a list each time.

Pandas function printing indices and value when it shouldn't

When I run df.loc[zone] on a dataframe with multi-index, the value that's printed is preceeded by the two indices (zone and run). When I run df.shape I get a shape of (1,). For example:
df
zone run
1 3 67.889616
2 3 167.131685
3 3 20.493902
zone=3
print(df.loc[zone])
displays:
zone run
3 3 20.493902
expect:
20.493902
Loc on dataframe takes in full index qualifier and returns value in any column.
So for a multiindex dataframe with index (x1, x2), the right syntax to read data from column y1 would be
data = df.loc[(x1,x2), y1] # full index should be passed to loc
data = df.loc[(3,3)] # zone=3 and run=3
Second line will store value of data as 20.493902

Pandas dividing filtered column from df 1 by filtered column of df 2 warning and weird behavior

I have a data frame which is conditionally broken up into two separate dataframes as follows:
df = pd.read_csv(file, names)
df = df.loc[df['name1'] == common_val]
df1 = df.loc[df['name2'] == target1]
df2 = df.loc[df['name2'] == target2]
# each df has a 'name3' I want to perform a division on after this filtering
The original df is filtered by a value shared by the two dataframes, and then each of the two new dataframes are further filtered by another shared column.
What I want to work:
df1['name3'] = df1['name3']/df2['name3']
However, as many questions have pointed out, this causes a setting with copy warning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
I tried what was recommended in this question:
df1.loc[:,'name3'] = df1.loc[:,'name3'] / df2.loc[:,'name3']
# also tried:
df1.loc[:,'name3'] = df1.loc[:,'name3'] / df2['name3']
But in both cases I still get weird behavior and the set by copy warning.
I then tried what was recommended in this answer:
df.loc[df['name2']==target1, 'name3'] = df.loc[df['name2']==target1, 'name3']/df.loc[df['name2'] == target2, 'name3']
which still results in the same copy warning.
If possible I would like to avoid copying the data frame to get around this because of the size of these dataframes (and I'm already somewhat wastefully making two almost identical dfs from the original).
If copying is the best way to go with this problem I'm interested to hear why that works over all the options I explored above.
Edit: here is a simple data frame along the lines of what df would look like after the line df.loc[df['name1'] == common_val]
name1 other1 other2 name2 name3
a x y 1 2
a x y 1 4
a x y 2 5
a x y 2 3
So if target1=1 and target2=2,
I would like df1 to contain only rows where name1=1 and df2 to contain only rows where name2=2, then divide the resulting df1['name3'] by the resulting df2['name3'].
If there is a less convoluted way to do this (without splitting the original df) I'm open to that as well!