Can't seem to shorten decimal numbers of my Pandas column - pandas

So I have a df column which I created by taking an average of three other columns
df['Avg_Grade'] = df.loc[:,'G1':'G3'].mean(axis =1)
The series looks like this (just a sample)
Avg_Grade
0 5.666667
1 5.333333
2 8.333333
3 14.666667
4 8.666667
I'm trying to truncate the output to show something like
0 5.67 (5.66 is also fine)
1 5.33
2 8.33
3 14.67
4 8.67
I've played around with the moduleDecimal with the following code, but I'm getting an error.
from decimal import *
getcontext().prec = 4
df['Avg_Grade'] = Decimal(df.loc[:,'G1':'G3'].mean(axis =1))
Traceback (most recent call last):
File "<pyshell#409>", line 1, in <module>
df['Avg_Grade'] = Decimal(df.loc[:,'G1':'G3'].mean(axis =1))
File "C:\Python27\lib\decimal.py", line 657, in __new__
raise TypeError("Cannot convert %r to Decimal" % value)
TypeError: Cannot convert 0 5.666667

There are a few ways you can do this, but they won't work in all situations.
Here's an example dataframe:
In [1]:
df = pd.DataFrame(10*np.random.rand(4,3), columns=['G1','G2','G3'])
df['Avg_Grade'] = df.loc[:,'G1':'G3'].mean(axis =1)
df
Out [1]:
G1 G2 G3 Avg_Grade
0 9.843159 4.155922 9.652694 7.883925
1 2.108822 9.347634 9.271351 6.909269
2 2.681108 3.071449 0.387151 2.046569
3 4.017461 9.464408 0.395482 4.625783
1. Use a global pandas setting
All floats will be displayed with two decimals. You can use either of the following:
pd.options.display.precision = 2
pd.set_option('display.precision', 2)
In [3]: df
Out[3]:
G1 G2 G3 Avg_Grade
0 9.84 4.16 9.65 7.88
1 2.11 9.35 9.27 6.91
2 2.68 3.07 0.39 2.05
3 4.02 9.46 0.40 4.63
2. Use a global setting within a with statement.
All floats displayed within the with statement will display with two decimals, but after it will revert to the regular value (default:6)
In [4]: with pd.option_context('display.precision', 2):
print(df)
Out[4]:
G1 G2 G3 Avg_Grade
0 9.84 4.16 9.65 7.88
1 2.11 9.35 9.27 6.91
2 2.68 3.07 0.39 2.05
3 4.02 9.46 0.40 4.63
Once you're outside of the with statement:
In [5]: print(df['Avg_Grade'])
0 7.883925
1 6.909269
2 2.046569
3 4.625783
Name: Avg_Grade, dtype: float64
print(df['Avg_Grade'])
3. Using an HTML styler.
This requires you run your code in a Jupyter Notebook.
df.style.set_precision(3)
4. Using round()
If you want to display something, you can also use something like:
df.round(2)
df['Avg_Grade'].round(2)
5. Creating another dataframe or modifying in place
This way will let you customize the precision column by column, but the underlying data is changed, so you might want to do that on a copy.
# Create a copy so we don't mess up the original df
df_print = df.copy()
# Round down some numbers
df_print['Avg_Grade'] = df_print['Avg_Grade'].round(2)
df_print['G1'] = df_print['Avg_Grade'].round(4)
# Add more decimals: need to switch that to a string representation
df_print['G3'] = df_print['G3'].map(lambda x: "{:,.10f}".format(x))
# display
df_print
G1 G2 G3 Avg_Grade
0 7.88 4.155922 9.6526935480 7.88
1 6.91 9.347634 9.2713506079 6.91
2 2.05 3.071449 0.3871511232 2.05
3 4.63 9.464408 0.3954815519 4.63

If you don't want to round values inside columns you can just change global display settings:
pd.set_option('display.precision', 2)
It also works for styler

Related

How to manipulate data in arrays using pandas

Have data in dataframe and need to compare current value of one column and prior of value of another column. Current time is row 5 in this dataframe and here's the desired output:
target data is streamed and captured into a DataFrame, then that array is multiplied by a constant to generate another column, however unable to generate the third column comp, which should compare current value of prod with prior value of the comp from comp.
df['temp'] = self.temp
df['prod'] = df['temp'].multiply(other=const1)
Another user had suggested using this logic but it is generates errors because the routine's array doesn't match the size of the DataFrame:
for i in range(2, len(df['temp'])):
df['comp'].append(max(df['prod'][i], df['comp'][i - 1]))
Let's try this, I think this will capture your intended logic:
df = pd.DataFrame({'col0':[1,2,3,4,5]
,'col1':[5,4.9,5.5,3.5,6.3]
,'col2':[2.5,2.45,2.75,1.75,3.15]
})
df['col3'] = df['col2'].shift(-1).cummax().shift()
print(df)
Output:
col0 col1 col2 col3
0 1 5.0 2.50 NaN
1 2 4.9 2.45 2.45
2 3 5.5 2.75 2.75
3 4 3.5 1.75 2.75
4 5 6.3 3.15 3.15

Pandas - DataFrame aggregate behaving oddly

Related to Dataframe aggregate method passing list problem and Pandas fails to aggregate with a list of aggregation functions
Consider this dataframe
import pandas as pd
import numpy as np
df = pd.DataFrame(index=range(10))
df['a'] = [ 3 * x for x in range(10) ]
df['b'] = [ 1 -2 * x for x in range(10) ]
According to the documentation for aggregate you should be able to specify which columns to aggregate using a dict like this:
df.agg({'a' : 'mean'})
Which returns
a 13.5
But if you try to aggregate with a user-defined function like this one
def nok_mean(x):
return np.mean(x)
df.agg({'a' : nok_mean})
It returns the mean for each row rather than the column
a
0 0.0
1 3.0
2 6.0
3 9.0
4 12.0
5 15.0
6 18.0
7 21.0
8 24.0
9 27.0
Why does the user-defined function not return the same as aggregating with np.mean or 'mean'?
This is using pandas version 0.23.4, numpy version 1.15.4, python version 3.7.1
The issue has to do with applying np.mean to a series. Let's look at a few examples:
def nok_mean(x):
return x.mean()
df.agg({'a': nok_mean})
a 13.5
dtype: float64
this works as expected because you are using pandas version of mean, which can be applied to a series or a dataframe:
df['a'].agg(nok_mean)
df.apply(nok_mean)
Let's see what happens when np.mean is applied to a series:
def nok_mean1(x):
return np.mean(x)
df['a'].agg(nok_mean1)
df.agg({'a':nok_mean1})
df['a'].apply(nok_mean1)
df['a'].apply(np.mean)
all return
0 0.0
1 3.0
2 6.0
3 9.0
4 12.0
5 15.0
6 18.0
7 21.0
8 24.0
9 27.0
Name: a, dtype: float64
when you apply np.mean to a dataframe it works as expected:
df.agg(nok_mean1)
df.apply(nok_mean1)
a 13.5
b -8.0
dtype: float64
in order to get np.mean to work as expected with a function pass an ndarray for x:
def nok_mean2(x):
return np.mean(x.values)
df.agg({'a':nok_mean2})
a 13.5
dtype: float64
I am guessing all of this has to do with apply, which is why df['a'].apply(nok_mean2) returns an attribute error.
I am guessing here in the source code
When you define your nok_mean function, your function definition is basically saying that you want np.mean for each row
It finds the mean for each row and returns you the result.
For example, if your dataframe looked like this:
a b
0 [0, 0] 1
1 [3, 4] -1
2 [6, 8] -3
3 [9, 12] -5
4 [12, 16] -7
5 [15, 20] -9
6 [18, 24] -11
7 [21, 28] -13
8 [24, 32] -15
9 [27, 36] -17
Then df.agg({'a', nok_mean}) would return this:
a
0 0.0
1 3.5
2 7.0
3 10.5
4 14.0
5 17.5
6 21.0
7 24.5
8 28.0
9 31.5
This is related to how calculations are made on pandas side.
When you pass a dict of functions, the input is treated as a DataFrame instead of a flattened array. After that all calculations are made over the index axis by default. That's why you're getting the means by row.
If you go to the docs page you'll see:
The aggregation operations are always performed over an axis, either the
index (default) or the column axis. This behavior is different from
numpy aggregation functions (mean, median, prod, sum, std,
var), where the default is to compute the aggregation of the flattened
array, e.g., numpy.mean(arr_2d) as opposed to numpy.mean(arr_2d,
axis=0).
__
I think the only way to emulate numpy's behavior and pass a dict of functions to agg at the same time is df.agg(nok_mean)['a'].

Select every nth row as a Pandas DataFrame without reading the entire file

I am reading a large file that contains ~9.5 million rows x 16 cols.
I am interested in retrieving a representative sample, and since the data is organized by time, I want to do this by selecting every 500th element.
I am able to load the data, and then select every 500th row.
My question: Can I immediately read every 500th element (using.pd.read_csv() or some other method), without having to read first and then filter my data?
Question 2: How would you approach this problem if the date column was not ordered? At the moment, I am assuming it's ordered by date, but all data is prone to errors.
Here is a snippet of what the data looks like (first five rows) The first 4 rows are out of order, bu the remaining dataset looks ordered (by time):
VendorID tpep_pickup_datetime tpep_dropoff_datetime passenger_count trip_distance RatecodeID store_and_fwd_flag PULocationID DOLocationID payment_type fare_amount extra mta_tax tip_amount tolls_amount improvement_surcharge total_amount
0 1 2017-01-09 11:13:28 2017-01-09 11:25:45 1 3.30 1 N 263 161 1 12.5 0.0 0.5 2.00 0.00 0.3 15.30
1 1 2017-01-09 11:32:27 2017-01-09 11:36:01 1 0.90 1 N 186 234 1 5.0 0.0 0.5 1.45 0.00 0.3 7.25
2 1 2017-01-09 11:38:20 2017-01-09 11:42:05 1 1.10 1 N 164 161 1 5.5 0.0 0.5 1.00 0.00 0.3 7.30
3 1 2017-01-09 11:52:13 2017-01-09 11:57:36 1 1.10 1 N 236 75 1 6.0 0.0 0.5 1.70 0.00 0.3 8.50
4 2 2017-01-01 00:00:00 2017-01-01 00:00:00 1 0.02 2 N 249 234 2 52.0 0.0 0.5 0.00 0.00 0.3 52.80
Can I immediately read every 500th element (using.pd.read_csv() or some other method), without having to read first and then filter my data?
Something you could do is to use the skiprows parameter in read_csv, which accepts a list-like argument to discard the rows of interest (and thus, also select). So you could create a np.arange with a length equal to the amount of rows to read, and remove every 500th element from it using np.delete, so this way we'll only be reading every 500th row:
n_rows = 9.5e6
skip = np.arange(n_rows)
skip = np.delete(skip, np.arange(0, n_rows, 500))
df = pd.read_csv('my_file.csv', skiprows = skip)
Can I immediately read every 500th element (using.pd.read_csv() or some other method), without having to read first and then filter my data?
First get the length of the file by a custom function, remove each 500 row by numpy.setdiff1d and pass to the skiprows parameter in read_csv:
#https://stackoverflow.com/q/845058
def file_len(fname):
with open(fname) as f:
for i, l in enumerate(f):
pass
return i + 1
len_of_file = file_len('test.csv')
print (len_of_file)
skipped = np.setdiff1d(np.arange(len_of_file), np.arange(0,len_of_file,500))
print (skipped)
df = pd.read_csv('test.csv', skiprows=skipped)
How would you approach this problem if the date column was not ordered? At the moment, I am assuming it's ordered by date, but all data is prone to errors.
The idea is read only the datetime column by parameter usecols, and then sort and select each 500 index value, get the difference and pass again to parameter skiprows:
def file_len(fname):
with open(fname) as f:
for i, l in enumerate(f):
pass
return i + 1
len_of_file = file_len('test.csv')
df1 = pd.read_csv('test.csv',
usecols=['tpep_pickup_datetime'],
parse_dates=['tpep_pickup_datetime'])
sorted_idx = (df1['tpep_pickup_datetime'].sort_values()
.iloc[np.arange(0,len_of_file,500)].index)
skipped = np.setdiff1d(np.arange(len_of_file), sorted_idx)
print (skipped)
df = pd.read_csv('test.csv', skiprows=skipped).sort_values(by=['tpep_pickup_datetime'])
use a lambda with skiprows:
pd.read_csv(path, skiprows=lambda i: i % N)
to skip every N rows.
source: https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html
You can use csv module return a iterator and use itertools.cycle to select every nth row.
import csv
from itertools import cycle
source_file='D:/a.txt'
cycle_size=500
chooser = (x == 0 for x in cycle(range(cycle_size)))
with open(source_file) as f1:
rdr = csv.reader(f1)
data = [row for pick, row in zip(chooser, rdr) if pick]

How to rearrange pandas groupby dataframe output

ID Sequence float Freq Count
3631 D 1.31 1 1
P 1.45 1 1
R 1.44 1 1
3633 D 1.26 3 3
1.27 2 2
1.32 1 1
P 1.33 4 4
the above is the output of pandas groupby
final_df = small_df.groupby(['ID','Seq','float'])['ID','Seq'].count()
I would like to write this to a csv file as
3631,"D,P,R","1.31,1.45,1.44"
3633,"D,P","1.26,1.27,1.32,1.33"
would like some help in this research work.
thank you
The essence of this problem is just a grouping operation on the ID, followed by an aggregation with str.join.
df.reset_index(level=1)\ # reset the first level
.iloc[:, :2]\ # select only the first 2 columns
.astype(str)\ # convert to string
.groupby(level=0)\ # group by the index
.agg(','.join)\ # join elements
.to_csv('file.csv', quotechar='"') # save to CSV with a quoting character
file.csv
ID,Sequence,float
3631,"D,P,R","1.31,1.45,1.44"
3633,"D,D,D,P","1.26,1.27,1.32,1.33"

Pandas bar plot with continuous x axis

I try to make a barchart in pandas, with two data series coming from a groupby:
data.groupby(['popup','UID']).size().groupby(level=0).value_counts().unstack().transpose().plot(kind='bar', layout=(2,2))
The x axis is not continuous, and only shows values that are in the dataset. In this example, it jumps from 11 to 13.
How can I make it continuous?
**EDIT 2: **
I tried JohnE datacentric approach, and it works. It creates a new index with no missing values:
temp = data.groupby(['popup','UID']).size().groupby(level=0).value_counts().unstack().transpose()
temp.reindex(np.arange(temp.index.min(), temp.index.max())).plot(kind='bar', layout=(2,2))
However, I assume there should be a better approach with histogram instead of bar plot. The best I could do with histograms is:
data.groupby(['popup','UID']).size().groupby(level=0).plot(kind='hist', bins=30, alpha=0.5, layout=(2,2), legend=True)
But I didn't find any option in hist plot to get the same rendering than bar plot, without bar overlapping.
**EDIT: ** Here are some information to answer comments.
Data sample:
INSEE C1 popup C3 date \
0 75101.0 0.0 0 NaN 2017-05-17T13:20:16Z
0 75101.0 0.0 0 NaN 2017-05-17T14:23:51Z
1 31557.0 0.0 1 NaN 2017-05-17T14:58:27Z
UID
0 ba4bd353-f14d-4bc5-95ba-6a1f5134cc84
0 ba4bd353-f14d-4bc5-95ba-6a1f5134cc84
1 bafe9715-3a07-4d9b-b85c-0bbf658a9115
First groupby result (sample):
data.groupby(['popup','UID']).size().head(3)
popup UID
0 016d3e7e-1901-4f84-be0e-117988ec57a8 6
01c15455-29cc-4d1e-8743-638fd0f51602 6
03fc9eb0-c5fb-4205-91f0-4b74f78a8b96 3
dtype: int64
Second groupby result (sample):
data.groupby(['popup','UID']).size().groupby(level=0).value_counts().head(3)
popup
0 1 46
3 23
4 22
dtype: int64
After unstack and transpose:
data.groupby(['popup','UID']).size().groupby(level=0).value_counts().unstack().transpose().head(3)
popup 0 1
1 46.0 38.0
2 21.0 35.0
3 23.0 22.0
There is a solution with histogram plot from matplotlib.axes.Axes.hist. It is better to use histograms than bar plots for this purpose, as we can choose the number of bins.
# Separate groups by 'popup' and count number of records for each 'UID'
popup_values = data['popup'].unique()
count_by_popup = [data[data['popup'] == popup_value].groupby(['UID']).size() for popup_value in popup_values]
# Create histogram
fig, ax = plt.subplots()
ax.hist(count_by_popup, 20, histtype='bar', label=[str(x) for x in popup_values])
ax.legend()
plt.show()