Pandas rounding - pandas

I have the following sample dataframe:
Market Value
0 282024800.37
1 317460884.85
2 1260854026.24
3 320556927.27
4 42305412.79
I am trying to round the values in this dataframe to the nearest whole number. Desired output:
Market Value
282024800
317460885
1260854026
320556927
42305413
I tried:
df.values.round()
and the result was
Market Value
282025000.00
317461000.00
1260850000.00
320557000.00
42305400.00
What am I doing wrong?
Thanks

This might be more appropriate posted as a comment, but put here for proper format.
I can't produce your result. With numpy 1.18.1 and Pandas 1.1.0,
df.round().astype('int')
gives me:
Market Value
0 282024800
1 317460885
2 1260854026
3 320556927
4 42305413
The only thing I can think of is that you may have a 32 bit system, where
df.astype('float32').round().astype('int')
gives me
Market Value
0 282024800
1 317460896
2 1260854016
3 320556928
4 42305412

The following will keep your data information intact as a float put will have it display/print to the nearest int.
Big caveat: it is only possible to have this apply to ALL dataframes at once (it is a pandas wide option) rather than just a single dataframe.
pd.set_option("display.precision", 0)

If you like #noah's solution but don't want to have to change the variables back if you output something, you can use the following helper function:
import pandas as pd
from contextlib import contextmanager
#contextmanager
def temp_pandas_options(options):
seen_options = set()
old_values = {}
if isinstance(options, dict):
options_pairs = list(options.items())
else:
options_pairs = options
for option, value in options_pairs:
assert not option in seen_options, f"Already saw option {option}"
old_values[option] = pd.get_option(option)
pd.set_option(option, value)
yield
for option, old_value in old_values.items():
pd.set_option(option, old_value)
Then you can run
with temp_pandas_options({'display.float_format': '{:.0f}'.format}):
print(market_value_df)
and get
Market value
0 282024800
1 317460885
2 1260854026
3 320556927
4 42305413

Related

outlier dedection with z-score, but

I wrote a code to outlier dedection with Python. I used the z-score method to do this. You can see my data and my codes below.
data =[5,10,15,20,25,30,36,22]
data.append(180)
data = pd.DataFrame(data, columns = ["Data"])
z = np.abs(stats.zscore(data))
print(z)
print(np.where( z > 1.5))
I wrote this code to detect outliers. Actually, I wanted to getthe indices of values with z-score higher than 1.5. But I think something is wrong with output.
Data
0 0.649600
1 0.551506
2 0.453412
3 0.355318
4 0.257224
5 0.159130
6 0.041417
7 0.316080
8 2.783688
(array([8], dtype=int64), array([0], dtype=int64))
The 8th element of the data's z-score is higher than 1.5 and it's already written on output, I'm okay with this but the 0th's z-score 0.64. What am i doing wrong?
You could do something like this:
import numpy as np
from scipy import stats
data =[5,10,15,20,25,30,36,22]
data.append(180)
z = stats.zscore(data)
np.where(z > 1.5)[0]
output:
array([8])

Find rows in dataframe column containing questions

I have a TSV file that I loaded into a pandas dataframe to do some preprocessing and I want to find out which rows have a question in it, and output 1 or 0 in a new column. Since it is a TSV, this is how I'm loading it:
import pandas as pd
df = pd.read_csv('queries-10k-txt-backup', sep='\t')
Here's a sample of what it looks like:
QUERY FREQ
0 hindi movies for adults 595
1 are panda dogs real 383
2 asuedraw winning numbers 478
3 sentry replacement keys 608
4 rebuilding nicad battery packs 541
After dropping empty rows, duplicates, and the FREQ column(not needed for this), I wrote a simple function to check the QUERY column to see if it contains any words that make the string a question:
df_test = df.drop_duplicates()
df_test = df_test.dropna()
df_test = df_test.drop(['FREQ'], axis = 1)
def questions(row):
questions_list =
["what","when","where","which","who","whom","whose","why","why don't",
"how","how far","how long","how many","how much","how old","how come","?"]
if row['QUERY'] in questions_list:
return 1
else:
return 0
df_test['QUESTIONS'] = df_test.apply(questions, axis=1)
But once I check the new dataframe, even though it creates the new column, all the values are 0. I'm not sure if my logic is wrong in the function, I've used something similar with dataframe columns which just have one word and if it matches, it'll output a 1 or 0. However, that same logic doesn't seem to be working when the column contains a phrase/sentence like this use case. Any input is really appreciated!
If you wish to check exact matches of any substring from question_list and of a string from dataframe, you should use str.contains method:
questions_list = ["what","when","where","which","who","whom","whose","why",
"why don't", "how","how far","how long","how many",
"how much","how old","how come","?"]
pattern = "|".join(questions_list) # generate regex from your list
df_test['QUESTIONS'] = df_test['QUERY'].str.contains(pattern)
Simplified example:
df = pd.DataFrame({
'QUERY': ['how do you like it', 'what\'s going on?', 'quick brown fox'],
'ID': [0, 1, 2]})
Create a pattern:
pattern = '|'.join(['what', 'how'])
pattern
Out: 'what|how'
Use it:
df['QUERY'].str.contains(pattern)
Out[12]:
0 True
1 True
2 False
Name: QUERY, dtype: bool
If you're not familiar with regexes, there's a quick python re reference. Fot symbol '|', explanation is
A|B, where A and B can be arbitrary REs, creates a regular expression that will match either A or B. An arbitrary number of REs can be separated by the '|' in this way
IIUC, you need to find if the first word in the string in the question list, if yes return 1, else 0. In your function, rather than checking if the entire string is in question list, split the string and check if the first element is in question list.
def questions(row):
questions_list = ["are","what","when","where","which","who","whom","whose","why","why don't","how","how far","how long","how many","how much","how old","how come","?"]
if row['QUERY'].split()[0] in questions_list:
return 1
else:
return 0
df['QUESTIONS'] = df.apply(questions, axis=1)
You get
QUERY FREQ QUESTIONS
0 hindi movies for adults 595 0
1 are panda dogs real 383 1
2 asuedraw winning numbers 478 0
3 sentry replacement keys 608 0
4 rebuilding nicad battery packs 541 0

Pandas Boolean selection updated?

I was used to getting a single number, which told me how many cases are TRUE for either for the conditions in the code. However, since I used conda update all I now get a list of values with either 0 or 1. I wonder what is now the simplest method in pandas to get this task done. I guess that this is a pandas update. I did a google search but could not find that they changed boolean indexing. What is the easiest way to get this sum of booleans (I know how to get it but I cannot imagine that this extra step is required).
import pandas as pd
import numpy as np
x = np.random.randint(10,size=10)
y = np.random.randint(10,size=10)
d ={}
d['x'] = x
d['y'] = y
df = pd.DataFrame(d)
sum([df['x']>=6] or [df['y']<=3])
You need to use vectorized or |:
(df.x.ge(6) | df.y.le(3)).sum()
# 9
Or: ((df.y <= 3) | (df.x >= 6)).sum(), sum((df.y <= 3) | (df.x >= 6)).

pandas resample when cumulative function returns data frame

I would like to use resampling function from pandas but applying my own custom function. The problem I'm facing is that the custom function returns a pandas Data Frame instead of a single array.
The following example illustrate my problem:
>>> import pandas as pd
>>> import numpy as np
>>> def f(data):
... return ((1+data).cumprod(axis=0)-1)
...
>>> data = np.random.randn(1000,3)
>>> index = pd.date_range("20170101", periods = 1000, freq="B")
>>> df = pd.DataFrame(data= data, index =index)
Now suppose I want to resample the business days to business end month frequency:
>>> resampler = df.resample("BM")
If I apply now the my function f I don't get the desired result. I would like to get the last row of my output from f.
>>> resampler.apply(f)
this is becaumes the cumprod in my function f returns a pandas data frame. I could write my f such that it returns just the last row. However, I would like to use this function in other places as well to return the whole Data Frame. This could be solved via introducing a flag like "last_row" in the function f which steers to return the complete or just the last row. But this solutions seem rather nasty.
Just define your function f with a last_row parameter. You can default it to False so that it returns the entire dataframe. When True it returns the last row
def f(data, last_row=False):
df = ((1+data).cumprod(axis=0)-1)
if last_row:
return df.iloc[-1]
return df
Get the last row
df.resample('BM').apply(f, last_row=True)
0 1 2
2017-01-31 0.185662 -0.580058 -1.004879
2017-02-28 -1.004035 -0.999878 17.059846
2017-03-31 -0.995280 -1.000001 -1.000507
2017-04-28 -1.000656 -240.369487 -1.002645
2017-05-31 47.646827 -72.042190 -1.000016
....
Return all the rows as you already did.
df.resample('BM').apply(f)
I think you could refactor in the following way, which will be much faster for larger dataframes:
(1+df).resample('BM').prod() - 1
0 1 2
2017-01-31 -0.999436 -1.259078 -1.000215
2017-02-28 -1.221404 0.342863 9.841939
2017-03-31 -0.820196 -1.002598 -0.450662
2017-04-28 -1.000299 2.739184 -1.035557
2017-05-31 -0.999986 -0.920445 -2.103289
That gives the same answer as #TedPetrou although you can't tell because we used different random seeds, but you can easily test this yourself. Though actually, I'm still sorting out why this gives the same answer via prod() rather than cumprod(). Anyway, as you can see this is a mix of intuition and reverse engineering I'm using here and will update as I double check things...
For this relatively small dataframe with 1,000 rows, this way is only around twice as fast, but if you increase the rows you'll find this way scales much better (about 250x faster at 10,000 rows).
Alternative approaches: These give different answers from the above (and from each other) but I wonder if they might be closer to what you are looking for?
(1+df).resample('BM').mean().expanding().apply( lambda x: x.prod() - 1)
(1+df).expanding().apply( lambda x: x.prod() - 1).resample('BM').mean()

Log values by SFrame column

Please, can anybody tell me, how I can take logarithm from every value in SFrame, graphlab (or DataFrame, pandas) column, without to iterate through the whole length of the SFrame column?
I specially interest on similar functionality, like by Groupby Aggregators for the log-function. Couldn't find it someself...
Important: Please, I don't interest for the for-loop iteration for the whole length of the column. I only interest for specific function, which transform all values to the log-values for the whole column.
I'm also very sorry, if this function is in the manual. Please, just give me a link...
numpy provides implementations for a wide number of basic mathematical transformations. You can use those on all data structures that build on numpy's ndarray.
import pandas as pd
import numpy as np
data = pd.Series([np.exp(1), np.exp(2), np.exp(3)])
np.log(data)
Outputs:
0 1
1 2
2 3
dtype: float64
This example is for pandas data types, but it works for all data structures that are based on numpy arrays.
The same "apply" pattern works for SFrames as well. You could do:
import graphlab
import math
sf = graphlab.SFrame({'a': [1, 2, 3]})
sf['b'] = sf['a'].apply(lambda x: math.log(x))
#cel
I think, in my case it could be possible also to use next pattern.
import numpy
import pandas
import graphlab
df
a b c
1 1 1
1 2 3
2 1 3
....
df['log c'] = df.groupby('a')['c'].apply(lambda x: numpy.log(x))
for SFrame (sf instead df object) it could look little be different
logvals = numpy.log(sf['c'])
log_sf = graphlab.SFrame(logvals)
sf = sf.join(log_sf, how = 'outer')
Probably with numpy the code fragment is a little bit to long, but it works...
The main problem is of course time perfomance. I did hope, I can fnd some specific function to minimise my time....