How can I round up an entire column to the next 10? - pandas

I'm really struggling to organise my data into 'bins' in Jupyter Notebook. I need the Length column to be rounded UP to the next 10 but I can only seem to round it up to the nearest whole number. I would really appreciate some guidance. Thanks in advance. :)
IN[58] df2['Length']
OUT[58] 0 541.56
1 541.73
2 482.22
3 345.45
...
Needs to look something like this:
IN[58] df2['Length']
OUT[58] 0 550
1 550
2 490
3 350
...

Sample
print (df2)
Length
0 541.56
1 541.73
2 482.22
3 500.00 <- whole number for better sample
You can use integer division, mutiple by 10 and convert to integers and add 10 if modulo is not 0:
s = (df2['Length'] // 10 * 10).astype(int) + (df2['Length'] % 10 > 0).astype(int) * 10
print (s)
0 550
1 550
2 490
3 500
Name: Length, dtype: int32
Another one use fact // round down:
s = -(-df2['Length'] // 10 * 10).astype(int)
print (s)
0 550
1 550
2 490
3 500
Name: Length, dtype: int32
Or is possible use division with np.ceil:
s = (np.ceil(df2['Length'] / 10) * 10).astype(int)
print (s)
0 550
1 550
2 490
3 500
Name: Length, dtype: int32

Related

Pandas cumsum only if positive else zero

I am making a table, where i want to show that if there's no income, no expense can happen
it's a cumulative sum table
This is what I've
Incoming
Outgoing
Total
0
150
-150
10
20
-160
100
30
-90
50
70
-110
Required output
Incoming
Outgoing
Total
0
150
0
10
20
0
100
30
70
50
70
50
I've tried
df.clip(lower=0)
and
df['new_column'].apply(lambda x : df['outgoing']-df['incoming'] if df['incoming']>df['outgoing'])
That doesn't work as well
is there any other way?
Update:
A more straightforward approach inspired by your code using clip and without numpy:
diff = df['Incoming'].sub(df['Outgoing'])
df['Total'] = diff.mul(diff.ge(0).cumsum().clip(0, 1)).cumsum()
print(df)
# Output:
Incoming Outgoing Total
0 0 150 0
1 10 20 0
2 100 30 70
3 50 70 50
Old answer:
Find the row where the balance is positive for the first time then compute the cumulative sum from this point:
start = np.where(df['Incoming'] - df['Outgoing'] >= 0)[0][0]
df['Total'] = df.iloc[start:]['Incoming'].sub(df.iloc[start:]['Outgoing']) \
.cumsum().reindex(df.index, fill_value=0)
Output:
>>> df
Incoming Outgoing Total
0 0 150 0
1 10 20 0
2 100 30 70
3 50 70 50
IIUC, you can check when Incoming is greater than Outgoing using np.where and assign a helper column. Then you can check when this new column is not null, using notnull(), calculate the difference, and use cumsum() on the result:
df['t'] = np.where(df['Incoming'].ge(df['Outgoing']),0,np.nan)
df['t'].ffill(axis=0,inplace=True)
df['Total'] = np.where(df['t'].notnull(),(df['Incoming'].sub(df['Outgoing'])),df['t'])
df['Total'] = df['Total'].cumsum()
df.drop('t',axis=1,inplace=True)
This will give back:
Incoming Outgoing Total
0 0 150 NaN
1 10 20 NaN
2 100 30 70.0
3 50 70 50.0

Replacing -999 with a number but I want all replaced number to be different

I have a Pandas DataFrame named df and in df['salary'] column, there are 400 values represented by same number -999. I want to replace that -999 value with any number in between 200 and 500. I want to replace all 400 values with a different number from 200 to 500. So far I have written this code:
df['salary'] = df['salary'].replace(-999, random.randint(200, 500))
but this code is replacing all -999 with the same value. I want all replaced values to be different from each other. How can do this.
You can use Series.mask with np.random.randint:
df = pd.DataFrame({"salary":[0,1,2,3,4,5,-999,-999,-999,1,3,5,-999]})
df['salary'] = df["salary"].mask(df["salary"].eq(-999), np.random.randint(200, 500, size=len(df)))
print (df)
salary
0 0
1 1
2 2
3 3
4 4
5 5
6 413
7 497
8 234
9 1
10 3
11 5
12 341
If you want non-repeating numbers instead:
s = pd.Series(range(200, 500)).sample(frac=1).reset_index(drop=True)
df['salary'] = df["salary"].mask(df["salary"].eq(-999), s)

iterrows() of 2 columns and save results in one column

in my data frame I want to iterrows() of two columns but want to save result in 1 column.for example df is
x y
5 10
30 445
70 32
expected output is
points sequence
5 1
10 2
30 1
445 2
I know about iterrows() but it saved out put in two different columns.How can I get expected output and is there any way to generate sequence number according to condition? any help will be appreciated.
First never use iterrows, because really slow.
If want 1, 2 sequence by number of columns convert values to numy array by DataFrame.to_numpy and add numpy.ravel, then for sequence use numpy.tile:
df = pd.DataFrame({'points': df.to_numpy().ravel(),
'sequence': np.tile([1,2], len(df))})
print (df)
points sequence
0 5 1
1 10 2
2 30 1
3 445 2
4 70 1
5 32 2
Do this way:
>>> pd.DataFrame([i[1] for i in df.iterrows()])
points sequence
0 5 1
1 10 2
2 30 1
3 445 2

Apply function with arguments across Multiindex levels

I would like to apply a custom function to each level within a multiindex.
For example, I have the dataframe
df = pd.DataFrame(np.arange(16).reshape((4,4)),
columns=pd.MultiIndex.from_product([['OP','PK'],['PRICE','QTY']]))
of which I want to add a column for each level 0 column, called "Value" which is the result of the following function;
def my_func(df, scale):
return df['QTY']*df['PRICE']*scale
where the user supplies the "scale" value.
Even in setting up this example, I am not sure how to show the result I want. But I know I want the final dataframe's multiindex column to be
pd.DataFrame(columns=pd.MultiIndex.from_product([['OP','PK'],['PRICE','QTY','Value']]))
Even if that wasn't had enough, I want to apply one "scale" value for the "OP" level 0 column and a different "scale" value to the "PK" column.
Use:
def my_func(df, scale):
#select second level of columns
df1 = df.xs('QTY', axis=1, level=1).values *df.xs('PRICE', axis=1, level=1) * scale
#create MultiIndex in columns
df1.columns = pd.MultiIndex.from_product([df1.columns, ['val']])
#join to original
return pd.concat([df, df1], axis=1).sort_index(axis=1)
print (my_func(df, 10))
OP PK
PRICE QTY val PRICE QTY val
0 0 1 0 2 3 60
1 4 5 200 6 7 420
2 8 9 720 10 11 1100
3 12 13 1560 14 15 2100
EDIT:
For multiple by scaled values different for each level is possible use list of values:
print (my_func(df, [10, 20]))
OP PK
PRICE QTY val PRICE QTY val
0 0 1 0 2 3 120
1 4 5 200 6 7 840
2 8 9 720 10 11 2200
3 12 13 1560 14 15 4200
Use groupby + agg, and then concatenate the pieces together with pd.concat.
scale = 10
v = df.groupby(level=0, axis=1).agg(lambda x: x.values.prod(1) * scale)
v.columns = pd.MultiIndex.from_product([v.columns, ['value']])
pd.concat([df, v], axis=1).sort_index(axis=1, level=0)
OP PK
PRICE QTY value PRICE QTY value
0 0 1 0 2 3 60
1 4 5 200 6 7 420
2 8 9 720 10 11 1100
3 12 13 1560 14 15 2100

Sort Pandas dataframe and print highest n values

I have a pandas data frame and I want to sort column('Bytes') in Descending order and print highest 10 values and its related "Client IP" column value. Suppose following is a part of my dataframe. I have many different methods and failed?
0 Bytes Client Ip
0 1000 192.168.10.2
1 2000 192.168.10.12
2 500 192.168.10.4
3 159 192.168.10.56
Following prints only the raw which has the highest value.
print df['Bytes'].argmax()
I think you can use nlargest (New in pandas version 0.17.0):
print df
0 Bytes Client Ip
0 1 1 1000 192.168.10.2
1 0 0 2000 192.168.10.12
2 2 2 500 192.168.10.4
3 3 3 159 192.168.10.56
print df.nlargest(3, 'Client')
0 Bytes Client Ip
1 0 0 2000 192.168.10.12
0 1 1 1000 192.168.10.2
2 2 2 500 192.168.10.4
Note: sort is deprecated - use sort_values instead
To sort descending use ascending=False:
In [6]: df.sort('Bytes', ascending=False)
Out[6]:
0 Bytes Client Ip
1 1 2000 192.168.10.12
0 0 1000 192.168.10.2
2 2 500 192.168.10.4
3 3 159 192.168.10.56
To take the first 10 values use .head(10).
df['Bytes'] = df['Bytes'].astype('int')
print df.sort('Bytes', ascending=False).head(10)[['Bytes', 'Client-IP']]
I could solve it using above code with the help of Andy Hayden. :D
df[['Bytes', 'Client Ip']].sort_values('Bytes', ascending=False).nlargest(10, 'Bytes')
This should get you everything you need
1) Sorting Bytes
2) Returning the Largest 10 Bytes values