Why is my df.sort_values() not correctly sorting the data points? - pandas

I have a dataframe with returns from various investments in %. sort.values does not correctly order my returns. For example I just want to simply see the TEST column returns sorted by lowest to highest or vice versa. Please look at the test output, it is not correct.
df.sort_values('TEST')
gives me an output of returns that are NOT sorted correctly. Sort values code not in correct order
Also I am having an issue where it sorts positive numbers lowest to highest, then half way down starts again for negative numbers lowest to highest.
I just want it to look like following:
-3%
-1%
-0.5%
1%
2%
5%

Go for numpy.lexsort and boolean indexing :
import numpy as np
arr = np.array([float(x.rstrip("%")) for x in df["TEST"]])
idx = np.lexsort((arr,))
​
df = df.iloc[idx]
​
Output :
print(df)
​
TEST
0 -3%
1 -1%
2 -0.5%
3 1%
4 2%
5 5%
Input used :
df = pd.DataFrame({"TEST": ["1%", "-3%","-0.5%", "-1%", "5%", "2%"]})
TEST
0 1%
1 -3%
2 -0.5%
3 -1%
4 5%
5 2%

The issue is that the lexicographic order of string is different from the natural order (1->10->2 vs 1->2->10).
One option using the key parameter of sort_values:
df.sort_values('TEST', key=lambda s: pd.to_numeric(s.str.extract(r'(-?\d+\.?\d*)', expand=False)))
Or:
df.sort_values('TEST', key=lambda s: pd.to_numeric(s.str.rstrip('%')))
Output:
TEST
1 -3%
3 -1%
2 -0.5%
0 1%
5 2%
4 5%

Related

Changing column name and it's values at the same time

Pandas help!
I have a specific column like this,
Mpg
0 18
1 17
2 19
3 21
4 16
5 15
Mpg is mile per gallon,
Now I need to replace that 'MPG' column to 'litre per 100 km' and change those values to litre per 100 km' at the same time. Any help? Thanks beforehand.
-Tom
I changed the name of the column but doing both simultaneously,i could not.
Use pop to return and delete the column at the same time and rdiv to perform the conversion (1 mpg = 1/235.15 liter/100km):
df['litre per 100 km'] = df.pop('Mpg').rdiv(235.15)
If you want to insert the column in the same position:
df.insert(df.columns.get_loc('Mpg'), 'litre per 100 km',
df.pop('Mpg').rdiv(235.15))
Output:
litre per 100 km
0 13.063889
1 13.832353
2 12.376316
3 11.197619
4 14.696875
5 15.676667
An alternative to pop would be to store the result in another dataframe. This way you can perform the two steps at the same time. In my code below, I first reproduce your dataframe, then store the constant for conversion and perform it on all entries using the apply method.
df = pd.DataFrame({'Mpg':[18,17,19,21,16,15]})
cc = 235.214583 # constant for conversion from mpg to L/100km
df2 = pd.DataFrame()
df2['litre per 100 km'] = df['Mpg'].apply(lambda x: cc/x)
print(df2)
The output of this code is:
litre per 100 km
0 13.067477
1 13.836152
2 12.379715
3 11.200694
4 14.700911
5 15.680972
as expected.

Pandas Cumulative sum over 1 indice but not the other 3

I have a dataframe with 4 variables DIVISION, QTR, MODEL_SCORE, MONTH with the sum of variable X aggregated by those 4.
I would like to effective partition the data by DIVISION,QTR, and MODEL SCORE and keep a running total order the MONTH FIELD order smallest to largest. The idea being it would reset if it got to a new permutation of the other 3 columns
df = df.groupby(['DIVISION','MODEL','QTR','MONTHS'])['X'].sum()
I'm trying
df['cumsum'] = df.groupby(level=3)['X'].cumsum()
having tried all numbers I can think in the level argument. It seems be able to work any way other than what I want.
EDIT: I know the below isn't formatted ideally, but basically as long as the only variable changing was MONTH the cumulative sum would continue but any other variable would cause it to reset.
DIVSION QTR MODEL MONTHS X CUMSUM
A 1 1 1 10 10
A 1 1 2 20 30
A 1 2 1 5 5
I'm sorry for all the trouble I believe the answer was way simpler than I was making it to be.
After
df = df.groupby(['DIVISION','MODEL','QTR','MONTHS'])['X'].sum()
I was supposed to reset the index I did not want a multi-index and this appears to have worked.
df = df.reset_index()
df['cumsum'] = df.groupby(['DIVISION','MODEL','QTR'])['X'].cumsum()

Is there a way to use cumsum with a threshold to create bins?

Is there a way to use numpy to add numbers in a series up to a threshold, then restart the counter. The intention is to form groupby based on the categories created.
amount price
0 27 22.372505
1 17 126.562276
2 33 101.061767
3 78 152.076373
4 15 103.482099
5 96 41.662766
6 108 98.460743
7 143 126.125865
8 82 87.749286
9 70 56.065133
The only solutions I found iterate with .loc which is slow. I tried building a solution based on this answer https://stackoverflow.com/a/56904899:
sumvals = np.frompyfunc(lambda a,b: a+b if a <= 100 else b,2,1)
df['cumvals'] = sumvals.accumulate(df['amount'], dtype=np.object)
The use-case is to find the average price of every 75 sold amounts of the thing.
Solution #1 Interpreting the following one way will get my solution below: "The use-case is to find the average price of every 75 sold amounts of the thing." If you are trying to do this calculation the "hard way" instead of pd.cut, then here is a solution that will work well but the speed / memory will depend on the cumsum() of the amount column, which you can find out if you do df['amount'].cumsum(). The output will take about 1 second per every 10 million of the cumsum, as that is how many rows is created with np.repeat. Again, this solution is not horrible if you have less than ~10 million in cumsum (1 second) or even 100 million in cumsum (~10 seconds):
i = 75
df = np.repeat(df['price'], df['amount']).to_frame().reset_index(drop=True)
g = df.index // i
df = df.groupby(g)['price'].mean()
df.index = (df.index * i).astype(str) + '-' + (df.index * i +75).astype(str)
df
Out[1]:
0-75 78.513748
75-150 150.715984
150-225 61.387540
225-300 67.411182
300-375 98.829611
375-450 126.125865
450-525 122.032363
525-600 87.326831
600-675 56.065133
Name: price, dtype: float64
Solution #2 (I believe this is wrong but keeping just in case)
I do not believe you are tying to do it this way, which was my initial solution, but I will keep it here in case, as you haven't included expected output. You can create a new series with cumsum and then use pd.cut and pass bins=np.arange(0, df['Group'].max(), 75) to create groups of cumulative 75. Then, groupby the groups of cumulative 75 and take the mean. Finally, use pd.IntervalIndex to clean up the format and change to a sting:
df['Group'] = df['amount'].cumsum()
s = pd.cut(df['Group'], bins=np.arange(0, df['Group'].max(), 75))
df = df.groupby(s)['price'].mean().reset_index()
df['Group'] = pd.IntervalIndex(df['Group']).left.astype(str) + '-' + pd.IntervalIndex(df['Group']).right.astype(str)
df
Out[1]:
Group price
0 0-75 74.467390
1 75-150 101.061767
2 150-225 127.779236
3 225-300 41.662766
4 300-375 98.460743
5 375-450 NaN
6 450-525 126.125865
7 525-600 87.749286

how to sort a series with positive value in assending order and negative value in descending order

I have a series tt=pd.Series([-1,5,4,0,-7,-9]) .Now i want to sort 'tt'.
the positive values sort in assending order and negative values sort in descending order.Positive values is in front of negative values.
I want to get the following result.
4,5,0,-1,-7,-9
Is there a good way to get the result?
You want to sort on tt <= 0 first. Notice this is True for negatives and zero, False for positives. Sorting on this puts positives first. Then sort on tt.abs(). This puts the smallest sized numbers first.
df = pd.concat([tt, tt.abs(), tt.le(0)], axis=1)
df.sort_values([2, 1])[0]
2 4
1 5
3 0
0 -1
4 -7
5 -9
Name: 0, dtype: int64
This is a bit too extended but it gets you your desired output:
import pandas as pd
tt=pd.Series([-1,5,4,0,-7,-9])
pd.concat((tt[tt > 0].sort_values(ascending=True), tt[tt <= 0].sort_values(ascending=False)))
Out[1]:
0 4
1 5
2 0
3 -1
4 -7
5 -9
Hope this helps.

calculating probability from long series data in python pandas

I have a data ranging from 19 to 49. How can I calculate the probability of the data occurred in between 25 to 40?
46.58762816
30.50477684
27.4195249
47.98157313
44.55425608
30.21066503
34.27381019
48.19934524
46.82233375
46.05077036
42.63647302
40.11270346
48.04909583
24.18660332
24.47549276
44.45442651
19.24542913
37.44141763
28.41079638
21.69325455
31.32887617
26.26988582
18.19898804
19.01329026
28.33846808
Simplest you can do is to use the % of values that fall between 25 and 40.
If s is your pandas.Series you gave us:
In [1]: s.head()
Out[1]:
0 46.587628
1 30.504777
2 27.419525
3 47.981573
4 44.554256
Name: 0, dtype: float64
In [2]: # calculate number of values between 25 and 40 and divide by total count
s.between(25,40).sum()/float(s.count())
Out[2]: 0.3599
Otherwise it would require trying to find what distribution your data might be following (from the data you gave, which might be just a small sample of your data, it doesn't appear to be following any distribution I know...), testing if it actually follows the distribution you think it follows (using Kolmogorov-Smirnov test or another like it), then you can use that distribution to calculate the probability etc.