Why does Pandas df.mode() return a zero before the actual modal value? - pandas

When I run df.mode() on the below dataframe I get a leading zero before the expected output. Why is that?
df
sample 1 2 3 4 5 6 7 8 9 10
zone run
2 5 14 12 22 23 24 22 23 22 23 23
print(df.iloc[:,3:10].mode(axis=1)))
gives
0
zone run
2 5 23
expecting
zone run
2 5 23

pd.Series.mode
Return the mode(s) of the dataset. Always returns Series even if only one value is returned.
So that's how it is by design. A Series must have an index and it will start counting from 0. This ensures that the return type is stable regardless of whether there is only a single mode or multiple values tied for the mode.
So if you take a slice where values are tied for the mode, your return is a Series where the numbers 0, ...N are indicators for the N values tied for the mode (modal values in sorted order).
df.iloc[:, 4:7]
#sample 5 6 7
#zone run
#2 5 24 22 23
df.iloc[:,4:7].mode(axis=1)
# 0 1 2 # <- 3 values tied for mode so 3 labels
#zone run
#2 5 22 23 24

My thinking is, df.mode returns a dataframe. By default, dataframes if no column values are given allocates indices as column names. In this case,0 is allocated because that is how pandas/python begins count.
Because it is a dataframe, the only way to change the column name which in this case is an index is to apply the .rename(columnn) method. Hence, to get what you need you will have to;
df1.iloc[:,3:10].agg('mode', axis=1).reset_index().rename(columns={0:''})
zone run
0 2 5 23

Related

iterrows() of 2 columns and save results in one column

in my data frame I want to iterrows() of two columns but want to save result in 1 column.for example df is
x y
5 10
30 445
70 32
expected output is
points sequence
5 1
10 2
30 1
445 2
I know about iterrows() but it saved out put in two different columns.How can I get expected output and is there any way to generate sequence number according to condition? any help will be appreciated.
First never use iterrows, because really slow.
If want 1, 2 sequence by number of columns convert values to numy array by DataFrame.to_numpy and add numpy.ravel, then for sequence use numpy.tile:
df = pd.DataFrame({'points': df.to_numpy().ravel(),
'sequence': np.tile([1,2], len(df))})
print (df)
points sequence
0 5 1
1 10 2
2 30 1
3 445 2
4 70 1
5 32 2
Do this way:
>>> pd.DataFrame([i[1] for i in df.iterrows()])
points sequence
0 5 1
1 10 2
2 30 1
3 445 2

How to get the mode of a column in pandas where there are few of the same mode values pandas

I have a data frame and i'd like to get the mode of a specific column.
i'm using:
freq_mode = df.mode()['my_col'][0]
However I get the error:
ValueError: ('The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()', 'occurred at index my_col')
I'm guessing it's because I have few mode that are the same.
I need any of the mode, it doesn't matter. How can I use any() to get any of the mode existed?
For me your code working nice with sample data.
If necessary select first value of Series from mode use:
freq_mode = df['my_col'].mode().iat[0]
We can see the one column
df=pd.DataFrame({"A":[14,4,5,4,1,5],
"B":[5,2,54,3,2,7],
"C":[20,20,7,3,8,7],
"train_label":[7,7,6,6,6,7]})
X=df['train_label'].mode()
print(X)
DataFrame
A B C train_label
0 14 5 20 7
1 4 2 20 7
2 5 54 7 6
3 4 3 3 6
4 1 2 8 6
5 5 7 7 7
Output
0 6
1 7
dtype: int64

pandas applying function to columns array is very slow

os hour day
0 13 14 0
1 19 14 0
2 13 14 0
3 13 14 0
4 13 14 0
Here is my dataframe and I just want to get a new column which is str(os)+'_'+str(hour)+'_'str(day). I use apply function to process the dataframe but it is very slow.
Any high-performance method to realize this ?
I also tried convert the df to array and process every row. It seems that it is slow too.
There are nearly two hundred millions rows of the dataframe.
Not sure what code are you using but you can try
df.astype(str).apply('_'.join, axis = 1)
0 13_14_0
1 19_14_0
2 13_14_0
3 13_14_0
4 13_14_0

Why are some of my ranges insane?

I tried parsing a common string depiction of ranges (e.g. 1-9) into actual ranges (e.g. 1 .. 9), but often got weird results when including two digit numbers. For example, 1-10 results in the single value 1 instead of a list of ten values and 11-20 gave me four values (11 10 21 20), half of which aren't even in the expected numerical range:
put get_range_for('1-9');
put get_range_for('1-10');
put get_range_for('11-20');
sub get_range_for ( $string ) {
my ($start, $stop) = $string.split('-');
my #values = ($start .. $stop).flat;
return #values;
}
This prints:
1 2 3 4 5 6 7 8 9
1
11 10 21 20
Instead of the expected:
1 2 3 4 5 6 7 8 9
1 2 3 4 5 6 7 8 9 10
11 12 13 14 15 16 17 18 19 20
(I figured this out before posting this question, so I have answered below. Feel free to add your own answer if you'd like to elaborate).
The problem is indeed that .split returns Str rather than Int, which the original answer solves. However, I would rather implement my "get_range_for" like this:
sub get_range_for($string) {
Range.new( |$string.split("-")>>.Int )
}
This would return a Range object rather than an Array. But for iteration (which is what you most likely would use this for), this wouldn't make any difference. Also, for larger ranges the other implementation of "get_range_for" could potentially eat a lot of memory because it vivifies the Range into an Array. This doesn't matter much for "3-10", but it would for "1-10000000".
Note that this implementation uses >>.Int to call the Int method on all values returned from the .split, and then slips them as separate parameters with | to Range.new. This will then also bomb should the .split return 1 value (if it couldn't split) or more than 2 values (if multiple hyphens occurred in the string).
The result of split is a Str, so you are accidentally creating a range of strings instead of a range of integers. Try converting $start and $stop to Int before creating the range:
put get_range_for('1-9');
put get_range_for('1-10');
put get_range_for('11-20');
sub get_range_for ( $string ) {
my ($start, $stop) = $string.split('-');
my #values = ($start.Int .. $stop.Int).flat; # Simply added .Int here
return #values;
}
Giving you what you expect:
1 2 3 4 5 6 7 8 9
1 2 3 4 5 6 7 8 9 10
11 12 13 14 15 16 17 18 19 20

PowerPivot formula for row wise weighted average

I have a table in PowerPivot which contains the logged data of a traffic control camera mounted on a road. This table is filled the velocity and the number of vehicles that pass this camera during a specific time(e.g. 14:10 - 15:25). Now I want to know that how can I get the average velocity of cars for an specific hour and list them in a separate table with 24 rows(hour 0 - 23) where the second column of each row is the weighted average velocity of that hour? A sample of my stat_table data is given below:
count vel hour
----- --- ----
133 96.00237 15
117 91.45705 21
81 81.90521 6
2 84.29946 21
4 77.7841 18
1 140.8766 17
2 56.14951 14
6 71.72839 13
4 64.14309 9
1 60.949 17
1 77.00728 21
133 100.3956 6
109 100.8567 15
54 86.6369 9
1 83.96901 17
10 114.6556 21
6 85.39127 18
1 76.77993 15
3 113.3561 2
3 94.48055 2
In a separate PowerPivot table I have 24 rows and 2 columns but when I enter my formula, the whole rows get updated with the same number. My formula is:
=sumX(FILTER(stat_table, stat_table[hour]=[hour]), stat_table[count] * stat_table[vel])/sumX(FILTER(stat_table, stat_table[hour]=[hour]), stat_table[count])
Create a new calculated column named "WeightedVelocity" as follows
WeightedVelocity = [count]*[vel]
Create a measure "WeightedAverage" as follows
WeightedAverage = sum(stat_table[WeightedVelocity]) / sum(stat_table[count])
Use measure "WeightedAverage" in VALUES area of pivot Table and use "hour" column in ROWS to get desired result.