Newbie question: Why does this indexing work? [duplicate] - pandas

This question already has an answer here:
Why does .loc have inclusive behavior for slices?
(1 answer)
Closed 3 years ago.
I am trying to extract just the verisicolor petal length entries in the iris dataset of scikit learn. This corresponds to rows 50 to 99. I have always been told that python indexing excludes the final entry, i.e. 1:10 is all the numbers from 1 to 9.
So, why is it that the following command includes row 99? Is this inclusive indexing (where the final value is included) just a pandas thing with loc? My code is below and it works, but I dont' know why, my intuition would have been to set the index in loc to [50:100]
from sklearn import datasets
import pandas as pd
import numpy as np
iris = datasets.load_iris() #load iris
iris_df=pd.DataFrame(data= np.c_[iris['data'], iris['target']],
columns= iris['feature_names'] + ['target']) #convert iris to dataframe
versicolor_petal_length=iris_df.loc[50:99,['petal length (cm)']] #extract rows 50-99 of petal length (cm) column
print(versicolor_petal_length)
The output includes row 99,
petal length (cm)
50 4.7
51 4.5
52 4.9
53 4.0
54 4.6
55 4.5
56 4.7
57 3.3
58 4.6
59 3.9
60 3.5
61 4.2
62 4.0
63 4.7
64 3.6
65 4.4
66 4.5
67 4.1
68 4.5
69 3.9
70 4.8
71 4.0
72 4.9
73 4.7
74 4.3
75 4.4
76 4.8
77 5.0
78 4.5
79 3.5
80 3.8
81 3.7
82 3.9
83 5.1
84 4.5
85 4.5
86 4.7
87 4.4
88 4.1
89 4.0
90 4.4
91 4.6
92 4.0
93 3.3
94 4.2
95 4.2
96 4.2
97 4.3
98 3.0
99 4.1
Given this, can someone explain to me when indexing is will include the last element and when indexing will exclude it? I am having some trouble with this.
Thanks

I believe you're thinking of np.arange which belongs to the Numpy library (excludes last index as seen here) whereas df.loc is from the Pandas library and is all inclusive w.r.t indexing as seen in examples here
EDIT to add: you might also be thinking of how for loops work in python regarding the range functionality. When it comes to indexing and playing with new libraries it never hurts to double check some documentation :)
if you have any further questions feel free to ask

what you are experiencing here is a DataFrame.loc[] property.
As mentioned in the documentation as a warning and i quote:
Warning : Note that contrary to usual python slices, both the start and the stop are included
Here is a link with an example provided from the pandas docs:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html

Related

Should I trim outliers from input features

Almost half of my input feature columns have offshoot "outliers" like when the mean is 19.6 the max is 2908.0. Is it OK or should I trim those to mean + std?
msg_cnt_in_x msg_cnt_in_other msg_cnt_in_y \
count 330096.0 330096.0 330096.0
mean 19.6 2.6 38.3
std 41.1 8.2 70.7
min 0.0 0.0 0.0
25% 0.0 0.0 0.0
50% 3.0 1.0 8.0
75% 21.0 2.0 48.0
max 2908.0 1296.0 4271.0
There is no general answer to that. It depends very much on your probem and data set.
You should look into your data set and check whether these outlier data points are actually valid and important. If they are caused by some errors during data collection you should delete them. If they are valid, then you can expect similar values in your test data and thus the data points should stay in the data set.
If you are unsure, just test both and pick the one that works better.

How would you plot a pandas series of floats which really stand for a categorical variable?

I am learning Pandas exploring a Google Play installs dataset on kaggle:
https://www.kaggle.com/lava18/google-play-store-apps
One of the columns is "Installs" and I have converted the values from the original Object type to Float to perform basic descriptive statistics but when I look at the content:
0.000000e+00 15
1.000000e+00 67
5.000000e+00 82
1.000000e+01 386
5.000000e+01 205
1.000000e+02 719
5.000000e+02 330
1.000000e+03 907
5.000000e+03 477
1.000000e+04 1054
5.000000e+04 479
1.000000e+05 1169
5.000000e+05 539
1.000000e+06 1579
5.000000e+06 752
1.000000e+07 1252
5.000000e+07 289
1.000000e+08 409
5.000000e+08 72
1.000000e+09 58
Name: Installs, dtype: int64
It is clear that Google does not give an exact number but rather a "bin".
Plotting it with this basic command:
apps['Installs'].plot.bar()
yields an almost unintelligible image.
Suggestions for a more readable presentation?
Suggestions to graphically show the different distribution of a subset of the data (e.g. only the "Medical" app category data)?
Thank you very much.

Which model fit best for semi sinusoidal data?

I am having a record containing the maximum and the minimum monthly temperatures at a particular station. The record shows information for each month from January 1908 to March 2012. However, some of the temperature values have been blanked out.
Sample Data
yyyy month tmax tmin
1908 January 5.0 -1.4
1908 February 7.3 1.9
1908 March 6.2 0.3
1908 April Missing_1 2.1
1908 May Missing_2 7.7
1908 June 17.7 8.7
1908 July Missing_3 11.0
1908 August 17.5 9.7
1908 September 16.3 8.4
1908 October 14.6 8.0
1908 November 9.6 3.4
1908 December 5.8 Missing_4
1909 January 5.0 0.1
1909 February 5.5 -0.3
1909 March 5.6 -0.3
1909 April 12.2 3.3
1909 May 14.7 4.8
1909 June 15.0 7.5
1909 July 17.3 10.8
1909 August 18.8 10.7
I want to find out the Missing values. Which model suits best for this kind of problem ? I am trying using MultiVariate Linear Regression here. Is it a right approach ?
This is an empirical question. Linear regression is a good starting point. If the data have non-linear shape you might find transforming features/outputs allows you to fit a linear model.
I'd suggest you come up with something, and use cross validation on the entries with present values. Use this to improve your method. If it is reasonable to assume that missing values have the same distribution as present values (i.e. there isn't some systematic bias, like equipment malfunction in extreme temperatures, in the missing values) then cross validation error ought to be a reasonable way to judge the quality of your missing data imputation.

iOS, LaunchImage Apple Watch size and name

I using xcode 6.2 beta for creating app for apple watch.
I want add 2 LunchImage for Apple Watch 38 mm and Apple Watch 42 mm.
when i Adding the Images , xcode give me the error for name of LaunchImage or error for size 449 x 449 or 136 x 170.
i want the exact name and size for LaunchImage Apple Watch 38 mm and 42 mm
The icons have a circular mask automatically applied to them. You will need icons with diameter 29, 80, and 172 pixels for 38 mm and diameter 36, 88, and 196 for 42 mm.
So create images 29x29, 80x80, 172x172, etc.
See https://developer.apple.com/library/prerelease/ios/documentation/UserExperience/Conceptual/WatchHumanInterfaceGuidelines/IconandImageSizes.html#//apple_ref/doc/uid/TP40014992-CH16-SW1

x86 binary bloat - 32-bit offsets when 8-bits would do

I'm using clang+LLVM 2.9 to compile various workloads for x86 with the -Os option. Small binary size is important and I must use static linking. All binaries are 32-bit.
I notice that many instructions use addressing modes with 32-bit displacements when only 8 bits are actually used. For example:
89 84 24 d4 00 00 00 mov %eax,0xd4(%esp)
Why didn't the compiler/assembler choose the compact 8-bit displacement?
89 44 24 d4 mov %eax,0xd4(%esp)
In fact, these wasted addressing bytes are over 2% of my entire binary!
I looked at LLVM's link time optimization and tried --emit-llvm, but it didn't mention or help this issue.
Is there some link-time optimization that can use knowledge of the actual displacements to choose the smaller instruction form?
Thanks for any help!
In x86, offsets are signed. This allows you to access data on both sides of the base address. Therefore, the range of an 8 bit offset is -128 to 127. Your instruction is referencing data 212 bytes forward (the value 0xD4 in decimal). If it had been encoded using an 8 bit offset, it would be -44 in decimal, which is not what you wanted.