How would you plot a pandas series of floats which really stand for a categorical variable? - pandas

I am learning Pandas exploring a Google Play installs dataset on kaggle:
https://www.kaggle.com/lava18/google-play-store-apps
One of the columns is "Installs" and I have converted the values from the original Object type to Float to perform basic descriptive statistics but when I look at the content:
0.000000e+00 15
1.000000e+00 67
5.000000e+00 82
1.000000e+01 386
5.000000e+01 205
1.000000e+02 719
5.000000e+02 330
1.000000e+03 907
5.000000e+03 477
1.000000e+04 1054
5.000000e+04 479
1.000000e+05 1169
5.000000e+05 539
1.000000e+06 1579
5.000000e+06 752
1.000000e+07 1252
5.000000e+07 289
1.000000e+08 409
5.000000e+08 72
1.000000e+09 58
Name: Installs, dtype: int64
It is clear that Google does not give an exact number but rather a "bin".
Plotting it with this basic command:
apps['Installs'].plot.bar()
yields an almost unintelligible image.
Suggestions for a more readable presentation?
Suggestions to graphically show the different distribution of a subset of the data (e.g. only the "Medical" app category data)?
Thank you very much.

Related

Newbie question: Why does this indexing work? [duplicate]

This question already has an answer here:
Why does .loc have inclusive behavior for slices?
(1 answer)
Closed 3 years ago.
I am trying to extract just the verisicolor petal length entries in the iris dataset of scikit learn. This corresponds to rows 50 to 99. I have always been told that python indexing excludes the final entry, i.e. 1:10 is all the numbers from 1 to 9.
So, why is it that the following command includes row 99? Is this inclusive indexing (where the final value is included) just a pandas thing with loc? My code is below and it works, but I dont' know why, my intuition would have been to set the index in loc to [50:100]
from sklearn import datasets
import pandas as pd
import numpy as np
iris = datasets.load_iris() #load iris
iris_df=pd.DataFrame(data= np.c_[iris['data'], iris['target']],
columns= iris['feature_names'] + ['target']) #convert iris to dataframe
versicolor_petal_length=iris_df.loc[50:99,['petal length (cm)']] #extract rows 50-99 of petal length (cm) column
print(versicolor_petal_length)
The output includes row 99,
petal length (cm)
50 4.7
51 4.5
52 4.9
53 4.0
54 4.6
55 4.5
56 4.7
57 3.3
58 4.6
59 3.9
60 3.5
61 4.2
62 4.0
63 4.7
64 3.6
65 4.4
66 4.5
67 4.1
68 4.5
69 3.9
70 4.8
71 4.0
72 4.9
73 4.7
74 4.3
75 4.4
76 4.8
77 5.0
78 4.5
79 3.5
80 3.8
81 3.7
82 3.9
83 5.1
84 4.5
85 4.5
86 4.7
87 4.4
88 4.1
89 4.0
90 4.4
91 4.6
92 4.0
93 3.3
94 4.2
95 4.2
96 4.2
97 4.3
98 3.0
99 4.1
Given this, can someone explain to me when indexing is will include the last element and when indexing will exclude it? I am having some trouble with this.
Thanks
I believe you're thinking of np.arange which belongs to the Numpy library (excludes last index as seen here) whereas df.loc is from the Pandas library and is all inclusive w.r.t indexing as seen in examples here
EDIT to add: you might also be thinking of how for loops work in python regarding the range functionality. When it comes to indexing and playing with new libraries it never hurts to double check some documentation :)
if you have any further questions feel free to ask
what you are experiencing here is a DataFrame.loc[] property.
As mentioned in the documentation as a warning and i quote:
Warning : Note that contrary to usual python slices, both the start and the stop are included
Here is a link with an example provided from the pandas docs:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html

Cannot use fit method in seaborn.distplot

I have a dataframe with many columns, one of which counts the duration of a process in months. Sample data is available bellow:
ID Unit Duration
231 TS 2
427 SP 4
291 EI 1
312 SP 3
So, I am trying to plot the histogram, filtered by units and fitting it (mostly for visualization purposes) to stats.expon, which is the best fit for most units. Seems simple enough:
graph = sns.distplot(df[df['Unit'] == 'SP']['Duration'], kde = False, fit = stats.expon)
But it raises TypeError: No loop matching the specified signature and casting
was found for ufunc add. What am I doing wrong? I'm kind of new do matplotlib and seaborn, so excuse me if this is trivial.

fall detection in Labview with quaternions

I have a wearable device that gives output in Quaternions which I can read serially via Labview. My task is to develop a threshold based fall detection system based on these values which I am not familiar with. The platform is Labview.
Could someone guide as to where I should start. FYI I don't have access to Accelerometer values.
Any help is appreciable
Here is a sample data I read from the device
id: 4 distance: 1048 q0: 646 q1: -232 q2: -119 q3: 717
id: 4 distance: 1067 q0: 645 q1: -232 q2: -80 q3: 722
id: 4 distance: 1109 q0: 645 q1: -232 q2: -81 q3: 722
id: 4 distance: 1036 q0: 645 q1: -232 q2: -80 q3: 722
Actually it has become more of a mathematical question now. I was able to find the Euler angles from the quaternions. I'm using the left hand or North East Down coordinates. The device is fixed on the shoe. I'm assuming forward and backward fall could be determined with yaw angle. Lateral fall with pitch. Is there a Combination of roll and pitch that could be used to find a fall?

Change the spacing of a pandas plot?

I am plotting a time-series graph using pandas, my data looks like this
1986-87 334
1987-88 331
1988-89 352
1989-90 380
1990-91 386
1991-92 386
1992-93 390
1993-94 403
1994-95 406
My code looks like this
playercount = pd.DataFrame(t.groupby('season').size())
playercount.plot()
plt.show()
I want more zoomed version than this. Currently my one pixel consists of 10 years, I want to modify it to make it more fine-grained i.e. 5 or fewer years. What parameters can I change to achieve this?
To adjust the grid spacing, try adjusting the parameter n below:
n = 5
playercount.plot(xticks=playercount.index[::n], grid=True)
It means that you are using every n'th index value as a tick mark on the x-axis.
If your index is not a timestamp but is a string, then this should work.
playercount.plot(xticks=[i for i in range(len(playercount.index)) if not i % n], grid=True)

PDF decode and find useful data in it

I'm trying to decode a PDF file to useful data but I've got some coordinate system problems.
First, the data I'm using: http://pastebin.com/h4MFiSbd (I've already decoded it)
I'm trying to get the coordinates of the gray squares.
My problem is I've found the coordinates of the text:
0 1.00057 -1 0 65.1595 353.15 Tm
[(2)5.81146(.)2.90771(4)5.81146( )2.90771(t)2.90771(i)222]TJ
65.1595 = y
353.15 = x
But the problem is the coordinate of the squares. I've found the color of the squares plus coordinates:
0.753906 0.753906 0.753906 rg
3039 200.914 817.996 1329 re
In the PDF reference it says re uses x,y,width,height, but as you can see, 3039 is far bigger than 353.15. I've also seen Tm uses a matrix thing [[a,b,0],[c,d,0],[e,f,1]]
The other problem are that those rectangles are wrong somehow:
470.996 2934.91 1674 1329 re ---> beveilig.tech.pr
1327 1567.91 2102 1329 re ---> beveilig.tech.th
1327 4301.91 817.996 1329 re ---> bbc ti
2183 4301.91 817.996 1329 re ---> b&o practicum
3039 200.914 817.996 1329 re ---> b&o theorie
I've collected all the coordinates from the 0.753906 0.753906 0.753906 colored squares, with the name of the text beneath it. As you can see these coordinates suggest all the blocks have a equal height.
Can someone please help me?
The reason is in the first line - "0.12 0 0 0.12 0 0 cm". This operation sets transformation matrix and (simplified) scales x,y coordinates in all following operations by 0.12. So 3039 is really = 3039*0.12 = 364.68.
You really need to track the values as you parse because you can also get relative moves (td) and you need to factor in lots of other values to get the correct outline rectangle for the text.
The values can also be popped on and off the stack with the Q q commands.