How to plot coordinates from single pandas series - pandas

I have a pandas series called df1['geometry.coordinates'] of coordinate values in the following format:
geometry.coordinates
0 [150.792711, -34.210868]
1 [151.551228, -33.023339]
2 [148.92149870748742, -34.767207772932835]
3 [151.033742, -33.919998]
4 [150.953963043732, -32.3935017885229]
... ...
432 [114.8927165, -28.902492300000002]
433 [115.34601918477634, -30.041742290803096]
434 [115.4632611, -30.8581035]
435 [121.42151909999998, -30.7804027]
436 [115.69424934340425, -30.680970908597665]
I want to plot each point on a graph, probably through using a scatter plot.
I tried: df1['geometry.coordinates'].plot.scatter() but it gets confused because it only reads it as one list value rather than two and therefore I always get the following error:
TypeError: scatter() missing 2 required positional arguments: 'x' and 'y'
Anyone know how I can solve this?

You need to separate the column containing the list so that you can specify x and y in the plot call.
You can split a column containing a list by constructing a data frame from a list.
pd.DataFrame(df2["geometry.coordinates"].to_list(), columns=['x', 'y']).plot.scatter(x=“x”, y=“y”)

Step 1: Split array into multiple columns
df1[['x','y']] = pd.DataFrame(df1['geometry.coordinates'].tolist(), index= df1.index)
Step 2: Plot
df1.plot.scatter(x = 'x', y = 'y', s = 30) #s is size of dots

You are not giving the parameters to scatter(), so the error is quite logical. Something among the lines of df.scatter.plot(df[0],df[1]) should work.
Also, as you are working working with column vectors, you need to transpose your data for it to be viewed as rows: df.scatter.plot(df.T[0],df.T[1])

I did it this way.
import matplotlib.pyplot as plt
geometry = pd.Series([
[150.792711, -34.210868],
[151.551228, -33.023339],
[148.92149870748742, -34.767207772932835],
[151.033742, -33.919998],
[150.953963043732, -32.3935017885229]])
df = pd.DataFrame(geometry.to_list(), columns = ['x','y'])
plt.scatter(x = df['x'], y = df['y'],
edgecolor ='black')
plt.grid(alpha=.15)

you can try
import pandas as pd
geometry_coordinates=[[150.792711, -34.210868],
[151.551228, -33.023339],
[148.92149870748742, -34.767207772932835],
[151.033742, -33.919998],
[150.953963043732, -32.3935017885229],
[114.8927165, -28.902492300000002],
[115.34601918477634, -30.041742290803096],
[115.4632611, -30.8581035],
[121.42151909999998, -30.7804027],
[115.69424934340425, -30.680970908597665]]
geometry_coordinates=pd.DataFrame(geometry_coordinates,columns=['lat','long'])
geometry_coordinates.plot.scatter(x='lat',y='long')

Related

I need to convert columns from a dataframe with pandas into floats but I cannot seem to do it

I am trying to plot data from a big file I have. I am only using 2 columns of it, and I need to delete all 'nan' values, rows with no data on those columns and rows with 0 on those columns. I cannot manage to plot it correctly. It shows as a disorganised plot (values not in numerical order), tries to print all ticks on the axes (that is why i have limited it on the code) and just does not make sense.
So, shorter, the issue is that I need to plot the values from the columns 'Gmag' and 'Teff' properly and I cannot.
So, this is my code:
import pandas as pd
import numpy as np
import matplotlib.pylab as plt
#gaia_sky = pd.read_csv ('Gaia_download_reduced_0_90_1deg.tsv' , sep='\t')
gaia_f_alerts = pd.read_csv ('Gaia_download_1.tsv' , sep='\t')
gaia_sky = pd.read_csv ('asu (2).tsv' , sep='|')
no_nan_sky = gaia_sky.dropna (axis = 0 , how = 'any' , thresh = None , subset = ['Teff' , 'Gmag'])
#mod_sky = no_nan_sky.astype({'Gmag':'float'})
#print (mod_sky.dtypes)
print (no_nan_sky['Gmag'])
print (gaia_sky['Gmag'])
#print (no_nan_sky ['Teff'])
#print(gaia_sky.columns)
#for i in range(0,len(gaia_sky['Gmag'])):
# print(gaia_sky['Teff'][i],gaia_sky['Gmag'][i])
#plt.scatter (gaia_f_alerts['Teff'] , gaia_f_alerts['Gmag'] , color='red',)
plt.scatter (no_nan_sky['Teff'] , no_nan_sky['Gmag'])
#plt.xticks([])
#plt.yticks([])
plt.xticks(no_nan_sky['Teff'][::300])
plt.yticks(no_nan_sky['Gmag'][::1000])
#plt.gca().axes.yaxis.set_ticklabels([])
#plt.locator_params(axis='y', nbins=6)
#plt.locator_params(axis='x', nbins=10)
plt.show()
These are the first 5 rows of Teff
0
1 5648.8
2
3 3837.9
4
These are the first 5 rows of Gmag
0 19.281116
1 16.308022
2 20.075556
3 18.978070
4 20.452116
I think that the issue is that the values in the rows I need are not floats, so I tried to change that but it would not let me do that either. How can I fix this? I know there are a lot of issues, but I just cannot figure it out.
When I try to convert to str this comes up: ValueError: could not convert string to float: ''. And when I try to delete the empty arrays nothing changes, still same number of rows and empty values all around.

Extracting column from Array in python

I am beginner in Python and I am stuck with data which is array of 32763 number, separated by comma. Please find the data here data
I want to convert this into two column 1 from (0:16382) and 2nd column from (2:32763). in the end I want to plot column 1 as x axis and column 2 as Y axis. I tried the following code but I am not able to extract the columns
import numpy as np
import pandas as pd
import matplotlib as plt
data = np.genfromtxt('oscilloscope.txt',delimiter=',')
df = pd.DataFrame(data.flatten())
print(df)
and then I want to write the data in some file let us say data1 in the format as shown in attached pic
It is hard to answer without seeing the format of your data, but you can try
data = np.genfromtxt('oscilloscope.txt',delimiter=',')
print(data.shape) # here we check we got something useful
# this should split data into x,y at position 16381
x = data[:16381]
y = data[16381:]
# now you can create a dataframe and print to file
df = pd.DataFrame({'x':x, 'y':y})
df.to_csv('data1.csv', index=False)
Try this.
#input as dataframe df, its chunk_size, extract output as list. you can mention chunksize what you want.
def split_dataframe(df, chunk_size = 16382):
chunks = list()
num_chunks = len(df) // chunk_size + 1
for i in range(num_chunks):
chunks.append(df[i*chunk_size:(i+1)*chunk_size])
return chunks
or
np.array_split

Infer Series Labels and Data from pandas dataframe column for plotting

Consider a simple 2x2 dataset with with Series labels prepended as the first column ("Repo")
Repo AllTests Restricted
0 Galactian 1860.0 410.0
1 Forecast-MLib 140.0 47.0
Here are the DataFrame columns:
p(df.columns)
([u'Repo', u'AllTests', u'Restricted']
So we have the first column is the string/label and the second and third columns are data values. We want one series per row corresponding to the Galactian and the Forecast-MLlib repos.
It would seem this would be a common task and there would be a straightforward way to simply plot the DataFrame . However the following related question does not provide any simple way: it essentially throws away the DataFrame structural knowledge and plots manually:
Set matplotlib plot axis to be the dataframe column name
So is there a more natural way to plot these Series - that does not involve deconstructing the already-useful DataFrame but instead infers the first column as labels and the remaining as series data points?
Update Here is a self contained snippet
runtimes = npa([1860.,410.,140.,47.])
runtimes.shape = (2,2)
labels = npa(['Galactian','Forecast-MLlib'])
labels.shape=(2,1)
rtlabels = np.concatenate((labels,runtimes),axis=1)
rtlabels.shape = (2,3)
colnames = ['Repo','AllTests','Restricted']
df = pd.DataFrame(rtlabels, columns=colnames)
ps(df)
df.set_index('Repo').astype(float).plot()
plt.show()
And here is output
Repo AllTests Restricted
0 Galactian 1860.0 410.0
1 Forecast-MLlib 140.0 47.0
And with piRSquared help it looks like this
So the data is showing now .. but the Series and Labels are swapped. Will look further to try to line them up properly.
Another update
By flipping the columns/labels the series are coming out as desired.
The change was to :
labels = npa(['AllTests','Restricted'])
..
colnames = ['Repo','Galactian','Forecast-MLlib']
So the updated code is
runtimes = npa([1860.,410.,140.,47.])
runtimes.shape = (2,2)
labels = npa(['AllTests','Restricted'])
labels.shape=(2,1)
rtlabels = np.concatenate((labels,runtimes),axis=1)
rtlabels.shape = (2,3)
colnames = ['Repo','Galactian','Forecast-MLlib']
df = pd.DataFrame(rtlabels, columns=colnames)
ps(df)
df.set_index('Repo').astype(float).plot()
plt.title("Restricting Long-Running Tests\nin Galactus and Forecast-ML")
plt.show()
p('df columns', df.columns)
ps(df)
Pandas assumes your label information is in the index and columns. Set the index first:
df.set_index('Repo').astype(float).plot()
Or
df.set_index('Repo').T.astype(float).plot()

Pandas fill cells in a column with NaN values, derive the value from other cells in the row

I have a dataframe:
a b c
0 1 2 3
1 1 1 1
2 3 7 NaN
3 2 3 5
...
I want to fill column "three" inplace (update the values) where the values are NaN using a machine learning algorithm.
I don't know how to do it inplace. Sample code:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
df=pd.DataFrame([range(3), [1, 5, np.NaN], [2, 2, np.NaN], [4,5,9], [2,5,7]],columns=['a','b','c'])
x=[]
y=[]
for row in df.iterrows():
index,data = row
if(not pd.isnull(data['c'])):
x.append(data[['a','b']].tolist())
y.append(data['c'])
model = LinearRegression()
model.fit(x,y)
#this line does not do it in place.
df[~df.c.notnull()].assign(c = lambda x:model.predict(x[['a','b']]))
But this gives me a copy of the dataframe. Only option I have left is using a for loop however, I don't want to do that. I think there should be more pythonic way of doing it using pandas. Can someone please help? Or is there any other way of doing this?
You'll have to do something like :
df.loc[pd.isnull(df['three']), 'three'] = _result of model_
This modifies directly dataframe df
This way you first filter the dataframe to keep the slice you want to modify (pd.isnull(df['three'])), then from that slice you select the column you want to modify (three).
On the right hand side of the equal, it expects to get an array / list / series with the same number of lines than the filtered dataframe ( in your example, one line)
You may have to adjust depending on what your model returns exactly
EDIT
You probably need to do stg like this
pred = model.predict(df[['a', 'b']])
df['pred'] = model.predict(df[['a', 'b']])
df.loc[pd.isnull(df['c']), 'c'] = df.loc[pd.isnull(df['c']), 'pred']
Note that a significant part of the issue comes from the way you are using scikit learn in your example. You need to pass the whole dataset to the model when you predict.
The simplest way is yo transpose first, then forward fill/backward fill at your convenience.
df.T.ffill().bfill().T

Overlaying actual data on a boxplot from a pandas dataframe

I am using Seaborn to make boxplots from pandas dataframes. Seaborn boxplots seem to essentially read the dataframes the same way as the pandas boxplot functionality (so I hope the solution is the same for both -- but I can just use the dataframe.boxplot function as well). My dataframe has 12 columns and the following code generates a single plot with one boxplot for each column (just like the dataframe.boxplot() function would).
fig, ax = plt.subplots()
sns.set_style("darkgrid", {"axes.facecolor":"darkgrey"})
pal = sns.color_palette("husl",12)
sns.boxplot(dataframe, color = pal)
Can anyone suggest a simple way of overlaying all the values (by columns) while making a boxplot from dataframes?
I will appreciate any help with this.
This hasn't been added to the seaborn.boxplot function yet, but there's something similar in the seaborn.violinplot function, which has other advantages:
x = np.random.randn(30, 6)
sns.violinplot(x, inner="points")
sns.despine(trim=True)
A general solution for the boxplot for the entire dataframe, which should work for both seaborn and pandas as their are all matplotlib based under the hood, I will use pandas plot as the example, assuming import matplotlib.pyplot as plt already in place. As you have already have the ax, it would make better sense to just use ax.text(...) instead of plt.text(...).
In [35]:
print df
V1 V2 V3 V4 V5
0 0.895739 0.850580 0.307908 0.917853 0.047017
1 0.931968 0.284934 0.335696 0.153758 0.898149
2 0.405657 0.472525 0.958116 0.859716 0.067340
3 0.843003 0.224331 0.301219 0.000170 0.229840
4 0.634489 0.905062 0.857495 0.246697 0.983037
5 0.573692 0.951600 0.023633 0.292816 0.243963
[6 rows x 5 columns]
In [34]:
df.boxplot()
for x, y, s in zip(np.repeat(np.arange(df.shape[1])+1, df.shape[0]),
df.values.ravel(), df.values.astype('|S5').ravel()):
plt.text(x,y,s,ha='center',va='center')
For a single series in the dataframe, a few small changes is necessary:
In [35]:
sub_df=df.V1
pd.DataFrame(sub_df).boxplot()
for x, y, s in zip(np.repeat(1, df.shape[0]),
sub_df.ravel(), sub_df.values.astype('|S5').ravel()):
plt.text(x,y,s,ha='center',va='center')
Making scatter plots is also similar:
#for the whole thing
df.boxplot()
plt.scatter(np.repeat(np.arange(df.shape[1])+1, df.shape[0]), df.values.ravel(), marker='+', alpha=0.5)
#for just one column
sub_df=df.V1
pd.DataFrame(sub_df).boxplot()
plt.scatter(np.repeat(1, df.shape[0]), sub_df.ravel(), marker='+', alpha=0.5)
To overlay stuff on boxplot, we need to first guess where each boxes are plotted at among xaxis. They appears to be at 1,2,3,4,..... Therefore, for the values in the first column, we want them to be plot at x=1; the 2nd column at x=2 and so on.
Any efficient way of doing it is to use np.repeat, repeat 1,2,3,4..., each for n times, where n is the number of observations. Then we can make a plot, using those numbers as x coordinates. Since it is one-dimensional, for the y coordinates, we will need a flatten view of the data, provided by df.ravel()
For overlaying the text strings, we need a anther step (a loop). As we can only plot one x value, one y value and one text string at a time.
I have the following trick:
data = np.random.randn(6,5)
df = pd.DataFrame(data,columns = list('ABCDE'))
Now assign a dummy column to df:
df['Group'] = 'A'
print df
A B C D E Group
0 0.590600 0.226287 1.552091 -1.722084 0.459262 A
1 0.369391 -0.037151 0.136172 -0.772484 1.143328 A
2 1.147314 -0.883715 -0.444182 -1.294227 1.503786 A
3 -0.721351 0.358747 0.323395 0.165267 -1.412939 A
4 -1.757362 -0.271141 0.881554 1.229962 2.526487 A
5 -0.006882 1.503691 0.587047 0.142334 0.516781 A
Use the df.groupby.boxplot(), you get it done.
df.groupby('Group').boxplot()