Convert date/time index of external dataset so that pandas would plot clearly - pandas

When you already have time series data set but use internal dtype to index with date/time, you seem to be able to plot the index cleanly as here.
But when I already have data files with columns of date&time in its own format, such as [2009-01-01T00:00], is there a way to have this converted into the object that the plot can read? Currently my plot looks like the following.
Code:
dir = sorted(glob.glob("bsrn_txt_0100/*.txt"))
gen_raw = (pd.read_csv(file, sep='\t', encoding = "utf-8") for file in dir)
gen = pd.concat(gen_raw, ignore_index=True)
gen.drop(gen.columns[[1,2]], axis=1, inplace=True)
#gen['Date/Time'] = gen['Date/Time'][11:] -> cause error, didnt work
filter = gen[gen['Date/Time'].str.endswith('00') | gen['Date/Time'].str.endswith('30')]
filter['rad_tot'] = filter['Direct radiation [W/m**2]'] + filter['Diffuse radiation [W/m**2]']
lis = np.arange(35040) #used the number of rows, checked by printing. THis is for 2009-2010.
plt.xticks(lis, filter['Date/Time'])
plt.plot(lis, filter['rad_tot'], '.')
plt.title('test of generation 2009')
plt.xlabel('Date/Time')
plt.ylabel('radiation total [W/m**2]')
plt.show()
My other approach in mind was to use plotly. Yet again, its main purpose seems to feed in data on the internet. It would be best if I am familiar with all the modules and try for myself, but I am learning as I go to use pandas and matplotlib.
So I would like to ask whether there are anyone who experienced similar issues as I.

I think you need set labels to not visible by loop:
ax = df.plot(...)
spacing = 10
visible = ax.xaxis.get_ticklabels()[::spacing]
for label in ax.xaxis.get_ticklabels():
if label not in visible:
label.set_visible(False)

Related

How to save the model comparison dataframe from compare_models() in pycaret?

I want to save the model comparison data frame from compare_models() in pycaret.
# load dataset
from pycaret.datasets import get_data
diabetes = get_data('diabetes')
# init setup
from pycaret.classification import *
clf1 = setup(data = diabetes, target = 'Class variable')
# compare models
best = compare_models()
i.e. this data frame as shown above.
Does anyone know how to do that?
The solution is :
df = pull()
by Goosang Yu from the pycaret slack community.
compare_models() returns a pandas dataframe, containing the information of the list of models. Hence, you only need to save a dataframe, which can be for example achieved with best.to_csv(path). If you want to save the object in a different format (pickle, xml, ...), you can refer to pandas i/o documentation.

image from [3,M,N] to [M,N,3]

I have a ndarray representing an image with different channels like this:
image = (8,100,100) where 8=channels, 100x100 the actual image per channel
I am interested in extracting the RGB components of that image:
imageRGB = np.take(image, [4,2,1], axis = 0)
in this way I have an array of (3,100,100) with the RGB components.
However, I need to visualize it so I need an array of (100,100,3), I think it's quite straightforward to do it but I all the methods I try do not work.
numpy einsum is a good tool to be used.
Official document: https://docs.scipy.org/doc/numpy/reference/generated/numpy.einsum.html
import numpy as np
imageRGB = np.random.randint(0,5,size=(3,100,101))
# set the last dim to 101 just to make stuff more clear
imageRGB.shape
# (3,100,101)
imageRGB_reshape = np.einsum('kij->ijk',imageRGB)
imageRGB_reshape.shape
# (100,101,3)
In my opinion it's the most clear way to write and read.
Wow thank you! I have never thought to use Einstein summation, actually it works very well.
Just for curiosity is it possible to build it manually?
For example:
R = image[4,:,:]
G = image[2,:,:]
B = image[1,:,:]
imageRGB = ???

Heatmap colorbars accumulating in Matplotlib/Seaborn figures

I have a list of data frames, and I want to make heatmaps of every data frame in the list. The first heatmap comes out perfectly, but the second one has two colorbars, one much larger than the other, which distorts the figure. The third has THREE colorbars, the last one being even larger, and this continues for as many heatmaps as I make.
This seems like a bug to me, as I have no idea why it's happening. Each heatmap should be stored as a separate element in the list of heatmaps, and even if I plot them individually, instead of using a loop or list comprehension, I get the same problem.
Here is my code:
# Set the seaborn font size.
sns.set(font_scale=0.5)
# Ensure that labels are not cut off.
plt.gcf().subplots_adjust(bottom=0.18)
plt.gcf().subplots_adjust(right=.3)
black_yellow = sns.dark_palette("yellow",10)
heatmap_list = [sns.heatmap(df, cmap=black_yellow, xticklabels=True, yticklabels=True) for df in df_list]
[heatmap_list[x].figure.savefig(file_names_list[x]+'.pdf', format='pdf') for x in range(0,len(heatmap_list))]
sns.heatmap() creates a problem while we are working in loop. To resolve this issue, the first iteration will be done individually and rest of the loop remains the same but we will add a parameter cbar=False to stop this recursion of colorbar in the loop portion.
# Set the seaborn font size.
sns.set(font_scale=0.5)
# Ensure that labels are not cut off.
plt.gcf().subplots_adjust(bottom=0.18)
plt.gcf().subplots_adjust(right=.3)
black_yellow = sns.dark_palette("yellow", 10)
hm = sns.heatmap(df_list[0], cmap=black_yellow, xticklabels=True, yticklabels=True)
hm.figure.savefig(file_names_list[0]+'.pdf', format='pdf')
heatmap_list = [sns.heatmap(df_list[i], cmap=black_yellow, xticklabels=True, yticklabels=True, cbar=False) for i in range(1, len(df_list))]
[heatmap_list[x].figure.savefig(file_names_list[x+1]+'.pdf', format='pdf') for x in range(0, len(heatmap_list))]

Merging geodataframes in geopandas (CRS do not match)

I am trying to merge two geodataframes (want to see which polygon each point is in).
The following code gets me a warning first ("CRS does not match!")
and then an error ("RTreeError: Coordinates must not have minimums more than maximums").
What exactly is wrong in there? Are CRS coordinates systems? If so, why are they not loaded the same way?
import geopandas as gpd
from shapely.geometry import Point, mapping,shape
from geopandas import GeoDataFrame, read_file
#from geopandas.tools import overlay
from geopandas.tools import sjoin
print('Reading points...')
points=pd.read_csv(points_csv)
points['geometry'] = points.apply(lambda z: Point(z.Latitude, z.Longitude), axis=1)
PointsGeodataframe = gpd.GeoDataFrame(points)
print PointsGeodataframe.head()
print('Reading polygons...')
PolygonsGeodataframe = gpd.GeoDataFrame.from_file(china_shapefile+".shp")
print PolygonsGeodataframe.head()
print('Merging GeoDataframes...')
merged=sjoin(PointsGeodataframe, PolygonsGeodataframe, how='left', op='intersects')
#merged = PointsGeodataframe.merge(PolygonsGeodataframe, left_on='iso_alpha2', right_on='ISO2', how='left')
print(merged.head(5))
Link to data for reproduction:
Shapefile,
GPS points
As noted in the comments on the question, you can eliminate the CRS does not match! warning by manually setting PointsGeodataframe.crs = PolygonsGeodataframe.crs (assuming the CRSs are indeed the same for both datasets).
However, that doesn't address the RTreeError. It's possible that you have missing lat/lon data in points_csv - in that case you would end up creating Point objects containing NaN values (i.e. Point(nan nan)), which go on to cause issues in rtree. I had a similar problem and the fix was just to filter out rows with missing coordinate data when loading in the CSV:
points=pd.read_csv(points_csv).dropna(subset=["Latitude", "Longitude"])
I'll add an answer here since I was recently struggling with this and couldn't find a great answer here. The Geopandas documentation has a good example for how to solve the "CRS does not match" issue.
I copied the entire code chunk from the documentation below, but the most relevant line is this one, where the to_crs() method is used to reproject a geodataframe. You can call mygeodataframe.crs to find the CRS of each dataframe, and then to_crs() to reproject one to match the other, like so:
world = world.to_crs({'init': 'epsg:3395'})
Simply setting PointsGeodataframe.crs = PolygonsGeodataframe.crs will stop the error from being thrown, but will not correctly reproject the geometry.
Full documentation code for reference:
# load example data
In [1]: world = geopandas.read_file(geopandas.datasets.get_path('naturalearth_lowres'))
# Check original projection
# (it's Platte Carre! x-y are long and lat)
In [2]: world.crs
Out[2]: {'init': 'epsg:4326'}
# Visualize
In [3]: ax = world.plot()
In [4]: ax.set_title("WGS84 (lat/lon)");
# Reproject to Mercator (after dropping Antartica)
In [5]: world = world[(world.name != "Antarctica") & (world.name != "Fr. S. Antarctic Lands")]
In [6]: world = world.to_crs({'init': 'epsg:3395'}) # world.to_crs(epsg=3395) would also work
In [7]: ax = world.plot()
In [8]: ax.set_title("Mercator");

Complementary Filter Code Not functioning

I've been scratching my head too long.
The data is coming from an 3D accelerometer and 3D gyro. I am using a complementary filter to control drift.
I have it working in excel but can't seem to get this python code to do the same thing:
r1_angle_cfx = np.zeros(len(r1_angle_ax))
r1_angle_cfx[0] = r1_angle_ax[0]
for i in xrange(len(r1_angle_ax)-1):
j = i + 1
r1_angle_cfx[j] = 0.98 *(r1_angle_cfx[i] + r1_alpha_x[j]*fs) + (0.02 * r1_angle_ax[j]) #complementary filter
In excel (correct) I get:
In python (incorrect) I get:
What is going wrong? and is there a better way to do this in python?
Thanks,
Scott
EDIT: Link to data files -
sample data
1. The csv file contains accelerometer, gyro data that is entered into the filter formula as well as those values that were calculated in excel.
2. The excel file contains all raw data (steps not mentioned above but I have triple checked and are equivalent up to the point of being entered in the filter formula).
EDIT 2: update - Turns out my code works. It was sloppy debugging. fs should be fs = 0.01. In my code I had fs = 1/100 which ends up = 0 in the script.
Your Python code looks pretty reasonable. Without example data, I can't do much more than say that.
But I can guess. I looked up "complementary filters" and found a link explaining them:
https://sites.google.com/site/myimuestimationexperience/filters/complementary-filter
This link gives an example equation that is very similar to yours:
angle = (1-alpha)*(angle + gyro * dt) + (alpha)*(acc)
You have fs where this has dt, and dt is computed as 1/sampling_frequency. If fs is the sampling frequency, maybe you should try inverting it?
EDIT: Okay, now that you posted the data, I played around with this. Here is my program that gets a correct result.
Your code looks basically correct, so I think you must have made a mistake in your code that collected the values. I'm not quite sure because your variable names confuse me.
I used a namedtuple and for the names, I used the column headers from the CSV file (with spaces and periods removed to make a valid Python identifier).
import collections as coll
import csv
import matplotlib.pyplot as plt
import numpy as np
import sys
fs = 100.0
dt = 1.0/fs
alpha = 0.02
Sample = coll.namedtuple("Sample",
"accZ accY accX rotZ rotY rotX r acc_angZ acc_angY acc_angX cfZ cfY cfX")
def samples_from_file(fname):
with open(fname) as f:
next(f) # discard header row
csv_reader = csv.reader(f, dialect='excel')
for i, row in enumerate(csv_reader, 1):
try:
values = [float(x) for x in row]
yield Sample(*values)
except Exception:
lst = list(row)
print("Bad line %d: len %d '%s'" % (i, len(lst), str(lst)))
samples = list(samples_from_file("data.csv"))
cfx = np.zeros(len(samples))
# Excel formula: =R12
cfx[0] = samples[0].acc_angX
# Excel formula: =0.98*(U12+N13*0.01)+0.02*R13
# Excel: U is cfX N is rotX R is acc_angX
for i, s in enumerate(samples[1:], 1):
cfx[i] = (1.0 - alpha) * (cfx[i-1] + s.rotX*dt) + (alpha * s.acc_angX)
check_line = [s.cfX - cf for s, cf in zip(samples, cfx)]
plt.figure(1)
plt.plot(check_line)
plt.plot(cfx)
plt.show()
check_line is the difference between the saved cfX value from the CSV file, and the new computed cfx value. As you can see in the plot, this is a straight line at 0, so my calculation is agreeing quite well with yours.
So I guess the mapping of names is:
your_name my_name
________________________
r1_angle_cfx cfx
r1_alpha_x rotX
r1_angle_ax acc_angX