MATLAB .mat in Pandas DataFrame to be used in Tensorflow - pandas

I have gone days trying to figure this out, hopefully someone can help.
I am uploading a .mat file into python using scipy.io, placing the struct into a dataframe, which will then be used in Tensorflow.
from scipy.io import loadmat
import pandas as pd
import numpy as p
import matplotlib.pyplot as plt
#import TF
path = '/home/anthony/PycharmProjects/Deep_Learning_MATLAB/circuit-data/for tinghao/template1-lib5-eqns-CR-RESULTS-SET1-FINAL.mat'
raw_data = loadmat(path, squeeze_me=True)
data = raw_data['Graphs']
df = pd.DataFrame(data, dtype=int)
df.pop('transferFunc')
print(df.dtypes)
The out put is:
A object
Ln object
types object
nz int64
np int64
dtype: object
Process finished with exit code 0
The struct is (43249x6). Each cell in the 'A' column is a different sized matrix, i.e. 18x18, or 16x16 etc. Each cell in "Ln" is a row of letters each in their own separate cell. Each cell in 'Types' contains 12 columns of numbers, and 'nz' and 'np' i have no issues with.
I want to put all columns into a dataframe, and use column A or LN or Types as the 'Labels' and nz and np as 'features', again i do not have issues with the latter. Can anyone help with this or have some kind of work around.
The end goal is to have tensorflow train on nz and np and give me either a matrix, Ln, or Type.

What type of data is your .mat file of ? Is your application very time critical?
If you can collect all your data in a struct you could give jsonencode a try, make the struct a json file and load it back into python via json (see json documentation on loading data).
Then you can create a pandas dataframe via
pd.df.from_dict()
Of course this would only be a workaround. Still you would have to ensure your data in the MATLAB struct is correctly orderer to be then imported and transferred to a df.

raw_data = loadmat(path, squeeze_me=True)
data = raw_data['Graphs']
graph_labels = pd.DataFrame()
graph_labels['perf'] = raw_data['Objective'][0:1000]
graph_labels['np'] = data['np'][0:1000]
The code above helped out. Its very simple and drawn out, but it got the job done. But, it does not work in tensorflow because tensorflow does not accept this format, and that was my main issue. I have to convert adjacency matrices to networkx graphs, then upload them into stellargraph.

Related

Numpy broadcasting comparison report "'bool' object has no attribute 'sum'" error when dealing with large dataframe

I use numpy broadcasting to get the differences matrix from a pandas dataframe. I find when dealing with large dataframe, it reports "'bool' object has no attribute 'sum'" error. While dealing with small dataframe, it runs fine.
I post the two csv files in the following links:
large file
small file
import numpy as np
import pandas as pd
df_small = pd.read_csv(r'test_small.csv',index_col='Key')
df_small.fillna(0,inplace=True)
a_small = df_small.to_numpy()
matrix = pd.DataFrame((a_small != a_small[:, None]).sum(2), index=df_small.index, columns=df_small.index)
print(matirx)
when running this, I could get the difference matrix.
when switch to large file, It reports the following error. Does anybody know why this happens?
EDIT:The numpy version is 1.19.5
np.__version__
'1.19.5'

What is the difference between doing a regression with a dataframe and ndarray?

I would like to know why would I need to convert my dataframe to ndarray when doing a regression, since I get the same result for intercept and coef when I do not convert it?
import matplotlib.pyplot as plt
import pandas as pd
import pylab as pl
import numpy as np
from sklearn import linear_model
%matplotlib inline
# import data and create dataframe
!wget -O FuelConsumption.csv https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv3/labs/FuelConsumptionCo2.csv
df = pd.read_csv("FuelConsumption.csv")
cdf = df[['ENGINESIZE','CYLINDERS','FUELCONSUMPTION_COMB','CO2EMISSIONS']]
# Split train/ test data
msk = np.random.rand(len(df)) < 0.8
train = cdf[msk]
test = cdf[~msk]
# Modeling
regr = linear_model.LinearRegression()
train_x = np.asanyarray(train[['ENGINESIZE']])
train_y = np.asanyarray(train[['CO2EMISSIONS']])
**# if I use the dataframe, train[['ENGINESIZE']] for 'x', and train[['CO2EMISSIONS']] for 'y'
below, I get the same result**
regr.fit (train_x, train_y)
# The coefficients
print ('Coefficients: ', regr.coef_)
print ('Intercept: ',regr.intercept_)
Thank you very much!
So df is the loaded dataframe, cdf is another frame with selected columns, and train is selected rows.
train[['ENGINESIZE']] is a 1 column dataframe (I believe train['ENGINESIZE'] would be a pandas Series).
I believe the preferred syntax for getting an array from the dataframe is:
train[['ENGINESIZE']].values # or
train[['ENGINESIZE']].to_numpy()
though
np.asanyarray(train[['ENGINESIZE']])
is supposed to do the same thing.
Digging down through the regr.fit code I see that it calls sklearn.utils.check_X_y which in turn calls sklearn.tils.check_array. That takes care of converting the inputs to numpy arrays, with some awareness of pandas dataframe peculiarities (such as multiple dtypes).
So it appears that if fit accepts your dataframes, you don't need to convert them ahead of time. But if you can get a nice array from the dataframe, there's no harm in do that either. Either way the fit is done with arrays, derived from the dataframe.

reading arrays from netCDF, why I get a size of (1,1,n)

I am trying to read and later on to plot data from a netcdf file. Some of the arrays contained at the .nc file that I am trying to store as variables, are created as a (1,1,n) size variable. When printing them i see [[[ numbers, numbers,....]]]. Why are these three [[[ are created? How can I read these variables as a simple (n,1) array?
Here is my code
import pandas as pd
import netCDF4 as nc
import matplotlib.pyplot as plt
from tkinter import filedialog
import numpy as np
file_path=filedialog.askopenfilename(title = "Select files", filetypes = (("all files","*.*"),("txt files","*.txt")))
file=nc.Dataset(file_path)
print(file.variables.keys()) # get all variable names
read_alt=file.variables['altitude'][:]
alt=np.array(read_alt)
read_b355=file.variables['backscatter'][:]
read_error_b355=file.variables['error_backscatter'][:]
b355=np.array(read_b355)
error_b355=np.array(read_error_b355)
the variable alt is fine, for the other two I have the aforementioned problem.
Is it possible that your variables - altitude, backscatter and error_backscatter - have more than one dimensions? Whenever you load that kind of data, the number of dimensions is kept by the netCDF library.
Nevertheless, what I usually do, is that I remove the dimensions that I do not need from the arrays by squeezing them:
read_alt = np.squeeze(file.variables['altitude'][:])
read_b355 = np.squeeze(file.variables['backscatter'][:]);
read_error_b355 = np.squeeze(file.variables['error_backscatter'][:]);

Change data type in Numpy and Nibabel

I'm trying to convert numpy arrays into Nifti file format using Nibabel. Some of my Numpy arrays have dtype('<i8') when it should be dtype('uint8') when Nibabel calls for the data type.
arr.get_data_dtype()
Does anyone know how to convert and save Numpy arrays' data type?
The question of the title is slightly different than the question in the text. So...
If you want to change the data-type of a numpy array arr to np.int8, you are looking for arr.astype(np.int8).
Mind that you may lose precision due to data casting (see astype documentation)
To save it afterwards you may want to see ?np.save and ?np.savetxt (or to check the library pickle, to save more general objects than numpy array).
If you want to change the data-type of a nifti image saved in my_image.nii.gz
you have to go for:
import nibabel as nib
import numpy as np
image = nib.load('my_image.nii.gz')
# to be extra sure of not overwriting data:
new_data = np.copy(image.get_data())
hd = image.header
# in case you want to remove nan:
new_data = np.nan_to_num(new_data)
# update data type:
new_dtype = np.int8 # for example to cast to int8.
new_data = new_data.astype(new_dtype)
image.set_data_dtype(new_dtype)
# if nifty1
if hd['sizeof_hdr'] == 348:
new_image = nib.Nifti1Image(new_data, image.affine, header=hd)
# if nifty2
elif hd['sizeof_hdr'] == 540:
new_image = nib.Nifti2Image(new_data, image.affine, header=hd)
else:
raise IOError('Input image header problem')
nib.save(new_image, 'my_image_new_datatype.nii.gz')
Finally if you have a numpy array my_arr and you want to save it into a nifti image with a given data-type np.my_dtype, you can do:
import nibabel as nib
import numpy as np
new_image = nib.Nifti1Image(my_arr, np.eye(4))
new_image.set_data_dtype(np.my_dtype)
nib.save(new_image, 'my_arr.nii.gz')
Hope it helps!
NOTE: If you are using ITKsnap you may want to use np.float32, np.float64, np.uint16, np.uint8, np.int16, np.int8. Other choices may not produce images that can be open with this software.
Seems like you could also do
import nibabel
img = nibabel.load(filename)
img.set_data_dtype(dtype)
img.to_filename(new_filename)
You can use nilearn for a tidy solution. Here is an example if you want to change the data type of nifti image to int16:
from nilearn import image
import numpy as np
vol = image.load_img(input_file)
vol = image.new_img_like(vol, np.int16(vol.get_fdata()))
vol.to_filename(output_file)
Datatypes for .nii files can also be specified in the .to_filename() function:
import nibabel as nib
new_image = nib.Nifti2Image(my_arr, affine)
new_image.to_filename(fn, dtype=np.uint8)

Stuck importing NetCDF file into Pandas DataFrame

I've been working on this as a beginner for a while. Overall, I want to read in a NetCDF file and import multiple (~50) columns (and 17520 cases) into a Pandas DataFrame. At the moment I have set it up for a list of 4 variables but I want to be able to expand that somehow. I made a start, but any help on how to loop through to make this happen with 50 variables would be great. It does work using the code below for 4 variables. I know its not pretty - still learning!
Another question I have it that when I try to read the numpy arrays directly into Pandas DataFrame it doesn't work and instead creates a DataFrame that is 17520 columns large. It should be the other way (transposed). If I create a series, it works fine. So I have had to use the following lines to get around this. Not even sure why it works. Any suggestions of a better way (especially when it comes to 50 variables)?
d={vnames[0] :vartemp[0], vnames[1] :vartemp[1], vnames[2] :vartemp[2], vnames[3] :vartemp[3]}
hs = pd.DataFrame(d,index=times)
The whole code is pasted below:
import pandas as pd
import datetime as dt
import xlrd
import numpy as np
import netCDF4
def excel_to_pydate(exceldate):
datemode=0 # datemode: 0 for 1900-based, 1 for 1904-based
pyear, pmonth, pday, phour, pminute, psecond = xlrd.xldate_as_tuple(exceldate, datemode)
py_date = dt.datetime(pyear, pmonth, pday, phour, pminute, psecond)
return(py_date)
def main():
filename='HowardSprings_2010_L4.nc'
#Define a list of variables names we want from the netcdf file
vnames = ['xlDateTime', 'Fa', 'Fh' ,'Fg']
# Open the NetCDF file
nc = netCDF4.Dataset(filename)
#Create some lists of size equal to length of vnames list.
temp=list(xrange(len(vnames)))
vartemp=list(xrange(len(vnames)))
#Enumerate the list and assign each NetCDF variable to an element in the lists.
# First get the netcdf variable object assign to temp
# Then strip the data from that and add to temporary variable (vartemp)
for index, variable in enumerate(vnames):
temp[index]= nc.variables[variable]
vartemp[index] = temp[index][:]
# Now call the function to convert to datetime from excel. Assume datemode: 0
times = [excel_to_pydate(elem) for elem in vartemp[0]]
#Dont know why I cant just pass a list of variables i.e. [vartemp[0], vartemp[1], vartemp[2]]
#But this is only thing that worked
#Create Pandas dataframe using times as index
d={vnames[0] :vartemp[0], vnames[1] :vartemp[1], vnames[2] :vartemp[2], vnames[3] :vartemp[3]}
theDataFrame = pd.DataFrame(d,index=times)
#Define missing data value and apply to DataFrame
missing=-9999
theDataFrame1=theDataFrame.replace({vnames[0] :missing, vnames[1] :missing, vnames[2] :missing, vnames[3] :missing},'NaN')
main()
You could replace:
d = {vnames[0] :vartemp[0], ..., vnames[3]: vartemp[3]}
hs = pd.DataFrame(d, index=times)
with
hs = pd.DataFrame(vartemp[0:4], columns=vnames[0:4], index=times)
.
Saying that, pandas can read HDF5 directly, so perhaps the same is true for netCDF (which is based on HDF5)...