Accessing carray of pointcloud using pytables - pytables

I am having a hard time understanding how to access the data in a carray.
http://carray.pytables.org/docs/manual/index.html
I have a carray that I can view in a group structure using vitables - but how to open it and retrieve the data it beyond me.
The data are a point cloud that is 3 levels down that I want to make a scatter plot of and extract as a .obj file..
I then have to loop through (many) clouds and do the same thing..
Is there anyone that can give me a simple example of how to do this?
This was my attempt:
import carray as ca
fileName = 'hdf5_example_db.h5'
a = ca.open(rootdir=fileName)
print a

I managed to solve my issue.. I wasn't treating the carray differently to the rest of the hierarchy. I needed to first load the entire db, then refer to the data I needed. I ended up not having to use carray, and just stuck to h5py:
from __future__ import print_function
import h5py
import numpy as np
# read the hdf5 format file
fileName = 'hdf5_example_db.h5'
f = h5py.File(fileName, 'r')
# full path of carry type data (which is in ply format)
dataspace = '/objects/object_000/object_model'
# view the data
print(f[dataspace])
# print to ply file
with open('object_000.ply', 'w') as fo:
for line in f[dataspace]:
fo.write(line+'\n')

Related

MATLAB .mat in Pandas DataFrame to be used in Tensorflow

I have gone days trying to figure this out, hopefully someone can help.
I am uploading a .mat file into python using scipy.io, placing the struct into a dataframe, which will then be used in Tensorflow.
from scipy.io import loadmat
import pandas as pd
import numpy as p
import matplotlib.pyplot as plt
#import TF
path = '/home/anthony/PycharmProjects/Deep_Learning_MATLAB/circuit-data/for tinghao/template1-lib5-eqns-CR-RESULTS-SET1-FINAL.mat'
raw_data = loadmat(path, squeeze_me=True)
data = raw_data['Graphs']
df = pd.DataFrame(data, dtype=int)
df.pop('transferFunc')
print(df.dtypes)
The out put is:
A object
Ln object
types object
nz int64
np int64
dtype: object
Process finished with exit code 0
The struct is (43249x6). Each cell in the 'A' column is a different sized matrix, i.e. 18x18, or 16x16 etc. Each cell in "Ln" is a row of letters each in their own separate cell. Each cell in 'Types' contains 12 columns of numbers, and 'nz' and 'np' i have no issues with.
I want to put all columns into a dataframe, and use column A or LN or Types as the 'Labels' and nz and np as 'features', again i do not have issues with the latter. Can anyone help with this or have some kind of work around.
The end goal is to have tensorflow train on nz and np and give me either a matrix, Ln, or Type.
What type of data is your .mat file of ? Is your application very time critical?
If you can collect all your data in a struct you could give jsonencode a try, make the struct a json file and load it back into python via json (see json documentation on loading data).
Then you can create a pandas dataframe via
pd.df.from_dict()
Of course this would only be a workaround. Still you would have to ensure your data in the MATLAB struct is correctly orderer to be then imported and transferred to a df.
raw_data = loadmat(path, squeeze_me=True)
data = raw_data['Graphs']
graph_labels = pd.DataFrame()
graph_labels['perf'] = raw_data['Objective'][0:1000]
graph_labels['np'] = data['np'][0:1000]
The code above helped out. Its very simple and drawn out, but it got the job done. But, it does not work in tensorflow because tensorflow does not accept this format, and that was my main issue. I have to convert adjacency matrices to networkx graphs, then upload them into stellargraph.

GeoViews saving inline HTML file is very large

I have created geo-dataframe using a combination of geopandas and geoviews. Libraries I'm using are below:
import pandas as pd
import numpy as np
import geopandas as gpd
import holoviews as hv
import geoviews as gv
import matplotlib.pyplot as plt
import matplotlib
import panel as pn
from cartopy import crs
gv.extension('bokeh')
I have concatenated 3 shapefiles to build a polygon picture of UK healthcare boundaries (links to files provided if needed). Unfortunately, from what i have found the UK doesn't produce one file that combines all of those, so have had to merge the shape files from the 3 individual countries i'm interested in. The 3 shape files have a size of:
shape file 1 = 5mb (https://www.opendatani.gov.uk/dataset/department-of-health-trust-boundaries)
shape file 2 = 204kb (https://geoportal.statistics.gov.uk/datasets/5252644ec26e4bffadf9d3661eef4826_4)
shape file 3 = 22kb (https://data.gov.uk/dataset/31ab16a2-22da-40d5-b5f0-625bafd76389/local-health-boards-december-2016-ultra-generalised-clipped-boundaries-in-wales)
I have merged them all successfully to build the picture i am looking for using:
Test = gv.Polygons(Merged_Shapes, vdims=[('Data'), ('CCG_Name')], crs=crs.OSGB()).options(tools=['hover'], width=550, height=700)
Test_2 = gv.Polygons(Merged_Shapes, vdims=[('Data'), ('CCG_Name')], crs=crs.OSGB()).options(tools=['hover'], width=550, height=700)
However, I would like to include these charts in a shareable html file. The issue I'm running into, is that when I save the HTML using:
from bokeh.resources import INLINE
layout = hv.Layout(Test + Test_2)
Final_report = pn.Tabs(('Test',layout)).save('Map_test.html', resources=INLINE)
I generate a html file that displays the charts, but the size is 80mb, which is far to large, especially if I want include more polygon charts and other charts in the same html.
Does anyone know of a more efficient way, from a memory perspective, I can store my polygon charts within a HTML file for sharing?
You can make the file smaller by rasterizng or by decimating the shapes. For rasterizng you can call hv.operation.datashader.rasterize(obj), and I think there is something in Shapely or GeoPandas for simplifying the shapes.

reading arrays from netCDF, why I get a size of (1,1,n)

I am trying to read and later on to plot data from a netcdf file. Some of the arrays contained at the .nc file that I am trying to store as variables, are created as a (1,1,n) size variable. When printing them i see [[[ numbers, numbers,....]]]. Why are these three [[[ are created? How can I read these variables as a simple (n,1) array?
Here is my code
import pandas as pd
import netCDF4 as nc
import matplotlib.pyplot as plt
from tkinter import filedialog
import numpy as np
file_path=filedialog.askopenfilename(title = "Select files", filetypes = (("all files","*.*"),("txt files","*.txt")))
file=nc.Dataset(file_path)
print(file.variables.keys()) # get all variable names
read_alt=file.variables['altitude'][:]
alt=np.array(read_alt)
read_b355=file.variables['backscatter'][:]
read_error_b355=file.variables['error_backscatter'][:]
b355=np.array(read_b355)
error_b355=np.array(read_error_b355)
the variable alt is fine, for the other two I have the aforementioned problem.
Is it possible that your variables - altitude, backscatter and error_backscatter - have more than one dimensions? Whenever you load that kind of data, the number of dimensions is kept by the netCDF library.
Nevertheless, what I usually do, is that I remove the dimensions that I do not need from the arrays by squeezing them:
read_alt = np.squeeze(file.variables['altitude'][:])
read_b355 = np.squeeze(file.variables['backscatter'][:]);
read_error_b355 = np.squeeze(file.variables['error_backscatter'][:]);

Read multiple parquet files in a folder and write to single csv file using python

I am new to python and I have a scenario where there are multiple parquet files with file names in order. ex: par_file1,par_file2,par_file3 and so on upto 100 files in a folder.
I need to read these parquet files starting from file1 in order and write it to a singe csv file. After writing contents of file1, file2 contents should be appended to same csv without header. Note that all files have same column names and only data is split into multiple files.
I learnt to convert single parquet to csv file using pyarrow with the following code:
import pandas as pd
df = pd.read_parquet('par_file.parquet')
df.to_csv('csv_file.csv')
But I could'nt extend this to loop for multiple parquet files and append to single csv.
Is there a method in pandas to do this? or any other way to do this would be of great help. Thank you.
I ran into this question looking to see if pandas can natively read partitioned parquet datasets. I have to say that the current answer is unnecessarily verbose (making it difficult to parse). I also imagine that it's not particularly efficient to be constantly opening/closing file handles then scanning to the end of them depending on the size.
A better alternative would be to read all the parquet files into a single DataFrame, and write it once:
from pathlib import Path
import pandas as pd
data_dir = Path('dir/to/parquet/files')
full_df = pd.concat(
pd.read_parquet(parquet_file)
for parquet_file in data_dir.glob('*.parquet')
)
full_df.to_csv('csv_file.csv')
Alternatively, if you really want to just append to the file:
data_dir = Path('dir/to/parquet/files')
for i, parquet_path in enumerate(data_dir.glob('*.parquet')):
df = pd.read_parquet(parquet_path)
write_header = i == 0 # write header only on the 0th file
write_mode = 'w' if i == 0 else 'a' # 'write' mode for 0th file, 'append' otherwise
df.to_csv('csv_file.csv', mode=write_mode, header=write_header)
A final alternative for appending each file that opens the target CSV file in "a+" mode at the onset, keeping the file handle scanned to the end of the file for each write/append (I believe this works, but haven't actually tested it):
data_dir = Path('dir/to/parquet/files')
with open('csv_file.csv', "a+") as csv_handle:
for i, parquet_path in enumerate(data_dir.glob('*.parquet')):
df = pd.read_parquet(parquet_path)
write_header = i == 0 # write header only on the 0th file
df.to_csv(csv_handle, header=write_header)
I'm having a similar need and I read current Pandas version supports a directory path as argument for the read_csv function. So you can read multiple parquet files like this:
import pandas as pd
df = pd.read_parquet('path/to/the/parquet/files/directory')
It concats everything into a single dataframe so you can convert it to a csv right after:
df.to_csv('csv_file.csv')
Make sure you have the following dependencies according to the doc:
pyarrow
fastparquet
This helped me to load all parquet files into one data frame
import glob
files = glob.glob("*.snappy.parquet")
data = [pd.read_parquet(f,engine='fastparquet') for f in files]
merged_data = pd.concat(data,ignore_index=True)
If you are going to copy the files over to your local machine and run your code you could do something like this. The code below assumes that you are running your code in the same directory as the parquet files. It also assumes the naming of files as your provided above: "order. ex: par_file1,par_file2,par_file3 and so on upto 100 files in a folder." If you need to search for your files then you will need to get the file names using glob and explicitly provide the path where you want to save the csv: open(r'this\is\your\path\to\csv_file.csv', 'a') Hope this helps.
import pandas as pd
# Create an empty csv file and write the first parquet file with headers
with open('csv_file.csv','w') as csv_file:
print('Reading par_file1.parquet')
df = pd.read_parquet('par_file1.parquet')
df.to_csv(csv_file, index=False)
print('par_file1.parquet appended to csv_file.csv\n')
csv_file.close()
# create your file names and append to an empty list to look for in the current directory
files = []
for i in range(2,101):
files.append(f'par_file{i}.parquet')
# open files and append to csv_file.csv
for f in files:
print(f'Reading {f}')
df = pd.read_parquet(f)
with open('csv_file.csv','a') as file:
df.to_csv(file, header=False, index=False)
print(f'{f} appended to csv_file.csv\n')
You can remove the print statements if you want.
Tested in python 3.6 using pandas 0.23.3
a small change for those trying to read remote files, which helps to read it faster (direct read_parquet for remote files was doing this much slower for me):
import io
merged = []
# remote_reader = ... <- init some remote reader, for example AzureDLFileSystem()
for f in files:
with remote_reader.open(f, 'rb') as f_reader:
merged.append(remote_reader.read())
merged = pd.concat((pd.read_parquet(io.BytesIO(file_bytes)) for file_bytes in merged))
Adds a little temporary memory overhead though.
You can use Dask to read in the multiple Parquet files and write them to a single CSV.
Dask accepts an asterisk (*) as wildcard / glob character to match related filenames.
Make sure to set single_file to True and index to False when writing the CSV file.
import pandas as pd
import numpy as np
# create some dummy dataframes using np.random and write to separate parquet files
rng = np.random.default_rng()
for i in range(3):
df = pd.DataFrame(rng.integers(0, 100, size=(10, 4)), columns=list('ABCD'))
df.to_parquet(f"dummy_df_{i}.parquet")
# load multiple parquet files with Dask
import dask.dataframe as dd
ddf = dd.read_parquet('dummy_df_*.parquet', index=False)
# write to single csv
ddf.to_csv("dummy_df_all.csv",
single_file=True,
index=False
)
# test to verify
df_test = pd.read_csv("dummy_df_all.csv")
Using Dask for this means you won't have to worry about the resulting file size (Dask is a distributed computing framework that can handle anything you throw at it, while pandas might throw a MemoryError if the resulting DataFrame is too large) and you can easily read and write from cloud data storage like Amazon S3.

Stuck importing NetCDF file into Pandas DataFrame

I've been working on this as a beginner for a while. Overall, I want to read in a NetCDF file and import multiple (~50) columns (and 17520 cases) into a Pandas DataFrame. At the moment I have set it up for a list of 4 variables but I want to be able to expand that somehow. I made a start, but any help on how to loop through to make this happen with 50 variables would be great. It does work using the code below for 4 variables. I know its not pretty - still learning!
Another question I have it that when I try to read the numpy arrays directly into Pandas DataFrame it doesn't work and instead creates a DataFrame that is 17520 columns large. It should be the other way (transposed). If I create a series, it works fine. So I have had to use the following lines to get around this. Not even sure why it works. Any suggestions of a better way (especially when it comes to 50 variables)?
d={vnames[0] :vartemp[0], vnames[1] :vartemp[1], vnames[2] :vartemp[2], vnames[3] :vartemp[3]}
hs = pd.DataFrame(d,index=times)
The whole code is pasted below:
import pandas as pd
import datetime as dt
import xlrd
import numpy as np
import netCDF4
def excel_to_pydate(exceldate):
datemode=0 # datemode: 0 for 1900-based, 1 for 1904-based
pyear, pmonth, pday, phour, pminute, psecond = xlrd.xldate_as_tuple(exceldate, datemode)
py_date = dt.datetime(pyear, pmonth, pday, phour, pminute, psecond)
return(py_date)
def main():
filename='HowardSprings_2010_L4.nc'
#Define a list of variables names we want from the netcdf file
vnames = ['xlDateTime', 'Fa', 'Fh' ,'Fg']
# Open the NetCDF file
nc = netCDF4.Dataset(filename)
#Create some lists of size equal to length of vnames list.
temp=list(xrange(len(vnames)))
vartemp=list(xrange(len(vnames)))
#Enumerate the list and assign each NetCDF variable to an element in the lists.
# First get the netcdf variable object assign to temp
# Then strip the data from that and add to temporary variable (vartemp)
for index, variable in enumerate(vnames):
temp[index]= nc.variables[variable]
vartemp[index] = temp[index][:]
# Now call the function to convert to datetime from excel. Assume datemode: 0
times = [excel_to_pydate(elem) for elem in vartemp[0]]
#Dont know why I cant just pass a list of variables i.e. [vartemp[0], vartemp[1], vartemp[2]]
#But this is only thing that worked
#Create Pandas dataframe using times as index
d={vnames[0] :vartemp[0], vnames[1] :vartemp[1], vnames[2] :vartemp[2], vnames[3] :vartemp[3]}
theDataFrame = pd.DataFrame(d,index=times)
#Define missing data value and apply to DataFrame
missing=-9999
theDataFrame1=theDataFrame.replace({vnames[0] :missing, vnames[1] :missing, vnames[2] :missing, vnames[3] :missing},'NaN')
main()
You could replace:
d = {vnames[0] :vartemp[0], ..., vnames[3]: vartemp[3]}
hs = pd.DataFrame(d, index=times)
with
hs = pd.DataFrame(vartemp[0:4], columns=vnames[0:4], index=times)
.
Saying that, pandas can read HDF5 directly, so perhaps the same is true for netCDF (which is based on HDF5)...