Save large numeric output to file natively in Julia 1.0.0 - file-io

I am trying to run a program in hpc-cluster. Unfortunately, I am unable to install external packages (e.g., JLD2) on the cluster. This is is a temporary problem, and should get fixed.
I don't want to wait around all that time and I am wondering if there is any way to save large output (2-3 GB) in Julia without external dependencies. Most of the output is matrix of numbers. I was using JLD2 previously that stores data in HDF5 format.
Bonus question: Is there a workaround to this using shell commands, like use pipe to get the output and use awk//grep to save data? (something like julia -p 12 main.jl | echo "file").

You could try the Serialization standard library.
To work with multiple variables, you can just store them sequentially:
x = rand(10)
y = "foo"
using Serialization
# write to file
open("data.out","w") do f
serialize(f, x)
serialize(f, y)
end
# load from file
open("data.out") do f
global x2, y2
x2 = deserialize(f)
y2 = deserialize(f)
end
or you could put them in a Dict, and just store that.

You could write as binary. Something along the lines of
julia> x = rand(2,2);
julia> write("test.out", x)
julia> y = reshape(reinterpret(Float64, read("test.out")), 2,2)
julia> x == y
true
If it is just HDF5 that is missing you could use for example NPZ.jl.

Related

using Dask to load many CSV files with different columns

I have many CSV files saved in AWS s3 with the same first set of columns and a lot of optional columns. I don't want to download them one by one and than use pd.concat to read it, since this takes a lot of time and has to fit in to the computer memory. Instead, I'm trying to use Dask to load and sum up all of these files, when optional columns should should be treated as zeros.
If all columns where the same I could use:
import dask.dataframe as dd
addr = "s3://SOME_BASE_ADDRESS/*.csv"
df = dd.read_csv(addr)
df.groupby(["index"]).sum().compute()
but it doesn't work with files that don't have same number of columns, since Dask assumes it can use the first columns for all files:
File ".../lib/python3.7/site-packages/pandas/core/internals/managers.py", line 155, in set_axis
'values have {new} elements'.format(old=old_len, new=new_len))
ValueError: Length mismatch: Expected axis has 64 elements, new values have 62 elements
According to this thread I can either read all headers in advanced (for example by writing them as I produce and save all of the small CSV's) or use something like this:
df = dd.concat([dd.read_csv(f) for f in filelist])
I wonder if this solution is actually faster/better than just directly use pandas? In general, I'd like to know what is the best (mainly fastest) way to tackle this issue?
It might be a good idea to use delayed to standardize dataframes before converting them to a dask dataframe (whether this is optimal for your use case is difficult to judge).
import dask.dataframe as dd
from dask import delayed
list_files = [...] # create a list of files inside s3 bucket
list_cols_to_keep = ['col1', 'col2']
#delayed
def standard_csv(file_path):
df = pd.read_csv(file_path)
df = df[list_cols_to_keep]
# add any other standardization routines, e.g. dtype conversion
return df
ddf = dd.from_delayed([standard_csv(f) for f in list_files])
I ended up giving up using Dask since it was too slow and used aws s3 sync to download the data and multiprocessing.Pool to read and concat them:
# download:
def sync_outputs(out_path):
local_dir_path = f"/tmp/outputs/"
safe_mkdir(os.path.dirname(local_dir_path))
cmd = f'aws s3 sync {job_output_dir} {local_dir_path} > /tmp/null' # the last part is to avoid prints
os.system(cmd)
return local_dir_path
# concat:
def read_csv(path):
return pd.read_csv(path,index_col=0)
def read_csvs_parallel(local_paths):
from multiprocessing import Pool
import os
with Pool(os.cpu_count()) as p:
csvs = list(tqdm(p.imap(read_csv, local_paths), desc='reading csvs', total=len(paths)))
return csvs
# all togeter:
def concat_csvs_parallel(out_path):
local_paths = sync_outputs(out_path)
csvs = read_csvs_parallel(local_paths)
df = pd.concat(csvs)
return df
aws s3 sync dowloaded about 1000 files (~1KB each) in about 30 second, and reading than with multiproccesing (8 cores) took 3 seconds, this was much faster than also downloading the files using multiprocessing (almost 2 minutes for 1000 files)

Usage of spark.catalog.refreshTable(tablename) in S3

I want to write a CSV file after transforming my Spark data with a function. The obtained Spark dataframe after the transformation seems good, but when I want to write it into a CSV file, I have an error:
It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.
But I really don't understand how to use the spark.catalog.refreshTable(tablename) function. I try to use it between the transformation and the file writing, but it said
AttributeError: 'DataFrame' object has no attribute '_get_object_id'
So I don't know how to deal with it...
#Create the function to resize the images and extract the features with mobilenetV2 model
def red_dim(width, height, nChannels, data):
#Transform image data to tensorflow compatoble format
images = []
for i in range(height.shape[0]):
x = np.ndarray(
shape=(height[i], width[i], nChannels[i]),
dtype=np.uint8,
buffer=data[i],
strides=(width[i] * nChannels[i], nChannels[i], 1))
images.append(preprocess_input(x))
#Resize images with the chosen size of the model
images = np.array(tf.image.resize(images, [IMAGE_SIZE, IMAGE_SIZE]))
#Load the model
model = load_model('models')
#Predict features for images
preds = model.predict(images).reshape(len(width), 3 * 3 * 1280)
#Return a pandas series with list of features for all images
return pd.Series(list(preds))
#Transform the function to a pandas udf function
#This allow to split the function in multiple chunks
red_dim_udf = pandas_udf(red_dim, returnType=ArrayType(DoubleType()))
#4 actions :
# apply the udf function defined just before
# cast the array of features to a string so it can be written in a csv
# select only the data that will be witten in the csv
# write the data -> where the error occurs
results=df.withColumn("dim_red", red_dim_udf(col("image.width"), col("image.height"), \
col("image.nChannels"), \
col("image.data"))) \
.withColumn("dim_red_string", lit(col("dim_red").cast("string")))
.select("image.origin", 'dim_red_string')
.repartition(5).write.csv(S3dir + '/results' + today)
Its a well-known issue where the underlying source data is getting updated while spark is processing on it.
I would suggest you to checkpoint i.e. move/copy the data to another directory before applying your transformations.
I think I can close my question, as I found the answer
If you have this type of error, it can also be because you have space in your S3 folders used to make your Dataframe, and Spark doesn't recognize the space character in the folder, so think the folder doesn't exist anymore...
But thanks #Constantine for your help !

Convert date/time index of external dataset so that pandas would plot clearly

When you already have time series data set but use internal dtype to index with date/time, you seem to be able to plot the index cleanly as here.
But when I already have data files with columns of date&time in its own format, such as [2009-01-01T00:00], is there a way to have this converted into the object that the plot can read? Currently my plot looks like the following.
Code:
dir = sorted(glob.glob("bsrn_txt_0100/*.txt"))
gen_raw = (pd.read_csv(file, sep='\t', encoding = "utf-8") for file in dir)
gen = pd.concat(gen_raw, ignore_index=True)
gen.drop(gen.columns[[1,2]], axis=1, inplace=True)
#gen['Date/Time'] = gen['Date/Time'][11:] -> cause error, didnt work
filter = gen[gen['Date/Time'].str.endswith('00') | gen['Date/Time'].str.endswith('30')]
filter['rad_tot'] = filter['Direct radiation [W/m**2]'] + filter['Diffuse radiation [W/m**2]']
lis = np.arange(35040) #used the number of rows, checked by printing. THis is for 2009-2010.
plt.xticks(lis, filter['Date/Time'])
plt.plot(lis, filter['rad_tot'], '.')
plt.title('test of generation 2009')
plt.xlabel('Date/Time')
plt.ylabel('radiation total [W/m**2]')
plt.show()
My other approach in mind was to use plotly. Yet again, its main purpose seems to feed in data on the internet. It would be best if I am familiar with all the modules and try for myself, but I am learning as I go to use pandas and matplotlib.
So I would like to ask whether there are anyone who experienced similar issues as I.
I think you need set labels to not visible by loop:
ax = df.plot(...)
spacing = 10
visible = ax.xaxis.get_ticklabels()[::spacing]
for label in ax.xaxis.get_ticklabels():
if label not in visible:
label.set_visible(False)

Writing a netcdf4 file is 6-times slower than writing a netcdf3_classic file and the file is 8-times as big?

I am using the netCDF4 library in python and just came across the issue stated in the title. At first I was blaming groups for this, but it turns out that it is a difference between the NETCDF4 and NETCDF3_CLASSIC formats (edit: and it appears related to our Linux installation of the netcdf libraries).
In the program below, I am creating a simple time series netcdf file of the same data in 2 different ways: 1) as NETCDF3_CLASSIC file, 2) as NETCDF4 flat file (creating groups in the netcdf4 file doesn't make much of a difference). What I find with a simple timing and the ls command is:
1) NETCDF3 1.3483 seconds 1922704 bytes
2) NETCDF4 flat 8.5920 seconds 15178689 bytes
It's exactly the same routine which creates 1) and 2), the only difference is the format argument in the netCDF4.Dataset method. Is this a bug or a feature?
Thanks, Martin
Edit: I have now found that this must have something to do with our local installation of the netcdf library on a Linux computer. When I use the program version below (trimmed down to the essentials) on my Windows laptop, I get similar file sizes, and netcdf4 is actually almost 2-times as fast as netcdf3! When I run the same program on our linux system, I can reproduce the old results. Thus, this question is apparently not related to python.
Sorry for the confusion.
New code:
import datetime as dt
import numpy as np
import netCDF4 as nc
def write_to_netcdf_single(filename, data, series_info, format='NETCDF4'):
vname = 'testvar'
t0 = dt.datetime.now()
with nc.Dataset(filename, "w", format=format) as f:
# define dimensions and variables
dim = f.createDimension('time', None)
time = f.createVariable('time', 'f8', ('time',))
time.units = "days since 1900-01-01 00:00:00"
time.calendar = "gregorian"
param = f.createVariable(vname, 'f4', ('time',))
param.units = "kg"
# define global attributes
for k, v in sorted(series_info.items()):
setattr(f, k, v)
# store data values
time[:] = nc.date2num(data.time, units=time.units, calendar=time.calendar)
param[:] = data.value
t1 = dt.datetime.now()
print "Writing file %s took %10.4f seconds." % (filename, (t1-t0).total_seconds())
if __name__ == "__main__":
# create an array with 1 mio values and datetime instances
time = np.array([dt.datetime(2000,1,1)+dt.timedelta(hours=v) for v in range(1000000)])
values = np.arange(0., 1000000.)
data = np.array(zip(time, values), dtype=[('time', dt.datetime), ('value', 'f4')])
data = data.view(np.recarray)
series_info = {'attr1':'dummy', 'attr2':'dummy2'}
filename = "testnc4.nc"
write_to_netcdf_single(filename, data, series_info)
filename = "testnc3.nc"
write_to_netcdf_single(filename, data, series_info, format='NETCDF3_CLASSIC')
[old code deleted because it had too much unnecessary stuff]
The two file formats do have different characteristics. the classic file format was dead simple (well, more simple than the new format: http://www.unidata.ucar.edu/software/netcdf/docs/netcdf/Classic-Format-Spec.html#Classic-Format-Spec ): A small header described all the data, and then (since you have 3 record variables) the 3 record variables get interleaved.
nice and simple, but you only get one UNLIMITED dimension, there's no facility for parallel I/O, and no way to manage data into groups.
Enter the new HDF5-based back-end, introduced in NetCDF-4.
In exchange for new features, more flexibility, and fewer restrictions on file and variable size, you have to pay a bit of a price. For large datasets, the costs are amortized, but your variables are (relatively speaking) kind of small.
I think the file size discrepancy is exacerbated by your use of record variables. in order to support arrays grow-able in N dimensions, there is more metadata associated with each record entry in the Netcdf-4 format.
HDF5 uses the "reader makes right" convention, too. classic NetCDF says "all data will be big-endian", but HDF5 encodes a bit of information about how the data was stored. If the reader process is the same architecture as the writer process (which is common, as it would be on your laptop or if restarting from a simulation checkpoint), then no conversion need be conducted.
This question is unlikely to help others as it appears to be a site-specific problem related to the interplay between netcdf libraries and the python netCDF4 module.

Complementary Filter Code Not functioning

I've been scratching my head too long.
The data is coming from an 3D accelerometer and 3D gyro. I am using a complementary filter to control drift.
I have it working in excel but can't seem to get this python code to do the same thing:
r1_angle_cfx = np.zeros(len(r1_angle_ax))
r1_angle_cfx[0] = r1_angle_ax[0]
for i in xrange(len(r1_angle_ax)-1):
j = i + 1
r1_angle_cfx[j] = 0.98 *(r1_angle_cfx[i] + r1_alpha_x[j]*fs) + (0.02 * r1_angle_ax[j]) #complementary filter
In excel (correct) I get:
In python (incorrect) I get:
What is going wrong? and is there a better way to do this in python?
Thanks,
Scott
EDIT: Link to data files -
sample data
1. The csv file contains accelerometer, gyro data that is entered into the filter formula as well as those values that were calculated in excel.
2. The excel file contains all raw data (steps not mentioned above but I have triple checked and are equivalent up to the point of being entered in the filter formula).
EDIT 2: update - Turns out my code works. It was sloppy debugging. fs should be fs = 0.01. In my code I had fs = 1/100 which ends up = 0 in the script.
Your Python code looks pretty reasonable. Without example data, I can't do much more than say that.
But I can guess. I looked up "complementary filters" and found a link explaining them:
https://sites.google.com/site/myimuestimationexperience/filters/complementary-filter
This link gives an example equation that is very similar to yours:
angle = (1-alpha)*(angle + gyro * dt) + (alpha)*(acc)
You have fs where this has dt, and dt is computed as 1/sampling_frequency. If fs is the sampling frequency, maybe you should try inverting it?
EDIT: Okay, now that you posted the data, I played around with this. Here is my program that gets a correct result.
Your code looks basically correct, so I think you must have made a mistake in your code that collected the values. I'm not quite sure because your variable names confuse me.
I used a namedtuple and for the names, I used the column headers from the CSV file (with spaces and periods removed to make a valid Python identifier).
import collections as coll
import csv
import matplotlib.pyplot as plt
import numpy as np
import sys
fs = 100.0
dt = 1.0/fs
alpha = 0.02
Sample = coll.namedtuple("Sample",
"accZ accY accX rotZ rotY rotX r acc_angZ acc_angY acc_angX cfZ cfY cfX")
def samples_from_file(fname):
with open(fname) as f:
next(f) # discard header row
csv_reader = csv.reader(f, dialect='excel')
for i, row in enumerate(csv_reader, 1):
try:
values = [float(x) for x in row]
yield Sample(*values)
except Exception:
lst = list(row)
print("Bad line %d: len %d '%s'" % (i, len(lst), str(lst)))
samples = list(samples_from_file("data.csv"))
cfx = np.zeros(len(samples))
# Excel formula: =R12
cfx[0] = samples[0].acc_angX
# Excel formula: =0.98*(U12+N13*0.01)+0.02*R13
# Excel: U is cfX N is rotX R is acc_angX
for i, s in enumerate(samples[1:], 1):
cfx[i] = (1.0 - alpha) * (cfx[i-1] + s.rotX*dt) + (alpha * s.acc_angX)
check_line = [s.cfX - cf for s, cf in zip(samples, cfx)]
plt.figure(1)
plt.plot(check_line)
plt.plot(cfx)
plt.show()
check_line is the difference between the saved cfX value from the CSV file, and the new computed cfx value. As you can see in the plot, this is a straight line at 0, so my calculation is agreeing quite well with yours.
So I guess the mapping of names is:
your_name my_name
________________________
r1_angle_cfx cfx
r1_alpha_x rotX
r1_angle_ax acc_angX