Speed up the code in NumPy - numpy

I have two dimensional numpy array (raster_data) with raster size of 1 million * 1 million.
I want to classify that raster into two classes as follows:
class_A = np.where((raster_data >= 5.23) & (raster_data < 8.55),raster_data,np.nan)
class_B = np.where((raster_data >= 8.55) & (raster_data < 10.0),raster_data,np.nan)
However, due to extremely large size of the data I receive Memory error.
How can I still classify that raster as I wanted?
I have already tried with 16GB RAM and 64bit NumPy.

You can try boolean indexing and in-place operations to conserve memory:
>>> class_A = raster_data.copy()
>>> class_B = raster_data.copy()
>>> mask = raster_data < 5.23
>>> mask |= raster_data >= 8.55
>>> class_A[mask] = np.nan
>>> mask = raster_data < 8.55
>>> mask |= raster_data >= 10
>>> class_B[mask] = np.nan

This is how you could do it with pytables. Although i hope you're patient and have lots of space.
import tables as tb
import numpy as np
import time
f = tb.openFile('humongusFile.h5', 'w')
n = 100000
x = f.createCArray(f.root, 'x', tb.Float16Atom(), (n,n), filters=tb.Filters(5, 'blosc'))
t0 = time.time()
for i in range(n):
x[i] = np.random.random_sample(n)* 10
x.flush() # dump data to disk
t1 = time.time()
print t1 - t0
print "Done creating test data"
y1 = f.createCArray(f.root, 'y1', tb.Float16Atom(), (n,n), filters=tb.Filters(5, 'blosc'))
y2 = f.createCArray(f.root, 'y2', tb.Float16Atom(), (n,n), filters=tb.Filters(5, 'blosc'))
t2 = time.time()
print t2 - t1
print "Done creating output array"
expr = tb.Expr("where((x >= 5.23) & (x < 8.55), x, 0)")
expr.setOutput(y1)
expr2 = tb.Expr("where((x >= 5.23) & (x < 8.55), x, 0)")
expr2.setOutput(y2)
t3 = time.time()
print t3 - t2
print "Starting evaluating first output"
expr.eval()
print "Starting evaluating second output"
expr2.eval()
print "Done"
t4 = time.time()
print t4 - t3

If your dataset is indeed as insanely large as you specified, you will need some form of on disk storage, and out-of-core computation.
It all depends on what exactly you want to do with that mask, but take a look at pytables; it allows for the efficient storage and manipulation of such large arrays.

Related

Csv file search speedup

I need to build a relief profile graph by coordinates, I have a csv file with 12,000,000 lines. searching through a csv file of the same height takes about 2 - 2.5 seconds. I rewrote the csv to parquet and it helped me save some time, it takes about 1.7 - 1 second to find one height. However, I need to build a profile for 500 - 2000 values, which makes the time very long. In the future, you may have to increase the base of the csv file, which will slow down this process even more. In this regard, my question is, is it possible to somehow reduce the processing time of values?
Code example:
import dask.dataframe as dk
import numpy as np
import pandas as pd
import time
filename = 'n46_e032_1arc_v3.csv'
df = dk.read_csv(filename)
df.to_parquet('n46_e032_1arc_v3_parquet')
Latitude1y, Longitude1x = 46.6276, 32.5942
Latitude2y, Longitude2x = 46.6451, 32.6781
sec, steps, k = 0.00027778, 1, 11.73
Latitude, Longitude = [Latitude1y], [Longitude1x]
sin, cos = Latitude2y - Latitude1y, Longitude2x - Longitude1x
y, x = Latitude1y, Longitude1x
while Latitude[-1] < Latitude2y and Longitude[-1] < Longitude2x:
y, x, steps = y + sec * k * sin, x + sec * k * cos, steps + 1
Latitude.append(y)
Longitude.append(x)
time_start = time.time()
long, elevation_data = [], []
df2 = dk.read_parquet('n46_e032_1arc_v3_parquet')
for i in range(steps + 1):
elevation_line = df2[(Longitude[i] <= df2['x']) & (df2['x'] <= Longitude[i] + sec) &
(Latitude[i] <= df2['y']) & (df2['y'] <= Latitude[i] + sec)].compute()
elevation = np.asarray(elevation_line.z.tolist())
if elevation[-1] < 0:
elevation_data.append(0)
else:
elevation_data.append(elevation[-1])
long.append(30 * i)
plt.bar(long, elevation_data, width = 30)
plt.show()
print(time.time() - time_start)
Here's one way to solve this problem using KD trees. A KD tree is a data structure for doing fast nearest-neighbor searches.
import scipy.spatial
tree = scipy.spatial.KDTree(df[['x', 'y']].values)
elevations = df['z'].values
long, elevation_data = [], []
for i in range(steps):
lon, lat = Longitude[i], Latitude[i]
dist, idx = tree.query([lon, lat])
elevation = elevations[idx]
if elevation < 0:
elevation = 0
elevation_data.append(elevation)
long.append(30 * i)
Note: if you can make assumptions about the data, like "all of the points in the CSV are equally spaced," faster algorithms are possible.
It looks like your data might be on a regular grid. If (and only if) every combination of x and y exist in your data, then it probably makes sense to turn this into a labeled 2D array of points, after which querying the correct position will be very fast.
For this, I'll use xarray, which is essentially pandas for N-dimensional data, and integrates well with dask:
# bring the dataframe into memory
df = dk.read('n46_e032_1arc_v3_parquet').compute()
da = df.set_index(["y", "x"]).z.to_xarray()
# now you can query the nearest points:
desired_lats = xr.DataArray([46.6276, 46.6451], dims=["point"])
desired_lons = xr.DataArray([32.5942, 32.6781], dims=["point"])
subset = da.sel(y=desired_lats, x=desired_lons, method="nearest")
# if you'd like, you can return to pandas:
subset_s = subset.to_series()
# you could do this only once, and save the reshaped array as a zarr store:
ds = da.to_dataset(name="elevation")
ds.to_zarr("n46_e032_1arc_v3.zarr")

Make numpy array which is hash of string array

I have numpy array:
A = np.array(['abcd','bcde','cdef'])
I need hash array of A: with function
B[i] = ord(A[i][1]) * 256 + ord(A[i][2])
B = np.array([ord('b') * 256 + ord('c'), ord('c') * 256 + ord('d'), ord('d') * 256 + ord('e')])
How I can do it?
Based on the question, I assume the string are ASCII one and all strings have a size bigger than 3 characters.
You can start by converting strings to ASCII one for sake of performance and simplicity (by creating a new temporary array). Then you can merge all the string in one big array without any copy thanks to views (since Numpy strings are contiguously stored in memory) and you can actually convert characters to integers at the same time (still without any copy). Then you can use the stride so to compute all the hash in a vectorized way. Here is how:
ascii = A.astype('S')
buff = ascii.view(np.uint8)
result = buff[1::ascii.itemsize]*256 + buff[2::ascii.itemsize]
Congratulation! Speed increase four times!
import time
import numpy as np
Iter = 1000000
A = np.array(['abcd','bcde','cdef','defg'] * Iter)
Ti = time.time()
B = np.zeros(A.size)
for i in range(A.size):
B[i] = ord(A[i][1]) * 256 + ord(A[i][2])
DT1 = time.time() - Ti
Ti = time.time()
ascii = A.astype('S')
buff = ascii.view(np.uint8)
result = buff[1::ascii.itemsize]*256 + buff[2::ascii.itemsize]
DT2 = time.time() - Ti
print("Equal = %s" % np.array_equal(B, result))
print("DT1=%7.2f Sec, DT2=%7.2f Sec, DT1/DT2=%6.2f" % (DT1, DT2, DT1/DT2))
Output:
Equal = True
DT1= 3.37 Sec, DT2= 0.82 Sec, DT1/DT2= 4.11

call functions on a specific level of a numpy ndarray without for loops

suppose a numpy ndarrary
arr
has shape (100,100,5,5)
The following codes work:
result=np.zeros((arr.shape[0], arr.shape[1], 10))
for i in range(arr.shape[0]):
for j in range(arr.shape[1]):
v=arr[i,j].flatten()
hist, bi= np.histogram(v, bins=10, range=(0,3))
result[i,j] =hist
but it's slow. Is there a more efficient way to write the codes, say avoid the for loops?
Hmm, I thought apply_along_axis would help, but it doesn't seem to make much of a difference, at least at the problem sizes of interest to you. Maybe there's overhead in myhist.
See the code below.
import numpy as np
import time
low = 0.0
high = 3.0
bins = 10
arrshp = (100,100,5,5)
def myhist(xx):
out = np.histogram(xx,bins=bins,range=(low,high))
return out[0]
arr = np.random.uniform(low,high,arrshp)
time1 = time.time()
arr2 = arr.reshape(arrshp[0],arrshp[1],-1)
out_fast = np.apply_along_axis(myhist,-1,arr2)
time2 = time.time()
print('time (secs) fast = ',time2-time1)
time3 = time.time()
out_slow = np.zeros((arr.shape[0],arr.shape[1],bins),dtype='float64')
for i in range(arr.shape[0]):
for j in range(arr.shape[1]):
v = arr[i,j].flatten()
_hh = np.histogram(v, bins=bins, range=(low,high))
out_slow[i,j,:] = _hh[0]
time4 = time.time()
print('norm diff = ',np.linalg.norm(out_fast-out_slow))
print('time (secs) slow = ',time4-time3)

Legacy code does not work with new large csv file

I have legacy code writing for pandas.
Now the new data become very large (in CSV format), and it is hard to read_csv with the new files (the file sizes ~ 7,8GB and will be larger in the future).
Could you suggest me the best way to not change the legacy code but still working with large CSV files? I thought to switch to spark but it seems that I will have to change a lot of code.
Many thanks
Have you tried reading the file in chunks? And defining the column dtypes beforehand might help boost the performance
chunksize = 1000000
chunks = pd.read_csv(filepath, dtype=dtypes, chunksize=chunksize)
df = pd.concat((chunk for chunk in chunks), ignore_index=True)
EDIT: Another trick is to reduce the dataframe's memory usage after loading. This is from a Kaggle kernel
def reduce_mem_usage(props):
start_mem_usg = props.memory_usage().sum() / 1024**2
print("Memory usage of properties dataframe is :",start_mem_usg," MB")
NAlist = [] # Keeps track of columns that have missing values filled in.
for col in props.columns:
if props[col].dtype != object: # Exclude strings
# Print current column type
print("******************************")
print("Column: ",col)
print("dtype before: ",props[col].dtype)
# make variables for Int, max and min
IsInt = False
mx = props[col].max()
mn = props[col].min()
# Integer does not support NA, therefore, NA needs to be filled
if not np.isfinite(props[col]).all():
NAlist.append(col)
props[col].fillna(mn-1,inplace=True)
# test if column can be converted to an integer
asint = props[col].fillna(0).astype(np.int64)
result = (props[col] - asint)
result = result.sum()
if result > -0.01 and result < 0.01:
IsInt = True
# Make Integer/unsigned Integer datatypes
if IsInt:
if mn >= 0:
if mx < 255:
props[col] = props[col].astype(np.uint8)
elif mx < 65535:
props[col] = props[col].astype(np.uint16)
elif mx < 4294967295:
props[col] = props[col].astype(np.uint32)
else:
props[col] = props[col].astype(np.uint64)
else:
if mn > np.iinfo(np.int8).min and mx < np.iinfo(np.int8).max:
props[col] = props[col].astype(np.int8)
elif mn > np.iinfo(np.int16).min and mx < np.iinfo(np.int16).max:
props[col] = props[col].astype(np.int16)
elif mn > np.iinfo(np.int32).min and mx < np.iinfo(np.int32).max:
props[col] = props[col].astype(np.int32)
elif mn > np.iinfo(np.int64).min and mx < np.iinfo(np.int64).max:
props[col] = props[col].astype(np.int64)
# Make float datatypes 32 bit
else:
props[col] = props[col].astype(np.float32)
# Print new column type
print("dtype after: ",props[col].dtype)
print("******************************")
# Print final result
print("___MEMORY USAGE AFTER COMPLETION:___")
mem_usg = props.memory_usage().sum() / 1024**2
print("Memory usage is: ",mem_usg," MB")
print("This is ",100*mem_usg/start_mem_usg,"% of the initial size")
return props, NAlist
You can simply apply the above function to your dataframe, and it will help especially in cases where your data has many numeric columns.

how to use conda accelerate / benchmarks?

I'm attempting to use Conda Accelerate to speedup some data preprocessing, but initial benchmarks indicate either I'm not using it correctly or it has no effect on FFT & linear algebra execution times in numpy and librosa. Re-reading the literature - does this mean I'm supposed to decorate and recode every ndarray operation as in the batch-matmul example for NumbaPro? I'd assumed I simply installed and it made numpy faster, but this doesn't appear to be the case.
Benchmarks and code are below. I've installed accelerate via conda install accelerate and also imported it for good measure.
Thanks!
Result - negligible difference before and after conda install accelerate
Total time was 25.356
Total load time was 1.6743
Total math time was 22.1599
Total save time was 1.5139
Total stft math time was 12.9219
Total other numpy math time was 9.1886
Relevant code:
loads, maths, saves = [], [], []
stfts, nps = [], []
# now we have a dict of all source files grouped by voice
for i in range(30):
v0_fn = v0_list[i]
v1_fn = v1_list[i]
tl0 = time.time()
# Process v0 & v1 file
v0_fn = signal_dir+v0_fn
v0, fs_s = librosa.load(v0_fn, sr=None)
v1_fn = signal_dir+v1_fn
v1, fs_s = librosa.load(v1_fn, sr=None)
tl1 = time.time()
loads.append((tl1-tl0))
mix = v0 + v1
# Capture the magnitude and phase of signal and signal + noise
tm0 = time.time()
v0_stft = librosa.stft(v0, int(frame_size*fs), int(step_size*fs)).transpose()
tm1 = time.time()
v0_mag = (v0_stft.real**2 + v0_stft.imag**2)**0.5
v0_pha = np.arctan2(v0_stft.imag, v0_stft.real)
v0_rtheta = np.stack((v0_mag, v0_pha), axis=0)
tm2 = time.time()
v1_stft = librosa.stft(v1, int(frame_size*fs), int(step_size*fs)).transpose()
tm3 = time.time()
v1_mag = (v1_stft.real**2 + v1_stft.imag**2)**0.5
v1_pha = np.arctan2(v1_stft.imag, v1_stft.real)
v1_rtheta = np.stack((v1_mag, v1_pha), axis=0)
tm4 = time.time()
mix_stft = librosa.stft(mix, int(frame_size*fs), int(step_size*fs)).transpose()
tm5 = time.time()
mix_mag = (mix_stft.real**2 + mix_stft.imag**2)**0.5
mix_pha = np.arctan2(mix_stft.imag, mix_stft.real)
mix_rtheta = np.stack((mix_mag, mix_pha), axis=0)
tm6 = time.time()
stfts += [tm1-tm0, tm3-tm2, tm5-tm4]
nps += [tm2-tm1, tm4-tm3, tm6-tm5]
data['sig_rtheta'] = v0_rtheta
data['noi_rtheta'] = v1_rtheta
data['mix_rtheta'] = mix_rtheta
tl2 = time.time()
maths.append(tl2-tl1)
with open(write_name, 'w') as f:
cPickle.dump(all_info, f, protocol=-1)
tl3 = time.time()
saves.append(tl3-tl2)
t1 = time.time()
print 'Total time was %.3f' % (t1-t0)
print 'Total load time was %.4f' % np.sum(loads)
print 'Total math time was %.4f' % np.sum(maths)
print 'Total save time was %.4f' % np.sum(saves)
print 'Total stft math was %.4f' % np.sum(stfts)
print 'Total other numpy math time was %.4f' % np.sum(nps)