Related
I need to build a relief profile graph by coordinates, I have a csv file with 12,000,000 lines. searching through a csv file of the same height takes about 2 - 2.5 seconds. I rewrote the csv to parquet and it helped me save some time, it takes about 1.7 - 1 second to find one height. However, I need to build a profile for 500 - 2000 values, which makes the time very long. In the future, you may have to increase the base of the csv file, which will slow down this process even more. In this regard, my question is, is it possible to somehow reduce the processing time of values?
Code example:
import dask.dataframe as dk
import numpy as np
import pandas as pd
import time
filename = 'n46_e032_1arc_v3.csv'
df = dk.read_csv(filename)
df.to_parquet('n46_e032_1arc_v3_parquet')
Latitude1y, Longitude1x = 46.6276, 32.5942
Latitude2y, Longitude2x = 46.6451, 32.6781
sec, steps, k = 0.00027778, 1, 11.73
Latitude, Longitude = [Latitude1y], [Longitude1x]
sin, cos = Latitude2y - Latitude1y, Longitude2x - Longitude1x
y, x = Latitude1y, Longitude1x
while Latitude[-1] < Latitude2y and Longitude[-1] < Longitude2x:
y, x, steps = y + sec * k * sin, x + sec * k * cos, steps + 1
Latitude.append(y)
Longitude.append(x)
time_start = time.time()
long, elevation_data = [], []
df2 = dk.read_parquet('n46_e032_1arc_v3_parquet')
for i in range(steps + 1):
elevation_line = df2[(Longitude[i] <= df2['x']) & (df2['x'] <= Longitude[i] + sec) &
(Latitude[i] <= df2['y']) & (df2['y'] <= Latitude[i] + sec)].compute()
elevation = np.asarray(elevation_line.z.tolist())
if elevation[-1] < 0:
elevation_data.append(0)
else:
elevation_data.append(elevation[-1])
long.append(30 * i)
plt.bar(long, elevation_data, width = 30)
plt.show()
print(time.time() - time_start)
Here's one way to solve this problem using KD trees. A KD tree is a data structure for doing fast nearest-neighbor searches.
import scipy.spatial
tree = scipy.spatial.KDTree(df[['x', 'y']].values)
elevations = df['z'].values
long, elevation_data = [], []
for i in range(steps):
lon, lat = Longitude[i], Latitude[i]
dist, idx = tree.query([lon, lat])
elevation = elevations[idx]
if elevation < 0:
elevation = 0
elevation_data.append(elevation)
long.append(30 * i)
Note: if you can make assumptions about the data, like "all of the points in the CSV are equally spaced," faster algorithms are possible.
It looks like your data might be on a regular grid. If (and only if) every combination of x and y exist in your data, then it probably makes sense to turn this into a labeled 2D array of points, after which querying the correct position will be very fast.
For this, I'll use xarray, which is essentially pandas for N-dimensional data, and integrates well with dask:
# bring the dataframe into memory
df = dk.read('n46_e032_1arc_v3_parquet').compute()
da = df.set_index(["y", "x"]).z.to_xarray()
# now you can query the nearest points:
desired_lats = xr.DataArray([46.6276, 46.6451], dims=["point"])
desired_lons = xr.DataArray([32.5942, 32.6781], dims=["point"])
subset = da.sel(y=desired_lats, x=desired_lons, method="nearest")
# if you'd like, you can return to pandas:
subset_s = subset.to_series()
# you could do this only once, and save the reshaped array as a zarr store:
ds = da.to_dataset(name="elevation")
ds.to_zarr("n46_e032_1arc_v3.zarr")
I'm trying to implement a sliding window method for a genomics dataset that I have, over a fairly long range (upwards of 50k nucleotide's). My approach so far works fine, however is fairly slow (taking several seconds per range, and several minutes per range at intervals >150k bp). Here is my code so far:
import numpy as np
VectorizedRange = np.arange(Start, End)#Start, End genomic flags on the reference genome
SlidingWindow = np.lib.stride_tricks.sliding_window_view(VectorizedRange, 100)#100 = the window size
GroupedDictFrame = pd.DataFrame({"Bins":GenomeRange})
GroupedDictFrame["ReadCov"] = 0
GroupedDictFrame["ReadSeq"] = [list() for _ in range(len(GroupedDictFrame.index.values))]
GroupedDictFrame.set_index(keys=["Bins"], inplace=True, drop=True)
def Appender(Start, End, Width, Seq):
AvgCov = 0
SeqList = []
if End <= Window[-1]:
AvgCov += 1
SeqList.append(Seq)
elif End > Window[-1]:
AvgCov += (Window[-1] - Start)/Width
SeqList.append(Seq[0:(Window[-1] - Start)])
GroupedDictFrame.loc[Window[0], "ReadCov"] += AvgCov
GroupedDictFrame.loc[Window[0], "ReadSeq"] = SeqList
for Window in SlidingWindow:
SubsetBAM = BAMFrame[(
(BAMFrame["start_coord"]>=Window[0])&
(BAMFrame["start_coord"]<=Window[-1])
)].reset_index(drop=True)
SubsetBAM.apply(
lambda x: Appender(x.start_coord,
x.end_coord,
x.width_lis,
x.seq_lis), axis=1
)
I think my vectorization isn't the best, any suggestions for speeding this up?
So I think I figured it out on my own, I'll add my solution in case anyone else faces a similar problem.
Essentially, I stopped subsetting my dataframe containing the small DNA read fragments in the for loop, and did one subset before the loop and converted it to a numpy array.
I removed my function and used numpy.where to do all my logic.
import numpy as np
VectorizedRange = np.arange(Start, End)
SlidingWindow = np.lib.stride_tricks.sliding_window_view(VectorizedRange, 100)
GroupedDictFrame = pd.DataFrame({"Bins":GenomeRange})
GroupedDictFrame["ReadCov"] = 0
GroupedDictFrame["ReadSeq"] = [list() for _ in range(len(GroupedDictFrame.index.values))]
GroupedDictFrame.set_index(keys=["Bins"], inplace=True, drop=True)
CoordArray = BAMFrame.loc[:, "start_coord":"end_coord"].to_numpy()
for Window in SlidingWindow:
ReadCovIn = np.where(((CoordArray[:,1] <= Window[-1]) & (CoordArray[:,0] >= Window[0])), 1, 0)
ReadCovOut = np.where(((CoordArray[:,1] > Window[-1]) & ((CoordArray[:,0] >= Window[0]) & (CoordArray[:,0] < Window[-1]))),
(Window[-1] - CoordArray[:,0])/(CoordArray[:,1] - CoordArray[:,0]), 0)
GroupedDictFrame.loc[Window[0], "ReadCov"] += np.sum((np.sum(ReadCovIn), np.sum(ReadCovOut)))
I've gotten it down to ~1 second per gene region which is typically about 50kb (so that would mean the SlidingWindow has a shape of (49900,100)), which is pretty good I think!
Does anyone know how to split a dataset created by the dataset API (tf.data.Dataset) in Tensorflow into Test and Train?
Assuming you have all_dataset variable of tf.data.Dataset type:
test_dataset = all_dataset.take(1000)
train_dataset = all_dataset.skip(1000)
Test dataset now has first 1000 elements and the rest goes for training.
You may use Dataset.take() and Dataset.skip():
train_size = int(0.7 * DATASET_SIZE)
val_size = int(0.15 * DATASET_SIZE)
test_size = int(0.15 * DATASET_SIZE)
full_dataset = tf.data.TFRecordDataset(FLAGS.input_file)
full_dataset = full_dataset.shuffle()
train_dataset = full_dataset.take(train_size)
test_dataset = full_dataset.skip(train_size)
val_dataset = test_dataset.skip(val_size)
test_dataset = test_dataset.take(test_size)
For more generality, I gave an example using a 70/15/15 train/val/test split but if you don't need a test or a val set, just ignore the last 2 lines.
Take:
Creates a Dataset with at most count elements from this dataset.
Skip:
Creates a Dataset that skips count elements from this dataset.
You may also want to look into Dataset.shard():
Creates a Dataset that includes only 1/num_shards of this dataset.
Disclaimer I stumbled upon this question after answering this one so I thought I'd spread the love
Most of the answers here use take() and skip(), which requires knowing the size of your dataset before hand. This isn't always possible, or is difficult/intensive to ascertain.
Instead what you can do is to essentially slice the dataset up so that 1 every N records becomes a validation record.
To accomplish this, lets start with a simple dataset of 0-9:
dataset = tf.data.Dataset.range(10)
# [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
Now for our example, we're going to slice it so that we have a 3/1 train/validation split. Meaning 3 records will go to training, then 1 record to validation, then repeat.
split = 3
dataset_train = dataset.window(split, split + 1).flat_map(lambda ds: ds)
# [0, 1, 2, 4, 5, 6, 8, 9]
dataset_validation = dataset.skip(split).window(1, split + 1).flat_map(lambda ds: ds)
# [3, 7]
So the first dataset.window(split, split + 1) says to grab split number (3) of elements, then advance split + 1 elements, and repeat. That + 1 effectively skips the 1 element we're going to use in our validation dataset.
The flat_map(lambda ds: ds) is because window() returns the results in batches, which we don't want. So we flatten it back out.
Then for the validation data we first skip(split), which skips over the first split number (3) of elements that were grabbed in the first training window, so we start our iteration on the 4th element. The window(1, split + 1) then grabs 1 element, advances split + 1 (4), and repeats.
Note on nested datasets:
The above example works well for simple datasets, but flat_map() will generate an error if the dataset is nested. To address this, you can swap out the flat_map() with a more complicated version that can handle both simple and nested datasets:
.flat_map(lambda *ds: ds[0] if len(ds) == 1 else tf.data.Dataset.zip(ds))
#ted's answer will cause some overlap. Try this.
train_ds_size = int(0.64 * full_ds_size)
valid_ds_size = int(0.16 * full_ds_size)
train_ds = full_ds.take(train_ds_size)
remaining = full_ds.skip(train_ds_size)
valid_ds = remaining.take(valid_ds_size)
test_ds = remaining.skip(valid_ds_size)
use code below to test.
tf.enable_eager_execution()
dataset = tf.data.Dataset.range(100)
train_size = 20
valid_size = 30
test_size = 50
train = dataset.take(train_size)
remaining = dataset.skip(train_size)
valid = remaining.take(valid_size)
test = remaining.skip(valid_size)
for i in train:
print(i)
for i in valid:
print(i)
for i in test:
print(i)
Now Tensorflow doesn't contain any tools for that.
You could use sklearn.model_selection.train_test_split to generate train/eval/test dataset, then create tf.data.Dataset respectively.
You can use shard:
dataset = dataset.shuffle() # optional
trainset = dataset.shard(2, 0)
testset = dataset.shard(2, 1)
See:
https://www.tensorflow.org/api_docs/python/tf/data/Dataset#shard
The upcoming TensorFlow 2.10.0 will have a tf.keras.utils.split_dataset function, see the rc3 release notes:
Added tf.keras.utils.split_dataset utility to split a Dataset object or a list/tuple of arrays into two Dataset objects (e.g. train/test).
In case size of the dataset is known:
from typing import Tuple
import tensorflow as tf
def split_dataset(dataset: tf.data.Dataset,
dataset_size: int,
train_ratio: float,
validation_ratio: float) -> Tuple[tf.data.Dataset, tf.data.Dataset, tf.data.Dataset]:
assert (train_ratio + validation_ratio) < 1
train_count = int(dataset_size * train_ratio)
validation_count = int(dataset_size * validation_ratio)
test_count = dataset_size - (train_count + validation_count)
dataset = dataset.shuffle(dataset_size)
train_dataset = dataset.take(train_count)
validation_dataset = dataset.skip(train_count).take(validation_count)
test_dataset = dataset.skip(validation_count + train_count).take(test_count)
return train_dataset, validation_dataset, test_dataset
Example:
size_of_ds = 1001
train_ratio = 0.6
val_ratio = 0.2
ds = tf.data.Dataset.from_tensor_slices(list(range(size_of_ds)))
train_ds, val_ds, test_ds = split_dataset(ds, size_of_ds, train_ratio, val_ratio)
A robust way to split dataset into two parts is to first deterministically map every item in the dataset into a bucket with, for example, tf.strings.to_hash_bucket_fast. Then you can split the dataset into two by filtering by the bucket. If you split your data into five buckets, you get 80-20 split assuming that the split is even.
As an example, assume that your dataset contains dictionaries with key filename. We split the data into five buckets based on this key. With this add_fold function, we add the key "fold" in the dictionaries:
def add_fold(buckets: int):
def add_(sample, label):
fold = tf.strings.to_hash_bucket(sample["filename"], num_buckets=buckets)
return {**sample, "fold": fold}, label
return add_
dataset = dataset.map(add_fold(buckets=5))
Now we can split the dataset into two disjoint datasets with Dataset.filter:
def pick_fold(fold: int):
def filter_fn(sample, _):
return tf.math.equal(sample["fold"], fold)
return filter_fn
def skip_fold(fold: int):
def filter_fn(sample, _):
return tf.math.not_equal(sample["fold"], fold)
return filter_fn
train_dataset = dataset.filter(skip_fold(0))
val_dataset = dataset.filter(pick_fold(0))
The key that you use for hashing should be one that captures the correlations in the dataset. For example, if your samples collected by the same person are correlated and you want all samples with the same collector end up in the same bucket (and the same split), you should use the collector name or ID as the hashing column.
Of course, you can skip the part with dataset.map and do the hashing and filtering in one filter function. Here's a full example:
dataset = tf.data.Dataset.from_tensor_slices([f"value-{i}" for i in range(10000)])
def to_bucket(sample):
return tf.strings.to_hash_bucket_fast(sample, 5)
def filter_train_fn(sample):
return tf.math.not_equal(to_bucket(sample), 0)
def filter_val_fn(sample):
return tf.math.logical_not(filter_train_fn(sample))
train_ds = dataset.filter(filter_train_fn)
val_ds = dataset.filter(filter_val_fn)
print(f"Length of training set: {len(list(train_ds.as_numpy_iterator()))}")
print(f"Length of validation set: {len(list(val_ds.as_numpy_iterator()))}")
This prints:
Length of training set: 7995
Length of validation set: 2005
Can't comment, but above answer has overlap and is incorrect. Set BUFFER_SIZE to DATASET_SIZE for perfect shuffle. Try different sized val/test size to verify. Answer should be:
DATASET_SIZE = tf.data.experimental.cardinality(full_dataset).numpy()
train_size = int(0.7 * DATASET_SIZE)
val_size = int(0.15 * DATASET_SIZE)
test_size = int(0.15 * DATASET_SIZE)
full_dataset = full_dataset.shuffle(BUFFER_SIZE)
train_dataset = full_dataset.take(train_size)
test_dataset = full_dataset.skip(train_size)
val_dataset = test_dataset.take(val_size)
test_dataset = test_dataset.skip(val_size)
I know that this stackoverflow thread already gives some nice examples about conditionals in tensorflow, but I'm still struggling how to solve my issue of randomly selecting among several different masks in tensorflow.
Right now I can only select between two mask tensors a and b:
rand_num = tf.random_uniform([], minval=0, maxval=2.0, dtype=tf.float32, seed=None)
def if_true():
return b
def if_false():
return a
mask_sel = tf.cond(tf.less(rand_num , tf.constant(1.0)),if_true,if_false)
(I still find it weird that one needs to define these two helper functions, but not using them weirdly throws an error.)
Now the question: Lets say I have 4 mask tensors (a,b,c,d) or more to randomly select, what would be the best way to do that in tensorflow?
In python that would be
rand_num = np.random.uniform(low=0,high=4.0)
if (rand_num < 1.0):
mask_sel = a
elif(rand_num < 2.0):
mask_sel = b
elif(rand_num < 3.0):
mask_sel = c
else
mask_sel = d
About the helper functions, they are useful because they allow tensorflow to know which operations will run under each condition, this way it can optimize by running only the selected branch and ignoring the other. Operations outside the helper functions but used by any of them will always be run before tf.cond runs.
The other options is to use tf.select; you won't need the helper functions here but it will always evaluate both sides before running tf.select which can be inefficient if you don't need to.
Now for the main problem 'selecting from more than 2 tesnors', you can use multiple options:
1- Recursively nesting tf.cond operations:
def select_from_list(selector, tensor_list):
length = len(tensor_list)
if length == 0:
raise ValueError('List is empty')
elif length == 1:
return tensor_list[0]
else:
half = length // 2
return tf.cond(tf.less(selector, float(half)), lambda: select_from_list(selector, tensor_list[:half]), lambda: select_from_list(selector - half, tensor_list[half:]))
2- Using tf.case:
def select_from_list(selector, tensor_list):
length = len(tensor_list)
if length == 0:
raise ValueError('List is empty')
elif length == 1:
return tensor_list[0]
else:
def fn(tensor):
return lambda: tensor
pred_fn_pairs = [(tf.less(selector, float(i+1)), fn(tensor)) for i, tensor in enumerate(tensor_list)]
return tf.case(pred_fn_pairs, default=lambda:tensor_list[-1])
You can test any of them using:
def test(selector, value_list, sess):
return select_from_list(float(selector), [tf.constant(value) for value in value_list]).eval(session = sess)
sess = tf.Session()
test(3.5, [4,2,6,7,5], sess)
This should return 7
I am running a best subset regression analyses in R studio. I am using the following libraries:
library(foreign)
library(glmnet)
library(caTools)
library(leaps)
library(ISLR)
library(knitr)
library(ggvis)
I want to split my sample into three samples: training, cross-validation, and test (maybe 50%, 30%, 20%).
I have successfully ran best subset on our training and cross-validated those results with the following script:
k = 10
set.seed(1)
folds = sample(1:k,nrow(best_demo_train),replace=TRUE)
table(folds)
cv.errors=matrix(NA,k,5, dimnames=list(NULL, paste(1:5)))
predict.regsubsets = function(object, newdata, id, ...) {
form = as.formula(object$call[[2]])
mat = model.matrix(form, newdata)
coefi = coef(object, id = id)
mat[, names(coefi)] %*% coefi
}
for(j in 1:k){
best.fit = regsubsets(selfaware ~., data=best_demo_train[folds != j,])
for (i in 1:5){
pred = predict(best.fit, best_demo_train[folds == j, ], id = i)
cv.errors[j, i] = mean((best_demo_train$selfaware[folds == j] - pred)^2)
}
}
mean.cv.errors=apply(cv.errors,2,mean)
mean.cv.errors
which.min(mean.cv.errors)
par(mfrow=c(1,1))
plot(mean.cv.errors,type='b')
points(which.min(mean.cv.errors),mean.cv.errors[which.min(mean.cv.errors)],
col="red",cex=2,pch=20)
reg.best=regsubsets (selfaware~.,data=best_demo_train)
coef(reg.best ,3)
reg.summary=summary(reg.best ,3)
reg.summary$adjr2
So, once I have the "best" variables, I would like to "test" that model on 20% of the data. Can someone help me out with this? I do not know what the script would be to test this model and have been unsuccessful searching online.
Thank you, I appreciate it.
Sarah