I'm trying to extract geometric data from an IFC file using xbim.
The IFC file is pretty big (about 500 MB) and it's taking quite a while.
I'm especially using the XbimGeometryEngine.CreateSolid method.
Is there a way to make xBim use the GPU to make all its calculations?
Related
It is possible for me to run numpy/scipy methods using np.float32 rather than np.float64. I'm currently having trouble managing an amount of data larger than my RAM memory and is looking for ways to reduce the size of the data.
I have notice that the method tf.data.experimental.save (added in r2.3) allows to save a tf.data.Dataset to file in just one line of code, which seems extremely convenient. Are there still some benefits in serializing a tf.data.Dataset and writing it into a TFRecord ourselves, or is this save function supposed to replace this process?
TFRecord have several benefits especially when using the large datasets. TFRecord - If you are working with large datasets, using a binary file format for storage of your data can have a significant impact on the performance of your import pipeline and as a consequence on the training time of your model. Binary data takes up less space on disk, takes less time to copy and can be read much more efficiently from disk. This is especially true if your data is stored on spinning disks, due to the much lower read/write performance in comparison with SSDs.
tf.data.experimental.save and tf.data.experimental.load will be useful if you are not worried about the performance of your import pipeline.
tf.data.experimental.save - The saved dataset is saved in multiple file "shards". By default, the dataset output is divided to shards in a round-robin fashion. The datasets saved through tf.data.experimental.save should only be consumed through tf.data.experimental.load, which is guaranteed to be backwards compatible.
Here is the situation: I am working with a large Tensorflow record file. It's about 50 GB. However the machine I'm doing this training on has 128 GB of RAM. 50 is less than 128, so even though this is a large file you would think that it would be possible to keep it in memory and save on slow I/O operators. But I'm using the TFRecordDataset class and it seems like the whole TFRecord system is designed specifically to not do that, and I don't see any way to force it to keep the records in memory. And since it reloads them every epoch I am wasting an inordinate amount of time on slow I/O operations reading from that 50 GB file.
I suppose I could load the records into memory in python and then load them into my model one by one with a feed_dict, bypassing the whole Dataset class. But that seems like a less elegant way to handle things and would require some redesign. Everything would be much simpler if I could just force the TFRecordDataset to load everything into memory and keep it there between epochs...
You need tf.data.Dataset.cache() operation. To achieve the desired effect (keeping the file in memory), put it right after the TFRecordDataset and don't provide any arguments to it:
dataset = tf.data.TFRecordDataset(filenames)
dataset = dataset.cache()
When the cache() operation is invoked without arguments, than caching is done in memory.
Also if you have some postprocessing of these records, like with dataset.map(...), then it could be even more beneficial to put the cache() operation in the end of the input pipeline.
More information can be found in the "Input Pipeline Performance Guide" Map and Cache section.
I am using the CNTKTextReader to read in my training and test sets. The train file is getting large ( 2.7 GB now, and soon to get bigger ).
I don't understand what is "CNTKTextFormatDeserializer" -- the doc I found didn't explain what the big picture is -- what is it and why use it? The doc I found just went into syntax of it.
So, is it a way to use a binary version of these files to make them more compact?
Readers in general are just a way to make certain aspects of training easier. These include
randomization: SGD generalizes better when the data presented to it are coming in random order. The reader can randomize the data for you with shuffling happening on the fly.
distributed training: For distributed training the reader is aware of the multiple workers and can make sure they receive distinct chunks of data.
memory budget issues: The reader does not load the whole training file in memory.
language agnostic i/o: The reader provides a cross-platform way to read data. (if you want to always be in Python, you might not care about this but others do).
The CTF format is a little verbose and indeed there is a binary format deserializer that was recently added.
I'm currently using a 10% sample of a very large dataset (10 vars, over 300m rows) which amounts to over 200 GB of data when stored in .dta format for the full dataset. Stata is able to handle operations like egen, collapse, merging, etc in a reasonable amount of time for the 10% sample when using Stata-MP on a UNIX server with ~50G of RAM and multiple cores.
However, now I want to move on to analyzing the whole sample. Even if I use a machine that has enough RAM to hold the dataset, simply generating a variable takes ages. (I think perhaps the background operations are causing Stata to run into virtual mem)
The problem is also very amenable to parallelization, i.e., the rows in the dataset are independent of each other, so I can just as easily think about the one large dataset as 100 smaller datasets.
Does anybody have any suggestions for how to process/analyze this data or can give me feedback on some suggestions I currently have? I mostly use Stata/SAS/MATLAB so perhaps there are other approaches that I am simply unaware of.
Here are some of my current ideas:
Split the dataset up into smaller datasets and utilize informal parallel processing in Stata. I can run my cleaning/processing/analysis on each partition and then merge the results after without having the store all the intermediate parts.
Use SQL to store the data and also perform some of the data manipulation such as aggregating over certain values. One concern here is that some tasks that Stata can handle fairly easily such as comparing values across time won't work so well in SQL. Also, I'm already running into performance issues when running some queries in SQL on a 30% sample of the data. But perhaps I'm not optimizing by indexing correctly, etc. Also, Shard-Query seems like it could help with this but I have not researched it too thoroughly yet.
R also looks promising, but I'm not sure if it would solve the problem of working with this enormous amount of data.
Thanks to those who have commented and replied. I realized that my problem is similar to this thread. I have re-written some of my data manipulation code in Stata into SQL and the response time is much quicker. I believe I can make large optimization gains by correctly utilizing indexes and using parallel processing via partitions/shards if necessary. After all the data manipulation has been done, I can import that data via ODBC in Stata.
Since you are familiar with Stata there is a well documented FAQ about large data sets in Stata Dealing with Large Datasets: you might find this helpful.
I would clean via columns, splitting those up, running any specific cleaning routines and merge back in later.
Depending on your machine resources, you should be able to hold the individual columns in multiple temporary files using tempfile. Taking care to select only the variables or columns most relevant to your analysis should reduce the size of your set quite a lot.