Fast ascii loader to NumPy arrays - pandas

It is well known [1] [2] that numpy.loadtxt is not particularly fast in loading simple text files containing numbers.
I have been googling around for alternatives, and of course I stumbled across pandas.read_csv and astropy io.ascii. However, these readers don’t appear to be easy to decouple from their library, and I’d like to avoid adding a 200 MB, 5-seconds-import-time gorilla just for reading some ascii files.
The files I usually read are simple, no missing data, no malformed rows, no NaNs, floating point only, space or comma separated. But I need numpy arrays as output.
Does anyone know if any of the parsers above can be used standalone or about any other quick parser I could use?
Thank you in advance.
[1] Numpy loading csv TOO slow compared to Matlab
[2] http://wesmckinney.com/blog/a-new-high-performance-memory-efficient-file-parser-engine-for-pandas/
[Edit 1]
For the sake of clarity and to reduce background noise: as I stated at the beginning, my ascii files contain simple floats, no scientific notation, no fortran specific data, no funny stuff, no nothing but simple floats.
Sample:
{
arr = np.random.rand(1000,100)
np.savetxt('float.csv',arr)
}

Personally I just use pandas and astropy for this. Yes, they are big and slow on import, but very widely available and on my machine import in under a second, so they aren't so bad. I haven't tried, but I would assume that extracting the CSV reader from pandas or astropy and getting it to build and run standalone isn't so easy, probably not a good way to go.
Is writing your own CSV to Numpy array reader an option? If the CSV is simple, it should be possible to do with ~ 100 lines of e.g. C / Cython, and if you know your CSV format you can get performance and package size that can't be beaten by a generic solution.
Another option you could look at is https://odo.readthedocs.io/ . I don't have experience with it, from a quick look I didn't see direct CSV -> Numpy. But it does make fast CSV -> database simple, and I'm sure there are fast database -> Numpy array options. So it might be possible to get fast e.g. CSV -> in-memory SQLite -> Numpy array via odo and possible a second package.

Related

pandas writes NUL character (\0) when calling to_csv

In one of my scripts I call the following code on my dataframe to save the data on disc.
to_csv(input_folder / 'tmp' / section_fname, index=False, encoding='latin1', quoting=csv.QUOTE_NONNUMERIC)
When I opened the created file with notepad++ in the "show all characters mode" it showed a lot of NUL characters (\0) inside one of the rows. In addition to this, some rows of the dataframe are not being written
However, if I scroll this line, there are some data of my dataframe after:
This appears somewhere in the middle of my dataframe, so I decided to call head and then tail to look inside this specific portion of the data where it appears. As I can see, the data is pretty all right: there are some integers and strings as it should be.
I am using pandas 1.1.5
I have looked throught the data to ensure that nothing weird is being written that can result in reading this way. In addition to this, I have googled if someone faced the same issue, but mostly it occurs that people read the data with pandas and get NUL characters
I have spent a lot of time digging into the data and the code and have no explanation of such behavior. Maybe someone can help me?
By the way, everytime I write my dataframe it occurs in a different place removing different amount of rows.
Kind regards,
Mike

How do I work with large, >30 GiB datasets that are formatted as SAS7DBAT files?

I have these 30 GiB SAS7BDAT files which correspond to a year's worth of data. When I try importing the file using pd.read_sas() I get a memory-related error. Upon research, I hear mentions of using Dask, segmenting the files into smaller chunks, or SQL. These answers sound pretty broad, and since I'm new, I don't really know where to begin. Would appreciate if someone could share some details with me. Thanks.
I am not aware of a partitioned loader of this sort of data for dask. However, the pandas API apparently allows you to stream the data by chunks, so you could write these chunks to other files in any convenient format, and then process those either serially or with dask. The best value of chunksize will depend on your data and available memory.
The following should work, but I don't have any of this sort of data to try it on.
with pd.read_sas(..., chunksize=100000) as file_reader:
for i, df in enumerate(file_reader):
df.to_parquet(f"{i}.parq")
then you can load the parts (in parallel) with
import dask.dataframe as dd
ddf = dd.read_parquet("*.parq")

Best format for Pandas serialization on disk

For my workload, I need to serialize on disk Pandas dataframe (Text +Datas) with a size of 5Go per Dataframe.
Came across various solutions:
HDF5 : Issues with string
Feather: not stable
CSV: Ok, but large file size.
pickle : Ok, cross-platform, can we do better ?
gzip : Same than CSV (slow for read access).
SFrame: Good, but not maintained anymore.
Just wondering any alternative solution to pickle to store string Dataframe on disk ?
parquet is the best format because this is used for Big tech company storing petabytes of data....
I suggest reading this article: https://towardsdatascience.com/the-best-format-to-save-pandas-data-414dca023e0d
The author concludes that feather is the most efficient serialization. However, it would not suitable for long-term storage - which is likely to be CSV (form long-term).

out-of-core 'where' on pytables array

I have a big pytables carray mapped to an hdf5 file and I want to extract a very small subset based on a condition without having to pull the whole thing into memory at once. All I want is the equivalent of this numpy code:
b=a[np.where(a>3.0)]
where 'a' would be my pytables disk array. It seems trivial but I've been scratching my head for hours. I'd be very grateful if anyone can help.
David
You cannot do 'out-of-core' queries for *Array objects in PyTables. The reason is that Table objects are the ones that received the largest share of love in PyTables. Your best bet here would be to store the CArray contents in a Table with just a column.

Best approach for bringing 180K records into an app: core data: yes? csv vs xml?

I've built an app with a tiny amount of test data (clues & answers) that works fine. Now I need to think about bringing in a full set of clues & answers, which roughly 180K records (it's a word game). I am worried about speed and memory usage of course. Looking around the intertubes and my library, I have concluded that this is probably a job for core data. Within that approach however, I guess I can bring it in as a csv or as an xml (I can create either one from the raw data using a scripting language). I found some resources about how to handle each case. What I don't know is anything about overall speed and other issues that one might expect in using csv vs xml. The csv file is about 3.6 Mb and the data type is strings.
I know this is dangerously close to a non-question, but I need some advice as either approach requires a large coding commitment. So here are the questions:
For a file of this size and characteristics, would one expect csv or
xml to be a better approach? Is there some other
format/protocol/strategy that would make more sense?
Am I right to focus on core data?
Maybe I should throw some fake code here so the system doesn't keep warning me about asking a subjective question. But I have to try! Thanks for any guidance. Links to discussions appreciated.
As for file size CSV will always be smaller compared to an xml file as it contains only the raw data in ascii format. Consider the following 3 rows and 3 columns.
Column1, Column2, Column3
1, 2, 3
4, 5, 6
7, 8, 9
Compared to it's XML counter part which is not even including schema information in it. It is also in ascii format but the rowX and the ColumnX have to be repeated mutliple times throughout the file. Compression of course could help fix this but I'm guessing even with compression the CSV will still be smaller.
<root>
<row1>
<Column1>1</Column1>
<Column2>2</Column2>
<Column3>3</Column3>
</row1>
<row2>
<Column1>4</Column1>
<Column2>5</Column2>
<Column3>6</Column3>
</row2>
<row3>
<Column1>7</Column1>
<Column2>8</Column2>
<Column3>9</Column3>
</row3>
</root>
As for your other questions sorry I can not help there.
This is large enough that the i/o time difference will be noticeable, and where the CSV is - what? 10x smaller? the processing time difference (whichever is faster) will be negligible compared to the difference in reading it in. And CSV should be faster, outside of I/O too.
Whether to use core data depends on what features of core data you hope to exploit. I'm guessing the only one is query, and it might be worth it for that, although if it's just a simple mapping from clue to answer, you might just want to read the whole thing in from the CSV file into an NSMutableDictionary. Access will be faster.