out-of-core 'where' on pytables array - pytables

I have a big pytables carray mapped to an hdf5 file and I want to extract a very small subset based on a condition without having to pull the whole thing into memory at once. All I want is the equivalent of this numpy code:
b=a[np.where(a>3.0)]
where 'a' would be my pytables disk array. It seems trivial but I've been scratching my head for hours. I'd be very grateful if anyone can help.
David

You cannot do 'out-of-core' queries for *Array objects in PyTables. The reason is that Table objects are the ones that received the largest share of love in PyTables. Your best bet here would be to store the CArray contents in a Table with just a column.

Related

How to efficently flatten JSON structure returned in elasticsearch_dsl queries?

I'm using elasticsearch_dsl to make make queries for and searches of an elasticsearch DB.
One of the fields I'm querying is an address, which as a structure like so:
address.first_line
address.second_line
address.city
adress.code
The returned documents hold this in JSON structures, such that the address is held in a dict with a field for each sub-field of address.
I would like to put this into a (pandas) dataframe, such that there is one column per sub-field of the address.
Directly putting address into the dataframe gives me a column of address dicts, and iterating the rows to manually unpack (json.normalize()) each address dict takes a long time (4 days, ~200,000 rows).
From the docs I can't figure out how to get elasticsearch_dsl to return flattened results. Is there a faster way of doing this?
Searching for a way to solve this problem, I've come across my own answer and found it lacking, so will update with a better way
Specifically: pd.json_normalize(df['json_column'])
In context: pd.concat([df, pd.json_normalize(df['json_column'])], axis=1)
Then drop the original column if required.
Original answer from last year that does the same thing much more slowly
df.column_of_dicts.apply(pd.Series) returns a DataFrame with those dicts flattened.
pd.concat(df,new_df) gets the new columns onto the old dataframe.
Then delete the original column_of_dicts.
pd.concat([df, df.address.apply(pd.Series)], axis=1) is the actual code I used.

Fast ascii loader to NumPy arrays

It is well known [1] [2] that numpy.loadtxt is not particularly fast in loading simple text files containing numbers.
I have been googling around for alternatives, and of course I stumbled across pandas.read_csv and astropy io.ascii. However, these readers don’t appear to be easy to decouple from their library, and I’d like to avoid adding a 200 MB, 5-seconds-import-time gorilla just for reading some ascii files.
The files I usually read are simple, no missing data, no malformed rows, no NaNs, floating point only, space or comma separated. But I need numpy arrays as output.
Does anyone know if any of the parsers above can be used standalone or about any other quick parser I could use?
Thank you in advance.
[1] Numpy loading csv TOO slow compared to Matlab
[2] http://wesmckinney.com/blog/a-new-high-performance-memory-efficient-file-parser-engine-for-pandas/
[Edit 1]
For the sake of clarity and to reduce background noise: as I stated at the beginning, my ascii files contain simple floats, no scientific notation, no fortran specific data, no funny stuff, no nothing but simple floats.
Sample:
{
arr = np.random.rand(1000,100)
np.savetxt('float.csv',arr)
}
Personally I just use pandas and astropy for this. Yes, they are big and slow on import, but very widely available and on my machine import in under a second, so they aren't so bad. I haven't tried, but I would assume that extracting the CSV reader from pandas or astropy and getting it to build and run standalone isn't so easy, probably not a good way to go.
Is writing your own CSV to Numpy array reader an option? If the CSV is simple, it should be possible to do with ~ 100 lines of e.g. C / Cython, and if you know your CSV format you can get performance and package size that can't be beaten by a generic solution.
Another option you could look at is https://odo.readthedocs.io/ . I don't have experience with it, from a quick look I didn't see direct CSV -> Numpy. But it does make fast CSV -> database simple, and I'm sure there are fast database -> Numpy array options. So it might be possible to get fast e.g. CSV -> in-memory SQLite -> Numpy array via odo and possible a second package.

Why is pyspark so much slower in finding the max of a column?

Is there a general explanation, why spark needs so much more time to calculate the maximum value of a column?
I imported the Kaggle Quora training set (over 400.000 rows) and I like what spark is doing when it comes to rowwise feature extraction. But now I want to scale a column 'manually': find the maximum value of a column and divide by that value.
I tried the solutions from Best way to get the max value in a Spark dataframe column and https://databricks.com/blog/2015/06/02/statistical-and-mathematical-functions-with-dataframes-in-spark.html
I also tried df.toPandas() and then calculate the max in pandas (you guessed it, df.toPandas took a long time.)
The only thing I did ot try yet is the RDD way.
Before I provide some test code (I have to find out how to generate dummy data in spark), I'd like to know
can you give me a pointer to an article discussing this difference?
is spark more sensitive to memory constraints on my computer than pandas?
As #MattR has already said in the comment - you should use Pandas unless there's a specific reason to use Spark.
Usually you don't need Apache Spark unless you encounter MemoryError with Pandas. But if one server's RAM is not enough, then Apache Spark is the right tool for you. Apache Spark has an overhead, because it needs to split your data set first, then process those distributed chunks, then process and join "processed" data, collect it on one node and return it back to you.
#MaxU, #MattR, I found an intermediate solution that also makes me reassess Sparks laziness and understand the problem better.
sc.accumulator helps me define a global variable, and with a separate AccumulatorParam object I can calculate the maximum of the column on the fly.
In testing this I noticed that Spark is even lazier then expected, so this part of my original post ' I like what spark is doing when it comes to rowwise feature extraction' boils down to 'I like that Spark is doing nothing quite fast'.
On the other hand a lot of the time spent on calculating the maximum of the column has most presumably been the calculation of the intermediate values.
Thanks for yourinput and this topic really got me much further in understanding Spark.

select from hdf5 apply function (e.g. mean)

I'm loading a data frame stored on disk as an HDF5 file. I'm using the store.select statement to run conditions and return only the data I'm interested in. After that I'm getting the column-wise mean of the data. Is there a way to combine the two steps such that the mean is basically performed on disk and the whole data is not loaded into memory at the same time?
Thanks!
-Kaushik
In theory yes, see here. In practice, not at the moment. You would have to drop down to pytables by using the store._handle to get at the data that is needed. You would also have to handle nan, for example.

Best approach for bringing 180K records into an app: core data: yes? csv vs xml?

I've built an app with a tiny amount of test data (clues & answers) that works fine. Now I need to think about bringing in a full set of clues & answers, which roughly 180K records (it's a word game). I am worried about speed and memory usage of course. Looking around the intertubes and my library, I have concluded that this is probably a job for core data. Within that approach however, I guess I can bring it in as a csv or as an xml (I can create either one from the raw data using a scripting language). I found some resources about how to handle each case. What I don't know is anything about overall speed and other issues that one might expect in using csv vs xml. The csv file is about 3.6 Mb and the data type is strings.
I know this is dangerously close to a non-question, but I need some advice as either approach requires a large coding commitment. So here are the questions:
For a file of this size and characteristics, would one expect csv or
xml to be a better approach? Is there some other
format/protocol/strategy that would make more sense?
Am I right to focus on core data?
Maybe I should throw some fake code here so the system doesn't keep warning me about asking a subjective question. But I have to try! Thanks for any guidance. Links to discussions appreciated.
As for file size CSV will always be smaller compared to an xml file as it contains only the raw data in ascii format. Consider the following 3 rows and 3 columns.
Column1, Column2, Column3
1, 2, 3
4, 5, 6
7, 8, 9
Compared to it's XML counter part which is not even including schema information in it. It is also in ascii format but the rowX and the ColumnX have to be repeated mutliple times throughout the file. Compression of course could help fix this but I'm guessing even with compression the CSV will still be smaller.
<root>
<row1>
<Column1>1</Column1>
<Column2>2</Column2>
<Column3>3</Column3>
</row1>
<row2>
<Column1>4</Column1>
<Column2>5</Column2>
<Column3>6</Column3>
</row2>
<row3>
<Column1>7</Column1>
<Column2>8</Column2>
<Column3>9</Column3>
</row3>
</root>
As for your other questions sorry I can not help there.
This is large enough that the i/o time difference will be noticeable, and where the CSV is - what? 10x smaller? the processing time difference (whichever is faster) will be negligible compared to the difference in reading it in. And CSV should be faster, outside of I/O too.
Whether to use core data depends on what features of core data you hope to exploit. I'm guessing the only one is query, and it might be worth it for that, although if it's just a simple mapping from clue to answer, you might just want to read the whole thing in from the CSV file into an NSMutableDictionary. Access will be faster.