what is the difference between series/dataframe and ndarray? - pandas

Leaving that they are from two different binaries.
I know that series/dataframe can hold any data type, and ndarray is also heterogenous data.
And also all the slicing operations of numpy are applicable to series.
Is there any other difference between them?

After some research I found the answer to my question I asked above. For anyone who needs, here it is from pandas docs:
A key difference between Series and ndarray is that operations between
Series automatically align the data based on the label. Thus, you can
write computations without giving consideration to whether the Series
involved have the same labels.
An example:
s[1:] + s[:-1]
The result for above would produce NaN for both first and last index.
If a label is not found in one Series or the other, the result will be marked as missing NaN.

Related

Should I care about data types in pandas dataframe? If so, how?

I know that columns of pandas dataframe can be of different types. In practice, does it make any difference which type I use? If so, what is that difference and what are some best practices regarding pandas dtypes?
(For example in R one should use factor type to encode categorical vars and it impacts the way that plots are generated etc. - but I haven't heard of anything similar in pandas.)

Does the sklearn.ensemble.GradientBoostingRegressor support sparse input samples?

I’m using sklearn.ensemble.GradientBoostingRegressor on data that is sometimes lacking some values. I can’t easily impute these data because they have a great variance and the estimate is very sensitive to them. They are also almost never 0.
The documentation of the fit method says about the first parameter X:
The input samples. Internally, it will be converted to dtype=np.float32 and if a sparse matrix is provided to a sparse csr_matrix.
This has lead me to think that the GradientBoostingRegressor can work with sparse input data.
But internally it calls check_array with implicit force_all_finite=True (the default), so that I get the following error if I put in a csr_matrix with NaN values:
ValueError: Input contains NaN, infinity or a value too large for dtype('float32')
Does the GradientBoostingRegressor not actually support sparse data?
Update:
I’m lucky in that I don’t have any meaningful zeros. My calling code now looks like this:
predictors['foobar'] = predictors['foobar'].fillna(0) # for columns that contain NaNs
predictor_matrix = scipy.sparse.csr_matrix(
predictors.values.astype(np.float)
)
predictor_matrix.eliminate_zeros()
model.fit(predictor_matrix, regressands)
This avoids the exception above. Unfortunately there is no eliminate_nans() method. (When I print a sparse matrix with NaNs, it lists them explicitly, so spareness must be something other than containing NaNs.)
But the prediction performance hasn’t (noticeably) changed.
Perhaps you could try using LightGBM. Here is a discussion in Kaggle about how it handles missing values:
https://www.kaggle.com/c/home-credit-default-risk/discussion/57918
Good luck

Random projection in Python Pandas using a dataframe containing NaN values

I have a dataframe data containing real values and some NaN values. I'm trying to perform locality sensitive hashing using random projections to reduce the dimension to 25 components, specifically with thesklearn.random_projection.GaussianRandomProjection class. However, when I run:
tx = random_projection.GaussianRandomProjection(n_components = 25)
data25 = tx.fit_transform(data)
I get Input contains NaN, infinity or a value too large for dtype('float64'). Is there a work-around to this? I tried changing all the NaN values to a value that is never present in my dataset, such as -1. How valid would my output be in this case? I'm not an expert behind the theory of locality sensitive hashing/random projections so any insight would be helpful as well. Thanks.
NA / NaN values (not-available / not-a-number) are, I have found, just plain troublesome.
You don't want to just substitute a random value like -1. If you are inclined to do that, use one of the Imputer classes. Otherwise, you are likely to very substantially change the distances between points. You likely want to preserve distances as much as possible if you are using random projection:
The dimensions and distribution of random projections matrices are controlled so as to preserve the pairwise distances between any two samples of the dataset.
However, this may or may not result in reasonable values for learning. As far as I know, imputation is an open field of study, which (for instance) this gentlemen has specialized in studying.
If you have enough examples, consider dropping rows or columns that contain NaN values. Another possibility is training a generative model like a Restricted Boltzman Machine and use that to fill in missing values:
rbm = sklearn.neural_network.BernoulliRBM().fit( data_with_no_nans )
mean_imputed_data = sklearn.preprocessing.Imputer().fit_transform( all_data )
rbm_imputation = rbm.gibbs( mean_imputed_data )
nan_mask = np.isnan( all_data )
all_data[ nan_mask ] = rbm_imputation[ nan_mask ]
Finally, you might consider imputing using nearest neighbors. For a given column, train a nearest neighbors model on all the variables except that column using all complete rows. Then, for a row missing that column, find the k nearest neighbors and use the average value among them. (This gets very costly, especially if you have rows with more than one missing value, as you will have to train a model for every combination of missing columns).

Working with columns of NumPy matrices

I've been unable to figure out how to access, add, multiply, replace, etc. single columns of a NumPy matrix. I can do this via looping over individual elements of the column, but I'd like to treat the column as a unit, something that I can do with rows.
When I've tried to search I'm usually directed to answers handling NumPy arrays, but this is not the same thing.
Can you provide code that's giving trouble? The operations on columns that you list are among the most basic operations that are supported and optimized in NumPy. Consider looking over the tutorial on NumPy for MATLAB users, which has many examples of accessing rows or columns, performing vectorized operations on them, and modifying them with copies or in-place.
NumPy for MATLAB Users
Just to clarify, suppose you have a 2-dimensional NumPy ndarray or matrix called a. Then a[:, 0] would access the first column just the same as a[0] or a[0, :] would access the first row. Any operations that work for rows should work for columns as well, with some caveats for broadcasting rules and certain mathematical operations that depend upon array alignment. You can also use the numpy.transpose(a) function (which is also exposed with a.T) to transpose a making the columns become rows.

Which features of Pandas DataFrame could be used to model GPS Tracklog data (read from GPX file)

It's been months now since I started to use Pandas DataFrame to deserialize GPS data and perform some data processing and analyses.
Although I am very impressed with Pandas robustness, flexibility and power, I'm a bit lost about which features, and in which way, I should use to properly model the data, both for clarity, simplicity and computational speed.
Basically, each DataFrame is primarily indexed by a datetime object, having at least one column for a latitude-longitude tuple, and one column for elevation.
The first thing I do is to calculate a new column with the geodesic distance between coordinate pairs (first one being 0.0), using a function that takes two coordinate pairs as arguments, and from that new column I can calculate the cumulative distance along the track, which I use as a Linear Referencing System
The questions I need to address would be:
Is there a way in which I can use, in the same dataframe, two different monotonically increasing columns (cumulative distance and timestamp), choosing whatever is more convenient in each given context at runtime, and use these indexes to auto-align newly inserted rows?
In the specific case of applying a diff function that could be vectorized (applied like an array operation instead of an iterative pairwise loop), is there a way to do that idiomatically in pandas? Should I create a "coordinate" class which support the diff (__sub__) operation so I could use dataframe.latlng.diff directly?
I'm not sure these questions are well formulated, but that is due, at least a bit, by the overwhelming number of possibilities, and a somewhat fragmented documentation (yet).
Also, any tip about using Pandas for GPS data (tracklogs) or Geospatial data in general is very much welcome.
Thanks for any help!