(fixed acidity;"volatile acidity";"citric acid";"re...)<-- how do I separate something like this with commas? - pandas

I got this dataset (link below), I'm trying to do an exercise for learning TensorFlow and had problems loading the data. The instruction says:
"Read the data into a DataFrame with Pandas. pd.read_csv would be very useful here. Note, it has an option to specify the delimiter (and the wine csv files are not comma separated). One other note, you should specify a dtype to read_csv: the quality value in the dataset is given as an integer, but our model (a regression model) is expecting a float value."
So, basically I understood that I need to separate the data with commas, also the values for each feature need to be changed to float values, could you please help?
https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv
cheers!

You don't need to change it into commas, just use semicolon as a seperator and set the dtype to float
pd.read_csv('winequality-white.csv', sep=';', dtype='float')

Related

label encoding in dask_cudf dataframe

I am trying to use dask_cudf to preprocess a very large dataset (150,000,000+ records) for multi-class xgboost training and am having trouble encoding the class column (dtype is string). I tried using the 'replace' function, but the error message said the two dtypes must match. I tried using dask_ml.LabelEncoder, but it said string arrays aren't supported in cudf. I tried using compute() in various ways, but i kept running into out-of-memory errors (i'm assuming because operations on cudf dataframe require a smaller dataset). I also tried pulling the class column out, encoding, and then merging it back with the dataframe, but the partitions do not line up. I tried manually lining them up, but dask_cudf seemingly does not support repartioning using 'divisions' parameter (got error saying something like 'old and new partitions do not match'). Any help on how to do this would be much appreciated.
Strings aren't supported on xgboost. Not having seen your data, here are a few ways quick and dirty ways I've modified string columns to train, as generally strings may not matter:
If the strings were actually numeric (like dates), converting to int (int8 int16, int32)
I did this by hashmapping the strings and then running xgboost (basically creating a reversible conversion between string and integer as long as you don't change the integer) and train on your current, now hashed as an integer, column.
if the strings are classes, manually naming class numbers (0,1,2,...,n) in a new column and train on that one.
There are definitely other, better ways. As for the second part of your question, left a comment.
Now, your XGBoost model and your dask-cudf dataframe per-GPU allocation must fit on a single GPU, or you will get memory errors. If your model will be considering a large amount of data, please train on the largest GPU memory sized cluster you can. A100s can have 40GB and 80GB. Some older compute GPUs, V100 and GV100 have 32GB. A6000 and RTX8000 have 48GB. then it goes to 24, 16, and lower from there. Please size your GPUs accordingly

how to predict winner based on teammates

I am trying to create a machine learning model to predict the position of each team, but I am having trouble organizing the data in a way the model can train off of it.
I want the pandas dataframe to look something like this
Where each tournament has team members constantly shifting teams.
And based on the inputted teammates, the model makes a prediction on the team's position. Anyone have any suggestions on how I can make a pandas dataframe like this that a model can use as trainnig data? I'm completely stumped. Thanks in advance!
Coming on to the question as to how to create this sheet, you can easily get the data and store in the format you described above. The trick is in how to use it as training data to your model. We need to convert it in numerical form to be able to be used as training data to any model. As we know that the max team size is 3 in most cases, we can divide the three names in three columns (keep the column blank, if there are less than 3 members in the team). Now we can either use Label encoding or One-hot encoding to convert the names to numbers. You should create a combined list of all three columns to fit a LabelEncoder and then use transform function individually on each column (since the names might be shared in these 3 columns). On label encoding, we can easily use tree based models. One-hot encoding might lead to curse of dimensionality as there will be many names, so I would prefer not to use it for an initial simple model.

Does the sklearn.ensemble.GradientBoostingRegressor support sparse input samples?

I’m using sklearn.ensemble.GradientBoostingRegressor on data that is sometimes lacking some values. I can’t easily impute these data because they have a great variance and the estimate is very sensitive to them. They are also almost never 0.
The documentation of the fit method says about the first parameter X:
The input samples. Internally, it will be converted to dtype=np.float32 and if a sparse matrix is provided to a sparse csr_matrix.
This has lead me to think that the GradientBoostingRegressor can work with sparse input data.
But internally it calls check_array with implicit force_all_finite=True (the default), so that I get the following error if I put in a csr_matrix with NaN values:
ValueError: Input contains NaN, infinity or a value too large for dtype('float32')
Does the GradientBoostingRegressor not actually support sparse data?
Update:
I’m lucky in that I don’t have any meaningful zeros. My calling code now looks like this:
predictors['foobar'] = predictors['foobar'].fillna(0) # for columns that contain NaNs
predictor_matrix = scipy.sparse.csr_matrix(
predictors.values.astype(np.float)
)
predictor_matrix.eliminate_zeros()
model.fit(predictor_matrix, regressands)
This avoids the exception above. Unfortunately there is no eliminate_nans() method. (When I print a sparse matrix with NaNs, it lists them explicitly, so spareness must be something other than containing NaNs.)
But the prediction performance hasn’t (noticeably) changed.
Perhaps you could try using LightGBM. Here is a discussion in Kaggle about how it handles missing values:
https://www.kaggle.com/c/home-credit-default-risk/discussion/57918
Good luck

Confusion about how bucketized feature columns work

I had some confusion about how bucketized feature columns represent input to the model. According to the blog post on feature columns, when we bucketize a feature like year this puts each value in buckets based on the defined boundaries, and creates a binary vector, turning on each bucket based on the input value, but the example in the documentation shows the output as a single integer. I'm confused as to how the input is to the model when using a bucketized column. Can anyone clarify this for me please?
From the dimensions of the first hidden layer of the estimator, it seems like for each feature column that is a tf.feature_column.bucketized_column, a one hot encoded vector is created based on the boundaries.

A similar approach for LabelEncoder in sklearn.preprocessing?

For encoding categorical data like sex we normally use LabelEncorder() in scikit learn. But If I'm going to use Tensorflow instead of Scikit Learn, what is the equivalent function or methodology for doing such task? I know that we can do one hot encoding easily with tensorflow, but then it will create labels as 10 , 01 instead of 1 , 0.
There is a package in TensorFlow called tf.feature_columns, that contain 4 methods to create categorical columns from your input data:
categorical_column_with_hash_bucket(...): Hash the input value to a fixed number of categories
categorical_column_with_identity(...): If you have numeric input and you want the value itself to be treated as a categorical column
categorical_column_with_vocabulary_list(...): Outputs a category based on a fixed (memory) list of words
categorical_column_with_vocabulary_file(...): Same as _list but reads the vocabulary from file
The package also provides lots more way of getting your input data to the model. For an overview, see this blogpost written by the developers of the package.