Quanteda merging unigrams and bigrams - text-mining

I want to experiment if having both unigrams and bigrams in one DFM improves my document classification. I would like to create both unigrams and bigrams in one DFM. From there, I can then get my TF-IDF weighted DFM considering both unigrams and bigrams. Possibly, I can possibly create unigram and bigram dfms separately and then I can merge them. But, I would like to know if quanteda has a more efficient way of doing this. I appreciate your responses.

Got it from the quanteda page. It works with something like this.
toks_skip <- tokens_ngrams(toks, n = 1:2)

Related

how to predict winner based on teammates

I am trying to create a machine learning model to predict the position of each team, but I am having trouble organizing the data in a way the model can train off of it.
I want the pandas dataframe to look something like this
Where each tournament has team members constantly shifting teams.
And based on the inputted teammates, the model makes a prediction on the team's position. Anyone have any suggestions on how I can make a pandas dataframe like this that a model can use as trainnig data? I'm completely stumped. Thanks in advance!
Coming on to the question as to how to create this sheet, you can easily get the data and store in the format you described above. The trick is in how to use it as training data to your model. We need to convert it in numerical form to be able to be used as training data to any model. As we know that the max team size is 3 in most cases, we can divide the three names in three columns (keep the column blank, if there are less than 3 members in the team). Now we can either use Label encoding or One-hot encoding to convert the names to numbers. You should create a combined list of all three columns to fit a LabelEncoder and then use transform function individually on each column (since the names might be shared in these 3 columns). On label encoding, we can easily use tree based models. One-hot encoding might lead to curse of dimensionality as there will be many names, so I would prefer not to use it for an initial simple model.

Should I join features and targets dataframes for use with scikit-learn?

I am trying to create a regression model to predict deliverables (dataframe 2) using design parameters (dataframe 1). Both dataframes have a id number that I used as an index.
Is it possible to use two dataframes to create a dataset for sklearn? Or do I need to join them? If I need to join them then what would be the best way?
# import data
df1= pd.read_excel('data.xlsx', sheet_name='Data1',index_col='Unnamed: 0')
df2= pd.read_excel('data.xlsx', sheet_name='Data2',index_col='Unnamed: 0')
I have only used sklearn on a single dataframe that had all of the columns for the feature and target vectors in it. So not sure how to handle the case where I am using two dataframes where one has the features and one has the targets.
All estimators in scikit-learn have a signature like estimator.fit(X, y), X being training features and y training targets.
Then, prediction will be achieved by calling some kind of estimator.predict(X_test), with X_test being the test features.
Even train_test_split takes as parameters two arrays X and y.
This means that, as long as you maintain the right order in rows, nothing requires you to merge features and targets.
Completely agree with the Guillaume answer.
Just be aware, as he said, of the rows order. That's the key of your problem. If they have the same order, you don't need to merge dataframe and you can fit the model directly.
But, if they are not in the same order, you have to combine both dataframes (similar to left join in SQL) in order to relate features and targets of one ID. You can do it like this (more information here):
df_final= pd.concat([df1, df2], axis=1)
As you used the ID as index, it should work properly. Be aware that maybe NaN values will appear if some ID appears in one Dataframe but not in the other one. You will have to handle with them.

How can I study the properties of outliers in high-dimensional data?

I have a bundle of high-dimensional data and the instances are labeled as outliers or not. I am looking to get some insights around where these outliers reside within the data. I seek to answer questions like:
Are the outliers spread far apart from each other? Or are they clustered together?
Are the outliers lying 'in-between' clusters of good data? Or are they on the 'edge' boundaries of the data?
If outliers are clustered together, how do these cluster densities compare with clusters of good data?
'Where' are the outliers?
What kind of techniques will let me find these insights? If the data was 2 or 3-dimensional, I can easily plot the data and just look at it. But I can't do it high-dimensional data.
Analyzing the Statistical Properties of Outliers
First of all, if you can choose to focus on specific features. For
example, if you know a featues is subject to high variation, you can
draw a box plot. You can also draw a 2D graph if you want to focus on
2 features. THis shows how much the labelled outliers vary.
Next, there's a metric called a Z-score, which basically says how
many standard devations a point varies compared to the mean. The
Z-score is signed, meaning if a point is below the mean, the Z-score
will be negative. This can be used to analyze all the features of the
dataset. You can find the threshold value in your labelled dataset for which all the points above that threshold are labelled outliers
Lastly, we can find the interquartile range and similarly filter
based on it. The IQR is simply the difference between the 75
percentile and 25 percentile. You can also use this similarly to Z-score.
Using these techniques, we can analyze some of the statistical properties of the outliers.
If you also want to analyze the clusters, you can adapt the DBSCAN algorithm to your problem. This algorithm clusters data based on densities, so it will be easy to apply the techniques to outliers.

Update parameters of Bayesian Network with new data

I have a bayesian network, and I know the CPTs by learning the probabilities from existing data.
Suppose I receive a new data instance. Ideally I don't want to use all the data again to update the probabilities.
Is there a way to incrementally update the CPTs of the existing network each time new data comes in?
I think there should be, and I feel like I'm missing something :)
It's easiest to maintain the joint probability table, and rebuild the CPT from that as needed. Along with the JPT, keep a count of how many examples were used to produce it. When adding the nth example, multiply all probabilities by 1 - 1/n, and then add probability 1/n to the new example's associated probability.
If you're going to do this a bunch, you should maintain a count of examples for each row in the JPT instead of a probability. That'll cut down on numerical drift.

How to plot a Pearson correlation given a time series?

I am using the code in this website http://blog.chrislowis.co.uk/2008/11/24/ruby-gsl-pearson.html to implement a Pearson Correlation given two time series data like so:
require 'gsl'
pearson_correlation = GSL::Stats::correlation(
GSL::Vector.alloc(first_metrics),GSL::Vector.alloc(second_metrics)
)
This returns a number such as -0.2352461593569471.
I'm currently using the highcharts library and am feeding it two sets of timeseries data. Given that I have a finite time series for both sets, can I do something with this number (-0.2352461593569471) to create a third time series showing the slope of this curve? If anyone can point me in the right direction I'd really appreciate it!
No, correlation doesn't tell you anything about the slope of the line of best fit. It just tells you approximately how much of the variability in one variable (or one time series, in this case) can be explained by the other. There is a reasonably good description here: http://www.graphpad.com/support/faqid/1141/.
How you deal with the data in your specific case is highly dependent on what you're trying to achieve. Are you trying to show that variable X causes variable Y? If so, you could start by dropping the time-series-ness, and just treat the data as paired values, and use linear regression. If you're trying to find a model of how X and Y vary together over time, you could look at multivariate linear regression (I'm not very familiar with this, though).