I am working on a multilabel dataset that is quite unbalanced with almost 100 labels. I can have one to several labels like this:
text labels
some text ["earth"]
another text ["earth","car"]
text again ["sun","earth","truck"]
from here I can have a get a dataframe with all possible labels and it's frequency:
labels_frequency = df.labels.map(ast.literal_eval).explode().value_counts()
out_labels = pd.DataFrame(labels_frequency).reset_index()
out_labels
And I can see that the label with the highest count have 10k records and the label with the lowest have 1k records
I am creating my dataset using sklearn MultiLabelBinarizer to get this:
text label1 label2 ... label100
some text 0 0 1
another text 1 1 0
text again 0 1 0
What I need from here:
I want to undersample this dataset in such way that I have all texts witht the lowest label count, ih this example, would be 1k records of each label. But as I told above, I can have records with more than one label per row..
So, what's the best way to tackle this problem?
I have a list and a data frame. I want to find the number of each word in the list (some words in the list are pair) for each "emotions" in the data frame.
Here is my list:
[(frozenset({'know'}), 16528),
(frozenset({'im'}), 39047),
(frozenset({'feeling'}), 99455),
(frozenset({'like'}), 49332),
(frozenset({'feel', 'im'}), 16602),
(frozenset({'feeling', 'im'}), 23488),
(frozenset({'feel'}), 202985),
(frozenset({'feel', 'like'}), 42162),
(frozenset({'time'}), 17203),
(frozenset({'really'}), 17247)]
and this is my data frame:
Unnamed: 0 id text emotions
0 0 27383 [feel, awful, job, get, position, succeed, hap... sadness
1 1 110083 [im, alone, feel, awful] sadness
2 2 140764 [ive, probably, mentioned, really, feel, proud... joy
3 3 100071 [feeling, little, low, day, back] sadness
4 4 2837 [beleive, much, sensitive, people, feeling, te... love
Here is the expected output:
6 columns for six existed emotions and the last column is for totall count.
I am currently trying to find optimal portfolio weights by optimizing a utility function that depends on those weights. I have a dataframe of containing the time series of returns, named rets_optns. rets_optns has 100 groups of 8 assets (800 columns - 1st group column 1 to 8, 2nd group column 9 to 16). I also have a dataframe named rf_options with 100 columns that present the corresponding risk free rate for each group of returns. I want to create a new dataframe composed by the portfolio's returns, using this formula: p. returns= rf_optns+sum(weights*rets_optns). It should have 100 columns and each columns should represent the returns of a portfolio composed by 8 assets belonging to the same group. I currently have:
def pret(rf,weights,rets):
return rf+np.sum(weights*(rets-rf))
It does not work
I have a .csv File that has hundreds of thousands of lines. The information was collected in order by the user.
For example, one user's inputs may range 20-400 rows, and the corresponding target is a single row where the users first input row started.
inputs | Targets
0, 7
1
2
3
4
So one set of targets per x amount of input rows.
Some of my columns contain '-' I feel like this will mess up my model when trying to train, considering it isn't a float or int what I should do?
Also, Should I shuffle my data if it is chunked like this?
I have a dataset that I shaped according to my needs, the dataframe is as follows:
Index A B C D ..... Z
Date/Time 1 0 0 0,35 ... 1
Date/Time 0,75 1 1 1 1
The total number of rows is 8878
What I try to do is create a time-series dendrogram (Example: Whole A column will be compared to whole B column in whole time).
I am expecting an output like this:
(source: rsc.org)
I tried to construct the linkage matrix with Z = hierarchy.linkage(X, 'ward')
However, when I print the dendrogram, it just shows an empty picture.
There is no problem if a compare every time point with each other and plot, but in that way, the dendrogram becomes way too complicated to observe even in truncated form.
Is there a way to handle the data as a whole time series and compare within columns in SciPy?