Convert duplicate raws into one with diffrent values - dataframe

I'm trying to find a solution to rearrange my data frame. Currently I have more than a half duplicate raws for a single object and I would like to combine them into one. The fraction of my dataset you can find below:
#NAME Sample1 Sample2 Sample3 sample4 Sample5
AAC(6')-Ib7 5 0 0 0 0
AAC(6')-Ib7 0 3 0 0 25
AAC(6')-Ib7 0 0 0 0 0
AAC(6')-Ib7 0 0 0 10 0
AAC(6')-Ib7 0 0 0 0 0
And I would like to have the output:
#NAME Sample1 Sample2 Sample3 sample4 Sample5
AAC(6')-Ib7 5 3 0 10 25
Can you give me any tips how I can rearrange it?
Because my original dataset has more than 7000 raws, but most are in dublicate (should have around 800 single raws), do I have to do it for each value separately?
Will be appreciated for your help!
Thank you.

Related

Handling features with multiple values per instance in Python for Machine Learning model

I am trying to handle my data set which contain some features that has some multiple values per instances as shown on the image
https://i.stack.imgur.com/D78el.png
I am trying to separate each value by '|' symbol to apply One-Hot encoding technique but I can't find any suitable solution to my problem
My idea is to keep every multiple values in one row or by another word convert each cell to list of integers
Maybe this is what you want:
df = pd.DataFrame(['465','444','465','864|857|850|843'],columns=['genre_ids'])
df
genre_ids
0 465
1 444
2 465
3 864|857|850|843
df['genre_ids'].str.get_dummies(sep='|')
444 465 843 850 857 864
0 0 1 0 0 0 0
1 1 0 0 0 0 0
2 0 1 0 0 0 0
3 0 0 1 1 1 1

Pandas dataframe aggregating data into counts per group

I'm new to pandas and was looking for some advice on how to reshape my pandas dataframe:
Currently, I have a dataframe like this.
panelist_id
type
refer_sm
refer_se
refer_non_n
1
HP
1
0
0
1
HP
1
0
0
1
HP
0
0
1
1
PB
0
1
0
2
PB
0
1
0
2
PB
1
0
0
2
HP
1
0
0
Ideally, I want to group by panelist_id, and aggregate the other columns by count:
panelist_id
type
type_count
refer_sm_count
refer_se_count
refer_non_n_count
1
HP
2
2
1
1
PB
1
0
1
0
2
HP
1
1
0
0
PB
2
1
1
0
0
I've tried using groupby to group by panelist, which works, however I'm a little stuck on the aggregation part. Any help would be much appreciated.
df.groupby(['panelist_id', 'type']).agg(type_count =('type', 'size'), refer_sm_count=('refer_sm', 'sum'), refer_se_count = ('refer_se', 'sum')) ?

One hot encoding a multi-valued categorical column where not all categories are represented

I have a column in a dataset called 'Crop', that represents the crops grown in a field over a period of time. The column might have a single string, like Cotton, or it may have multiple strings, like Cotton, Soy. And, depending on the dataset, there may be crops that are categories, but not represented in the particular dataset I'm training with at the time.
I've tried this:
possible_categories = list(['Corn', 'Sorghum', 'Hemp', 'Cotton', 'Soy'])
#df = (X.Crop).str.split(', ', expand=True)
#ohe_crop = pd.get_dummies(df, columns=possible_categories, sparse=True)
#print(ohe_crop)
X.Crop = (X.Crop).astype(pd.CategoricalDtype(categories=possible_categories))
ohe_crop = pd.get_dummies(X.Crop, columns=possible_categories, sparse=True)
which yields this:
Corn Sorghum Hemp Cotton Soy
0 0 0 0 0 0
1 0 0 0 0 0
2 0 0 0 0 0
3 0 0 0 0 0
4 0 0 0 0 0
... ... ... ... ...
35512 0 0 0 0 1
35513 0 0 0 0 1
35514 0 0 0 0 1
35515 0 0 0 0 1
35516 0 0 0 0 1
[35517 rows x 5 columns]
In reality, the 1st row of Crop was Cotton, Soy, Sorghum so I expected:
Corn Sorghum Hemp Cotton Soy
0 0 1 0 1 1
I think what happened here is that get_dummies() creates dummy columns for all possible permutations of the crop data:
Corn, Cotton, Soy
Corn, Soy Cotton ...
Hemp, Cotton,
Soy Hemp,
Soy Soy
so unless the field had crops that fit into these patterns, the row gets a 0.
I'd like to specify the possible categories, split Crop into multiple columns delimited by the commas that are in the rows, and then be able to populate multiple columns if there were multiple crops grown, but I can't figure out how to make it happen. Any advice?

How to split a column in a data frame containing only numbers into multiple columns in pandas

I have a .dat file containing the following data:
0001100000101010100
110101000001111
101100011001110111
0111111010100
1010111111100011
Need to count number of zeros and ones in each row
I have tried with Pandas.
Step-1: Read the data file
Step-2: Given a column name
Step-3: Tried to split the values into multiple columns. But could
not succeed
df1=pd.read_csv('data.dat',header=None) df1.head()
0 1100000101010100
1 110101000001111
2 101100011001110111
3 111111010100
4 1010111111100011
df1.columns=['kirti']
df1.head()
Kirti
_______________________
0 1100000101010100
1 110101000001111
2 101100011001110111
3 111111010100
4 1010111111100011
I need to split the data frame into multiple columns depending upon the 0s and 1s in each row.
the maximum number of columns will be equal to max no of zeros and ones in any of the rows in the data frame.
First create one column DataFrame by parameters names and dtype=str for convert column to strings:
import pandas as pd
temp="""0001100000101010100
110101000001111
101100011001110111
0111111010100
1010111111100011"""
#after testing replace 'pd.compat.StringIO(temp)' to 'filename'
df = pd.read_csv(StringIO(temp), header=None, names=['kirti'], dtype=str)
print (df)
kirti
0 0001100000101010100
1 110101000001111
2 101100011001110111
3 0111111010100
4 1010111111100011
And then create new DataFrame by convert values to lists:
df = pd.DataFrame([list(x) for x in df['kirti']])
print (df)
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
0 0 0 0 1 1 0 0 0 0 0 1 0 1 0 1 0 1 0 0
1 1 1 0 1 0 1 0 0 0 0 0 1 1 1 1 None None None None
2 1 0 1 1 0 0 0 1 1 0 0 1 1 1 0 1 1 1 None
3 0 1 1 1 1 1 1 0 1 0 1 0 0 None None None None None None
4 1 0 1 0 1 1 1 1 1 1 1 0 0 0 1 1 None None None
If your data is in a list of strings, then use the count method:
>> data = ["0001100000101010100", "110101000001111", "101100011001110111", "0111111010100", "1010111111100011"]
>> for i in data:
print(i.count("0"))
13
7
7
5
5
If your data is in a .dat file with whitespace sepparation as you discribed, then I would recommend loading your data as follows:
data = pd.read_csv("data.dat", lineterminator=" ",dtype="str", header=None, names=["Kirti"])
Kirti
0 0001100000101010100
1 110101000001111
2 101100011001110111
3 0111111010100
4 1010111111100011
The lineterminator argument ensures that every entry is in a new row. The dtype argument ensures that it's read as string. Otherwise you will loose leading zeros.
If your data is in a DataFrame, you can use the count method (inspired from here):
>> data["Kirti"].str.count("0")
0 13
1 7
2 7
3 5
4 5
Name: Kirti, dtype: int64

How to use a list of categories that example belongs to as a feature solving classification problem?

One of features looks like this:
1 170,169,205,174,173,246,247,249,380,377,383,38...
2 448,104,239,277,276,99,154,155,76,412,139,333,...
3 268,422,419,124,1,17,431,343,341,435,130,331,5...
4 50,53,449,106,279,420,161,74,123,364,231,18,23...
5 170,169,205,174,173,246,247,249,380,377,383,38...
It tells us what categories the example belongs to.
How should I use it while solving classification problem?
I've tried to use dummy variables,
df=df.join(features['cat'].str.get_dummies(',').add_prefix('contains_'))
but we don't know where there are some other categories that were not mentioned in the training set, so, I do not know how to preprocess all the objects.
That's interesting. I didn't know str.get_dummies, but maybe I can help you with the rest.
You basically have two problems:
The set of categories you get later contains categories that were unknown while training the model. You have to get rid of these later.
The set of categories you get later does not contain all categories. You have to make sure, you generate dummies for them as well.
Problem 1: filtering out unknown/unwanted categories
The first problem is easy to solve:
# create a set of all categories, you want to allow
# either definie it as a fixed set, or extract it from your
# column like this (the output of the map is actually irrelevant)
# the result will be in valid_categories
valid_categories= set()
df['categories'].str.split(',').map(valid_categories.update)
# now if you want to normalize your data before you do the
# dummy encoding, you can cleanse the data by
# splitting it, creating an intersection and then joining
# it back again to get a string on which you can work with
# str.get_dummies
df['categories'].str.split(',').map(lambda l: valid_categories.intersection(l)).str.join(',')
Problem 2: generating dummies for all known categories
The second problem can be solved by just adding a dummy row, that
contains all categories e.g. with df.append just before you
call get_dummies and removing it right after get_dummies.
# e.g. you can do it like this
# get a new index value to
# be able to remove the row later
# (this only works if you have
# a numeric index)
dummy_index= df.index.max()+1
# assign the categories
#
df.loc[dummy_index]= {'id':999, 'categories': ','.join(valid_categories)}
# now do the processing steps
# mentioned in the section above
# then create the dummies
# after that remove the dummy line
# again
df.drop(labels=[dummy_index], inplace=True)
Example:
import io
raw= """id categories
1 170,169,205,174,173,246,247
2 448,104,239,277,276,99,154
3 268,422,419,124,1,17,431,343
4 50,53,449,106,279,420,161,74
5 170,169,205,174,173,246,247"""
df= pd.read_fwf(io.StringIO(raw))
valid_categories= set()
df['categories'].str.split(',').map(valid_categories.update)
# remove 154 and 170 for demonstration purposes
valid_categories.remove('170')
valid_categories.remove('154')
df['categories'].str.split(',').map(lambda l: valid_categories.intersection(l)).str.join(',').str.get_dummies(',')
Out[622]:
1 104 106 124 161 169 17 173 174 205 239 246 247 268 276 277 279 343 419 420 422 431 448 449 50 53 74 99
0 0 0 0 0 0 1 0 1 1 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 0 1 0 0 0 0 0 0 0 0 1 0 0 0 1 1 0 0 0 0 0 0 1 0 0 0 0 1
2 1 0 0 1 0 0 1 0 0 0 0 0 0 1 0 0 0 1 1 0 1 1 0 0 0 0 0 0
3 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 1 1 1 1 0
4 0 0 0 0 0 1 0 1 1 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
You can see, that there are not columns for 154 and 170.