How to replace values to binary(0-1) in Pandas for Network data? - pandas

I have 75 columns, and 300k captured network traffic CSV file.
I am playing with data to apply ML. I need to convert IP addresses to 1 and 0 according to internal and external.
So if it is
10.0.2.* > 0
others > 1
Is there an easy way to do this?
I was doing the manually replace method.
df['SrcAddr'] = df['SrcAddr'].replace(['10.0.2.15','10.0.2.2'],[0,0,0])

IIUC, you can use:
df['SrcAddr'] = df['SrcAddr'].str.startswith('10.0.2.').rsub(1)
or with a regex:
df['SrcAddr'] = df['SrcAddr'].str.fullmatch('10\.0\.2\.').rsub(1)
How it works: for each match this returns True, using rsub(1) we compute 1-True -> 0 and for each non-match 1-False -> 1
Alternative with np.where for using any value:
df['SrcAddr'] = np.where(df['SrcAddr'].str.startswith('10.0.2.'), 0, 1)
example (as new column):
SrcAddr SrcAddr2
0 10.0.2.42 0
1 8.8.8.8 1

Related

Multilabel Encoder takes whole value instead of array

I'm working on a dataset with a Tags column extracted from a stackoverflow dataset.
I need to encode these tags to perform a tag prediction using a title and body.
I'm stuck with this encoding, can't get what I need.
Here's a preview of my column :
Tags
['python', 'authentication', 'login', 'flask', 'python-2.x']
['c++', 'vector', 'c++11', 'move', 'deque']
...
And what I'm doing so far :
y_classes = pd.get_dummies(df.Tags)
y_classes
['.net', 'asp.net-mvc', 'visual-studio', 'asp.net-mvc-4', 'intellisense']
['.net', 'asp.net-mvc-3', 'linq', 'entity-framework', 'entity-framework-5']
0
0
0
0
0
0
0
0
0
As you can see, I need to have one column for each tag and not for each unique array of tags.
I tried multiple solutions found in StackOverflow but none worked
EDIT : I also tried with MultiLabelBinarizer from sklearn.preprocessing and I had a column for each unique character of Tags column
How can I make this works ?
Ok, so I figured out myself how to fix this problem so here is my solution if :
tags_array=df['Tags'].to_numpy()
df2 = pd.DataFrame(tags_array, columns=['Tags'])
coun_vect = CountVectorizer()
count_matrix = coun_vect.fit_transform(df2["Tags"])
count_array = count_matrix.toarray()
df2 = pd.DataFrame(data=count_array,columns =
coun_vect.get_feature_names())
print(df2)
output :
ajax
algorithm
amazon
android
angular
...
0
0
0
1
0
...
1
1
0
0
0
...
0
0
1
0
1
...
...
...
...
...
...
...
Edit :
Like #OllieStanley said in a comment, it could have worked with multilabelBinarizer, the problem was the dataset considered as a list and could be solved by using set or nested list instead

Count in string terms and stored mapped to other value

I have a pandas dataframe which includes columns (amongst others) like this, with RATING being integers 0 to 5 and COMMENT is string:
RATING COMMENT
1 some text
2 more text
3 other text
... ...
I would now like to mine (for lack of better word ) the key words for a list of strings:
list = ['like', trust', 'etc etc etc']
and would like to iterate through the COMMENT and count the number of key words by rating to get a df out like so
KEYWORD RATING COUNT
like 1 202
like 2 325
like 3 0
like 4 967
like 5 534
...
trust 1 126
....
how can I achieve this?
I am beginner so would really appreciate your help (and the simpler and more understandable the better)
thank you
hi at the moment I have been iterating through manually,
ie
#DATA_df is the original data
word_list = ['word', 'words', 'words', 'more']
values = [0] * len(word_list)
tot_val=[values]*5
rating_table = pd.DataFrame(tot_val, columns=word_list)
for i in len(word_list):
for g in len (DATA_df[COMMENT]):
if i in DATA_df[COMMENT][g]:
rating_table[i][DATA_df[RATING]-1] +=1
this give a DF like so
word words words more
0 0 0 0 0
1 0 0 0 0
2 0 0 0 0
3 0 0 0 0
4 0 0 0 0
that I am then trying to add to.... it appears really clunky
I managed to solve it, key points learnt are use group by to pre-select data based on the rating, this slices the data and it is possible to alternate through the groups. also use of str.lower() in combination with str.count() worked well.
I am thankful if more experienced programmers could show me a better solution, but at least this works.
rating = [1,2,3,4,5]
rategroup = tp_clean.groupby('Rating')
#print (rategroup.groups)
results_list =[]
for w in word_list:
current = [w]
for r in rating:
stargroup = rategroup.get_group(str(r))
found = stargroup['Content'].str.lower().str.count(w)
c = found.sum()
current.append(c)
results_list.append(current)
results_df = pd.DataFrame (results_list, columns=['Keyword','1 Star','2 Star','3 Star','4 Star','5 Star'])
The one thing I am still struggling with is how to use regex to make it look for full words. I believe \b is the right one but how do I put it into str.count function?

How can I rename a column that contains special (Greek) characters

I have a dataframe and in early in my script I name my columns using:
beta = 1.17
names =np.arange((beta-0.05),(beta+0.05),.01)
dfs.columns = [r'$\beta$'+str(i) for i in names]
Later in the script I want to replace r'$\beta$' with ats.
I have tried the following:
dfs.columns = dfs.columns.str.replace("[(r'$\beta$')]", "ats")
But, it isn't working as expected. Any suggestions are appreciate.
Thanks.
You need escape special regex characters $:
beta = 1.17
names =np.arange((beta-0.05),(beta+0.05),.01)
dfs = pd.DataFrame(0, columns=names, index=[0])
dfs.columns = [r'$\beta$'+str(i) for i in names]
dfs.columns = dfs.columns.str.replace(r'\$\\beta\$', "ats")
print (dfs)
ats1.1199999999999999 ats1.13 ats1.14 ats1.15 ats1.16 ats1.17 \
0 0 0 0 0 0 0
ats1.18 ats1.19 ats1.2 ats1.21 ats1.22
0 0 0 0 0 0
I using str.replace twice
df.columns.str.replace('[^0-9a-zA-Z.]+', "").str.replace('beta','ats')
Index(['ats1.1199999999999999', 'ats1.13', 'ats1.14', 'ats1.15', 'ats1.16',
'ats1.17', 'ats1.18', 'ats1.19', 'ats1.2', 'ats1.21', 'ats1.22'],
dtype='object')
The problem seems to be that pandas.Series.replace use regex. So you need backslash to indicate a character but not a special character.
df.columns.str.replace(r"\$\\beta\$", "ats")

Reverse MS-Access Format Function

I have a field within an Access 2007 database which contains either a 0 or a 1.
When displaying a view, I need to format the field as Yes/No.
My issue is that I can't use FORMAT(Field,"Yes/No") as the 1 and 0 are the wrong way round i.e.:
0 = No 1 = Yes is how the format function works.
1 = No 0 = Yes is how my data is formatted.
Is there anyway to reserve or manipulate the FORMAT function in a way that when a query is run, the query will display my Yes/No the correct way round?
FORMAT(ABS(Field-1), "Yes/No")
This works because ABS(1-1) = 0 and ABS(0-1) = 1. In other words your 0 -> 1 and 1 -> 0, so it changes the number to have the "right" value (as far as MS-Access is concerned) before using the format function.

What's the R equivalent of SQL's LIKE 'description%' statement?

Not sure how else to ask this but, I want to search for a term within several string elements. Here's what my code looks like (but wrong):
inplay = vector(length=nrow(des))
for (ii in 1:nrow(des)) {
if (des[ii] = 'In play%')
inplay[ii] = 1
else inplay[ii] = 0
}
des is a vector that stores strings such as "Swinging Strike", "In play (run(s))", "In play (out(s) recorded)" and etc. What I want inplay to store is a 1s and 0s vector corresponding with the des vector, with the 1s in inplay indicating that the des value had "In play%" in it and 0s otherwise.
I believe the 3rd line is incorrect, because all this does is return a vector of 0s with a 1 in the last element.
Thanks in advance!
The data.table package has syntax that is often similar to SQL. The package includes %like%, which is a "convenience function for calling regexpr". Here is an example taken from its help file:
## Create the data.table:
DT = data.table(Name=c("Mary","George","Martha"), Salary=c(2,3,4))
## Subset the DT table where the Name column is like "Mar%":
DT[Name %like% "^Mar"]
## Name Salary
## 1: Mary 2
## 2: Martha 4
The R analog to SQL's LIKE is just R's ordinary indexing syntax.
The 'LIKE' operator selects data rows from a table by matching string values in a specified column against a user-supplied pattern
> # create a data frame having a character column
> clrs = c("blue", "black", "brown", "beige", "berry", "bronze", "blue-green", "blueberry")
> dfx = data.frame(Velocity=sample(100, 8), Colors=clrs)
> dfx
Velocity Colors
1 90 blue
2 94 black
3 71 brown
4 36 beige
5 75 berry
6 2 bronze
7 89 blue-green
8 93 blueberry
> # create a pattern to use (the same as you would do when using the LIKE operator)
> ptn = '^be.*?' # gets beige and berry but not blueberry
> # execute a pattern-matching function on your data to create an index vector
> ndx = grep(ptn, dfx$Colors, perl=T)
> # use this index vector to extract the rows you want from the data frome:
> selected_rows = dfx[ndx,]
> selected_rows
Velocity Colors
4 36 beige
5 75 berry
In SQL, that would be:
SELECT * FROM dfx WHERE Colors LIKE ptn3
Something like regexpr?
> d <- c("Swinging Strike", "In play (run(s))", "In play (out(s) recorded)")
> regexpr('In play', d)
[1] -1 1 1
attr(,"match.length")
[1] -1 7 7
>
or grep
> grep('In play', d)
[1] 2 3
>
Since stringr 1.5.0, you can use str_like, which follows the structure of SQL's LIKE:
library(stringr)
fruit <- c("apple", "banana", "pear", "pineapple")
str_like(fruit, "app%")
#[1] TRUE FALSE FALSE FALSE
Not only does it include %, but also several other operators (see ?str_like).
Must match the entire string
_⁠ matches a single character (like .)
⁠%⁠ matches any number of characters (like ⁠.*⁠)
⁠%⁠ and ⁠_⁠ match literal ⁠%⁠ and ⁠_⁠
The match is case insensitive by default