How to mark cells using str.contains()

How to mark cells using str.contains() - pandas

Working on cleaning up one excel document and one of the columns(df_i['Email']) contains email addresses and I need to mark (by adding a comment to a comments column) Gmail, and yahoo emails. I created the exclusion list, but for some reason, it works only if I specify the index of email to be excluded.
input
emails_to_exclude = ('#gmail', '#yahoo')
df_i['Comments'] = np.where(df_i['Email'].str.contains(emails_to_exclude[0] case = False),'to be deleted','')
print(df_i['Comments'])
output
0
1
2
3
4
5
6
7
8
9
10 to be deleted
11
12
13

This is because str.contains cannot work with a list, you need to make use of regex (regular expressions) to join the values with an OR statement represented with a pipe |
for your example, please provide a sample of your data next time please :
df_i = pd.DataFrame({'Email' : ['john#yahoo.com','john#outlook.com','john#gmail.com']})
emails_to_exclude = ('#gmail', '#yahoo')
df_i.loc[df_i['Emails'].str.contains('|'.join(emails_to_exclude)),'comments'] = 'to be deleted'
print(df_i)
Emails comments
0 john#yahoo.com to be deleted
1 john#outlook.com NaN
2 john#gmail.com to be deleted
you can fill the NaN columns with white space as so :
df_i['comments'].fillna('')

Related

Is there a way to detect special chars such as '?' or any, in a column in huge dataframe with thousands of records?

INPUT
A B C
0 1 2 3
1 4 ? 6
2 7 8 ?
... ... ... ...
551 4 4 6
552 3 7 9
There might be '?' in between somewhere which is undetectable, I tried doing it with
pd.to_numeric, error='coerce'
but it only show first 5 and last 5 rows, and I cant check all rows/columns for special chars
So how to actually deal with this problem and make dataset clean
Once detected I know how to remove those and fill with their respective column mean values, so thats not an issue
Please I'm new to this stack overflow and switching from a non-IT field

The below is an easier way without using regex.
special = '[#_!#$%^&*()<>?/\|}{~:]'
df['B'].str.count(special)
Please refer to below link to do it using regex:
regex

Creating a new column with a iterating sentence count whenever two simultaneous row values are null (indicating a new sentence is found)

I have a dataframe with words and entities and would like to create a third column which keeps a sentence count for every new sentence found as shown in the link example of desired output.
The condition based on which I would recognize the start of a new sentence is when both the word and entity columns have null values like at index 4.
0 word entity
1 It O
2 was O
3 fun O
4 NaN NaN
5 from O
6 vodka B-product
So far I have managed to fill the null values with a new_sent string and have figured out how to make a new column where I can enter a value whenever a new sentence is found using.
df.fillna("new_sentence", inplace=True)
df['Sentence #'] = np.where(df['word']=='new_sentence', 'S', False)
In the above code instead of S I would like to fill Sentence: {count} as in the example. What would be easiest/quickest way to do this? Also, is there a better way to keep a count of sentences in a separate column like in the example instead of the method I am trying?
So far I am able to get an output like this
0 word entity Sentence #
1 It O False
2 was O False
3 fun O False
4 new_sentence new_sentence S
5 from O False
6 vodka B-product False

Merging two dataframes on the same type column gives me wrong result

I have two dataframes, assume A and B, which have been created after reading the sheets of an Excel file and performing some basic functions. I need to merge right the two dataframes on a column named ID which has first been converted to astype(str) for both dataframes.
The ID column of the left Dataframe (A) is:
0 5815518813016
1 5835503994014
2 5835504934023
3 5845535359006
4 5865520960012
5 5865532845006
6 5875531550008
7 5885498289039
8 5885498289039_A2
9 5885498289039_A3
10 5885498289039_X2
11 5885498289039_X3
12 5885509768698
13 5885522349999
14 5895507791025
Name: ID, dtype: object
The ID column of the right Dataframe (B) is:
0 5835503994014
1 5845535359006
2 5835504934023
3 5815518813016
4 5885498289039_A1
5 5885498289039_A2
6 5885498289039_A3
7 5885498289039_X1
8 5885498289039_X2
9 5885498289039_X3
10 5885498289039
11 5865532845006
12 5875531550008
13 5865520960012
14 5885522349998
15 5895507791025
16 5885509768698
Name: ID, dtype: object
However, when I merge the two, the rest of the columns of the left (A) dataframe become "empty" (np.nan) except for the rows where the ID does not contain only numbers but letters too. This is the pd.merge() I do:
A_B=A.merge(B[['ID','col_B']], left_on='ID', right_on='ID', how='right')
Do you have any ideas what might be so wrong? Your input is valuable.

Try turning all values in both columns into strings:
A['ID'] = A['ID'].astype(str)
B['ID'] = B['ID'].astype(str)
Generally, when a merge like this doesn't work, I would try to debug by printing out the unique values in each column to check if anything pops out (usually dtype issues).

Unable to identify strange whitespace character in MSSQL table

We have a process that reads an XML file into our database and inserts any rows that aren't currently in another table to that table.
This process also has a trigger to write to an audit table and a nightly snapshot is also held in another table.
In the XML holding table a field looks like 1234567890123456 but it exists on our live table as 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6. Those spaces will not be removed by any combination of REPLACE functions. We have tried all CHAR values and it does not recognise the character. The audit table and nightly snapshot, however, contain the correct values.
Similarly, if we run a comparison between SELECT CASE WHEN '1234567890123456' = '1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 ' THEN 1 ELSE 0 END, this returns 1, so they match. However LEN('1234567890123456') is 16 and LEN('1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 ') is 32.
We have ran some queries to loop through the characters in the field and output the ASCII and Unicode values for the characters. The digits return the correct ASCII/Unicode values, but this random whitespace character does not return a value.
An example of the incorrectly displayed one is 0x35000000320000003800000036000000380000003300000039000000370000003800000037000000330000003000000035000000340000003000000033000000 and a correct one is 0x3500320038003600380033003200300030003000360033003600380036003000. Both were added by the same means on the same day. One has the extra bytes, the other is fine.
How can we identify this character and get rid of it? Is there a reason this would have been inserted originally? How can we avoid this in future?

Data entry
It looks like some null (i.e. Char(0)) characters have got into the data.
If the data was supposed to be ASCII when it was entered but UTF-16 data got, then it could be:
Entered character codes: 48 00
Sent to database: 48 00 00 00
To avoid that, remove disallowed characters as the first step in processing the input, say by using a regex to replace [\x00-\x1F] with an empty string.
Data repair
Search for entries which a Char(0) in them to confirm that they can be found that way.
If so, replace the Char(0) with an empty string.
If that doesn't work, you could convert the data to the format '0x35000000320000003800000036000000380000003300000039000000370000003800000037000000330000003000000035000000340000003000000033000000', replace '000000' with '00', and then convert back.

Processing loading table data

I have a text file "celldata.txt" containing a very simple table of data.
1 2 3 4
5 6 7 8
9 10 11 12
1 2 3 4
2 3 4 5
The problem is when it comes to accessing the data at a certain column and row.
My approach has been to load using loadTable.
Table table;
int numCols;
int numRows;
void setup() {
size(200,200);
table = loadTable("celldata.txt","tsv");
numRows=table.getRowCount();
numCols=table.getColumnCount();
}
void draw() {
background(255);
fill(0);
text(numRows +" "+ numCols,100,100); // Check num of cols and rows
println(table.getFloat(0,0));
}
Question 1: When I do this, it says the number of rows are 5 and the number of columns is just 1. Why is it not 5 x 4?
Question 2: Why is table.getFloat(0,0) "NaN" instead of the first element of the data?
I want to use a much bigger matrix later and access certain elements (of type double) with something like getFloat(i,j) and be able to loop through all elements.
Using the same example data as I, can someone please help me understand what is wrong with my code and how to access the textfile's data? Should I be using another method than loadTable?

You've told Processing that the file contains tab separated values (by using the "tsv" option), but your file contains space separated values.
Since your file does not contain any tabs, it reads the entire row as a single value. So the 0,0 position of your table is 1 2 3 4, which isn't a number- hence the NaN. This is also why it thinks your table only has one column.
You should modify your celldata.txt file to actually be separated by tabs instead of spaces:
1 2 3 4
5 6 7 8
9 10 11 12
1 2 3 4
2 3 4 5
You could also separate them by commas and then use the "csv" option.
If you're still having trouble, you can see what Processing is reading in by adding saveTable(table, "data/new.csv"); to the end of your setup() function and then looking at that file. It will be a list of values separated by commas, so you can see exactly where Processing thinks the cells of the table are.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

How to mark cells using str.contains() - pandas

Related

Is there a way to detect special chars such as '?' or any, in a column in huge dataframe with thousands of records?

Creating a new column with a iterating sentence count whenever two simultaneous row values are null (indicating a new sentence is found)

Merging two dataframes on the same type column gives me wrong result

Unable to identify strange whitespace character in MSSQL table

Processing loading table data

Categories

Resources