Removing Non-English Words from CSV - NLTK - pandas

I am relatively new to Python and NLTK and have a hold of Flickr data stored in CSV and want to remove non-english words from the tags column. I keep getting errors saying "expected a String or a byte-like object". I have a feeling it's to do with the fact the tags column is in a Pandas Series datatype currently and not a String. However, none of the related solutions I've seen on Stack have worked when it comes to converting to string.
I have this code:
#converting pandas df to string
filtered_new = df_filtered_english_only.applymap(str)
#check it's converted to string
from pandas.api.types import is_string_dtype
is_string_dtype(filtered_new['tags'])
filtered_new['tags'].dropna(inplace=True)
tokens = filtered_new['tags'].apply(word_tokenize)
#print(tokens)
#remove non-English tags
#initialise corpus of englihs word from nltk
words = set(nltk.corpus.words.words())
" ".join(w for w in nltk.word_tokenize(df_filtered_english_only["tags"]) \
if w.lower() in words or not w.isalpha())
Any ideas how to resolve this?

Generally: You should give an example of your dataset.
What is the previous content of the column "tags"? How are tags separated? How is "no tags" expressed and is there a difference between "empty list" and "NAN"?
I assume tags can contain multiple words so that is important, also when it comes to removing non-english words.
But for simplicity sake let's assume there are only one-word-tags and they are separated by a whitespace, so that each rows content is a string. Also let's assume that empty rows (no tags) have the default NA value for pandas (numpy.NaN). And since you probably read the file with pandas some values might have been auto-converted to numbers.
Setup:
import numpy
import pandas
import nltk
df = pandas.DataFrame({"tags": ["bird dog cat xxxyyy", numpy.NaN, "Vogel Hund Katze xxxyyy", 123]})
> tags
0 bird dog cat xxxyyy
1 NaN
2 Vogel Hund Katze xxxyyy
3 123
Drop NA rows and tokenize:
df.dropna(inplace=True)
tokens = df["tags"].astype(str).apply(nltk.word_tokenize)
> 0 [bird, dog, cat, xxxyyy]
2 [Vogel, Hund, Katze, xxxyyy]
3 [123]
Name: tags, dtype: object
Filter by known words, always allow non-alpha:
words = set(nltk.corpus.words.words())
filtered = [" ".join(w for w in row if w.lower() in words or not w.isalpha()) for row in tokens]
> ['bird dog cat', '', '123']
The main problem in your code probably results from you doing a flat iteration over a nested list (you already tokenized so now each row in the pandas Series is a list). If you modify the iteration to be nested as well as I did in the example the code should run.
Also you should never do string conversion (be it .astype(str) or any other way) BEFORE removing NAs because then NAs will become something like 'nan' and will not be removed. First drop NA to handle empty cells, then convert to handle other stuff like numbers etc.

Related

Extracting portions of the entries of Pandas dataframe

I have a Pandas dataframe with several columns wherein the entries of each column are a combination of​ numbers, upper and lower case letters and some special characters:, i.e, "=A-Za-z0-9_|"​. Each entry of the column is of the form:
​'x=ABCDefgh_5|123|' ​
I want to retain only the numbers 0-9 appearing only between | | and strip out all other characters​. Here is my code for one column of the dataframe:
list(map(lambda x: x.lstrip(r'\[=A-Za-z_|,]+'), df[1]))
However, the code returns the full entry ​'x=ABCDefgh_5|123|' ​ without stripping out anything. Is there an error in my code?
Instead of working with these unreadable regex expressions, you might want to consider a simple split. For example:
import pandas as pd
d = {'col': ["x=ABCDefgh_5|123|", "x=ABCDefgh_5|123|"]}
df = pd.DataFrame(data=d)
output = df["col"].str.split("|").str[1]

Losing rows when renaming columns in pyspark (Azure databricks)

I have a line of pyspark that I am running in databricks:
df = df.toDF(*[format_column(c) for c in df.columns])
where format_column is a python function that upper cases, strips and removes the characters full stop . and backtick ` from the column names.
Before and after this line of code, the dataframe randomly loses a bunch of rows. If I do a count before and after the line, then the number of rows drops.
I did some more digging with this and found the same behaviour if I tried the following:
import pyspark.sql.functions as F
df = df.toDF(*[F.col(column_name).alias(column_name) for column_name in df.columns])
although the following is ok without the aliasing:
import pyspark.sql.functions as F
df = df.toDF(*[F.col(column_name) for column_name in df.columns])
and it is also ok if I don't rename all columns such as:
import pyspark.sql.functions as F
df = df.toDF(*[F.col(column_name).alias(column_name) for column_name in df.columns[:-1]])
And finally, there were some pipe (|) characters in the column names, which when removed manually beforehand then resulted in no issue.
As far as I know, pipe is not actually a special character in spark sql column names (unlike full stop and backtick).
Has anyone seen this kind of behaviour before and know of a solution aside from removing the pipe character manually beforehand?
Running on Databricks Runtime 10.4LTS.
Edit
format_column is defined as follows:
def format_column(column: str) -> str:
column = column.strip().upper() # Case and leading / trailing white spaces
column = re.sub(r"\s+", " ", column) # Multiple white spaces
column = re.sub(r"\.|`", "_", column)
return column
I reproduced this in my environment and there is no loss of any rows in my dataframe.
format_column function and my dataframe:
When I used the format_column as same, you can see the count of dataframe before and after replacing.
Please recheck your dataframe if something other than this function is changing your dataframe.
If you still getting the same, you can try and check if the following results losing any rows or not.
print("before replacing : "+str(df.count()))
df1=df.toDF(*[re.sub('[^\w]', '_', c) for c in df.columns])
df1.printSchema()
print("before replacing : "+str(df1.count()))
If this also results losing rows, then the issue is with something else in your dataframe or code. please recheck on that.

Pandas / Dask Reading a semi-tabular text

I have a text file that looks like this:
Version:23
Developer: Ali
NAME AGE IN
- Carol 22 no
- Kyle 31 yes
...
I ma reading it using Dask dataframe (which should be similar to Pandas). The result table should be dataframe look like this:
NAME AGE IN
Carol 22 no
Kyle 31 yes
I am having trouble to get rid of the dash in each row ('-') below the column name '-'. I tried
dd.read_csv(filepath, header = 3, sep="\s+")
which results in a dataframe with has different row size and brings more problems,
and I also tried using multiple delimiters but still giving errors.
dd.read_csv(filepath, header = 3, sep="\s-\s+")
dask.dataframe assumes your data is already in a tabular format. If you insist on using dask, then you will get further with dask.bag, which will load the file line by line. You can then filter out the lines that do not start with a dash, and process the ones that do, encoding them as a json object/dict, which you later convert to dataframe with .to_dataframe().

Extracting a value from a pd dataframe

I have a dataframe column such as below.
{"urls":{"web":{"discover":"http://www.kickstarter.com/discover/categories/film%20&%20video/narrative%20film"}},"color":16734574,"parent_id":11,"name":"Narrative Film","id":31,"position":13,"slug":"film & video/narrative film"}
I want to extract the info against the word 'slug'. (In this instance it is film & video/narrative film) and store the info as a new dataframe column.
How can I do this ?
Many thanks
This is a (nested) dictionary with different kinds of entries, so it does not make much sense to treat it as a DataFrame column. You could treat it as a DataFrame row, with the dictionary keys giving the column names:
import pandas as pd
dict = {"urls":{"web":{"discover":"http://www.kickstarter.com/discover/categories/film%20&%20video/narrative%20film"}},
"color":16734574, "parent_id":11, "name":"Narrative Film", "id":31, "position":13,
"slug":"film & video/narrative film"}
df = pd.DataFrame(dict, index=[0])
display(df)
Output:
urls color parent_id name id position slug
0 NaN 16734574 11 Narrative Film 31 13 film & video/narrative film
Note that the urls entry is not recognized, due to the sub-dictionary.
In any case, this does yield slug as a column, so please let me know if this answers your question.
Of course you could also extract the slug entry directly from your dictionary:
dict['slug']

Count frequency of multiple words

I used this code
unclassified_df['COUNT'] = unclassified_df.tweet.str.count('mulcair')
to count the number of times mulcair appeared in each row in my pandas dataframe. I am trying to repeat the same but for a set of words such as
Liberal = ['lpc','ptlib','justin','trudeau','realchange','liberal', 'liberals', "liberal2015",'lib2015','justin2015', 'trudeau2015', 'lpc2015']
I saw somewhere that I could use collection.Counter(data) and its .most_common(k) method for such, please can anyone help me out.
from collections import Counter
import pandas as pd
#check frequency for the following for each row, but no repetition for row
Liberal = ['lpc','justin','trudeau','realchange','liberal', 'liberals', "liberal2015", 'lib2015','justin2015', 'trudeau2015', 'lpc2015']
#sample data
data = {'tweet': ['lpc living dream camerama', "jsutingnasndsa dnsadnsadnsa dsalpcdnsa", "but", 'mulcair suggests thereslcp bad lpc blood']}
# the data frame with one coloumn tweet
df = pd.DataFrame(data,columns=['tweet'])
#no duplicates per row
print [(df.tweet.str.contains(word).sum(),word) for word in Liberal]
#captures all duplicates located in each row
print pd.Series({w: df.tweet.str.count(w).sum() for w in Liberal})
References:
Contains & match