Removing unwanted characters from a text column of a dataframe - pandas

I am using this dataframe below
In this dataframe the "TITLE" and"ABSTRACT" column contains a lot of unwanted characters along with words.
I want only the letters and not any other unwanted characters in these two columns.
Please help me remove the unwanted charcters from both the columns of the dataframe.
Please use any method(functions preferable).

df['TITLE'] = df.TITLE.str.replace('[^a-zA-Z]', '')
df['ABSTRACT'] = df.ABSTRACT.str.replace('[^a-zA-Z]', '')

Related

How to convert multiple columns with european numbers (comma as decimal separator) to float?

I have multiple columns that contain numbers in european format, e.g.
1.630,78
They have different chars in front or at the end (€, %) so I can't use pandas converter function.
pd.read_csv("file.csv", decimal=',', separator={"col1": float, "col": float}
won't work because I have to remove the signs first which I can only do after reading in the whole file.
Search and replace dots and commas in pandas dataframe
did not work, I get a
ValueError: could not convert string to float: ''
but every row has an entry
How can I change those strings in specific columns then to floats?
Read the colums as strings and then use translate:
tt = str.maketrans(',', '.', '.€%')
df.col1 = df.col1.str.translate(tt).astype(float)
PS: you may need to adopt the third argument with the characters to remove as needed.

How to concatenate numerous column names in pandas?

I would like to concatenate all the columns with comma-delimitted in pandas.
But as you can seem it is very laborious tasks since I manually typed all the column indices.
de = data[3]+","+data[4]+","+data[5]+....+","+data[1511]
do you have any idea to avoid above procedure in pandas in python3?
First convert all columns to strings by DataFrame.astype and then possible add join per rows:
df = data.astype(str).apply(','.join, axis=1)
Or after convert to strings add ,, then sum and last remove last , by Series.str.rstrip:
df = data.astype(str).add(',').sum(axis=1).str.rstrip(',')

Pandas - Removing rows with nan or None values

I have a some data that was pre-populated from another system whose DataFrame looks as below:
id;value
101;Product_1,,,,,,,,,,,,,,,,,,,,,,,Product_2,,,,,,,,,,,,,,,,,,,,,,, Product_3,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan, Product_4,None,None,None,None,None,None,None,None,None,None,None,None,None,None,None,None,None,None,None,None,None,None
102;,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,None,None,None,None,None,None,None,None,None,None,None,None,None,None,None,None,None,None,None,None,None,None,None
I am trying to clean this up such that I remove all values that have 2 or more commas (,)continuously that are blanks.
Expected Output:
id; value
101; Product_1, Product_2, Product_3, Product_4
102;
Using semi-colon (;) to identify separators
First, import the data while specifying the separator as a semicolon. Then you can run str.replace() to collapse the commas. There are actually three kinds of replacements you want to perform.
Replace the null values (and blank spaces) with ', '
Replace sequences of commas with single ', '
To deal with empty cells, add a final replace. I've specified it as leaving a blank '', but for many purposes it would more useful to replace it with numpy.nan instead.
import pandas as pd
df = pd.read_csv(path, sep=';')
df['value'].str.replace(r'nan|None| ', '').str.replace(r'\,+', ', ').replace(', ', '')
You might find it useful to have lists instead of strings, in which case you can use:
df['value'].str.split(', ')

iterate of df colomns to give rows with text

I have a dataframe with columns. The columns have mostly blank rows but a few of the rows have strings and those are the only rows i want to see. I have tried the below code but dont know how to only select strings in the columns and append to get a new dataframe with only columns with strings in the rows.
columns = list(df)
for i in columns:
df1 = df[df[i]== ]
can someone please help?
df[df['column_name'].isna()]
should do the trick

Pandas read_sql_query -> to_string. remove spaces between columns (FWF)

I am trying to create a fixed width file output in Pandas. When using pandas dataFrames to_string all the data has a "white space" separating the values. How do I remove the white space between the data columns?
sql = """SELECT
FIELD_1,
FIELD_2,
.........
FROM
VIEW"""
db_connection_string = "your connection string"
df = pd.read_sql_query(sql=sql, con=db_connection_string)
df['field_1'] = df['field_1'].str.pad(width=10, side='right', fillchar='-')
df['field_2'] = df['field_2'].str.pad(width=10, side='right', fillchar='-')
print(df.to_string(header=False, index=False)
I expected the following:
field1----field2----
What I got was:
field1---- field2----
Please note the spaces between the columns. This is what I am trying to remove. The fields should be flush against one another and not have a whitespace separator.
I think problem is to_string add default separator. Possible solutions is join together all columns:
print(df.astype(str).apply(''.join, 1).to_string(header=False, index=False))
field1----field_2---
Or only some columns:
print ((df['field_1'] + df['field_2']).to_string(header=False, index=False)))