Cleaning DataFrame columns words starting with defined character - pandas

I have a Pandas DataFrame that I would like to clean a little bit.
import pandas as pd
data = ['This is awesome', '\$BTC $USD Short the market', 'Dont miss the dip on $ETH']
df = pd.DataFrame(data)
print(df)
'''
I am trying to delete all words starting with "$" such as "$BTC", "$USD", etc. Can't figure out what to do. Convert the column to a list? Would like to use the function startswith() but don't know exactly how... thanks for your help!

Code-
import pandas as pd
data = ['This is awesome', '\$BTC $USD Short the market', 'Dont miss the dip on $ETH']
df = pd.DataFrame(data,columns=['data'])
df['data']=df['data'].replace('\$\w+',"", regex=True)
df
Output-
data
0 This is awesome
1 \ Short the market
2 Dont miss the dip on
Ref link- remove words starting with "#" in a column from a dataframe

Related

Data cleaning in Pandas

I have an age column which has values such as 10+ <9 or >45. I have to clean this data and make it ready for EDA. What sort of logic I can use to clean the data.
Hope, it will work for your solution, use str.extract to get only integers from a string,
import pandas as pd
import re
df = pd.DataFrame(
data=
[
{'emp_length': '10+years'},
{'emp_length': '3 years'},
{'emp_length': '<1 year'}
]
)
df['emp_length'] = df['emp_length'].str.extract(r'(\d+)')
df

Having trouble with and Excel spreadsheet, in google colab and a column is missing

Yes this is homework and no I don't want an answer to the question, but for some reason the column I would like to move using pandas is missing yet I can still see it on my end result. Why is this happening. This is what I have done:
import numpy as np
import scipy as sp
import pandas as pd
import matplotlib as mpl
import seaborn as sns
#read xlsx file
df = pd.read_excel("https://docs.google.com/spreadsheets/d/e/2PACX-
1vTd9TqybCunAe9HPPdb5mOW5uFn5m5fXO-mecfsn0TEk10_l8Bz1Kc7k13AFWoyvC1t3A7A27zozfTd/pub?
output=xlsx")
df
#removes last 2 rows
df.iloc[0:, 0:21]
#columns grouped by type float
df.iloc[0:, [0,2,4,9,10,11,12,13,14,15,16,17,18,19,20]]
#columns grouped by type object
df.iloc[0:, [1,3,5,6,7]]
#gets dummies and stores them in variables
type_float = df.iloc[0:, [0,2,4,9,10,11,12,13,14,15,16,17,18,19,20]]
type_object = df.iloc[0:, [1,3,5,6,7]]
#concatonates the dummies to orignal dataframe
df = pd.concat([type_float, type_object], axis='columns')
df
#rename
df.rename(columns = {'Attrition_Flag':'Target'}, inplace = True)
df
#Replaceing target with 0/1
df['Target'].replace(['Existing Customer', 'Attrited Customer'],[0, 1], inplace=True)
df
'''
This is where im having trouble
When I try to move column "target" I cant. Ive tried to pop it, and then move it to the back
and when I try using "df.iloc[0:, [15]]" which is its column, it just goes to the next column. Why is this column non-existent? anymore
Not sure if I understand correctly what you need to do but if you want to change the order of columns (make 'Target' the last column) you can use:
all_columns_in_new_order = list(df.columns.drop('Target')) + ['Target']
and then:
df = df.reindex(all_columns_in_new_order, axis=1)

python - if-else in a for loop processing one column

I am interested to loop through column to convert into processed series.
Below is an example of two row, four columns data frame:
import pandas as pd
from rapidfuzz import process as process_rapid
from rapidfuzz import utils as rapid_utils
data = [['r/o ac. nephritis. /. nephrotic syndrome', ' ac. nephritis. /. nephrotic syndrome',1,'ac nephritis nephrotic syndrome'], [ 'sternocleidomastoid contracture','sternocleidomastoid contracture',0,"NA"]]
# Create the pandas DataFrame
df_diagnosis = pd.DataFrame(data, columns = ['diagnosis_name', 'diagnosis_name_edited','is_spell_corrected','spell_corrected_value'])
I want to use spell_corrected_value column if is_spell_corrected column is more than 1. Else, use diagnosis_name_edited
At the moment, I have following code to directly use diagnosis_name_edited column. How do I make into if-else/lambda check for is_spell_corrected column?
unmapped_diag_series = (rapid_utils.default_process(d) for d in df_diagnosis['diagnosis_name_edited'].astype(str)) # characters (generator)
unmapped_processed_diagnosis = pd.Series(unmapped_diag_series) #
Thank you.
If I get you right, try out this fast solution using numpy.where:
df_diagnosis['new_column'] = np.where(df_diagnosis['is_spell_corrected'] > 1, df_diagnosis['spell_corrected_value'], df_diagnosis['diagnosis_name_edited'])

Dictionary type data sorting

I have this type of data
{"id":"colvera","gg_unique_id":"colvera","gg_unique_prospect_account_id":"cobra-hq-enterprises","completeness_score":100,"full_name":"chris olvera","first_name":"chris","last_name":"olvera","linkedin_url":"linkedin.com/in/colvera","linkedin_username":"colvera","facebook_url":null,"twitter_url":null,"email":"colvera#cobrahq.com","mobile_phone":null,"industry":"information technology and services","title":"independent business owner","company_name":"cobra hq enterprises","domain":"cobrahq.com","website":"cobrahq.com","employee_count":"1-10","company_linkedin_url":"linkedin.com/company/cobra-hq-enterprises","company_linkedin_username":"cobra-hq-enterprises","company_location":"raymore, missouri, united states","company_city":"raymore","company_state":"missouri","company_country":"united states"
i want to set "id","gg_unique_id" etc as column name and the values as row. How can i do that?
Im trying the following codes but nothing happens:
import pandas as pd
import numpy as np
data = pd.read_csv("1k_sample_data.txt")
data.info()
df = pd.DataFrame.from_dict(data)
df
I am new to this type of data, any help would be appriciated
Looks like you have data in Json format. Try:
df = pd.read_json("1k_sample_data.txt", lines=True)
print(df)

when reading an html (pandas.read_html), how to select dataframe and set_ index in one line

I'm reading an html which brings back a list of dataframes. I want to be able to choose the dataframe from the list and set my index (index_col) in the least amount of lines.
Here is what I have right now:
import pandas as pd
df =pd.read_html('http://finviz.com/insidertrading.ashx?or=-10&tv=100000&tc=1&o=-transactionvalue', header = 0)
df2 =df[4] #here I'm assigning df2 to dataframe#4 from the list of dataframes I read
df2.set_index('Date', inplace =True)
Is it possible to do all this in one line? Do I need to create another dataframe (df2) to assign one dataframe from a list, or is it possible I can assign the dataframe as soon as I read the list of dataframes (df).
Thanks.
Anyway:
import pandas as pd
df = pd.read_html('http://finviz.com/insidertrading.ashx?or=-10&tv=100000&tc=1&o=-transactionvalue', header = 0)[4].set_index('Date')