Data cleaning in Pandas - pandas

I have an age column which has values such as 10+ <9 or >45. I have to clean this data and make it ready for EDA. What sort of logic I can use to clean the data.

Hope, it will work for your solution, use str.extract to get only integers from a string,
import pandas as pd
import re
df = pd.DataFrame(
data=
[
{'emp_length': '10+years'},
{'emp_length': '3 years'},
{'emp_length': '<1 year'}
]
)
df['emp_length'] = df['emp_length'].str.extract(r'(\d+)')
df

Related

Cleaning DataFrame columns words starting with defined character

I have a Pandas DataFrame that I would like to clean a little bit.
import pandas as pd
data = ['This is awesome', '\$BTC $USD Short the market', 'Dont miss the dip on $ETH']
df = pd.DataFrame(data)
print(df)
'''
I am trying to delete all words starting with "$" such as "$BTC", "$USD", etc. Can't figure out what to do. Convert the column to a list? Would like to use the function startswith() but don't know exactly how... thanks for your help!
Code-
import pandas as pd
data = ['This is awesome', '\$BTC $USD Short the market', 'Dont miss the dip on $ETH']
df = pd.DataFrame(data,columns=['data'])
df['data']=df['data'].replace('\$\w+',"", regex=True)
df
Output-
data
0 This is awesome
1 \ Short the market
2 Dont miss the dip on
Ref link- remove words starting with "#" in a column from a dataframe

Dask .loc only the first result (iloc[0])

Sample dask dataframe:
import pandas as pd
import dask
import dask.dataframe as dd
df = pd.DataFrame({'col_1': [1,2,3,4,5,6,7], 'col_2': list('abcdefg')},
index=pd.Index([0,0,1,2,3,4,5]))
df = dd.from_pandas(df, npartitions=2)
Now I would like to only get first (based on the index) result back - like this in pandas:
df.loc[df.col_1 >3].iloc[0]
col_1 col_2
2 4 d
I know there is no positional row indexing in dask using iloc, but I wonder if it would be possible to limit the query to 1 result like in SQL?
Got it - But not sure about the efficiency here:
tmp = df.loc[df.col_1 >3]
tmp.loc[tmp.index == tmp.index.min().compute()].compute()

How can I get an interpolated value from a Pandas data frame?

I have a simple Pandas data frame with two columns, 'Angle' and 'rff'. I want to get an interpolated 'rff' value based on entering an Angle that falls between two Angle values (i.e. between two index values) in the data frame. For example, I'd like to enter 3.4 for the Angle and then get an interpolated 'rff'. What would be the best way to accomplish that?
import pandas as pd
data = [[1.0,45.0], [2,56], [3,58], [4,62],[5,70]] #Sample data
s= pd.DataFrame(data, columns = ['Angle', 'rff'])
print(s)
s = s.set_index('Angle') #Set 'Angle' as index
print(s)
result = s.at[3.0, "rff"]
print(result)
You may use numpy:
import numpy as np
np.interp(3.4, s.index, s.rff)
#59.6
You could use numpy for this:
import numpy as np
import pandas as pd
data = [[1.0,45.0], [2,56], [3,58], [4,62],[5,70]] #Sample data
s= pd.DataFrame(data, columns = ['Angle', 'rff'])
print(s)
print(np.interp(3.4, s.Angle, s.rff))
>>> 59.6

Parse list and create DataFrame

I have been given a list called data which has the following content
data=[b'Name,Age,Occupation,Salary\r\nRam,37,Plumber,1769\r\nMohan,49,Elecrician,3974\r\nRahim,39,Teacher,4559\r\n']
I wanted to have a pandas dataframe which looks like the link
Expected Dataframe
How can I achieve this.
You can try this:
data=[b'Name,Age,Occupation,Salary\r\nRam,37,Plumber,1769\r\nMohan,49,Elecrician,3974\r\nRahim,39,Teacher,4559\r\n']
processed_data = [x.split(',') for x in data[0].decode().replace('\r', '').strip().split('\n')]
df = pd.DataFrame(columns=processed_data[0], data=processed_data[1:])
Hope it helps.
I would recommend you to convert this list to string as there is only one index in this list
str1 = ''.join(data)
Then use solution provided here
import sys
if sys.version_info[0] < 3:
from StringIO import StringIO
else:
from io import StringIO
import pandas as pd
TESTDATA = StringIO(str1)
df = pd.read_csv(TESTDATA, sep=",")

How to access dask dataframe index value in map_paritions?

I am trying to use dask dataframe map_partition to apply a function which access the value in the dataframe index, rowise and create a new column.
Below is the code I tried.
import dask.dataframe as dd
import pandas as pd
df = pd.DataFrame(index = ["row0" , "row1","row2","row3","row4"])
df
ddf = dd.from_pandas(df, npartitions=2)
res = ddf.map_partitions(lambda df: df.assign(index_copy= str(df.index)),meta={'index_copy': 'U' })
res.compute()
I am expecting df.index to be the value in the row index, not the entire partition index which it seems to refer to. From the doc here, this work well for columns but not the index.
what you want to do is this
df.index = ['row'+str(x) for x in df.index]
and for that first create your pandas dataframe and then run this code after you will have your expected result.
let me know if this works for you.