remove trailing * or # from data in pandas - pandas

Im reading a csv using pandas and getting all datatypes as object
NO is the column having numeric values with trailing * and # in some observations.
I tried
import numpy as np
tai[np.isfinite(tai['NO'])]
TypeError: ufunc 'isfinite' not supported for the input types, and the
inputs could not be safely coerced to any supported types according to
the casting rule ''safe''
How can I remove all rows which have * or # in trailing in NO columns?

Consider this dataframe,
No
0 1
1 2#
2 3
3 4*
4 #5
You can use this to remove ONLY trailing characters,
df['No'] = df['No'].str.replace('[#|*]$', '')
You get
No
0 1
1 2
2 3
3 4
4 #5
A more generalized solution incase you want to remove these characters from the entire column and keep only the numbers
df['No' ] = df['No'].str.extract('(\d+)', expand = False)
You get
No
0 1
1 2
2 3
3 4
4 5

Related

Panda's dataframe not throwing error for rows containing lesser fields

I am facing one issue when reading rows containing lesser fields my dataset looks like below.
"Source1"~"schema1"~"table1"~"modifiedon"~"timestamp"~"STAGE"~15~NULL~NULL~FALSE~FALSE~TRUE
"Source1"~"schema2"~"table2"
and I am running below command to read the dataset.
tabdf = pd.read_csv('test_table.csv',sep='~',header = None)
But its not throwing any error though its supposed too.
The version we are using
pip show pandas
Name: pandas
Version: 1.0.1
My question is how to make the process failed if we will get incorrect data structure.
You could inspect the data first using Pandas and then either fail the process if bad data exists or just read the known-good rows.
Read full rows into a dataframe
df = pd.read_fwf('test_table.csv', widths=[999999], header=None)
print(df)
0
0 "Source1"~"schema1"~"table1"~"modifiedon"~"tim...
1 "Source1"~"schema2"~"table2"
Count number of separators
sep_count = df[0].str.count('~')
sep_count
0 11
1 2
Maybe just terminate the process if bad (short) rows
If the number of unique values are not 1.
sep_count.nunique()
2
Or just read the good rows
good_rows = sep_count.eq(11) # if you know what separator count should be. Or ...
good_rows = sep_count.eq(sep_count.max()) # if you know you have at least 1 good row
df = pd.read_csv('test_table.csv', sep='~', header=None).loc[good_rows]
print(df)
Result
0 1 2 3 4 5 6 7 8 9 10 11
0 Source1 schema1 table1 modifiedon timestamp STAGE 15.0 NaN NaN False False True

ValueError: could not convert string to float: '1.598.248'

how do i convert a number dataframe column in millions to double or float
0 1.598.248
1 1.323.373
2 1.628.781
3 1.551.707
4 1.790.930
5 1.877.985
6 1.484.103
0 15982480.0
1 13233730.0
2 16287810.0
3 15517070.0
4 17909300.0
5 18779850.0
6 14841030.0
You will need to remove the full stops. You can use pandas replace method then convert it into a float:
df['col'] = df['col'].replace('\.', '', regex=True).astype('float')
Example
>>> df = pd.DataFrame({'A': ['1.1.1', '2.1.2', '3.1.3', '4.1.4']})
>>> df
A
0 1.1.1
1 2.1.2
2 3.1.3
3 4.1.4
>>> df['col'] = df['col'].replace('\.', '', regex=True).astype('float')
>>> df['A']
A
0 111.0
1 212.0
2 313.0
3 414.0
>>> df['A'].dtype
float64
I'm assuming that because you have two full stops that the data is of type string. However, this should work even if you have some integers or floats in that column as well.
my_col_name
0 1.598.248
1 1.323.373
2 1.628.781
3 1.551.707
4 1.790.930
5 1.877.985
6 1.484.103
With the df above, you can try below code, with 3 steps: (1) change column type to string, (2) do string replace character, (3) change column type to float
col = 'my_col_name'
df[col] = df[col].astype('str')
df[col] = df[col].str.replace('.','')
df[col] = df[col].astype('float')
print(df)
Please note the above will result in a warning: FutureWarning: The default value of regex will change from True to False in a future version. In addition, single character regular expressions will *not* be treated as literal strings when regex=True.
So you could use below code with regex=True, also, I've combined in 1 line:
df[col] = df[col].astype('str').str.replace('.','', regex=True).astype('float')
print(df)
Output
my_col_name
0 15982480.0
1 13233730.0
2 16287810.0
3 15517070.0
4 17909300.0
5 18779850.0
6 14841030.0

How to create new pandas column by vlookup-like procedure on another data-frame

I have a dataframe that looks like this. It will be used to map values using two categorical variables. Maybe converting this to a dictionary would be better.
The 2nd data-frame is very large with a screenshot shown below. I want to take the values from the categorical variables to create a new attribute (column) based on the 1st data-frame.
For example...
A row with FICO_cat of (700,720] and OrigLTV_cat of (75,80] would receive a value of 5.
A row with FICO_cat of (700,720] and OrigLTV_cat of (85,90] would receive a value of 6.
Is there an efficient way to do this?
If your column labels are the FICO_cat values, and your Index is OrigLTV_cat, this should work:
Given a dataframe df:
780+ (740,780) (720,740)
(60,70) 3 3 3
(70,75) 4 5 4
(75,80) 3 1 2
Do:
df = df.unstack().reset_index()
df.rename(columns = {'level_0' : 'FICOCat', 'level_1' : 'OrigLTV', 0 : 'value'}, inplace = True)
Output:
FICOCat OrigLTV value
0 780+ (60,70) 3
1 780+ (70,75) 4
2 780+ (75,80) 3
3 (740,780) (60,70) 3
4 (740,780) (70,75) 5
5 (740,780) (75,80) 1
6 (720,740) (60,70) 3
7 (720,740) (70,75) 4
8 (720,740) (75,80) 2

how to convert a column of pandas series without the header

It is quite odd as I hadn't experienced the issue until now,, for conversion of data series.
So I have wind speed data by date & hour at different heights, retrieved from NREL.
file09 = 'wind/wind_yr2009.txt'
wind09 = pd.read_csv(file09, encoding = "utf-8", names = ['DATE (MM/DD/YYYY)', 'HOUR-MST', 'AWS#20m [m/s]', 'AWS#50m [m/s]', 'AWS#80m [m/s]', 'AMPLC(2-80m)'])
file10 = 'wind/wind_yr2010.txt'
wind10 = pd.read_csv(file10, encoding = "utf-8", names = ['DATE (MM/DD/YYYY)', 'HOUR-MST', 'AWS#20m [m/s]', 'AWS#50m [m/s]', 'AWS#80m [m/s]', 'AMPLC(2-80m)'])
I merge the two readings of .txt files below
wind = pd.concat([wind09, wind10], join='inner')
Then drop the duplicate headings..
wind = wind.reset_index().drop_duplicates(keep='first').set_index('index')
print(wind['HOUR-MST'])
Printing would return smth like the following -
index
0 HOUR-MST
1 1
2 2
I wasn't sure at first but apparently index 0 is on HOUR-MST, which is the column heading. Python does recognize it as I can infer the column data using the specific header. Yet, when I try converting into int
temp = hcodebook.iloc[wind['HOUR-MST'].astype(int) - 1]
Both errors were returned, as I later tried to convert to float
ValueError: invalid literal for int() with base 10: 'HOUR-MST'
ValueError: could not convert string to float: 'HOUR-MST'
I verified it is only the 0th index that has strings by using try/except in for loop.
I think the reason is because I didnt' use the parameter sep when reading these file - as that is the only difference with the previous attempts with other files where the data conversion is troubling me.
Yet it doesn't necessarily enlighten me in how to address it.
Kindly advise.
MCVE:
from io import StringIO
import pandas as pd
cfile = StringIO("""A B C D
1 2 3 4
5 6 7 8""")
pd.read_csv(cfile, names=['a','b','c','d'], sep='\s\s+')
Header included in data:
a b c d
0 A B C D
1 1 2 3 4
2 5 6 7 8
Use skiprows to avoid getting headers:
from io import StringIO
import pandas as pd
​
cfile = StringIO("""A B C D
1 2 3 4
5 6 7 8""")
pd.read_csv(cfile, names=['a','b','c','d'], sep='\s\s+', skiprows=1)
No headers:
a b c d
0 1 2 3 4
1 5 6 7 8

Slicing rows of pandas dataframe between

I have a pandas dataframe with a column that marks interesting points of data in another column (e.g. the locations of peaks and troughs). I often need to do some computation on the values between each marker. Is there a neat way to slice the dataframe using the markers as end points so that I can run a function on each slice? The dataframe would look like this, with the desired slices marked:
numbers markers
0 0.632009 None
1 0.733576 None # Slice 1 (0,1,2)
2 0.585944 x _________
3 0.212374 None
4 0.491948 None
5 0.324899 None # Slice 2 (3,4,5,6)
6 0.389103 y _________
7 0.638451 None
8 0.123557 None # Slice 3 (7,8,9)
9 0.588472 x _________
My current approach is to create an array of the indices where the markers occur, iterating over this array using the values to slice the dataframe, and then appending these slices to a list. I end up with a list of numpy arrays that I can then apply a function to:
import pandas as pd
df = pd.DataFrame({'numbers':np.random.rand(10),'markers':[None,None,'x',None,None,None,'y',None,None,'x']})
index_array = df[df.markers.isin(['x', 'y'])].index # returns an array of xy indices
slice_list = []
prev_i = 0 # first slice of the dataframe needs to start from index 0
for i in index_array:
new_slice = df.numbers[prev_i:i+1].values # i+1 to include the end marker in the slice
slice_list.append(new_slice)
prev_i = i+1 # excludes the start marker in the next slice
for j in slice_list:
myfunction(j)
This works, but I was wondering if there is a more idiomatic approach using fancy indexing/grouping/pivoting or something that I am missing?
I've looked at using groupby, but that doesn't work because grouping on the markers column only returns the rows where the markers are, and multi-indexes and pivot tables require unique labels. I wouldn't bother asking, except pandas has a tool for just about everything so my expectations are probably unreasonably high.
I am not tied to ending up with a list of arrays, that was just the solution I found. I am very open to suggestions on changing the way that I structure my data from the very start if that makes things easier.
You can do this using a variant of the compare-cumsum-groupby pattern. Starting from
>>> df["markers"].isin(["x","y"])
0 False
1 False
2 True
3 False
4 False
5 False
6 True
7 False
8 False
9 True
Name: markers, dtype: bool
We can shift and take the cumulative sum to get:
>>> df["markers"].isin(["x","y"]).shift().fillna(False).cumsum()
0 0
1 0
2 0
3 1
4 1
5 1
6 1
7 2
8 2
9 2
Name: markers, dtype: int64
After which groupby works as you want:
>>> group_id = df["markers"].isin(["x","y"]).shift().fillna(False).cumsum()
>>> for k,g in df.groupby(group_id):
... print(k)
... print(g)
...
0
numbers markers
0 0.632009 None
1 0.733576 None
2 0.585944 x
1
numbers markers
3 0.212374 None
4 0.491948 None
5 0.324899 None
6 0.389103 y
2
numbers markers
7 0.638451 None
8 0.123557 None
9 0.588472 x