Pandas integer colums remove last three digits - pandas

A pandas df as an example with columns all integers but some are with NAN.
raw
capitalSurplus 188883000
totalLiab 2589242000
totalStockholderEquity 6740732000
minorityInterest 27549000
otherCurrentLiab 40412000
totalAssets 9357523000
endDate 1483142400
commonStock 5818867000
retainedEarnings 732982000
otherLiab 746117000
otherAssets 6034000
totalCurrentLiabilities 436539000
propertyPlantEquipment 9135741000
totalCurrentAssets 212758000
longTermInvestments 2990000
netTangibleAssets 6740732000
netReceivables 201288000
longTermDebt 1406586000
accountsPayable 396127000
otherCurrentAssets NAN
ps. df is transposed.
expect results are last three digits('000') are removed from all columns despite NAN columns
and also keep endDate unchanged:
endDate 1483142400

If the NAN is not a np.nan , you can replace them using df.replace.
Post which , I renamed the columns as A,B by using df.columns = ['A','B']
Then you can just do the below using floordiv() which is a builtin function:
df.B.update(df[df.A!='endDate']['B'].floordiv(1000))
This will remove the last 3 zeros except the endDate row and update the column B in the respective indices.
Alternatively you can also use // to remove the last 3 zeros as shown below:
df.B.update(df[df.A!='endDate']['B'] // 1000)

Related

Removing values of a certain object type from a dataframe column in Pandas

I have a pandas dataframe where some values are integers and other values are an array. I simply want to drop all of the rows that contain the array (object datatype I believe) in my "ORIGIN_AIRPORT_ID" column, but I have not been able to figure out how to do so after trying many methods.
Here is what the first 20 rows of my dataframe looks like. The values that show up like a list are the ones I want to remove. The dataset is a couple million rows, so I just need to write code that removes all of the array-like values in that specific dataframe column if that makes sense.
df = df[df.origin_airport_ID.str.contains(',') == False]
You should consider next time giving us a data sample in text, instead of a figure. It's easier for us to test your example.
Original data:
ITIN_ID ORIGIN_AIRPORT_ID
0 20194146 10397
1 20194147 10397
2 20194148 10397
3 20194149 [10397, 10398, 10399, 10400]
4 20194150 10397
In your case, you can use the .to_numeric pandas function:
df['ORIGIN_AIRPORT_ID'] = pd.to_numeric(df['ORIGIN_AIRPORT_ID'], errors='coerce')
It replaces every cell that cannot be converted into a number to a NaN ( Not a Number ), so we get:
ITIN_ID ORIGIN_AIRPORT_ID
0 20194146 10397.0
1 20194147 10397.0
2 20194148 10397.0
3 20194149 NaN
4 20194150 10397.0
To remove these rows now just use .dropna
df = df.dropna().astype('int')
Which results in your desired DataFrame
ITIN_ID ORIGIN_AIRPORT_ID
0 20194146 10397
1 20194147 10397
2 20194148 10397
4 20194150 10397

Pandas get_dummies for a column of lists where a cell may have no value in that column

I have a column in a dataframe where all the values are lists (list of one item usually for each row). So, I would like to use get_dummies to one hot encode all the values. However, there may be a few rows where there is not a value for the column. I have seen it originally as a nan and then I have replaced that nan with an empty list, but in either case I do not see 0 and 1s for the result for the get_dummies, but rather each generated column is blank (I would expect each generated column to be 0).
How do I get get_dummies to work with an empty list?
# create column from dict where value will be a list
X['sponsor_list'] = X['bill_id'].map(sponsor_non_plaw_dict)
# line to replace nan in sponsor_list column with empty list
X.loc[X['sponsor_list'].isnull(),['sponsor_list']] = X.loc[X['sponsor_list'].isnull(),'sponsor_list'].apply(lambda x: [])
# use of get_dummies to encode the sponsor_list column
X = pd.concat([X, pd.get_dummies(X.sponsor_list.apply(pd.Series).stack()).sum(level=0)], axis=1)
Example:
111th-congress_senate-bill_3695.txt False ['Menendez,_Robert_[D-NJ].txt']
112th-congress_house-bill_3630.txt False []
111th-congress_senate-bill_852.txt False ['Vitter,_David_[R-LA].txt']
114th-congress_senate-bill_2832.txt False
['Isakson,_Johnny_[R-GA].txt']
107th-congress_senate-bill_535.txt False ['Bingaman,_Jeff_[D-NM].txt']
I want to one hot encode on the third column. That particular data item in the 2nd row has no person associated it with them, so I need that row to be encoded with all 0s. The reason I need the third column to be a list is because I need to do this to a related column as well where I need to have [0,n] values where n can be 5 or 10 or even 20.
X['sponsor_list'] = X['bill_id'].map(sponsor_non_plaw_dict)
X.loc[X['sponsor_list'].isnull(),['sponsor_list']] = X.loc[X['sponsor_list'].isnull(),'sponsor_list'].apply(lambda x: [])
mlb = MultiLabelBinarizer()
X = X.join(pd.DataFrame(mlb.fit_transform(X.pop('sponsor_list')),
columns=mlb.classes_,
index=X.index))
I used a MultiLabelBinarizer to capture what I was trying to do. I still replace nan with empty list before applying, but then I fit_transform to create the 0/1 values which can result in no 1's in a row, or many 1's in a row.

Pandas DataFrame: sort_values by an index with empty strings

I have a pandas DataFrame with multi level index. I want to sort by one of the index levels. It has float values, but occasionally few empty strings too which I want to be considered as nan.
df = pd.DataFrame(dict(x=[1,2,3,4]), index=[1,2,3,''])
df.index.name = 'i'
df.sort_values('i')
TypeError: '<' not supported between instances of 'str' and 'int'
One way to solve the problem is to replace the empty strings with nan, do the sort, and then replace nan with empty strings again.
I am wondering if there is any way we could tweek the sort_values to consider empty stings as nan.
Why there are empty strings in the first place?
In my application, actually the data read has missing values which is read as np.nan. But, np.nan values cause problem with groupby. So, they are replace to empty strings. I wish we had a constant like nan which is treated like empty string by groupby and like nan for numeric operations.
I am wondering if there is any way we could tweek the sort_values to consider empty stings as nan.
In pandas missing values are not empty values, only if save DataFrame with missing values then are replaced by empty strings.
Btw, main problem is mixed values - numeric with strings (empty values), best is convert all strings to numeric for avoid it.
You can replace empty values by missing values by rename:
df = pd.DataFrame(dict(x=[1,2,3,4]), index=[1,2,3,''])
df.index.name = 'i'
df = df.rename({'':np.nan})
df = df.sort_values('i')
print (df)
x
i
1.0 1
2.0 2
3.0 3
NaN 4
Possible solution if cannot be changed original data is get positions of sorted values by Index.argsort and change order by DataFrame.iloc:
df = df.iloc[df.rename({'':np.nan}).index.argsort()]
print (df)
x
i
1 1
2 2
3 3
4

How to filter a Dataframe based on an ID-Column which corresponds to a second Dataframe containing conditions for each ID efficiently?

I have a Dataframe with one ID column and two data columns X,Y containing numeric values. For each ID there are several rows of data.
I have a second Dataframe with the same ID column and two numeric columns specifing the lower and upper Limit for the X - Values for each ID.
I want to use the second Dataframe to filter the first Dataframe to only have rows which have X Values within in the X_min-X_max Range of the specific ID.
I can solve this by Looping over the second dataframe and filtering groupby(ID) - Elements of the first DF but that is slow for large amount of IDs. Is there an efficient way to solve this?
Example Code with the data in df, the ranges in df_ranges and the expected result in df_result. The real data Frame is obviously a lot bigger.
import pandas as pd
x=[2.1,2.2,2.6,2.4,2.8,3.5,2.8,3.2]
y=[3.1,3.5,3.4,2.7,2.1,2.7,4.1,4.3]
ID=[0]*4+[0.1]*4
x_min=[2.0,3.0]
x_max=[2.5,3.4]
IDs=[0,0.1]
df=pd.DataFrame({'ID':ID,'X':x,'Y':y})
df_ranges=pd.DataFrame({'ID':IDs,'X_min':x_min,'X_max':x_max})
df_result=df.iloc[[0,1,3,7],:]
Possible Solution:
def filter_ranges(grp,df_ranges):
x_min=df_ranges.loc[df_ranges.ID==grp.name,'X_min'].values[0]
x_max=df_ranges.loc[df_ranges.ID==grp.name,'X_max'].values[0]
return grp.loc[(grp.X>=x_min)&(grp.X<=x_max),:]
target_df_grp=df.groupby('ID').apply(filter_ranges,df_ranges=df_ranges)
Try this:
merged = df.merge(df_ranges, on='ID')
target_df = merged[(merged.X>=merged.X_min)&(merged.X<=merged.X_max)][['ID', 'X', 'Y']] # Here, desired filter is applied.
print(target_df) will give:
ID X Y
0 0.0 2.1 3.1
1 0.0 2.2 3.5
3 0.0 2.4 2.7
7 0.1 3.2 4.3

while pre-processing i am having huge count of columns with nan values! any possible way to replace with all columns nan with "Zero" or 'N'

For example, i am converted all the columns as list sample['ib_home_market_value','ib_comm_involve_don_cultural','ib_comm_involve_political','ib_home_furnishings', 'ib_magazines','ib_womens_apparel']
similar i am having 200 + columns.
Total rows - 10L
Sample [ib_comm_involve_don_cultural]- Y -309639 NAN -690361
similar i am need to work for all columns to change either 'Zero' and 'N'. I am required function to change all the columns nan values.
Am doing preprocessing for clustering model :
Sorry for the code un-readable, i am tried and fillna applied.
for i in list1:
df1[i].fillna('N', inplace=True)