Delete all rows from pandas dataframe containing 0 as string and integer - pandas

I am using read_csv to load data from Yahoo Finance leads to rows containing 0 sometimes as string and at other times as integer. Trying to drop / delete these rows per Boolean masking:
df[(df != '0') & (df != 0)]
leads to errors:
TypeError: Could not compare ['0'] with block values
(in case the dataframe does not have any row with the string value '0') and
TypeError: Could not compare [0] with block values
(in case the frame does not have any integer value 0).

With the following dataframe:
df = pd.DataFrame({'int': [0,0,2,3,0,0,1,2,3],
'string': ['0','1','2','3','0','0','1','2','0']})
int string
0 0 0
1 0 1
2 2 2
3 3 3
4 0 0
5 0 0
6 1 1
7 2 2
8 3 0
The following code should work:
df = df[df.string != '0']
df = df[df.int != 0]
This gives the following output:
int string
2 2 2
3 3 3
6 1 1
7 2 2

Related

Restructuring a Pandas series

I have the following series:
r = [1,2,3,4,'None']
ser = pd.Series(r, copy=False)
The output of which is -
ser
Out[406]:
0 1
1 2
2 3
3 4
4 None
At ser[1], I want to set the value to be 'NULL' and copy the [2,3,4] to be shifted by one index.
Therefore the desired output would be:
ser
Out[406]:
0 1
1 NULL
2 2
3 3
4 4
I did the following which is not working:
slice_ser = ser[1:-1]
ser[2] = 'NULL'
ser[3:-1] = slice_ser
I am getting an error 'ValueError: cannot set using a slice indexer with a different length than the value'. How do I fix the issue?
I'd use shift for this:
>>> ser[1:] = ser[1:].shift(1).fillna('NULL')
>>> ser
0 1
1 NULL
2 2
3 3
4 4
dtype: object
You can shift values after position 1 and assign it back:
ser.iloc[1:] = ser.iloc[1:].shift()
ser
0 1
1 NaN
2 2
3 3
4 4
dtype: object

Pandas Dataframe: split column into multiple columns

I need to break a column in a DataFrame that at present collects multiple values (someone else's excel sheet unfortunately) for a categorical data field that can have multiple values.
As you can see below the column has 15 category codes seen in the column header.
Original DataFrame
I want to split the column based on the category codes seen in the column header ['Pamphlet'] and then transform the values collected for each record in the original column to be mapped to there respective new columns as a (1) for checked and (0) for unchecked instead of the raw value [1,2,4,5].
This is the code to split based on , between values but I need to put these into the new columns I need to set up by splitting the column ['Pamphlet'] up by the values in the header [15: 1) OSA\n2) Nutrition\n3) Activity\n4) etc.].
'''df_old['Pamphlets'].str.split(pat = ',', n = -1, expand = True)'''
Shape of desired DatFrame
If I could just get an outline of whats the best approach, if it is even possible to do this within Pandas, Thanks.
You need to go through your columns one by one and divide the headers, then create a new dataframe for each column made up of split columns, then join all that back to the original dataframe. It's a bit messy but doable.
You need to use a function and some loops to go through the columns.
First lets define the dataframe. (It would be much appreciated if in future questions you supply a replicatable dataframe and any other data.
data = {
"1) Mail\n2) Email \n3) At PAC/TPAC": [2, 1, 3, 2, 3, 1, 3, 2, 3, 1],
"1) ACC\n2) IM \n3) PT\n4) Smoking, \n5) Cessation": [5, 1, 4, 4, 2, 5, 1, 4, 3, 2],
}
df_full = pd.DataFrame(data)
print(df_full)
1) Mail\n2) Email \n3) At PAC/TPAC 1) ACC\n2) IM \n3) PT\n4) Smoking, \n5) Cessation
0 2 5
1 1 1
2 3 4
3 2 4
4 3 2
5 1 5
6 3 1
7 2 4
8 3 3
9 1 2
We will go through the dataframe column by column using a function. For now let's build the column manually for the first column. After we'll turn this next part into a function.
First, let's grab the first column.
s_col = df_full.iloc[:, 0]
print(s_col)
0 2
1 1
2 3
3 2
4 3
5 1
6 3
7 2
8 3
9 1
Name: 1) Mail\n2) Email \n3) At PAC/TPAC, dtype: int64
Split the header into individual pieces.
col = s_col.name.split("\n")
print(col)
['1) Mail', '2) Email ', '3) At PAC/TPAC']
Clean up any leading or trailing white space.
col = [x.strip() for x in col]
print(col)
['1) Mail', '2) Email', '3) At PAC/TPAC']
Create a new dataframe from series and column heads.
data = {col[x]: s_col.to_list() for x in range(len(col))}
df = pd.DataFrame(data)
print(df)
1) Mail 2) Email 3) At PAC/TPAC
0 2 2 2
1 1 1 1
2 3 3 3
3 2 2 2
4 3 3 3
5 1 1 1
6 3 3 3
7 2 2 2
8 3 3 3
9 1 1 1
Create a copy to make changes to the values.
df_res = df.copy()
Go through the column headers, get the first number, then filter and apply bool.
for col in df.columns:
value = pd.to_numeric(col[0])
df_res.loc[df[col] == value, col] = 1
df_res.loc[df[col] != value, col] = 0
print(df_res)
1) Mail 2) Email 3) At PAC/TPAC
0 0 1 0
1 1 0 0
2 0 0 1
3 0 1 0
4 0 0 1
5 1 0 0
6 0 0 1
7 0 1 0
8 0 0 1
9 1 0 0
Now we have split a column into its components and assigned a bool value.
Let's step back and make the above a function so we can use it for each column in the original dataframe.
def split_column(s_col):
# Split the header into individual pieces.
col = s_col.name.split("\n")
# Clean up any leading or trailing white space.
col = [x.strip() for x in col]
# Create a new dataframe from series and column heads.
data = {col[x]: s_col.to_list() for x in range(len(col))}
df = pd.DataFrame(data)
# Create a copy to make changes to the values.
df_res = df.copy()
# Go through the column headers, get the first number, then filter and apply bool.
for col in df.columns:
value = pd.to_numeric(col[0])
df_res.loc[df[col] == value, col] = 1
df_res.loc[df[col] != value, col] = 0
return df_res
Now for the last step. Let's create a loop to go through the columns in the original dataframe, call the function to split each column, and then concat it to the original dataframe less the columns that were split.
for c in df_full.columns:
# Call the function to get the split columns in a new dataframe.
df_split = split_column(df_full[c])
# Join it with the origianl full dataframe but drop the current column.
df_full = pd.concat([df_full.loc[:, ~df_full.columns.isin([c])], df_split], axis=1)
print(df_full)
1) Mail 2) Email 3) At PAC/TPAC 1) ACC 2) IM 3) PT 4) Smoking, 5) Cessation
0 0 1 0 0 0 0 0 1
1 1 0 0 1 0 0 0 0
2 0 0 1 0 0 0 1 0
3 0 1 0 0 0 0 1 0
4 0 0 1 0 1 0 0 0
5 1 0 0 0 0 0 0 1
6 0 0 1 1 0 0 0 0
7 0 1 0 0 0 0 1 0
8 0 0 1 0 0 1 0 0
9 1 0 0 0 1 0 0 0
Here is the full code...
data = {
"1) Mail\n2) Email \n3) At PAC/TPAC": [2, 1, 3, 2, 3, 1, 3, 2, 3, 1],
"1) ACC\n2) IM \n3) PT\n4) Smoking, \n5) Cessation": [5, 1, 4, 4, 2, 5, 1, 4, 3, 2],
}
df_full = pd.DataFrame(data)
def split_column(s_col):
# Split the header into individual pieces.
col = s_col.name.split("\n")
# Clean up any leading or trailing white space.
col = [x.strip() for x in col]
# Create a new dataframe from series and column heads.
data = {col[x]: s_col.to_list() for x in range(len(col))}
df = pd.DataFrame(data)
# Create a copy to make changes to the values.
df_res = df.copy()
# Go through the column headers, get the first number, then filter and apply bool.
for col in df.columns:
value = pd.to_numeric(col[0])
df_res.loc[df[col] == value, col] = 1
df_res.loc[df[col] != value, col] = 0
return df_res
for c in df_full.columns:
# Call the function to get the split columns in a new dataframe.
df_split = split_column(df_full[c])
# Join it with the origianl full dataframe but drop the current column.
df_full = pd.concat([df_full.loc[:, ~df_full.columns.isin([c])], df_split], axis=1)
print(df_full)

How to split a column in a data frame containing only numbers into multiple columns in pandas

I have a .dat file containing the following data:
0001100000101010100
110101000001111
101100011001110111
0111111010100
1010111111100011
Need to count number of zeros and ones in each row
I have tried with Pandas.
Step-1: Read the data file
Step-2: Given a column name
Step-3: Tried to split the values into multiple columns. But could
not succeed
df1=pd.read_csv('data.dat',header=None) df1.head()
0 1100000101010100
1 110101000001111
2 101100011001110111
3 111111010100
4 1010111111100011
df1.columns=['kirti']
df1.head()
Kirti
_______________________
0 1100000101010100
1 110101000001111
2 101100011001110111
3 111111010100
4 1010111111100011
I need to split the data frame into multiple columns depending upon the 0s and 1s in each row.
the maximum number of columns will be equal to max no of zeros and ones in any of the rows in the data frame.
First create one column DataFrame by parameters names and dtype=str for convert column to strings:
import pandas as pd
temp="""0001100000101010100
110101000001111
101100011001110111
0111111010100
1010111111100011"""
#after testing replace 'pd.compat.StringIO(temp)' to 'filename'
df = pd.read_csv(StringIO(temp), header=None, names=['kirti'], dtype=str)
print (df)
kirti
0 0001100000101010100
1 110101000001111
2 101100011001110111
3 0111111010100
4 1010111111100011
And then create new DataFrame by convert values to lists:
df = pd.DataFrame([list(x) for x in df['kirti']])
print (df)
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
0 0 0 0 1 1 0 0 0 0 0 1 0 1 0 1 0 1 0 0
1 1 1 0 1 0 1 0 0 0 0 0 1 1 1 1 None None None None
2 1 0 1 1 0 0 0 1 1 0 0 1 1 1 0 1 1 1 None
3 0 1 1 1 1 1 1 0 1 0 1 0 0 None None None None None None
4 1 0 1 0 1 1 1 1 1 1 1 0 0 0 1 1 None None None
If your data is in a list of strings, then use the count method:
>> data = ["0001100000101010100", "110101000001111", "101100011001110111", "0111111010100", "1010111111100011"]
>> for i in data:
print(i.count("0"))
13
7
7
5
5
If your data is in a .dat file with whitespace sepparation as you discribed, then I would recommend loading your data as follows:
data = pd.read_csv("data.dat", lineterminator=" ",dtype="str", header=None, names=["Kirti"])
Kirti
0 0001100000101010100
1 110101000001111
2 101100011001110111
3 0111111010100
4 1010111111100011
The lineterminator argument ensures that every entry is in a new row. The dtype argument ensures that it's read as string. Otherwise you will loose leading zeros.
If your data is in a DataFrame, you can use the count method (inspired from here):
>> data["Kirti"].str.count("0")
0 13
1 7
2 7
3 5
4 5
Name: Kirti, dtype: int64

Pandas truth value of series ambiguous

I am trying to set one column in a dataframe in pandas based on whether another column value is in a list.
I try:
df['IND']=pd.Series(np.where(df['VALUE'] == 1 or df['VALUE'] == 4, 1,0))
But I get: Truth value of a Series is ambiguous.
What is the best way to achieve the functionality:
If VALUE is in (1,4), then IND=1, else IND=0
You need to assign the else value and then modify it with a mask using isin
df['IND'] = 0
df.loc[df['VALUE'].isin([1,4]), 'IND'] = 1
For multiple conditions, you can do as follow:
mask1 = df['VALUE'].isin([1,4])
mask2 = df['SUBVALUE'].isin([10,40])
df['IND'] = 0
df.loc[mask1 & mask2, 'IND'] = 1
Consider below example:
df = pd.DataFrame({
'VALUE': [1,1,2,2,3,3,4,4]
})
Output:
VALUE
0 1
1 1
2 2
3 2
4 3
5 3
6 4
7 4
Then,
df['IND'] = 0
df.loc[df['VALUE'].isin([1,4]), 'IND'] = 1
Output:
VALUE IND
0 1 1
1 1 1
2 2 0
3 2 0
4 3 0
5 3 0
6 4 1
7 4 1

Pandas iterate max value of a variable length slice in a series

Let's assume i have a Pandas DataFrame as follows:
import pandas as pd
idx = ['2003-01-02', '2003-01-03', '2003-01-06', '2003-01-07',
'2003-01-08', '2003-01-09', '2003-01-10', '2003-01-13',
'2003-01-14', '2003-01-15', '2003-01-16', '2003-01-17',
'2003-01-21', '2003-01-22', '2003-01-23', '2003-01-24',
'2003-01-27']
a = pd.DataFrame([1,2,0,0,1,2,3,0,0,0,1,2,3,4,5,0,1],
columns = ['original'], index = pd.to_datetime(idx))
I am trying to get the max for each slices of that DataFrame between two zeros.
In that example i would get:
a['result'] = [0,2,0,0,0,0,3,0,0,0,0,0,0,0,5,0,1]
that is:
original result
2003-01-02 1 0
2003-01-03 2 2
2003-01-06 0 0
2003-01-07 0 0
2003-01-08 1 0
2003-01-09 2 0
2003-01-10 3 3
2003-01-13 0 0
2003-01-14 0 0
2003-01-15 0 0
2003-01-16 1 0
2003-01-17 2 0
2003-01-21 3 0
2003-01-22 4 0
2003-01-23 5 5
2003-01-24 0 0
2003-01-27 1 1
find zeros
cumsum to make groups
mask the zeros into their own group -1
find the max location in each group idxmax
get rid of the one for group -1, that was for zeros anyway
get a.original for found max locations, reindex and fill with zeros
m = a.original.eq(0)
g = a.original.groupby(m.cumsum().mask(m, -1))
i = g.idxmax().drop(-1)
a.assign(result=a.loc[i, 'original'].reindex(a.index, fill_value=0))
original result
2003-01-02 1 0
2003-01-03 2 2
2003-01-06 0 0
2003-01-07 0 0
2003-01-08 1 0
2003-01-09 2 0
2003-01-10 3 3
2003-01-13 0 0
2003-01-14 0 0
2003-01-15 0 0
2003-01-16 1 0
2003-01-17 2 0
2003-01-21 3 0
2003-01-22 4 0
2003-01-23 5 5
2003-01-24 0 0
2003-01-27 1 1