Restructuring a Pandas series - pandas

I have the following series:
r = [1,2,3,4,'None']
ser = pd.Series(r, copy=False)
The output of which is -
ser
Out[406]:
0 1
1 2
2 3
3 4
4 None
At ser[1], I want to set the value to be 'NULL' and copy the [2,3,4] to be shifted by one index.
Therefore the desired output would be:
ser
Out[406]:
0 1
1 NULL
2 2
3 3
4 4
I did the following which is not working:
slice_ser = ser[1:-1]
ser[2] = 'NULL'
ser[3:-1] = slice_ser
I am getting an error 'ValueError: cannot set using a slice indexer with a different length than the value'. How do I fix the issue?

I'd use shift for this:
>>> ser[1:] = ser[1:].shift(1).fillna('NULL')
>>> ser
0 1
1 NULL
2 2
3 3
4 4
dtype: object

You can shift values after position 1 and assign it back:
ser.iloc[1:] = ser.iloc[1:].shift()
ser
0 1
1 NaN
2 2
3 3
4 4
dtype: object

Related

Append new column to DF after sum?

I have a sample dataframe below:
sn C1-1 C1-2 C1-3 H2-1 H2-2 K3-1 K3-2
1 4 3 5 4 1 4 2
2 2 2 0 2 0 1 2
3 1 2 0 0 2 1 2
I will like to sum based on the prefix of C1, H2, K3 and output three new columns with the total sum. The final result is this:
sn total_c1 total_h2 total_k3
1 12 5 6
2 4 2 3
3 3 2 3
What I have tried on my original df:
lst = ["C1", "H2", "K3"]
lst2 = ["total_c1", "total_h2", "total_k3"]
for k in lst:
idx = df.columns.str.startswith(i)
for j in lst2:
df[j] = df.iloc[:,idx].sum(axis=1)
df1 = df.append(df, sort=False)
But I kept getting error
IndexError: Item wrong length 35 instead of 36.
I can't figure out how to append the new total column to produce my end result in the loop.
Any help will be appreciated (or better suggestion as oppose to loop). Thank you.
You can use groupby:
# columns of interest
cols = df.columns[1:]
col_groups = cols.str.split('-').str[0]
out_df = df[['sn']].join(df[cols].groupby(col_groups, axis=1)
.sum()
.add_prefix('total_')
)
Output:
sn total_C1 total_H2 total_K3
0 1 12 5 6
1 2 4 2 3
2 3 3 2 3
Let us try ,split then groupby with it with axis=1
out = df.groupby(df.columns.str.split('-').str[0],axis=1).sum().set_index('sn').add_prefix('Total_').reset_index()
Out[84]:
sn Total_C1 Total_H2 Total_K3
0 1 12 5 6
1 2 4 2 3
2 3 3 2 3
Another option, where we create a dictionary to groupby the columns:
mapping = {entry: f"total_{entry[:2]}" for entry in df.columns[1:]}
result = df.groupby(mapping, axis=1).sum()
result.insert(0, "sn", df.sn)
result
sn total_C1 total_H2 total_K3
0 1 12 5 6
1 2 4 2 3
2 3 3 2 3

Setting groupby nunique to dataframe column

I have a groupby that I am trying to set as a new column within my dataframe, but when I set a column name to the result of the groupby its returns NaN as the value of each row. If the groupby is set to a new value and then printed the value returns the gropby values and the nunique for each. Is the issue I am facing due to some indexing that needs to be resolved on the dataframe?
When set to column:
merged_df['noramlized_values'] = merged_df.groupby(['be_hash'])['id'].nunique()
// normalized_values
// NaN
When set to a new value:
test = merged_df.groupby(['be_hash'])['id'].nunique()
// ij32ndshufho23nd 1
Data example
id date be_hash unique_call_rank normalized_calls What I want
1 10/20/20 10171 1 3 1
1 10/20/20 10171 1 3 0
2 10/20/20 10171 2 3 1
3 10/23/20 10171 3 3 1
Use DataFrame.duplicated with both columns and numpy.where:
merged_df['noramlized_values'] = np.where(merged_df.duplicated(['be_hash','id']), 0, 1)
print (merged_df)
id date be_hash unique_call_rank normalized_calls What I want \
0 1 10/20/20 10171 1 3 1
1 1 10/20/20 10171 1 3 0
2 2 10/20/20 10171 2 3 1
3 3 10/23/20 10171 3 3 1
noramlized_values
0 1
1 0
2 1
3 1

pandas: idxmax for k-th largest

Having df of probabilities distribution, I get max probability for rows with df.idxmax(axis=1) like this:
df['1k-th'] = df.idxmax(axis=1)
and get the following result:
(scroll the tables to the right if you can not see all the columns)
0 1 2 3 4 5 6 1k-th
0 0.114869 0.020708 0.025587 0.028741 0.031257 0.031619 0.747219 6
1 0.020206 0.012710 0.010341 0.012196 0.812495 0.113863 0.018190 4
2 0.023585 0.735475 0.091795 0.021683 0.027581 0.054217 0.045664 1
3 0.009834 0.009175 0.013165 0.016014 0.015507 0.899115 0.037190 5
4 0.023357 0.736059 0.088721 0.021626 0.027341 0.056289 0.046607 1
the question is how to get the 2-th, 3th, etc probabilities, so that I get the following result?:
0 1 2 3 4 5 6 1k-th 2-th
0 0.114869 0.020708 0.025587 0.028741 0.031257 0.031619 0.747219 6 0
1 0.020206 0.012710 0.010341 0.012196 0.812495 0.113863 0.018190 4 3
2 0.023585 0.735475 0.091795 0.021683 0.027581 0.054217 0.045664 1 4
3 0.009834 0.009175 0.013165 0.016014 0.015507 0.899115 0.037190 5 4
4 0.023357 0.736059 0.088721 0.021626 0.027341 0.056289 0.046607 1 2
Thank you!
My own solution is not the prettiest, but does it's job and works fast:
for i in range(7):
p[f'{i}k'] = p[[0,1,2,3,4,5,6]].idxmax(axis=1)
p[f'{i}k_v'] = p[[0,1,2,3,4,5,6]].max(axis=1)
for x in range(7):
p[x] = np.where(p[x]==p[f'{i}k_v'], np.nan, p[x])
The loop does:
finds the largest value and it's column index
drops the found value (sets to nan)
again
finds the 2nd largest value
drops the found value
etc ...

Pandas truth value of series ambiguous

I am trying to set one column in a dataframe in pandas based on whether another column value is in a list.
I try:
df['IND']=pd.Series(np.where(df['VALUE'] == 1 or df['VALUE'] == 4, 1,0))
But I get: Truth value of a Series is ambiguous.
What is the best way to achieve the functionality:
If VALUE is in (1,4), then IND=1, else IND=0
You need to assign the else value and then modify it with a mask using isin
df['IND'] = 0
df.loc[df['VALUE'].isin([1,4]), 'IND'] = 1
For multiple conditions, you can do as follow:
mask1 = df['VALUE'].isin([1,4])
mask2 = df['SUBVALUE'].isin([10,40])
df['IND'] = 0
df.loc[mask1 & mask2, 'IND'] = 1
Consider below example:
df = pd.DataFrame({
'VALUE': [1,1,2,2,3,3,4,4]
})
Output:
VALUE
0 1
1 1
2 2
3 2
4 3
5 3
6 4
7 4
Then,
df['IND'] = 0
df.loc[df['VALUE'].isin([1,4]), 'IND'] = 1
Output:
VALUE IND
0 1 1
1 1 1
2 2 0
3 2 0
4 3 0
5 3 0
6 4 1
7 4 1

Delete all rows from pandas dataframe containing 0 as string and integer

I am using read_csv to load data from Yahoo Finance leads to rows containing 0 sometimes as string and at other times as integer. Trying to drop / delete these rows per Boolean masking:
df[(df != '0') & (df != 0)]
leads to errors:
TypeError: Could not compare ['0'] with block values
(in case the dataframe does not have any row with the string value '0') and
TypeError: Could not compare [0] with block values
(in case the frame does not have any integer value 0).
With the following dataframe:
df = pd.DataFrame({'int': [0,0,2,3,0,0,1,2,3],
'string': ['0','1','2','3','0','0','1','2','0']})
int string
0 0 0
1 0 1
2 2 2
3 3 3
4 0 0
5 0 0
6 1 1
7 2 2
8 3 0
The following code should work:
df = df[df.string != '0']
df = df[df.int != 0]
This gives the following output:
int string
2 2 2
3 3 3
6 1 1
7 2 2