Get column index label based on values - pandas

I have the following:
C1 C2 C3
0 0 0 1
1 0 0 1
2 0 0 1
And i would like to get the corresponding column index value that has 1's, so the result
should be "C3".
I know how to do this by transposing the dataframe and then getting the index values, but this is not ideal for data in the dataframes i have, and i wonder there might be a more efficient solution?

I will save the result in a list because otherwise there could be more than one column with values ​​equal to 1. You can use DataFrame.loc
if all column values ​​must be 1 then you can use:
df.loc[:,df.eq(1).all()].columns.tolist()
Output:
['C3']
if this isn't necessary then use:
df.loc[:,df.eq(1).any()].columns.tolist()
or as suggested #piRSquared, you can select directly from df.columns:
[*df.columns[df.eq(1).all()]]

Related

How to change rows in pandas based on an attribute of the other rows

I have a dataframe with columns: A(continuous variable) and B(discrete 1 or 0). The df is initially sorted by A variable.
I need to order the dataframe so for each set of X rows, there are Y rows with value 1 in B column, and (X-Y) rows with 0 (B column) (when possible!). But these sets should have variable A in desceding order. X and Y are input by the user
Example:
X=4, Y=3
Rows 0-11 are ok, since the sets (0-3),(4-7) and (8-11) has 3 rows with 1 in column B and only one row with 0 AND variable A is descending. However, rows 12-15 are not ok, since there are 2 rows with 1(variable B) and two with 0. Row 17 would replace row 15 to make this set valid. There is no problem if the last rows has 0 in variable B, since there isn't any with value 1.
The code should be general enough to run on dataframes with different number of rows.
Any ideas?

Single cell string to list to multiple rows

I have a pandas data frame,
Currently the list column is a string, I want to delimit this by spaces and replicate rows for each primary key would be associated with each item in the list. Can you please advise me on how I can achieve this?
Edit:
I need to copy down the value column after splitting and stacking the list column
If your data frame is df you can do:
df.List.str.split(' ').apply(pd.Series).stack()
and you will get
Primary Key
0 0 a
1 b
2 c
1 0 d
1 e
2 f
dtype: object
You are splitting the variable List on spaces, turning the resulting list into a series, and then stacking it to turn it into long format, indexed on the primary key, along with a sequence for each item obtained from the split.
My version:
df['List'].str.split().explode()
produces
0 a
0 b
0 c
1 d
1 e
1 f
With regards to the Edit of the question, the following tweak will give you want you need I think:
df['List'] = df['List'].str.split()
df.explode('List')
Here is a solution.
df = df.assign(**{'list':df['list'].str.split()}).explode('list')
df['cc'] = df.groupby(level=0)['list'].cumcount()
df.set_index(['cc'],append=True)

collapse pandas dataframe rows based on index column

I have a dataframe that contains information that is linked by an ID column. The rows are sequential with the odd rows containing a "start-point" and the even rows containing an "end" point. My goal is to collapse the data from these into a single row with columns for "start" and "end" following each other. The rows do have a "packet ID" that would link them if the sequential nature of the dataframe is not consistent.
example:
df:
0 1 2 3 4 5
0 hs6 106956570 106956648 ID_A1 60 -
1 hs1 153649721 153649769 ID_A1 60 -
2 hs1 865130744 865130819 ID_A2 0 -
3 hs7 21882206 21882237 ID_A2 0 -
4 hs1 74230744 74230819 ID_A3 0 +
5 hs8 92041314 92041508 ID_A3 0 +
The resulting dataframe that I am trying to achieve is:
new_df
0 1 2 3 4 5
0 hs6 106956570 106956648 hs1 153649721 153649769
1 hs1 865130744 865130819 hs7 21882206 21882237
2 hs1 74230744 74230819 hs8 92041314 92041508
with each row containing the information on both the start and the end-point.
I have tried to pass the IDs in to an array and use a for loop to pull the information out of the original dataframe into a new dataframe but this has not worked. I was looking at the melt documentation which would suggest that pd.melt(df, id_vars=[3], value_vars=[0,1,2]) may work but I cannot see how to get the corresponding row in to positions new_df[3,4,5].
I think that it may be something really simple that I am missing but any suggestions would be appreciated.
You can try this:
df_out = df.set_index([df.index%2, df.index//2])[df.columns[:3]]\
.unstack(0).sort_index(level=1, axis=1)
df_out.columns = np.arange(len(df_out.columns))
df_out
Output:
0 1 2 3 4 5
0 hs6 106956570 106956648 hs1 153649721 153649769
1 hs1 865130744 865130819 hs7 21882206 21882237
2 hs1 74230744 74230819 hs8 92041314 92041508

change value (string manipulation) in Pandas DataFrame

I am reading a CSV file to Pandas DataFrame but need to be cleaned up before can be used. I need to do two things:
use regex to filter values
apply string functions such as trim, left, right, ...
For instance, DataFrame may looks like:
0 city_some_string_45
1 city_Other_string_56
2 city_another_string_77
so I need to filter (using regex) for all rows that its value start with "city" and get last two character.
the end result should looks like:
0 45
1 56
2 77
In another word, logic I want to apply is: read value of cell and if starts with city (filtering with regex ie: ^city) and replace the value of cell with its two last character of the cell (eg using right string function)
For a dataframe like this:
No city
0 0 city_some_string_45
1 1 city_Other_string_56
2 2 city_another_string_77
Filter the dataframe to keep the rows with city column starting with city
df = df[df.city.str.startswith('city')]
You can use str.extract to extract only the number
df['city'] = df.city.str.extract('(\d+)').astype(int)
The resulting df
No city
0 0 45
1 1 56
2 2 77

How do I delete rows I don't need in dataframe pandas?

I want to delete a certain row where both the ZIPCODE and AV_LAND values would be deleted. For instance, I want to delete row 1 and 2. How would I do that? In addition, I want to reset the index once I delete all the rows I don't need.
ZIPCODE AV_LAND
0 02108 2653506
1 02109 5559661
2 02110 11804931
3 02134 4333212
You can use drop:
df.drop([1, 2]).reset_index(drop=True)
Out:
ZIPCODE AV_LAND
0 02108 2653506
1 02134 4333212
This is not an inplace operation so if you want to change the original DataFrame you need to assign it back: df = df.drop([1, 2]).reset_index(drop=True)