To iterate all the dataframe in pandas - pandas

I'm trying to do something similar to this...
My intention is to create a boucle for in pandas that can iterate all the dataframe filtering all the rows that are highest than four. If the condition is satisfied, it will give me a new column with the column name and the ID. Something like this (the column output):
enter image description here
I'm trying with this code but it doesn't work...
list = []
for col in df.columns:
for row in df[col]:
if row>4:
list.append(df(row).index, col)
Could somebody help me? I will thanks you so much...

Here is a proposition with pandas.DataFrame.loc and pandas.Series.ge :
collected_vals = []​
for col in df.filter(like="X").columns:
collected_vals.append(df.loc[df[col].ge(4), "ID"].astype(str).radd(f"{col}, "))
#if list​ is needed
from itertools import chain​
l = list(chain(*[ser.tolist() for ser in collected_vals]))
#if Series is needed
ser = pd.concat(collected_vals, ignore_index=True)
#if DataFrame is needed
out_df = pd.concat(collected_vals, ignore_index=True).to_frame("OUTPUT")
# Output
print(out_df)
OUTPUT
0 X40, 1100
1 X40, 1200
2 X50, 700
3 X50, 800
4 X50, 900
Input used :
print(df)
X40 X50 ID
0 1 5 700
1 2 6 800
2 1 8 900
3 3 2 1000
4 4 3 1100
5 6 1 1200

Related

pandas finding duplicate rows with different label

I have the case where I want to sanity check labeled data. I have hundreds of features and want to find points which have the same features but different label. These found cluster of disagreeing labels should then be numbered and put into a new dataframe.
This isn't hard but I am wondering what the most elegant solution for this is.
Here an example:
import pandas as pd
df = pd.DataFrame({
"feature_1" : [0,0,0,4,4,2],
"feature_2" : [0,5,5,1,1,3],
"label" : ["A","A","B","B","D","A"]
})
result_df = pd.DataFrame({
"cluster_index" : [0,0,1,1],
"feature_1" : [0,0,4,4],
"feature_2" : [5,5,1,1],
"label" : ["A","B","B","D"]
})
In order to get the output you want (both de-duplication and cluster_index), you can use a groupby approach:
g = df.groupby(['feature_1', 'feature_2'])['label']
(df.assign(cluster_index=g.ngroup()) # get group name
.loc[g.transform('size').gt(1)] # filter the non-duplicates
# line below only to have a nice cluster_index range (0,1…)
.assign(cluster_index= lambda d: d['cluster_index'].factorize()[0])
)
output:
feature_1 feature_2 label cluster_index
1 0 5 A 0
2 0 5 B 0
3 4 1 B 1
4 4 1 D 1
First get all duplicated values per feature columns and then if necessary remove duplciated by all columns (here in sample data not necessary), last add GroupBy.ngroup for groups indices:
df = df[df.duplicated(['feature_1','feature_2'],keep=False)].drop_duplicates()
df['cluster_index'] = df.groupby(['feature_1', 'feature_2'])['label'].ngroup()
print (df)
feature_1 feature_2 label cluster_index
1 0 5 A 0
2 0 5 B 0
3 4 1 B 1
4 4 1 D 1

python - List of Lists into pandas dataframe including name of columns

I would like to transfer a list of lists into a dataframe with columns based on the lists in the list.
This is still easy.
list = [[....],[....],[...]]
df = pd.DataFrame(list)
df = df.transpose()
The problem is: I would like to give the columns a column-name based on entries I have in another list:
list_two = [A,B,C,...]
This is my issue Im still struggling with.
Is there any approach to solve this problem?
Thanks a lot in advance for your help.
Best regards
Sascha
Use zip with dict for dictionary of lists and pass to DataFrame:
L= [[1,2,3,5],[4,8,9,8],[1,2,5,3]]
list_two = list('ABC')
df = pd.DataFrame(dict(zip(list_two, L)))
print (df)
A B C
0 1 4 1
1 2 8 2
2 3 9 5
3 5 8 3
Or if pass index parameter after transpose get columns names by this list:
df = pd.DataFrame(L, index=list_two).T
print (df)
A B C
0 1 4 1
1 2 8 2
2 3 9 5
3 5 8 3

Replacing -999 with a number but I want all replaced number to be different

I have a Pandas DataFrame named df and in df['salary'] column, there are 400 values represented by same number -999. I want to replace that -999 value with any number in between 200 and 500. I want to replace all 400 values with a different number from 200 to 500. So far I have written this code:
df['salary'] = df['salary'].replace(-999, random.randint(200, 500))
but this code is replacing all -999 with the same value. I want all replaced values to be different from each other. How can do this.
You can use Series.mask with np.random.randint:
df = pd.DataFrame({"salary":[0,1,2,3,4,5,-999,-999,-999,1,3,5,-999]})
df['salary'] = df["salary"].mask(df["salary"].eq(-999), np.random.randint(200, 500, size=len(df)))
print (df)
salary
0 0
1 1
2 2
3 3
4 4
5 5
6 413
7 497
8 234
9 1
10 3
11 5
12 341
If you want non-repeating numbers instead:
s = pd.Series(range(200, 500)).sample(frac=1).reset_index(drop=True)
df['salary'] = df["salary"].mask(df["salary"].eq(-999), s)

I want to remove specific rows and restart the values from 1

I have a dataframe that looks like this:
Time Value
1 5
2 3
3 3
4 2
5 1
I want to remove the first two rows and then restart time from 1. The dataframe should then look like:
Time Value
1 3
2 2
3 1
I attach the code:
file = pd.read_excel(r'C:......xlsx')
df = file0.loc[(file0['Time']>2) & (file0['Time']<11)]
df = df.reset_index()
Now what I get is:
index Time Value
0 3 3
1 4 2
2 5 1
Thank you!
You can use .loc[] accessor and reset_index() method:
df=df.loc[2:].reset_index(drop=True)
Finally use list comprehension:
df['Time']=[x for x in range(1,len(df)+1)]
Now If you print df you will get your desired output:
Time Value
0 1 3
1 2 2
2 3 1
You can use df.loc to extract the subset of dataframe, Reset the index and then change the value of Time column.
df = df.loc[2:].reset_index(drop=True)
df['Time'] = df.index + 1
print(df)
you have two ways to do that.
first :
df[2:].assign(time = df.time.values[:-2])
Which returns your desired output.
time
value
1
3
2
2
3
1
second :
df = df.set_index('time')
df['value'] = df['value'].shift(-2)
df.dropna()
this return your output too but turn the numbers to float64
time
value
1
3.0
2
2.0
3
1.0

Use row values as data frame headers

I am dealing with several data frames (DataFrames = [DataFrame_a,b,c...z]) with long description as their headers, for examples, a = pd.DataFrame(data = [[1,2,7],["A","B","C"],[5,6,0]], columns = ['SuperSuperlong name columnA', 'SuperSuperlong name columnB','SuperSuperlong name columnC'])
SuperSuperlong_name_columnA SuperSuperlong_name_columnB SuperSuperlong_name_columnC
0 1 2 7
1 ABC BCD CDE
2 5 6 0
I'd like it to be transformed to
ABC BCD CDE
0 SuperSuperlong_name_columnA SuperSuperlong_name_columnB SuperSuperlong_name_columnC
1 1 2 7
2 5 6 0
What's the easiest way?
I also like to apply the method to all data frame I have. How should I do it?
Hope this helps.
# Pass column name as new value in DataFrame and reset index
df.loc['new'] = df.columns
df.reset_index(inplace=True, drop=True)
# Pass the row you want as the column name
df.columns = df.iloc[1]