Pandas: replacing part of a string from elements in different columns - pandas

I have a dataframe where numbers contained in some cells (in several columns) look like this: '$$10'
I want to replace/remove the '$$'. So far I tried this, but I does not work:
replace_char={'$$':''}
df.replace(replace_char, inplace=True)

An example close to the approach you are taking would be:
df[col_name].str.replace('\$\$', '')
Notice that this has to be done on a series so you have to select the column you would like to apply the replace to.
amt
0 $$12
1 $$34
df['amt'] = df['amt'].str.replace('\$\$', '')
df
gives:
amt
0 12
1 34
or you could apply to the full df with:
df.replace({'\$\$':''}, regex=True)

your code is (almost) right.
this will work if you had AA:
replace_char={'AA':''}
df.replace(replace_char, inplace=True)
problem is $$ is a regex and therefore you need to do it differently:
df['your_column'].replace({'\$':''}, regex = True)
example:
df = pd.DataFrame({"A":[1,2,3,4,5,'$$6'],"B":[9,9,'$$70',9,9, np.nan]})
A B
0 1 9
1 2 9
2 3 $$70
3 4 9
4 5 9
5 $$6 NaN
do
df['A'].replace({'\$':''}, regex = True)
desired result for columns A:
0 1
1 2
2 3
3 4
4 5
5 6
you can iterate to any column from this point.

You just need to specify the regex argument. Like:
replace_char={'$$':''}
df.replace(replace_char, in place = True, regex = True)
'df.replace' should replace it for all entries in the data frame.

Related

python - List of Lists into pandas dataframe including name of columns

I would like to transfer a list of lists into a dataframe with columns based on the lists in the list.
This is still easy.
list = [[....],[....],[...]]
df = pd.DataFrame(list)
df = df.transpose()
The problem is: I would like to give the columns a column-name based on entries I have in another list:
list_two = [A,B,C,...]
This is my issue Im still struggling with.
Is there any approach to solve this problem?
Thanks a lot in advance for your help.
Best regards
Sascha
Use zip with dict for dictionary of lists and pass to DataFrame:
L= [[1,2,3,5],[4,8,9,8],[1,2,5,3]]
list_two = list('ABC')
df = pd.DataFrame(dict(zip(list_two, L)))
print (df)
A B C
0 1 4 1
1 2 8 2
2 3 9 5
3 5 8 3
Or if pass index parameter after transpose get columns names by this list:
df = pd.DataFrame(L, index=list_two).T
print (df)
A B C
0 1 4 1
1 2 8 2
2 3 9 5
3 5 8 3

I want to remove specific rows and restart the values from 1

I have a dataframe that looks like this:
Time Value
1 5
2 3
3 3
4 2
5 1
I want to remove the first two rows and then restart time from 1. The dataframe should then look like:
Time Value
1 3
2 2
3 1
I attach the code:
file = pd.read_excel(r'C:......xlsx')
df = file0.loc[(file0['Time']>2) & (file0['Time']<11)]
df = df.reset_index()
Now what I get is:
index Time Value
0 3 3
1 4 2
2 5 1
Thank you!
You can use .loc[] accessor and reset_index() method:
df=df.loc[2:].reset_index(drop=True)
Finally use list comprehension:
df['Time']=[x for x in range(1,len(df)+1)]
Now If you print df you will get your desired output:
Time Value
0 1 3
1 2 2
2 3 1
You can use df.loc to extract the subset of dataframe, Reset the index and then change the value of Time column.
df = df.loc[2:].reset_index(drop=True)
df['Time'] = df.index + 1
print(df)
you have two ways to do that.
first :
df[2:].assign(time = df.time.values[:-2])
Which returns your desired output.
time
value
1
3
2
2
3
1
second :
df = df.set_index('time')
df['value'] = df['value'].shift(-2)
df.dropna()
this return your output too but turn the numbers to float64
time
value
1
3.0
2
2.0
3
1.0

pandas split-apply-combine creates undesired MultiIndex

I am using the split-apply-combine pattern in pandas to group my df by a custom aggregation function.
But this returns an undesired DataFrame with the grouped column existing twice: In an MultiIndex and the columns.
The following is a simplified example of my problem.
Say, I have this df
df = pd.DataFrame([[1,2],[3,4],[1,5]], columns=['A','B']))
A B
0 1 2
1 3 4
2 1 5
I want to group by column A and keep only those rows where B has an even value. Thus the desired df is this:
B
A
1 2
3 4
The custom function my_combine_func should do the filtering. But applying it after a groupby, leads to an MultiIndex with the former Index in the second level. And thus column A existing two times.
my_combine_func = group[group['B'] % 2 == 0]
df.groupby(['A']).apply(my_combine_func)
A B
A
1 0 1 2
3 1 3 4
How to apply a custom group function and have the desired df?
It's easier to use apply here so you get a boolean array back:
df[df.groupby('A')['B'].apply(lambda x: x % 2 == 0)]
A B
0 1 2
1 3 4

Pandas drop_duplicates. Keep first AND last. Is it possible?

I have this dataframe and I need to drop all duplicates but I need to keep first AND last values
For example:
1 0
2 0
3 0
4 0
output:
1 0
4 0
I tried df.column.drop_duplicates(keep=("first","last")) but it doesn't word, it returns
ValueError: keep must be either "first", "last" or False
Does anyone know any turn around for this?
Thanks
You could use the panda's concat function to create a dataframe with both the first and last values.
pd.concat([
df['X'].drop_duplicates(keep='first'),
df['X'].drop_duplicates(keep='last'),
])
you can't drop both first and last... so trick is too concat data frames of first and last.
When you concat one has to handle creating duplicate of non-duplicates. So only concat unique indexes in 2nd Dataframe. (not sure if Merge/Join would work better?)
import pandas as pd
d = {1:0,2:0,10:1, 3:0,4:0}
df = pd.DataFrame.from_dict(d, orient='index', columns=['cnt'])
print(df)
cnt
1 0
2 0
10 1
3 0
4 0
Then do this:
d1 = df.drop_duplicates(keep=("first"))
d2 = df.drop_duplicates(keep=("last"))
d3 = pd.concat([d1,d2.loc[set(d2.index) - set(d1.index)]])
d3
Out[60]:
cnt
1 0
10 1
4 0
Use a groupby on your column named column, then reindex. If you ever want to check for duplicate values in more than one column, you can extend the columns you include in your groupby.
df = pd.DataFrame({'column':[0,0,0,0]})
Input:
column
0 0
1 0
2 0
3 0
df.groupby('column', as_index=False).apply(lambda x: x if len(x)==1 else x.iloc[[0, -1]]).reset_index(level=0, drop=True)
Output:
column
0 0
3 0

Use row values as data frame headers

I am dealing with several data frames (DataFrames = [DataFrame_a,b,c...z]) with long description as their headers, for examples, a = pd.DataFrame(data = [[1,2,7],["A","B","C"],[5,6,0]], columns = ['SuperSuperlong name columnA', 'SuperSuperlong name columnB','SuperSuperlong name columnC'])
SuperSuperlong_name_columnA SuperSuperlong_name_columnB SuperSuperlong_name_columnC
0 1 2 7
1 ABC BCD CDE
2 5 6 0
I'd like it to be transformed to
ABC BCD CDE
0 SuperSuperlong_name_columnA SuperSuperlong_name_columnB SuperSuperlong_name_columnC
1 1 2 7
2 5 6 0
What's the easiest way?
I also like to apply the method to all data frame I have. How should I do it?
Hope this helps.
# Pass column name as new value in DataFrame and reset index
df.loc['new'] = df.columns
df.reset_index(inplace=True, drop=True)
# Pass the row you want as the column name
df.columns = df.iloc[1]