Pandas Dataframe: Divide Column entries by number of occurence - pandas

my Problem:
I have this DF:
df_problem = pd.DataFrame({"Share":['5%','6%','9%','9%', '9%'],"level_1":[0,0,1,2,3], 'BO':['Nestle', 'Procter', 'Nestle', 'Tesla', 'Jeff']})
The Problem is, that the 9% are actually divided by the three shareholders. So I want to giv each of them their share of 3% and put it to their names. It then should look like this:
df_solution = pd.DataFrame({"Share":['5%','6%','3%','3%', '3%'],"level_1":[0,0,0,1,2], 'BO': ['Nestle', 'Procter', 'Nestle', 'Tesla', 'Jeff']})
How do I do this in a simple way?

You could try something like this:
f_problem['Share'] = (f_problem['Share'].str.replace('%', '').astype(float) /
f_problem.groupby('Share')['BO'].
transform('count')).astype(str) + '%'
>>> f_problem
Share level_1 BO
0 5.0% 0 Nestle
1 6.0% 0 Procter
2 3.0% 1 Nestle
3 3.0% 2 Tesla
4 3.0% 3 Jeff
Please note that I have assumed that the value of the column 'Share' to be float as you could see above.

Related

Why is my df.sort_values() not correctly sorting the data points?

I have a dataframe with returns from various investments in %. sort.values does not correctly order my returns. For example I just want to simply see the TEST column returns sorted by lowest to highest or vice versa. Please look at the test output, it is not correct.
df.sort_values('TEST')
gives me an output of returns that are NOT sorted correctly. Sort values code not in correct order
Also I am having an issue where it sorts positive numbers lowest to highest, then half way down starts again for negative numbers lowest to highest.
I just want it to look like following:
-3%
-1%
-0.5%
1%
2%
5%
Go for numpy.lexsort and boolean indexing :
import numpy as np
arr = np.array([float(x.rstrip("%")) for x in df["TEST"]])
idx = np.lexsort((arr,))
​
df = df.iloc[idx]
​
Output :
print(df)
​
TEST
0 -3%
1 -1%
2 -0.5%
3 1%
4 2%
5 5%
Input used :
df = pd.DataFrame({"TEST": ["1%", "-3%","-0.5%", "-1%", "5%", "2%"]})
TEST
0 1%
1 -3%
2 -0.5%
3 -1%
4 5%
5 2%
The issue is that the lexicographic order of string is different from the natural order (1->10->2 vs 1->2->10).
One option using the key parameter of sort_values:
df.sort_values('TEST', key=lambda s: pd.to_numeric(s.str.extract(r'(-?\d+\.?\d*)', expand=False)))
Or:
df.sort_values('TEST', key=lambda s: pd.to_numeric(s.str.rstrip('%')))
Output:
TEST
1 -3%
3 -1%
2 -0.5%
0 1%
5 2%
4 5%

Changing column name and it's values at the same time

Pandas help!
I have a specific column like this,
Mpg
0 18
1 17
2 19
3 21
4 16
5 15
Mpg is mile per gallon,
Now I need to replace that 'MPG' column to 'litre per 100 km' and change those values to litre per 100 km' at the same time. Any help? Thanks beforehand.
-Tom
I changed the name of the column but doing both simultaneously,i could not.
Use pop to return and delete the column at the same time and rdiv to perform the conversion (1 mpg = 1/235.15 liter/100km):
df['litre per 100 km'] = df.pop('Mpg').rdiv(235.15)
If you want to insert the column in the same position:
df.insert(df.columns.get_loc('Mpg'), 'litre per 100 km',
df.pop('Mpg').rdiv(235.15))
Output:
litre per 100 km
0 13.063889
1 13.832353
2 12.376316
3 11.197619
4 14.696875
5 15.676667
An alternative to pop would be to store the result in another dataframe. This way you can perform the two steps at the same time. In my code below, I first reproduce your dataframe, then store the constant for conversion and perform it on all entries using the apply method.
df = pd.DataFrame({'Mpg':[18,17,19,21,16,15]})
cc = 235.214583 # constant for conversion from mpg to L/100km
df2 = pd.DataFrame()
df2['litre per 100 km'] = df['Mpg'].apply(lambda x: cc/x)
print(df2)
The output of this code is:
litre per 100 km
0 13.067477
1 13.836152
2 12.379715
3 11.200694
4 14.700911
5 15.680972
as expected.

How to replace values of a column based on another data frame?

I have a column containing symbols of chemical elements and other substances. Something like this:
Commoditie
sn
sulfuric acid
cu
sodium chloride
au
df1 = pd.DataFrame(['sn', 'sulfuric acid', 'cu', 'sodium chloride', 'au'], columns=['Commodities'])
And I have another data frame containing the symbols of the chemical elements and their respective names. Like this:
Name
Symbol
sn
tin
cu
copper
au
gold
df2 = pd.DataFrame({'Name': ['tin', 'copper', 'gold'], 'Symbol': ['sn', 'cu', 'au']})
I need to replace the symbols (in the first dataframe)(df1['Commoditie']) with the names (in the second one) (df2['Names']), so that it outputs like the following:
I need the
Output:
Commoditie
tin
sulfuric acid
copper
sodium chloride
gold
I tried using for loops and lambda but got different results than expected. I have tried many things and googled, I think it's something basic, but I just can't find an answer.
Thank you in advance!
first, convert df2 to a dictionary:
replace_dict=dict(df2[['Symbol','Name']].to_dict('split')['data'])
#{'sn': 'tin', 'cu': 'copper', 'au': 'gold'}
then use replace function:
df1['Commodities']=df1['Commodities'].replace(replace_dict)
print(df1)
'''
Commodities
0 tin
1 sulfuric acid
2 copper
3 sodium chloride
4 gold
'''
Try:
for i, row in df2.iterrows():
df1.Commodities = df1.Commodities.str.replace(row.Symbol, row.Name)
which gives df1 as:
Commodities
0 tin
1 sulfuric acid
2 copper
3 sodium chloride
4 gold
EDIT: Note that it's very likely to be far more efficient to skip defining df2 at all and just zip your lists of names and symbols together and iterate over that.

counting unique values in column using sub-id

I have a df containing sub-trajectories (segments) of users, with mode of travel indicated by 0,1,2... which looks like this:
df = pd.read_csv('sample.csv')
df
id lat lon mode
0 5138001 41.144540 -8.562926 0
1 5138001 41.144538 -8.562917 0
2 5138001 41.143689 -8.563012 0
3 5138003 43.131562 -8.601273 1
4 5138003 43.132107 -8.598124 1
5 5145001 37.092095 -8.205070 0
6 5145001 37.092180 -8.204872 0
7 5145015 39.289341 -8.023454 2
8 5145015 39.197432 -8.532761 2
9 5145015 39.198361 -8.375641 2
In the above sample, id is for the segments but a full trajectory maybe covered by different modes (i.e. contains multiple segments).
So the first 4-digits of id is the unique trajectories, and the last 3-digits, unique segment with that trajectory.
I know that I can count the number of unique segments in the dfusing:
df.groupby('id').['mode'].nunique()
How do I then count the number of unique trajectories 5138, 5145, ...?
Use indexing for get first 4 values with str, if necessary first convert values to strings by Series.astype:
df = df.groupby(df['id'].astype(str).str[:4])['mode'].nunique().reset_index(name='count')
print (df)
id count
0 5138 2
1 5145 2
If need processing values after first 4 ids:
s = df['id'].astype(str)
df = s.str[4:].groupby(s.str[:4]).nunique().reset_index(name='count')
print (df)
id count
0 5138 2
1 5145 2
Another idea is use lambda function:
df.groupby(df['id'].apply(lambda x: str(x)[:4]))['mode'].nunique()

collapse pandas dataframe rows based on index column

I have a dataframe that contains information that is linked by an ID column. The rows are sequential with the odd rows containing a "start-point" and the even rows containing an "end" point. My goal is to collapse the data from these into a single row with columns for "start" and "end" following each other. The rows do have a "packet ID" that would link them if the sequential nature of the dataframe is not consistent.
example:
df:
0 1 2 3 4 5
0 hs6 106956570 106956648 ID_A1 60 -
1 hs1 153649721 153649769 ID_A1 60 -
2 hs1 865130744 865130819 ID_A2 0 -
3 hs7 21882206 21882237 ID_A2 0 -
4 hs1 74230744 74230819 ID_A3 0 +
5 hs8 92041314 92041508 ID_A3 0 +
The resulting dataframe that I am trying to achieve is:
new_df
0 1 2 3 4 5
0 hs6 106956570 106956648 hs1 153649721 153649769
1 hs1 865130744 865130819 hs7 21882206 21882237
2 hs1 74230744 74230819 hs8 92041314 92041508
with each row containing the information on both the start and the end-point.
I have tried to pass the IDs in to an array and use a for loop to pull the information out of the original dataframe into a new dataframe but this has not worked. I was looking at the melt documentation which would suggest that pd.melt(df, id_vars=[3], value_vars=[0,1,2]) may work but I cannot see how to get the corresponding row in to positions new_df[3,4,5].
I think that it may be something really simple that I am missing but any suggestions would be appreciated.
You can try this:
df_out = df.set_index([df.index%2, df.index//2])[df.columns[:3]]\
.unstack(0).sort_index(level=1, axis=1)
df_out.columns = np.arange(len(df_out.columns))
df_out
Output:
0 1 2 3 4 5
0 hs6 106956570 106956648 hs1 153649721 153649769
1 hs1 865130744 865130819 hs7 21882206 21882237
2 hs1 74230744 74230819 hs8 92041314 92041508