Pop the first element in a pandas column - pandas

I have a pandas column like below:
import pandas as pd
data = {'id': ['001', '002', '003'],
'address': [['William J. Clare', '290 Valley Dr.', 'Casper, WY 82604','USA, United States'],
['1180 Shelard Tower', 'Minneapolis, MN 55426', 'USA, United States'],
['William N. Barnard', '145 S. Durbin', 'Casper, WY 82601', 'USA, United States']]
}
df = pd.DataFrame(data)
I wanted to pop the 1st element in the address column list if its name or if it doesn't contain any number.
output:
[['290 Valley Dr.', 'Casper, WY 82604','USA, United States'], ['1180 Shelard Tower', 'Minneapolis, MN 55426', 'USA, United States'], ['145 S. Durbin', 'Casper, WY 82601', 'USA, United States']]
This is continuation of my previous post. I am learning python and this is my 2nd project and I am struggling with this from morning please help me.

Assuming you define an address as a string starting with a number (you can change the logic):
for l in df['address']:
if not l[0][0].isdigit():
l.pop(0)
print(df)
updated df:
id address
0 001 [290 Valley Dr., Casper, WY 82604, USA, United...
1 002 [1180 Shelard Tower, Minneapolis, MN 55426, US...
2 003 [145 S. Durbin, Casper, WY 82601, USA, United ...

Related

How to change index name in multiindex groupby object with condition?

I need to change 0 level index ('Product Group') of pandas groupby object, based on conditions (sum of related values in column 'Sales').
Since code is very long and some files are needed, I`ll copy output.
the last string of code is:
tdk_regions = tdk[['Region', 'Sales', 'Product Group']].groupby(['Product Group', 'Region']).sum()
###The output will be like this
Product Group Region Sales
ALUMINUM & FILM CAPACITORS BG America 7.425599e+07
China 2.249969e+08
Europe 2.404613e+08
India 6.034134e+07
Japan 7.667371e+06
... ... ...
TEMPERATURE&PRESSURE SENSORS BG Europe 1.308471e+08
India 3.077273e+06
Japan 2.851744e+07
Korea 1.309189e+06
OSEAN 1.258075e+07
Try MultiIndex.rename:
df.index.rename("New Name", level=0, inplace=True)
print(df)
Prints:
Sales
New Name Region
ALUMINUM & FILM CAPACITORS BG America 74255990.0
China 224996900.0
Europe 240461300.0
India 60341340.0
Japan 7667371.0

How to remove the space and dots and convert into lowercase

I have a pyspark dataframe with names like
N. Plainfield
North Plainfield
West Home Land
NEWYORK
newyork
So. Plainfield
S. Plaindield
Some of them contain dots and spaces between initials, and some do not. How can they be converted to:
n Plainfield
north plainfield
west homeland
newyork
newyork
so plainfield
s plainfield
(with no dots and spaces between initials and 1 space between initials and name)
I tried using the following, but it only replaces dots and doesn't remove spaces between initials:
names_modified = names.withColumn("name_clean", regexp_replace("name", r"\.",""))
After removing the whitespaces and dots is there any way get the distinct values.
Like this.
north plainfield
west homeland
newyork
so plainfield
I think you should divide the step.
from uppercase to lowercase
replace dot using regex_replace function
from pyspark.sql.functions import *
# from uppercase to lowercase
names_modified = names_modified.withColumn('name', lower('name'))
# from dot to blink
names_modified = names_modified.withColumn('name_clean', regex_replace('name', '.', ' '))

Aggregate data based on values appearing in two columns interchangeably?

home_team_name away_team_name home_ppg_per_odds_pre_game away_ppg_per_odds_pre_game
0 Manchester United Tottenham Hotspur 3.310000 4.840000
1 AFC Bournemouth Aston Villa 0.666667 3.230000
2 Norwich City Crystal Palace 0.666667 13.820000
3 Leicester City Sunderland 4.733333 3.330000
4 Everton Watford 0.583333 2.386667
5 Chelsea Manchester United 1.890000 3.330000
The home_ppg_per_odds_pre_game and away_ppg_per_odds_pre_game are basically the same metric. The former reprsents the value of this metric for the home_team, while the latter represents this metric for the away team. I want a mean of this metric for each team and that is regardless whether the team is playing home or away. In the example df you Manchester United as home_team_name in zero and as away_team_name in 5. I want the mean for Manchester United that includes all this examples.
df.groupby("home_team_name")["home_ppg_per_odds_pre_game"].mean()
This will only bring me the mean for the occasion when the team is playing home, but I want both home and away.
Since the two metrics are the same, you can append the home and away team metrics, like this:
data_df = pd.concat([df.loc[:,('home_team_name','home_ppg_per_odds_pre_game')], df.loc[:,('away_team_name','away_ppg_per_odds_pre_game')].rename(columns={'away_team_name':'home_team_name','away_ppg_per_odds_pre_game':'home_ppg_per_odds_pre_game'})])
Then you can use groupby to get the means:
data_df.groupby('home_team_name')['home_ppg_per_odds_pre_game'].mean().reset_index()

Pandas: Unable to change value of a cell using while loop

I am trying to use a while loop to read through all the rows of my file and edit the value of a particular cell when a condition is met.
My logic is working just fine when I am reading data from an excel. But same logic is not working when I am reading from a csv file.
Here is my logic to read from Excel file:
df = pd.read_excel('Energy Indicators.xls', 'Energy', index_col=None, na_values=['NA'], skiprows = 15, skipfooter = 38, header = 1, parse_cols ='C:F')
df = df.rename(columns = {'Unnamed: 0' : 'Country', 'Renewable Electricity Production': '% Renewable'})
df = df.drop(0, axis=0)
i = 0
while (i !=len(df)):
if df.iloc[i]['Country'] == "Ukraine18":
print(df.iloc[i]['Country'])
df.iloc[i]['Country'] = 'Ukraine'
print(df.iloc[i]['Country'])
i += 1
df
The result I get is:
Ukraine18
Ukraine
But when I read a CSV file:
df = pd.read_csv('world_bank.csv', skiprows = 4)
df = df.rename(columns = {'Country Name' : 'Country'})
i = 0
while (i !=len(df)):
if df.iloc[i]['Country'] == "Aruba":
print(df.iloc[i]['Country'])
df.iloc[i]['Country'] = "Arb"
print(df.iloc[i]['Country'])
i += 1
df
The result I get is:
Aruba
Aruba
Can someone please help? What am I doing wrong with my CSV file?
#Anna Iliukovich-Strakovskaia, #msr_003, you guys are right! I changed my code to df['ColumnName][i], and it worked with the CSV file. But it is not working with Excel file now.
So, it seems with data read from CSV file, df['ColumnName][i] works correctly,
but with data read from Excel file, df.iloc[i]['ColumnName'] works correctly.
At time point, I have no clue why there should be a difference, because I am not working with the data 'within' the files, rather I am working on data that was read from these files into a 'dataframe'. Once the data is in the dataframe, the source shouldn't have any influence, I think.
Anyway, thank you for your help!!
generally i used to modify as below.
testdf = pd.read_csv("sam.csv")
testdf
ExportIndex AgentName Country
0 1 Prince United Kingdom
1 2 Nit United Kingdom
2 3 Akhil United Kingdom
3 4 Ruiva United Kingdom
4 5 Niraj United Kingdom
5 6 Nitin United States
i = 0
while(i != len(testdf)):
if(testdf['AgentName'][i] == 'Nit'):
testdf['AgentName'][i] = 'Nitesh'
i += 1
testdf
ExportIndex AgentName Country
0 1 Prince United Kingdom
1 2 Nitesh United Kingdom
2 3 Akhil United Kingdom
3 4 Ruiva United Kingdom
4 5 Niraj United Kingdom
5 6 Nitin United States
But i'm not sure what's wrong with your approach.

Creating a dataframe in pandas with one index column and the second column as a list of different sizes creating boxplot problems

I am a beginner to Python. I am analyzing bus headways for each stop along a bus route. For each stop, I have a list of headways. The count of headway can be different for each stop. To visualize data, I want to plot boxplots on the same page so that you can observe how bus bunching occurs over the route. For this, I developed a code that reads bus data from a .csv file into a stop dictionary with the name as key and values as an object (I trace some other aspects of the stop but not included here for brevity). The trouble I am having is related to the boxplot. I thought pandas would provide some ease to do this. But, I had lots of trouble trying to set up a dataframe, because my dictionary includes objects. You may have other ideas. I simplified my code to a minimum so that you can still follow what I did. As a side note, I was trying to learn how to use classes while I was working on this analysis. That’s why you see bunch of classes in my code. In my full code, I deal with duplicate vehicles and outliers in their own methods.
stops={}
stopNamesA=[]
headwaysA=[]
class Data:
def __init__(self):
self.depart = 0
self.vehicle = 0
class Stop:
def __init__(self):
self.vehicles = []
self.departs = []
self.headways=[]
self.stopName =""
def AddData(self, line):
fields = line.split(",")
self.stopName = fields[3]
self.vehicles.append(fields[0])
x = fields[4]
self.departs.append(datetime.datetime.strptime(x[:-1], "%m/%d/%y %I:%M:%S %p"))
def CalcHeadway(self):
for i in range(len(self.departs)-1):
dt = self.departs[i]
dt2 = self.departs[i+1]
self.headways.append(datetime.timedelta.total_seconds(dt2 - dt))
with open('data.csv','r') as f:
for line in f:
fields = line.split(",")
sid = str(fields[3])
if (fields[1] == 'X2' and fields[2] == 'WEST'):
if sid in stops.keys():
s = stops[sid]
else:
s = Stop()
stops[sid] = s
s.AddData(line)
for key, value in stops.items():
value.CalcHeadway()
The data looks like the following (I truncated other parts again):
5401 X2 WEST H ST NW + 7TH ST NW 10/3/16 7:58:48 AM
2835 X2 WEST H ST NW + 7TH ST NW 10/3/16 8:16:49 AM
2460 X2 WEST H ST NW + 7TH ST NW 10/3/16 8:20:12 AM
2460 X2 WEST H ST NW + 7TH ST NW 10/3/16 8:20:38 AM
2460 X2 WEST H ST NW + 7TH ST NW 10/3/16 8:20:57 AM
5404 X2 WEST I ST + 14TH ST 10/3/16 8:01:55 AM
2835 X2 WEST I ST + 14TH ST 10/3/16 8:24:01 AM
2853 X2 WEST I ST + 14TH ST 10/3/16 9:27:07 AM
5404 X2 WEST I ST + 14TH ST 10/3/16 9:45:43 AM
2835 X2 WEST I ST + 14TH ST 10/3/16 9:57:31 AM
2831 X2 WEST MINNESOTA AVE NE + BENNING RD NE 10/3/16 8:02:41 AM
2821 X2 WEST MINNESOTA AVE NE + BENNING RD NE 10/3/16 8:17:42 AM
5420 X2 WEST MINNESOTA AVE NE + BENNING RD NE 10/3/16 8:34:43 AM
2853 X2 WEST MINNESOTA AVE NE + BENNING RD NE 10/3/16 8:44:14 AM
5401 X2 WEST MINNESOTA AVE NE + BENNING RD NE 10/3/16 9:02:20 AM
First, as the Error suggests, 'Series' object has no attribute 'boxplot'. You can draw a boxplot from a Series by Series.plot.box().
However since you want multiple boxes to appear, it makes sense to use a dataframe. So what you need is a DataFrame to plot your boxplot from.
If I understand your needs correctly, you'd need a DataFrame with 26 columns, one column per bus stop.
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame()
df["I ST + 14TH ST"] = [1107.0, 1359.0, 1859.0, 1190.0, 1071.0, 904.0]
df["BENNING RD NE + 19TH ST NE"] = [1132.0, 1503.0, 1448.0, 1344.0, 958.0, 771.0]
#......
df["H ST NW + 5TH ST NW"] = [1182.0, 1315.0, 1691.0, 1193.0, 956.0, 729.0]
df.boxplot(rot=45)
plt.tight_layout()
plt.show()
It seems that in order to get a working dataframe out of the stops dictionary, one can do.
stops_for_drawing = {}
for key, val in stops.iteritems():
stops_for_drawing.update({key: val.headways})
df = pd.DataFrame(stops_for_drawing)