I’m looking at this but I have no idea how to formulate it:
Change Value of a Dataframe Column Based on a Filter
I need to change the values in medianIncome with values of 0.4999 or lower to 0.4999 or if 15.0001 or higher to 15.0001.
Here's sample data:
id longitude_x latitude ocean_proximity longitude_y state medianHouseValue housingMedianAge totalBedrooms totalRooms households population medianIncome
0 1 -122.23 37.88 NEAR BAY -122.23 CA 452.603 45.0 131.0 884.0 130.0 323.0 83252.0
1 396 -122.34 37.88 NEAR BAY -122.23 CA 350.004 41.0 930.0 3063.0 926.0 2560.0 17375.0
2 398 -122.29 37.88 NEAR BAY -122.23 CA 216.703 54.0 263.0 1211.0 230.0 525.0 38672.0
3 401 -122.28 37.88 NEAR BAY -122.23 CA 261.303 55.0 333.0 1845.0 335.0 772.0 42614.0
4 424 -122.26 37.88 NEAR BAY -122.23 CA 391.803 53.0 418.0 2553.0 404.0 898.0 62425.0
... ... ... ... ... ... ... ... ... ... ... ... ... ...
929044 9476 -123.38 39.37 INLAND -121.24 CA 124.601 20.0 813.0 3947.0 732.0 1902.0 26424.0
929045 9494 -123.75 39.37 INLAND -121.24 CA 151.403 20.0 299.0 1377.0 282.0 830.0 32500.0
929046 10065 -121.03 39.37 INLAND -121.24 CA 85.000 15.0 327.0 1338.0 310.0 1174.0 26341.0
929047 10074 -120.10 39.37 INLAND -121.24 CA 117.301 34.0 411.0 2328.0 373.0 1016.0 45208.0
929048 21558 -121.24 39.37 INLAND -121.24 CA 89.401 18.0 616.0 2787.0 532.0 1387.0 23886.0
It shows:
np.where(df['x'] > 0 & df['y'] < 10, 1, 0)
So I'm at:
np.where(housing['medianIncome'] > 15.0001
And I'm stuck as to the rest. Only using pandas and numpy, not able to use lambda.
I'm expecting an outcome that won't give an error. As of yet, I don't have an outcome.
Use Series.clip:
housing = pd.DataFrame({'medianIncome':[20,5,0.07]})
housing['medianIncome'] = housing['medianIncome'].clip(upper=15.0001, lower=0.4999)
print (housing)
medianIncome
0 15.0001
1 5.0000
2 0.4999
Alternative with numpy.select if need set another values by conditions:
housing['medianIncome'] = np.select([housing['medianIncome'].lt(0.4999),
housing['medianIncome'].gt(15.0001)],
[0,1],
default=housing['medianIncome'])
print (housing)
medianIncome
0 1.0
1 5.0
2 0.0
I am trying to get the data in the dataframe as csv file, but I have always an error. I need a last code to convert the readable data content in python into saved csv file.
Code is here:
from selenium import webdriver
import time
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd`
browser=webdriver.Chrome()
browser.get("https://archive.doingbusiness.org/en/scores")
countries= WebDriverWait(browser, 20).until(EC.visibility_of_element_located((By.XPATH, "//*[#id='dftFrontiner']/div[3]/table"))).get_attribute("outerHTML")
df = pd.read_html(countries)
**df.to_csv('output'.csv, index=False)**
print(df)
time.sleep(2)
browser.quit()
Without the line written in bold, I can get the following output:
[ 0 1 2 3
0 Region Region Region Region
1 NaN East Asia & Pacific 62.7 63.3
2 NaN Europe & Central Asia 71.8 73.1
3 NaN Latin America & Caribbean 58.8 59.1
4 NaN Middle East & North Africa 58.4 60.2
.. ... ... ... ...
217 NaN Vietnam 68.6 69.8
218 NaN West Bank and Gaza 59.7 60
219 NaN Yemen, Rep. 30.7 31.8
220 NaN Zambia 65.7 66.9
221 NaN Zimbabwe 50.5 54.5
When I add the bold line ( df.to_csv('output'.csv, index=False)), I could not save the file. However, I need this data in csv format. Please direct me how to write the code.
Thanks.
That's because pandas.read_html returns a list of DataFrames. So, you need to slice before saving the .csv
Replace:
df.to_csv('output.csv', index=False)
By this :
df[0].to_csv('output.csv', index=False)
# Output :
print(df[0])
0 1 2 3
0 Region Region Region Region
1 NaN East Asia & Pacific 62.7 63.3
2 NaN Europe & Central Asia 71.8 73.1
3 NaN Latin America & Caribbean 58.8 59.1
4 NaN Middle East & North Africa 58.4 60.2
.. ... ... ... ...
217 NaN Vietnam 68.6 69.8
218 NaN West Bank and Gaza 59.7 60
219 NaN Yemen, Rep. 30.7 31.8
220 NaN Zambia 65.7 66.9
221 NaN Zimbabwe 50.5 54.5
[222 rows x 4 columns]
If you need a separated .csv for each group (Region & Economy), use this :
for g in df[0].ffill().groupby(0, sort=False):
sub_df= g[1].reset_index(drop=True).iloc[1:, 1:]
sub_df.to_csv(f"{g[0]}.csv", index=False)
In a pandas frame with 4 columns, I need to remove the digits from the end of the names of the Country column that have them:
Country Energy
56 Central African Republic 23
57 Chad 77
58 Chile 1613
59 China2 127191
60 Hong Kong 585
75 Denmark5 725
I'd write a function to remove digits at the end of the string and apply it to that specific column:
import pandas as pd
import string
def remove_digits(country):
return country.rstrip(string.digits)
df = pd.DataFrame({'country': ['China2', 'Hong Kong'], 'energy': [127191, 585]})
print(df)
df['country'] = df['country'].apply(remove_digits)
print('\n', df)
This will return:
country energy
0 China2 127191
1 Hong Kong 585
country energy
0 China 127191
1 Hong Kong 585
Most simply, I'd use regex to replace digits at the end with an empty string:
import pandas as pd
df = pd.DataFrame({'Country': ['China2', 'Hong Kong', 'Ireland43'], 'energy': [127191, 585, 999]})
print(df)
df['Country'] = df['Country'].str.replace('[0-9]?', '', regex=True)
print('\n', df)
This returns:
Country energy
0 China2 127191
1 Hong Kong 585
2 Ireland43 999
Country energy
0 China 127191
1 Hong Kong 585
2 Ireland 999
Getting the KeyError: 'BasePay' for the BasePay element while its therein the DataFrame but missing while using mean() function.
My pandas version is '0.23.3' python3.6.3
>>> import numpy as np
>>> salDataF = pd.read_csv('Salaries.csv', low_memory=False)
>>> salDataF.head()
Id EmployeeName JobTitle BasePay OvertimePay OtherPay ... TotalPay TotalPayBenefits Year Notes Agency Status
0 1 NATHANIEL FORD GENERAL MANAGER-METROPOLITAN TRANSIT AUTHORITY 167411.18 0.0 400184.25 ... 567595.43 567595.43 2011 NaN San Francisco NaN
1 2 GARY JIMENEZ CAPTAIN III (POLICE DEPARTMENT) 155966.02 245131.88 137811.38 ... 538909.28 538909.28 2011 NaN San Francisco NaN
2 3 ALBERT PARDINI CAPTAIN III (POLICE DEPARTMENT) 212739.13 106088.18 16452.6 ... 335279.91 335279.91 2011 NaN San Francisco NaN
3 4 CHRISTOPHER CHONG WIRE ROPE CABLE MAINTENANCE MECHANIC 77916.0 56120.71 198306.9 ... 332343.61 332343.61 2011 NaN San Francisco NaN
4 5 PATRICK GARDNER DEPUTY CHIEF OF DEPARTMENT,(FIRE DEPARTMENT) 134401.6 9737.0 182234.59 ... 326373.19 326373.19 2011 NaN San Francisco NaN
[5 rows x 13 columns]
>>> EmpSal = salDataF.groupby('Year').mean()
KeyboardInterrupt
>>> salDataF.groupby('Year').mean()
Id TotalPay TotalPayBenefits Notes
Year
2011 18080.0 71744.103871 71744.103871 NaN
2012 54542.5 74113.262265 100553.229232 NaN
2013 91728.5 77611.443142 101440.519714 NaN
2014 129593.0 75463.918140 100250.918884 NaN
>>> EmpSal = salDataF.groupby('Year').mean()['BasePay']
Error: KeyError: 'BasePay'
Here is problem BasePay is not numeric, so salDataF.groupby('Year').mean() exclude all non numeric columns by design.
Solution is first try astype:
salDataF['BasePay'] = salDataF['BasePay'].astype(float)
...and if failed because some non numeric data use to_numeric with errors='coerce' for convert them to NaNs
salDataF['BasePay'] = pd.to_numeric(salDataF['BasePay'], errors='coerce')
and then better is select column before mean:
EmpSal = salDataF.groupby('Year')['BasePay'].mean()
I have a dataframe like this. i have regular fields till "state" then i will have trailers (3 columns tr1* represents 1 tailer) i want to convert those trailers to rows. I tried melt function but i am able to use only 1 trailer column. kindly look at below example you can understand
Name number city state tr1num tr1acct tr1ct tr2num tr2acct tr2ct tr3num tr3acct tr3ct
DJ 10 Edison nj 1001 20345 Dew 1002 20346 Newca. 1003. 20347. pen
ND 20 Newark DE 2001 1985 flor 2002 1986 rodge
I am expecting the output like this.
Name number city state trnum tracct trct
DJ 10 Edison nj 1001 20345 Dew
DJ 10 Edison nj 1002 20346 Newca
DJ 10 Edison nj 1003 20347 pen
ND 20 Newark DE 2001 1985 flor
ND 20 Newark DE 2002 1986 rodge
You need to look at using pd.wide_to_long. However, you will need to do some column renaming first.
df = df.set_index(['Name','number','city','state'])
df.columns = df.columns.str.replace('(\D+)(\d+)(\D+)',r'\1\3_\2')
df = df.reset_index()
pd.wide_to_long(df, ['trnum','trct','tracct'],
['Name','number','city','state'], 'Code',sep='_',suffix='\d+')\
.reset_index()\
.drop('Code',axis=1)
Output:
Name number city state trnum trct tracct
0 DJ 10 Edison nj 1001.0 Dew 20345.0
1 DJ 10 Edison nj 1002.0 Newca. 20346.0
2 DJ 10 Edison nj 1003.0 pen 20347.0
3 ND 20 Newark DE 2001.0 flor 1985.0
4 ND 20 Newark DE 2002.0 rodge 1986.0
5 ND 20 Newark DE NaN NaN NaN
you could achieve this by renaming your columns and bit and applying the pandas wide_to_long method. Below is the code which produces your desired output.
df = pd.DataFrame({"Name":["DJ", "ND"], "number":[10,20], "city":["Edison", "Newark"], "state":["nj","DE"],
"trnum_1":[1001,2001], "tracct_1":[20345,1985], "trct_1":["Dew", "flor"], "trnum_2":[1002,2002],
"trct_2":["Newca", "rodge"], "trnum_3":[1003,None], "tracct_3":[20347,None], "trct_3":["pen", None]})
pd.wide_to_long(df, stubnames=['trnum', 'tracct', 'trct'], i='Name', j='dropme', sep='_').reset_index().drop('dropme', axis=1)\
.sort_values('trnum')
outputs
Name state city number trnum tracct trct
0 DJ nj Edison 10 1001.0 20345.0 Dew
1 DJ nj Edison 10 1002.0 NaN Newca
2 DJ nj Edison 10 1003.0 20347.0 pen
3 ND DE Newark 20 2001.0 1985.0 flor
4 ND DE Newark 20 2002.0 NaN rodge
5 ND DE Newark 20 NaN NaN None
Another option:
df = pd.DataFrame({'col1': [1,2,3], 'col2':[3,4,5], 'col3':[5,6,7], 'tr1':[0,9,8], 'tr2':[0,9,8]})
The df:
col1 col2 col3 tr1 tr2
0 1 3 5 0 0
1 2 4 6 9 9
2 3 5 7 8 8
subsetting to create 2 df's:
tr1_df = df[['col1', 'col2', 'col3', 'tr1']].rename(index=str, columns={"tr1":"tr"})
tr2_df = df[['col1', 'col2', 'col3', 'tr2']].rename(index=str, columns={"tr2":"tr"})
res = pd.concat([tr1_df, tr2_df])
result:
col1 col2 col3 tr
0 1 3 5 0
1 2 4 6 9
2 3 5 7 8
0 1 3 5 0
1 2 4 6 9
2 3 5 7 8
One option is the pivot_longer function from pyjanitor, using the .value placeholder:
# pip install git+https://github.com/pyjanitor-devs/pyjanitor.git
import janitor
import pandas as pd
(df
.pivot_longer(
index=slice("Name", "state"),
names_to=(".value", ".value"),
names_pattern=r"(.+)\d(.+)",
sort_by_appearance=True)
.dropna()
)
Name number city state trnum tracct trct
0 DJ 10 Edison nj 1001.0 20345.0 Dew
1 DJ 10 Edison nj 1002.0 20346.0 Newca.
2 DJ 10 Edison nj 1003.0 20347.0 pen
3 ND 20 Newark DE 2001.0 1985.0 flor
4 ND 20 Newark DE 2002.0 1986.0 rodge
The .value keeps the part of the column associated with it as header, and since we have multiple .value, they are combined into a single word. The .value is determined by the groups in the names_pattern, which is a regular expression.
Note that currently the multiple .value option is available in dev.