Modifying a data frame by adding an additional column with if statement.
I created 5 lists namely: East_Asia, Central_Asia,Central_America,South_America, Europe_East & Europe_West. And I wanted to add a conditional column based on existing column. i.e if japan in Central_East, then the japan row in the adding column should contain Central East, so on.
df['native_region'] =df["native_country"].apply(lambda x: "Asia-East" if x in 'Asia_East'
"Central-Asia" elif x in "Central_Asia"
"South-America" elif x in "South_America"
"Europe-West" elif x in "Europe_West"
"Europe-East" elif x in "Europe_East"
"United-States" elif x in "
else "Outlying-US"
File "", line 2
"Central-Asia" elif x in "Central_Asia"
SyntaxError: invalid syntax
I might be wrong, but I think you're taking the problem the wrong way around.
What you seem to be doing there is just to replace '_' by '-', which you can do with the following line:
df['native_region'] = df.native_country.str.replace('_', '-')
And then, in my experience, it's more understandable to work like that :
known_countries = ['Asia-East', 'Central-Asia', 'South-America', ...]
is_known = df['native_country'].isin(known_countries )
df.native_region[~known_countries] = 'Outlying-US'
This could work also if you worked with countries like :
east_asia_countries = ['Japan', 'China', 'Korea']
isin_east_asia = df['native_country'].isin(east_asia_countries)
df.native_region[known_countries] = 'East-Asia'
I am having the following two (2) lists:
lines = [[[0, 98]], [[64], [1,65,69]]]
stations = [[0,1], [0,3,1]]
The lines describes the line combinations for getting from 0 to 1 and stations describes the stops visited by choosing the line combinations. For getting from 0 to 1 the following are the possible combinations:
Take line 0 or 98 (direct connection)
Take line 64 and then line 1 (1 transfer at station 3)
Take line 64 and then line 65 (1 transfer at station 3)
Take line 64 and then line 69 (1 transfer at station 3)
The len of stations always equals the len of lines. I have used the following code to explode the lists in the way I described previously and store the options in a dataframe df.
result_direct = []
df = pd.DataFrame(columns=["lines", "stations", 'transfers'])
result_transfer = []
for index,(line,station) in enumerate(zip(lines,stations)):
if len(line) == 1: #if the line store direct connections
result_direct = [[i] for i in line[0]] #stores the direct connections in a list
for sublist in result_direct:
df = df.append({'lines': sublist,'stations': station, 'transfers': 0},ignore_index=True)
result_transfer = [[[x] for x in tup] for tup in itertools.product(*line)]
result_transfer = [[item[0] for item in sublist] for sublist in result_transfer]
for sublist in result_transfer:
df = df.append({'lines': sublist,'stations': station, 'transfers': len(sublist)-1},ignore_index=True)
For the sake of the example I add the following 2 columns score1, score2:
df['score1'] = [5,5,5,2,2]
df['score2'] = [2,6,4,3,3]
I want to update the values of lines and stations based on a condition. When score2 > score1 this line/station combinations should be removed from the lists.
In this example the direct line 98 as well as the combination of lines 64,65 and 64,69 should be removed from the lines. Therefore, the expected output is the following:
lines = [[[0]], [[64], [1]]]
stations = [[0,1], [0,3,1]]
At this point, I should note that stations is not affected since there is at least one remaining combination of lines in the sublists. If also line 0 should have been removed the expected output should be:
lines = [[[64], [1]]]
stations = [[0,3,1]]
For starting I have tried a manual solution that works for a single line removal (e.g for removing line 98):
lines = [[y for y in x if y != 98] if isinstance(x, list) else x for x in [y for x in lines for y in x]]
I am having the difficulty for line combinations (e.g 64,65 and 64,69). Do you have any suggestions?
def locate (code):
string1 = str(code)
floor = string1[3]
if floor == '1':
return 'Ground Floor'
if int(string1[5]) < 1:
lobby = 'G'
elif int(string1[5]) < 2:
lobby = 'F'
lobby = 'E'
return floor + lobby
This function works fine with individual input code as above with Output
Ground Floor
But when I use this to map a series in data frame, it shows error.
error_data1['location'] = error_data1['status'].map(locate)
Error message: string index out of range.
How can I fix this??
Your problem is with your series values:
se = pd.Series(['S191009', 'rt'])
produces the same error you reported. You can ignore these rows using try...except in function if it does not hurt you.
The problem is you are indexing an index on a string that doesn't exist (i.e the string is shorter than what you expect). As the other answer mentioned, if you try and use
You will get the same error. To solve this you should add a try except statement, or for simplicity an initial if statement that returns "NotValid" or something like that. Your data probably has strings that do not follow the standard form you expect.
I was not able to figure out the reason why my code didn't work. Ii seemingly doesn't have any problem for me. Can anyone help to point out the issue in my code?
What I tried:
true_avengers['Deaths'] = 0
for index, row in true_avengers.iterrows():
for i in range(1,6):
col = 'Death{}'.format(i)
if row[col] == 'YES':
row['Deaths'] += 1
def clean_deaths(row):
num_deaths = 0
columns = ['Death1', 'Death2', 'Death3', 'Death4', 'Death5']
for c in columns:
death = row[c]
if pd.isnull(death) or death == 'NO':
elif death == 'YES':
num_deaths += 1
return num_deaths
true_avengers['Deaths'] = true_avengers.apply(clean_deaths, axis=1)
Much appreciated if you can enlighten me!
You do not use pandas correctly. It is usually not necessary to loop through the rows explicitly. Here's a clean vectorized solution. First, identify the columns of interest. Their names consist pf "Death" followed by a number:
death_columns = true_avengers.columns.str.match(r"Death\d+")
Find out which of them are "YES":
changes = true_avengers.iloc[:, death_columns]=='YES'
Calculate the sum of the occurrences and add them to the last column:
true_avengers['Deaths'] += changes.sum(axis=1)
I have the 'Field_Type' column filled with strings and I want to derive the values in the 'Units' column using an if statement.
So Units shows the desired result. Essentially I want to call out what type of activity is occurring.
I tried to do this using my code below but it won't run (please see screen shot below for error). Any help is greatly appreciated!
create_table['Units'] =['Field_Name'].str.startswith("W"), "MW",['Field_Name'].str.contains("R"), "MVar",['Field_Name'].str.contains("V"), "Per Unit")))```
ValueError: either both or neither of x and y should be given
You can write a function to define your conditionals, then use apply on the dataframe and pass the funtion
def unit_mapper(row):
if row['Field_Type'].startswith('W'):
return 'MW'
elif 'R' in row['Field_Type']:
return 'MVar'
elif 'V' in row['Field_Type']:
return 'Per Unit'
return 'N/A'
And then
create_table['Units'] = create_table.apply(unit_mapper, axis=1)
In your text you talk about Field_Type but you are using Field_Name in your example. Which one is good ?
You want to do something like:
create_table[create_table['Field_Type'].str.startwith('W'), 'Units'] = 'MW'
create_table[create_table['Field_Type'].str.startwith('R'), 'Units'] = 'MVar'
create_table[create_table['Field_Type'].str.startwith('V'), 'Units'] = 'Per Unit'
I keep getting the following error.
I read a file that contains time series data of 3 columns: [meter ID] [daycode(explain later)] [meter reading in kWh]
consum = pd.read_csv("data/File1.txt", delim_whitespace=True, encoding = "utf-8", names =['meter', 'daycode', 'val'], engine='python')
consum.set_index('meter', inplace=True)
test = consum.loc[[1048]]
I will observe meter readings for all the length of data that I have in this file, but first filter by meter ID.
test['day'] = test['daycode'].astype(str).str[:3]
test['hm'] = test['daycode'].astype(str).str[-2:]
For readability, I convert daycode based on its rule. First 3 digits are in range of 1 to 365 x2 = 730, last 2 digits in range of 1 to 48. These are 30-min interval reading of 2-year length. (but not all have in full)
So I create files that contain dates in one, and times in another separately. I will use index to convert the digits of daycode into the corresponding date & time that these file contain.
#dcodebook index starts from 0. So minus 1 from the daycode before match
dcodebook = pd.read_csv("data/dcode.txt", encoding = "utf-8", sep = '\r', names =['match'])
#hcodebook starts from 1
hcodebook = pd.read_csv("data/hcode.txt", encoding = "utf-8", sep ='\t', lineterminator='\r', names =['code', 'print'])
hcodebook = hcodebook.drop(['code'], axis= 1)
For some weird reason, dcodebook was indexed using .iloc function as I understood, but hcodebook needed .loc.
#iloc: by int-position
#loc: by label value
#ix: by both
day_df = dcodebook.iloc[test['day'].astype(int) - 1].reset_index(drop=True)
#to avoid duplicate index Valueerror, create separate dataframes..
hm_df = hcodebook.loc[test['hm'].astype(int) - 1]
#.to_frame error / do I need .reset_index(drop=True)?
The following line is where the code crashes.
datcode_df = day_df(['match']) + ' ' + hm_df(['print'])
print datcode_df
print test
What I don't understand:
I tested earlier that columns of different dataframes can be merged using the simple addition as seen
I initially assigned this to the existing column ['daycode'] in test dataframe, so that previous values will be replaced. And the same error msg was returned.
Please advise.
You need same size of both DataFrames, so is necessary day and hm are unique.
Then reset_index with drop=True for same indices and last remove () in join:
day_df = dcodebook.iloc[test['day'].astype(int) - 1].reset_index(drop=True)
hm_df = hcodebook.loc[test['hm'].astype(int) - 1].reset_index(drop=True)
datcode_df = day_df['match'] + ' ' + hm_df['print']