a=0
for i in range (0,len(df)):
if df['column name'][i][7]!='1' or df['column name'][i][7]='6':
a=a+1
If i run this piece of code, I got error "string index out of range". Can someone help me to solve this problem.
P.S. df has about 10 million rows
If the index is greater or equal to the length of the string then this error occurs.
You can have a check whether the string is equal or more than 7 characters.
a=0
for i in range (0,len(df)):
data = df['column name'][i]
if len(data) > 6 and (data[7] != '1' or data[7] == '6'):
a=a+1
you can do this by list comprehension
can_count = lambda row: len(row['col']) > 6 and (row['col'][7] != '1' or row['col'][7] == '6')
a = sum((1 for _, row in df.iterrows() if can_count(row)))
One thing to note is df['column name'][i][7]='6' should be == not =
I see you are using the assignment operator '=' in your code instead of '==' . I have copy pasted the line to indicate this. Can you retry and indicate the error message you finally get. Also, a little more comment on what you would like to achieve with the operation.
if df['column name'][i][7]!='1' or df['column name'][i][7]='6':
can you please add an example for your string? your data is probably too short.
if you use this: df['column name'][i][7], your string should be at least 8 charas long.
good luck
Related
def locate (code):
string1 = str(code)
floor = string1[3]
if floor == '1':
return 'Ground Floor'
else:
if int(string1[5]) < 1:
lobby = 'G'
elif int(string1[5]) < 2:
lobby = 'F'
else:
lobby = 'E'
return floor + lobby
print(locate('S191009'))
print(locate('S087525'))
This function works fine with individual input code as above with Output
Ground Floor
7E
But when I use this to map a series in data frame, it shows error.
error_data1['location'] = error_data1['status'].map(locate)
Error message: string index out of range.
How can I fix this??
Your problem is with your series values:
se = pd.Series(['S191009', 'rt'])
se.map(locate)
produces the same error you reported. You can ignore these rows using try...except in function if it does not hurt you.
The problem is you are indexing an index on a string that doesn't exist (i.e the string is shorter than what you expect). As the other answer mentioned, if you try and use
my_string="foo"
print(my_string[5])
You will get the same error. To solve this you should add a try except statement, or for simplicity an initial if statement that returns "NotValid" or something like that. Your data probably has strings that do not follow the standard form you expect.
I would like to match all cells that beginns with 978 number. But following code matches 397854 or nan too.
an_transaction_product["kniha"] = np.where(an_transaction_product["zbozi_ean"].str.contains('^978', regex=True) , 1, 0)
What do I do wrong please?
This doesn't work because .str.contains will check if the regex occurs anywhere in the string.
If you insist on using regex, .str.match does what you want.
But for this simple case .str.startswith("978") is clearer.
Apart from regex, you can use .loc to find cells that start with '978'. The code below will assign 1 to such cells in column 'A', just as an example:
df.loc[df['A'].astype(str).str[:3]=='978', 'A'] = 1
note: astype(str) converts the number to string and then str[:3] gets the first 3 characters, and then compares it to '978'.
I was not able to figure out the reason why my code didn't work. Ii seemingly doesn't have any problem for me. Can anyone help to point out the issue in my code?
What I tried:
true_avengers['Deaths'] = 0
for index, row in true_avengers.iterrows():
for i in range(1,6):
col = 'Death{}'.format(i)
if row[col] == 'YES':
row['Deaths'] += 1
Answer:
def clean_deaths(row):
num_deaths = 0
columns = ['Death1', 'Death2', 'Death3', 'Death4', 'Death5']
for c in columns:
death = row[c]
if pd.isnull(death) or death == 'NO':
continue
elif death == 'YES':
num_deaths += 1
return num_deaths
true_avengers['Deaths'] = true_avengers.apply(clean_deaths, axis=1)
Much appreciated if you can enlighten me!
You do not use pandas correctly. It is usually not necessary to loop through the rows explicitly. Here's a clean vectorized solution. First, identify the columns of interest. Their names consist pf "Death" followed by a number:
death_columns = true_avengers.columns.str.match(r"Death\d+")
Find out which of them are "YES":
changes = true_avengers.iloc[:, death_columns]=='YES'
Calculate the sum of the occurrences and add them to the last column:
true_avengers['Deaths'] += changes.sum(axis=1)
I have the 'Field_Type' column filled with strings and I want to derive the values in the 'Units' column using an if statement.
So Units shows the desired result. Essentially I want to call out what type of activity is occurring.
I tried to do this using my code below but it won't run (please see screen shot below for error). Any help is greatly appreciated!
create_table['Units'] = pd.np.where(create_table['Field_Name'].str.startswith("W"), "MW",
pd.np.where(create_table['Field_Name'].str.contains("R"), "MVar",
pd.np.where(create_table['Field_Name'].str.contains("V"), "Per Unit")))```
ValueError: either both or neither of x and y should be given
You can write a function to define your conditionals, then use apply on the dataframe and pass the funtion
def unit_mapper(row):
if row['Field_Type'].startswith('W'):
return 'MW'
elif 'R' in row['Field_Type']:
return 'MVar'
elif 'V' in row['Field_Type']:
return 'Per Unit'
else:
return 'N/A'
And then
create_table['Units'] = create_table.apply(unit_mapper, axis=1)
In your text you talk about Field_Type but you are using Field_Name in your example. Which one is good ?
You want to do something like:
create_table[create_table['Field_Type'].str.startwith('W'), 'Units'] = 'MW'
create_table[create_table['Field_Type'].str.startwith('R'), 'Units'] = 'MVar'
create_table[create_table['Field_Type'].str.startwith('V'), 'Units'] = 'Per Unit'
I am given a set of substrings. I need to find the count of occurrence of all those substrings in a particular column in a dataframe. The relevant datframe would look like this
training['concat']
0 svAxu$paxArWAn
1 xvAxaSa$varRANi
2 AxAna$xurbale
3 go$BakwAH
4 viXi$Bexena
5 nIwi$kuSalaM
6 lafkA$upamam
7 yaSas$lipsoH
8 kaSa$AGAwam
9 hewumaw$uwwaram
10 varRa$pUgAn
My set of substrings is a dictionary, where the keys are the substrings and values are the probabilities with which they occur
reg = {'anuBavAn':0.35, 'a$piwra':0.2 ...... 'piwra':0.7, 'pa':0.03, 'a':0.0005}
#The length of dicitioanry is 2000
Particularly I need to find those substrings which occur more than twice
I have written the following code that performs the task. Is there a more elegant pythonic way or panda specific way to achieve the same as the current implementation is taking quite some time to execute.
elites = dict()
for reg_pat in reg_:
count = 0
eliter = len(training[training['concat'].str.contains(reg_pat)]['concat'])
if eliter >=3:
elites[reg_pat] = reg_[reg_pat]
You can use apply instead str.contains, it is faster:
reg_ = {'anuBavAn':0.35, 'a$piwra':0.2, 'piwra':0.7, 'pa':0.03, 'a':0.0005}
elites = dict()
for reg_pat in reg_:
if training['concat'].apply(lambda x: reg_pat in x).sum() >= 3:
elites[reg_pat] = reg_[reg_pat]
print (elites)
{'a': 0.0005}
Hopefully I have interpreted your question correctly. I'm inclined to stay away from regex here (in fact, I've never used it in conjunction with pandas), but it's not wrong, strictly speaking. In any case, I find it hard to believe that any regex operations are faster than a simple in check, but I could be wrong on that.
for substr in reg:
totalStringAppearances = training.apply((lambda string: substr in string))
totalStringAppearances = totalStringAppearances.sum()
if totalStringAppearances > 2:
reg[substr] = totalStringAppearances / len(training)
else:
# do what you want to with the very rare substrings
Some gotchas:
If you wanted something like a substring 'a' in 'abcdefa' to return 2, then this will not work. It merely checks for existence of the substring in each string.
Inside the apply(), I am using a potentially unreliable exploitation of booleans. See this question for more details.
Post-edit: Jezrael's answer is more complete as it uses the same variable names. But, in a simple case, regarding regex vs. apply and in, I validate his claim, and my presumption: