Got TypeError: string indices must be integers with .apply [duplicate] - pandas

I have a dataframe, one column is a URL, the other is a name. I'm simply trying to add a third column that takes the URL, and creates an HTML link.
The column newsSource has the Link name, and url has the URL. For each row in the dataframe, I want to create a column that has:
[newsSource name]
Trying the below throws the error
File "C:\Users\AwesomeMan\Documents\Python\MISC\News Alerts\simple_news.py", line 254, in
df['sourceURL'] = df['url'].apply(lambda x: '{1}'.format(x, x[0]['newsSource']))
TypeError: string indices must be integers
df['sourceURL'] = df['url'].apply(lambda x: '{1}'.format(x, x['source']))
But I've used x[colName] before? The below line works fine, it simply creates a column of the source's name:
df['newsSource'] = df['source'].apply(lambda x: x['name'])
Why suddenly ("suddenly" to me) is it saying I can't access the indices?

pd.Series.apply has access only to a single series, i.e. the series on which you are calling the method. In other words, the function you supply, irrespective of whether it is named or an anonymous lambda, will only have access to df['source'].
To access multiple series by row, you need pd.DataFrame.apply along axis=1:
def return_link(x):
return '{1}'.format(x['url'], x['source'])
df['sourceURL'] = df.apply(return_link, axis=1)
Note there is an overhead associated with passing an entire series in this way; pd.DataFrame.apply is just a thinly veiled, inefficient loop.
You may find a list comprehension more efficient:
df['sourceURL'] = ['{1}'.format(i, j) \
for i, j in zip(df['url'], df['source'])]
Here's a working demo:
df = pd.DataFrame([['BBC', 'http://www.bbc.o.uk']],
columns=['source', 'url'])
def return_link(x):
return '{1}'.format(x['url'], x['source'])
df['sourceURL'] = df.apply(return_link, axis=1)
print(df)
source url sourceURL
0 BBC http://www.bbc.o.uk BBC

With zip and string old school string format
df['sourceURL'] = ['%s.' % (x,y) for x , y in zip (df['url'], df['source'])]
This is f-string
[f'{y}' for x , y in zip ((df['url'], df['source'])]

Related

Pandas - Setting column value, based on a function that runs on another column

I have been all over the place to try and get this to work (new to datascience). It's obviously because I don't get how the datastructure of Panda fully works.
I have this code:
def getSearchedValue(identifier):
full_str = anedf["Diskret data"].astype(str)
value=""
if full_str.str.find(identifier) <= -1:
start_index = full_str.str.find(identifier)+len(identifier)+1
end_index = full_str[start_index:].find("|")+start_index
value = full_str[start_index:end_index].astype(str)
return value
for col in anedf.columns:
if col.count("#") > 0:
anedf[col] = getSearchedValue(col)
What i'm trying to do is iterate over my columns. I have around 260 in my dataframe. If they contain the character #, it should try to fill values based on whats in my "Diskret data" column.
Data in the "Diskret data" column is completely messed up but in the form CCC#111~VALUE|DDD#222~VALUE| <- Until there is no more identifiers + values. All identifiers are not present in each row, and they come in no specific order.
The function works if I run it with hard coded strings in regular Python document. But with the dataframe I get various error like:
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
Input In [119], in <cell line: 12>()
12 for col in anedf.columns:
13 if col.count("#") > 0:
---> 14 anedf[col] = getSearchedValue(col)
Input In [119], in getSearchedValue(identifier)
4 full_str = anedf["Diskret data"].astype(str)
5 value=""
----> 6 if full_str.str.find(identifier) <= -1:
7 start_index = full_str.str.find(identifier)+len(identifier)+1
8 end_index = full_str[start_index:].find("|")+start_index
I guess this is because it evaluate against all rows (Series) which obviously provides some false and true errors. But how can I make the evaluation and assignment so it it's evaluating+assigning like this:
Diskret data
CCC#111
JJSDJ#1234
CCC#111~1IBBB#2323~2234
1 (copied from "Diskret data")
0
JJSDJ#1234~Heart attack
0 (or skipped since the row does not contain a value for the identifier)
Heart attack
The plan is to drop the "Diskret data" when the assignment is done, so I have the data in a more structured way.
--- Update---
By request:
I have included a picture of how I visualize the problem, And what I seemingly can't make it do.
Problem visualisation
With regex you could do something like:
def map_(list_) -> pd.Series:
if list_:
idx, values = zip(*list_)
return pd.Series(values, idx)
else:
return pd.Series(dtype=object)
series = pd.Series(
['CCC#111~1|BBB#2323~2234', 'JJSDJ#1234~Heart attack']
)
reg_series = series.str.findall(r'([^~|]+)~([^~|]+)')
reg_series.apply(map_)
Breaking this down:
Create a new series by running a map on each row that turns your long string into a list of tuples
Create a new series by running a map on each row that turns your long string into a list of tuples.
reg_series = series.str.findall(r'([^~|]+)~([^~|]+)')
reg_series
# output:
# 0 [(CCC#111, 1), (BBB#2323, 2234)]
# 1 [(JJSDJ#1234, Heart attack)]
Then we create a map_ function. This function takes each row of reg_series and maps it to two rows: the first with only the "keys" and the other with only the "values". We then create series of this with the index as the keys and the values as the values.
Edit: We added in a if/else statement that check whether the list exists. If it does not, we return an empty series of type object.
def map_(list_) -> pd.Series:
if list_:
idx, values = zip(*list_)
return pd.Series(values, idx)
else:
return pd.Series(dtype=object)
...
print(idx, values) # first row
# output:
# ('CCC#111', 'BBB#2323') (1, 2234)
Finally we run apply on the series to create a dataframe that takes the outputs from map_ for each row and zips them together in columnar format.
reg_series.apply(map_)
# output:
# CCC#111 BBB#2323 JJSDJ#1234
# 0 1 2234 NaN
# 1 NaN NaN Heart attack

Use DataFrame row.index as input to lambda function result

I have a large dataframe, df_vol.
It has about 20 columns and 500k rows.
In the column named "FTID" three of the values are "###". Other than those three instances every other value in the "FTID" column is unique.
I want to search for and change each instance of "###" to be unique.
Either of these two options would be acceptable:
"###1", "###2", "###3", or
"###" + str(row_index) for each i.e. concatenate "###" with the row index
The code I have tried is:
df_vol["FTID"] = df_vol["FTID"].apply(lambda x: "###" if x == "###" else None)
I know the above code doesn't actually change anything, but I don't know how to make it pull only the row index or use an incremental number. I have tried so many different things but I'm a noob, and I'm stabbing in the dark.
It seems to me it should look like:
df_vol["FTID"] = df_vol["FTID"].apply(lambda x: "###" + df_vol.index.astype(str) if x == "###" else None)
but what little success I have had just returns something like this for the new values:
Int64Index([ 423, 424, 425, 426, 427, 428, 429, 430,
Going to go collect up all my hair now and see if I can glue it back to my head ;)
You can access the index with x.name. I think you need something like:
df_vol["FTID"] = df_vol["FTID"].apply(lambda x: f"###{x.name}" if x == "###" else x)
(I didn't get why you would otherwise set the value to None since other values are unique... I think it should be unchanged when not equal to ###)
Edit: apply works slightly differently when use on Series and Dataframes.
In your case it would be best to create a function and apply it on your whole dataframe:
def myfunc(row):
if row['FTID']=="###":
row['FTID'] = f"###{row.name}"
return row
df_vol = df_vol.apply(myfunc, axis=1)

Selecting two sets of columns from a dataFrame with all rows

I have a dataFrame with 28 columns (features) and 600 rows (instances). I want to select all rows, but only columns from 0-12 and 16-27. Meaning that I don't want to select columns 12-15.
I wrote the following code, but it doesn't work and throws a syntax error at : in 0:12 and 16:. Can someone help me understand why?
X = df.iloc[:,[0:12,16:]]
I know there are other ways for selecting these rows, but I am curious to learn why this one does not work, and how I should write it to work (if there is a way).
For now, I have written it is as:
X = df.iloc[:,0:12]
X = X + df.iloc[:,16:]
Which seems to return an incorrect result, because I have already treated the NaN values of df, but when I use this code, X includes lots of NaNs!
Thanks for your feedback in advance.
You can use np.r_ to concatenate the slices:
x = df.iloc[:, np.r_[0:12,16:]]
iloc has these allowed inputs (from the docs):
An integer, e.g. 5.
A list or array of integers, e.g. [4, 3, 0].
A slice object with ints, e.g. 1:7.
A boolean array.
A callable function with one argument (the calling Series or DataFrame) and that returns valid output for indexing (one of the above). This is useful in method chains, when you don’t have a reference to the calling object, but would like to base your selection on some value.
What you're passing to iloc in X = df.iloc[:,[0:12,16:]] is not a list of integers or a slice of ints, but a list of slice objects. You need to convert those slices to a list of integers, and the best way to do that is using the numpy.r_ function.
X = df.iloc[:, np.r_[0:13, 16:28]]

Python3.4 Pandas DataFrame from function

I wrote a function that outputs selected data from a parsing function. I am trying to put this information into a DataFrame using pandas.DataFrame but I am having trouble.
The headers are listed below as well as the function.head() data output
QUESTION
How will I be able to place the function output within the pandas DataFrame so the headers are linked to the output
HEADERS
--TICK---------NI----------CAPEXP----------GW---------------OE---------------RE-------
OUTPUT
['MMM', ['4,956,000'], ['(1,493,000)'], ['7,050,000'], ['13,109,000'], ['34,317,000']]
['ABT', ['2,284,000'], ['(1,077,000)'], ['10,067,000'], ['21,526,000'], ['22,874,000']]
['ABBV', ['1,774,000'], ['(612,000)'], ['5,862,000'], ['1,742,000'], ['535,000']]
-Loop through each item (I'm assuming data is a list with each element being one of the lists shown above)
-Take the first element as the ticker and convert the rest into numbers using translate to undo the string formatting
-Make a DataFrame per row and then concat all at the end, then transpose
-Set the columns by parsing the header string (I've called it headers)
dflist = list()
for x in data:
h = x[0]
rest = [float(z[0].translate(str.maketrans('(','-','),'))) for z in x[1:]]
dflist.append(pd.DataFrame([h]+rest))
df = pd.concat(dflist, 1).T
df.columns = [x for x in headers.split('-') if len(x) > 0]
But this might be a bit slow - would be easier if you could get your input into a more consistent format.

Extracting value and creating new column out of it

I would like to extract certain section of a URL, residing in a column of a Pandas Dataframe and make that a new column. This
ref = df['REFERRERURL']
ref.str.findall("\\d\\d\\/(.*?)(;|\\?)",flags=re.IGNORECASE)
returns me a Series with tuples in it. How can I take out only one part of that tuple before the Series is created, so I can simply turn that into a column? Sample data for referrerurl is
http://wap.blah.com/xxx/id/11/someproduct_step2;jsessionid=....
In this example I am interested in creating a column that only has 'someproduct_step2' in it.
Thanks,
In [25]: df = DataFrame([['http://wap.blah.com/xxx/id/11/someproduct_step2;jsessionid=....']],columns=['A'])
In [26]: df['A'].str.findall("\\d\\d\\/(.*?)(;|\\?)",flags=re.IGNORECASE).apply(lambda x: Series(x[0][0],index=['first']))
Out[26]:
first
0 someproduct_step2
in 0.11.1 here is a neat way of doing this as well
In [34]: df.replace({ 'A' : "http:.+\d\d\/(.*?)(;|\\?).*$"}, { 'A' : r'\1'} ,regex=True)
Out[34]:
A
0 someproduct_step2
This also worked
def extract(x):
res = re.findall("\\d\\d\\/(.*?)(;|\\?)",x)
if res: return res[0][0]
session['RU_2'] = session['REFERRERURL'].apply(extract)