I have just started my work on pandas. Currently I'm working on a dataset of NETFLIX.
In this dataset I want to add a new column which contains the total number of cast members in that particular movie or tv show. I can calculate the cast individually but I want to calculate all of them. Can someone help me to write this code ?
Here is what I'm trying to do:
link https://www.kaggle.com/datasets/shivamb/netflix-shows?
def set_cast(val):
if val is None:
return 0
if val == 'None':
return 0
return len(val.split(', '))
data['num_of_cast'] = data['cast'].apply(set_cast)
getting these error
return SeriesApply(self, func, convert_dtype, args, kwargs).apply()
'float' object has no attribute 'split'
In your dataset the first column are some statistics. Maybe they are the flout values that causes the error.
Try skipping the first column. Maybe like this:
data['num_of_cast'] = [set_cast(x) for x in data['cast'] if data['cast'].index(x) >=1]
Instead of this:
data['num_of_cast'] = data['cast'].apply(set_cast)
Related
I have a org.apache.spark.sql.DataFrame and I would like to convert it into a column: org.apache.spark.sql.Column.
So basically, this is my dataframe:
val filled_column2 = x.select(first(("col1"),ignoreNulls = true).over(window)) that I want to convert, it into an sql spark column. Anyone could help on that ?
Thank you,
#Jaime Caffarel: this is exactly what i am trying to do, this will give you more visibility. You may also check the error msg in the 2d screenshot
From the documentation of the class org.apache.spark.sql.Column
A column that will be computed based on the data in a DataFrame. A new
column is constructed based on the input columns present in a
dataframe:
df("columnName") // On a specific DataFrame.
col("columnName") // A generic column no yet associcated
with a DataFrame. col("columnName.field") // Extracting a
struct field col("a.column.with.dots") // Escape . in column
names. $"columnName" // Scala short hand for a named
column. expr("a + 1") // A column that is constructed
from a parsed SQL Expression. lit("abc") // A
column that produces a literal (constant) value.
If filled_column2 is a DataFrame, you could do:
filled_column2("col1")
******** EDITED AFTER CLARIFICATION ************
Ok, it seems to me that what you are trying to do is a JOIN operation. Assuming that the product_id is a unique key per each row, I would do something like this:
val filled_column = df.select(df("product_id"), last(("last_prev_week_nopromo"), ignoreNulls = true) over window)
This way, you are also selecting the product_id that you will use as key. Then, you can do the following
val promo_txn_cnt_seas_df2 = promo_txn_cnt_seas_df1
.join(filled_column, promo_txn_cnt_seas_df1("product_id") === filled_column("driver_id"), "inner")
// orderBy("product_id", "week")... (the rest of the operations)
Is this what you are trying to achieve?
def locate (code):
string1 = str(code)
floor = string1[3]
if floor == '1':
return 'Ground Floor'
else:
if int(string1[5]) < 1:
lobby = 'G'
elif int(string1[5]) < 2:
lobby = 'F'
else:
lobby = 'E'
return floor + lobby
print(locate('S191009'))
print(locate('S087525'))
This function works fine with individual input code as above with Output
Ground Floor
7E
But when I use this to map a series in data frame, it shows error.
error_data1['location'] = error_data1['status'].map(locate)
Error message: string index out of range.
How can I fix this??
Your problem is with your series values:
se = pd.Series(['S191009', 'rt'])
se.map(locate)
produces the same error you reported. You can ignore these rows using try...except in function if it does not hurt you.
The problem is you are indexing an index on a string that doesn't exist (i.e the string is shorter than what you expect). As the other answer mentioned, if you try and use
my_string="foo"
print(my_string[5])
You will get the same error. To solve this you should add a try except statement, or for simplicity an initial if statement that returns "NotValid" or something like that. Your data probably has strings that do not follow the standard form you expect.
I am working with several CSV's that first N columns are information and then the next Ms (M is big) columns are information regarding a date.
This is the dataframe picture
I need to set just the columns between N+1 to N+M - 1 columns name to date format.
I tried this, in this case N+1 = 5, no matter M, I suppose that I can use -1 to not affect the last column name.
ContDiarios.columns[5:-1] = pd.to_datetime(ContDiarios.columns[5:-1])
but I get the following error:
TypeError: Index does not support mutable operations
The way you are doing is not feasable. Please try this way
def convert(x):
try:
return pd.to_datetime(x)
except:
return x
x.columns = map(convert,x.columns)
Or you can also use df.rename property to convert it.
I have the 'Field_Type' column filled with strings and I want to derive the values in the 'Units' column using an if statement.
So Units shows the desired result. Essentially I want to call out what type of activity is occurring.
I tried to do this using my code below but it won't run (please see screen shot below for error). Any help is greatly appreciated!
create_table['Units'] = pd.np.where(create_table['Field_Name'].str.startswith("W"), "MW",
pd.np.where(create_table['Field_Name'].str.contains("R"), "MVar",
pd.np.where(create_table['Field_Name'].str.contains("V"), "Per Unit")))```
ValueError: either both or neither of x and y should be given
You can write a function to define your conditionals, then use apply on the dataframe and pass the funtion
def unit_mapper(row):
if row['Field_Type'].startswith('W'):
return 'MW'
elif 'R' in row['Field_Type']:
return 'MVar'
elif 'V' in row['Field_Type']:
return 'Per Unit'
else:
return 'N/A'
And then
create_table['Units'] = create_table.apply(unit_mapper, axis=1)
In your text you talk about Field_Type but you are using Field_Name in your example. Which one is good ?
You want to do something like:
create_table[create_table['Field_Type'].str.startwith('W'), 'Units'] = 'MW'
create_table[create_table['Field_Type'].str.startwith('R'), 'Units'] = 'MVar'
create_table[create_table['Field_Type'].str.startwith('V'), 'Units'] = 'Per Unit'
I'm very confused on the apply function for pandas. I have a big dataframe where one column is a column of strings. I'm then using a function to count part-of-speech occurrences. I'm just not sure the way of setting up my apply statement or my function.
def noun_count(row):
x = tagger(df['string'][row].split())
# array flattening and filtering out all but nouns, then summing them
return num
So basically I have a function similar to the above where I use a POS tagger on a column that outputs a single number (number of nouns). I may possibly rewrite it to output multiple numbers for different parts of speech, but I can't wrap my head around apply.
I'm pretty sure I don't really have either part arranged correctly. For instance, I can run noun_count[row] and get the correct value for any index but I can't figure out how to make it work with apply how I have it set up. Basically I don't know how to pass the row value to the function within the apply statement.
df['num_nouns'] = df.apply(noun_count(??),1)
Sorry this question is all over the place. So what can I do to get a simple result like
string num_nouns
0 'cat' 1
1 'two cats' 1
EDIT:
So I've managed to get something working by using list comprehension (someone posted an answer, but they've deleted it).
df['string'].apply(lambda row: noun_count(row),1)
which required an adjustment to my function:
def tagger_nouns(x):
list_of_lists = st.tag(x.split())
flat = [y for z in list_of_lists for y in z]
Parts_of_speech = [row[1] for row in flattened]
c = Counter(Parts_of_speech)
nouns = c['NN']+c['NNS']+c['NNP']+c['NNPS']
return nouns
I'm using the Stanford tagger, but I have a big problem with computation time, and I'm using the left 3 words model. I'm noticing that it's calling the .jar file again and again (java keeps opening and closing in the task manager) and maybe that's unavoidable, but it's really taking far too long to run. Any way I can speed it up?
I don't know what 'tagger' is but here's a simple example with a word count that ought to work more or less the same way:
f = lambda x: len(x.split())
df['num_words'] = df['string'].apply(f)
string num_words
0 'cat' 1
1 'two cats' 2