Function giving error when run on the same dataframe more than once - pandas

Function giving error when run on the same data frame more than once. it works fine the first time but when run again on the same df it gives me this error:
IndexError: single positional indexer is out-of-bounds
def update_data(df):
df.drop(df.columns[[-1, -2, -3]], axis=1, inplace=True)
df.loc['Total'] = df.sum()
df.iloc[-1, 0] = 'Group'
df = df.set_index(list(df)[0])
for i in range(1, 21):
df.iloc[-1, i] = 100 + (100 * (
(df.iloc[-1, i] - df.iloc[-1, 0]) / abs(df.iloc[-1, 0])))
df.iloc[-1, 0] = 100
xax = list(df.columns.values)
yax = df.values[-1].tolist()
d = {'period': xax, 'level': yax}
index_level = pd.DataFrame(d)
index_level['level'] = index_level['level'].round(3)
return index_level

Using inplace=True in a function changes the input data frame. Of course there it doesn't work, your function presumes the data is in some format at the start of the function. That assumption is broken.
df = pd.DataFrame([{'x': 0}])
def change(df):
df.drop(columns=['x'], inplace=True)
return len(df)
change(df)
Out[346]: 1
df
Out[347]:
Empty DataFrame
Columns: []
Index: [0]

Related

How do I drop columns in a pandas dataframe that exist in another dataframe?

How do I drop columns in raw_clin if the same columns already exist in raw_clinical_sample? Using isin raised a cannot compute isin with a duplicate axis error.
Explanation of the code:
I want to merge raw_clinical_patient and raw_clinical_sample dataframes. However, the SAMPLE_ID column in raw_clinical_sample should be relabeled as PATIENT_ID before the merge (because it was wrongly labelled). I want the new PATIENT_ID to be the index of raw_clin.
import pandas as pd
# Clinical patient info
raw_clinical_patient = pd.read_csv("./gbm_tcga/data_clinical_patient.txt", sep="\t", header=4)
raw_clinical_patient["PATIENT_ID"] = raw_clinical_patient["PATIENT_ID"].replace()
raw_clinical_patient.set_index("PATIENT_ID", inplace=True)
raw_clinical_patient.sort_index()
# Clinical sample info
raw_clinical_sample = pd.read_csv("./gbm_tcga/data_clinical_sample.txt", sep="\t", header=4)
raw_clinical_sample.set_index("PATIENT_ID", inplace=True)
raw_clinical_sample = raw_clinical_sample[raw_clinical_sample.index.isin(raw_clinical_patient.index)]
# Get the actual patient ID from the `raw_clinical_sample` dataframe
# Drop "PATIENT_ID" and rename "SAMPLE_ID" as "PATIENT_ID" and set as index
raw_clin = raw_clinical_patient.merge(raw_clinical_sample, on="PATIENT_ID", how="left").reset_index().drop(["PATIENT_ID"], axis=1)
raw_clin.rename(columns={'SAMPLE_ID':'PATIENT_ID'}, inplace=True)
raw_clin.set_index('PATIENT_ID', inplace=True)
Now, I want to drop all the columns in raw_clinical_sample since the only columns that are needed were the PATIENT_ID and SAMPLE_ID columns.
# Drop columns that exist in `raw_clinical_sample`
raw_clin = raw_clin[~raw_clin.isin(raw_clinical_sample)]
Traceback:
ValueError Traceback (most recent call last)
<ipython-input-60-45e2e83ddc00> in <module>()
18
19 # Drop columns that exist in `raw_clinical_sample`
---> 20 raw_clin = raw_clin[~raw_clin.isin(raw_clinical_sample)]
/usr/local/lib/python3.7/dist-packages/pandas/core/frame.py in isin(self, values)
10514 elif isinstance(values, DataFrame):
10515 if not (values.columns.is_unique and values.index.is_unique):
> 10516 raise ValueError("cannot compute isin with a duplicate axis.")
10517 return self.eq(values.reindex_like(self))
10518 else:
ValueError: cannot compute isin with a duplicate axis.
You have many ways to do this.
For example using isin:
new_df1 = df1.loc[:, ~df1.columns.isin(df2.columns)]
or with drop:
new_df1 = df1.drop(columns=df1.columns.intersection(df2.columns))
example input:
df1 = pd.DataFrame(columns=['A', 'B', 'C', 'D'])
df2 = pd.DataFrame(columns=['B', 'E'])
output:
pd.DataFrame(columns=['A', 'C', 'D'])
You can use set operations for your application like this:
df1 = pd.DataFrame()
df1['string'] = ['Hello', 'Hi', 'Hola']
df1['number'] = [1, 2, 3]
df2 = pd.DataFrame()
df2['string'] = ['Hello', 'Hola']
df2['number'] = [1, 5]
ds1 = set(map(tuple, df1.values))
ds2 = set(map(tuple, df2.values))
df_out = pd.DataFrame(list(ds1.difference(ds2)))
df_out.columns = df1.columns
print(df_out)
Output:
string number
0 Hola 3
1 Hi 2
Inspired by: https://stackoverflow.com/a/18184990/7509907
Edit:
Sorry I didn't notice you need to drop the columns. For that, you can use the following: (using mozway's dummy example)
df1 = pd.DataFrame(columns=['A', 'B', 'C', 'D'])
df2 = pd.DataFrame(columns=['B', 'E'])
ds1 = set(df1.columns)
ds2 = set(df2.columns)
cols = ds1.difference(ds2)
df = df1[cols]
print(df)
Output:
Empty DataFrame
Columns: [C, A, D]
Index: []

groupby with transform minmax

for every city , I want to create a new column which is minmax scalar of another columns (age).
I tried this an get Input contains infinity or a value too large for dtype('float64').
cols=['age']
def f(x):
scaler1=preprocessing.MinMaxScaler()
x[['age_minmax']] = scaler1.fit_transform(x[cols])
return x
df = df.groupby(['city']).apply(f)
From the comments:
df['age'].replace([np.inf, -np.inf], np.nan, inplace=True)
Or
df['age'] = df['age'].replace([np.inf, -np.inf], np.nan)

Whats is the most common way to convert a Timestamp to an int

I tried multiple ways but they always give me an error. The most common error I get is :
AttributeError: 'Timestamp' object has no attribute 'astype'
Here is the line where I try to convert my element :
df.index.map(lambda x: x - oneSec if (pandas.to_datetime(x).astype(int) / 10**9) % 1 else x)
I tried x.astype(int) or x.to_datetime().astype(int)
I think here is necessary use Index.where:
df = pd.DataFrame(index=(['2019-01-10 15:00:00','2019-01-10 15:00:01']))
df.index = pd.to_datetime(df.index)
mask = df.index.second == 1
print (mask)
[False True]
df.index = df.index.where(~mask, df.index - pd.Timedelta(1, unit='s'))
print (df)
Empty DataFrame
Columns: []
Index: [2019-01-10 15:00:00, 2019-01-10 15:00:00]

Finding top 3 dominant topics for LDA topic model

I am creating a datatable via this LDA modeling tutorial, (https://www.machinelearningplus.com/nlp/topic-modeling-visualization-how-to-present-results-lda-models/) and instead of just finding the single most dominant topic, I want to expand to find the top 3 most dominant topics, along with each of their percent contributions and topic keywords.
To do that, is it best to create 2 additional functions to create 3 separate dataframes, and append each of the results? Or is there a simpler way to modify the format_topics_sentence function to pull the top 3 topics from the enumerated bag of words corpus?
def format_topics_sentences(ldamodel=None, corpus=corpus, texts=data):
# Init output
sent_topics_df = pd.DataFrame()
# Get main topic in each document
for i, row_list in enumerate(ldamodel[corpus]):
row = row_list[0] if ldamodel.per_word_topics else row_list
# print(row)
row = sorted(row, key=lambda x: (x[1]), reverse=True)
# Get the Dominant topic, Perc Contribution and Keywords for each document
for j, (topic_num, prop_topic) in enumerate(row):
if j == 0: # => dominant topic
wp = ldamodel.show_topic(topic_num)
topic_keywords = ", ".join([word for word, prop in wp])
sent_topics_df = sent_topics_df.append(pd.Series([int(topic_num), round(prop_topic,4), topic_keywords]), ignore_index=True)
else:
break
sent_topics_df.columns = ['Dominant_Topic', 'Perc_Contribution', 'Topic_Keywords']
# Add original text to the end of the output
contents = pd.Series(texts)
sent_topics_df = pd.concat([sent_topics_df, contents], axis=1)
return(sent_topics_df)
df_topic_sents_keywords = format_topics_sentences(ldamodel=lda_model, corpus=corpus, texts=data_ready)
# Format
df_dominant_topic = df_topic_sents_keywords.reset_index()
df_dominant_topic.columns = ['Document_No', 'Dominant_Topic', 'Topic_Perc_Contrib', 'Keywords', 'Text']
df_dominant_topic.head(10)
table ouput
I had a similar requirement in a recent project, hopefully this helps you out, you will need to add topic keywords to below code:
topics_df1 = pd.DataFrame()
topics_df2 = pd.DataFrame()
topics_df3 = pd.DataFrame()
for i, row_list in enumerate(lda_model[corpus]):
row = row_list[0] if lda_model.per_word_topics else row_list
row = sorted(row, key=lambda x: (x[1]), reverse=True)
for j, (topic_num, prop_topic) in enumerate(row):
if len(row) >= 3:
if j ==0:
topics_df1 = topics_df1.append(pd.Series([int(topic_num), prop_topic]), ignore_index=True)
elif j ==1:
topics_df2 = topics_df2.append(pd.Series([int(topic_num), prop_topic]), ignore_index=True)
elif j ==2:
topics_df3 = topics_df3.append(pd.Series([int(topic_num), prop_topic]), ignore_index=True)
else:
break
elif len(row) == 2:
if j ==0:
topics_df1 = topics_df1.append(pd.Series([int(topic_num), prop_topic]), ignore_index=True)
elif j ==1:
topics_df2 = topics_df2.append(pd.Series([int(topic_num), prop_topic]), ignore_index=True)
topics_df3 = topics_df3.append(pd.Series(['-', '-']), ignore_index=True)
elif len(row) == 1:
topics_df1 = topics_df1.append(pd.Series([int(topic_num), prop_topic]), ignore_index=True)
topics_df2 = topics_df2.append(pd.Series(['-', '-']), ignore_index=True)
topics_df3 = topics_df3.append(pd.Series(['-', '-']), ignore_index=True)
topics_df1.rename(columns={0:'1st Topic', 1:'1st Topic Contribution'}, inplace=True)
topics_df2.rename(columns={0:'2nd Topic', 1:'2nd Topic Contribution'}, inplace=True)
topics_df3.rename(columns={0:'3rd Topic', 1:'3rd Topic Contribution'}, inplace=True)
topics_comb = pd.concat([topics_df1, topics_df2, topics_df3], axis=1, sort=False)
#Join topics dataframe to original data
new_df = pd.concat([data_ready, topics_comb], axis=1, sort=False)

Why am I returned an object when using std() in Pandas?

The print for average of the spreads come out grouped and calculated right. Why do I get this returned as the result for the std_deviation column instead of the standard deviation of the spread grouped by ticker?:
pandas.core.groupby.SeriesGroupBy object at 0x000000000484A588
df = pd.read_csv('C:\\Users\\William\\Desktop\\tickdata.csv',
dtype={'ticker': str, 'bidPrice': np.float64, 'askPrice': np.float64, 'afterHours': str},
usecols=['ticker', 'bidPrice', 'askPrice', 'afterHours'],
nrows=3000000
)
df = df[df.afterHours == "False"]
df = df[df.bidPrice != 0]
df = df[df.askPrice != 0]
df['spread'] = (df.askPrice - df.bidPrice)
df['std_deviation'] = df['spread'].std(ddof=0)
df = df.groupby(['ticker'])
print(df['std_deviation'])
print(df['spread'].mean())
UPDATE: no longer being returned an object but now trying to figure out how to have the standard deviation displayed by ticker
df['spread'] = (df.askPrice - df.bidPrice)
df2 = df.groupby(['ticker'])
print(df2['spread'].mean())
df = df.set_index('ticker')
print(df['spread'].std(ddof=0))
UPDATE2: got the dataset I needed using
df = df[df.afterHours == "False"]
df = df[df.bidPrice != 0]
df = df[df.askPrice != 0]
df['spread'] = (df.askPrice - df.bidPrice)
print(df.groupby(['ticker'])['spread'].mean())
print(df.groupby(['ticker'])['spread'].std(ddof=0))
This line:
df = df.groupby(['ticker'])
assigns df to a DataFrameGroupBy object, and
df['std_deviation']
is a SeriesGroupBy object (of the column).
It's a good idea not to "shadow" / re-assign one variable to a completely different datatype. Try to use a different variable name for the groupby!