Plotting a graph of the top 15 highest values - pandas

I am working on a dataset which shows the budget spent on movies. I want make a plot which contains the top 15 highest budget movies.
#sort the 'budget' column in decending order and store it in the new dataframe.
info = pd.DataFrame(dp['budget'].sort_values(ascending = False))
info['original_title'] = dp['original_title']
data = list(map(str,(info['original_title'])))
#extract the top 10 budget movies data from the list and dataframe.
x = list(data[:10])
y = list(info['budget'][:10])
This was the ouput i got
C:\Users\Phillip\AppData\Local\Temp\ipykernel_7692\1681814737.py:2: FutureWarning: The behavior of `series[i:j]` with an integer-dtype index is deprecated. In a future version, this will be treated as *label-based* indexing, consistent with e.g. `series[i]` lookups. To retain the old behavior, use `series.iloc[i:j]`. To get the future behavior, use `series.loc[i:j]`.
y = list(info['budget'][:5])
I'm new to the data analysis scene so i'm confused on how else to go about the problem

A simple example using a movie dataset I found online:
import pandas as pd
url = "https://raw.githubusercontent.com/erajabi/Python_examples/master/movie_sample_dataset.csv"
df = pd.read_csv(url)
# Bar plot of 15 highest budgets:
df.nlargest(n=15, columns="budget").plot.bar(x="movie_title", y="budget")
You can customize your plot in various ways by adding arguments to the .bar(...) call.

Related

How to work with multiple `item_id`'s of varying length for timeseries prediction in autogluon?

I'm currently building a model to predict daily stock price based on daily data for thousands of stocks. In the data, I've got the daily data for all stocks, however they are for different lengths. Eg: for some stocks I have daily data from 2000 to 2022, and for others I have data from 2010 to 2022.
Many dates are also obviously repeated for all stocks.
While I was learning autogluon, I used the following function to format timeseries data so it can work with .fit():
def forward_fill_missing(ts_dataframe: TimeSeriesDataFrame, freq="D") -> TimeSeriesDataFrame:
original_index = ts_dataframe.index.get_level_values("timestamp")
start = original_index[0]
end = original_index[-1]
filled_index = pd.date_range(start=start, end=end, freq=freq, name="timestamp")
return ts_dataframe.droplevel("item_id").reindex(filled_index, method="ffill")
ts_dataframe = ts_dataframe.groupby("item_id").apply(forward_fill_missing)
This worked, however I was trying this for data for just one item_id and now I have thousands.
When I use this now, I get the following error: ValueError: cannot reindex from a duplicate axis
It's important to note that I have already foreward filled my data with pandas, and the ts_dataframe shouldn't have any missing dates or values, but when I try to use it with .fit() I get the following error:
ValueError: Frequency not provided and cannot be inferred. This is often due to the time index of the data being irregularly sampled. Please ensure that the data set used has a uniform time index, or create the TimeSeriesPredictorsettingignore_time_index=True.
I assume that this is because I have only filled in missing data and dates, but not taken into account the varying number of days available for every stock individually.
For reference, here's how I have formatted the data with pandas:
df = pd.read_csv(
"/content/drive/MyDrive/stock_data/training_data.csv",
parse_dates=["Date"],
)
df["Date"] = pd.to_datetime(df["Date"], errors="coerce", dayfirst=True)
df.fillna(method='ffill', inplace=True)
df = df.drop("Unnamed: 0", axis=1)
df[:11]
How can I format the data so I can use it with .fit()?
Thanks!

Plotting Tweet Sentiments against dates

I have a program using tweepy that pulls tweets and their created_at date.
So far it creates a bar chart that shows the number of tweets plotted against the sentiment like so:
I am rubbish with matplotlib and was wondering if it is possible to plot these three sentiments against the dates they were created. So that i can see if specific date ranges have more positive or negative tweets etc.
This is my pandas data frame:
This is my tweepy query:
def process_tweet(tweet, default_value=None):
val = {}
val['Tweet'] = tweet.full_text if tweet.full_text else default_value
val['Date'] = tweet.created_at if tweet.created_at else default_value
return val
q = "Tesla OR Elon Musk"
tweets = [process_tweet(tweet) for tweet in
tweepy.Cursor(api.search, q=(q), lang="en", since="2020-04-13", tweet_mode="extended").items(100)]
df = pd.DataFrame(tweets)
Im wondering if i will need to edit the query to make a gap in between the date of tweets being pulled, as they all seem to be within the same day/hour.
Ideally i would like to see 3 lines plotting the Neutral, Negative and Positive value counts on the y axis against their corresponding dates on the x axis.
Advice on how to accomplish this would be appreciated!
Thank you

pandas : stacked bar plot from customers orders

I am trying to do a stacked plot bar from customers order on many companies. I want to show each order as a part of the bar of each company. The issue is that I do have an uncertain number of order by company, and that the display of the plot may do my jupyter notebook crash.
Conceptually I reach my goal with the following :
company1 = pd.Series([10,10,10])
company2 = pd.Series([20,20])
df = pd.DataFrame([company1, company2]).T
df.columns = ["company1", "company2"]
df.T.plot.bar(stacked=True)
which give me a plot :
Now how can I apply that on my dataset ?
I try the following on a subset of my data (only 3 companies in p2) :
p3 = p2[["COMPANY", "TOTAL_PAID"]]
companies = [company for company, group in p3.groupby("COMPANY")]
series = [group["TOTAL_PAID"] for company, group in p3.groupby("COMPANY")]
df = pd.DataFrame(series).T
df.columns = companies
df.T.plot.bar(stacked=True, legend=False)
and it works :
but when I apply it on the whole file (who is still small : 15 k lines) I can wait a long long time before getting any result (indeed I wrote this whole question after launching the plot creation, and it is not displayed yet ...) , so the question are :
Is it a good strategy this idea of the two comprehension lists ? I thaught it was a bit suboptimal...
Is it normal that the display of the plot takes so long ?
Is it normal that jupyter may crash ?

Infer Series Labels and Data from pandas dataframe column for plotting

Consider a simple 2x2 dataset with with Series labels prepended as the first column ("Repo")
Repo AllTests Restricted
0 Galactian 1860.0 410.0
1 Forecast-MLib 140.0 47.0
Here are the DataFrame columns:
p(df.columns)
([u'Repo', u'AllTests', u'Restricted']
So we have the first column is the string/label and the second and third columns are data values. We want one series per row corresponding to the Galactian and the Forecast-MLlib repos.
It would seem this would be a common task and there would be a straightforward way to simply plot the DataFrame . However the following related question does not provide any simple way: it essentially throws away the DataFrame structural knowledge and plots manually:
Set matplotlib plot axis to be the dataframe column name
So is there a more natural way to plot these Series - that does not involve deconstructing the already-useful DataFrame but instead infers the first column as labels and the remaining as series data points?
Update Here is a self contained snippet
runtimes = npa([1860.,410.,140.,47.])
runtimes.shape = (2,2)
labels = npa(['Galactian','Forecast-MLlib'])
labels.shape=(2,1)
rtlabels = np.concatenate((labels,runtimes),axis=1)
rtlabels.shape = (2,3)
colnames = ['Repo','AllTests','Restricted']
df = pd.DataFrame(rtlabels, columns=colnames)
ps(df)
df.set_index('Repo').astype(float).plot()
plt.show()
And here is output
Repo AllTests Restricted
0 Galactian 1860.0 410.0
1 Forecast-MLlib 140.0 47.0
And with piRSquared help it looks like this
So the data is showing now .. but the Series and Labels are swapped. Will look further to try to line them up properly.
Another update
By flipping the columns/labels the series are coming out as desired.
The change was to :
labels = npa(['AllTests','Restricted'])
..
colnames = ['Repo','Galactian','Forecast-MLlib']
So the updated code is
runtimes = npa([1860.,410.,140.,47.])
runtimes.shape = (2,2)
labels = npa(['AllTests','Restricted'])
labels.shape=(2,1)
rtlabels = np.concatenate((labels,runtimes),axis=1)
rtlabels.shape = (2,3)
colnames = ['Repo','Galactian','Forecast-MLlib']
df = pd.DataFrame(rtlabels, columns=colnames)
ps(df)
df.set_index('Repo').astype(float).plot()
plt.title("Restricting Long-Running Tests\nin Galactus and Forecast-ML")
plt.show()
p('df columns', df.columns)
ps(df)
Pandas assumes your label information is in the index and columns. Set the index first:
df.set_index('Repo').astype(float).plot()
Or
df.set_index('Repo').T.astype(float).plot()

How to find ngram frequency of a column in a pandas dataframe?

Below is the input pandas dataframe I have.
I want to find the frequency of unigrams & bigrams. A sample of what I am expecting is shown below
How to do this using nltk or scikit learn?
I wrote the below code which takes a string as input. How to extend it to series/dataframe?
from nltk.collocations import *
desc='john is a guy person you him guy person you him'
tokens = nltk.word_tokenize(desc)
bigram_measures = nltk.collocations.BigramAssocMeasures()
finder = BigramCollocationFinder.from_words(tokens)
finder.ngram_fd.viewitems()
If your data is like
import pandas as pd
df = pd.DataFrame([
'must watch. Good acting',
'average movie. Bad acting',
'good movie. Good acting',
'pathetic. Avoid',
'avoid'], columns=['description'])
You could use the CountVectorizer of the package sklearn:
from sklearn.feature_extraction.text import CountVectorizer
word_vectorizer = CountVectorizer(ngram_range=(1,2), analyzer='word')
sparse_matrix = word_vectorizer.fit_transform(df['description'])
frequencies = sum(sparse_matrix).toarray()[0]
pd.DataFrame(frequencies, index=word_vectorizer.get_feature_names(), columns=['frequency'])
Which gives you :
frequency
good 3
pathetic 1
average movie 1
movie bad 2
watch 1
good movie 1
watch good 3
good acting 2
must 1
movie good 2
pathetic avoid 1
bad acting 1
average 1
must watch 1
acting 1
bad 1
movie 1
avoid 1
EDIT
fit will just "train" your vectorizer : it will split the words of your corpus and create a vocabulary with it. Then transform can take a new document and create vector of frequency based on the vectorizer vocabulary.
Here your training set is your output set, so you can do both at the same time (fit_transform). Because you have 5 documents, it will create 5 vectors as a matrix. You want a global vector, so you have to make a sum.
EDIT 2
For big dataframes, you can speed up the frequencies computation by using:
frequencies = sum(sparse_matrix).data
or
frequencies = sparse_matrix.sum(axis=0).T