Plotting Tweet Sentiments against dates - pandas

I have a program using tweepy that pulls tweets and their created_at date.
So far it creates a bar chart that shows the number of tweets plotted against the sentiment like so:
I am rubbish with matplotlib and was wondering if it is possible to plot these three sentiments against the dates they were created. So that i can see if specific date ranges have more positive or negative tweets etc.
This is my pandas data frame:
This is my tweepy query:
def process_tweet(tweet, default_value=None):
val = {}
val['Tweet'] = tweet.full_text if tweet.full_text else default_value
val['Date'] = tweet.created_at if tweet.created_at else default_value
return val
q = "Tesla OR Elon Musk"
tweets = [process_tweet(tweet) for tweet in
tweepy.Cursor(api.search, q=(q), lang="en", since="2020-04-13", tweet_mode="extended").items(100)]
df = pd.DataFrame(tweets)
Im wondering if i will need to edit the query to make a gap in between the date of tweets being pulled, as they all seem to be within the same day/hour.
Ideally i would like to see 3 lines plotting the Neutral, Negative and Positive value counts on the y axis against their corresponding dates on the x axis.
Advice on how to accomplish this would be appreciated!
Thank you

Related

How to work with multiple `item_id`'s of varying length for timeseries prediction in autogluon?

I'm currently building a model to predict daily stock price based on daily data for thousands of stocks. In the data, I've got the daily data for all stocks, however they are for different lengths. Eg: for some stocks I have daily data from 2000 to 2022, and for others I have data from 2010 to 2022.
Many dates are also obviously repeated for all stocks.
While I was learning autogluon, I used the following function to format timeseries data so it can work with .fit():
def forward_fill_missing(ts_dataframe: TimeSeriesDataFrame, freq="D") -> TimeSeriesDataFrame:
original_index = ts_dataframe.index.get_level_values("timestamp")
start = original_index[0]
end = original_index[-1]
filled_index = pd.date_range(start=start, end=end, freq=freq, name="timestamp")
return ts_dataframe.droplevel("item_id").reindex(filled_index, method="ffill")
ts_dataframe = ts_dataframe.groupby("item_id").apply(forward_fill_missing)
This worked, however I was trying this for data for just one item_id and now I have thousands.
When I use this now, I get the following error: ValueError: cannot reindex from a duplicate axis
It's important to note that I have already foreward filled my data with pandas, and the ts_dataframe shouldn't have any missing dates or values, but when I try to use it with .fit() I get the following error:
ValueError: Frequency not provided and cannot be inferred. This is often due to the time index of the data being irregularly sampled. Please ensure that the data set used has a uniform time index, or create the TimeSeriesPredictorsettingignore_time_index=True.
I assume that this is because I have only filled in missing data and dates, but not taken into account the varying number of days available for every stock individually.
For reference, here's how I have formatted the data with pandas:
df = pd.read_csv(
"/content/drive/MyDrive/stock_data/training_data.csv",
parse_dates=["Date"],
)
df["Date"] = pd.to_datetime(df["Date"], errors="coerce", dayfirst=True)
df.fillna(method='ffill', inplace=True)
df = df.drop("Unnamed: 0", axis=1)
df[:11]
How can I format the data so I can use it with .fit()?
Thanks!

Plotting a graph of the top 15 highest values

I am working on a dataset which shows the budget spent on movies. I want make a plot which contains the top 15 highest budget movies.
#sort the 'budget' column in decending order and store it in the new dataframe.
info = pd.DataFrame(dp['budget'].sort_values(ascending = False))
info['original_title'] = dp['original_title']
data = list(map(str,(info['original_title'])))
#extract the top 10 budget movies data from the list and dataframe.
x = list(data[:10])
y = list(info['budget'][:10])
This was the ouput i got
C:\Users\Phillip\AppData\Local\Temp\ipykernel_7692\1681814737.py:2: FutureWarning: The behavior of `series[i:j]` with an integer-dtype index is deprecated. In a future version, this will be treated as *label-based* indexing, consistent with e.g. `series[i]` lookups. To retain the old behavior, use `series.iloc[i:j]`. To get the future behavior, use `series.loc[i:j]`.
y = list(info['budget'][:5])
I'm new to the data analysis scene so i'm confused on how else to go about the problem
A simple example using a movie dataset I found online:
import pandas as pd
url = "https://raw.githubusercontent.com/erajabi/Python_examples/master/movie_sample_dataset.csv"
df = pd.read_csv(url)
# Bar plot of 15 highest budgets:
df.nlargest(n=15, columns="budget").plot.bar(x="movie_title", y="budget")
You can customize your plot in various ways by adding arguments to the .bar(...) call.

How can I draw Yearly series using monthly data from a DateTimeIndex in Matplotlib?

I have monthly data of 6 variables from 2014 until 2018 in one dataset.
I'm trying to draw 6 subplots (one for each variable) with monthly X axis (Jan, Feb....) and 5 series (one for each year) with their legend.
This is part of the data:
I created 5 series (one for each year) per variable (30 in total) and I'm getting the expected output but using MANY lines of code.
What is the best way to achieve this using less lines of code?
This is an example how I created the series:
CL2014 = data_total['Charity Lottery'].where(data_total['Date'].dt.year == 2014)[0:12]
CL2015 = data_total['Charity Lottery'].where(data_total['Date'].dt.year == 2015)[12:24]
This is an example of how I'm plotting the series:
axCL.plot(xvals, CL2014)
axCL.plot(xvals, CL2015)
axCL.plot(xvals, CL2016)
axCL.plot(xvals, CL2017)
axCL.plot(xvals, CL2018)
There's no need to litter your namespace with 30 variables. Seaborn makes the job very easy but you need to normalize your dataframe first. This is what "normalized" or "unpivoted" looks like (Seaborn calls this "long form"):
Date variable value
2014-01-01 Charity Lottery ...
2014-01-01 Racecourse ...
2014-04-01 Bingo Halls ...
2014-04-01 Casino ...
Your screenshot is a "pivoted" or "wide form" dataframe.
df_plot = pd.melt(df, id_vars='Date')
df_plot['Year'] = df_plot['Date'].dt.year
df_plot['Month'] = df_plot['Date'].dt.strftime('%b')
import seaborn as sns
plot = sns.catplot(data=df_plot, x='Month', y='value',
row='Year', col='variable', kind='bar',
sharex=False)
plot.savefig('figure.png', dpi=300)
Result (all numbers are randomly generated):
I would try using .groupby(), it is really powerful for parsing down things like this:
for _, group in data_total.groupby([year, month])[[x_variable, y_variable]]:
plt.plot(group[x_variables], group[y_variables])
So here the groupby will separate your data_total DataFrame into year/month subsets, with the [[]] on the end to parse it down to the x_variable (assuming it is in your data_total DataFrame) and your y_variable, which you can make any of those features you are interested in.
I would decompose your datetime column into separate year and month columns, then use those new columns inside that groupby as the [year, month]. You might be able to pass in the dt.year and dt.month like you had before... not sure, try it both ways!

Group DataFrame by binning a column::Float64, in Julia

Say I have a DataFrame with a column of Float64s, I'd like to group the dataframe by binning that column. I hear the cut function might help, but it's not defined over dataframes. Some work has been done (https://gist.github.com/tautologico/3925372), but I'd rather use a library function rather than copy-pasting code from the Internet. Pointers?
EDIT Bonus karma for finding a way of doing this by month over UNIX timestamps :)
You could bin dataframes based on a column of Float64s like this. Here my bins are increments of 0.1 from 0.0 to 1.0, binning the dataframe based on a column of 100 random numbers between 0.0 and 1.0.
using DataFrames #load DataFrames
df = DataFrame(index = rand(Float64,100)) #Make a DataFrame with some random Float64 numbers
df_array = map(x->df[(df[:index] .>= x[1]) .& (df[:index] .<x[2]),:],zip(0.0:0.1:0.9,0.1:0.1:1.0)) #Map an anonymous function that gets every row between two numbers specified by a tuple called x, and map that anonymous function to an array of tuples generated using the zip function.
This will produce an array of 10 dataframes, each one with a different 0.1-sized bin.
As for the UNIX timestamp question, I'm not as familiar with that side of things, but after playing around a bit maybe something like this could work:
using Dates
df = DataFrame(unixtime = rand(1E9:1:1.1E9,100)) #Make a dataframe with floats containing pretend unix time stamps
df[:date] = Dates.unix2datetime.(df[:unixtime]) #convert those timestamps to DateTime types
df[:year_month] = map(date->string(Dates.Year.(date))*" "*string(Dates.Month.(date)),df[:date]) #Make a string for every month in your time range
df_array = map(ym->df[df[:year_month] .== ym,:],unique(df[:year_month])) #Bin based on each unique year_month string

how to group pandas timestamps plot several plots in one figure and stack them together in matplotlib?

I have a data frame with perfectly organised timestamps, like below:
It's a web log, and the timestamps go though the whole year. I want to cut them into each day and show the visits within each hour and plot them into the same figure and stack them all together. Just like the pic shown below:
I am doing well on cutting them into days and plot the visits of a day individually, but I am having trouble plotting them and stacking them together. The primary tool I am using is Pandas and Matplotlib.
Any advices and suggestions? Much Appreciated!
Edited:
My Code is as below:
The timestamps are: https://gist.github.com/adamleo/04e4147cc6614820466f7bc05e088ac5
And the dataframe looks like this:
I plotted the timestamp density through the whole period used the code below:
timestamps_series_all = pd.DatetimeIndex(pd.Series(unique_visitors_df.time_stamp))
timestamps_series_all_toBePlotted = pd.Series(1, index=timestamps_series_all)
timestamps_series_all_toBePlotted.resample('D').sum().plot()
and got the result:
I plotted timestamps within one day using the code:
timestamps_series_oneDay = pd.DatetimeIndex(pd.Series(unique_visitors_df.time_stamp.loc[unique_visitors_df["date"] == "2014-08-01"]))
timestamps_series_oneDay_toBePlotted = pd.Series(1, index=timestamps_series_oneDay)
timestamps_series_oneDay_toBePlotted.resample('H').sum().plot()
and the result:
And now I am stuck.
I'd really appreciate all of your help!
I think you need pivot:
#https://gist.github.com/adamleo/04e4147cc6614820466f7bc05e088ac5 to L
df = pd.DataFrame({'date':L})
print (df.head())
date
0 2014-08-01 00:05:46
1 2014-08-01 00:14:47
2 2014-08-01 00:16:05
3 2014-08-01 00:20:46
4 2014-08-01 00:23:22
#convert to datetime if necessary
df['date'] = pd.to_datetime(df['date'] )
#resample by Hours, get count and create df
df = df.resample('H', on='date').size().to_frame('count')
#extract date and hour
df['days'] = df.index.date
df['hours'] = df.index.hour
#pivot and plot
#maybe check parameter kind='density' from http://stackoverflow.com/a/33474410/2901002
#df.pivot(index='days', columns='hours', values='count').plot(rot='90')
#edit: last line change to below:
df.pivot(index='hours', columns='days', values='count').plot(rot='90')