Problems plotting price data against my datetime which is indexed - pandas

Here is my code:
df=pd.read_csv('data.csv')
df['datetime']=pd.to_datetime(df['datetime'])
df=df.set_index('datetime')
data = df.filter(['avgLowPrice'])
plt.plot(data['avgLowPrice'])
plt.show()
the graph looks like this:
I have no idea why its doing this...

I suppose that your DataFrame is not sorted by the index, i.e.
consecutive rows have "intermixed" (instead of ordered) index values.
Sort your DataFrame, even in-place:
df.sort_index(inplace=True)
and then generate your plot.
Another (not related) hint, to make your code more concise:
To read your input file, convert datetime column to datetime and
set it as the index, in one go, run:
df = pd.read_csv('data.csv', parse_dates=['datetime'], index_col='datetime')

Related

How to compare one row in Pandas Dataframe to all other rows in the same Dataframe

I have a csv file with in which I want to compare each row with all other rows. I want to do a linear regression and get the r^2 value for the linear regression line and put it into a new matrix. I'm having trouble finding a way to iterate over all the other rows (it's fine to compare the primary row to itself).
I've tried using .iterrows but I can't think of a way to define the other rows once I have my primary row using this function.
UPDATE: Here is a solution I came up with. Please let me know if there is a more efficient way of doing this.
def bad_pairs(df, limit):
list_fluor = list(combinations(df.index.values, 2))
final = {}
for fluor in list_fluor:
final[fluor] = (r2_score(df.xs(fluor[0]),
df.xs(fluor[1])))
bad_final = {}
for i in final:
if final[i] > limit:
bad_final[i] = final[i]
return(bad_final)
My data is a pandas DataFrame where the index is the name of the color and there is a number between 0-1 for each detector (220 columns).
I'm still working on a way to make a new pandas Dataframe from a dictionary with all the values (final in the code above), not just those over the limit.

How do I append a column from a numpy array to a pd dataframe?

I have a numpy array of 100 predicted values called first_100. If I convert these to a dataframe they are indexed as 0,1,2 etc. However, the predictions are for values that are in random indexed order, 66,201,32 etc. I want to be able to put the actual values and the predictions in the same dataframe, but I'm really struggling.
The real values are in a dataframe called first_100_train.
I've tried the following:
pd.concat([first_100, first_100_train], axis=1)
This doesn't work and for some reason returns the entire dataframe and indexed from 0 so there are lots of NaNs...
first_100_train['Prediction'] = first_100[0]
This is almost what I want, but again because the indexes are different the data doesn't match up. I'd really appreciate any suggestions.
EDIT: After managing to join the dataframes I now have this:
I'd like to be able to drop the final column...
Here is first_100.head()
and first_100_train.head()
The problem is that index 2 from first_100 actually corresponds to index 480 of first_100_train
Set default index values by DataFrame.reset_index and drop=True for correct alignment:
pd.concat([first_100.reset_index(drop=True),
first_100_train.reset_index(drop=True)], axis=1)
Or if first DataFrame have default RangeIndex solution is simplify:
pd.concat([first_100,
first_100_train.reset_index(drop=True)], axis=1)

pd.datetime not indexing correctly

I have dataset with date of every transaction in restaurant. I tried to set date as index, before converting it with df.to_datetime:
df['dateTransaction'] = pd.to_datetime(df['dateTransaction'])
df.info()
And I really get 'dateTransaction' type as datetime64[ns]. But than I tried to set index with
df = df.set_index('dateTransaction')
but my dataset didn't sorted by date correctly.
enter image description here
Please advice how to index dataframe by date in sorted way?
When you set_index it never rearranges rows, it just "moves" a column to an index. So you have to explicitly sort (either before or after set_index):
df = df.set_index('dateTransaction').sort_index()
# or
df = df.sort_values("dateTransaction").set_index('dateTransaction')
If you are reading it from CSV you can also try
data=pd.read_csv('SomeData.csv',index_col=['Date'],parse_dates=['Date'],dayfirst=True)
without dayfirst=True the days read in as months and vice versa

Group DataFrame by binning a column::Float64, in Julia

Say I have a DataFrame with a column of Float64s, I'd like to group the dataframe by binning that column. I hear the cut function might help, but it's not defined over dataframes. Some work has been done (https://gist.github.com/tautologico/3925372), but I'd rather use a library function rather than copy-pasting code from the Internet. Pointers?
EDIT Bonus karma for finding a way of doing this by month over UNIX timestamps :)
You could bin dataframes based on a column of Float64s like this. Here my bins are increments of 0.1 from 0.0 to 1.0, binning the dataframe based on a column of 100 random numbers between 0.0 and 1.0.
using DataFrames #load DataFrames
df = DataFrame(index = rand(Float64,100)) #Make a DataFrame with some random Float64 numbers
df_array = map(x->df[(df[:index] .>= x[1]) .& (df[:index] .<x[2]),:],zip(0.0:0.1:0.9,0.1:0.1:1.0)) #Map an anonymous function that gets every row between two numbers specified by a tuple called x, and map that anonymous function to an array of tuples generated using the zip function.
This will produce an array of 10 dataframes, each one with a different 0.1-sized bin.
As for the UNIX timestamp question, I'm not as familiar with that side of things, but after playing around a bit maybe something like this could work:
using Dates
df = DataFrame(unixtime = rand(1E9:1:1.1E9,100)) #Make a dataframe with floats containing pretend unix time stamps
df[:date] = Dates.unix2datetime.(df[:unixtime]) #convert those timestamps to DateTime types
df[:year_month] = map(date->string(Dates.Year.(date))*" "*string(Dates.Month.(date)),df[:date]) #Make a string for every month in your time range
df_array = map(ym->df[df[:year_month] .== ym,:],unique(df[:year_month])) #Bin based on each unique year_month string

Converting row of DataFrame to a Series (Pandas)

So I could not find how to do this in the documentation, but I am reading a row from a dataframe as such:
self.data = df[n:n+1]
But this results in self.data being a 1 row and 7 column dataframe, instead of just a series. However, the test cases for my course depend on it being a series. Is there an easy way to make that conversion?
Just use .ix:
df.ix[n]
That assumes that your df.index lists the rows in numerical order.