pd.datetime not indexing correctly - pandas

I have dataset with date of every transaction in restaurant. I tried to set date as index, before converting it with df.to_datetime:
df['dateTransaction'] = pd.to_datetime(df['dateTransaction'])
df.info()
And I really get 'dateTransaction' type as datetime64[ns]. But than I tried to set index with
df = df.set_index('dateTransaction')
but my dataset didn't sorted by date correctly.
enter image description here
Please advice how to index dataframe by date in sorted way?

When you set_index it never rearranges rows, it just "moves" a column to an index. So you have to explicitly sort (either before or after set_index):
df = df.set_index('dateTransaction').sort_index()
# or
df = df.sort_values("dateTransaction").set_index('dateTransaction')

If you are reading it from CSV you can also try
data=pd.read_csv('SomeData.csv',index_col=['Date'],parse_dates=['Date'],dayfirst=True)
without dayfirst=True the days read in as months and vice versa

Related

Pandas Sort Values By Date Doesn't Sort By Year

I have a large data set that is in this format
I'd like to order this data set by the "created_at" column, so I converted the "created_at" column to type datetime following this guide:
https://www.geeksforgeeks.org/how-to-sort-a-pandas-dataframe-by-date/
data = pd.read_csv(PATH_TO_CSV)
data['created_at'] = data['created_at'].str.split("+").str[0]
data['created_at'] = pd.to_datetime(data['created_at'],format="%Y-%m-%dT%H:%M:%S")
data.sort_values(by='created_at')
But it's not sorting by year as expected. The values starting with 2012 should be at the top, but they aren't
print(data)
print(type(data['created_at'][0]))
What am I missing?
With a datetime type, this should be able to sort directly, make sure to assign the output as sorting is not in place:
# no need for an intermediate column nor to pass the full format
data['created_at'] = pd.to_datetime(data['created_at'].str.split("+").str[0])
# assign output
data = data.sort_values(by='created_at')
As in the comments already stated. the sorted df needs to be assigned again. sort_values doesn't work inplace by default.
data = data.sort_values(by='created_at')
# OR
data.sort_values(by='created_at', inplace=True)

Problems plotting price data against my datetime which is indexed

Here is my code:
df=pd.read_csv('data.csv')
df['datetime']=pd.to_datetime(df['datetime'])
df=df.set_index('datetime')
data = df.filter(['avgLowPrice'])
plt.plot(data['avgLowPrice'])
plt.show()
the graph looks like this:
I have no idea why its doing this...
I suppose that your DataFrame is not sorted by the index, i.e.
consecutive rows have "intermixed" (instead of ordered) index values.
Sort your DataFrame, even in-place:
df.sort_index(inplace=True)
and then generate your plot.
Another (not related) hint, to make your code more concise:
To read your input file, convert datetime column to datetime and
set it as the index, in one go, run:
df = pd.read_csv('data.csv', parse_dates=['datetime'], index_col='datetime')

How to index a column with two values pandas

I have two dataframes:
Dataframe #1
Reads the values--Will only be interested in NodeID AND GSE
sta = pd.read_csv(filename)
Dataframe #2
Reads the file, use pivot and get the following result
sim = pd.read_csv(headout,index_col=0)
sim['Layer'] = sim.groupby('date').cumcount() + 1
sim['Layer'] = 'L' + sim['Layer'].astype(str)
sim = sim.pivot(index = None , columns = 'Layer').T
This gives me the index column to be with two values. (The header is blank for the first one, and Layers for the second) i.e 1,L1.
What I need help on is:
I can not find a way to rename that first blank in the index to 'NodeID'.
I want to name it that so that I can do the lookup function and use NodeID in both dataframes so that I can bring in the 'GSE' values from the first dataframe to the second.
I have been googling way to rename that first column in the second dataframe and I can not seem to find an solution. Any ideas help at this point. I think my pivot function might be wrong...
This is a picture of dataframe #2 before pivot. The number 1-4 are the Node ID.
when I export it to csv to see what the dataframe looks like I get this..
Try
df.rename(columns={"Index": "your preferred name"})
if it is your index then do -
df = df.reset_index()
df.rename(columns={"index": "your preferred name"})

Datetime column coerced to int when setting with .loc and slice

I have a column of datetimes and need to change several of these values to new datetimes. When I set the values using df.loc[indices, 'col'] = new_datetimes, the unaffected values are coerced to int while the new set values are in datetime. If I set the values one at a time, no type coercion occurs.
For illustration I created a sample df with just one column.
df = pd.DataFrame([dt.datetime(2019,1,1)]*5)
df.loc[[1,3,4]] = [dt.datetime(2019,1,2)]*3
df
This produces the following:
output
If I change indices 1,3,4 individually:
df = pd.DataFrame([dt.datetime(2019,1,1)]*5)
df.loc[1] = dt.datetime(2019,1,2)
df.loc[3] = dt.datetime(2019,1,2)
df.loc[4] = dt.datetime(2019,1,2)
df
I get the correct output:
output
A suggestion was to turn the list to a numpy array before setting, which does resolve the issue. However, if you try to set multiple columns (some of which are not datetime) using a numpy array, The issue arises again.
In this example the dataframe has two columns and I try to set both columns.
df = pd.DataFrame({'dt':[dt.datetime(2019,1,1)]*5, 'value':[1,1,1,1,1]})
df.loc[[1,3,4]] = np.array([[dt.datetime(2019,1,2)]*3, [2,2,2]]).T
df
This gives the following output:
output
Can someone please explain what is causing the coercion and how to prevent it from doing so? The code I wrote that uses this was written over a month ago and used to work just fine, could it be one of those warnings about future version of pandas deprecating certain functionalities?
An explanation of what is going on would be greatly appreciated because I wrote a other codes that likely employ similar functionality want to make sure everything works as intended.
The solution proposed by w-m has such an "awkward detail" than
the result column has also the time part (it didn't have it
before).
I have also such a remark, that DataFrames are tables not Series,
so they have columns, each with its name and it is a bad habit to
rely on default column names (consecutive numbers).
So I propose another solution, addressing both above issues:
To create the source DataFrame I executed:
df = pd.DataFrame([dt.datetime(2019, 1, 1)]*5, columns=['c1'])
Note that I provided a name for the only column.
Then I created another DataFrame:
df2 = pd.DataFrame([dt.datetime(2019,1,2)]*3, columns=['c1'], index=[1,3,4])
It contains your "new" dates and the numbers which you used in loc
I set as the index (again with the same column name).
Then, to update df, use (not surprisingly) df.update:
df.update(df2)
This function performs in-place update, so if you print(df), you will get:
c1
0 2019-01-01
1 2019-01-02
2 2019-01-01
3 2019-01-02
4 2019-01-02
As you can see, under indices 1, 3 and 4 you have new dates
and there is no time part, just like before.
[dt.datetime(2019,1,2)]*3 is a Python list of objects. This particular list happens to contain only datetimes, but Pandas does not seem to recognize that, and treats it as it is - a list of any kind of objects.
If you convert it into a typed array, then Pandas will keep the original dtype of the column intact:
df.loc[[1,3,4]] = np.asarray([dt.datetime(2019,1,2)]*3)
I hope this workaround helps you, but you may still want to file a bug with Pandas. I don't have an explanation as to why the datetime objects should be coerced to ints in the first output example.

matplotlib: value error x and y have different dimensions

Having some difficulty plotting out values grouped by a text/name field and along a range of dates. The issue is that while I can group by the name and generate plots for some of the date ranges, there are instances where the grouping contains missing date values (just the nature of the overall dataset).
That is to say, I may very well be able to plot for a date_range('10/1/2013', '10/31/2013') for SOME of the grouped values, but there are instances where there is no '10/15/2013' within that range and therefore will throw the error mentioned in the title of this post.
Thanks for any input!
plt.rcParams['legend.loc'] = 'best'
dtable = pd.io.parsers.read_table(str(datasource), sep=',')
unique_keys = np.unique(dtable['KEY'])
index = date_range(d1frmt, d2frmt)
for key in unique_keys:
values = dtable[dtable['KEY'] == key]
plt.figure()
plt.plot(index, values['VAL']) <--can fail if index is missing a date
plt.xlim(xmin=d1frmt,xmax=d2frmt)
plt.xticks(rotation=270)
plt.xticks(size='small')
plt.legend(('H20'))
plt.ylabel('Head (ft)')
plt.title('Well {0}'.format(key))
fig = str('{0}.png'.format(key))
out = str(outputloc) + "\\" + str(fig)
plt.savefig(out)
plt.close()
You must have a date column, or index, in your dtable. Otherwise you dont know which in values['Val'] belong to which date.
If you do, there are two ways.
Since you make a subset based on a key, you can either use the index (if its a datetime!) of that subset:
plt.plot(values.index.to_pydatetime(), values['VAL'])
or reindex the subset to your 'target' range":
values = values.reindex(index)
plt.plot(index.to_pydatetime(), values['VAL'])
By default, reindex inserts NaN values as missing data.
It would be easier if you gave a working example, its a bit hard to answer without knowing what your Dataframe looks like.