Python pandas - change last hour of a day 00 with 24 - pandas

I'm currently working on setting up a pandas dataframe. I have a date time column, which should be splitted onto date and time columns.
I have reached to split this column to date + time columns like this:
Date Hour
22 20200205 23
23 20200205 00
What I want is to replace 00 value with 24. I know Python does not recognize 24th hour as the last our, but for my purposes, I need to include "24" hour.
I'm pretty new to python and pandas and I'm confused about replacing this value.
I tried to use below code line but with no luck:
frame= OMPdata
OMPdata['Hour'] = OMPdata['Hour'].str.replace(00,24, case= False)
or function:
def h24 (row):
if row['Hour'] == "00":
return "24"
else:
return row['Hour']
Please help me with this issue.
Thanks a lot!

Try to first convert to string the column and then replace:
df['Hour'] = df['Hour'].astype(str)
df['Hour'] = df['Hour'].str.replace("00","24", case= False)

Related

how to add a character to every value in a dataframe without losing the 2d structure

Today my problem is this: I have a dataframe of 300 X 41. Its encoded with numbers. I want to append an 'a' to each value in the dataframe so that another down stream program will not fuss about these being 'continuous variables' which they arent, they are factors. Simple right?
Every way I can think to do this though returns a dataframe or object that is not 300x 41...but just one long list of altered values:
Please end this headache for me. How can I do this in a way that returns a 400 X 31 altered output?
> dim(x)
[1] 300 41
>x2 <- sub("^","a",x)
>dim(x2)
[1] 12300 1

Collapse pandas DataFrame based on daily column value

I have a pandas DataFrame with multiple measurements per day (for example hourly measurements, but that is not necessarily the case), but I want to keep only the hour for which a certain column is the daily minimum.
My one day in my data frame looks somewhat like this
DATE Value Distance
17 1979-1-2T00:00:00.0 15.5669870447436 34.87
18 1979-1-2T01:00:00.0 81.6306803714536 31.342
19 1979-1-2T02:00:00.0 83.1854759740486 33.264
20 1979-1-2T03:00:00.0 23.8659679630303 32.34
21 1979-1-2T04:00:00.0 63.2755504429306 31.973
22 1979-1-2T05:00:00.0 91.2129044773733 34.091
23 1979-1-2T06:00:00.0 76.493130052689 36.837
24 1979-1-2T07:00:00.0 63.5443183375785 34.383
25 1979-1-2T08:00:00.0 40.9255407683688 35.275
26 1979-1-2T09:00:00.0 54.5583051827551 32.152
27 1979-1-2T10:00:00.0 26.2690011881422 35.104
28 1979-1-2T11:00:00.0 71.3059740399097 37.28
29 1979-1-2T12:00:00.0 54.0111262724049 38.963
30 1979-1-2T13:00:00.0 91.3518048568241 36.696
31 1979-1-2T14:00:00.0 81.7651763485069 34.832
32 1979-1-2T15:00:00.0 90.5695814525067 35.473
33 1979-1-2T16:00:00.0 88.4550315358515 30.998
34 1979-1-2T17:00:00.0 41.6276969038137 32.353
35 1979-1-2T18:00:00.0 79.3818377264749 30.15
36 1979-1-2T19:00:00.0 79.1672568582629 37.07
37 1979-1-2T20:00:00.0 1.48337999844262 28.525
38 1979-1-2T21:00:00.0 87.9110385474789 38.323
39 1979-1-2T22:00:00.0 38.6646421460678 23.251
40 1979-1-2T23:00:00.0 88.4920153764757 31.236
I would like to keep all rows that have the minimum "distance" per day, so for the one day shown above, one would have only one row left (the one with index value 39). I know how to collapse the data frame so that I only have the Distance column left. I can do that - if I first set the DATE as index - with
df_short = df.groupby(df.index.floor('D'))["Distance"].min()
But I also want the Value column in my final result. How do I keep all columns?
It doesn't seem to work if I do
df_short = df.groupby(df.index.floor('D')).min(["Distance"])
This does keep all the columns in the final result, but it seems like the outcome is wrong, so I'm not sure what this does.
Maybe this is already posted somewhere, but I have trouble finding it.
You can use aggregate
df_short = df.groupby(df.index.floor('D')).agg({'Distance': min, 'Value': max})
If you want the kept Value column is the same with minimum of Distance column:
df_short = df.loc[df.groupby(df.index.floor('D'))['Distance'].idxmin(), :]
Make a datetime Index:
df.DATE = pd.to_datetime(df.DATE) # If not already datetime.
df.set_index('DATE', inplace=True)
Resample and find the min Distance's location:
df.loc[df.resample('D')['Distance'].idxmin()]
Output:
Value Distance
DATE
1979-01-02 22:00:00 38.664642 23.251

Pandas how to get row number from datetime index and back again?

I have great difficulties. I have read a csv files, and set the index on "Timestamp" column like this
# df = pd.read_csv (csv_file, quotechar = "'", decimal = ".", delimiter=";", parse_dates = True, index_col="Timestamp")
# df
XYZ PRICE position nrLots posText
Timestamp
2014-10-14 10:00:29 30 140 -1.0 -1.0 buy
2014-10-14 10:00:30 21 90 -1.0 -5.0 buy
2014-10-14 10:00:31 3 110 1.0 2.0 sell
2014-10-14 10:00:32 31 120 1.0 1.0 sell
2014-10-14 10:00:33 4 70 -1.0 -5.0 buy
So if I want to get the price of 2nd row, I want to do like this:
df.loc [2,"PRICE"]
But that does not work. If I want to use df.loc[] operator, I need to insert a Timestamp, like this:
df.loc["2014-10-14 10:00:31", "PRICE"]
If I want to use row numbers, I need to do like this instead:
df["PRICE"].iloc[2]
which sucks. The syntax is ugly. However, it works. I can get the value, and I can set the value - which is what I want.
If I want to find the Timestamp from a row, I can do like this:
df.index[row]
Question) Is there a more elegant syntax to get and set the value, when you always work with a row number? I always iterate over the row numbers, never iterate over Timestamps. I never use the Timestamp to access values, I always use row numbers.
Bonusquestion) If I have a Timestamp, how can I find the corresponding row number?
There is way to do this .
First use df = df.reset_index() .
"Timestamp" will be new column added to df , now you get new index as integer.
And you access any row element with df.loc[] or df.iat[] and you can find any row with specific element .

Creating pandas series with all 1 values

I'm trying to generate a pandas timeseries where all values are 1.
start=str(timeseries.index[0].round('S'))
end=str(timeseries.index[-1].round('S'))
empty_series_index = pd.date_range(start=start, end=end, freq='2m')
empty_series_values = 1
empty_series = pd.Series(data=empty_series_values, index=empty_series_index)
print(start,end)
print(empty_series)
The printout reads
2019-09-20 00:30:51+00:00 2019-10-30 23:57:35+00:00
2019-09-30 00:30:51+00:00 1
Why is there only one value, even tough its a 2min frequency and its more than 10 days long?
in the line:
empty_series_index = pd.date_range(start=start, end=end, freq='2m')
you are using the frequency string: '2m' which actually means 2 months.
If you want to use minutes you should use: '2min' or '2T' (from documentation)
Hope this helps. Let me know if you have any more questions.

Mapping column values to a combination of another csv file's information

I have a dataset that indicates date & time in 5-digit format: ddd + hm
ddd part starts from 2009 Jan 1. Since the data was collected only from then to 2-years period, its [min, max] would be [1, 365 x 2 = 730].
Data is observed in 30-min interval, making 24 hrs per day period to lengthen to 48 at max. So [min, max] for hm at [1, 48].
Following is the excerpt of daycode.csv file that contains ddd part of the daycode, matching date & hm part of the daycode, matching time.
And I think I agreed to not showing the dataset which is from ISSDA. So..I will just describe that the daycode in the File1.txt file reads like '63317'.
This link gave me a glimpse of how to approach this problem, and I was in the middle of putting up this code together..which of course won't work at this point.
consume = pd.read_csv("data/File1.txt", sep= ' ', encoding = "utf-8", names =['meter', 'daycode', 'val'])
df1= pd.read_csv("data/daycode.csv", encoding = "cp1252", names =['code', 'print'])
test = consume[consume['meter']==1048]
test['daycode'] = test['daycode'].map(df1.set_index('code')['print'])
plt.plot(test['daycode'], test['val'], '.')
plt.title('test of meter 1048')
plt.xlabel('daycode')
plt.ylabel('energy consumption [kWh]')
plt.show()
Not all units(thousands) have been observed at full length but 730 x 48 is a large combination to lay out on excel by hand. Tbh, not an elegant solution but I tried by dragging - it doesn't quite get it.
If I could read the first 3 digits of the column values and match with another file's column, 2 last digits with another column, then combine.. is there a way?
For the last 2 lines you can just do something like this
df['first_3_digits'] = df['col1'].map(lambda x: str(x)[:3])
df['last_2_digits'] = df['col1'].map(lambda x: str(x)[-2:])
for joining 2 dataframes
df3 = df.merge(df2,left_on=['first_3_digits','last_2_digits'],right_on=['col1_df2','col2_df2'],how='left')