Nested for loops to calculate max. temperature from a csv - numpy

I'm starting a class on advanced data structures and I'm struggling to answer the problems shown in the image below.The NYC_temperature.csv has hourly temperatures and you have to calculate it by day to then show what was the warmest 30-day period

for i in range(len(data)-24*30-1):
temp = 0
for j in range(i, i+30):
temp += data[j]
maxtemp = max(temp, maxtemp)

Related

Pandas rolling window on an offset between 4 and 2 weeks in the past

I have a datafile with quality scores from different suppliers over a time range of 3 years. The end goal is to use machine learning to predict the quality label (good or bad) of a shipment based on supplier information.
I want to use the mean historic quality data over a specific period of time as an input feature in this model by using pandas rolling window. the problem with this method is that pandas only allows you to create a window from t=0-x until t=0 for you rolling window as presented below:
df['average_score t-2w'] = df['score'].rolling(window='14d',closed='left').mean()
And this is were the problem comes in. For my feature I want to use quality data from a period of 2 weeks, but these 2 weeks are not the 2 weeks before the corresponding shipment, but of 2 weeks, starting from t=-4weeks , and ending on t=-2weeks.
You would imagine that this could be solved by using the same string of code but changing the window as presented below:
df['average_score t-2w'] = df['score'].rolling(window='28d' - '14d',closed='left').mean()
This, or any other type of denotation of this specific window does not seem to work.
It seems like pandas does not offer a solution to this problem, so we made a work around it with the following solution:
def time_shift_week(df):
def _avg_score_interval_func(series):
current_time = series.index[-1]
result = series[(series.index > ( current_time- pd.Timedelta(value=4, unit='w')))
& (series.index < (current_time - pd.Timedelta(value=2, unit='w')))]
return result.mean() if len(result)>0 else 0.0
temp_df = df.groupby(by=["supplier", "timestamp"], as_index=False).aggregate({"score": np.mean}).set_index('timestamp')
temp_df["w-42"] = (
temp_df
.groupby(["supplier"])
.ag_score
.apply(lambda x:
x
.rolling(window='30D', closed='both')
.apply(_avg_score_interval_func)
))
return temp_df.reset_index()
This results in a new df in which we find the average score score per supplier per timestamp, which we can subsequently merge with the original data frame to obtain the new feature.
Doing it this way seems really cumbersome and overly complicated for the task I am trying to perform. Eventhough we have found a workaround, I am wondering if there is an easier method of doing this.
Is anyone aware of a less complicated way of performing this rolling window feature extraction?
While pandas does not have the custom date offset you need, calculating the mean is pretty simple: it's just sum divided by count. You can subtract the 14-day rolling window from the 28-day rolling window:
# Some sample data. All scores are sequential for easy verification
idx = pd.MultiIndex.from_product(
[list("ABC"), pd.date_range("2020-01-01", "2022-12-31")],
names=["supplier", "timestamp"],
)
df = pd.DataFrame({"score": np.arange(len(idx))}, index=idx).reset_index()
# Now we gonna do rolling avg on score with the custom window.
# closed=left mean the current row will be excluded from the window.
score = df.set_index("timestamp").groupby("supplier")["score"]
r28 = score.rolling("28d", closed="left")
r14 = score.rolling("14d", closed="left")
avg_score = (r28.sum() - r14.sum()) / (r28.count() - r14.count())

Pandas create graph from Date and Time while them being in different columns

My data looks like this:
Creation Day Time St1 Time St2
0 28.01.2022 14:18:00 15:12:00
1 28.01.2022 14:35:00 16:01:00
2 29.01.2022 00:07:00 03:04:00
3 30.01.2022 17:03:00 22:12:00
It represents parts being at a given station. What I now need is something that counts how many Columns have the same Day and Hour e.g. How many parts were at the same station for a given Hour.
Here 2 Where at Station 1 for the 28th and the timespan 14-15.
Because in the end I want a bar graph that show production speed. Additionally later in the project I want to highlight Parts that havent moved for >2hrs.
Is it practical to create a datetime object for every Station (I have 5 in total)? Or is there a much simpler way to do this?
FYI I import this data from an excel sheet
I found the solution. As they are just strings I can just add them and reformat the result with pd.to_datetime().
Example:
df["Time St1"] = pd.to_datetime(
df["Creation Day"] + ' ' + df["Time St1"],
infer_datetime_format=False, format='%d.%m.%Y %H:%M:%S'
)

Is there a way to count the individual instances of an event per year?

I am working on Apache Pig to get an understanding of working with large databases. The specific problem is, I need to count the number of days per year for all years listed in the dataset when the temperature in the recorded area was recorded to be above 80 degrees.
The data is set up in the following manner.
Date Max Temp
1919-06-03, 36
1919-11-26, 91
1927-09-23, 61
This repeats every day for about 200 years.
Currently, I know that to make this more manageable I will be using the split function, to split the data set based on the temp being above 80 degrees.
SPLIT data INTO max_above_95 if max_t > 80;
I also figured that if you can get the year out of the date, you can group by, after splitting to get the intended results and count.
I, however, could not find a method to use the year's chunk of the date.
I need this to in the end output giving each year, and the number of occurrences for that year such as the following:
(1993, 21)
(1994, 7)
(1995, 13)
Use FILTER and then extract the year,group by year,count the occurrences.
B = FILTER A BY (A.max_t > 80);
C = FOREACH B GENERATE B.Date,GetYear(B.Date) as Year,max_t;
D = GROUP C BY Year;
E = FOREACH D GENERATE FLATTEN(group) as Year,COUNT(C.max_t);
DUMP E;

Make a plot by occurence of a col by hour of a second col

I have this df :
and i would like to make a graph by half hour of how many row i have by half hour without including the day.
Just a graph with number of occurence by half hour not including the day.
3272 8711600410367 2019-03-11T20:23:45.415Z d7ec8e9c5b5df11df8ec7ee130552944 home 2019-03-11T20:23:45.415Z DISPLAY None
3273 8711600410367 2019-03-11T20:23:51.072Z d7ec8e9c5b5df11df8ec7ee130552944 home 2019-03-11T20:23:51.072Z DISPLAY None
Here is my try :
df["Created"] = pd.to_datetime(df["Created"])
df.groupby(df.Created.dt.hour).size().plot()
But it's not by half hour
I would like to show all half hour on my graph
One way you could do this is split up coding for hours and half-hours, and then bring them together. To illustrate, I extended your data example a bit:
import pandas as pd
df = pd.DataFrame({'Created':['2019-03-11T20:23:45.415Z', '2019-03-11T20:23:51.072Z', '2019-03-11T20:33:03.072Z', '2019-03-11T21:10:10.072Z']})
df["Created"] = pd.to_datetime(df["Created"])
First create a 'Hours column':
df['Hours'] = df.Created.dt.hour
Then create a column that codes half hours. That is, if the minutes are greater than 30, count it as half hour.
df['HalfHours'] = [0.5 if x>30 else 0 for x in df.Created.dt.minute]
Then bring them together again:
df['Hours_and_HalfHours'] = df['Hours']+df['HalfHours']
Finally, count the number of rows by groupby, and plot:
df.groupby(df['Hours_and_HalfHours']).size().plot()

how can i plot a graph in in vb to to show a large amount of data

I want to plot graph to show the difference of Net connectivity in one hour . i found the average speed of One hour and the different between the Speed. and i add the difference percentage in a ListBox . I have to show the numbers in graph of one hour ? How can I Plot the Graph Any Suggestion Please .
enter code here
Dim Per As Double
For x As Integer = 0 To ListBox2.Items.Count - 1
Per = Math.Abs((avg - Val(ListBox2.Items.Item(x).ToString)) / (avg)) * 100
ListBox3.Items.Add(Per)
i have to plot all the number of ListBox3 contains more than 3000.
i found it...
For y As Integer = 0 To ListBox3.Items.Count - 1
s.Points.AddXY(y, Val(ListBox3.Items.Item(y).ToString))
Next
Chart1.Series.Add(s)