Pandas sum values between two dates in the most efficient way? - pandas

I have a dataset which shows production reported every week and another reporting the production every hours over some subproduction. I would now like to compare the sum of all this hourly subproduction with the value reported every week in the most efficient way. How could I achieve this? I would like to avoid a for loop at all cost as my dataset is really large.
So my datasest looks like this:
Weekly reported data:
Datetime_text | Total_Production_A
--------------------------|--------------------
2014-12-08 00:00:00.000 | 8277000
2014-12-15 00:00:00.000 | 8055000
2014-12-22 00:00:00.000 | 7774000
Hourly data:
Datetime_text | A_Prod_1 | A_Prod_2 | A_Prod_3 | ...... | A_Prod_N |
--------------------------|-----------|-----------|-----------|-----------|-----------|
2014-12-06 23:00:00.000 | 454 | 9 | 54 | 104 | 4 |
2014-12-07 00:00:00.000 | 0 | NaV | 0 | 23 | 3 |
2014-12-07 01:00:00.000 | 54 | 0 | 4 | NaV | 20 |
and so on. I would like to a new table where the differnce between the weekly reported data and hourly reported data is calculated for all dates of weekly reported data. So something like this:
Datetime_text | Diff_Production_A
--------------------------|------------------
2014-12-08 00:00:00.000 | 10
2014-12-15 00:00:00.000 | -100
2014-12-22 00:00:00.000 | 1350
where Diff_Production_A = Total_Production_A - sum(A_Prod_1,A_Prod_2,A_Prod_3,...,A_Prod_N;over all datetimes of a week) How can I best achieve this?
Any help is this regard would be greatly appriciated :D
Best
fidu13

Store datetime as pd.Timestamp, then you can do all kinds of manipulation on the dates.
For your problem, they is to group the hourly data by week (starting on Mondays), then merge it with the weekly data and calculate the differences:
weekly["Datetime"] = pd.to_datetime(weekly["Datetime_Text"])
hourly["Datetime"] = pd.to_datetime(hourly["Datetime_Text"])
hourly["HourlyTotal"] = hourly.loc[:, "A_Prod_1":"A_Prod_N"].sum(axis=1)
result = (
hourly.groupby(pd.Grouper(key="Datetime", freq="W-MON"))["HourlyTotal"]
.sum()
.to_frame()
.merge(
weekly[["Datetime", "Total_Production_A"]],
how="outer",
left_index=True,
right_on="Datetime",
)
.assign(Diff=lambda x: x["Total_Production_A"] - x["HourlyTotal"])
)

Related

Merging some columns from two postgres tables into a new table based on row value

Hello PostgresSQL experts (and maybe this is also a task for Perl's DBI since I also happen to be working with it, but...) I might also have some terminologies misused here so bear with me.
I have a set of 32 tables, every one exactly as the other. The first column of every table always contains a date, while the second column contains values (integers) that can change once every 24 hours, some samples get back-dated. In many cases, the tables may not contain data for a particular date, ever. So here's an example of two such tables:
date_list | sum date_list | sum
---------------------- --------------------------
2020-03-12 | 4 2020-03-09 | 1
2020-03-14 | 5 2020-03-11 | 3
| 2020-03-12 | 5
| 2020-03-13 | 9
| 2020-03-14 | 12
The idea is to merge the separate tables into one, sort of like a grid, but with the samples placed in the correct row in its own column and ensuring that the date column (always the first column) is not missing any dates, looking like this:
date_list | sum1 | sum2 | sum3 .... | sum32
---------------------------------------------------------
2020-03-08 | | |
2020-03-09 | | 1 |
2020-03-10 | | | 5
2020-03-11 | | 3 | 25
2020-03-12 | 4 | 5 | 35
2020-03-13 | | 9 | 37
2020-03-14 | 5 | 12 | 40
And so on, so 33 columns by 2020-01-01 to date.
Now, I have tried doing a FULL OUTER JOIN and it succeeds. It's the subsequent attempts that get me trouble, creating a long, cascading table with the values in the wrong place or accidentally clobbering data. So I know this works if I use a table of one column with a date sequence and joining the first data table, just as a test of my theory using baby steps:
SELECT date_table.date_list, sums_1.sum FROM date_table FULL OUTER JOIN sums_1 ON date_table.date_list = sums_1.date_list
2020-03-07 | 1
2020-03-08 |
2020-03-09 |
2020-03-10 | 2
2020-03-11 |
2020-03-12 | 4
Encouraged, I thought I'd get a little more ambitious with my testing, but that places some rows out of sequence to the bottom of the table and I'm not sure that I'm losing data or not, this time trying USING as an alternative:
SELECT * FROM sums_1 FULL OUTER JOIN sums_2 USING (date_list);
Result:
fecha_sintomas | sum | sum
----------------+-------+-------
2020-03-09 | | 1
2020-03-11 | | 3
2020-03-12 | 4 | 5
2020-03-13 | | 9
2020-03-14 | 5 | 12
2020-03-15 | 6 | 15
2020-03-16 | 8 | 20
: : :
2020-10-29 | 10053 | 22403
2020-10-30 | 10066 | 22407
2020-10-31 | 10074 | 22416
2020-11-01 | 10076 | 22432
2020-11-02 | 10077 | 22434
2020-03-07 | 1 |
2020-03-10 | 2 |
(240 rows)
I think I'm getting close. In any case, where do I get to what I want, which is my grid of data described above? Maybe this is an iterative process that could benefit from using DBI?
Thanks,
You can full join like so:
select date_list, s1.sum as sum1, s2.sum as sum2, s3.sum as sum3
from sums_1 s1
full join sums_2 s2 using (date_list)
full join sums_3 s3 using (date_list)
order by date_list;
The using syntax makes unqualified column date_list unambiguous in the select and order by clauses. Then, we need to enumerate the sum columns, provided aliases for each of them.

df.replace not having any effect when trying to replace dates in pandas dataframe

I've been through the various comments on here about df.replace but I'm still not able to get it working.
Here is a snippet of my code:
# Name columns
df_yearly.columns = ['symbol', 'date', ' annuual % price change']
# Change date format to D/M/Y
df_yearly['date'] = pd.to_datetime(df_yearly['date'], format='%d/%m/%Y')
The df_yearly dataframe looks like this:
| symbol | date | annuual % price change
---|--------|------------|-------------------------
0 | APX | 12/31/2017 |
1 | APX | 12/31/2018 | -0.502554278
2 | AURA | 12/31/2018 | -0.974450706
3 | BASH | 12/31/2016 | -0.998110828
4 | BASH | 12/31/2017 | 8.989361702
5 | BASH | 12/31/2018 | -0.083599574
6 | BCC | 12/31/2017 | 121718.9303
7 | BCC | 12/31/2018 | -0.998018734
I want to replace all dates of 12/31/2018 with 06/30/2018. The next section of my code is:
# Replace 31-12-2018 with 30-06-2018 as this is final date in monthly DF
df_yearly_1 = df_yearly.date.replace('31-12-2018', '30-06-2018')
print(df_yearly_1)
But the output is still coming as:
| 0 | 2017-12-31
| 1 | 2018-12-31
| 2 | 2018-12-31
| 3 | 2016-12-31
| 4 | 2017-12-31
| 5 | 2018-12-31
Is anyone able to help me with this? I thought this might be related to me having the date format incorrect in my df.replace statement but I've tried to search and replace 12-31-2018 and it's still not doing anything.
Thanks in advance!!
try '.astype(str).replace'
df.date.astype(str).replace('2016-12-31', '2018-06-31')

I think I need a loop in an MS Access Query

I have a table of login and logout times for users, table looks something like below:
| ID | User | WorkDate | Start | Finish |
| 1 | Bill | 07/12/2017 | 09:00:00 | 17:00:00 |
| 2 | John | 07/12/2017 | 09:00:00 | 12:00:00 |
| 3 | John | 07/12/2017 | 12:30:00 | 17:00:00 |
| 4 | Mary | 07/12/2017 | 09:00:00 | 10:00:00 |
| 5 | Mary | 07/12/2017 | 10:10:00 | 12:00:00 |
| 6 | Mary | 07/12/2017 | 12:10:00 | 17:00:00 |
I'm running a query to find out the length of the breaks that each user took by running a date diff between the Min of Finish, and Max of Start, then doing some other sums/queries to find out their break length.
This works where i have a maximum of two rows per User per WorkDate, so rows 1,2,3 give me workable data.
Rows 4,5,6 do not.
So long story short, how can i calculate the break times based on the above data in MS Access in a query. I'm assuming i'm going to need some looping statement but have no idea where to begin.
Here is a solution that comes to mind first.
First query to get the min/max start and end times.
Second query to calculate the total time worked for each day by using your Min(start time) and max(end time) query.
Third query to calculate the total time worked for each shift (time difference between start and end times) and then do a daily sum.
Forth query to calculate the difference between total time from the second query and the total time from the third query. The difference gives you the amount of break time they took.
If you need additional help, I can provide some screenshots of example queries.

How to assign event counts to relative date values in SQL?

I want to line up multiple series so that all milestone dates are set to month zero, allowing me to measure the before-and-after effect of the milestone. I'm hoping to be able to do this using SQL server.
You can see an approximation of what I'm starting with at this data.stackexchange.com query. This sample query returns a table that basically looks like this:
+------------+-------------+---------+---------+---------+---------+---------+
| UserID | BadgeDate | 2014-01 | 2014-02 | 2014-03 | 2014-04 | 2014-05 |
+------------+-------------+---------+---------+---------+---------+---------+
| 7 | 2014-01-02 | 232 | 22 | 19 | 77 | 11 |
+------------+-------------+---------+---------+---------+---------+---------+
| 89 | 2014-04-02 | 345 | 45 | 564 | 13 | 122 |
+------------+-------------+---------+---------+---------+---------+---------+
| 678 | 2014-03-11 | 55 | 14 | 17 | 222 | 109 |
+------------+-------------+---------+---------+---------+---------+---------+
| 897 | 2014-03-07 | 234 | 56 | 201 | 19 | 55 |
+------------+-------------+---------+---------+---------+---------+---------+
| 789 | 2014-02-22 | 331 | 33 | 67 | 108 | 111 |
+------------+-------------+---------+---------+---------+---------+---------+
| 989 | 2014-01-09 | 12 | 89 | 97 | 125 | 323 |
+------------+-------------+---------+---------+---------+---------+---------+
This is not what I'm ultimately looking for. Values in month columns are counts of answers per month. What I want is a table with counts under relative month numbers as defined by BadgeDate (with BadgeDate month set to month 0 for each user, earlier months set to negative relative month #s, and later months set to positive relative month #s).
Is this possible in SQL? Or is there a way to do it in Excel with the above table?
After generating this table I plan on averaging relative month totals to plot a line graph that will hopefully show a noticeable inflection point at relative month zero. If there's no apparent bend, I can probably assume the milestone has a negligible effect on the Y-axis metric. (I'm not even quite sure what this kind of chart is called. I think Google might have been more helpful if I knew the proper terms for what I'm talking about.)
Any ideas?
This is precisely what the aggregate functions and case when ... then ... else ... end construct are for:
select
UserID
,BadgeDate
,sum(case when AnswerDate = '2014-01' then 1 else 0 end) as '2014-01'
-- etc.
group by
userid
,BadgeDate
The PIVOT clause is also available in some flavours and versions of SQL, but is less flexible in general so the traditional mechanism is worth understanding.
Likewise, the PIVOT TABLE construct in EXCEL can produce the same report, but there is value in maximally aggregating the data on the server in bandwidth competitive environments.

Can I convert between timezones in SQL Server?

Right now I'm storing a number of records in SQL Server with a DATETIME column that stores the current timestamp using GETUTCDATE(). This ensures that we're always storing the exact date without having to worry about questions like "well is this 2:00 my time or 2:00 your time?" Using UTC ensures that we know exactly when it happened regardless of timezone.
However, I have a query that essentially groups these records by date. A simplified version of this query looks something like this:
SELECT [created], SUM([amount]) AS [amount]
FROM (
SELECT [amount], LEFT(CONVERT(VARCHAR, [created], 120), 10) AS [created]
FROM (
SELECT [amount], DATEADD(HOUR, -5, [created]) AS [created]
FROM [sales]
WHERE [organization] = 1
) AS s
) AS s
GROUP BY [created]
ORDER BY [created] ASC
Obviously this query is far from ideal--the whole reason I'm here is to ask how to improve it. First of all, it does (for the most part) accomplish the goal of what I'm looking for here--it has things grouped by dates and the other values aggregated accordingly. But what it doesn't accomplish is handling Daylight Savings Time correctly.
I live in Madison, WI and we're on Central Time time, so between March and November we're UTC-5, otherwise we're UTC-6. That's why you see the -5 in the code there as a quick hack to get it working.
The problem is that if I run this query, and there are records that fall on both sides of the daylight savings time changeover, it could potentially group things incorrectly. So for instance, if the table looks something like this:
+----+--------+---------------------+
| id | amount | created |
+----+--------+---------------------+
| 1 | 100.00 | 2010-04-02 06:00:00 |
| 2 | 50.00 | 2010-04-02 04:30:00 |
| 3 | 75.00 | 2010-04-02 03:00:00 |
| 4 | 150.00 | 2010-03-02 07:00:00 |
| 5 | 25.00 | 2010-03-02 05:30:00 |
| 6 | 50.00 | 2010-03-02 04:00:00 |
+----+--------+---------------------+
My query will return this:
+------------+--------+
| created | amount |
+------------+--------+
| 2010-03-01 | 50.00 |
| 2010-03-02 | 175.00 |
| 2010-04-01 | 125.00 |
| 2010-04-02 | 100.00 |
+------------+--------+
However, ideally it SHOULD return this:
+------------+--------+
| created | amount |
+------------+--------+
| 2010-03-01 | 75.00 |
| 2010-03-02 | 150.00 |
| 2010-04-01 | 125.00 |
| 2010-04-02 | 100.00 |
+------------+--------+
The trouble is that if I just subtract a fixed -5, then April is correct but March is not, but if I instead subtract a fixed -6 then March is correct but April is not. What I really need to do is convert to the appropriate time zone in a way that is aware of Daylight Savings Time and can adjust accordingly. Can I do this with SQL query? How do I write this query?
None of the current date/time functions are DST aware.
Using an auxiliary calendar table may be your best bet:
http://web.archive.org/web/20070611150639/http://sqlserver2000.databases.aspfaq.com/why-should-i-consider-using-an-auxiliary-calendar-table.html
You can store UTCOffset's by date and reference it in your select statement
If you were able to store your data in a datetimeoffset field instead of datetime
this might help.
http://msdn.microsoft.com/en-us/library/bb630289.aspx
This datatype and the corepsonding functions are a new feature of sql server 2008.