Duplicity incremental backup taking too long - amazon-s3

I have duplicity running an incremental daily backup to S3. About 37 GiB.
On the first month or so, it went ok. It used to finish in about 1 hour. Then it started taking too long to complete the task. Right now, as I type, it is still running the daily backup that started 7 hours ago.
I'm running two commands, first to backup and then cleanup:
duplicity --full-if-older-than 1M LOCAL.SOURCE S3.DEST --volsize 666 --verbosity 8
duplicity remove-older-than 2M S3.DEST
The logs
Temp has 54774476800 available, backup will use approx 907857100.
So the temp has enough space, good. Then it starts with this...
Copying duplicity-full-signatures.20161107T090303Z.sigtar.gpg to local cache.
Deleting /tmp/duplicity-ipmrKr-tempdir/mktemp-13tylb-2
Deleting /tmp/duplicity-ipmrKr-tempdir/mktemp-NanCxQ-3
[...]
Copying duplicity-inc.20161110T095624Z.to.20161111T103456Z.manifest.gpg to local cache.
Deleting /tmp/duplicity-ipmrKr-tempdir/mktemp-VQU2zx-30
Deleting /tmp/duplicity-ipmrKr-tempdir/mktemp-4Idklo-31
[...]
This continues for each day till today, taking long minutes for each file. And continues with this...
Ignoring incremental Backupset (start_time: Thu Nov 10 09:56:24 2016; needed: Mon Nov 7 09:03:03 2016)
Ignoring incremental Backupset (start_time: Thu Nov 10 09:56:24 2016; needed: Wed Nov 9 18:09:07 2016)
Added incremental Backupset (start_time: Thu Nov 10 09:56:24 2016 / end_time: Fri Nov 11 10:34:56 2016)
After a long time...
Warning, found incomplete backup sets, probably left from aborted session
Last full backup date: Sun Mar 12 09:54:00 2017
Collection Status
-----------------
Connecting with backend: BackendWrapper
Archive dir: /home/user/.cache/duplicity/700b5f90ee4a620e649334f96747bd08
Found 6 secondary backup chains.
Secondary chain 1 of 6:
-------------------------
Chain start time: Mon Nov 7 09:03:03 2016
Chain end time: Mon Nov 7 09:03:03 2016
Number of contained backup sets: 1
Total number of contained volumes: 2
Type of backup set: Time: Num volumes:
Full Mon Nov 7 09:03:03 2016 2
-------------------------
Secondary chain 2 of 6:
-------------------------
Chain start time: Wed Nov 9 18:09:07 2016
Chain end time: Wed Nov 9 18:09:07 2016
Number of contained backup sets: 1
Total number of contained volumes: 11
Type of backup set: Time: Num volumes:
Full Wed Nov 9 18:09:07 2016 11
-------------------------
Secondary chain 3 of 6:
-------------------------
Chain start time: Thu Nov 10 09:56:24 2016
Chain end time: Sat Dec 10 09:44:31 2016
Number of contained backup sets: 31
Total number of contained volumes: 41
Type of backup set: Time: Num volumes:
Full Thu Nov 10 09:56:24 2016 11
Incremental Fri Nov 11 10:34:56 2016 1
Incremental Sat Nov 12 09:59:47 2016 1
Incremental Sun Nov 13 09:57:15 2016 1
Incremental Mon Nov 14 09:48:31 2016 1
[...]
After listing all chains:
Also found 0 backup sets not part of any chain, and 1 incomplete backup set.
These may be deleted by running duplicity with the "cleanup" command.
This was only the backup part. It takes hours doing this and only 10 minutes to upload 37 GiB to S3.
ElapsedTime 639.59 (10 minutes 39.59 seconds)
SourceFiles 288
SourceFileSize 40370795351 (37.6 GB)
Then comes the cleanup, that gives me this:
Cleaning up
Local and Remote metadata are synchronized, no sync needed.
Warning, found incomplete backup sets, probably left from aborted session
Last full backup date: Sun Mar 12 09:54:00 2017
There are backup set(s) at time(s):
Tue Jan 10 09:58:05 2017
Wed Jan 11 09:54:03 2017
Thu Jan 12 09:56:42 2017
Fri Jan 13 10:05:05 2017
Sat Jan 14 10:24:54 2017
Sun Jan 15 09:49:31 2017
Mon Jan 16 09:39:41 2017
Tue Jan 17 09:59:05 2017
Wed Jan 18 09:59:56 2017
Thu Jan 19 10:01:51 2017
Fri Jan 20 09:35:30 2017
Sat Jan 21 09:53:26 2017
Sun Jan 22 09:48:57 2017
Mon Jan 23 09:38:45 2017
Tue Jan 24 09:54:29 2017
Which can't be deleted because newer sets depend on them.
Found old backup chains at the following times:
Mon Nov 7 09:03:03 2016
Wed Nov 9 18:09:07 2016
Sat Dec 10 09:44:31 2016
Mon Jan 9 10:04:51 2017
Rerun command with --force option to actually delete.

I found the problem. Because of an issue, I followed this answer, and added this code to my script:
rm -rf ~/.cache/deja-dup/*
rm -rf ~/.cache/duplicity/*
This is supposed to be a one time thing because of random bug duplicity had. But the answer didn't mention this. So every day the script was removing the cache just after syncing the cache, and then, on the next day it had to download the whole thing again.

Related

Get the number of unique days with overlapping dates (in SAS)

I couldn't briefly explain the problem so I'll try to explain it this way. Let's say I have a table similar to the one below.
How do I get the total number of days in October per student that that student has at least 1 book checked out?
Please note that a single student can check out more than 1 book at a time which cause the overlapping dates.
Student
Book
Date_Borrowed
Date_Returned
David
A Thousand Splendid Suns
01 Oct 2021
05 Oct 2021
David
Jane Eyre
09 Oct 2021
13 Oct 2021
David
Please Look After Mom
21 Oct 2021
29 Oct 2021
Fiona
Sense and Sensibility
05 Oct 2021
14 Oct 2021
Fiona
The Girl Who Saved the King of Sweden
05 Oct 2021
14 Oct 2021
Fiona
A Fort of Nine Towers
02 Oct 2021
17 Oct 2021
Fiona
One Hundred Years of Solitude
20 Oct 2021
30 Oct 2021
Fiona
The Unbearable Lightness of Being
20 Oct 2021
30 Oct 2021
Greg
Fahrenheit 451
06 Oct 2021
11 Oct 2021
Greg
One Hundred Years of Solitude
10 Oct 2021
17 Oct 2021
Greg
Please Look After Mom
15 Oct 2021
21 Oct 2021
Greg
4 3 2 1
20 Oct 2021
27 Oct 2021
Greg
The Girl Who Saved the King of Sweden
27 Oct 2021
03 Nov 2021
Marcus
Fahrenheit 451
01 Oct 2021
04 Oct 2021
Marcus
Nectar in a Sieve
15 Oct 2021
15 Oct 2021
Marcus
Please Look After Mom
30 Oct 2021
31 Oct 2021
Priya
Like Water for Chocolate
02 Oct 2021
21 Oct 2021
Priya
Fahrenheit 451
21 Oct 2021
22 Oct 2021
Sasha
Baudolino
03 Oct 2021
29 Oct 2021
Sasha
A Thousand Splendid Suns
07 Oct 2021
16 Oct 2021
Sasha
A Fort of Nine Towers
26 Oct 2021
01 Nov 2021
Thanks in advance!
Using the data step, you can expand each date into a long format. From there, you can use SQL to do a simple count by student after removing overlapping dates.
data foo;
set have;
do date = date_borrowed to date_returned;
output;
end;
keep student date;
format date date9.;
run;
This gets us a long table of all the dates with at least one book checked out for each student.
student date
David 01OCT2021
David 02OCT2021
David 03OCT2021
David 04OCT2021
David 05OCT2021
David 09OCT2021
...
Now we need to remove the overlapping dates.
proc sort data=foo nodupkey;
by student date;
run;
From here, we can do a simple SQL count per student.
proc sql noprint;
create table want as
select student
, intnx('month', date, 0, 'B') as month format=monyy7.
, count(*) as days_checked_out
from foo
where calculated month = '01OCT2021'd
group by student, calculated month
;
quit;
Output:
student month days_checked_out
David OCT2021 19
Fiona OCT2021 27
Greg OCT2021 26
Marcus OCT2021 7
Priya OCT2021 21
Sasha OCT2021 29
An easy way is to make a temporary array with one variable for each day in the time period you want to count. Then just use a do loop to set the variables representing those days to 1. When you have reached the last record for a student then take the sum to find the number of days covered.
First let's convert your posted table into a dataset.
data have;
infile cards dsd dlm='|' truncover;
input Student :$20. Book :$100. (Date_Borrowed Date_Returned) (:date.);
format Date_Borrowed Date_Returned date11.;
cards;
David|A Thousand Splendid Suns|01 Oct 2021|05 Oct 2021
David|Jane Eyre|09 Oct 2021|13 Oct 2021
David|Please Look After Mom|21 Oct 2021|29 Oct 2021
Fiona|Sense and Sensibility|05 Oct 2021|14 Oct 2021
Fiona|The Girl Who Saved the King of Sweden|05 Oct 2021|14 Oct 2021
Fiona|A Fort of Nine Towers|02 Oct 2021|17 Oct 2021
Fiona|One Hundred Years of Solitude|20 Oct 2021|30 Oct 2021
Fiona|The Unbearable Lightness of Being|20 Oct 2021|30 Oct 2021
Greg|Fahrenheit 451|06 Oct 2021|11 Oct 2021
Greg|One Hundred Years of Solitude|10 Oct 2021|17 Oct 2021
Greg|Please Look After Mom|15 Oct 2021|21 Oct 2021
Greg|4 3 2 1|20 Oct 2021|27 Oct 2021
Greg|The Girl Who Saved the King of Sweden|27 Oct 2021|03 Nov 2021
Marcus|Fahrenheit 451|01 Oct 2021|04 Oct 2021
Marcus|Nectar in a Sieve|15 Oct 2021|15 Oct 2021
Marcus|Please Look After Mom|30 Oct 2021|31 Oct 2021
Priya|Like Water for Chocolate|02 Oct 2021|21 Oct 2021
Priya|Fahrenheit 451|21 Oct 2021|22 Oct 2021
Sasha|Baudolino|03 Oct 2021|29 Oct 2021
Sasha|A Thousand Splendid Suns|07 Oct 2021|16 Oct 2021
Sasha|A Fort of Nine Towers|26 Oct 2021|01 Nov 2021
;
Now we can use BY group processing in a data step to aggregate per student. We can set the upper and lower index for the array to be the values SAS uses to represent those days. Temporary arrays are automatically retained across observations, we just need to clear it out when we start a new student.
The SAS compiler does not expect to see a date literal as the index boundaries for an array so we can use %SYSEVALF() to convert the date literal to the integer it represents.
data want;
set have;
by student ;
array october [%sysevalf('01oct2021'd):%sysevalf('31oct2021'd)] _temporary_ ;
if first.student then call missing(of october[*]);
do date=max(date_borrowed,'01oct2021'd) to min(date_returned,'31oct2021'd);
october[date]=1;
end;
if last.student;
days = sum(0, of october[*]);
keep student days;
run;
Results:
Obs Student days
1 David 19
2 Fiona 27
3 Greg 26
4 Marcus 7
5 Priya 21
6 Sasha 29
You could also modify it slightly to not only count the number of "covered" (or unique) days, but also the total number of "book" days.
data want;
set have;
by student ;
array october [%sysevalf('01oct2021'd):%sysevalf('31oct2021'd)] _temporary_ ;
if first.student then call missing(of october[*]);
do date=max(date_borrowed,'01oct2021'd) to min(date_returned,'31oct2021'd);
october[date]=sum(october[date],1);
end;
if last.student;
unique_days = n(of october[*]);
book_days = sum(0,of october[*]);
keep student unique_days book_days;
run;
Results:
unique_ book_
Obs Student days days
1 David 19 19
2 Fiona 27 58
3 Greg 26 34
4 Marcus 7 7
5 Priya 21 22
6 Sasha 29 43

Showing data order from Monday-Sunday full week only and hide non-full week data

sorry if I'm shooting newbie questions here.
I want to create a weekly report, but for this weekly report, I want full data from Monday to Sunday
Condition:
Last 4 weeks only
Showing full week (Monday - Sunday)
Hide the result if it's not full week
If i use getdate -14, if I access the data on Wednesday, they will start counting last week from Wednesday 2 weeks ago instead of last Monday. Meanwhile, I want the report to show full week only.
Can anyone share how to do that in SQL?
Here I provide sample data:
Column name = DATE -- Column name: TOTAL_PERSON
- Fri, 1 Jun 2018 -- 10
- Sat, 2 Jun 2018 -- 4
- Sun, 3 Jun 2018 -- 12
- Mon, 4 Jun 2018 -- 15
- Tue, 5 Jun 2018 -- 10
- Wed, 6 Jun 2018 -- 3
- Thu, 7 Jun 2018 -- 1
- Fri, 8 Jun 2018 -- 13
- Sat, 9 Jun 2018 -- 9
- Sun, 10 Jun 2018 -- 23
- Mon, 11 Jun 2018 -- 5
- Tue, 12 Jun 2018 -- 3
- Wed, 13 Jun 2018 -- 1
- Thu, 14 Jun 2018 -- (TODAY)
In this case, if I am accessing on Thu 6 Jun 2018 I want to get TOTAL PERSON data from Mon, 4 Jun 2018 to Sun, 10 Jun 2018 only and not showing data from the rest since the week is not full.
Can anyone help me how to do that?
Thanks a lot!
I think you want:
where datediff(week, date, getdate()) <= 2
This counts the number of week boundaries between two dates, so it returns an entire week.
For MySQL, you can use such a select:
SELECT * FROM `myDB` WHERE `Date`
BETWEEN DATE_SUB(NOW()-INTERVAL DATE_FORMAT(CURRENT_DATE, '%w') DAY, INTERVAL 28 DAY)
AND NOW()- INTERVAL DATE_FORMAT(CURRENT_DATE, '%w') DAY
This uses the capability to transform the current day of this week into a number and substract this to get the last Sunday. from there, we select an intervall of 28 days.
(Only testet with 14 days and a very limited test-dataset, but should work)

What does the "+"-sign in the duration column of the last command in OS X terminal mean?

I am running last to check my kids computer usage and I am getting output like this:
Ruben console Wed Mar 22 09:17 - crash (02:17)
Ruben console Tue Mar 14 15:56 - crash (6+06:52)
Ruben console Sat Mar 4 12:08 - crash (1+04:48)
Ruben console Tue Feb 28 16:24 - crash (3+01:48)
Ruben console Mon Feb 13 06:47 - crash (10:39)
Ruben console Tue Jan 31 16:27 - crash (01:16)
If I understand things correctly, the first line describes a 2 minutes, 17 seconds session. How do I read the following, though? What does (6+06:52) mean?

Trying to pull the required rows from the single table with applying conditional statements on columns in sql server?

I have tried in n-number ways to solve this solution but unfortunately I got stuck in all the ways..
source table
id year jan feb mar apr may jun jul aug sep oct nov dec
1234 2014 05 06 12 15 16 17 18 19 20 21 22 23
1234 2013 05 06 12 15 16 17 18 19 20 21 22 23
Task: Assume that we are currently at March 2014, and we need 12 months back date ...(i.e., from Mar 2013 to Feb 2014, and the remaining values needs to be zero except year and id.)
Solution:
id year jan feb mar apr may jun jul aug sep oct nov dec
1234 2014 05 06 0 0 0 0 0 0 0 0 0 0
1234 2013 0 0 12 15 16 17 18 19 20 21 22 23
This needs a code solution for SQL Server 2008. I would be very happy if any body can solve this.
Note:
I got stuck to pull the column names dynamically.
You can try this:
select id, year, case when DATEDiff(month, getdate(), convert(datetime, year + '-01-01'))) < 12 then jan else 0,
DATEDiff(month, getdate(), convert(datetime, year + '-02-01'))) < 12 then fab else 0 ....

Best way to store aggregated values

We need to store aggregated values for different accounts which summarise various numbers on Month/Year basis. These numbers would be updated each time the data is updated (usually once or twice every 24 hours).
I'm expecting the data to be the results of PIVOT functions e.g.:
Year Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
2011 0 0 0 0 0 0 95 33 34 24 36 52
Each account will need different aggregates e.g. "Count Of Customers", "Count Of Orders" and "Value Of Sales" and I'm not sure whether it would be best to add a key to the data or use separate tables e.g.:
Year Key Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
2011 CntOrders 0 0 0 0 0 0 95 33 34 24 36 52
2011 CntCust 0 0 0 0 0 0 95 33 34 24 36 52
2011 ValOrders 0 0 0 0 0 0 95 33 34 24 36 52
Or
dbo.CountOfOrders
Year Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
2011 0 0 0 0 0 0 95 33 34 24 36 52
dbo.ValueOfOrders
Year Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
2011 0 0 0 0 0 0 95 33 34 24 36 52
I've read a number of posts suggesting both NoSQL and SQL Server so I'm not sure which way we should go or how to decide.
We can't justify a dedicated cube at the moment but I'm wondering if it would be better to store the values in a NoSQL database or whether we should stick with SQL Server?
I'll stick with SQL. However, if you are worried about the time to rebuild such PIVOT table, don't, because you don't have to necessarily build a table with unique "key".
Build it with key + process datetime and just append it to the main pivot. So during creation of the incrementals it will be bounded by your transaction timestamp (begin and end). There should be much bloat. If there is, you can collapse the process dates in a weekend job.
Set up a job to run stored procedures that insert data into tables.
Store the data like Account,Year,Month,Value
Use views of these tables for reporting multiple aggregates.
Definitely stick with SQL. There is no reason to add technical overhead for such a simple task.