Rounding off a list of numbers to a user-defined step while preserving their sum - objective-c

I've been reading a lot of posts about rounding off numbers, but I couldn't manage to do what I want :
I have got a list of positive floats.
The unsigned integer roundOffStep to use is user-defined. I have no control other it.
I want to be able to do the most accurate rounding while preserving the sum of those numbers, or at least while keeping the new sum inferior to the original sum.
How would I do that ? I am terrible with algorithms, so this is way too tricky for me.
Thx.
EDIT : Adding a Test case :
FLOATS
29.20
18.25
14.60
8.76
2.19
sum = 73;
Let's say roundOffStep = 5;
ROUNDED FLOATS
30
15
15
10
0
sum = 70 < 73 OK

Round all numbers to the nearest multiple of roundOffStep normally.
If the new sum is lower than the original sum, you're done.
For each number, calculate rounded_number - original_number. Sort this list of differences in decreasing order so that you can find the numbers with the largest difference.
Pick the number that gives the largest difference rounded_number - original_number, and subtract roundOffStep from that number.
Repeat step 4 (picking the next largest difference each time) until the new sum is less than the original.
This process should ensure that the rounded numbers are as close as possible to the originals, without going over the original sum.
Example, with roundOffStep = 5:
Original Numbers | Rounded | Difference
----------------------+------------+--------------
29.20 | 30 | 0.80
18.25 | 20 | 1.75
14.60 | 15 | 0.40
8.76 | 10 | 1.24
2.19 | 0 | -2.19
----------------------+------------+--------------
Sum: 73 | 75 |
The sum is too large, so we pick the number giving the largest difference (18.25 which was rounded to 20) and subtract 5 to give 15. Now the sum is 70, so we're done.

Related

How to redistribute outliers over the previous time period?

Imagine a dataframe that looks like this:
1
2
3
4
5
6
7
50
16
17
Normally we would apply an algorithm from Detect and exclude outliers in a pandas DataFrame to entirely remove the 50, however my particular dataset instead requires me to distribute the values of the 50 over the previous 7 days:
8
9
10
11
12
13
14
15
16
17
How can I make this work in Pandas? I can detect the outliers pretty easily but not sure how to spread the values out into previous days. Note that a simple moving average doesn't work well for this type of data, as there would still be a jump in the average value when 50 shows up. What I need to do is smooth out 50 into the previous days so that no jump is visible.

What is the size of metadata in a postgres table?

There is a table in postgres 9.4 with following types of columns:
NAME TYPE TYPE SIZE
id | integer | 4 bytes
timestamp | timestamp with time zone | 8 bytes
num_seconds | double precision | 8 bytes
count | integer | 4 bytes
total | double precision | 8 bytes
min | double precision | 8 bytes
max | double precision | 8 bytes
local_counter | integer | 4 bytes
global_counter | integer | 4 bytes
discrete_value | integer | 4 bytes
Giving in total: 60 bytes per row
The size of a table(with toast) returned by pg_table_size(table) is: 49 152 bytes
Number of rows in the table: 97
Taking into account that a table is split into pages of 8kB, we can fit 49 152/8 192 = 6 pages into this table.
Each page and each row has some meta-data...
Looking at the pure datatype size we should expect something around 97 * 60 = 5 820 bytes of row data and adding approximately the same amount of metadata to it, we are not landing even close to the result returned by pg_table_size: 49 152 bytes.
Does metadata really take ~9x space compared to the pure data in postgres?
A factor 9 is clearly more wasted space ("bloat") than there should be:
Each page has a 16-byte header.
Each row has a 23-byte "tuple header".
There will be four bytes of padding between id and timestamp and between count and total for alignment reasons (you can avoid that by reordering the columns).
Moreover, each tuple has a "line pointer" of two bytes in the data page.
See this answer for some details.
To see exactly how the space in your table is used, install the pgstattuple extension:
CREATE EXTENSION pgstattuple;
and use the pgstattuple function on the table:
SELECT * FROM pgstattuple('tablename');

Why does two's-complement multiplication need to do sign extension?

In the book Computer Systems A Programmer's Perspective (2.3.5), the method to calculate two's-complement multiplication is described as follows:
Signed multiplication in C generally is performed by truncating the 2w-bit product to w bits.
Truncating a two’s-complement number to w bits is equivalent to first computing its value modulo 2w and then converting from unsigned to two’s-complement.
Thus, for similar bit-level operands, why is unsigned multiplication different from two’s-complement multiplication? Why does two's-complement multiplication need to do sign extension?
To calculate same bit-level representation of unsigned and two’s-complement addition, we can convert the arguments of two’s-complement, then perform unsigned addition, and finally convert back to two’s-complement.
Since multiplication consists of multiple additions, why are the full representations of unsigned and two’s-complement multiplication different?
Figure 2.27 demonstrates the example below:
+------------------+----------+---------+-------------+-----------------+
| Mode | x | y | x · y | Truncated x · y |
+------------------+----------+---------+-------------+-----------------+
| Unsigned | 5 [101] | 3 [011] | 15 [001111] | 7 [111] |
| Two's complement | –3 [101] | 3 [011] | –9 [110111] | –1 [111] |
+------------------+----------+---------+-------------+-----------------+
If you multiply 101 by 011, you will get 1111 (which is equal to 001111). How did they get 110111 for two's complement case then?
The catch here is that to get a correct 6-bit two's complement product you need to multiply 6-bit two's complement numbers. Thus, you need first to convert -3 and 3 to 6-bit two's complement representation: -3 = 111101, 3 = 000011 and only then multiply them 111101 * 000011 = 10110111. You also need to truncate the result to 6 bits to eventually get 110111 from the table above.

CodeChef C_HOLIC2 Solution Find the smallest N whose factorial produces P Trailing Zeroes

For CodeChef problem C_HOLIC2, I tried iterating over elements: 5, 10, 15, 20, 25,... and for each number checking the number of trailing zeros using the efficient technique as specified over here, but got TLE.
What is the fastest way to solve this using formula method?
Here is the Problem Link
As we know for counting the number of trailing zeros in factorial of a number, the trick used is:
The number of multiples of 5 that are less than or equal to 500 is 500÷5=100
Then, the number of multiples of 25 is 500÷25=20
Then, the number of multiples of 125 is 500÷125=4
The next power of 5 is 625, which is > than 500.
Therefore, the number of trailing zeros of is 100+20+4=124
For detailed explanation check this page
Thus, this count can be represented as:
Using this trick, given a number N you can determine the no. of trailing zeros count in its factorial. Codechef Problem Link
Now, suppose we are given the no. of trailing zeros, count and we are asked the smallest no. N whose factorial has count trailing zeros Codechef Problem Link
Here the question is how can we split count into this representation?
This is a problem because in the following examples, as we can see it becomes difficult.
The count jumps even though the no is increasing by the same amount.
As you can see from the following table, count jumps at values whose factorials have integral powers of 5 as factors e.g. 25, 50, ..., 125, ...
+-------+-----+
| count | N |
+-------+-----+
| 1 | 5 |
+-------+-----+
| 2 | 10 |
+-------+-----+
| 3 | 15 |
+-------+-----+
| 4 | 20 |
+-------+-----+
| 6 | 25 |
+-------+-----+
| 7 | 30 |
+-------+-----+
| 8 | 35 |
+-------+-----+
| 9 | 40 |
+-------+-----+
| 10 | 45 |
+-------+-----+
| 12 | 50 |
+-------+-----+
| ... | ... |
+-------+-----+
| 28 | 120 |
+-------+-----+
| 31 | 125 |
+-------+-----+
| 32 | 130 |
+-------+-----+
| ... | ... |
+-------+-----+
You can see this from any brute force program for this task, that these jumps occur frequently i.e. at 6, 12, 18, 24 in case of numbers whose factorials have 25.(Interval = 6=1×5+1)
After N=31 factorials will also have a factor of 125. Thus, these jumps corresponding to 25 will still occur with the same frequency i.e. at 31, 37, 43, ...
Now the next jump corresponding to 125 will be at 31+31 which is at 62. Thus jumps corresponding to 125 will occur at 31, 62, 93, 124.(Interval =31=6×5+1)
Now the jump corresponding to 625 will occur at 31×5+1=155+1=156
Thus you can see there exists a pattern. We need to find the formula for this pattern to proceed.
The series formed is 1, 6, 31, 156, ...
which is 1 , 1+5 , 1+5+52 , 1+5+52+53 , ...
Thus, nth term is sum of n terms of G.P. with a = 1, r = 5
Thus, the count can be something like 31+31+6+1+1, etc.
We need to find this tn which is less than count but closest to it. i.e.
Say the number is count=35, then using this we identify that tn=31 is closest. For count=63 we again see that using this formula, we get tn=31 to be the closest but note that here, 31 can be subtracted twice from count=63. Now we go on finding this n and keep on subtracting tn from count till count becomes 0.
The algorithm used is:
count=read_no()
N=0
while count!=0:
n=floor(log(4*count+1,5))
baseSum=((5**n)-1)/4
baseOffset=(5**n)*(count/baseSum) // This is integer division
count=count%baseSum
N+=baseOffset
print(N)
Here, 5**n is 5n
Let's try working this out for an example:
Say count = 70,
Iteration 1:
Iteration 2:
Iteration 3:
Take another example. Say count=124 which is the one discussed at the beginning of this page:
Iteration 1:
PS: All the images are completely owned by me. I had to use images because StackOverflow doesn't allow MathJax to be embedded. #StackOverflowShouldAllowMathJax

Tabulate Command Stata

I don't know if Stata can do this but I use the tabulate command a lot in order to find frequencies. For instance, I have a success variable which takes on values 0 to 1 and I would like to know the success rate for a certain group of observations ie tab success if group==1. I was wondering if I can do sort of the inverse of this operation. That is, I would like to know if I can find a value of "group" for which the frequency is greater than or equal to 15% for example.
Is there a command that does this?
Thanks
As an example
sysuse auto
gen success=mpg<29
Now I want to find the value of price such that the frequency of the success variable is greater than 75% for example.
According to #Nick:
ssc install groups
sysuse auto
count
74
#return list optional
local nobs=r(N) # r(N) gives total observation
groups rep78, sel(f >(0.15*`r(N)')) #gives the group for which freq >15 %
+---------------------------------+
| rep78 Freq. Percent % <= |
|---------------------------------|
| 3 30 43.48 57.97 |
| 4 18 26.09 84.06 |
+---------------------------------+
groups rep78, sel(f >(0.10*`nobs'))# more than 10 %
+----------------------------------+
| rep78 Freq. Percent % <= |
|----------------------------------|
| 2 8 11.59 14.49 |
| 3 30 43.48 57.97 |
| 4 18 26.09 84.06 |
| 5 11 15.94 100.00 |
+----------------------------------+
I'm not sure if I fully understand your question/situation, but I believe this might be useful. You can egen a variable that is equal to the mean of success, by group, and then see which observations have the value for mean(success) that you're looking for.
egen avgsuccess = mean(success), by(group)
tab group if avgsuccess >= 0.15
list group if avgsuccess >= 0.15
Does that accomplish what you want?