Ruby on Rails iterate through column efficiently - sql

created_at iteration group_hits_per_iteration
--------------------------------------------------------------------
2019-11-08 08:14:05.170492 300 34
2019-11-08 08:14:05.183277 300 24
2019-11-08 08:14:05.196785 300 63
2019-11-08 08:14:05.333424 300 22
2019-11-08 08:14:05.549140 300 1
2019-11-08 08:14:05.576509 300 15
2019-11-08 08:44:05.832730 301 69
2019-11-08 08:44:05.850111 301 56
2019-11-08 08:44:05.866771 301 18
2019-11-08 08:44:06.310749 301 14
Hello
My goal is to create a sum total of the values in 'group_hits_per_iteration' for each unique value in the 'iteration column' which will then be graphed using chartkick.
For example, for iteration 300 I would sum together 34,24,63,22,1,15 for a total of 159, then repeat for each unique entry.
The code I've included below does work and generates the required output but it's slow and gets slower the more data is read into the database.
It creates a hash that is fed into chartkick.
hsh = {}
Group.pluck(:iteration).uniq.each do |x|
date = Group.where("iteration = #{x}").pluck(:created_at).first.localtime
itsum = Group.where("iteration = #{x}").pluck('SUM(group_hits_per_iteration)' )
hsh[date] = itsum
end
<%= line_chart [
{name: "#{#groupdata1.first.networkid}", data: hsh}
] %>
I'm looking for other ways to approach this, I was thinking of having SQL do the heavy lifting and not do the calculations in rails but not really sure how to approach that.
Thanks for the help.

If you want to get just the sums for every iteration, following code should work:
# new lines only for readability
group_totals =
Group
.select('iteration, min(created_at) AS created_at, sum(group_hits_per_iteration) AS hits')
.group('iteration')
.order('iteration') # I suppose you want the results in some order
group_totals.each do |group|
group.iteration # => 300
group.hits # => 159
group.created_at # => 2019-11-08 08:14:05.170492
end
In this case all the hard work is done by the database, you can just read the results in your ruby code.
Note: In your code you are taking first created_at for every iteration, I took the lowest date

Related

translate Dataframe using crosswalk in julia

I have a very large dataframe (original_df) with columns of codes
14 15
21 22
18 16
And a second dataframe (crosswalk) which maps 'old_codes' to 'new_codes'
14 104
15 105
16 106
18 108
21 201
22 202
Of course, the resultant df (resultant_df) that I would like would have values:
104 105
201 202
108 106
I am aware of two ways to accomplish this. First, I could iterate through each code in original_df, find the code in crosswalk, then rewrite the corresponding cell in original_df with the translated code from crosswalk. The faster and more natural option would be to leftjoin() each column of original_df on 'old_codes'. Unfortunately, it seems I have to do this separately for each column, and then delete each column after its conversion column has been created -- this feels unnecessarily complicated. Is there a simpler way to convert all of original_df at once using the crosswalk?
You can do the following (I am using column numbers as you have not provided column names):
d = Dict(crosswalk[!, 1] .=> crosswalk[!, 2])
resultant_df = select(original_df, [i => ByRow(x -> d[x]) for i in 1:ncol(original_df)], renamecols=false)

What is the difference of "dom_content_loaded.histogram.bin.start/end" in Google's BigQuery?

I need to build a histogram, concerning DOMContentLoaded of a webpage. When I used BigQuery, I noticed that apart from density, there are 2 more attributes (start, end). In my head there should only be 1 attribute, the DOMContentLoaded event is only fired when the DOM has loaded.
Can anyone help clarify the difference of .start and .stop? These attributes always have 100 milliseconds difference between them (if start = X ms, then stop = X+100 ms. See a query example posted below.
I can not understand what these attributes represent exactly:
dom_content_loaded.histogram.bin.START
AND
dom_content_loaded.histogram.bin.END
Q: Which one of them represents the time that the DOMContentLoaded event
is fired in a user's browser?
SELECT
bin.START AS start,
bin.END AS endd
FROM
`chrome-ux-report.all.201809`,
UNNEST(dom_content_loaded.histogram.bin) AS bin
WHERE
origin = 'https://www.google.com'
Output:
Row |start | end
1 0 100
2 100 200
3 200 300
4 300 400
[...]
Below explains meaning of bin.start, bin.end and bin.density
Run below SELECT statement
SELECT
origin,
effective_connection_type.name type_name,
form_factor.name factor_name,
bin.start AS bin_start,
bin.end AS bin_end,
bin.density AS bin_density
FROM `chrome-ux-report.all.201809`,
UNNEST(dom_content_loaded.histogram.bin) AS bin
WHERE origin = 'https://www.google.com'
You will get 1550 rows in result
below are first 5 rows
Row origin type_name factor_name bin_start bin_end bin_density
1 https://www.google.com 4G phone 0 100 0.01065
2 https://www.google.com 4G phone 100 200 0.01065
3 https://www.google.com 4G phone 200 300 0.02705
4 https://www.google.com 4G phone 300 400 0.02705
5 https://www.google.com 4G phone 400 500 0.0225
You can read them as:
for phone with 4G load of dom_content was loaded within 100 milliseconds for 1.065% of loads; in between 100 and 200 milliseconds for 1.065%; in between 200 and 300 milliseconds for 2.705% and so on
To summarize for each origin, type and factor you got histogram that is represented as a repeated record with start and end of each bin along with density which represents percentage of respective user experience
Note: if you add up the dom_content_loaded densities across all dimensions for a single origin, you will get 1 (or a value very close to 1 due to approximations).
For example
SELECT SUM(bin.density) AS total_density
FROM `chrome-ux-report.all.201809`,
UNNEST(dom_content_loaded.histogram.bin) AS bin
WHERE origin = 'https://www.google.com'
returns
Row total_density
1 0.9995999999999978
Hope this helped

Composite indexing using Redis in a hierarchical data model

I have a data model like this:
Fields:
counter number (e.g. 00888, 00777, 00123 etc)
counter code (e.g. XA, XD, ZA, SI etc)
start date (e.g. 2017-12-31 ...)
end date (e.g. 2017-12-31 ...)
Other counter date (e.g. xxxxx)
Current Datastructure organization is like this (root and multiple child format):
counter_num + counter_code
---> start_date + end_date --> xxxxxxxx
---> start_date + end_date --> xxxxxxxx
---> start_date + end_date --> xxxxxxxx
Example:
00888 + XA
---> Jan 10 + Jan 20 --> xxxxxxxx
---> Jan 21 + Jan 31 --> xxxxxxxx
---> Feb 01 + Dec 31 --> xxxxxxxx
00888 + ZI
---> Jan 09 + Feb 24 --> xxxxxxxx
---> Feb 25 + Dec 31 --> xxxxxxxx
00777 + XA
---> Jan 09 + Feb 24 --> xxxxxxxx
---> Feb 25 + Dec 31 --> xxxxxxxx
Today the retrieval happens in 2 ways:
//Fetch unique counter data using all the composite keys
counter_number + counter_code + date (start_date <= date <= end_date)
//Fetch all the counter codes and corresponding data matching the below conditions
counter_number + date (start_date <= date <= end_date)
What's the best way to model this in redis as I need to cache some of the frequently hit data. I feel sorted sets should do this somehow, but unable to model it.
UPDATE:
Just to remove the confusion, the ask here is not for an SQL "BETWEEN" like query. 'Coz I don't know what the start_date and end_date values are. Think they are just column names.
What I don't want is
SELECT * FROM redis_db
WHERE counter_num AND
date_value BETWEEN start_date AND end_date
What I want is
SELECT * FROM redis_db
WHERE counter_num AND
start_date <= specifc_date AND end_date >= specific_date
NOTE: The requirement is pretty much close to 2D indexing of what is proposed in Redis multi-dimensional indexing document
https://redis.io/topics/indexes#multi-dimensional-indexes
I understood the concept but unable to digest the implementation detail that is given.
I'm unlikely to get this done in time for the bounty, but what the hell...
This sounds like a job for geohashing. Geohashing is what you do when you want to index a 2-dimensional (or higher) dataset. For example, if you have a database of cities and you want to be able to quickly respond to queries like "find all the cities within 50km of X", you use geohashing.
For the purposes of this question, you can think of start_date and end_date as x and y coordinates. Normally in geohashing you're searching for points in your dataset near a particular point in space, or in a certain bounded region of space. In this case you just have a lower bound on one of the coordinates and an upper bound on the other one. But I suppose in practice the whole dataset is bounded anyway, so that's not a problem.
It would be nice if there was a library for doing this in Redis. There probably is, if you look hard enough. The newer versions of Redis have built-in geohashing functionality. See the commands starting with GEO. But it doesn't claim to be very accurate, and it's designed for the surface of a sphere rather than a flat surface.
So as far as I can see you have 3 options:
Map your search space to a small part of the sphere, preferably near the equator. Use the Redis GEO commands. To search, use GEOSPHERE on a circle covering the triangle you're trying to search, taking into account the inbuilt inaccuracy and the distortion you get by mapping onto the sphere, then filter the results to get the ones that are actually inside the triangle.
Find some 3rd-party geohashing client for Redis which works on flat space and is more accurate than GEO.
Read the rest of this answer, or some other primer on geohashing, then implement it yourself on top of Redis. This is the hardest (but most educational) option.
If you have a database that indexes data using a numerical ordering, such that you can do queries like "find all the rows/records for which z is between a and b", you can build a geohash index on top of it. Suppose the coordinates are (non-negative) integers x and y. Then you add an integer-valued column z, and index by z. To calculate z, write x and y in binary, then take alternate digits from each. Example:
x = 969 = 0 1 1 1 1 0 0 1 0 0 1
y = 1130 = 1 0 0 0 1 1 0 1 0 1 0
z = 1750214 = 0110101011010011000110
Note that the index allows you to find, for example, all records positioned with z between 0101100000000000000000 and 0101101111111111111111 inclusive. In other words, all records for which z starts with 010110. Or to put it another way, you can find all records for which x starts with 001 and y starts with 110. This set of records corresponds to a square in the 2-dimensional space we are trying to search.
Not all squares can be searched in this way. We'll call these ones searchable squares. Suppose the client sends a request for all records for which (x,y) is inside a particular rectangle. (Or a circle, or some other reasonable geometric shape.) Then you need to find a set of searchable squares which cover the rectangle. Then, for each of these squares you've chosen, query the database for records inside that square and send the results to the client. (But you'll have to filter the results, because not all the records in the square are actually in the original rectangle.)
There's a balance to be struck. If you choose a small number of large special squares, you'll probably end up covering a much larger area of the map than you need; the query to the database will return lots of extra results that you'll have to filter out. Alternatively, if you use lots of little special squares, you'll be doing lots of queries to the database, many of which will return no results.
I said above that x and y could be start_time and end_time. But actually the distribution of your dataset won't be as symmetrical as in most uses of geohashing. So the performance might be better (or worse) if you use x = end_time + start_time and y = end_time - start_time.
Because your question remains a bit vague on how you desire to query your data, it remains unclear on how to solve your question. With that in mind, however, here are my thoughts on how I might model your data:
Updated answer, detailing how to use SORTED SET
I have edited this answer to be able to store your values in a way that you can query by dynamic date ranges. This edit assumes that your database values are timestamps, as in the value is for a single time, not 2, as in your current setup.
Yes, you are correct that using Sorted Sets will be able to accomplish this. I suggest that you always use a Unix timestamp value for the score component in these sorted sets.
In case you were not already familiar with redis, let's explain indexing limitations. Redis is a simple key-value designed to quickly retrieve values by a key. Because of this design, it does not contain many features of your traditional DBMS, like indexing a column for instance.
In redis, you accomplish indexing by using a key, and the most nested key-like structures are available in HASH and SORTED SET, but you only get 2 key-like structures. In a HASH, you have the key (same as any data type), and a inner hash key, which can take the form of any string.
In a SORTED SET, you have the key (same as any data type), and a numeric value.
A HASH is nice to use to keep a grouped data organized.
A SORTED SET is nice if you want to query by a range of values. This could be a good fit for your data.
Your SORTED SET would look like the following:
key
00888:XA =>
score (date value) value
1452427200 (2016-01-10) xxxxxxxx
1452859200 (2016-01-10) yyyyxxxx
1453291200 (2016-01-10) zzzzxxxx
Let's use a more intuitive example, the 2017 Juventus roster:
To produce the SORTED SET in the table below, issue this command in your redis client:
ZADD JUVENTUS 32 "Emil Audero" 1 "Gianluigi Buffon" 42 "Mattia Del Favero" 36 "Leonardo Loria" 25 "Neto" 15 "Andrea Barzagli" 4 "Medhi Benatia" 19 "Leonardo Bonucci" 3 "Giorgio Chiellini" 40 "Luca Coccolo" 29 "Paolo De Ceglie" 26 "Stephan Lichtsteiner" 12 "Alex Sandro" 24 "Daniele Rugani" 43 "Alessandro Semprini" 23 "Dani Alves" 22 "Kwadwo Asamoah" 7 "Juan Cuadrado" 6 "Sami Khedira" 18 "Mario Lemina" 46 "Mehdi Leris" 38 "Rolando Mandragora" 8 "Claudio Marchisio" 14 "Federico Mattiello" 45 "Simone Muratore" 20 "Marko Pjaca" 5 "Miralem Pjanic" 28 "Tomás Rincón" 27 "Stefano Sturaro" 21 "Paulo Dybala" 9 "Gonzalo Higuaín" 34 "Moise Kean" 17 "Mario Mandzukic"
Jersey Name Jersey Name
32 Emil Audero 23 Dani Alves
1 Gianluigi Buffon 42 Mattia Del Favero
36 Leonardo Loria 25 Neto
15 Andrea Barzagli 4 Medhi Benatia
19 Leonardo Bonucci 3 Giorgio Chiellini
40 Luca Coccolo 29 Paolo De Ceglie
26 Stephan Lichtsteiner 12 Alex Sandro
24 Daniele Rugani 43 Alessandro Semprini
22 Kwadwo Asamoah 7 Juan Cuadrado
6 Sami Khedira 18 Mario Lemina
46 Mehdi Leris 38 Rolando Mandragora
8 Claudio Marchisio 14 Federico Mattiello
45 Simone Muratore 20 Marko Pjaca
5 Miralem Pjanic 28 Tomás Rincón
27 Stefano Sturaro 21 Paulo Dybala
9 Gonzalo Higuaín 34 Moise Kean
17 Mario Mandzukic
To query the roster by a range of jersey numbers:
ZRANGEBYSCORE JUVENTUS 1 5
Output:
1) "Gianluigi Buffon"
2) "Giorgio Chiellini"
3) "Medhi Benatia"
4) "Miralem Pjanic"
Note that the scores are not returned, however ZRANGEBYSCORE command orders the results in ASC order by score.
To add the scores, append "WITHSCORES" to the command, like so: ZRANGEBYSCORE JUVENTUS 1 5 WITHSCORES
By using ZRANGEBYSCORE, you should be able to query any key (counter number + counter code) with a date range,
producing the values in that range.
Original: Below is my original answer, recommending HASH
Based on your examples, I recommend you use a HASH.
With a hash, you would have a main key to find the hash (Ex. 00888:XA). Then within the hash, you have key -> value pairs (Ex. 2017-01-10:2017-01-20 -> xxxxxxxx). I prefer to delimit or tokenize my keys' components with the colon char :, but you can use any delimiter.
HASH follows your example data structure very well:
key
00888:XA =>
hashkey value
2017-01-10:2017-01-20 xxxxxxxx
2017-01-21:2017-01-31 yyyyxxxx
2016-02-01:2016-12-31 zzzzxxxx
key
00888:ZI =>
hashkey value
2017-01-10:2017-01-20 xxxxxxxx
2017-01-21:2017-01-31 xxxxyyyy
2016-02-01:2016-12-31 xxxxzzzz
When querying for data, instead of GET key, you would query with HGET key hashkey. Same for setting values, instead of SET key value, use HSET key hashkey value.
Example commands
HSET 00777:XA 2017-01-10:2017-01-20 xxxxxxxx
HSET 00777:XA 2017-01-21:2017-01-31 yyyyyyyy
HSET 00777:XA 2016-02-01:2016-12-31 zzzzzzzz
(Note: there is also a HMSET to simplify this into a single command)
Then:
HGET 00777:XA 2017-01-21:2017-01-31
Would return yyyyyyyy
Unless there is some specific performance consideration, or other goal for your data, I think Hashes will work great for your system.
It's also very convenient if you want to get all hashkeys or all values for a given hash, using commands like HKEYS, HVALS, or HGETALL.

How to get fitted values from clogit model

I am interested in getting the fitted values at set locations from a clogit model. This includes the population level response and the confidence intervals around it. For example, I have data that looks approximately like this:
set.seed(1)
data <- data.frame(Used = rep(c(1,0,0,0),1250),
Open = round(runif(5000,0,50),0),
Activity = rep(sample(runif(24,.5,1.75),1250, replace=T), each=4),
Strata = rep(1:1250,each=4))
Within the Clogit model, activity does not vary within a strata, thus there is no activity main effect.
mod <- clogit(Used ~ Open + I(Open*Activity) + strata(Strata),data=data)
What I want to do is build a newdata frame at which I can eventually plot marginal fitted values at specified locations of Open similar to a newdata design in a traditional glm model: e.g.,
newdata <- data.frame(Open = seq(0,50,1),
Activity = rep(max(data$Activity),51))
However, when I try to run a predict function on the clogit, I get the following error:
fit<-predict(mod,newdata=newdata,type = "expected")
Error in Surv(rep(1, 5000L), Used) : object 'Used' not found
I realize this is because clogit in r is being run throught Cox.ph, and thus, the predict function is trying to predict relative risks between pairs of subjects within the same strata (in this case= Used).
My question, however is if there is a way around this. This is easily done in Stata (using the Margins Command), and manually in Excel, however I would like to automate in R since everything else is programmed there. I have also built this manually in R (example code below), however I keep ending up with what appear to be incorrect CIs in my real data, as a result I would like to rely on the predict function if possible. My code for manual prediction is:
coef<-data.frame(coef = summary(mod)$coefficients[,1],
se= summary(mod)$coefficients[,3])
coef$se <-summary(mod)$coefficients[,4]
coef$UpCI <- coef[,1] + (coef[,2]*2) ### this could be *1.96 but using 2 for simplicity
coef$LowCI <-coef[,1] - (coef[,2]*2) ### this could be *1.96 but using 2 for simplicity
fitted<-data.frame(Open= seq(0,50,2),
Activity=rep(max(data$Activity),26))
fitted$Marginal <- exp(coef[1,1]*fitted$Open +
coef[2,1]*fitted$Open*fitted$Activity)/
(1+exp(coef[1,1]*fitted$Open +
coef[2,1]*fitted$Open*fitted$Activity))
fitted$UpCI <- exp(coef[1,3]*fitted$Open +
coef[2,3]*fitted$Open*fitted$Activity)/
(1+exp(coef[1,3]*fitted$Open +
coef[2,3]*fitted$Open*fitted$Activity))
fitted$LowCI <- exp(coef[1,4]*fitted$Open +
coef[2,4]*fitted$Open*fitted$Activity)/
(1+exp(coef[1,4]*fitted$Open +
coef[2,4]*fitted$Open*fitted$Activity))
My end product would ideally look something like this but a product of the predict function....
Example output of fitted values.
Evidently Terry Therneau is less a purist on the matter of predictions from clogit models: http://markmail.org/search/?q=list%3Aorg.r-project.r-help+predict+clogit#query:list%3Aorg.r-project.r-help%20predict%20clogit%20from%3A%22Therneau%2C%20Terry%20M.%2C%20Ph.D.%22+page:1+mid:tsbl3cbnxywkafv6+state:results
Here's a modification to your code that does generate the 51 predictions. Did need to put in a dummy Strata column.
newdata <- data.frame(Open = seq(0,50,1),
Activity = rep(max(data$Activity),51), Strata=1)
risk <- predict(mod,newdata=newdata,type = "risk")
> risk/(risk+1)
1 2 3 4 5 6 7
0.5194350 0.5190029 0.5185707 0.5181385 0.5177063 0.5172741 0.5168418
8 9 10 11 12 13 14
0.5164096 0.5159773 0.5155449 0.5151126 0.5146802 0.5142478 0.5138154
15 16 17 18 19 20 21
0.5133829 0.5129505 0.5125180 0.5120855 0.5116530 0.5112205 0.5107879
22 23 24 25 26 27 28
0.5103553 0.5099228 0.5094902 0.5090575 0.5086249 0.5081923 0.5077596
29 30 31 32 33 34 35
0.5073270 0.5068943 0.5064616 0.5060289 0.5055962 0.5051635 0.5047308
36 37 38 39 40 41 42
0.5042981 0.5038653 0.5034326 0.5029999 0.5025671 0.5021344 0.5017016
43 44 45 46 47 48 49
0.5012689 0.5008361 0.5004033 0.4999706 0.4995378 0.4991051 0.4986723
50 51
0.4982396 0.4978068
{Warning} : It's actually rather difficult for mere mortals to determine which of the R-gods to believe on this one. I've learned so much R and statistics form each of those experts. I suspect there are matters of statistical concern or interpretation that I don't really understand.

Pandas shifting uneven timeseries data

I have some irregularly stamped time series data, with timestamps and the observations at every timestamp, in pandas. Irregular basically means that the timestamps are uneven, for instance the gap between two successive timestamps is not even.
For instance the data may look like
Timestamp Property
0 100
1 200
4 300
6 400
6 401
7 500
14 506
24 550
.....
59 700
61 750
64 800
Here the timestamp is say seconds elapsed since a chose origin time. As you can see we could have data at the same timestamp, 6 secs in this case. Basically the timestamps are strictly different, just that second resolution cannot measure the change.
Now I need to shift the timeseries data ahead, say I want to shift the entire data by 60 secs, or a minute. So the target output is
Timestamp Property
0 750
1 800
So the 0 point got matched to the 61 point and the 1 point got matched to the 64 point.
Now I can do this by writing something dirty, but I am looking to use as much as possible any inbuilt pandas feature. If the timeseries were regular, or evenly gapped, I could've just used the shift() function. But the fact that the series is uneven makes it a bit tricky. Any ideas from Pandas experts would be welcome. I feel that this would be a commonly encountered problem. Many thanks!
Edit: added a second, more elegant, way to do it. I don't know what will happen if you had a timestamp at 1 and two timestamps of 61. I think it will choose the first 61 timestamp but not sure.
new_stamps = pd.Series(range(df['Timestamp'].max()+1))
shifted = pd.DataFrame(new_stamps)
shifted.columns = ['Timestamp']
merged = pd.merge(df,shifted,on='Timestamp',how='outer')
merged['Timestamp'] = merged['Timestamp'] - 60
merged = merged.sort(columns = 'Timestamp').bfill()
results = pd.merge(df,merged, on = 'Timestamp')
[Original Post]
I can't think of an inbuilt or elegant way to do this. Posting this in case it's more elegant than your "something dirty", which is I guess unlikely. How about:
lookup_dict = {}
def assigner(row):
lookup_dict[row['Timestamp']] = row['Property']
df.apply(assigner, axis=1)
sorted_keys = sorted(lookup_dict.keys)
df['Property_Shifted'] = None
def get_shifted_property(row,shift_amt):
for i in sorted_keys:
if i >= row['Timestamp'] + shift_amt:
row['Property_Shifted'] = lookup_dict[i]
return row
df = df.apply(get_shifted_property, shift_amt=60, axis=1)