Group by Week and Count Column Values in Rails

Group by Week and Count Column Values in Rails - sql

Looking to do a simple-ish SQL query in rails (using Active Record) and I'm running into some trouble.
I want to return a JSON response so the client can consume the following data structure:
{"2014-12-01-2014-12-07": {
"foo": 100,
"bar": 50,
"baz": 20,
"blah": 10,
},
"2014-12-08-2014-12-14": {
"foo": 40,
"bar": 550,
"baz": 210,
"blah": 10,
}
}
Where foo, bar, baz, blah etc. are possible values available in an array on the Foo model (from Foo::PossibleStates). I want to return summed counts of each type per week. I know roughly how I'd go about this in Mongo (the world I'm more familiar with), but am running into trouble with the nuances in SQL and Rails/Active Record. Any direction would be greatly appreciated! Here's what I've tried so far:
class FooController < ApplicationController
def index
start_date = params[:start_date]
end_date = params[:end_date]
#jsonFooData = Foo.group('created_at').group("workflow_state").sum('workflow_state')
end
end

I think it is not an easy task.
If you are using postgres, it could be achieved by one sql and a loop:
# models/foo.rb
# call this Foo.count_by_week in your controller
def self.count_by_week
raw_result = group_by_week_and_state.count
# {['2014-12-01-2014-12-07', 'foo'] => 100, ['2014-12-01-2014-12-07', 'bar'] => 100, '...' => '...'}
raw_result.each_with_object({}) do |(k, v), result|
result[k[0]] ||= {}
result[k[0]][k[1]] = v
end
end
def self.group_by_week_and_state
group("#{weekday_query(0)} || \'-\' || #{weekday_query(6)}").group('workflow_state')
end
# build sql part for day offset of week (0 => mon, 6 => sun)
def self.weekday_query(offset)
"to_char(cast(date_trunc(\'week\', created_at) as date) + #{offset}, \'YYYY-MM-DD\')"
end

results = Hash.new 0
counts = Hash.new 0
# Get the results 'foos' with all the foo records grouped by beginning of the week. I'm using 'Monday' here but can be changed to any other weekday.
foos = Foo.all.group_by{|f| f.created_at.beginning_of_week(:monday)}
# loop through each grouped item
foos.collect do |foo|
# collect the workflow states and loop through them and count the unique workflow states.
foo[1].collect{|f| f.workflow_state}.each do |ws|
counts[ws] += 1
end
# assign the workflow states counts to the beginning of the week
results[foo[0]] = counts
# reset the counter to new hash
counts = Hash.new 0
end
# output the results
puts results

Related

Unable to iterate a list to a thread

I am trying to pass a json return between functions but I get errors. So I convert the json to a list. However, I cannot iterate the list from a while loop unless I specify an actual number.
Full code is
class BackendThread(QThread):
update_date = pyqtSignal(list)
def run(self):
device_mode = []
while True:
#do stuff and get json_return
for xx in json_return["result"]["devices"]:
for y in xx["nodes"]:
if y['type'] == "FRAME_BUFFER":
data = xx["device_id"] + "\n" + y['configuration']['display_mode']
device_mode.append(data)
self.update_date.emit(str(device_mode))
device_mode = []
time.sleep(1)
class Window(QDialog):
def __init__(self):
QDialog.__init__(self)
self.resize(400,400)
self.input=QTextEdit(self)
self.input.resize(400,400)
self.initUI()
def initUI(self):
self.backend=BackendThread()
self.backend.update_date.connect(self.handleDisplay)
self.backend.start()
def handleDisplay(self,data):
count = 0
while count < 11:
self.input.setText(data[count])
count += 1
if __name__ == '__main__':
app=QApplication(sys.argv)
win =Window()
win.show()
sys.exit(app.exec_())
So this part does not work. I only get the last item in the list
count = 0
while count < 11:
self.input.setText(data[count])
count += 1
When I do this, it works but I cannot hard code the item number becuase the list will never have the same amount of items
self.input.setText(data[0])
self.input.setText(data[1])
self.input.setText(data[2])
etc
Any ideas as to how to get that while loop working?

Create a dictionary using "if... else" in Julia?

I'm working with a sample CSV file that lists nursing home residents' DOBs and DODs. I used those fields to calculate their age at death, and now I'm trying to create a dictionary that "bins" their age at death into groups. I'd like the bins to be 1-25, 26-50, 51-75, and 76-100.
Is there a concise way to make a Dict(subject_id, age, age_bin) using "if... else" syntax?
For example: (John, 76, "76-100"), (Moira, 58, "51-75").
So far I have:
#import modules
using CSV
using DataFrames
using Dates
# Open, read, write desired files
input_file = open("../data/FILE.csv", "r")
output_file = open("FILE_output.txt", "w")
# Use to later skip header line
file_flag = 0
for line in readlines(input_file)
if file_flag==0
global file_flag = 1
continue
end
# Define what each field in FILE corresponds to
line_array = split(line, ",")
subject_id = line_array[2]
gender = line_array[3]
date_of_birth = line_array[4]
date_of_death = line_array[5]
# Get yyyy-mm-dd only (first ten characters) from fields 4 and 5:
date_birth = date_of_birth[1:10]
date_death = date_of_death[1:10]
# Create DateFormat; use to calculate age
date_format = DateFormat("y-m-d")
age_days = Date(date_death, date_format) - Date(date_birth, date_format)
age_years = round(Dates.value(age_days)/365.25, digits=0)
# Use "if else" statement to determine values
keys = age_years
function values()
if age_years <= 25
return "0-25"
elseif age_years <= 50
return "26-50"
elseif age_years <= 75
return "51-75"
else age_years < 100
return "76-100"
end
end
values()
# Create desired dictionary
age_death_dict = Dict(zip(keys, values()))
end
Edit: or is there a better way to approach this using DataFrames?

To answer your question, " is there a concise way using if/else" -- probably not, given that you have 5 cases (age ranges) you have to account for. Suppose you have names and ages in two separate lists (which I assume you generate from your example code, although I can't see the input CSVs):
julia> name = ["John", "Mary", "Robert", "Cindy", "Beatrice"];
julia> ages = [24, 73, 75, 69, 90];
julia> function bin_age_ifelse(a)
if a<1
return "Invalid age"
elseif 1<=a<=25
return "1-25"
elseif 25<a<=50
return "26-50"
elseif 50<a<=75
return "51-75"
else
return "76-100"
end
end
bin_age_ifelse (generic function with 1 method)
julia> binned_ifelse = Dict([n=>[a, bin_age_ifelse(a)] for (n,a) in zip(name, ages)])
Dict{String, Vector{Any}} with 5 entries:
"John" => [24, "1-25"]
"Mary" => [73, "51-75"]
"Beatrice" => [90, "76-100"]
"Robert" => [75, "51-75"]
"Cindy" => [69, "51-75"]
Here's an option for the binning function to avoid if/else syntax, although there are probably yet more elegant ways to do it:
julia> function bin_age(a)
bins = [1:25, 26:50, 51:75, 76:100]
for b in bins
if a in b
return "$(b[1])-$(b[end])"
end
end
end
bin_age (generic function with 1 method)
julia> bin_age(84)
"76-100"
I've taken some liberties with the format of the answer, using the name as the key, since your original question describes a dict format that doesn't really make sense in Julia. If you'd like to have the keys be the age ranges, you could construct the dictionary above and then invert it as described here (with some modification since the values above have two entries).
If you don't care about name, age, or age range being a key, then I would suggest using DataFrames.jl:
julia> using DataFrames
julia> d = DataFrame(name=name, age=ages, age_range=[bin_age(a) for a in ages])
5×3 DataFrame
Row │ name age age_range
│ String Int64 String
─────┼────────────────────────────
1 │ John 24 1-25
2 │ Mary 73 51-75
3 │ Robert 75 51-75
4 │ Cindy 69 51-75
5 │ Beatrice 90 76-100

Efficient way to expand a DataFrame in Julia

I have a dataframe with exposure episodes per case:
using DataFrames
using Dates
df = DataFrame(id = [1,1,2,3], startdate = [Date(2018,3,1),Date(2019,4,2),Date(2018,6,4),Date(2018,5,1)], enddate = [Date(2019,4,4),Date(2019,8,5),Date(2019,3,1),Date(2019,4,15)])
I want to expand each episode to its constituent days, eliminating any duplicate days per case resulting from overlapping episodes (case 1 in the example dataframe):
s = similar(df, 0)
for row in eachrow(df)
tf = DataFrame(row)
ttf = repeat(tf, Dates.value.(row.enddate - row.startdate) + 1)
ttf.daydate = ttf.startdate .+ Dates.Day.(0:nrow(ttf) - 1) #a record for each day between start and end days (inclusive)
ttf.start = ttf.daydate .== ttf.startdate #a flag to indicate this record was at the start of an episode
ttf.end = ttf.daydate .== ttf.enddate #a flag to indicate this record was at the end of an episode
append!(s, ttf, cols=:union)
end
sort!(s, [:id,:daydate,:startdate, order(:enddate, rev=true)])
unique!(s,[:id,:daydate]) #to eliminate duplicate dates in the case of episode overlaps (e.g. case 1)
I have a strong suspicion that there is a more efficient way of doing this than the brute force method I came up with and any help will be appreciated.
Implementation note: In the actual implementation there are several hundred thousand cases, each with relatively few episodes (median = 1, 75 percentile 3, maximum 20), but spanning 20 years or more of exposure resulting in a very large dataset (several 100 million records). To fit into available memory I have partitioned the dataset on id and used the Threads.#threads macro to loop through the partitions in parallel. The primary purpose of this decomposition into days is not just to eliminate overlaps, but to join the data with other exposure data that is available on a per day basis.

Below is a more complete solution that takes into account some essential details. Each episode is associated with additional attributes, as an example I used locationid (place where the exposure took place) and the need to indicate whether there was a gap between subsequent episodes. The original solution also did not cater for the special case where an episode is fully contained within another episode - such episodes should not be expanded.
using Dates
using DataFrames
function process(startdate, enddate, locationid)
start = startdate[1]
stop = enddate[1]
location = locationid[1]
res_daydate = collect(start:Day(1):stop)
res_startdate = fill(start, length(res_daydate))
res_enddate = fill(stop, length(res_daydate))
res_location = fill(location, length(res_daydate))
gap = 0
res_gap = fill(0, length(res_daydate))
for i in 2:length(startdate)
if startdate[i] > res_daydate[end]
start = startdate[i]
elseif enddate[i] > res_daydate[end]
start = res_daydate[end] + Day(1)
else
continue #this episode is contained within the previous episode
end
if start - res_daydate[end] > Day(1)
gap = gap==0 ? 1 : 0
end
stop = enddate[i]
location = locationid[i]
new_daydate = start:Day(1):stop
append!(res_daydate, new_daydate)
append!(res_startdate, fill(startdate[i], length(new_daydate)))
append!(res_enddate, fill(stop, length(new_daydate)))
append!(res_location, fill(location, length(new_daydate)))
append!(res_gap, fill(gap, length(new_daydate)))
end
return (daydate=res_daydate, startdate=res_startdate, enddate=res_enddate, locationid=res_location, gap = res_gap)
end
function eliminateoverlap()
df = DataFrame(id = [1,1,2,3,3,4,4], startdate = [Date(2018,3,1),Date(2019,4,2),Date(2018,6,4),Date(2018,5,1), Date(2019,5,1), Date(2012,1,1), Date(2012,2,2)],
enddate = [Date(2019,4,4),Date(2019,8,5),Date(2019,3,1),Date(2019,4,15),Date(2019,6,15),Date(2012,6,30), Date(2012,2,10)], locationid=[10,11,21,30,30,40,41])
dfs = sort(df, [:startdate, order(:enddate, rev=true)])
gdf = groupby(dfs, :id, sort=true)
r = combine(gdf, [:startdate, :enddate, :locationid] => process => AsTable)
df = combine(groupby(r, [:id,:gap,:locationid]), :daydate => minimum => :StartDate, :daydate => maximum => :EndDate)
return df
end
df = eliminateoverlap()

Here is something that should be efficient:
dfs = sort(df, [:startdate, order(:enddate, rev=true)])
gdf = groupby(dfs, :id, sort=true)
function process(startdate, enddate)
start = startdate[1]
stop = enddate[1]
res_daydate = collect(start:Day(1):stop)
res_startdate = fill(start, length(res_daydate))
res_enddate = fill(stop, length(res_daydate))
for i in 2:length(startdate)
if startdate[i] > res_daydate[end]
start = startdate[i]
stop = enddate[i]
elseif enddate[i] > res_daydate[end]
start = res_daydate[end] + Day(1)
stop = enddate[i]
end
new_daydate = start:Day(1):stop
append!(res_daydate, new_daydate)
append!(res_startdate, fill(startdate[i], length(new_daydate)))
append!(res_enddate, fill(stop, length(new_daydate)))
end
return (startdate=res_startdate, enddate=res_enddate, daydate=res_daydate)
end
combine(gdf, [:startdate, :enddate] => process => AsTable)
(but please check it with larger data against your implementation if it is correct as I have just written it quickly to show you how to do performant implementations with DataFrames.jl)

Date is not working even when date column is set to index

I have a multiple dataframe dictionary where the index is set to 'Date' but am having a trouble to capture the specific day of a search.
Dictionary created as per link:
Call a report from a dictionary of dataframes
Then I tried to add the following column to create specific days for each row:
df_dict[k]['Day'] = pd.DatetimeIndex(df['Date']).day
It´s not working. The idea is to separate the day of the month only (from 1 to 31) for each row. When I call the report, it will give me the day of month of that occurrence.
More details if needed.
Regards and thanks!

In the case of your code, there is no 'Date' column, because it's set as the index.
df_dict = {f.stem: pd.read_csv(f, parse_dates=['Date'], index_col='Date') for f in files}
To extract the day from the index use the following code.
df_dict[k]['Day'] = df.index.day
Pulling the code from this question
# here you can see the Date column is set as the index
df_dict = {f.stem: pd.read_csv(f, parse_dates=['Date'], index_col='Date') for f in files}
data_dict = dict() # create an empty dict here
for k, df in df_dict.items():
df_dict[k]['Return %'] = df.iloc[:, 0].pct_change(-1)*100
# create a day column; this may not be needed
df_dict[k]['Day'] = df.index.day
# aggregate the max and min of Return
mm = df_dict[k]['Return %'].agg(['max', 'min'])
# get the min and max day of the month
date_max = df.Day[df['Return %'] == mm.max()].values[0]
date_min = df.Day[df['Return %'] == mm.min()].values[0]
# add it to the dict, with ticker as the key
data_dict[k] = {'max': mm.max(), 'min': mm.min(), 'max_day': date_max, 'min_day': date_min}
# print(data_dict)
[out]:
{'aapl': {'max': 8.702843218147871,
'max_day': 2,
'min': -4.900700398891522,
'min_day': 20},
'msft': {'max': 6.603769278967109,
'max_day': 2,
'min': -4.084428935702855,
'min_day': 8}}

how insert mongo data into redis list

Now, my mongodb have a table, in it is the keywords and the number of grabs by keywords are stored, now how to insert the keywords into the redis list and by the number of grabs priority level?
thks! very much
this is my code:
def init_mongo_to_redis(mongo_db, redis_pool):
r = redis.Redis(connection_pool = redis_pool)
mongo_handle = mongo_db.keywords_tbl.find({}, {'keyword': 1, 'keyword_type': 1, \
'ignore_station': 1}, no_cursor_timeout=True)
redis_len = r.llen('fetch_keywords')
if redis_len != 0:
logging.info('redis fetch_keywords size is %d', redis_len)
return
logging.info('init redis fetch_keywords start')
r_pipe = r.pipeline()
pbar = tqdm(mongo_handle)
for keyword in pbar:
item = {
'keyword_type': keyword['keyword_type'],
'ignore_station': keyword['ignore_station'],
'keyword': keyword['keyword']
}
r_pipe.lpush('fetch_keywords', json.dumps(item))
pbar.set_description('Processing %s' % keyword)
r_pipe.execute()
logging.info('init redis fetch_keywords end :%d', r.llen('fetch_keywords'))

Anytime you have to order by a numerical (integer or float) value in Redis, it is better to consider first using Sorted Sets. You can order values by a score, that in your case I suppose is
number of grabs priority level

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Group by Week and Count Column Values in Rails - sql

Related

Unable to iterate a list to a thread

Create a dictionary using "if... else" in Julia?

Efficient way to expand a DataFrame in Julia

Date is not working even when date column is set to index

how insert mongo data into redis list

Categories

Resources