I've got this code I'm using to insert data into Highcharts
def self.amount_on(date)
where("date(created_at) = ?",date).sum(:amount)
end
I would like to only sum an :amount if it's negative. Is this doable?
This should do it:
def self.amount_on(date)
where("amount > 0 AND date(created_at) = ?",date).sum(:amount)
end
Related
I am trying to define the trigger point when wt1(Moving average 1) crosses over wt2(moving average 2) and add it to the column ['side'].
So basically add 1 to side at the moment wt1 crosses above wt2.
This is the current code I am using but doesn't seem to be working.
for i in range(len(df)):
if df.wt1.iloc[i] > df.wt2.iloc[i] and df.wt1.iloc[i-1] < df.wt2.iloc[i-1]:
df.side.iloc[1]
If I do the following:
long_signals = (df.wt1 > df.wt2)
df.loc[long_signals, 'side'] = 1
it return the value of 1 the entire time wt1 is above wt2, which is not what i am trying to do.
Expected outcome is when wt1 crosses above wt2 side should be labeled as 1.
Help would be appreciated!
Use shift in your condition:
long_signals = (df.wt1 > df.wt2) & (df.wt1.shift() <= df.wt2.shift())
df.loc[long_signals, 'side'] = 1
df
if you do not like NaNs in 'side', use df.fillna(0) at the end
Your first piece of code also works with the following small modification
for i in range(len(df)):
if df.wt1.iloc[i] > df.wt2.iloc[i] and df.wt1.iloc[i-1] <= df.wt2.iloc[i-1]:
df.loc[i,'side'] = 1
I think I have a problem with time calculation.
I want to run this code on a DataFrame of 320 000 lines, 6 columns:
index_data = data["clubid"].index.tolist()
for i in index_data:
for j in index_data:
if data["clubid"][i] == data["clubid"][j]:
if data["win_bool"][i] == 1:
if (data["startdate"][i] >= data["startdate"][j]) & (
data["win_bool"][j] == 1
):
NW_tot[i] += 1
else:
if (data["startdate"][i] >= data["startdate"][j]) & (
data["win_bool"][j] == 0
):
NL_tot[i] += 1
The objective is to determine the number of wins and the number of losses from a given match taking into account the previous match, this for every clubid.
The problem is, I don't get an error, but I never obtain any results either.
When I tried with a smaller DataFrame ( data[0:1000] ) I got a result in 13 seconds. This is why I think it's a time calculation problem.
I also tried to first use a groupby("clubid"), then do my for loop into every group but I drowned myself.
Something else that bothers me, I have at least 2 lines with the exact same date/hour, because I have at least two identical dates for 1 match. Because of this I can't put the date in index.
Could you help me with these issues, please?
As I pointed out in the comment above, I think you can simply sum the vector of win_bool by group. If the dates are sorted this should be equivalent to your loop, correct?
import pandas as pd
dat = pd.DataFrame({
"win_bool":[0,0,1,0,1,1,1,0,1,1,1,1,1,1,0],
"clubid": [1,1,1,1,1,1,1,2,2,2,2,2,2,2,2],
"date" : [1,2,1,2,3,4,5,1,2,1,2,3,4,5,6],
"othercol":["a","b","b","b","b","b","b","b","b","b","b","b","b","b","b"]
})
temp = dat[["clubid", "win_bool"]].groupby("clubid")
NW_tot = temp.sum()
NL_tot = temp.count()
NL_tot = NL_tot["win_bool"] - NW_tot["win_bool"]
If you have duplicate dates that inflate the counts, you could first drop duplicates by dates (within groups):
# drop duplicate dates
temp = dat.drop_duplicates(["clubid", "date"])[["clubid", "win_bool"]].groupby("clubid")
I have performed a stratified sample on a multi-label dataset before training a classifier and want to check how balanced it is now. The columns in the dataset are:
|_Body|label_0|label_1|label_10|label_100|label_101|label_102|label_103|label_104|label_11|label_12|label_13|label_14|label_15|label_16|label_17|label_18|label_19|label_2|label_20|label_21|label_22|label_23|label_24|label_25|label_26|label_27|label_28|label_29|label_3|label_30|label_31|label_32|label_33|label_34|label_35|label_36|label_37|label_38|label_39|label_4|label_40|label_41|label_42|label_43|label_44|label_45|label_46|label_47|label_48|label_49|label_5|label_50|label_51|label_52|label_53|label_54|label_55|label_56|label_57|label_58|label_59|label_6|label_60|label_61|label_62|label_63|label_64|label_65|label_66|label_67|label_68|label_69|label_7|label_70|label_71|label_72|label_73|label_74|label_75|label_76|label_77|label_78|label_79|label_8|label_80|label_81|label_82|label_83|label_84|label_85|label_86|label_87|label_88|label_89|label_9|label_90|label_91|label_92|label_93|label_94|label_95|label_96|label_97|label_98|label_99|
I want to group by every label_* column once, and create a dictionary of the results with positive/negative counts. At the moment I am accomplishing this in PySpark SQL like this:
# Evaluate how skewed the sample is after balancing it by resampling
stratified_sample = spark.read.json('s3://stackoverflow-events/1901/Sample.Stratified.{}.*.jsonl'.format(limit))
stratified_sample.registerTempTable('stratified_sample')
label_counts = {}
for i in range(0, 100):
count_df = spark.sql('SELECT label_{}, COUNT(*) as total FROM stratified_sample GROUP BY label_{}'.format(i, i))
rows = count_df.rdd.take(2)
neg_count = getattr(rows[0], 'total')
pos_count = getattr(rows[1], 'total')
label_counts[i] = [neg_count, pos_count]
The output is thus:
{0: [1034673, 14491],
1: [1023250, 25914],
2: [1030462, 18702],
3: [1035645, 13519],
4: [1037445, 11719],
5: [1010664, 38500],
6: [1031699, 17465],
...}
This feels like it should be possible in one SQL statement, but I can't figure out how to do this or find an existing solution. Obviously I don't want to write out all the column names and generating SQL seems worse than this solution.
Can SQL do this? Thanks!
You can indeed do that in one statement but I am not sure the performances will be good.
from pyspark.sql import functions as F
from functools import reduce
dataframes_list = [
stratified_sample.groupBy(
"label_{}".format(i)
).count().select(
F.lit("label_{}".format(i)).alias("col"),
"count"
)
for i in range(0, 100)
]
count_df = reduce(
lambda a, b: a.union(b),
dataframes_list
)
This will create a dataframe with 2 colummns, col which contains the name of the column you are counting, and count the value of the count.
To change it to a dict, I let you read another post.
Here is a solution with single sql, to get all pos and neg counts
sql = 'select '
for i in range(0, 100):
sql = sql + ' sum(CASE WHEN label_{} > 0 THEN 1 ELSE 0 END) as label{}_pos_count, '.format(i,i)
sql = sql + ' sum(CASE WHEN label_{} < 0 THEN 1 ELSE 0 END) as label{}_neg_count'.format(i,i)
if i < 99:
sql = sql + ', '
sql = sql + ' from stratified_sample '
df = spark.sql(sql)
rows = df.rdd.take(1)
label_counts = {}
for i in range(0, 100):
label_counts[i] = [rows[0][2*i],rows[0][2*i+1] ]
print(label_counts)
You can generate sql without group by.
Something like
SELECT COUNT(*) AS total, SUM(label_k) as positive_k ,.. FROM table
And then use the result to produce your dict {k : [total-positive_k, positive_k]}
I have requirement as below
poem belongs to poet
poet has many poems
If user searching for word "ruby"
It should give,
Total number of times word ruby used in all poems.
Show all the poems which has the word ruby.
Number of times word ruby used in each poems.
Total number of poets used the word ruby.
Total number of times each poets used the word ruby.
So my query in model Poem is here
poems= where("poem_column like ?", "%#{word}%" )
#results = {}
poems.each do |poem|
words = poem.poem_column.split
count = 0
words.each do |word|
count += 1 if word.upcase.include?(word.upcase)
end
#results[poem] = count # to get each poem using word ruby
end
And to get poets count
in Poem Model
#poets = poems.select("distinct(poet_id)")
#poets.each do |poet|
#poets_word_count << poems.where("poet_id = #{poem.poet_id}").count
end
Where poems are around 50k. its taking almost more than 1 minute.
I know am doing in wrong way but i couldnt optimize it in any other way.
i think the below lines taking too much time as it looping each words of all poems.
words.each do |word|
count += 1 if word.upcase.include?(word.upcase)
end
Can anyone of you show me the way to optimize it.As lack of knowledge in queries i couldnt do it in any other way.
Thanks in advance
Not an answer, just a test.
First, reduce the data extracting keywords for each poem as they are saved:
rails g resource Keyword word occurrences poem_id:integer
rails db:migrate
Then in your Poem model:
# add more words
EXCLUDED_WORDS = %w( the a an so that this these those )
has_many :keywords
before_save :set_keywords
# { :some => 3, :word => 2, :another => 1}
def keywords_hash(how_many = 5)
words = Hash.new 0
poem_column.split.each do |word|
words[word] += 1 if not word.in? EXCLUDED_WORDS
end
Hash[words.sort { |w, w1| w1 <=> w }.take(how_many)]
end
def set_keywords
keywords_hash.each do | word, occurrences |
keywords.create :word => word, :occurrences => occurrences
end
end
In Keyword model:
belongs_to :poem
def self.poem_ids
includes(:poem).map(&:poem_id)
end
def self.poems
Poem.where(id: poem_ids)
end
Then when you have word to search for:
keywords = Keyword.where(word: word)
poems = keywords.poems
poets = poems.poets
To use this last part, you would need in Poem model:
def self.poet_ids
includes(:poet).map(&:poet_id)
end
def self.poets
Poet.where(id: poet_ids)
end
As far as I see this way would require just 3 queries, no joins, so it seems to make sense.
I will think in how to extend this way to search by the entire content.
Im my opnion, you can change following code quoted from your post:
poems.each do |poem|
words = poem.poem_column.split
count = 0
words.each do |word|
count += 1 if word.upcase.include?(word.upcase)
end
#results[poem] = count # to get each poem using word ruby
end
to:
poems.each {|poem| #results[poem] = poem.poem_column.scan(/ruby/i).size}
I am trying to write a migration that will increase the value of an integer field by +1. I've tried several variations and done searches. What I am looking for is something like:
def self.up
User.update_all(:category => [:category + 1])
end
Thanks.
Maybe this will do it
User.update_all("category = (category + 1)")