Am I misunderstanding String#hash in Ruby? - sql

I am processing a bunch of data and I haven't coded a duplicate checker into the data processor yet, so I expected duplicates to occur. I ran the following SQL query:
SELECT body, COUNT(body) AS dup_count
FROM comments
GROUP BY body
HAVING (COUNT(body) > 1)
And get back a list of duplicates. Looking into this I find that these duplicates have multiple hashes. The shortest string of a comment is "[deleted]". So let's use that as an example. In my database there are nine instances of a comment being "[deleted]" and in my database this produces a hash of both 1169143752200809218 and 1738115474508091027. The 116 is found 6 times and 173 is found 3 times. But, when I run it in IRB, I get the following:
a = '[deleted]'.hash # => 811866697208321010
Here is the code I'm using to produce the hash:
def comment_and_hash(chunk)
comment = chunk.at_xpath('*/span[#class="comment"]').text ##Get Comment##
hash = comment.hash
return comment,hash
end
I've confirmed that I don't touch comment anywhere else in my code. Here is my datamapper class.
class Comment
include DataMapper::Resource
property :uid , Serial
property :author , String
property :date , Date
property :body , Text
property :arank , Float
property :srank , Float
property :parent , Integer #Should Be UID of another comment or blank if parent
property :value , Integer #Hash to prevent duplicates from occurring
end
Am I correct in assuming that .hash on a string will return the same value each time it is called on the same string?
Which value is the correct value assuming my string consists of "[deleted]"?
Is there a way I could have different strings inside ruby, but SQL would see them as the same string? That seems to be the most plausible explanation for why this is occurring, but I'm really shooting in the dark.

If you run
ruby -e "puts '[deleted]'.hash"
several times, you will notice that the value is different. In fact, the hash value stays only constant as long as your Ruby process is alive. The reason for this is that String#hash is seeded with a random value. rb_str_hash (the C implementing function) uses rb_hash_start which uses this random seed which gets initialized every time Ruby is spawned.
You could use a CRC such as Zlib#crc32 for your purposes or you may want to use one of the message digests of OpenSSL::Digest, although the latter is overkill since for detection of duplicates you probably won't need the security properties.

I use the following to create String#hash alternatives that are consistant across time and processes
require 'zlib'
def generate_id(label)
Zlib.crc32(label.to_s) % (2 ** 30 - 1)
end

Ruby intentionally makes String.hash produce different values in different sessions: Why is Ruby String.hash inconsistent across machines?

Related

Find records where length of array equal to - Rails 4

In my Room model, I have an attribute named available_days, which is being stored as an array.
For example:
Room.first.available_days
=> ["wed", "thurs", "fri"]
What is the best way to find all Rooms where the size of the array is equal to 3?
I've tried something like
Room.where('LENGTH(available_days) = ?', 3)
with no success.
Update: the data type for available_days is a string, but in order to store an array, I am serializing the attribute from my model:
app/models/room.rb
serialize :available_days
Can't think of a purely sql way of doing it for sqlite since available_days is a string.
But here's one way of doing it without loading all records at once.
rooms = []
Room.in_batches(of: 10).each_record do |r|
rooms << r if r.available_days.length == 3
end
p rooms
If you're using postgres you can parse the serialized string to an array type, then query on the length of the array. I expect other databases may have similar approaches. How to do this depends on how the text is being serialized, but by default for Rails 4 should be YAML, so I expect you data is encoded like this:
---
- first
- second
The following SQL will remove the leading ---\n- as well as the final newline, then split the remaining string on - into an array. It's not strictly necessary to cleanup the extra characters to find the length, but if you want to do other operations you may find it useful to have a cleaned up array (no leading characters or trailing newline). This will only work for simple YAML arrays and simple strings.
Room.where("ARRAY_LENGTH(STRING_TO_ARRAY(RTRIM(REPLACE(available_days,'---\n- ',''),'\n'), '\n- '), 1) = ?", 3)
As you can see, this approach is rather complex. If possible you may want to add a new structured column (array or jsonb) and migrate the serialized string into the a typed column to make this easier and more performant. Rails supports jsonb serialization for postgres.

Passing a block to .try()

This question is about using the .try() method in rails 3 (& a bit about best practice in testing for nil also)
If I have a db object that may or may not be nil, and it has a field which is eg a delimited string, separated by commas and an optional semi-colon, and I want to see if this string contains a particular sub-string, then I might do this:
user_page = UserPage.find(id)
if user_page != nil
result = user_page.pagelist.gsub('\;', ',').split(",").include?(sub)
end
So, eg user_page.pagelist == "item_a,item_b;item_c"
I want to re-write this using .try, but cannot find a way of applying the stuff after pagelist
So I have tried passing a block (from this link http://apidock.com/rails/Object/try)
user_page.try(:pagelist) {|t| t.gsub('\;', ',').split(",").include?(sub)}
However, the block is simply ignored and the returned result is simply pagelist
I have tried other combinations of trying to apply the gsub etc part, but to no avail
How can I apply other methods to the method being 'tried'?
(I dont even know how to phrase the question in an intelligible way!)
I think you're probably coming at this wrong - the quick solution should be to add a method to your model:
def formatted_pagelist
(pagelist || "").split(/[;,]/)
end
Then you can use it where ever:
user_page.formatted_pagelist.include? sub
Strictly speaking, I'd say this goes into a decorator. If you're not using a decorator pattern in your app (look into draper!), then a model method should work fine.

Constructing a recursive compare with SQL

This is an ugly one. I wish I wasn't having to ask this question, but the project is already built such that we are handling heavy loads of validations in the database. Essentially, I'm trying to build a function that will take two stacks of data, weave them together with an unknown batch of operations or comparators, and produce a long string.
Yes, that was phrased very poorly, so I'm going to give an example. I have a form that can have multiple iterations of itself. For some reason, the system wants to know if the entered start date on any of these forms is equal to the entered end date on any of these forms. Unfortunately, due to the way the system is designed, everything is stored as a string, so I have to format it as a date first, before I can compare. Below is pseudo code, so please don't correct me on my syntax
Input data:
'logFormValidation("to_date(#) == to_date(^)"
, formname.control1name, formname.control2name)'
Now, as I mentioned, there are multiple iterations of this form, and I need to loop build a fully recursive comparison (note: it may not always be typical boolean comparisons, it could be internally called functions as well, so .In or anything like that won't work.) In the end, I need to get it into a format like below so the validation parser can read it.
OR(to_date(formname.control1name.1) == to_date(formname.control2name.1)
,to_date(formname.control1name.2) == to_date(formname.control2name.1)
,to_date(formname.control1name.3) == to_date(formname.control2name.1)
,to_date(formname.control1name.1) == to_date(formname.control2name.2)
:
:
,to_date(formname.control1name.n) == to_date(formname.control2name.n))
Yeah, it's ugly...but given the way our validation parser works, I don't have much of a choice. Any input on how this might be accomplished? I'm hoping for something more efficient than a double recursive loop, but don't have any ideas beyond that
Okay, seeing as my question is apparently terribly unclear, I'm going to add some more info. I don't know what comparison I will be performing on the items, I'm just trying to reformat the data into something useable for ANY given function. If I were to do this outside the database, it'd look something like this. Note: Pseudocode. '#' is the place marker in a function for vals1, '^' is a place marker for vals2.
function dynamicRecursiveValidation(string functionStr, strArray vals1, strArray vals2){
string finalFunction = "OR("
foreach(i in vals1){
foreach(j in vals2){
finalFunction += functionStr.replace('#', i).replace('^', j) + ",";
}
}
finalFunction.substring(0, finalFunction.length - 1); //to remove last comma
finalFunction += ")";
return finalFunction;
}
That is all I'm trying to accomplish. Take any given comparator and two arrays, and create a string that contains every possible combination. Given the substitution characters I listed above, below is a list of possible added operations
# > ^
to_date(#) == to_date(^)
someFunction(#, ^)
# * 2 - 3 <= ^ / 4
All I'm trying to do is produce the string that I will later execute, and I'm trying to do it without having to kill the server in a recursive loop
I don't have a solution code for this but you can algorithmically do the following
Create a temp table (start_date, end_date, formid) and populate it with every date from any existing form
Get the start_date from the form and simply:
SELECT end_date, form_id FROM temp_table WHERE end_date = <start date to check>
For the reverse
SELECT start_date, form_id FROM temp_table WHERE start_date = <end date to check>
If the database is available why not let it do all the heavy lifting.
I ended up performing a cross product of the data, and looping through the results. It wasn't the sort of solution I really wanted, but it worked.

ROR: Zero-padding removed upon database query

Is there anyway to make .find not remove zeropadding on the values it pulls back from the database?
ie. I have zipcodes in a database and some of them are shorter than 5 characters. I am zeropadding thme back to 5 characters in the database, so I end up with "00210" for example. However, this value just becomes "210" in my array.
I know I can use "%05d" % value to zeropad it when it's going back into the views... but I'd rather not have to zeropad it on the way out like that.
A Fixnum (the Ruby type you're dealing with) only cares about value. my_var = 00000001 in ruby sets my_var to 1, and outputting it as a string results in "1". If you want to format it differently, you'll have to rely on string functionality.

How to comment on MATLAB variables

When I´m using MATLAB, sometimes I feel the need to make comments on some variables. I would like to save these comments inside these variables. So when I have to work with many variables in the workspace, and I forget the context of some of these variables I could read the comments I put in every one of them. So I would like to comment variables and keep the comments inside of them.
While I'm of the opinion that the best (and easiest) approach would be to make your variables self-documenting by giving them descriptive names, there is actually a way for you to do what you want using the object-oriented aspects of MATLAB. Specifically, you can create a new class which subclasses a built-in class so that it has an additional property describing the variable.
In fact, there is an example in the documentation that does exactly what you want. It creates a new class ExtendDouble that behaves just like a double except that it has a DataString property attached to it which describes the data in the variable. Using this subclass, you can do things like the following:
N = ExtendDouble(10,'The number of data points')
N =
The number of data points
10
and N could be used in expressions just as any double value would. Using this example subclass as a template, you could create "commented" versions of other built-in numeric classes, with the exception of those you are not allowed to subclass (char, cell, struct, and function_handle).
Of course, it should be noted that instead of using the ExtendDouble class like I did in the above example, I could instead define my variable like so:
nDataPoints = 10;
which makes the variable self-documenting, albeit with a little more typing needed. ;)
How about declaring another variable for your comments?
example:
\>> num = 5;
\>> numc = 'This is a number that contains 5';
\>> whos
...
This is my first post in StackOverflow. Thanks.
A convenient way to solve this is to have a function that does the storing and displaying of comments for you, i.e. something like the function below that will pop open a dialog box if you call it with comments('myVar') to allow you to enter new (or read/update previous) comments to variable (or function, or co-worker) labeled myVar.
Note that the comments will not be available in your next Matlab session. To make this happen, you have to add save/load functionality to comments (i.e. every time you change anything, you write to a file, and any time you start the function and database is empty, you load the file if possible).
function comments(name)
%COMMENTS stores comments for a matlab session
%
% comments(name) adds or updates a comment stored with the label "name"
%
% comments prints all the current comments
%# database is a n-by-2 cell array with {label, comment}
persistent database
%# check input and decide what to do
if nargin < 1 || isempty(name)
printDatabase;
else
updateDatabase;
end
function printDatabase
%# prints the database
if isempty(database)
fprintf('no comments stored yet\n')
else
for i=1:size(database,1)
fprintf('%20s : %s\n',database{i,1},database{i,2});
end
end
end
function updateDatabase
%# updates the database
%# check whether there is already a comment
if size(database,1) > 0 && any(strcmp(name,database(:,1)))
idx = strcmp(name,database(:,1));
comment = database(idx,2);
else
idx = size(database,1)+1;
comment = {''};
end
%# ask for new/updated comment
comment = inputdlg(sprintf('please enter comment for %s',name),'add comment',...
5,comment);
if ~isempty(comment)
database{idx,1} = name;
database(idx,2) = comment;
end
end
end
Always always always keep the Matlab editor open with a script documenting what you do. That is, variable assignments and calculations.
Only exceptions are very short sessions where you want to experiment. Once you have something -- add it to the file (It's also easier to cut and paste when you can see your entire history).
This way you can always start over. Just clear all and rerun the script. You never have random temporaries floating around in your workspace.
Eventually, when you are finished, you will also have something that is close to 'deliverable'.
Have you thought of using structures (or cells, although structures would require extra memory use)?
'>> dataset1.numerical=5;
'>> dataset1.comment='This is the dataset that contains 5';
dataset1 =
numerical: 5
comment: 'This is the dataset that contains 5'