Rails show different object every day - sql

I want to match my user to a different user in his/her community every day. Currently, I use code like this:
#matched_user = User.near(#user).order("RANDOM()").first
But I want to have a different #matched_user on a daily basis. I haven't been able to find anything in Stack or in the APIs that has given me insight on how to do it. I feel it should be simpler than having to resort to a rake task with cron. (I'm on postgres.)

Whenever I find myself hankering for shared 'memory' or transient state, I think to myself "this is what (distributed) caches were invented for".
#matched_user = Rails.cache.fetch(#user.cache_key + '/daily_match', expires_in: 1.day) {
User.near(#user).order("RANDOM()").first
}
NOTE: While specifying a TTL for cache entry tells Rails/the cache system to try and keep that value for the given timeframe, there's NO guarantee that it will. In particular, a cache that aggressively tries to reclaim memory may expire an entry well before its desired expires_in time.
For this particular use case, it shouldn't be a big deal but in cases where the business/domain logic demands periodically generated values that are durable then you really have to factor that into your database.

How about using PostgreSQL's SETSEED function? I used the date to seed so that every day the seed will change, but within a day, the seed will be consistent.:
User.connection.execute "SELECT SETSEED(#{Date.today.strftime("%y%d%m").to_i/1000000.0})"
#matched_user = User.near(#user).order("RANDOM()").first
You may want to seed a random value after using this so that any future calls to random aren't biased:
random = User.connection.execute("SELECT RANDOM()").to_a.first["random"]
# Same code as above:
User.connection.execute "SELECT SETSEED(#{Date.today.strftime("%y%d%m").to_i/1000000.0})"
#matched_user = User.near(#user).order("RANDOM()").first
# Use random value before seed to make new seed:
User.connection.execute "SELECT SETSEED(#{random})"

I have split these steps in different sections just for readability. you can optimise query later.
1) Find all user records till today morning. so that the count will freeze.
usrs_till_today_morning = User.where("created_at <?", DateTime.now.in_time_zone(Time.zone).beginning_of_day)
2) Pluck all ID's
user_ids = usr_till_today_morning.pluck(:id)
3) Today date it will be a range (1..30) but will remain constant throughout the day.
day_today = Time.now.day
4) Select the same ID for the day
todays_user_id = user_ids[day_today % user_ids.count]
#matched_user = User.find(todays_user_id)
So it will give you random user records by maintaining same record throughout the day!!

Related

How can I show the most recent events per user with Keen IO?

Suppose you have a Keen IO collection called "survey-completed" that contains events matching the following pattern:
keen.id: <unique autogenerated id>
keen.timestamp: <autogenerated overridable timestamp>
userId: <hex string for user>
surveyScore: <integer from 1 to 10>
...
How would you create a report of only the most up-to-date satisfaction score for each user that responded to one or more surveys within a given amount of time (such as one week)?
There isn't a really elegant way to make it happen, but for a given userId you could successfully return your the most up-to-date event create a count query with a group_by on [surveyScore, keen.timestamp] and an order_by on the keen.timestamp property. You will want to set limit=1 to select only the most recent surveyScore.
If you'd like to use an extraction, the most straight forward way would be to run an extraction with property_names set to ["userId","keen.timestamp","surveyScore"]. Once you receive the results you can then do some client-side post processing. This is probably the best way if you want to take a look at all of your userIds.
If you're interested in a given userId and want to use an extraction, you can run an extraction with a filter on the userId eq X, define the optional parameter latest set to latest=1. The latest property is an integer containing the number of most recent events to extract. Note: The use of latest will call upon the keen.created_at timestamp instead of keen.timestamp (https://keen.io/docs/api/#the-keen-object).

pushgateway or node exporter - how to add string data?

I have a cron job that runs an sql query every day, and gives me an important integer.
And I have to expose that integer to the Prometheus server.
As I've seen I have two options; use the pushgateway or node exporter.
But that metric (integer) that I get from the sql query also need some information (like the company name, and the database that I got it from).
What would be a better way?
For instance this is what I made for my metric:
count = some number
registry = CollectorRegistry()
g = Gauge('machine_number', 'machfoobarine_stat', registry=registry).set(count)
push_to_gateway('localhost:9091', job='batchA', registry=registry)
So how do I add key-value pairs to my metric above?
Do I have to change the job name ('batchA') for every single sql count that I get and expose as a metric to the pushgateway, because I can only see the last one?
Tnx,
Tom
The best way is to set a general name to your metric, for example animal_count and then specialize it with label. Here is an pseudocode:
g = Gauge.build("animal_count", "Number of animal in zoo")
.labelsName("sex", "classes")
.create();
g.labels("male", "mammals")
.set(count);

Use API to gather statistics on my followers

I am very new to this and would like to know how to start gathering statistics on my followers as I am currently growing my follower base. I am subscribed to several statistic tracking apps but none are really good.
I wish to track things such as:
Follower count by Location
Frequency distribution of followers and tags
Follower growth rate by Hour, Day, week, etc..
Follower Loss
Is this at all possible using APIs? Can anyone tell me how to get started?
There is no direct API call to get follower growth by hour and week, you have to get all followers every hour and store it in database and analyze for growth or loss every hour compared to previous hour and save it on the server.
You cannot get location of followers from API, you can may be estimate the location by checking for location in bio or analyzing all user posts and finding most posted location (this is expensive on API side and will have to make a lot of API calls to get analyze)
Yes, all this is possible to do using API, but it is a lot of work on backend, so if some service does this, it will cost you money cause they cannot do it for free, my guess is that you have checked all free or cheap services and they cannot do all this analysis for cheap.
You can get a broad breakdown of follower count in Google Sheets. This doesn't require API access so you won't get all of the data you are looking for, such as GEO. But, if you would like to see your follower increase by the hour, do this -
Open up a new Google Sheet
Go to Tools > Script Editor
Name your script 'IGFollowers'
In the code box, copy and paste this code below, but make sure to write this replace 'AccountName' with your username
var sheetName = "IGFollowers";
var instagramAccountName = "AccountName";
function insertFollowerCount() {
var ss = SpreadsheetApp.getActiveSpreadsheet();
var sheet = ss.getSheetByName("IGFollowers");
sheet.appendRow([Utilities.formatDate(new Date(), "PST", "yyyy-MM-dd"),
Utilities.formatDate(new Date(), "PST", "hh:mm"),
getInstagramFollowerCount(this.instagramAccountName)]);
};
function getInstagramFollowerCount(username) {
var url = "https://www.instagram.com/" + username + "/?__a=1";
var response = UrlFetchApp.fetch(url).getContentText();
return JSON.parse(response).user.followed_by.count;
}
Go to Run > InsertFollowerCount
NOTE: You may need to do a bit of formatting with the main Google Sheet, but this will get you some very long columns showing an increase in followers by the hour.

Using deepstream List for tens of thousands unique values

I wonder if it's a good/bad idea to use deepstream record.getList for storing a lot of unique values, for example, emails or any other unique identifiers. The main purpose is to be able to answer a question quickly whether we already have, say, a user with such email (email in use) or another record by specific unique field.
I made few experiments today and got two problems:
1) when I tried to populate the list with few thousands values I got
FATAL ERROR: CALL_AND_RETRY_LAST Allocation failed - process out of memory
and my deepstream server went off. I was able to fix it by adding more memory to the server node process with this flag
--max-old-space-size=5120
it doesn't look fine but allowed me to make a list with more than 5000 items.
2) It wasn't enough for my tests so I precreated the list with 50000 items and put the data directly to rethinkdb table and got another issue on getting the list or modifing it:
RangeError: Maximum call stack size exceeded
I was able to fix it with another flag:
--stack-size=20000
It helps but I believe it's only matter of time when one of those errors appear in production when the list size reaches proper value. I don't know really whether it's nodejs, javascript, deepstream or rethinkdb issue. That's all in general made me think that I try to use deepstream List wrong way. Please, let me know. Thank you in advance!
Whilst you can use lists to store arrays of strings, they are actually intended as collections of recordnames - the actual data would be stored in the record itself, the list would only manage the order of the records.
Having said that, there are two open Github issues to improve performance for very long lists by sending more efficient deltas and by introducing a pagination option
Interesting results in regards to memory though, definitely something that needs to be handled more gracefully. In the meantime you could drastically improve performance by combining updates into one:
var myList = ds.record.getList( 'super-long-list' );
// Sends 10.000 messages
for( var i = 0; i < 10000; i++ ) {
myList.addEntry( 'something-' + i );
}
// Sends 1 message
var entries = [];
for( var i = 0; i < 10000; i++ ) {
entries.push( 'something-' + i );
}
myList.setEntries( entries );

DQL query to return all files in a Cabinet in Documentum?

I want to retrieve all the files from a cabinet (called 'Wombat Insurance Co'). Currently I am using this DQL query:
select r_object_id, object_name from dm_document(all)
where folder('/Wombat Insurance Co', descend);
This is ok except it only returns a maximum of 100 results. If there are 5000 files in the cabinet I want to get all 5000 results. Is there a way to use pagination to get all the results?
I have tried this query:
select r_object_id, object_name from dm_document(all)
where folder('/Wombat Insurance Co', descend)
ENABLE (RETURN_RANGE 0 100 'r_object_id DESC');
with the intention of getting results in 100 file increments, but this query gives me an error when I try to execute it. The error says this:
com.emc.documentum.fs.services.core.CoreServiceException: "QUERY" action failed.
java.lang.Exception: [DM_QUERY2_E_UNRECOGNIZED_HINT]error:
"RETURN_RANGE is an unknown hint or is being used incorrectly."
I think I am using the RETURN_RANGE hint correctly, but maybe I'm not. Any help would be appreciated!
I have also tried using the hint ENABLE(FETCH_ALL_RESULTS 0) but this still only returns a maximum of 100 results.
To clarify, my question is: how can I get all the files from a cabinet?
You have already accepted an answer which is using DFS.
Since your are playing with DFC, these information might help you.
DFS:
If you are using DFS, you have to aware about the number of concurrent sessions that you can consume with DFS.
I think it is 100 or 150.
DFC:
Actually there is a limit that you can fetch via DFC (I'm not sure with DFS).
Go to your DFC application(webtop or da or anything) and check the dfc.properties file.
# Maximum number of results to retrieve by a query search.
# min value: 1, max value: 10000000
#
dfc.search.max_results = 100
# Maximum number of results to retrieve per source by a query search.
# min value: 1, max value: 10000000
#
dfc.search.max_results_per_source = 400
dfc.properties.full or similar file is there and you can verify these values according to your system.
And I'm talking about the ContentServer side, not the client side dfc.properties file.
If you use ENABLE (RETURN_TOP) hint with DFC, there are 2 ways to fetch the results from the ContentServer.
Object based
Row based
You have to configure this by using the parameter return_top_results_row_based in the server.ini file.
All of these changes for the documentum server side, not for your DFC/DQL client.
Aha, I've figured it out. Using DFS with Java (an abstraction layer on top of DFC) you can set the starting index for query results:
String queryStr = "select r_object_id, object_name from dm_document(all)
where folder('/Wombat Insurance Co', descend);"
PassthroughQuery query = new PassthroughQuery();
query.setQueryString(queryStr);
query.addRepository(repositoryStr);
QueryExecution queryEx = new QueryExecution();
queryEx.setCacheStrategyType(CacheStrategyType.DEFAULT_CACHE_STRATEGY);
queryEx.setStartingIndex(currentIndex); // set start index here
OperationOptions operationOptions = null;
// will return 100 results starting from currentIndex
QueryResult queryResult = queryService.execute(query, queryEx, operationOptions);
You can just increment the currentIndex variable to get all results.
Well, the hint is being used incorrectly. Start with 1, not 0.
There is no built-in limit in DQL itself. All results are returned by default. The reason you get only 100 results must have something to do with the way you're using DFC (or whichever other client you are using). Using IDfCollection in the following way will surely return everything:
IDfQuery query = new DfQuery("SELECT r_object_id, object_name "
+ "FROM dm_document(all) WHERE FOLDER('/System', DESCEND)");
IDfCollection coll = query.execute(session, IDfQuery.DF_READ_QUERY);
int i = 0;
while (coll.next()) i++;
System.out.println("Number of results: " + i);
In a test environment (CS 6.7 SP1 x64, MS SQL), this outputs:
Number of results: 37162
Now, there's proof. Using paging is however a good idea if you want to improve the overall performance in your application. As mentioned, start counting with the number 1:
ENABLE(RETURN_RANGE 1 100 'r_object_id DESC')
This way of paging requires that sorting be specified in the hint rather than as a DQL statement. If all you want is the first 100 records, try this hint instead:
ENABLE(RETURN_TOP 100)
In this case sorting with ORDER BY will work as you'd expect.
Lastly, note that adding (all) will not only find all documents matching the specified qualification, but all versions of every document. If this was your intention, that's fine.
I've worked with DFC API (with Java) for a while but I don't remember any default limit on queries, IIRC we've always got all of the documents, there weren't any limit. Actually (according to my notes) we have to set the limit explicitly with, for example, enable (return_top 2000). (As far I know the syntax might be depend on the DBMS behind EMC Documentum.)
Just a guess: check your dfc.properties file.