Hi Mahout community at SO!
I have couple of questions about speeding up recommendation calculations. On my server I have Mahout installed without Hadoop. Also jRuby is used for recommendation script. In the database I have 3k users and 100k items (270k items in join table). So when user requests recommendations the simple script starts working:
First it establishes db connection using PGPoolingDataSource like this:
connection = org.postgresql.ds.PGPoolingDataSource.new()
connection.setDataSourceName("db_name");
connection.setServerName("localhost")
connection.setPortNumber(5432)
connection.setDatabaseName("db_name")
connection.setUser("mahout")
connection.setPassword("password")
connection.setMaxConnections(100)
connection
I get this warning:
WARNING: You are not using ConnectionPoolDataSource. Make sure your DataSource pools connections to the database itself, or database performance will be severely reduced.
Any ideas how to fix that?
After it I create recommendations:
model = PostgreSQLJDBCDataModel.new(
connection,
'stars',
'user_id',
'repo_id',
'preference',
'created_at'
)
similarity = TanimotoCoefficientSimilarity.new(model)
neighborhood = NearestNUserNeighborhood.new(5, similarity, model)
recommender = GenericBooleanPrefUserBasedRecommender.new(model, neighborhood, similarity)
recommendations = recommender.recommend user_id, 30
For now it takes about 5-10 seconds to generate recommendation for one user. The question is how to make recommendations faster (200ms would be nice)?
If you know you are using a pooling data source, you can ignore the warning. It means the implementation does not implement the usual interface for pooling implementations, ConnectionPoolDataSource.
You're never going to make this run fast if trying to run directly off a database. There is just too much data access. Wrap the JDBCDataModel in ReloadFromJDBCDataModel and it will be cached in memory, which should work, literally, 100x faster.
Related
Lets say i want to query all Orchard user IDs and i want to include those users that have been removed (aka soft deleted) also. The DB contains around 1000 users.
Option A - takes around 2 minutes
Orchard.ContentManagement.IContentManager lContentManager = ...;
lContentManager
.Query<Orchard.Users.Models.UserPart, Orchard.Users.Models.UserPartRecord>(Orchard.ContentManagement.VersionOptions.AllVersions)
.List()
.Select(u => u.Id)
.ToList();
Option B - executes with almost unnoticeable delay
Orchard.Data.IRepository<Orchard.Users.Models.UserPartRecord> UserRepository = ...;
UserRepository .Fetch(u => true).Select(u => u.Id).ToList();
I don't see any SQL queries being executed in SQL Profiler when using Option A. I guess it has something to do with NHibernate or caching.
Is there any way to optimize Option A?
Could it be because the IContentManager version is accessing the data via InfoSet (basically an xml representation of the data), where as the IRepository version uses the actual DB table itself.
I seem to remember reading that though Infoset is great in many cases, when you're dealing with larger datasets with sorting / filtering it is more efficient to go direct to the table, as using Infoset requires each xml fragment to be parsed and elements extracted before you get to the data.
Since 'the shift', Orchard uses both so you can use whichever method best suits to your needs. I can't find the article that explained it now, but this explains the shift & infosets quite nicely:
http://weblogs.asp.net/bleroy/the-shift-how-orchard-painlessly-shifted-to-document-storage-and-how-it-ll-affect-you
Hope that helps you?
I have 15 files of data, each around 4.5GB. Each file is a months worth of data for around 17,000 customers. All together, the data represents information on 17,000 customers over the course of 15 months. I want to reformat this data so that, instead of 15 files each denoting a month, I have 17,000 files for each customer and all their data. I wrote a script to do this:
#the variable 'files' is a vector of locations of the 15 month files
exists = NULL #This vector keeps track of customers who have a file created for them
for (w in 1:15){ #for each of the 15 month files
month = fread(files[w],select = c(2,3,6,16)) #read in the data I want
custlist = unique(month$CustomerID) #a list of all customers in this month file
for (i in 1:length(custlist)){ #for each customer in this month file
curcust = custlist[i] #the current customer
newchunk = subset(month,CustomerID == curcust) #all the data for this customer
filename = sprintf("cust%s",curcust) #what the filename is for this customer will be, or is
if ((curcust %in% exists) == TRUE){ #check if a file has been created for this customer. If a file has been created, open it, add to it, and read it back
custfile = fread(strwrap(sprintf("C:/custFiles/%s.csv",filename)))#read in file
custfile$V1 = NULL #remove an extra column the fread adds
custfile= rbind(custfile,newchunk)#combine read in data with our new data
write.csv(custfile,file = strwrap(sprintf("C:/custFiles/%s.csv",filename)))
} else { #if it has not been created, write newchunk to a csv
write.csv(newchunk,file = strwrap(sprintf("C:/custFiles/%s.csv",filename)))
exists = rbind(exists,curcust,deparse.level = 0) #add customer to list of existing files
}
}
}
The script works (At least, I'm pretty sure). The problem is that it is incredibly slow. At the rate I'm going, it's going to take a week or more to finish, and I don't have that time. Do any of you a better, faster way to do this in R? Should I try to do this in something like SQL? I've never really used SQL before; could any of you show me how something like this would be done? Any input is greatly appreciated.
As the #Dominic Comtois I would also recommend to use SQL.
R can handle quite a biggish data - there is nice benchmark of 2 billions rows which beats python - but because R run mostly in memory you need to have a good machine to make it work. Still your case don't need to load more than 4.5GB file at once so it should be well doable on personal computer, see second approach for fast non-database solution.
You can utilize R to load data to SQL database and later to query them from database.
If you don't know SQL you may want to use some simple database. The simplest way from R is to use RSQLite (unfortunately since v1.1 it is not lite any more). You don't need to install or manage any external dependency. The RSQLite package contains the database engine embedded.
library(RSQLite)
library(data.table)
conn <- dbConnect(dbDriver("SQLite"), dbname="mydbfile.db")
monthfiles <- c("month1","month2") # ...
# write data
for(monthfile in monthfiles){
dbWriteTable(conn, "mytablename", fread(monthfile), append=TRUE)
cat("data for",monthfile,"loaded to db\n")
}
# query data
df <- dbGetQuery(conn, "select * from mytablename where customerid = 1")
# when working with bigger sets of data I would recommend to do below
setDT(df)
dbDisconnect(conn)
Thats all. You use SQL without really having to do much overhead usually related to databases.
If you prefer to go with the approach from your post I think you can dramatically speed up by doing write.csv by groups while aggregation in data.table.
library(data.table)
monthfiles <- c("month1","month2") # ...
# write data
for(monthfile in monthfiles){
fread(monthfile)[, write.csv(.SD,file=paste0(CustomerID,".csv"), append=TRUE), by=CustomerID]
cat("data for",monthfile,"written to csv\n")
}
So you utilize fast unique from data.table and perform subsetting while grouping which is also ultra fast. Below is working example of the approach.
library(data.table)
data.table(a=1:4,b=5:6)[,write.csv(.SD,file=paste0(b,".csv")),b]
Update 2016-12-05:
Starting from data.table 1.9.8+ you can replace write.csv with fwrite, example in this answer.
I think you already have your answer. But to reinforce it, see the official Doc
R Data Import Export
That states
In general, statistical systems like R are not particularly well
suited to manipulations of large-scale data. Some other systems are
better than R at this, and part of the thrust of this manual is to
suggest that rather than duplicating functionality in R we can make
another system do the work! (For example Therneau & Grambsch (2000)
commented that they preferred to do data manipulation in SAS and then
use package survival in S for the analysis.) Database manipulation
systems are often very suitable for manipulating and extracting data:
several packages to interact with DBMSs are discussed here.
So clearly storage of massive data is not R's primary strength, yet it provides interfaces to several tools specialized for this. In my own work, the lightweight SQLite solution is enough, even if it's a matter of preference, to some extent. Search for "drawbacks of using SQLite" and you probably won't find much to dissuade you.
You should find SQLite's documentation pretty smooth to follow. If you have enough programming experience, doing a tutorial or two should get you going pretty quickly on the SQL front. I don't see anything overly complicated going on in your code, so the most common & basic queries such as CREATE TABLE, SELECT ... WHERE will likely meet all your needs.
Edit
Another advantage of using a DBMS that I didn't mention is that you can have views that make easily accessible other data organization schemas if one might say. By creating views, you can go back to the "visualization by month" without having to rewrite any table nor duplicate any data.
I have the following script:
SELECT
DEPT.F03 AS F03, DEPT.F238 AS F238, SDP.F04 AS F04, SDP.F1022 AS F1022,
CAT.F17 AS F17, CAT.F1023 AS F1023, CAT.F1946 AS F1946
FROM
DEPT_TAB DEPT
LEFT OUTER JOIN
SDP_TAB SDP ON SDP.F03 = DEPT.F03,
CAT_TAB CAT
ORDER BY
DEPT.F03
The tables are huge, when I execute the script in SQL Server directly it takes around 4 min to execute, but when I run it in the third party program (SMS LOC based on Delphi) it gives me the error
<msg> out of memory</msg> <sql> the code </sql>
Is there anyway I can lighten the script to be executed? or did anyone had the same problem and solved it somehow?
I remember having had to resort to the ROBUST PLAN query hint once on a query where the query-optimizer kind of lost track and tried to work it out in a way that the hardware couldn't handle.
=> http://technet.microsoft.com/en-us/library/ms181714.aspx
But I'm not sure I understand why it would work for one 'technology' and not another.
Then again, the error message might not be from SQL but rather from the 3rd-party program that gathers the output and does so in a 'less than ideal' way.
Consider adding paging to the user edit screen and the underlying data call. The point being you dont need to see all the rows at one time, but they are available to the user upon request.
This will alleviate much of your performance problem.
I had a project where I had to add over 7 million individual lines of T-SQL code via batch (couldn't figure out how to programatically leverage the new SEQUENCE command). The problem was that there was limited amount of memory available on my VM (I was allocated the max amount of memory for this VM). Because of the large amount lines of T-SQL code I had to first test how many lines it could take before the server crashed. For whatever reason, SQL (2012) doesn't release the memory it uses for large batch jobs such as mine (we're talking around 12 GB of memory) so I had to reboot the server every million or so lines. This is what you may have to do if resources are limited for your project.
My Analysis Services database is responsing really slow after processing. The problem can be reproduced also by clearing cache with the ClearCache XMLA -command. I understand that after clearing the cache, the query performance is slower but I'm seeing slow performance also when using Microsoft.AnalysisServices.AdomdClient library.
I made a small timing test.
DateTime start = DateTime.Now;
int dc = cube.Dimensions.Count; // cube = Microsoft.AnalysisServices.AdomdClient.CubeDef
DateTime end = DateTime.Now;
Debug.WriteLine("Start: " + start.ToLongTimeString());
Debug.WriteLine("Dimensions count: " + dc.ToString());
Debug.WriteLine("End: " + end.ToLongTimeString());
For example this will give the following result
Start: 8:41:53
Dimensions count: 18
End: 8:43:15
So it takes almost 1.5 minutes to get the count for the dimensions. Same performance if I get the measures (which there are only few).
After the first operation all the following operations and queries are fast. My question is that how can I work around this issue? It's a real problem when the database comes almost non-responsive after every database processing. I could do something to automatically "fire up" the database after processing but wouldn't that just move the waiting time from one place to another?
Update:
I've found out the problem. The reason why the performance was different with Management Studio and with AdomdClient was that with AdomdClient I had different connection string to the Analysis Services database. I have some custom stuff at the database which fired with that connectionstring. Anyway, the problem is now solved and wasn't directly related to the actual Analysis Services.
Lesson learned: Make sure you're testing with the correct connectionstring :)
The answer is in the question update - the culprit is an incorrect connection string.
I am currently trying to ring more performance out of my reporting website which uses linq to sql and an sql server express 2008 database.
I am finding that as I now approach a million rows in on of my more 'ugly' tables that performance is becoming a real issue, with one report in particular taking 3 minutes to generate.
Essentially, I have a loop that, for each user, hits the database and grabs a collection of data on them. This data is then queried in various ways (and more rows loaded as needed) until I have a nice little summary object that I can fire off to a set of silverlight charts. Lazy loading comes is used and the reporting pulls into data from around 8 linked tables.
The problem is I don't know where the bottleneck now is and how to improve performance. Due to certain constraints I was forced to use uniqueidentifiers for a number of primary keys in the tables involved - could this be an issue?
Basically, I need to put time into increasing performance but don't have enough to do that with both the database or the linq to sql. Is there anyway I can see where the bottlenecks are?
As im running express I don't have access to the profiler. I am considering rewriting my queries into compiled linq to sql but fear the database may be the culprit.
I understand this question is a bit open ended and its hard to answer without knowing much more about my setup (database schema etc) but any advice on how to find out where the bottlenecks are is more appreciated!
Thanks
UPDATE:
Thanks for all the great advice guys, and some links to some great tools.
UPDATE for those interested
I have been unable to make my queries any quicker through tweaking the linq. the problem seems to be that the majority of my database access code takes place in a loop. I can't see a way around it. Basically I am building up a report by looking through a number of users data - hence the loop. Pulling all the records up front seems a bit crazy - 800,000 + rows. My gut feeling is that there is a much better way, but its a technological leap too far for me!
However, adding another index to one of the foreign keys in one of the tables boosted performance so now the report takes 20 seconds to generate as opposed to 3 minutes!
I used this excelent tool: Linq2Sql profiler. It works on the application side, so there is no need for database server profiling functionality.
You have to add one line of initialization code to your application and then in separate desktop application profiler shows you SQL query for each LINQ query with exact line of code where it was executed (cs or aspx), database time and application time of executions and it even detects some common performance problems like n+1 queries (query executed for iteration) or unbounded datasets. You have to pay for it, but the trial version is also available.
As you're using SQL Express which doesn't have Profiler, there is a free third party profiler you can download here. I've used it when running SQL Express. That will allow you to trace what's going on in the database.
Also, you can query the Dynamic Management Views to see what the costly queries are:
e.g. TOP 10 queries that have taken the most time
SELECT TOP 10 t.text, q.*, p.query_plan
FROM sys.dm_exec_query_stats q
CROSS APPLY sys.dm_exec_sql_text(q.sql_handle) t
CROSS APPLY sys.dm_exec_query_plan (q.plan_handle) AS p
ORDER BY q.total_worker_time DESC
There are 2 tools I use for this, LinqPad and the Visual Studio Debugger. First, check out LinqPad, even the free version is very powerful, showing you execution time, the SQL generated and you can use it to run any code snippet...it's tremendously useful.
Second, you can use the Visual studio debugger, this is something we use on our DataContext (note: only use this in debug, it's a performance hit and completely unnecessary outside of debugging)
#if DEBUG
private readonly Stopwatch Watch = new Stopwatch();
private static void Connection_StateChange(object sender, StateChangeEventArgs e)
{
if (e.OriginalState == ConnectionState.Closed && e.CurrentState == ConnectionState.Open)
{
Current.Watch.Start();
}
else if (e.OriginalState == ConnectionState.Open && e.CurrentState == ConnectionState.Closed)
{
Current.Watch.Stop();
string msg = string.Format("SQL took {0}ms", Current.Watch.ElapsedMilliseconds);
Trace.WriteLine(msg);
}
}
#endif
private static DataContext New
{
get
{
var dc = new DataContext(ConnectionString);
#if DEBUG
if (Debugger.IsAttached)
{
dc.Connection.StateChange += Connection_StateChange;
dc.Log = new DebugWriter();
}
#endif
return dc;
}
}
In a debug build, as an operation completes with each context, we see the timestamp in the debug window and the SQL it ran. The DebugWriter class you see can be found here (Credit: Kris Vandermotten). We can quickly see if a query's taking a while. To use it we just initiate a DataContext by:
var DB = DataContext.New;
(The profiler is not an option for me since we don't use SQL server, this answer is simply to give you some alternatives that have been very useful for me)