I have 15 files of data, each around 4.5GB. Each file is a months worth of data for around 17,000 customers. All together, the data represents information on 17,000 customers over the course of 15 months. I want to reformat this data so that, instead of 15 files each denoting a month, I have 17,000 files for each customer and all their data. I wrote a script to do this:
#the variable 'files' is a vector of locations of the 15 month files
exists = NULL #This vector keeps track of customers who have a file created for them
for (w in 1:15){ #for each of the 15 month files
month = fread(files[w],select = c(2,3,6,16)) #read in the data I want
custlist = unique(month$CustomerID) #a list of all customers in this month file
for (i in 1:length(custlist)){ #for each customer in this month file
curcust = custlist[i] #the current customer
newchunk = subset(month,CustomerID == curcust) #all the data for this customer
filename = sprintf("cust%s",curcust) #what the filename is for this customer will be, or is
if ((curcust %in% exists) == TRUE){ #check if a file has been created for this customer. If a file has been created, open it, add to it, and read it back
custfile = fread(strwrap(sprintf("C:/custFiles/%s.csv",filename)))#read in file
custfile$V1 = NULL #remove an extra column the fread adds
custfile= rbind(custfile,newchunk)#combine read in data with our new data
write.csv(custfile,file = strwrap(sprintf("C:/custFiles/%s.csv",filename)))
} else { #if it has not been created, write newchunk to a csv
write.csv(newchunk,file = strwrap(sprintf("C:/custFiles/%s.csv",filename)))
exists = rbind(exists,curcust,deparse.level = 0) #add customer to list of existing files
}
}
}
The script works (At least, I'm pretty sure). The problem is that it is incredibly slow. At the rate I'm going, it's going to take a week or more to finish, and I don't have that time. Do any of you a better, faster way to do this in R? Should I try to do this in something like SQL? I've never really used SQL before; could any of you show me how something like this would be done? Any input is greatly appreciated.
As the #Dominic Comtois I would also recommend to use SQL.
R can handle quite a biggish data - there is nice benchmark of 2 billions rows which beats python - but because R run mostly in memory you need to have a good machine to make it work. Still your case don't need to load more than 4.5GB file at once so it should be well doable on personal computer, see second approach for fast non-database solution.
You can utilize R to load data to SQL database and later to query them from database.
If you don't know SQL you may want to use some simple database. The simplest way from R is to use RSQLite (unfortunately since v1.1 it is not lite any more). You don't need to install or manage any external dependency. The RSQLite package contains the database engine embedded.
library(RSQLite)
library(data.table)
conn <- dbConnect(dbDriver("SQLite"), dbname="mydbfile.db")
monthfiles <- c("month1","month2") # ...
# write data
for(monthfile in monthfiles){
dbWriteTable(conn, "mytablename", fread(monthfile), append=TRUE)
cat("data for",monthfile,"loaded to db\n")
}
# query data
df <- dbGetQuery(conn, "select * from mytablename where customerid = 1")
# when working with bigger sets of data I would recommend to do below
setDT(df)
dbDisconnect(conn)
Thats all. You use SQL without really having to do much overhead usually related to databases.
If you prefer to go with the approach from your post I think you can dramatically speed up by doing write.csv by groups while aggregation in data.table.
library(data.table)
monthfiles <- c("month1","month2") # ...
# write data
for(monthfile in monthfiles){
fread(monthfile)[, write.csv(.SD,file=paste0(CustomerID,".csv"), append=TRUE), by=CustomerID]
cat("data for",monthfile,"written to csv\n")
}
So you utilize fast unique from data.table and perform subsetting while grouping which is also ultra fast. Below is working example of the approach.
library(data.table)
data.table(a=1:4,b=5:6)[,write.csv(.SD,file=paste0(b,".csv")),b]
Update 2016-12-05:
Starting from data.table 1.9.8+ you can replace write.csv with fwrite, example in this answer.
I think you already have your answer. But to reinforce it, see the official Doc
R Data Import Export
That states
In general, statistical systems like R are not particularly well
suited to manipulations of large-scale data. Some other systems are
better than R at this, and part of the thrust of this manual is to
suggest that rather than duplicating functionality in R we can make
another system do the work! (For example Therneau & Grambsch (2000)
commented that they preferred to do data manipulation in SAS and then
use package survival in S for the analysis.) Database manipulation
systems are often very suitable for manipulating and extracting data:
several packages to interact with DBMSs are discussed here.
So clearly storage of massive data is not R's primary strength, yet it provides interfaces to several tools specialized for this. In my own work, the lightweight SQLite solution is enough, even if it's a matter of preference, to some extent. Search for "drawbacks of using SQLite" and you probably won't find much to dissuade you.
You should find SQLite's documentation pretty smooth to follow. If you have enough programming experience, doing a tutorial or two should get you going pretty quickly on the SQL front. I don't see anything overly complicated going on in your code, so the most common & basic queries such as CREATE TABLE, SELECT ... WHERE will likely meet all your needs.
Edit
Another advantage of using a DBMS that I didn't mention is that you can have views that make easily accessible other data organization schemas if one might say. By creating views, you can go back to the "visualization by month" without having to rewrite any table nor duplicate any data.
Related
Looking for a shortcut here. I am pretty adept with SQL database engines and ERPs. I should clarify... I mean databases like MS SQL, MySQL, postresql, etc.
One of the things that I like to do when I am working on a new project is to get a feel for what is being utilized and what isn't. In T-SQL this is pretty easy. I just query the information schema and get a row count of all the tables and filter out the ones having rowcount = 0. I know this isn't truly a precise row count, but it does give me an idea of what is in use.
So I recently started at a new company and one of their systems is running on UniData. This is a pretty radical shift from mainstream databases and there isn't a lot of help out there. I was wondering if anybody knew of a command to do the same thing listed above in UniBasic/UniQuery/whatever else.
Which tables, files, are heavily populated and which ones are not?
You can start with a special "table" (or file in Unidata terminology) named VOC - it will have a list of all the other files that are in your current "database" (aka account), as well as a bunch of other things.
To get a list of files in (or pointed to) the current account:
:SORT VOC WITH F1 = "F]" "L]" "DIR" F1 F2
Try HELP CREATE.FILE if you're curious about the difference between F and LF and DIR.
Once you have a list of files, weed out the ones named *TEMP* or *WORK* and start digging into the ones that seem important. There are other ways to get at what's important (e.g using triggers or timestamps), but browsing isn't a bad idea to see what conventions are used.
Once you have a file that looks interesting (let's say CUSTOMERS), you can look at the dictionary of that file to see
:SORT DICT CUSTOMERS F1 F2 BY F1 BY F2 USING DICT VOC
It can help to create something like F2.LONG in DICT VOC to increase the display size up from 15 characters.
Now you have a list of "columns" (aka fields or attributes), you're looking for D-type attributes that will tell you what columns are in the file. V or I-type's are calculations
https://github.com/ianmcgowan/SCI.BP/blob/master/PIVOT is helpful with profiling when you see an attribute that looks interesting and you want to see what the data looks like.
http://docs.rocketsoftware.com/nxt/gateway.dll/RKBnew20/unidata/previous%20versions/v8.1.0/unidata_userguide_v810.pdf has some generally good information on the concepts and there are many other online manuals available there. It can take a lot of reading to get to the right thing if you don't know the terminology.
I used to work with SQL like MySQL, Postgres or MSSQL.
Now I want to play with Redis. I'm working on a little home project, that I think is the best choice for starting using Redis.
I have a machine that reads temperature (indoor and outdoor) and humidity. I need to store the readings into Redis. Can you help me to understand the best data structure to do so?
Other than this data I need to store the time (ex. unix timestamp) of the temperature reading for use plotting a graphic.
I installed Redis read the documentation, so I understand the commands and data types.
Since this is your first Redis project and it's a home project, I'd be careful about being to careful. Here's a couple ways to consider designing it (NOTE: I only dug deep into REDIS this past weekend so hopefully others will weigh in).
IDEA 1:
Four ordered sets
KEY for sets are "indoor_temps", "outdoor_temps", "indoor_humidity", "outdoor_humidity"
VALUES are the temperatures / humidities
SCORE is the date stored as EPOCH
IDEA 2:
Four types of keys (best shown by example)
datetime_key = /year:2014/month:07/day:12/hour:07/minute:32/second:54
type_keys = [indoor_temps, outdoor_temps, indoor_humidity, outdoor_humidity]
keys are of form type + "/" + datetime_key
values are the temp and humidity itself
You probably want to implement some initial design and then work with the data immediately - graph it, do stats, etc. Whatever you plan to do with it. That will expose flaws and if they are major, flush the database and try again. These designs should really only take ~1 hour to implement since the only thing you're really changing is a few Redis commands and some string manipulation to convert the data to keys.
I like Tony's suggestions, but I'll also throw out another possibility.
4 lists
keys are "indoor_temps", "outdoor_temps", "indoor_humidity", "outdoor_humidity"
values are of the form < timestamp >_< reading > ie.( "1403197981_27.2" )
Push items onto the front of the list using LPUSH. Get a set of readings using LRANGE. The list will always be ordered by the time of the reading. Obviously split the value on "_" to get your time and reading...
In all honesty, this will give the same properties as Tony's first example, with slightly worse lookup performance, but better memory usage. I'm guessing for this project you'll be neither memory, nor CPU constrained, so the choice is probably not an issue. That said, if you expect to be saving 100's of thousands or more readings, I would suggest the list unless you want to consume a large portion of your system's memory.
Also, it's a good idea to call EXPIRE on your entries with some reasonable TTL that encompasses the length of time you want to save the readings for. If your plan is to have them live in perpetuity then you may want to look at backing them up to a disk DB over time, and just use Redis as a quick lookup cache for recent readings.
Thank to all answer, I choose this strucure:
4 lists: tempIN, tempOut, humidIN and humidOUT
values are: [value]:[timestamp]. For example: "25.4:1403615247"
As suggested from wallacer i want to backup old entries out from Redis.
For main frontend i need only last two days of sample.
For example i can create Redis RDB file snapshot and "trim" the live lists. This solution is not convenient in the event that, in the future you want to recover old values.
Do you have any tips on what kind of procedure to adopt to store the data? Maybe use of SQLIte DB?
I'm optimizing the memory load (~2GB, offline accounting and analysis routine) of this line:
l2 = Photograph.objects.filter(**(movie.get_selectors())).values()
Is there a way to convince django to skip certain columns when fetching values()?
Specifically, the routine obtains all rows of the table matching certain criteria (db is optimized and performs it very quickly), but it is a bit too much for python to handle - there is a long string referenced in each row, storing the urls for thumbnails.
I only really need three fields from each row, but, if all the fields are included, it suddenly consumes about 5kB/row which sadly pushes the RAM to the limit.
The values(*fields) function allows you to specify which fields you want.
Check out the QuerySet method, only. When you declare that you only want certain fields to be loaded immediately, the QuerySet manager will not pull in the other fields in your object, till you try to access them.
If you have to deal with ForeignKeys, that must also be pre-fetched, then also check out select_related
The two links above to the Django documentation have good examples, that should clarify their use.
Take a look at Django Debug Toolbar it comes with a debugsqlshell management command that allows you to see the SQL queries being generated, along with the time taken, as you play around with your models on a django/python shell.
I have an sqlite3 table that tells when I gain/lose points in a game. Sample/query result:
SELECT time,p2 FROM events WHERE p1='barrycarter' AND action='points'
ORDER BY time;
1280622305|-22
1280625580|-9
1280627919|20
1280688964|21
1280694395|-11
1280698006|28
1280705461|-14
1280706788|-13
[etc]
I now want my running point total. Given that I start w/ 1000 points,
here's one way to do it.
SELECT DISTINCT(time), (SELECT
1000+SUM(p2) FROM events e WHERE p1='barrycarter' AND action='points'
AND e.time <= e2.time) AS points FROM events e2 WHERE p1='barrycarter'
AND action='points' ORDER BY time
but this is highly inefficient. What's a better way to write this?
MySQL has #variables so you can do things like:
SELECT time, #tot := #tot+points ...
but I'm using sqlite3 and the above isn't ANSI standard SQL anyway.
More info on the db if anyone needs it: http://ccgames.db.94y.info/
EDIT: Thanks for the answers! My dilemma: I let anyone run any
single SELECT query on "http://ccgames.db.94y.info/". I want to give
them useful access to my data, but not to the point of allowing
scripting or allowing multiple queries with state. So I need a single
SQL query that can do accumulation. See also:
Existing solution to share database data usefully but safely?
SQLite is meant to be a small embedded database. Given that definition, it is not unreasonable to find many limitations with it. The task at hand is not solvable using SQLite alone, or it will be terribly slow as you have found. The query you have written is a triangular cross join that will not scale, or rather, will scale badly.
The most efficient way to tackle the problem is through the program that is making use of SQLite, e.g. if you were using Web SQL in HTML5, you can easily accumulate in JavaScript.
There is a discussion about this problem in the sqlite mailing list.
Your 2 options are:
Iterate through all the rows with a cursor and calculate the running sum on the client.
Store sums instead of, or as well as storing points. (if you only store sums you can get the points by doing sum(n) - sum(n-1) which is fast).
I am looking for the fastest way to check for the existence of an object.
The scenario is pretty simple, assume a directory tool, which reads the current hard drive. When a directory is found, it should be either created, or, if already present, updated.
First lets only focus on the creation part:
public static DatabaseDirectory Get(DirectoryInfo dI)
{
var result = DatabaseController.Session
.CreateCriteria(typeof (DatabaseDirectory))
.Add(Restrictions.Eq("FullName", dI.FullName))
.List<DatabaseDirectory>().FirstOrDefault();
if (result == null)
{
result = new DatabaseDirectory
{
CreationTime = dI.CreationTime,
Existing = dI.Exists,
Extension = dI.Extension,
FullName = dI.FullName,
LastAccessTime = dI.LastAccessTime,
LastWriteTime = dI.LastWriteTime,
Name = dI.Name
};
}
return result;
}
Is this the way to go regarding:
Speed
Separation of Concern
What comes to mind is the following: A scan will always be performed "as a whole". Meaning, during a scan of drive C, I know that nothing new gets added to the database (from some other process). So it MAY be a good idea to "cache" all existing directories prior to the scan, and look them up this way. On the other hand, this may be not suitable for large sets of data, like files (which will be 600.000 or more)...
Perhaps some performance gain can be achieved using "index columns" or something like this, but I am not so familiar with this topic. If anybody has some references, just point me in the right direction...
Thanks,
Chris
PS: I am using NHibernate, Fluent Interface, Automapping and SQL Express (could switch to full SQL)
Note:
In the given problem, the path is not the ID in the database. The ID is an auto-increment, and I can't change this requirement (other reasons). So the real question is, what is the fastest way to "check for the existance of an object, where the ID is not known, just a property of that object"
And batching might be possible, by selecting a big group with something like "starts with C:Testfiles\" but the problem then remains, how do I know in advance how big this set will be. I cant select "max 1000" and check in this buffered dictionary, because i might "hit next to the searched dir"... I hope this problem is clear. The most important part, is, is buffering really affecting performance this much. If so, does it make sense to load the whole DB in a dictionary, containing only PATH and ID (which will be OK, even if there are 1.000.000 object, I think..)
First off, I highly recommend that you (anyone using NH, really) read Ayende's article about the differences between Get, Load, and query.
In your case, since you need to check for existence, I would use .Get(id) instead of a query for selecting a single object.
However, I wonder if you might improve performance by utilizing some knowledge of your problem domain. If you're going to scan the whole drive and check each directory for existence in the database, you might get better performance by doing bulk operations. Perhaps create a DTO object that only contains the PK of your DatabaseDirectory object to further minimize data transfer/processing. Something like:
Dictionary<string, DirectoryInfo> directories;
session.CreateQuery("select new DatabaseDirectoryDTO(dd.FullName) from DatabaseDirectory dd where dd.FullName in (:ids)")
.SetParameterList("ids", directories.Keys)
.List();
Then just remove those elements that match the returned ID values to get the directories that don't exist. You might have to break the process into smaller batches depending on how large your input set is (for the files, almost certainly).
As far as separation of concerns, just keep the operation at a repository level. Have a method like SyncDirectories that takes a collection (maybe a Dictionary if you follow something like the above) that handles the process for updating the database. That way your higher application logic doesn't have to worry about how it all works and won't be affected should you find an even faster way to do it in the future.