solutions for cleaning/manipulating big data (currently using Stata)

solutions for cleaning/manipulating big data (currently using Stata) - sql

I'm currently using a 10% sample of a very large dataset (10 vars, over 300m rows) which amounts to over 200 GB of data when stored in .dta format for the full dataset. Stata is able to handle operations like egen, collapse, merging, etc in a reasonable amount of time for the 10% sample when using Stata-MP on a UNIX server with ~50G of RAM and multiple cores.
However, now I want to move on to analyzing the whole sample. Even if I use a machine that has enough RAM to hold the dataset, simply generating a variable takes ages. (I think perhaps the background operations are causing Stata to run into virtual mem)
The problem is also very amenable to parallelization, i.e., the rows in the dataset are independent of each other, so I can just as easily think about the one large dataset as 100 smaller datasets.
Does anybody have any suggestions for how to process/analyze this data or can give me feedback on some suggestions I currently have? I mostly use Stata/SAS/MATLAB so perhaps there are other approaches that I am simply unaware of.
Here are some of my current ideas:
Split the dataset up into smaller datasets and utilize informal parallel processing in Stata. I can run my cleaning/processing/analysis on each partition and then merge the results after without having the store all the intermediate parts.
Use SQL to store the data and also perform some of the data manipulation such as aggregating over certain values. One concern here is that some tasks that Stata can handle fairly easily such as comparing values across time won't work so well in SQL. Also, I'm already running into performance issues when running some queries in SQL on a 30% sample of the data. But perhaps I'm not optimizing by indexing correctly, etc. Also, Shard-Query seems like it could help with this but I have not researched it too thoroughly yet.
R also looks promising, but I'm not sure if it would solve the problem of working with this enormous amount of data.

Thanks to those who have commented and replied. I realized that my problem is similar to this thread. I have re-written some of my data manipulation code in Stata into SQL and the response time is much quicker. I believe I can make large optimization gains by correctly utilizing indexes and using parallel processing via partitions/shards if necessary. After all the data manipulation has been done, I can import that data via ODBC in Stata.

Since you are familiar with Stata there is a well documented FAQ about large data sets in Stata Dealing with Large Datasets: you might find this helpful.
I would clean via columns, splitting those up, running any specific cleaning routines and merge back in later.
Depending on your machine resources, you should be able to hold the individual columns in multiple temporary files using tempfile. Taking care to select only the variables or columns most relevant to your analysis should reduce the size of your set quite a lot.

Related

Improving our algorithm with SQLite vs storing everything in memory

The problem...I’m trying to figure out a way to make our algorithm faster.
Our algorithm...is written in C and runs on an embedded Linux system with little memory and a lackluster CPU. The entire algorithm makes heavy use of 2d arrays and stores them all in memory. At a high level, the algorithm’s input data, which is a single array of 250 doubles (0.01234, 0.02532….0.1286), is compared to a larger 2d array, which is 20k+ rows x 250 doubles. The input data is compared against the 20k+ rows using a for loop. For each iteration, the algorithm performs computations and stores those results in memory.
I’m not an embedded software developer, I am a cloud developer that uses databases (Postgres, mainly). Our embedded software doesn’t make use of any databases and, since that is what I know, I thought I’d look into SQLite.
My approach...applying what I know about databases, I'd go about it this way: I would have a single table with 6 columns: id, array, computation_1, computation_2, computation_3, and computation_4. I’d store all 20k+ rows in this table with the computation_* columns initially defaulted to null. Then I’d have the algorithm loop through each entry and update the values for each computation_* column accordingly. For graphical purposes, the table would look like this:
Storing arrays in a database doesn't seem like a good fit so I don't immediately understand if there is a benefit to doing this. But, it seems like it would replace the extensive use of malloc()/calloc() we have baked into the algorithm.
My question is...can SQLite help speed up our algorithm if I use it in the way I've described? Since I don’t know how much benefit this would provide, if any, I thought I’d ask the experts here on SO before going down this path. If it will (or won't) provide an improvement, I'd like to know why from a technical standpoint so that I can learn.
Thanks in advance.

As you have described it so far, SQLite won't help you.
A relational database stores data into tables with various indexes and so on. When it receives SQL, it compiles it into a bytecode program, and then it runs that bytecode program in an interpreter against those tables. You can learn more about SQLite's bytecode from https://www.sqlite.org/opcode.html.
This has a lot of overhead compared to native data structures in a low-level language. In my experience the difference is up to several orders of magnitude.
Why, then, would anyone use a database? It is because you'd have to write a lot of potentially buggy code to match it. Doubly so if you've got multiple users at the same time. Furthermore the database query optimizer is able to find efficient plans for computing complex joins that are orders of magnitude more efficient than what most programmers produce on their own.
So a database is not a recipe for doing arbitrary calculations more efficiently. But if you can describe what you are doing in SQL (particularly if it involves joins), the database may be able to find a much more efficient calculation than the one you're currently performing.
Even in that case, squeezing performance out of a low-end embedded system is a case where it may be worth figuring out what a database would do, and then writing code to do that directly.

Which is faster: doing a SQL join from inside R (DBI), or doing the join inside PGAdmin or DataGrip, saving a CSV, and importing to R?

Summary
I have a data analysis project that requires the use of a locally stored PostgreSQL database of fairly decent size for a desktop machine (~10 tables with up to ~90m rows and ~20 columns, summing to around 20gb worth of data).
I have some statistical models I want to execute on this data using R. First, though, I need to manipulate the data a little bit to get it into the form I want. The basic manipulations are fairly straightforward SELECT and JOIN operations, but they take a a few minutes to do on my machine because of the amount of data. I'll need to refer to the manipulated tables again and again in analysis, so I'd like to be able to save the results of these SELECT and JOIN operations for later use.
Question
Is it faster or computationally more efficient to
(a) execute the joins from R's DBI package, using, say, dbGetQuery, and saving the resulting dataframes on disk for later analysis
or
(b) do the joins and selects inside PGAdmin or DataGrip, save the result to a .csv file, and bring that into R?
What I've tried
I've tried three of the operations I need to do in both RStudio (as in a) and DataGrip (as in b) and timed them with a stopwatch. In 2 instances, the code seems to go faster inside the SQL environment in DataGrip, and in the third, it's marginally faster in RStudio. I'm not sure why this is, besides the third operation working on smaller tables than the first two. No, I don't know how to benchmark code in either platform, and yes, this may be part of my issue. Nor do I know much about big-O notation, but that may not be relevant here.
I can include some more concrete code if it's helpful, but my question (it seems to me) is a little more theoretical. I'm basically asking if connecting to a SQL server on my local machine should be any different if I'm doing it in R versus doing it in some "proper" database environment. Are there bottlenecks in one and not the other?
Thanks in advance for any insight!

Are SQLite result sets in-memory data structures?

I currently find myself needing to do fairly simple computations on several million datapoints. (Constructing a large list of strings from a well defined multi-gigabite file, sorting that list, and then comparing it to another list, a superset.) This is the sort of simple work most of us normally do with the data entirely in-memory, but the size and quantity of the units of data I need to work with could make RAM an issue if I try to keep everything in memory. I quickly realized I probably need to write the data to a file, at a few points, to avoid exhausting my system's resources. I decided to use SQLite3 for this. (This is probably a bit much for a CSV.) It is fairly lightweight, while its storage limits seem to safely exceed my requirements.
The problem I am having is the understanding exactly how the result set works. The documentation I have come across seems a little vague on this. Obviously, SQLite is not writing a whole new table to the database every time a SELECT statement is executed. Does this mean it is duplicating all the selected fields in a complete in-memory table, or does it only keep some sort of pointers in memory (rather than the actual data)? Something else altogether?
I need to be able to sort the data in question. If the result set is really just an in-memory data structure, than simply creating creating a new table and populating it with the help of ORDER BY could be a bad idea.

SQLite does not really have result sets. It has cursors, which allow access to only the current row, and which cannot go backwards.
SQLite computes results on the fly, so only one row needs to be in memory at a time.
When a computation needs to access multiple rows (i.e., aggregate functions, or sorting without a usable index), as much data as possible is kept in the cache, and then spilled to disk in a temporary database.

What Database for extensive logfile analysis?

The task is to filter and analyze a huge amount of logfiles (around 8TB) from a finished research project. The idea is to fill a database with the data to be able to run different analysis tasks later.
The values are stored comma separated. In principle the values are tuples of up to 5 values:
id, timestamp, type, v1, v2, v3, v4, v5
In a first try using MySQL I used one table with one log entry per row. So there is no direct relation between the log values. The downside here is slow querying of subsets.
Because there is no relation I looked into alternatives like NoSQL databases, and column based tables like hbase or cassandra seemed to be a perfect fit for this kind of data. But these systems are made for huge distributed systems, which we not have. In our case the analysis will run on a single machine or perhaps some VMs.
Which kind of database would fit this task? Is it worth to setup a single machine instance with hadoop+hbase... or is this all a bit over-sized?
What database would you choose to do high-performance logfile analysis?
EDIT: Maybe out of my question it is not clear that we cannot spend money for cloud services or new hardware. The Question is if there are benefits in using noSQL approaches instead of mySQL (especially for this data). If there are none, or if they are so small that the effort of setting up a noSQL system is not worth the benefit we can use our ESXi infrastructure and MySQL.
EDIT2: I'm still having the Problem here. I did further experiments with MySQL and just inserted a quarter of all available data. The insert is now running for over 2 days and is not yet finished. Currently there are 2,147,483,647 rows in my single table db. With indeces this takes 211,2 GiB of disk space. And this is just a quarter of all logging data...
A query of the form
SELECT * FROM `table` WHERE `timestamp`>=1342105200000 AND `timestamp`<=1342126800000 AND `logid`=123456 AND `unit`="UNIT40";
takes 761 seconds to complete, in this case returning one row.
There is a combined index on timestamp, logid, unit.
So I think this is not the way to go, because later in analysis I will have to get all entries in a time range and compare the datapoints.
I read bout MongoDB and Redis, but the problem with them is, that they are in Memory databases.
In the later analyzing process there will a very small amount of concurrent database access. In fact the analyzing will be run from one single machine.
I do not need redundancy. I would be able to regenerate the database in case of a failure.
When the database is once completely written, there would also be no need to update or add further row.
What do you think about alternatives like Redis, MongoDB and so on. When I get this right, i would need RAM in the dimension of my data...
Is this task even somehow possible with a single node system or with maybe two nodes?

well i personally would prefer the faster solution, as you said you need a high-perfomance analysis. the problem is, if you have to setup a whole new system to do so and the performance-improvement would be minor in relation to the additional effort you'd need, then stay with SQL.
in our company, we have a quite small Database containing not even half a GB of Data on the VM. the problem now is, as soon as you use a VM, you will have major performance issues, when opening the Database on VM you can go for a coffee in the meantime ;)
But if the time until the Database is loaded to cache is not so important it doesn't matter. It all depends on how much faster you think the new System will be, and how much effort you will have to put in it, but as i said i'd prefer the faster solution if you have to go for "high-performance analysis"

Would this method work to scale out SQL queries?

I have a database containing a single huge table. At the moment a query can take anything from 10 to 20 minutes and I need that to go down to 10 seconds. I have spent months trying different products like GridSQL. GridSQL works fine, but is using its own parser which does not have all the needed features. I have also optimized my database in various ways without getting the speedup I need.
I have a theory on how one could scale out queries, meaning that I utilize several nodes to run a single query in parallel. A precondition is that the data is partitioned (vertically), one partition placed on each node. The idea is to take an incoming SQL query and simply run it exactly like it is on all the nodes. When the results are returned to a coordinator node, the same query is run on the union of the resultsets. I realize that an aggregate function like average need to be rewritten into a count and sum to the nodes and that the coordinator divides the sum of the sums with the sum of the counts to get the average.
What kinds of problems could not easily be solved using this model. I believe one issue would be the count distinct function.
Edit: I am getting so many nice suggestions, but none have addressed the method.

It's a data volume problem, not necessarily an architecture problem.
Whether on 1 machine or 1000 machines, if you end up summarizing 1,000,000 rows, you're going to have problems.
Rather than normalizing you data, you need to de-normalize it.
You mention in a comment that your data base is "perfect for your purpose", when, obviously, it's not. It's too slow.
So, something has to give. Your perfect model isn't working, as you need to process too much data in too short of a time. Sounds like you need some higher level data sets than your raw data. Perhaps a data warehousing solution. Who knows, not enough information to really say.
But there are a lot of things you can do to satisfy a specific subset of queries with a good response time, while still allowing ad hoc queries that respond in "10-20 minutes".
Edit regarding comment:
I am not familiar with "GridSQL", or what it does.
If you send several, identical SQL queries to individual "shard" databases, each containing a subset, then the simple selection query will scale to the network (i.e. you will eventually become network bound to the controller), as this is a truly, parallel, stateless process.
The problem becomes, as you mentioned, the secondary processing, notably sorting and aggregates, as this can only be done on the final, "raw" result set.
That means that your controller ends up, inevitably, becoming your bottleneck and, in the end, regardless of how "scaled out" you are, you still have to contend with a data volume issue. If you send your query out to 1000 node and inevitably have to summarize or sort the 1000 row result set from each node, resulting in 1M rows, you still have a long result time and large data processing demand on a single machine.
I don't know what database you are using, and I don't know the specifics about individual databases, but you can see how if you actually partition your data across several disk spindles, and have a decent, modern, multi-core processor, the database implementation itself can handle much of this scaling in terms of parallel disk spindle requests for you. Which implementations actually DO do this, I can't say. I'm just suggesting that it's possible for them to (and some may well do this).
But, my general point, is if you are running, specifically, aggregates, then you are likely processing too much data if you're hitting the raw sources each time. If you analyze your queries, you may well be able to "pre-summarize" your data at various levels of granularity to help avoid the data saturation problem.
For example, if you are storing individual web hits, but are more interested in activity based on each hour of the day (rather than the subsecond data you may be logging), summarizing to the hour of the day alone can reduce your data demand dramatically.
So, scaling out can certainly help, but it may well not be the only solution to the problem, rather it would be a component. Data warehousing is designed to address these kinds of problems, but does not work well with "ad hoc" queries. Rather you need to have a reasonable idea of what kinds of queries you want to support and design it accordingly.

One huge table - can this be normalised at all?
If you are doing mostly select queries, have you considered either normalising to a data warehouse that you then query, or running analysis services and a cube to do your pre-processing for you?
From your question, what you are doing sounds like the sort of thing a cube is optimised for, and could be done without you having to write all the plumbing.

By trying custom solution (grid) you introduce a lot of complexity. Maybe, it's your only solution, but first did you try partitioning the table (native solution)?

I'd seriously be looking into an OLAP solution. The trick with the Cube is once built it can be queried in lots of ways that you may not have considered. And as #HLGEM mentioned, have you addressed indexing?
Even at in millions of rows, a good search should be logarithmic not linear. If you have even one query which results in a scan then your performance will be destroyed. We might need an example of your structure to see if we can help more?
I also agree fully with #Mason, have you profiled your query and investigated the query plan to see where your bottlenecks are. Adding nodes improving speed makes me think that your query might be CPU bound.

David,
Are you using all of the features of GridSQL? You can also use constraint exclusion partitioning, effectively breaking out your big table into several smaller tables. Depending on your WHERE clause, when the query is processed it may look at a lot less data and return results much faster.
Also, are you using multiple logical nodes per physical server? Configuring it that way can take advantage of otherwise idle cores.
If you monitor the servers during execution, is the bottleneck IO or CPU?
Also alluded to here is that you may want to roll up rows in your fact table into summary tables/cubes. I do not know enough about Tableau, will it automatically use the appropriate cube and drill down only when necessary? If so, it seems like you would get big gains doing something like this.

My guess (based on nothing but my gut) is that any gains you might see from parallelization will be eaten up by reaggregation and subsequent queries of the results. Further, I would think that writing might get more complicated with pk/fk/constraints. If this were my world, I would probably create many indexed views on top of my table (and other views) that optimized for the particular queries I need to execute (which I have worked with successfully on 10million+ row tables.)

If you run the incoming query, unpartitioned, on each node, why will any node finish before a single node running the same query would finish? Am I misunderstanding your execution plan?
I think this is, in part, going to depend on the nature of the queries you're executing and, in particular, how many rows contribute to the final result set. But surely you'll need to partition the query somehow among the nodes.

Your method to scale out queries works fine.
In fact, I've implemented such a method in:
http://code.google.com/p/shard-query
It uses a parser, but it supports most SQL constructs.
It doesn't yet support count(distinct expr) but this is doable and I plan to add support in the future.
I also have a tool called Flexviews (google for flexviews materialized views)
This tool lets you create materialized views (summary tables) which include various aggregate functions and joins.
Those tools combined together can yield massive scalability improvements for OLAP type queries.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas