Handling missing data for Exploratory Factor Analysis - missing-data

I am trying to conduct an exploratory factor analysis in Rstudio. I'm using the psych package. I have a lot of missing data so I was wondering what is the most appropriate way to handle missing data? Shall I remove any cases with missing data or use pairwise deletion?
Thanks

Related

Using PCA on Part of Dataframe

I want to use a clustering algorithm to a dataframe that contains a lot of features (32 columns).
A part of the features are encoded using one hot encoder.
I want to use PCA ( Principal Component analysis ) to reduce the dimension and make the machine learning process easier.
Is it possible to use the PCA just for some columns of the data frame and keep the other columns as they are then use machine learning model.
Or it is obligatory to use PCA for all the dataframe before clustering.
I guess there should be no issue with doing what you describe.
What this does, effectively, is merge some of the objects' features into fewer ones, but then using other, non-merged ones in addition to the merged ones. I don't know what effect that would have on the outcome; it might be good to run a correlation to see whether the unmerged features add anything to the PCA-merged ones. You might find that they basically duplicate what is there already.
Since clustering is an exploratory method, you can basically do whatever you want. It is of course advisable to have a reason for doing so, as it otherwise ends up as simply trial-and-error, and if you find a result, you won't be able to describe why you got there. It is possible (or even likely for some data sets) that there are multiple ways to cluster them, so you should make decisions based on what you know about the data already, so they can be justified in those terms.
Running random trial-and-error clustering until you find a structure makes it a bit difficult to come up with a good explanation why that structure is valid.

Why would you use industry standard ETL?

I've just started work at a new company who have a datawarehouse that uses some bizzare proprietary ETL built in PHP.
I'm looking for arguments as to why its worth the investment to move to a standard system such as SSIS or infomatica or something. The primary reasons I have at the moment are:
A wider and more diverse community of developers available for contract work, replacements etc.
A large online knowledge base/support networks
Ongoing updates and support will be better
What are other good high level arguments to bring a little standardisation in :)
The only real disadvantage is that a lot of the data sources are web apis returning individual row-by-row records which are more easily looped through with PHP as opposed to standard ETL.
Here are some more:
Simplifies development and deployment process.
Easy to debug and incorporate changes. Would reduce maintenance and enhancement costs.
Industry standard ETL tools perform better on large volume of data as they use various techniques like, grid computing, parallel processing, partitioning etc.
Can support many types for data as source or target. Less impact if source or target systems are migrated to a different data store.
Codes are re-usable. Same component of code can be used in multiple processes.

How to use pure SQL for Exploratory Data Analysis?

I'm an ETL developer using different tools for ETL tasks. The same question rises in all our projects: the importance of the data profiling before the Data Warehouse is build and before the ETL is build for data movement. Usually I have done data profiling (i.e. finding bad data, data anomalies, counts, distinct values etc.) using pure SQL because ETL tools does not provide a good alternative for these (there is some data quality components in our tools, but they are not so sophisticated). One option is to use R programming language or SPSS Modeler etc. kind of tools for this kind of Exploratory Data Analysis. But usually these kinds of tools are not available or does not qualify if there is millions of rows of data.
How to do this kind of profiling using SQL? Is there any helper scripts available? How do you do this kind of Exploratory Data Analysis before data cleaning and ETL?
Load the data into the some staging system and use the Data profiler task from SSIS. Use this link http://gowdhamand.wordpress.com/2012/07/27/data-profiling-task-in-ssis/ to verify how to data analysis. Hope this helps.
I found a good tool for this purpose: Datacleaner. This seems to do most of things I want to do with data in EDA process.
USe this Exploratory Data Analysis for SQL which can help in Data Profiling & Analysis
https://pypi.org/project/edaSQL/
source code:
https://github.com/selva221724/edaSQL

gDatabase Optimization: Need a really big database to test some of the features of sql server

I have done database optimization for dbs upto 3GB size. Need a really large database to test optimization.
Simply generating a lot of data and throwing it into a table proves nothing about the DBMS, the database itself, the queries being issued against it, or the applications interacting with them, all of which factor into the performance of a database-dependent system.
The phrase "I have done database optimization for [databases] up to 3 GB" is highly suspect. What databases? On what platform? Using what hardware? For what purposes? For what scale? What was the model? What were you optimizing? What was your budget?
These same questions apply to any database, regardless of size. I can tell you first-hand that "optimizing" a 250 GB database is not the same as optimizing a 25 GB database, which is certainly not the same as optimizing a 3 GB database. But that is not merely on account of the database size, it is because databases that contain 250 GB of data invariably deal with requirements that are vastly different from those addressed by a 3 GB database.
There is no magic size barrier at which you need to change your optimization strategy; every optimization requires in-depth knowledge of the specific data model and its usage requirements. Maybe you just need to add a few indexes. Maybe you need to remove a few indexes. Maybe you need to normalize, denormalize, rewrite a couple of bad queries, change locking semantics, create a data warehouse, implement caching at the application layer, or look into the various kinds of vertical scaling available for your particular database platform.
I submit that you are wasting your time attempting to create a "really big" database for the purposes of trying to "optimize" it with no specific requirements in mind. Various data-generation tools are available for when you need to generate data fitting specific patterns for testing against a specific set of scenarios, but until you have that information on hand, you won't accomplish very much with a database full of unorganized test data.
The best way to do this is to create your schema and write a script to populate it with lots of random(ish) dummy data. Random, meaning that your text-fields don't necessarily have to make sense. 'ish', meaning that the data distribution and patterns should generally reflect your real-world DB usage.
Edit: a quick Google search reveals a number of commercial tools that will do this for you if you don't want to write your own populate scripts: DB Data Generator, DTM Data Generator. Disclaimer: I've never used either of these and can't really speak to their quality or usefulness.
Here is a free procedure I wrote to generate Random person names. Quick and dirty, but it works and might help.
http://www.joebooth-consulting.com/products/genRandNames.sql
I use Red-Gate's Data Generator regularly to test out problems as well as loads on real systems and it works quite well. That said, I would agree with Aaronnaught's sentiment in that the overall size of the database isn't nearly as important as the usage patterns and the business model. For example, generating 10 GB of data on a table that will eventually get no traffic will not provide any insight into optimization. The goal is to replicate the expected transaction and storage loads you anticipate to occur in order to identify bottlenecks before they occur.

Commercial uses for grid computing?

I keep hearing from associates about grid computing which, from what I can gather, is highly distributed stuff along the lines of SETI#Home.
Is anyone working on these sort of systems for business use? My interest is in figuring out if there's a commercial reason for starting software development in this field.
Rendering Farms such as Pixar
Model Evaluation e.g. weather, financials, military
Architectural Engineering e.g. earthquakes.
To list a few.
Grid computing is really only needed if you have a lot of WORK that needs to be done, like folding proteins, otherwise a simple server farm will likely be plenty.
Obviously Google are major users of Grid Computing; all their search service relies on it, and many others.
Engines such as BigTable are based on using lots of nodes for storage and computation. These are commercially very useful because they're a good alternative to a small number of big servers, providing better redundancy and cost effective scaling.
The downside is that the software is fiendishly difficult to write, but Google seem to manage that one ok :)
So anything which requires big storage and/or lots of computation.
I used to work for these guys. Grid computing is used all over. Anyone who makes computer chips uses them to test designs before getting physical silicon cut. Financial websites use grids to calculate if you qualify for that loan. These days they are starting to replace big iron in a lot of places, as they tend to be cheaper to maintain over the long term.