I'm working on anomaly detection with timeseries using R. The data is stored in Hadoop. There is a sequence of queries to execute during the script runs and they are quite different.
I wanted to know what is the best way to stock these queries to easily maintain in case of changes in tables structure or access path? For example, I've seen that using Impala or Hive I can save queries, but could I then call them from R with RJDBC package?
Thanks in advance.
Related
Summary
I have a data analysis project that requires the use of a locally stored PostgreSQL database of fairly decent size for a desktop machine (~10 tables with up to ~90m rows and ~20 columns, summing to around 20gb worth of data).
I have some statistical models I want to execute on this data using R. First, though, I need to manipulate the data a little bit to get it into the form I want. The basic manipulations are fairly straightforward SELECT and JOIN operations, but they take a a few minutes to do on my machine because of the amount of data. I'll need to refer to the manipulated tables again and again in analysis, so I'd like to be able to save the results of these SELECT and JOIN operations for later use.
Question
Is it faster or computationally more efficient to
(a) execute the joins from R's DBI package, using, say, dbGetQuery, and saving the resulting dataframes on disk for later analysis
or
(b) do the joins and selects inside PGAdmin or DataGrip, save the result to a .csv file, and bring that into R?
What I've tried
I've tried three of the operations I need to do in both RStudio (as in a) and DataGrip (as in b) and timed them with a stopwatch. In 2 instances, the code seems to go faster inside the SQL environment in DataGrip, and in the third, it's marginally faster in RStudio. I'm not sure why this is, besides the third operation working on smaller tables than the first two. No, I don't know how to benchmark code in either platform, and yes, this may be part of my issue. Nor do I know much about big-O notation, but that may not be relevant here.
I can include some more concrete code if it's helpful, but my question (it seems to me) is a little more theoretical. I'm basically asking if connecting to a SQL server on my local machine should be any different if I'm doing it in R versus doing it in some "proper" database environment. Are there bottlenecks in one and not the other?
Thanks in advance for any insight!
It seems like all queries expressed in SQL can be converted into MapReduce jobs. This is in essence what Spark SQL does. SparkSQL takes in SQL, converts it to a MapReduce job then executes the MapReduce job on Spark's runtime.
All questions which can be answered by SQL can be answered by MapReduce jobs. Can all MapReduce jobs also be written as SQL (maybe with custom user defined functions)? When does it make sense to use MapReduce over SQL or vice versa?
SQL is useful when you have structured data (e.g. tables, with clearly defined columns and, usually, data types). Using SQL with that structure you can select columns, join them, etc.
With MapReduce you can do that (Spark SQL will help you do that) but you can also do much more. A typical example is a word count app that counts the words in text files. Text files do not have any predefined structure that you can use to query them using SQL. Take into account that kind of applications are usually coded using Spark core (i.e. RDD) instead of Spark SQL, since Spark SQL needs also a structure.
Another maybe more real use case is processing large amounts of log files using MapReduce (again, log files does not have a relational structure such as the one required by SQL).
SQL and MapReduce also have their own advantages. To data analysist , they don't need to learn how to write the MapReduce Program. And from the perspective of developer, writing MapReduce program leave enough room to tuning the program ,like add random prefix to skewing data.
And In the long run, with the development of SQL interpreters, use SQL over MapReduce/Spark RDD.
I am new to Snowflake and Python. I am trying to figure out which would faster and more efficient:
Read data from snowflake using fetch_pandas_all() or fetch_pandas_batches() OR
Unload data from Snowflake into Parquet files and then read them into a dataframe.
CONTEXT
I am working on a data layer regression testing tool, that has to verify and validate datasets produced by different versions of the system.
Typically a simulation run produces around 40-50 million rows, each having 18 columns.
I have very less idea about pandas or python, but I am learning (I used to be a front-end developer).
Any help appreciated.
LATEST UPDATE (09/11/2020)
I used fetch_pandas_batches() to pull down data into manageable dataframes and then wrote them to the SQLite database. Thanks.
Based on your use-case, you are likely better off just running a fetch_pandas_all() command to get the data into a df. The performance is likely better as it's one hop of the data, and it's easier to code, as well. I'm also a fan of leveraging the SQLAlchemy library and using the read_sql command. That looks something like this:
resultSet = pd.read_sql(text(sqlQuery), SnowEngine)
once you've established an engine connection. Same concept, but leverages the SQLAlchemy library instead.
We have large volumes (10 to 400 billion) of raw data in BigQuery tables. We have a requirement to process this data to convert and create the data in the form of star schema tables (probably a different dataset in bigquery) which can then be accessed by atscale.
Need pros and cons between two options below:
1. Write complex SQL within BigQuery which reads data form source dataset and then loads to target dataset (used by Atscale).
2. Use PySpark or MapReduce with BigQuery connectors from Dataproc and then load the data to BigQuery target dataset.
The complexity of our transformations involve joining multiple tables at different granularity, using analytics functions to get the required information, etc.
Presently this logic is implemented in vertica using multiple temp tables for faster processing and we want to re-write this processing logic in GCP (Big Query or Data Proc)
I went successfully with option 1: Big Query is very capable to run the very complex transformation with SQL, on top of that you can also run them incrementally with time range decorators. Note that it takes a lot of time and resources to take data back and forth to BigQuery. When running BigQuery SQL data never leaves BigQuery in the first place and you already have all raw logs there. So as long your problem can be solved by a series of SQL I believe this is the best way to go.
We moved out Vertica reporting cluster, rewriting successfully ETL last year, with option 1.
Around a year ago, I've written POC comparing DataFlow and series of BigQuery SQL jobs orchestrated by potens.io workflow allowing SQL parallelization at scale.
I took a good month to write DataFlow in Java with 200+ data points and complex transformation with terrible debugging capability at a time.
And a week to do the same using a series of SQL with potens.io utilizing
Cloud Function for Windowed Tables and parallelization with clustering transient tables.
I know there's been bunch improvement in CloudDataFlow since then, but at a time
the DataFlow did fine only at a million scale and never-completed at billions record input (main reason shuffle cardinality went little under billions of records, with each records having 200+ columns). And the SQL approach produced all required aggregation under 2 hours for a dozen billion. Debugging and easiest of troubleshooting with potens.io helped a lot too.
Both BigQuery and DataProc can handle huge amounts of complex data.
I think that you should consider two points:
Which transformation would you like to do in your data?
Both tools can make complex transformations but you have to consider that PySpark will provide you a full programming language processing capability while BigQuery will provide you SQL transformations and some scripting structures. If only SQL and simple scripting structures can handle your problem, BigQuery is an option. If you need some complex scripts to transform your data or if you think you'll need to build some extra features involving transformations in the future, PySpark may be a better option. You can find the BigQuery scripting reference here
Pricing
BigQuery and DataProc have different pricing systems. While in BigQuery you'd need to concern about how much data you would process in your queries, in DataProc you have to concern about your cluster's size and VM's configuration, how much time your cluster would be running and some other configurations. You can find the pricing reference for BigQuery here and for DataProc here. Also, you can simulate the pricing in the Google Cloud Platform Pricing Calculator
I suggest that you create a simple POC for your project in both tools to see which one has the best cost benefit for you.
I hope these information help you.
I have been using Spark for some time now with success in Python however we have a product written in C# that would greatly benefit from distributed and parallel execution. I did some research and tried out the new C# API for Spark but this is a little restrictive at the moment.
In regards to Ignite, on the surface it seems like a decent alternative. Its got good .NET support, it has clustering ability and the ability to distribute compute across the grid.
However, I was wondering if it really can be used to replace Spark in our use case - what we need is a distributed way in which to perform data frame type operations. In particular a lot of our code in Python was implemented using Pandas UDF and we let Spark worry about the data transfer and merging of results.
If i wanted to use Ignite, where our data is really more like a table (typically CSV sourced) rather than key/value based, is there an efficient way to represent that data across the grid and send computations to the cluster that execute on an arbitrary subset of the data in the same way Spark does, especially in the sense that the result of the calculations just become 1..n more columns in the dataframe without having to collect all the results back to the main program?
You can load your structured data (CSV) to Ignite using its SQL implementation:
https://apacheignite-sql.readme.io/docs/overview
it will provide the possibility to do distributed SQL queries over this data and indexes support. Spark also provides the possibility to work with structured data using SQL but there are no indexes. Indexes will help you to significantly increase the performance of your SQL operations.
In case if you have already had some solution worked using Spark data frames then you also can save the same logic but use Ignite integration with Spark instead:
https://apacheignite-fs.readme.io/docs/ignite-data-frame
In this case, you can have all data stored in Ignite SQL tables and do SQL requests and other operations using Spark.
Here you can see an example how to load CSV data to Ignite using Spark DF and how it can be configured:
https://www.gridgain.com/resources/blog/how-debug-data-loading-spark-ignite