how to represent hive data for mahout's recommendations? - hive

I am a new bee in hadoop - big data analysis. I am referring book-"mahout in action".
Here i saw a topic which explains how we can represent recommender's data from database. In the book they have shown programmatic approach of connecting MySQL with mahout.
my question is "Is it possible to connect hive with mahout like we connect MySQL? if yes then how?

These are pretty completely different things -- you're talking about using a non-distributed recommender with MySQL, versus a Hadoop-based non-distributed recommender with the output of Hive on HDFS. If you're using recommenders, the simplest thing is to get your output as simple CSV data on HDFS and use that as input.

Related

Spark SQL vs Hive vs Presto SQL for analytics on top of Parquet file

I have terabytes of data stored in Parquet format for analytics use case. There are multiple big tables which needs joins as well and there are heavy queries. The system is expected to be highly scalable. Currently, evaluating Spark SQL, Hive and Presto SQL. Based on theories, all seem to be meeting the requirements. Could you please shed some light on the differences and what should be considered for the above mentioned use case. Tableau will be used for visualization on top of this.

Fetch_pandas vs Unload as Parquet - Snowflake data unloading using Python connector

I am new to Snowflake and Python. I am trying to figure out which would faster and more efficient:
Read data from snowflake using fetch_pandas_all() or fetch_pandas_batches() OR
Unload data from Snowflake into Parquet files and then read them into a dataframe.
CONTEXT
I am working on a data layer regression testing tool, that has to verify and validate datasets produced by different versions of the system.
Typically a simulation run produces around 40-50 million rows, each having 18 columns.
I have very less idea about pandas or python, but I am learning (I used to be a front-end developer).
Any help appreciated.
LATEST UPDATE (09/11/2020)
I used fetch_pandas_batches() to pull down data into manageable dataframes and then wrote them to the SQLite database. Thanks.
Based on your use-case, you are likely better off just running a fetch_pandas_all() command to get the data into a df. The performance is likely better as it's one hop of the data, and it's easier to code, as well. I'm also a fan of leveraging the SQLAlchemy library and using the read_sql command. That looks something like this:
resultSet = pd.read_sql(text(sqlQuery), SnowEngine)
once you've established an engine connection. Same concept, but leverages the SQLAlchemy library instead.

BigQuery replaced most of my Spark jobs, am I missing something?

I've been developing Spark jobs for some years using on-premise clusters and our team recently moved to the Google Cloud Platform allowing us to leverage the power of BigQuery and such.
The thing is, I now often find myself writing processing steps in SQL more than in PySpark since it is :
easier to reason about (less verbose)
easier to maintain (SQL vs scala/python code)
you can run it easily on the GUI if needed
fast without having to really reason about partitioning, caching and so on...
In the end, I only use Spark when I've got something to do that I can't express using SQL.
To be clear, my workflow is often like :
preprocessing (previously in Spark, now in SQL)
feature engineering (previously in Spark, now mainly in SQL)
machine learning model and predictions (Spark ML)
Am I missing something ?
Is there any con in using BigQuery this way instead of Spark ?
Thanks
A con I can see is the additional time required by the Hadoop cluster to create and finish the job. By making a direct request to BigQuery, this extra time can be decreased.
If your tasks need parallel processing, I would recommend using Spark, but if your app is mainly used to access to BQ, you might want to use the BQ Client Libraries and separate your current tasks:
BigQuery Client Libraries. They are optimized to connect to BQ. Here is a QuickStart and you can use different programming languages like python or java, among others.
Spark jobs. If you still need to perform transformations in Spark and need to read the data from BQ you can use the Dataproc-BQ connector. While this connector is installed in Dataproc by default, you can install it on-premises so that you can continue running you SparkML jobs with BQ data. Just in case it helps, you might want to consider using some GCP services like AutoML, BQ ML, AI Platform Notebooks, etc., they are specialized services for Machine Learning and AI.
I'm using PySpark (on GCP Dataproc), BigQuery and we have jobs in both. I will summarize my vision about Pros and Cons of one system against the other. And I do admit that your environment could be different, so that something which I think is Pros might not be like this for you.
Pros of Spark:
better testing of the code, simpler to build unit tests and run them with mocked data and classes, rather in trying to do this with BigQuery
it's possible to use SQL (SparkSQL) for operations and even combine operations over different data sources (DB, files, BQ)
we have JSON files in the format which is not valid for BigQuery, and it cannot parse them (while files have valid JSON format)
possible to implement naturally more complicated logic for some cases, for example, traversing arrays in nested fields and other complicated calculations
better custom monitoring is possible, when we need to check specific metrics in the pipeline we can send related metrics (StatsD, etc.) easier
more natural for CI/CD processes
Pros of BigQuery (all with a note: if all data is available):
simplicity of SQL, when all data is available in a convenient format
DBAs who are not familiar with Python/Scala still could contribute (bcs they know SQL)
awesome infrastructure behind the scene, very performant
With both approaches it's possible to check quickly the result in GUI. For example, Jupyter Notebook allows to run PySpark instantly. I cannot add my notes about ML related traits, though.

Queries reasonable usage and storage with R and Hadoop

I'm working on anomaly detection with timeseries using R. The data is stored in Hadoop. There is a sequence of queries to execute during the script runs and they are quite different.
I wanted to know what is the best way to stock these queries to easily maintain in case of changes in tables structure or access path? For example, I've seen that using Impala or Hive I can save queries, but could I then call them from R with RJDBC package?
Thanks in advance.

Hadoop and MS SQL Server Best Practices

I've been following Hadoop for a while, it seems like a great technology. The Map/Reduce, Clustering it's just good stuff. But I haven't found any article regarding the use of Hadoop with SQL Server.
Let's say I have a huge claims table (600 million rows) and I want to take advantage of Hadoop. I was thinking but correct me if I'm wrong, I can query my table and extract all of my data and insert it into hadoop in chunks of any type (xml, json, csv). Then I can take advantage of Map/Reduce and Clustering with at least 6 machines and leave my SQL Server for other tasks. I'm just throwing a bone here I just want to know if anybody has done such a thing.
Importing and exporting data to and from a relational database is a very common use-case for Hadoop. Take a look at Cloudera's Sqoop utility, which will aid you in this process:
http://incubator.apache.org/projects/sqoop.html