Setting up environment for Hadoop datawarehousing (Hive) - hive

I am new to Hadoop and trying to learn it on datawarehousing and analytical front.
Can someone advise me on how to set up my practice machines, especially with regards to
1.Number of machines/nodes required to start learning
2.Is it advisable to set up on Windows?
3.What software needs to be installed
4.Availability of test/sample data
Also I would like to get advice on the best way to perform BI actions with Hive.
Thank you.

I would suggest to download cloudera VM if you more interested in hadoop machinery. Another way to jump start immidiately - to use amazon EMR (elastic mapreduce). There is an option to create interactive hive cluster there and start playing with datasets stored in S3.
Regarding number of nodes - it depends on goals. If you interested to "feel" some hadoop performance - try at least 4-6 nodes.
Both ways listed above are good if you do not have access to organization's internal hadoop / hive cluster. And even in this case - I would suggest to try with them to gain some hands-on before using shared environment.

Related

TiDB as metastore for hive

Is somebody using TiDB as metastore for hive installation?
Looking for experience and testimonials as we are evaluating it to solve our current metastore bottleneck while using mysql.
This was published a while ago: https://www.pingcap.com/blog/create-scale-out-hive-cluster-with-distributed-mysql-compatible-database/
A search also found this: https://dzone.com/articles/cluster-diagnostics-troubleshoot-cluster-issues-us-2
I think TiDB can provide high availability, online upgrades and scalability which would all be good for Hive.
Disclaimer: I'm working for PingCAP, the company behind TiDB.

Using DynamoDB instead of Cassandra in cadence

Is it possible to configure dynamoDB instead of Cassandra. If possible can please share any instructions on it. I couldn't find any in https://github.com/uber/cadence/blob/master/docker/README.md
It is not possible as DynanamoDB is not fully compatible with Cassandra semantic as related to conditional batch operations.

BigQuery replaced most of my Spark jobs, am I missing something?

I've been developing Spark jobs for some years using on-premise clusters and our team recently moved to the Google Cloud Platform allowing us to leverage the power of BigQuery and such.
The thing is, I now often find myself writing processing steps in SQL more than in PySpark since it is :
easier to reason about (less verbose)
easier to maintain (SQL vs scala/python code)
you can run it easily on the GUI if needed
fast without having to really reason about partitioning, caching and so on...
In the end, I only use Spark when I've got something to do that I can't express using SQL.
To be clear, my workflow is often like :
preprocessing (previously in Spark, now in SQL)
feature engineering (previously in Spark, now mainly in SQL)
machine learning model and predictions (Spark ML)
Am I missing something ?
Is there any con in using BigQuery this way instead of Spark ?
Thanks
A con I can see is the additional time required by the Hadoop cluster to create and finish the job. By making a direct request to BigQuery, this extra time can be decreased.
If your tasks need parallel processing, I would recommend using Spark, but if your app is mainly used to access to BQ, you might want to use the BQ Client Libraries and separate your current tasks:
BigQuery Client Libraries. They are optimized to connect to BQ. Here is a QuickStart and you can use different programming languages like python or java, among others.
Spark jobs. If you still need to perform transformations in Spark and need to read the data from BQ you can use the Dataproc-BQ connector. While this connector is installed in Dataproc by default, you can install it on-premises so that you can continue running you SparkML jobs with BQ data. Just in case it helps, you might want to consider using some GCP services like AutoML, BQ ML, AI Platform Notebooks, etc., they are specialized services for Machine Learning and AI.
I'm using PySpark (on GCP Dataproc), BigQuery and we have jobs in both. I will summarize my vision about Pros and Cons of one system against the other. And I do admit that your environment could be different, so that something which I think is Pros might not be like this for you.
Pros of Spark:
better testing of the code, simpler to build unit tests and run them with mocked data and classes, rather in trying to do this with BigQuery
it's possible to use SQL (SparkSQL) for operations and even combine operations over different data sources (DB, files, BQ)
we have JSON files in the format which is not valid for BigQuery, and it cannot parse them (while files have valid JSON format)
possible to implement naturally more complicated logic for some cases, for example, traversing arrays in nested fields and other complicated calculations
better custom monitoring is possible, when we need to check specific metrics in the pipeline we can send related metrics (StatsD, etc.) easier
more natural for CI/CD processes
Pros of BigQuery (all with a note: if all data is available):
simplicity of SQL, when all data is available in a convenient format
DBAs who are not familiar with Python/Scala still could contribute (bcs they know SQL)
awesome infrastructure behind the scene, very performant
With both approaches it's possible to check quickly the result in GUI. For example, Jupyter Notebook allows to run PySpark instantly. I cannot add my notes about ML related traits, though.

Is there a way to shard and replicate neo4j data?

I am considering the option of neo4j for some of the new projects I am working for. For the given data needs (inherently graph based) neo4j fits well and a quick prototype is giving good response time for me. What I want to understand is how to scale a neo4j deployment. Specifically:
How do I shard my data across neo4j deployments. Since neo4j is deployed on a single machine, there is a limit to how much data I can store in a single machine and hence I would like to know how to distribute it. Clearly if I split it on users, then relationships between disconnected users (across shards) cannot be maintained.
How do I replicate the neo4j data? I am potentially thinking of putting up a sql-like-setup with masters used for write and slaves used for reads so that we can both scale up our potentially readers and writers, but also have a backup of our data in real time. I understand that all the neo4j data is stored in a filesystem - which is inherently non-replicatable. Is there a way I can do it here? Perhaps, something akin to a mysql bin log?
sharding is as of now not handled by Neo4j itself, but by the domain, much as you describe. Neo4j 2.0 is going to target that problem.
For replication, Online Backup is working and real High Availability with Master failover is in the works, using ZooKeeper to track the cluster nodes and elect new masters, etc.
Any more details on your app sharding requirements? What domain etc?

Katta in production environment

According to the website Katta is a scalable, failure tolerant, distributed, indexed, data storage.
I would like to know if it is ready to be deployed into production environment. Anyone already using it and has advices? Any pitfalls? Recommendations? Testimonials? Please share.
Any answer would be greatly appreciated.
We have tried using katta and for what its worth - found it very stable, relatively easy to manage (as compared to managing plain vanilla lucene)
Only pitfall I can think of is lack of realtime updates - when we tested it (about 9-10 months back) update meant, updating index using a separate process (hadoop job or what have you...) and replacing the live index, this was a deal-breaker for us.
If you are looking into distributed lucene you should really tryout ElasticSearch or Solandra