Example data for Hive tutorial

Example data for Hive tutorial - hive

The original Hive tutorial available online refers to a dataset called "pv_2008-06-08.txt":
https://cwiki.apache.org/confluence/display/Hive/Tutorial
And of course, it is referenced in dozens of tutorials all over the Internet. However, there is no way I can find the original data anywhere. Does anybody have a clue where is it?

After reading through the given site, found that examples given in that site is outdated. Please use new link for more examples.
https://cwiki.apache.org/confluence/display/Hive/Tutorial#Tutorial-UsageandExamples
NOTE: Many of the following examples are out of date.  More up to date information can be found in the LanguageManual.
If you are still interested in that dataset, suggest you to mail to the community to provide (please refer below link)
http://hive.apache.org/mailing_lists.html
Hortonworks datasets:
Recently I come across this Hortonworks datasets which can be used for creating database and queries in Hive and Pig.
https://app.box.com/v/hadoopcrashcoursedata
If you want to try with this dataset, here is the link for creating table using the above dataset
http://hortonworks.com/hadoop-tutorial/hello-world-an-introduction-to-hadoop-hcatalog-hive-and-pig/#section_4

Related

BigQuery: include procedures when copying a dataset

I need to copy a BigQuery dataset from one project to another. However, when I follow the documentation
https://cloud.google.com/bigquery/docs/copying-datasets#required_permissions
I am only able to transfer the tables and not the procedures that are also stored in the same dataset. Is there a way to accomplish that?

In order to further contribute to the community I am posting #MikhalBerlyant`s answer as Community Wiki.
Currently there is an open Feature request for copying procedures between datasets. Although, there is no ETA for evaluation and implementation, you can follow any updates by following this thread, link.

Where is the query tutorial in Redis official site?

I am new to redis and just start to learn it today. The official website does a good job about what the data types are and how to set them. That part is not hard to understand. But the problem is without queries, data becomes meaningless. I really failed to find any good documentation on how to do queries/searches in the official site.
When googling, I found this question Redis strings vs Redis hashes to represent JSON and people are all ignoring queries. I just don't get it at all. Many people suggest to store JSON as a string value to the key. This looks very crazy to me. How can they query JSON keys later? For example, for a user object to store in either key-value data type or hashes, how to query users whose age is greater than 30? That should be a very basic and simplest query for a database.
Thank you very much for your help. I am very confused.
EDITED:
After long time googling, I figured out a basic concept: redis can only query keys, and value are not searchable. Thus to search values, I have to create keys which contain the value. This answers my second question.
But the first and my primary question is where to find query tutorial in redis official site. Since redis is very different from sql db, the question might be changed to where to find data modeling and query tutorial in redis official site. It seems to do queries, I have to create some kind of special keys first. That makes query tutorial become modeling tutorial in the end.
Btw, for those who are new to redis and confused like me, you can read this article Storing and Querying Objects. Even if it has a little bug in it, it clarifies many thing about how to use redis for query. These kind of information really should go into redis official doc.

I really failed to find any good documentation on how to do queries/searches in the official site.
Check the command manual for how to query different data structures.
How can they query JSON keys later
JSON is NOT a built-in data structure for Redis. If you want to query JSON data you need to build the index with Redis' built-in data structures by yourself, or you might want to try RedisJSON, which is a Redis module for processing JSON data.

apache storm/spark and data visualisation tool(s)

I have been searching for hours bu i did not find a clear answer. I would like to know what it is the most suitable data visualization tool(s) to use with apache storm/spark. I know there is tableau and jaspersoft but they are not free. Furthermore, there is the possibility of elasticsearch and kibana but I would like to find/try something else. So, do you have an idea please ?!
Thanks a lot for your attention.

You are not giving much info here. Storm is stream processing engine, Spark can do a lot more but in both cases you need to deposit information somewhere. If it is text based data, you may do Solr+Graphana or Elastic+Kibana. If it is SQL or NoSQL DB there are many tools mostly around data base type. There are BIs for time series with InfluxDB, etc. With Spark, you have Zepplin that can do some level of BI. The last is to have your own visualization but I would be careful with D3 as it is not very good for dynamic charts. You may be better with pure JS charts like HighCharts, etc.
Best of luck.

Apache Zeppelin is a great web based front end for Spark
Highcharts is an excellent chart library.
spark-highcharts add easy modeling feature from Spark DataFrame to highcharts. It can be used in Zeppelin, spark-shell, or other spark application.
spark-highcharts can generate self contain HTML page with full interaction feature. It can share to other users.
Using following docker command try out
docker run -p 8080:8080 -d knockdata/zeppelin-highcharts

Have a look at D3 Javascript library.It provides a very good Visualization library
https://d3js.org/

How to turn MongoDB collection into a 'Table'

I've been given access to a cloud MongoDB (MongoLab) and need to extract some data into Excel so I can analyse it. The data isn't particularly complicated or large and is well suited to a 'normal' relational structure.
My research suggests things are trickier because the data has 'nested' aspects although conceptually its pretty clear how this would become a table. Here is what a document in the collection looks like, essntinaly the stuff highlighted blue would be columns in the table while the yellow would create a row for each "marketing_event" with the specifics of each event also being in a column:
Ideally I would use Power Query to get the data into Power Pivot but at this point anything will do!
I've tried a bunch of things all of which haven't got me much closer to end result that I'm looking for:
I downloaded MongoVue which I used to successfully connect to the database and while it enabled me to see the data in a basic table form, it does nothing with the nested stuff and the documentation is minimal in terms of how it could be of more use.
I also tried Pentaho PDI based on this article:http://sqlmag.com/blog/integrating-mongodb-and-open-source-data-stores-power-pivot but the steps aren't detailed and although I can see the collection, trying to replicate some sample queries I found on the web were totally unsuccesful.
I've tried to get a trial of Simba's ODBC connector but as yet the download doesn't seem to be working. I have contacted them but without response just yet.
I've even installed Mongo locally and tried to use the command prompt to connect which I was unable to do. Even if I pursued this I wouldn't be confident about knowing where to start in terms of creating the end product.
Happy to hear any suggestions or recommendations.
TIA
Jacob

Here's a solid ODBC driver that helps maintain the fidelity of your mongoDB data by exposing the nested MongoDB data model as a set of relational tables to Excel and other ODBC apps. in the sample document above, this driver will do exactly what you're looking for. The embedded documents and arrays can be extracted as separate related tables from the fields at the root level of the document.
https://www.progress.com/odbc/mongodb

I don't know if you already found the solution - but Simba ODBC is providing support for nested arrays.
Have a look here:
https://www.simba.com/resources/webinars/connect-tableau-big-data-source. This is an example how to connect Tableau BI to MongoDB. You might find it helpful.
And some more information on handling no-sql data in BI tools is provided in this whitepaper: http://info.mongodb.com/rs/mongodb/images/MongoDB_BI_Analytics.pdf

Wiki Database, is there one?

I was searching the net for something like a wiki database, just like wikipedia but instead stores structured content, editable by users. What I was looking for was an online database accessible by everyone where people can design the schema and data with proper versioning of both schema and data. I couldn't find any such site. I am not sure if it is my search skills or if there really is no wiki database as of now. Does anyone out there know anything like this?
I think there is a great potential for something like this. A possible example will be a website with a GUI for querying a MySQL DB where any website visitor can create DB objects and populate data.
UPDATE: I had registered the domain wikidatabase.org to get started on a tool but I didn't find enough time yet. If anyone is interested in spending some time and coding on this, please let me know at wikidatabase.org

It's not quite what you're looking for, but Semantic Mediawiki adds database-like features to MediaWiki:
http://semantic-mediawiki.org/wiki/Semantic_MediaWiki
It's still fundamentally a Wiki, but you can add semantic tags to pages ([[foo::bar]] [[baz::1000]]) and then do database-type queries across them: SELECT baz FROM pages WHERE foo=bar would be {{#ask: [[foo::bar]] | ?baz}}. There is even an embryonic SPARQL implementation for pseudo-SQL queries.

OK this question is old, but Google led me here, so for anyone else out there looking for a wiki for structured data: Take a look at Foswiki.

This might be like what you're looking for: dbpedia.org. They're working on extracting data from Wikipedia, and encoding it in a structured format using RDF, so that it can be queried using SPARQL.
Linkeddata.org has a big list of RDF data sets.

Do you mean something like http://www.freebase.com?

You should check out https://www.wikidata.org/wiki/Wikidata:Main_Page which is a bit different but still may be of interest.

Something that might come close to your requirements is Google Docs.
What's offered is document editing roughly similar to MS Word, and spreadsheets roughly similar to Excel. I'm thinking of the latter, of course.
In Google Docs, You can create spreadsheets for free; being spreadsheets, they naturally have a row-and-column structure similar to a database, and which you can define flexibly. You can also share these sheets with other people. This seems to be a by-invite-only process rather than open-to-all, but there may be other possibilities I'm not aware of, or that level of sharing might be enough for you in any case.

mindtouch should be able to do it. It's rather easy to get data in / out. (for example: it's trivial to aggregate all the IP's for servers into one table).
I pretty much use it as a DB in the wiki itself (pages have tables, key/value..inheritance, templates, etc...) but you can also interface with the API, write dekiscript, grab the XML...

I like this idea. I have heard of some sites that are trying to pull together large datasets for various things for open consumption, but none that would allow a wiki feel.
You could start with something as simple as an installation of phpMyAdmin with a known password that would allow people to log in, create a database, edit data and query from any other site on the web.
It might suffer from more accuracy problems than wikipedia though.

OpenRecord, development of which seems to have halted in 2008, seems to approach this. It is a structured wiki in which pages are views on the data. Unlike RDBMSes it is loosely typed - the system tries to make a best guess about what data you entered, but defaults to text when it cannot guess. Schemas appear to have been implied.
http://openrecord.org
An example of the typing that is given is that of a date. If you enter '2008' in a record, the system interprets this as a date. If you enter 'unknown' however, the system allows that as well.

Perhaps you might be interested in Couch DB:
Apache CouchDB is a document-oriented
database that can be queried and
indexed in a MapReduce fashion using
JavaScript. CouchDB also offers
incremental replication with
bi-directional conflict detection and
resolution.

I'm working on an Open Source PHP / Symfony / PostgreSQL app that does this.
It allows multiple projects, each project can have multiple directories, each directory has a defined field structure. Admins set all this up.
Then members of the public can suggest new records, edit or report existing ones. All this is moderated and versioned.
It's early days yet but it basically works and is already in real world use in several projects.
Future plans already in progress include tools to help keep the data up to date, better searching/querying and field types that allow translations of content between languages.
There is more at http://www.directoki.org/

I'm surprised that nobody has mentioned Wikibase yet, which is the software that powers Wikidata.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas