Recently there's being development going on for the Apache Tajo project. The project has an objective of being "advanced open source data warehouse system in Hadoop for processing web-scale data sets".
Since we already have Apache hive as a data warehouse for hadoop and is now advanced and widely used across, How useful and different would this new project be for the hadoop world?
If you already have your warehouse stable on Hive, I'm pretty sure you don't have to move away in the short term. Couple of areas that Tajo is trying to access are:
Low latency queries (ad-hoc): you might be getting fast-enough results using Impala/ Tez and hive-on-spark is coming with CDH 5.7. For even faster response, a different DB (not usually DWH) can be used.
Full SQL Support: As long as people who are using hive are already comfortable with HQL, there's no pressing need for SQL. Although you can easily understand why it's a benefit to have full-sql support.
Related
In my CS program, I was told I should learn SQL for my databases.
If I'm using PostgreSQL, do I also need a SQL server to go along with it? Is PostgreSQL a language, a server, or both? Is there even a SQL language or is it only servers?
Background: I downloaded Postgres because hey, that has SQL in the name, it works and I'm under the impression it's a pretty good choice anyway. But I couldn't figure out through their website if it needs a companion server, so I went looking for one and found AWS RDS.
The impression I have is that Postgres is the language and AWS RDS is the server, and they serve different functions. But I'm not sure about any of that.
Seems you're learning too many new topics at the same time.
Ok. I'll try to answer.
SQL stands for 'Structured Query Language', and serves as a 'standard' for many vendors that in much ways respects its fundamentals. Oracle, MySQL (now owned by Oracle), MariaDB and PostreSQL are some vendors.
Main thing with SQL code I would recommend you to identify every time you look at it, is to understand if it belongs to DML or DDL. DML stands for 'Data Manipulation Language' and refers to SQL instructions which 'modifies' data. DDL stands for 'Data Declaration Language' which defines or 'alter' de structure on which data will be stored.
Another important concept is atomicity of data manipulation. You can confirm a change or roll it back before it is persisted. This thing corresponds to 'commit' changes or do a 'rollback'. It's some kind of advanced concept, but generally happens "automatically" with standard client configurations. Later, you would have to know about it while programming some system module which interacts with databases.
When you think of the SQL 'server', it refers to the software configured/installed which has the responsability of manage persistence of data within some kind of 'instance' of persistence, allocated in some system with data storage capabilities. AWS implements this service in the cloud, and RDS is the product which supports many kind of SQL flavors to choose (Oracle, Postgresql, etc.)
If you are comfortable with Docker, I recomend you learn the basics which would help you setup and destroy databases many times, which is useful to develop and test locally. Next command, let you start a Postgresql database configured with open port 5432. You can see the server log through docker and use some SQL client to get connected. When you press Ctrl+C everything will be deleted. Of course there are other ways to keep data persistent, but this command would be an easy starting point.
$ docker run --rm -p 5432:5432 --name some-postgres-container-name -e POSTGRES_PASSWORD=mysecretpassword postgres:13.3
Side note: it's better to get used to work with specific docker image versions always (not 'latest').
More details of it usage here: https://hub.docker.com/_/postgres/
if I'm using PostgreSQL, do I also need a SQL server to go along with
it? Is PostgreSQL a language, a server, or both? Is there even a SQL
language or is it only servers lol? I'm genuinely trying to figure
this out myself, but basically everything I read is beyond my scope of
competence and confuses me more. I'm learning the syntax of SQL well
enough, but I'm so confused about everything on the most fundamental
level.
By the way "SQL Server" is Microsoft's SQL flavor, just another one. Don't be confused with the concept of having some SQL server configured.
Yes, you can think of PostgreSQL as a language too, which shares most of its syntax and semantics with other SQL vendors. Yes, there is a 'basic' SQL language shared and compatible between all vendors; some share more aspects than others. In terms of Venn diagrams, you can think of many circles representing each one, Microsoft's SQL Server, Oracle SQL, PostgreSQL, MySQL, etc. sharing the very most of its elements, where each element is a SQL instruction.
When dealing with Databases in general, keep in mind that they helps to modelate situations of 'real world' scenarios or software systems. SQL allows to 'talk' to implementation of "Relational Databases" wich is one kind of database modeling, but there are others too. ER Diagrams helps to represent the 'structure' of a database in a conceptual manner. I like DBeaver because it has an integrated ER diagram generator wich helps to understand the structure of a given database instance.
I have used Postgres and it is an excellent product (and free).
I would install it standalone first. It does come with its own client tools, which you use to communicate with the database server, which runs independently as a service. However, you might be better off installing something like SqlWorkbench as a client tool (which I use). In the config you specify the machine Postgres is running on (which can be your local computer for testing purposes) and the port to connect on. Essentially, the client sends your instructions to Postgres server and the server returns the resultsets associated with your instructions. The client also formats the resultsets into a nice readable "spreadsheet" format with rows and columns.
First I'll try to answer the questions you asked. There is a SQL language, but in practice it is not strictly standardized. There are many offerings for databases and database servers. Many of these are discussed below.
Any database you pick will give you the chance to learn basics of SQL queries and this knowledge will serve you well even if you switch to a different database later.
Specifically, when it comes to PostgreSQL, it is a Relational Database Management System. It is a software that operates as a server. You can install it on your personal computer running Windows, Linux, or MacOS. You can also install in on a dedicated server computer where you'll get better performance and uptime. Further, there are many companies that offer PostgreSQL hosting including Amazon RDS and Google Cloud but they're not free.
For a CS student, PostreSQL installed on your personal computer might be a reasonable choice. But you have lots of options. Read on....
For a CS program, your choice of database will depend on:
what degree of portability you need
how much data you have
how many users will connect to database
what kinds of jobs you might pursue after graduation
Portability
If you think you want to ship your database with your application, then your best bet is probably SQLite. By some accounts it can handle several million rows worth of data and still be performant. However, it's not great if you need for multiple users to connect to the same database. Your data can get corrupted in many multi-user scenarios.
How Much Data and Users
For large data and large users, you'll want to consider the client/server heavy hitters:
PostgreSQL
MySQL/MariaDB
Oracle
SQL Server
These databases will support large quantities of data any many simultaneous connections. But if you want to distribute the database with your application, it's not a good idea. Or if you want to demonstrate your app, you need to ensure that a connection to a server will be available. All of these databases come with a free version, but the last two will have the most restrictions.
After Graduation
Now you're looking to the future and possibly what kind of skills you want to put on your resume. If you think you'll end up in a corporate environment that is already well established, they will likely already have a preferred database and it could be any of the ones listed here (SQLite or the "heavy hitters"). If you want to position yourself as developing apps with low overhead cost, you'll gravitate towards SQLite/PostgreSQL/MySQL. If you think you're going to be some kind of database administrator working in a buttoned-up corporate environment, those companies tend to favor SQL Server and Oracle.
Good luck. Any choice you make will probably be fine. Knowing some flavor of SQL is useful for your future endeavors.
SQL is a language like any other language but working on database. It is called SQL because it works on structured data like table (i.e rows and columns). After reading the documentation of PostgreSQL, I think we do not need any separate server installation. You can download it from here. If you are facing any issues with it I suggest using MySQL workbench. Although installation may take longer time, but its easy to understand.
I need a solution for storing logs (which more or less follow one of, say, 10, standard formats), preferably in real time, in a database which is fast to query and can easily give me the result to various wired queries. E.g. queries looking for keywords in text bodies, queries involving multiple tables.
A solution that was recommended to me was MetaMarket, which seems to do real-time logging with a very good query system in style. However I'm unsure about the cost and wether or not such a complex solution is needed.
From what I understand the "selling point" of metamarket is the druid db and said db is open source and can be deployed outside of their stack. So what I come here to ask is:
Have any of you guys had experience deploying a real-time logging system with Druid ? How hard was it ? How long did it take ? What are the challenges ? What other technologies besides druid did you use ? Do you have any recommended reading ?
Have any of you had experience with metamarket. If so, again, how hard was it ? how long did it take ? what are the challenges ? how were the cost once it hit production ? Do you have any recommended reading on the subject ?
Also, bonus question: Are there actually any benchmarks done by "unbiased professionals" about druid ? The fact that a real-time in real-time out databse is written in Java seems a bit... ahm, hard to believe.
This is quick answer.
It is true druid is open source but the missing link here is a good UI that plays nice with druid. There is one UI used to be called caravel and now is superset i guess it can do an ok job.
Concerning running a druid cluster it should not be that hard if you have enough resources (eg engineers) to put in place all the pipeline from packaging to deploying druid on the machines/cloud.
Finally the last piece is monitoring/updating the cluster it needs good amount of work as well.
And yes it is written using JAVA but that's the case for many other realtime software take example of KAFKA, in fact druid does a lot of thing off-heap and uses memory mapped files for serving data. Reading the white paper will provide a good/basic understanding of the system, hence you find the answer if druid is a good fit or not.
We are building new feature sets for one of our financial application. We have our own SQL server database and we will be calling multiple RESTful APIs that return JSON responses. For e.g. some returns news data, some returns stocks info, some returns finance data and our own sql server database has employee data. So, they all come with their own different data format. This new app we are building is going to aggregate all those data, transform it into meaningful display on web like mint.com does.
Web application will display analytical reports based on these data
There will be an option to download reports through various templates
We are completely open in terms of technology stack for our backend and middle-tier. As a first thought NoSQL like mongodb and elasticsearch for search and reporting comes to our mind. There will be a web application build on top of these data (stored or retrieved from API), most likely in Asp.net MVC.
We need your input, specially if you have experience with building similar enterprise solution.
Can you please share your opinions on,
What are some good tech stack you would pick for this app?
How would that scale now and in future when APIs data format changes.
Performance is also important since data will be displayed on web UI.
We have a similar setup to what you are mentioning, using ASP.Net MVC with ElasticSearch (SQL server for relational data, periodically updating ES), aggregating data (XML/JSON) from multiple sources, although with the purpose of improving searching and filtering results instead of reporting. However, I would expect that the scenario you are looking at would also be a suitable match for ElasticSearch, depending on your specific requirements.
1) Since you are already using SQL Server (and I expect are familiar with that), I would suggest combining that with ElasticSearch - the additional mongodb layer seems unnecessary, in terms of maintenance of another technology and development to fit that integration. There is a very good C# library (two actually, ElasticSearch.Net and NEST, used together) that exposes most of the ES functionality.
2) We chose ElasticSearch for its scalability in combination with flexibility and ease-of-use. A challenge you may face could be mapping the documents from C# classes to ElasticSearch documents. In essence, it is incredibly easy to set up, however you do need to do some planning to index data the way you want to search and retrieve it. So if choosing ES as a platform, spend some time with the structure of the documents - by default, dynamic mapping is enabled, so you can pretty much throw any JSON into a document. However, for a production environment, it's better to turn that off and have one or more mappings set up, so they can be queried in a standardized way.
3) Performance is a key factor for us as well, which is why we were looking at Lucene-based engines like Solr and ElasticSearch when doing research, along with NoSQL databases. ElasticSearch outperforms SQL Server by 10 to 1 or better, in most scenarios. Solr vs. ElasticSearch performance depends on scenario, benchmarks and comparisons are around if you Google them. The exception may be if many documents should be retrieved in one query - ES (or Lucene) is not made for that use case, it's best for fast retrieval of fewer results (similar to Google's per page results count) per page. If you need 1000 documents per page/result, a NoSQL database may be a better option.
ElasticSearch is fast to get up and running - install it on a local development box and try it out, you'll get a feel for if it fits.
From my experience, mongodb is the worst choice for reporting, especially for aggregation. It lacks in good aggregation functionality, has some data type conflicts (such as decimals being stored as strings, which you cannot use in it's built in aggregation framework api) and you'll probably will have to maintain map-reduce functions in javascript for most of the scenarios.
If your application's true nature is only reports, and they do not have to be updated in realtime, I would drop off the on-demand RPC calls to external APIs. I would consider copying ahead as much data as possible and storing it under a schema that is the most convenient for you to work with, and synchronising it afterwards under scheduled, predicted intervals.
I wouldn't be in a hurry making assumptions about that data to be available all the time nor it always to be in the format you expect. You also gain optimisation benefits on running your own copy of it, indexed in the way you want, instead of trying to figure which of the RPCs is your bottleneck.
As for your questions:
1) If you don't mind using Python, I would pick Django on top of PostgresSQL database. Django is a fully featured sturdy ORM + Web framework which is excellent for this kind of work. If not, just stick to a relational SQL database. I heard wonders of Cassandra but haven't tried it yet.
2 + 3) As I mentioned before, replicating the data as much as possible for your own good. After everything is "in house" you can cluster it and tweak it freely. Using a distributed cache against heavy client requests is also a good idea (such as REDIS), instead of generating those reports each time on demand.
I've been using Jasper reports and the Jasper Reports Server to integrate into our web app. Jasper accepts many different datasource types including JSON and SQLServer. The core version is free and allows you to product html amd pdf reports of high complexity. The paid version with the server allows you to easily integrate in your web app. The core is Java spring (partially open source) running on tomcat/jboss and you can interact with it using REST web services or the visualize.js library for your web front end. It uses highcharts which can produce some beautiful results and has options for adhoc reporting and dashboards built from many reports.
See demos here: http://www.jaspersoft.com/
This has an assumed stack of your backend db's and data sources, tomcat with Java Spring, web front end HTML/Javascript.
The tool is used by many large enterprises including Amazon so scalibility so shouldn't be an issue.
If the format of your data changes you'll need to change the report. This is xml formatted editted by GUI with WYSIWYG.
Drill looks like an interesting tool for the ad-hoc drill down queries as opposed to the high-latency Hive.
It seems that there should be a decent integration between those two but i couldn't find it.
Lets assume that today all of my work is done on Hive/Shark how can i integrate it with Drill?
Do I have to switch to the Drill engine back and forth?
I'm looking for an integration similar to what Shark and Hive have.
Although there are provisions to implement Drill-Hive integration, your question seems to be a bit "before the time" thing. Drill still has a long way to go and folks have been trying really hard to get all this done as soon as possible.
As per their roadmap, Drill will first support Hadoop FileSystem implementations and HBase. Second, Hadoop-related data formats will be supported (eg, Apache Avro, RCFile). Third, MapReduce-based tools will be provided to produce column-based formats. Fourth, Drill tables can be registered in HCatalog. Finally, Hive is being considered as the basis of the DrQL implementation.
See this for more details.
I'm setting up a web application with a FreeBSD PostgreSQL back-end. I'm looking for some database performance optimization tool/technique.
Database optimization is usually a combination of two things
Reduce the number of queries to the database
Reduce the amount of data that needs to be looked at to answer queries
Reducing the amount of queries is usually done by caching non-volatile/less important data (e.g. "Which users are online" or "What are the latest posts by this user?") inside the application (if possible) or in an external - more efficient - datastore (memcached, redis, etc.). If you've got information which is very write-heavy (e.g. hit-counters) and doesn't need ACID-semantics you can also think about moving it out of the Postgres database to more efficient data stores.
Optimizing the query runtime is more tricky - this can amount to creating special indexes (or indexes in the first place), changing (possibly denormalizing) the data model or changing the fundamental approach the application takes when it comes to working with the database. See for example the Pagination done the Postgres way talk by Markus Winand on how to rethink the concept of pagination to make it more database efficient
Measuring queries the slow way
But to understand which queries should be looked at first you need to know how often they are executed and how long they run on average.
One approach to this is logging all (or "slow") queries including their runtime and then parsing the query log. A good tool for this is pgfouine which has already been mentioned earlier in this discussion, it has since been replaced by pgbadger which is written in a more friendly language, is much faster and more actively maintained.
Both pgfouine and pgbadger suffer from the fact that they need query-logging enabled, which can cause a noticeable performance hit on the database or bring you into disk space troubles on top of the fact that parsing the log with the tool can take quite some time and won't give you up-to-date insights on what is going in the database.
Speeding it up with extensions
To address these shortcomings there are now two extensions which track query performance directly in the database - pg_stat_statements (which is only helpful in version 9.2 or newer) and pg_stat_plans. Both extensions offer the same basic functionality - tracking how often a given "normalized query" (Query string minus all expression literals) has been run and how long it took in total. Due to the fact that this is done while the query is actually run this is done in a very efficient manner, the measurable overhead was less than 5% in synthetic benchmarks.
Making sense of the data
The list of queries itself is very "dry" from an information perspective. There's been work on a third extension trying to address this fact and offer nicer representation of the data called pg_statsinfo (along with pg_stats_reporter), but it's a bit of an undertaking to get it up and running.
To offer a more convenient solution to this problem I started working on a commercial project which is focussed around pg_stat_statements and pg_stat_plans and augments the information collected by lots of other data pulled out of the database. It's called pganalyze and you can find it at https://pganalyze.com/.
To offer a concise overview of interesting tools and projects in the Postgres Monitoring area i also started compiling a list at the Postgres Wiki which is updated regularly.
pgfouine works fairly well for me. And it looks like there's a FreeBSD port for it.
I've used pgtop a little. It is quite crude, but at least I can see which query is running for each process ID.
I tried pgfouine, but if I remember, it's an offline tool.
I also tail the psql.log file and set the logging criteria down to a level where I can see the problem queries.
#log_min_duration_statement = -1 # -1 is disabled, 0 logs all statements
# and their durations, > 0 logs only
# statements running at least this time.
I also use EMS Postgres Manager to do general admin work. It doesn't do anything for you, but it does make most tasks easier and makes reviewing and setting up your schema more simple. I find that when using a GUI, it is much easier for me to spot inconsistencies (like a missing index, field criteria, etc.). It's only one of two programs I'm willing to use VMWare on my Mac to use.
Munin is quite simple yet effective to get trends of how the database is evolving and performing over time. In the standard kit of Munin you can among other thing monitor the size of the database, number of locks, number of connections, sequential scans, size of transaction log and long running queries.
Easy to setup and to get started with and if needed you can write your own plugin quite easily.
Check out the latest postgresql plugins that are shipped with Munin here:
http://munin-monitoring.org/browser/branches/1.4-stable/plugins/node.d/
Well, the first thing to do is try all your queries from psql using "explain" and see if there are sequential scans that can be converted to index scans by adding indexes or rewriting the query.
Other than that, I'm as interested in the answers to this question as you are.
Check out Lightning Admin, it has a GUI for capturing log statements, not perfect but works great for most needs. http://www.amsoftwaredesign.com
DBTuna http://www.dbtuna.com/postgresql_monitor.php has recently started supporting PostgreSQL monitoring. We use it extensively for MySQL monitoring, so if it provides the same for Postgres then it should be a good fit for you too.