Basic benchmark for oltp databases [closed] - sql

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 6 years ago.
Improve this question
Im looking for benchmarks that tested different RDBMS running on the same enviroment to use as reference for a project. Im not looking for any test in particular just want a source of comparison for a few RDBMS something like Techempowers benchmarks for development frameworks. Does anyone know where I can find this? It would be really helpful for me. Thanks in advance.

TPC-C at http://www.tpc.org, is a Benchmarks used to simulates a complete computing environment where a multi-users executes transactions against a database.
The benchmark is centered around the transactions of an order-entry environment.
These transactions include entering and delivering orders, recording payments, checking the status of orders, and monitoring the level of stock at the warehouses.
TPC-C involves a mix of five concurrent transactions of different types and complexity either executed on-line or queued for deferred execution.
The simultaneous execution of multiple transaction types that span a breadth of complexity
On-line and deferred transaction execution modes
Significant disk input/output
-Transaction integrity (ACID properties)
Non-uniform distribution of data access through primary and secondary keys
Databases consisting of many tables with a wide variety of sizes, attributes, and relationships Contention on data access and update
It's used for selecting best hardware /database/Price Performance
you can find TPC-c for many database engines runing in different environment at:
The TPC defines transaction processing and database benchmarks and delivers trusted results to the industr
You can sort / concentrate in environment you need.
Concentrate on two numbers in table (based on database and hardware/o.s)
(tpmC) : absolute number , as increase as best performance
Price/tpmC : relative cost per dollar (just relative)

Related

Has any benchmarking been done on SAS data files (compared to a SQL database)? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 6 months ago.
Improve this question
I work for a company that's currently using a collection of SAS data files (sas7bdat) as their data warehouse. I'm trying to make the case that moving from SAS to a SQL database would result in large performance gains. Based on how long SAS is currently taking to perform queries I have a gut feeling that a data warehouse in, say, PostgreSQL running on the same hardware would be much faster.
The problem is that it's really difficult to compare performance apples-to-apples (e.g. on the same hardware). I would love to fire up a VM on my home server and run the same set of operations on SAS and compare to a SQL db, but I'm not willing to pay for SAS's expensive licensing.
Has anyone done benchmarking on how long it takes to perform a query on a SAS dataset as compared to a SQL table?
I have done that analysis before, as a consultant. I don't have the specifics in front of me but it is enormous (SQL Server is like 10-100x faster). Create the table using an index.
As a former SAS consultant (at SAS), we used to encourage clients to use an RDBMS vs SAS datasets. The sas7bdat is a proprietary, binary format designed a long time ago. It is nowhere near the speed or capability of an RDBMS.
Also, it is easy to convert from SAS datasets to a SQL table.
I am not sure how Postgres would perform but I would imagine the numbers would be comparable to SQL Server (probably not as fast but pretty close). I have used all of the major DBs but I tested on SQL Server.

searching for free database of any company [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 4 years ago.
Improve this question
i received an university project where i'm asked to build and analayze database of a company. the company can be of any type and it must have several tables, for example: airline company that sales ticket (tables will be: sales, customers, flights, airports, ... etc).
i'm searching for free and open database of such company, where can i find one?
thanks a lot.
You're using one of those databases right now! Stack Overflow regularly publishes a data dump of their database, and Brent Ozar helpfully compiles it into a SQL Server database for people to practice query tuning and such. Here's a link to the most recent version I could find, but you can also search for something like "Stack Overflow Database" and I'm sure you'll be able to find other versions.
Additionally, if you want to run some queries of your own against the database without downloading the whole shebang and running SQL Server on your own machine, you can access a web-based service for querying the database directly at https://data.stackexchange.com/
Also note - if the goal of your project is to design a database, this might not be the way to go, since it's already done for you. But even if it doesn't give you something to design, it might still be helpful to study how it's set up to give you ideas for your own work.
You could fairly easily build a small database (items, orders) out of the chipotle dataset: https://github.com/TheUpshot/chipotle.
In general, companies don't offer up their data to the public (there tends to be proprietary info in them). Luckily you are more interested in the datamodel than the actual data. That said, the reality is you want something simpler than a real company's db. Real enterprise databases are unwieldly complicated - think of all the tables they will have related to things like sales tax rules for different localities.
I would start with something like what I mention above and expand it a little. Or just spend a few minutes thinking about the different things you would need to track for a business (like an airline), and just build the datamodel from that. You will get a much better experience and learn how things need to fit together.

Testing pandas code [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 5 years ago.
Improve this question
I have a script of about 50 lines that reads data from a database, loads it into a pandas dataframe and then perform numerous operations on the dataframe.
I was wondering how people generally test this type of code? I'm not talking about tools like assert_frame_equal, but rather principles people follow.
For instance, should I create 50 separate tests to basically test each operation performed or should I try to break up the script in smaller parts?
If anybody knows of quality open source projects that I can use as inspiration, please let me know.
If you want to start to write python unit test, this question is recommended.
Since the 50 lines are relevant, you probably want a functional test.
Read the difference between unit, functional, acceptance, and integration testing.
If you know SOLID principle of object-oriented-design, refactoring to the code is needed.
About how to design a good test, What are the properties of good unit tests
Specific to pandas, use fewer data to improve performance for testing.
Make a dummy copy for testing, rather than use the origin data.
And check mainly on the key feature, you want to check.
I may suggest such approach:
Split the script into data retrieving and data processing part. It's better to test your data access/query code and computations separately.
Prepare fixed dataset you will use for tests. It may be part of your production data or special dataset which cover some boundary conditions (like NaNs, zeroes, negative values, etc).
Write test cases, that check results of your computations. You may check values directly or do some aggregations (COUNT, SUM) and compare it with expected values.
The number of checks depends on data and computation you do. For some cases it might be enough to check only SUM() of all elements, for others - check every item.
I prefer to check only a few general conditions, which would fail if something went wrong than cover all possible cases.

Scalable RDBMS alternative, NoSQL, NewSQL [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 5 years ago.
Improve this question
I am looking for scalable alternative to traditional DBMS like PostgreSQL or MySQL.
In traditional databases I don't have the following features:
Auto sharding to ensure linear scalability.
Replication with automatic failover and recovery to ensure high availability.
No single point of failure.
MongoDB looks like good candidate if I can sacrifice transactions.
Also I've looked at several newSQL databases. NewSQL seems suitable for my purposes: VoltDB, TiDB, cockroachDB. But I'm worried about whethever they are production-ready.
May be there are extensions allowing to run postgreSQL or MySQL in clustered mode out of box.
You should check out Vitess. It's used at YouTube and by a few other companies.
PS: I work on that project.
TiDB
Compatibility with MySQL
It supports the MySQL Protocol so that you can transfer your MySQL scripts running on TiDB without change.
Use cases
It was used by many big name company such as Mobike, uber,pinterest etc. In Mobike, the big data team uses TiDB as a slave for synchronizing data with online DB. After that, OLTP query, consisting of analysis and gathering request, was executed in such circumstance. Last but not the least, the cloud computing platform belongs to Tencent, the technology giant, recommend customers use HTAP based on TiDB for OLTP and OLAP.

Distributed log system [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 2 years ago.
Improve this question
I need to store logs in a distributed file system.
Let's say that I have many types of logs. Each log type is recorded in file. But this file can be huge, so it must be distributed across many nodes (with replication for data durability).
These files must support append/get operations.
Is there a distributed system that achieves my needs?
Thanks!
I would recommend Flume, a log pulling infrastructure from the folks at Cloudera:
http://github.com/cloudera/flume
You can also try out Scribe from Facebook:
http://github.com/facebook/scribe
Combine a NAS with a no-sql database like MongoDB and you'll have distributed, large, and fault tolerant.
Of course, without more specific details like how much data, structure of the logs (or lack thereof), etc, it's really hard to recommend a real product.
For example, if by "huge" you really mean 2TB or less, and the data is highly structured, then a regular SQL server in a 2 machine clustered environment for fail over will do just fine.
However, if by "huge" you mean exabyte level or more and/or unstructured data then several large (and very expensive) NAS devices are needed. On which you run a set of no-sql databases that are clustered for fail/over and/or multi-master relationships...
You can use Logstash to collect the logs and centralize them with an Elasticsearch cluster. The local logs could be rolling log files, so that they remain small.
Further you can use Graylog2 to analyze and view your logs.