How to improve the performance of SAS Enterprise Guide 4 - sql

I have a table with almost 200,000,000 records. It takes a long long time to query.
Any idea about improving the performance?

Consider adding index on id-type columns in that table.
BTW, this has nothing to do with SAS EG performance, but everything to do with the underlying BASE SAS engine.

In addition to indexing you should also consider:
Compression. When you build the dataset make sure you use the compress=yes option if you're not already. This will shrink the size of the table on disk resulting in less disk I/O (the slowest part of querying).
Check column lengths - make sure you're not using a field length of $255 to store something that only needs a length of $20 etc...
Use the SAS SPDE (Scalable Performance Data Engine). It allows you to partition your SAS datasets into multiple files and optionally spread them across different disks. Once your SAS datasets reach a certain size you can see performance improvements. I generally tend to use SPD libnames any time a dataset grows > 10G. No additional SAS modules are requires - this is enabled as part of Base SAS.

Related

Hive performance improvement

I want to join 1TB data of table with another table which also has 1TB of data in hive. Could you please suggest some best practises to follow?
I want to know how performance will be improved in hive if both the tables are partitioned. Basically, how mapreduce works in this case.
Below are few performance improvement Rules to follow when dealing with large data -
Tez-Execution Engine in Hive Or Hive on Spark
Use Tez Execution Engine (Hortonworks) – Hive Optimization Techniques, to increase the Hive performance of our hive query by using our execution engine as Tez. On defining Tez, it is a new application framework built on Hadoop Yarn. That executes complex-directed acyclic graphs of general data processing tasks. However, we can consider it to be a much more flexible and powerful successor to the map-reduce framework.
Or
Use Hive on Spark (Cloudera)
In addition, to write native YARN applications on Hadoop that bridges the spectrum of interactive and batch workloads Tez offers an API framework to developers. To be more specific, to work with petabytes of data over thousands of nodes it allows those data access applications.
Let’s Discuss Apache Hive Features & Limitations of Hive
SET hive.execution.engine=tez;
SET hive.execution.engine=spark;
Usage of Suitable File Format in Hive
ORCFILE File Formate – Hive Optimization Techniques, if we use appropriate file format on the basis of data. It will drastically increase our query performance. Basically, for increasing your query performance ORC file format is best suitable. Here, ORC refers to Optimized Row Columnar. That implies we can store data in an optimized way than the other file formats.
To be more specific, ORC reduces the size of the original data up to 75%. Hence, data processing speed also increases. On comparing to Text, Sequence and RC file formats, ORC shows better performance. Basically, it contains rows data in groups. Such as Stripes along with a file footer. Therefore, we can say when Hive is processing the data ORC format improves the performance.
Hive Partitioning
Hive Partition – Hive Optimization Techniques, Hive reads all the data in the directory Without partitioning. Further, it applies the query filters on it. Since all data has to be read this is a slow as well as expensive.
Also, users need to filter the data on specific column values frequently. Although, users need to understand the domain of the data on which they are doing analysis, to apply the partitioning in the Hive.
Basically, by Partitioning all the entries for the various columns of the dataset are segregated and stored in their respective partition. Hence, While we write the query to fetch the values from the table, only the required partitions of the table are queried. Thus it reduces the time taken by the query to yield the result.
Bucketing in Hive
Bucketing in Hive – Hive Optimization Techniques, let’s suppose a scenario. At times, there is a huge dataset available. However, after partitioning on a particular field or fields, the partitioned file size doesn’t match with the actual expectation and remains huge. Still, we want to manage the partition results into different parts. Thus, to solve this issue of partitioning, Hive offers Bucketing concept. Basically, that allows the user to divide table data sets into more manageable parts.
Hence, to maintain parts that are more manageable we can use Bucketing. Through it, the user can set the size of the manageable parts or Buckets too.
Vectorization In Hive
Vectorization In Hive – Hive Optimization Techniques, to improve the performance of operations we use Vectorized query execution. Here operations refer to scans, aggregations, filters, and joins. It happens by performing them in batches of 1024 rows at once instead of single row each time.
However, this feature is introduced in Hive 0.13. It significantly improves query execution time, and is easily enabled with two parameters settings:
set hive.vectorized.execution = true
set hive.vectorized.execution.enabled = true
Cost-Based Optimization in Hive (CBO)
Cost-Based Optimization in Hive – Hive Optimization Techniques, before submitting for final execution Hive optimizes each Query’s logical and physical execution plan. Although, until now these optimizations are not based on the cost of the query.
However, CBO, performs, further optimizations based on query cost in a recent addition to Hive. That results in potentially different decisions: how to order joins, which type of join to perform, the degree of parallelism and others.
To use CBO, set the following parameters at the beginning of your query:
set hive.cbo.enable=true;
set hive.compute.query.using.stats=true;
set hive.stats.fetch.column.stats=true;
set hive.stats.fetch.partition.stats=true;
Then, prepare the data for CBO by running Hive’s “analyze” command to collect various statistics on the tables for which we want to use CBO.
Hive Indexing
Hive Index – Hive Optimization Techniques, one of the best ways is Indexing. To increase your query performance indexing will definitely help. Basically, for the original table use of indexing will create a separate called index table which acts as a reference.
As we know, there are many numbers of rows and columns, in a Hive table. Basically, it will take a large amount of time if we want to perform queries only on some columns without indexing. Because queries will be executed on all the columns present in the table.
Moreover, there is no need for the query to scan all the rows in the table while we perform a query on a table that has an index, it turned out as the major advantage of using indexing. Further, it checks the index first and then goes to the particular column and performs the operation.
Hence, maintaining indexes will be easier for Hive query to look into the indexes first and then perform the needed operations within less amount of time. Well, time is the only factor that everyone focuses on, eventually.
This was all about Hive Optimization Techniques Tutorial. Hope you like our explanation of Hive Performance Tuning.

Many small data table I/O for pandas?

I have many table (about 200K of them) each small (typically less than 1K rows and 10 columns) that I need to read as fast as possible in pandas. The use case is fairly typical: a function loads these table one at a time, computes something on them and stores the final result (not keeping the content of the table in memory).
This is done many times over and I can choose the storage format for these tables for best (speed) performance.
What natively supported storage format would be the quickest?
IMO there are a few options in this case:
use HDF Store (AKA PyTable, H5) as #jezrael has already suggested. You can decide whether you want to group some/all of your tables and store them in the same .h5 file using different identifiers (or keys in Pandas terminology)
use new and extremely fast Feather-Format (part of the Apache Arrow project). NOTE: it's still a bit new format so its format might be changed in future which could lead to incompatibilities between different versions of feather-format module. You also can't put multiple DFs in one feather file, so you can't group them.
use a database for storing/reading tables. PS it might be slower for your use-case.
PS you may also want to check this comparison especially if you want to store your data in compressed format

SQL Server Time Series Modelling Huge datacollection

I have to implement data collection for replay for electrical parameters for 100-1000's of devices with at least 20 parameters to monitor. This amounts to huge data collection as it will be based very similar to time series.I have to support resolution for 1 second. thinking about 1 year [365*24*60*60*1000]=31536000000 rows.
I did my research but still have few questions
As data will be huge is it good to keep data in same table or should the tables be spitted. [data structure is same] or i should
rely on indexes?
Data inserts also will be very frequent but i can batch them still what is the best way? Is it directly writing to same database
or using a temporary database for write and sync with it?
Does SQL Server has a specific schema recommendation to do time series optimization for select,update and inserts? any out of box
helps for day average ? or specific common aggregate functions i can
write my own but just to know as this a standard problem so they
might have some best practices and samples out of box.**
please let me know any help is appreciated, thanks in advance
1) You probably want to explore the use of partitions. This will allow very effective inserts (its a meta operation if you do the partitioning correctly) and very fast (2). You may want to explore columnstore indexes because the data (once collected) will never change and you will have very large data sets. Partitioning and columnstore require a learning curve but its very doable. There are lots of code on the internet describing the use of date functions in SQL Server.
That is a big number but I would start with one table see if it hold up. If you split it in multiple tables it is still the same amount of data.
Do you ever need to search across devices? If not you can have a separate table for each device.
I have some audit tables that are not that big but still big and have not had any problems. If the data is loaded in time order then make date the first (or only) column of the clustered index.
If the the PK is date, device then fine but if you can get two reading in the same seconds you cannot do that. If this is the PK then if you can load the data by that sort. Even if you have to stage each second and load. You just cannot afford to fragment a table that big. If you cannot load by the sort then take a fill factor of 50%.
If you cannot have a PK then just use date as clustered index but not as PK and put a non clustered index on device.
I have some tables of 3,000,000,000 and I have the luxury of loading by PK with no other indexes. There is no measurable degradation in insert from row 1 to row 3,000,000,000.

How long should a query that returns 5 million records take?

I realise the answer should probably be 'as little time as possible' but I'm trying to learn how to optimise databases and I have no idea what an acceptable time is for my hardware.
For a start I'm using my local machine with a copy of sql server 2008 express. I have a dual-core processor, 2GB ram and a 64bit OS (if that makes a difference). I'm only using a simple table with about 6 varchar fields.
At first I queried the data without any indexing. This took a ridiculously long amount of time so I cancelled and added a clustered index (using the PK) to the table. This cut the time down to 1 minute 14 sec. I have no idea if this is the best I can get or whether I'm still able to cut this down even further?
Am I limited by my hardware or is there anything else I can do to my table/database/queries to get results faster?
FYI I'm only using a standard SELECT * FROM <Table> to retrieve my results.
EDIT: Just to clarify, I'm only doing this for testing purposes. I don't NEED to pull out all the data, I'm just using that as a consistent test to see if I can cut down the query times.
I suppose what I'm asking is: Is there anything I can do to speed up the performance of my queries other than a) upgrading hardware and b) adding indexes (assuming the schema is already good)?
I think you are asking the wrong question.
First of all - why do you need so many articles at one time on the local machine? What do you want to do with them? I'm asking because I think you want to transfer this of data to somewhere, so you should be measuring how long it takes to transfer the data.
Some advice:
Your applications should not select 5 million records at the time. Try to split your query and get the data in smaller sets.
UPDATE:
Because you are doing this for testing, I suggest that you
Remove * from your query - it takes SQL server some time to resolve this.
Put your data in temporary storage, try using VIEW or a temporary table for this.
Use plan caching on your server
to improve performance. But even if you're just testing, I still don't understand why you would need such tests if your application would never use such a query. Testing just for the sake of testing is a bad use of time
Look at the query execution plan. If your query is doing a table scan, it will obviously take a long time. The query execution plan can help you decide what kind of indexing you would need on the table. Also, creating table partitions can help sometimes in cases where the data is partitioned by a condition (usually date and time).
I did 5.5 million in 20 seconds. That's taking over 100k schedules with different frequencies and forecasting them for the next 25 years. Just max scenario testing, but proves the speed you can achieve in a scheduling system as an example.
The best optimized way depends on the indexing strategy you choose. As many of the above answers, i too would say partitioning the table would help sometimes. And its not the best practice to query all the billion record in a single time frame. Will give you much better results if you could try to query partially with the iterations. you may check this link to clear the doubts on the minimum requirements for the Sql server 2008 Minimum H/W and S/W Requirements for Sql server 2008
When fecthing 5 million rows you are almost 100% going spool to tempdb. you should try to optimize your temp Db by adding additional files. if you have multiple drives on seperate disks you should split the table data into different ndf files located on seperate disks. parititioning wont help when querying all the data on the disk
U can also use a query hint to force parrallelism MAXDOP this will increase the CPU utilization. Ensure that the columns contain few nulls as possible and rebuild ur indexes and stats

indexed flat-files

We have a batch analytical SQL job – run once daily – that reads data from 2 source tables held in a powerful RDBMS. The source tables are huge (>100TB) but has less than 10 fields combined.
The question I have is can the 2 source tables be held in a compressed and indexed flat file so the entire operation can be much faster and saves on storage and can be run on a low spec server. Also, can we run SQL like queries against these compressed and indexed flat-files? Any pointers on how to go about doing this would be extremely helpful.
Most optimization strategies optimize either speed or size, and trade one off against the other. In general, RDBMS solutions optimize for speed, at the expense of size - for instance, by creating an index, you take up more space, and in return you get faster data access.
So your desire to optimize for both speed AND size is unlikely to be fulfilled - you almost certainly have to trade one against the other.
Secondly, if you want to execute "sql-like" queries, I'm pretty sure that an RDBMS is the best solution - especially with huge data sets.
It may be the case that the underlying data lends itself to a specific optimization - for instance, if you can create a custom indexing scheme based on bitmasks to create integers, and using those integers to access data using boolean operators, you may be able to beat the performance of an RDBMS index.