I have S1 tier with about 600,000 files in the single container in the blob storage. I've restricted the index to include only (.doc,.docx,.xls,.xlsx,.ppt,.pptx,.pdf,.txt,.rtf,.htm,.html) and excluding (.png,.jpeg,.jpg,.gif,.psd,.mp3,.mp4,.wav,.exe,.zip,.dmg,.msi,.mkv,.flv,.ogg,.ogv,.avi,.mov,.wmv). I also tried to increase the partitions to the max allowed 12 with no considerable changes in the performance.
With the current indexing speed, I can estimate 30 days to finish the process.
I need this to be indexed quicker. How can I increase the speed of this?
Thanks.
You can speed up indexing by parallelizing it: split blobs in your container into several folders, and create several datasource / indexer pairs all writing into the same target search index. If your search service has N search units, N indexer can be running concurrently, resulting in a significant speed-up of indexing.
Related
I'm working on an application wherein I'll be loading data into Redshift.
I want to upload the files to S3 and use the COPY command to load the data into multiple tables.
For every such iteration, I need to load the data into around 20 tables.
I'm now creating 20 CSV files for loading data into 20 tables wherein for every iteration, the 20 created files will be loaded into 20 tables. And for next iteration, new 20 CSV files will be created and dumped into Redshift.
With the current system that I have, each CSV file may contain a maximum of 1000 rows which should be dumped into tables. Maximum of 20000 rows for every iteration for 20 tables.
I wanted to improve the performance even more. I've gone through https://docs.aws.amazon.com/redshift/latest/dg/t_Loading-data-from-S3.html
At this point, I'm not sure how long it's gonna take for 1 file to load into 1 Redshift table. Is it really worthy to split every file into multiple files and load them parallelly?
Is there any source or calculator to give an approximate performance metrics of data loading into Redshift tables based on number of columns and rows so that I can decide whether to go ahead with splitting files even before moving to Redshift.
You should also read through the recommendations in the Load Data - Best Practices guide: https://docs.aws.amazon.com/redshift/latest/dg/c_loading-data-best-practices.html
Regarding the number of files and loading data in parallel, the recommendations are:
Loading data from a single file forces Redshift to perform a
serialized load, which is much slower than a parallel load.
Load data files should be split so that the files are about equal size,
between 1 MB and 1 GB after compression. For optimum parallelism, the ideal size is between 1 MB and 125 MB after compression.
The number of files should be a multiple of the number of slices in your
cluster.
That last point is significant for achieving maximum throughput - if you have 8 nodes then you want n*8 files e.g. 16, 32, 64 ... this is so all nodes are doing maximum work in parallel.
That said, 20,000 rows is such a small amount of data in Redshift terms I'm not sure any further optimisations would make much significant difference to the speed of your process as it stands currently.
I have a large amount of data, but there is no particular column I would like to filter based on (that is, my 'where clause' can be any column). In this scenario, does partitioning provide any benefit (maybe helps with read-parallelism?) when the queries end up scanning all the data?
If there is no column all, or most, queries would filter on then partitions will only hurt performance. Instead aim for files around 100 MB, as few as possible, Parquet if possible, and put all files directly under the table's LOCATION.
The reason why partitions would hurt performance is that when Athena starts executing your query it will list all files, and the way it does it is as if S3 was a file system. It starts by listing the table's LOCATION, and if it finds anything that looks like a directory it will list it separately, and so on, recursively. If you have a deep directory structure this can end up taking a lot of time. You want to help Athena by having all your files in a flat structure, but also fewer than 1000 of them, because that's the page size for S3's list operation. With more than 1000 files you want to have directories so that Athena can parallelize the listing (but as few as possible still, because there's a limit to how many listings it will do in parallel).
You want to keep file sizes to around 100 MB because that's a good size that trades off how long it takes to process a file against the overhead of getting it from S3. The exact recommendation is 128 MB.
I need to spool over 20 million records in a flat file. A direct select query would be time utilizing. I feel the need to generate the output in parallel based on portions of the data - i.e having 10 select queries over 10% of the data each in parallel. Then sort and merge on UNIX.
I can utilize rownum to do this, however this would be tedious, static and needs to be updated every time my rownum changes.
Is there a better alternative available?
If the data in SQL is well spread out over multiple spindles and not all on one disk, and the IO and network channels are not saturated currently, splitting into separate streams may reduce your elapsed time. It may also introduce random access on one or more source hard drives which will cripple your throughput. Reading in anything other than cluster sequence will induce disk contention.
The optimal scenario here would be for your source table to be partitioned, that each partition is on separate storage (or very well striped), and each reader process is aligned with a partition boundary.
I have a table with almost 200,000,000 records. It takes a long long time to query.
Any idea about improving the performance?
Consider adding index on id-type columns in that table.
BTW, this has nothing to do with SAS EG performance, but everything to do with the underlying BASE SAS engine.
In addition to indexing you should also consider:
Compression. When you build the dataset make sure you use the compress=yes option if you're not already. This will shrink the size of the table on disk resulting in less disk I/O (the slowest part of querying).
Check column lengths - make sure you're not using a field length of $255 to store something that only needs a length of $20 etc...
Use the SAS SPDE (Scalable Performance Data Engine). It allows you to partition your SAS datasets into multiple files and optionally spread them across different disks. Once your SAS datasets reach a certain size you can see performance improvements. I generally tend to use SPD libnames any time a dataset grows > 10G. No additional SAS modules are requires - this is enabled as part of Base SAS.
our index is rising relatively fast, by adding 2000-3000 documents a day.
We are running an optimize every night.
The point is, that Solr needs double disc space while optimizing. Actually the index has an size of 44GB, which works on an 100GB partition - for the next few months.
The point is, that 50% of the disk space are unused for 90% of the day and only needed during optimize.
Next thing: we have to add more space on that partition periodical - which is always a painful discussion with the guys from the storage department (because we have more than one index...).
So the question is: is there a way to optimize an index without blocking additional 100% of the index size on disk?
I know, that multi-cores an distributed search is an option - but this is only an "fall back" solution, because for that we need to change the application basically.
Thank you!
There is continous merging going on under the hood in Lucene. Read up on the Merge Factor which can be set in the solrconfig.xml. If you tweak this setting you probably wont have to optimize at all.
You can try partial optimize by passing maxSegment parameter.
This will reduce the index to that specified number.
I suggest you do in batches (e.g if there are 50 segments first reduce to 30 then to 15 and so on).
Here's the url:
host:port/solr/CORE_NAME/update?optimize=true&maxSegments=(Enter the number of segments you want to reduce to. Ignore the parentheses)&waitFlush=false