How to remove shards in crate DB? - lucene

I am new to crate.io and I am not very familiar with the term of "sherd" and I am trying to understand why when I am running my local db it creates 4 different shards?
I need to reduce this to one single shard because it causes problems when I try to export the data from crate into json files (it creates 4 different shards!)

Most users run crate on multiple servers. To distribute the records of a table between multiple servers it needs to be splitted. One piece of that table is called shards.
To make sure that the database still has records CrateDB by defaults create on replica of each shard. A copy of the data that is located on a different server.
While the system doesn't have full copies of the shards the cluster state is yellow / underreplicated.
CrateDB running on a single node will never be able to create a redundant copy (because it is only one server).
To change the amount of replicas you can use the command ALTER TABLE my_table SET(number_of_replicas=...)

Related

Database copy limit per database reached. The database X cannot have more than 10 concurrent database copies (Azure SQL)

In our application, we have a master database 'X'. For each new client, we will create a new database copy of master database 'X'.
I am using the following SQL command which will be executed against Azure SQL server.
CREATE DATABASE [NEW NAME] AS COPY OF [MASTER DB]
We are using a custom queue tier so that we can create more than one client at a time parallelly.
I am facing issues in following scenario.
I am trying to create 70 clients. Once 25 clients got created I am getting below error.
Database copy limit per database reached. The database 'BlankDBClient' cannot have more than 10 concurrent database copies
Can you please share your thoughts on this?
SQL Azure has logic to do various operations online/automatically for you (backups, upgrades, etc). There are IOs required to do each copy, so there are limits in place because the machine does not have infinite iops. (Those limits may change a bit over time as we work to improve the service, get newer hardware, etc).
In terms of what options you have, you could:
Restore N databases from a database backup (which would still have IO limits but they may be higher for you depending on your reservation size)
Consider models to copy in parallel using a single source to hierarchically create what you need (copy 2 from one, then copy 2 from each of the ones you just copied, etc)
Stage out the copies over time based on the limits you get back from the system.
Try a larger reservation size for the source and target during the copy to get more IOPS and lower the time to perform the operations.
In addition to Connor answer, you can consider to a have a dacpac or bacpac of that master database stored on Azure Storage and once you have submitted 25 concurrent database copies you can start restoring the dacpac from Azure Storage.
You can also monitor how many database copies are showing COPYING on the state_desc column of the following queries, after sending the first batch of 25 copies, and when those queries return less than 25 rows, start sending more copies until reaching the 25 limit. Keep doing this until finishing the queue of copies required.
Select
[sys].[databases].[name],
[sys].[databases].[state_desc],
[sys].[dm_database_copies].[start_date],
[sys].[dm_database_copies].[modify_date],
[sys].[dm_database_copies].[percent_complete],
[sys].[dm_database_copies].[error_code],
[sys].[dm_database_copies].[error_desc],
[sys].[dm_database_copies].[error_severity],
[sys].[dm_database_copies].[error_state]
From
[sys].[databases]
Left
Outer
Join
[sys].[dm_database_copies]
On
[sys].[databases].[database_id] = [sys].[dm_database_copies].[database_id]
Where
[sys].[databases].[state_desc] = 'COPYING'
SELECT state_desc, *
FROM sys.databases
WHERE [state_desc] = 'COPYING'

SSIS : Huge Data Transfer from Source (SQL Server) to Destination (SQL Server)

Requirement :
Transfer millions of records from source (SQL Server) to destination (SQL Server).
Structure of source tables is different from destination tables.
Refresh data once per week in destination server.
Minimum amount of time for the processing.
I am looking for optimized approach using SSIS.
Was thinking these options :
Create Sql dump from source server and import that dump in destination server.
Directly copy the tables from source server to destination server.
Lots of issues to consider here. Such as are the servers in the same domain, on same network, etc.
Most of the time you will not want to move the data as a single large chunk of millions of records but in smaller amounts. An SSIS package handles that logic for you, but you can always recreate it as well but iterating the changes easier. Sometimes this is a reason to push changes more often rather than wait an entire week as smaller syncs are easier to manage with less downtime.
Another consideration is to be sure you understand your delta's and to ensure that you have ALL of the changes. For this reason I would generally suggest using a staging table at the destination server. By moving changes to staging and then loading to the final table you can more easily ensure that changes are applied correctly. Think of the scenario of a an increment being out of order (identity insert), datetime ordered incorrectly or 1 chunk failing. When using a staging table you don't have to rely solely on the id/date and can actually do joins on primary keys to look for changes.
Linked Servers proposed by Alex K. can be a great fit, but you will need to pay close attention to a couple of things. Always do it from Destination server so that it is a PULL not a push. Linked servers are fast at querying the data but horrible at updating/inserting in bulk. 1 XML column cannot be in the table at all. You may need to set some specific properties for distributed transactions.
I have done this task both ways and I would say that SSIS does give a bit of advantage over Linked Server just because of its robust error handling, threading logic, and ability to use different adapters (OLEDB, ODBC, etc. they have different performance do a search and you will find some results). But the key to your #4 is to do it in smaller chunks and from a staging table and if you can do it more often it is less likely to have an impact. E.g. daily means it would already be ~1/7th of the size as weekly assuming even daily distribution of changes.
Take 10,000,000 records changed a week.
Once weekly = 10mill
once daily = 1.4 mill
Once hourly = 59K records
Once Every 5 minutes = less than 5K records
And if it has to be once a week. just think about still doing it in small chunks so that each insert will have more minimal affect on your transaction logs, actual lock time on production table etc. Be sure that you never allow loading of a partially staged/transferred data otherwise identifying delta's could get messed up and you could end up missing changes/etc.
One other thought if this is a scenario like a reporting instance and you have enough server resources. You could bring over your entire table from production into a staging or update a copy of the table at destination and then simply do a drop of current table and rename the staging table. This is an extreme scenario and not one I generally like but it is possible and actual impact to the user would be very nominal.
I think SSIS is good at transfer data, my approach here:
1. Create a package with one Data Flow Task to transfer data. If the structure of two tables is different then it's okay, just map them.
2. Create a SQL Server Agent job to run your package every weekend
Also, feature Track Data Changes (SQL Server) is also good to take a look. You can config when you want to sync data and it's good at performance too
With SQL Server versions >2005, it has been my experience that a dump to a file with an export is equal to or slower than transferring data directly from table to table with SSIS.
That said, and in addition to the excellent points #Matt makes, this the usual pattern I follow for this sort of transfer.
Create a set of tables in your destination database that have the same table schemas as the tables in your source system.
I typically put these into their own database schema so their purpose is clear.
I also typically use the SSIS OLE DB Destination package's "New" button to create the tables.
Mind the square brackets on [Schema].[TableName] when editing the CREATE TABLE statement it provides.
Use SSIS Data Flow tasks to pull the data from the source to the replica tables in the destination.
This can be one package or many, depending on how many tables you're pulling over.
Create stored procedures in your destination database to transform the data into the shape it needs to be in the final tables.
Using SSIS data transformations is, almost without exception, less efficient than using server side SQL processing.
Use SSIS Execute SQL tasks to call the stored procedures.
Use parallel processing via Sequence Containers where possible to save time.
This can be one package or many, depending on how many tables you're transforming.
(Optional) If the transformations are complex, requiring intermediate data sets, you may want to create a separate Staging database schema for this step.
You will have to decide whether you want to use the stored procedures to land the data in your ultimate destination tables, or if you want to have the procedures write to intermediate tables, and then move the transformed data directly into the final tables. Using intermediate tables minimizes down time on the final tables, but if your transformations are simple or very fast, this may not be an issue for you.
If you use intermediate tables, you will need a package or packages to manage the final data load into the destination tables.
Depending on the number of packages all of this takes, you may want to create a Master SSIS package that will call the extraction package(s), then the transformation package(s), and then, if you use intermediate processing tables, the final load package(s).

How to add or route PostgreSQL Data to New Hard Drive

Im Using Windows Server 2008 R2 Standard
Im Running PostgreSQL 9.0.1, compiled by Visual C++ build 1500, 32-bit
I got C:/ and D:/ Drive
C:/ --> 6.7GB free space (almost full and my server performance running low)
D:/ --> 141GB free space
Currently my PostgreSQL Data stored at C:/ Now,I want to route or add path to D:/ without migrate the data from C:/ to D:/ because now my PostgreSQL Data Stored around 148 GB. It Heavy and Massive Stored.
If success, I should still be able to do a query like SELECT * From table_bla_bla and it will return result from both drives?
Please do not suggest me to change PostgreSQL to other DB or whatsoever.
Because Im running 39,763 Device GPS Meter that send the data to my Server.
I have to take care this server because my expert already past-away.
You need to use tablespaces.
Create the tablespace, for example CREATE TABLESPACE second_drive LOCATION 'D:/postgresdata/' (see this other answer if you get permission denied errors)
ALTER TABLE table_bla_bla SET tablespace second_drive
Tablespaces allow you to decide which tables go on which drives and that can help speed up performance by ensuring you control where reads and writes go, but it also helps with space.
Postgres places individual tables in TABLESPACEs (which relate to a single disk), which is enough if you have multiple tables and you can achieve what you need by moving some tables to the other disk.
On the other hand, if you have a large table that you need to split over multiple disks, you need to use Postgres's Horizontal Partitioning capability.
This builds on tablespaces by allowing you to create a master table table_bla_bla which is actually just a facade on top of two or more tables which actually hold the data. These data tables can then be put on different tablesspaces effectively splitting your data over disks.
For this you would:
Rename your current table_bla_bla to something like
table_bla_bla_c
Create a new table_bla_bla master table.
Alter table_bla_bla_c to mark that it inherits from
table_bla_bla
Create a new table_bla_bla_d table that inherits from table_bla_bla and specify the tablespace as the D drive.
Apply partitioning triggers and check constraints as per the partitioning documentation
Once this is in place, you can arrange it so that any inserts into table_bla_bla cause new records to be created on the D drive. Selects on table_bla_bla will read from both disks.

Multiple local temp tables with the same name

I am writing a set of stored procedures that aggregate data from large datasets.
The main of stored procedure makes a call to another server(s) where the data is located. The data is calculated in steps and stored in multiple temp tables (currently global temp tables) and then pulled to the server I'm sitting on (this is done because of the way the linked servers are setup).
Right now I'm trying to write dynamic SQL to create temp tables with a unique identifier because multiple people may run the stored procedures at the same time. However because of the number of sub-steps to this process its getting complex so I'm wondering if I'm over thinking it.
My question is if I simplify and just use local temp tables will I run into problems because the tables will have the same name? NOTE: Users may have same login user names.
Temp table names are per-session. When you call SqlConnection.Open you get a new session. Normally, applications do not share sessions between HTTP requests. Neither is this a common thing nor is this a good thing.
I don't believe you have a problem. If you get name clashes then you should fix the application to not share sessions in the first place.

SSIS data import with resume

I need to push a large SQL table from my local instance to SQL Azure. The transfer is a simple, 'clean' upload - simply push the data into a new, empty table.
The table is extremely large (~100 million rows) and consist only of GUIDs and other simple types (no timestamp or anything).
I create an SSIS package using the Data Import / Export Wizard in SSMS. The package works great.
The problem is when the package is run over a slow or intermittent connection. If the internet connection goes down halfway through, then there is no way to 'resume' the transfer.
What is the best approach to engineering an SSIS package to upload this data, in a resumable fashion? i.e. in case of connection failure, or to allow the job to be run only between specific time windows.
Normally, in a situation like that, I'd design the package to enumerate through batches of size N (1k row, 10M rows, whatever) and log to a processing table what the last successful batch transmitted would be. However, with GUIDs you can't quite partition them out into buckets.
In this particular case, I would modify your data flow to look like Source -> Lookup -> Destination. In your lookup transformation, query the Azure side and only retrieve the keys (SELECT myGuid FROM myTable). Here, we're only going to be interested in rows that don't have a match in the lookup recordset as those are the ones pending transmission.
A full cache is going to cost about 1.5GB (100M * 16bytes) of memory assuming the Azure side was fully populated plus the associated data transfer costs. That cost will be less than truncating and re-transferring all the data but just want to make sure I called it out.
Just order by your GUID when uploading. And make sure you use the max(guid) from Azure as your starting point when recovering from a failure or restart.