Azure Site Recovery Replicated VMs no longer bootable - hyper-v

The last time we did a fail-over test in January, everything worked as expected. Today, when doing the same test 5 of my 6 VMs will no longer boot.
2 of them simply display a black screen with a blinking cursor when I view them in Boot Diagnostics, 3 of them display non system disk or disk errors. I've attempted restarting, re-deploying and winding down and re testing from a older RPO (both app and crash consistent ones.) No success and no useful messages to work with.
It's a mix of OS's; 2012R2, and 2016. All replicated from Windows Server Hyper-V 2016.
Again, this worked 3 months ago, no errors in replication health, and other than windows updates, no changes to the servers/applications on them.
Did a recent windows update have a known issue with Azure replication or something - anyone have any ideas?

TL;DR - rinse and repeat. Tried again next week and it worked.
I came back a week later and repeated process and same servers all come up this time... No changes. Presumably an update of some sort as I tried multiple restore points going back 12 hours before I posted my question...

Related

SQL Server response time increasing

I'm currently working on updating an ERP called Sage 1000 to a newer version. The main change occurred along the upgrade is that the ERP solution and its database became on two different server. Ever since, The response time has significantly increased.
The part related to SQL Server is the following: whenever the time response problem happens, I restart the SQL Server service and it starts working just fine, at least for a couple of days until I have to restart it again.
Within the task manager before restarting the service I can see that it consumes nearly 17 to 20 gigabytes of memory witch is an insane amount even for a server run service, after restart, the amount of memory drops to a normal 2 to 3 gigabytes.
So my question is the following: in your experience what could be the cause of such abnormal behavior? Right now I created a planned task to restart the service everyday at 3:00 am, but it's not a radical solution.
Thanks in advance for your help,

Transaction log filling the drive if mirroring fails

I am having two mirrored SQL Server 2016 machines with mirroring set up. This is supposed to be hot-standby scheme.
HDD size is 1TB, and database size is around 600GB (just one DB). This is 90 days worth of archived data, everything older than 90 days is getting deleted every night (automatically through an external application which is using/filling the database in the first place). So 600GB is the peak DB size, it will not go beyond as it is being cleaned up regularly.
The problem is with the transaction log if one server fails, or if mirroring gets suspended for any other reason. If I understood the principle correctly, healthy server will retain the transaction logs as long as it doesn't get information from partner that everything is OK. So if mirroring fails, the HDD will get filled within several hours.
Is there any suitable technique to prevent this? I have backups of logs every 15 minutes and everything works fine, but if mirroring gets suspended, backups are not worth much, as the log will keep growing in spite of the backups. And the situation on site is a bit specific, there are no engineers, only operators who are accessing this data once or twice per day, so it's impossible to react straight away. It can take more than 24 hours to have someone attend the problem.
Only thing I could think of is some sort of trigger that would remove the mirroring completely once it was suspended for some time (or maybe if it is suspended and HDD space is too low). This will prevent the healthy server from crashing completely, but someone will again have to come to site and set the mirroring up from scratch. And due to bad design from the start, DB size is bigger than half of the HDD size, so I can't even create local backup/restore, I would have to do everything via the 100Mbps NAS that belongs to the client. And this would take more time than it would take the transaction log to fill the drive again.

Current session is no longer available due to structural changes in the database - Tabular

We are using a SQL Server Tabular model which we use for self-service BI purposes. At monthly basis we have some 90 distinct persons who are using the model. Recently we encountered some issues/errors in the client tools(Excel and Power BI) that are connecting to the Tabular model. See screenshots. We did not make any significant changes to the model the past period.
We noticed that the errors keep showing up after our incremental load, i.e. a full process of a number of partitions we process these partitions every 15 minutes. The process is kicked of by a SSIS job which is scheduled every 15 minutes and processes 5 partitions in 3 tables.
Edit: After some research I figured out that the problem lies in the perspectives. Everytime I do a full process on any object. The error appears. This does not happen on the default model view. Still not found a solution though.
The error occurs when you make a change to the power bi report or the excel file. For example when you do a refresh, or when you click a filter. If you press refresh multiple times the connection comes back and everything works as it is supposed to. It seems like the clients lose their connection to the model. After 15 minutes the problem occurs again.
This is very aggravating for the users. Especially when they are in the middle of a presentation.
This is what we tried:
We tried searching Google for a solution
Checked that we have the latest SQL Server 2016 update (13.0.5149.0)
SSAS Builds from Visual Studio(2015 en 2017)
No full process on tables, only on
partitions.
Upgrading the server from 4 to 8 cpu cores.
I hope somebody can help us.
You shouldn't have the error that you are seeing with just a full process of a partition or even the full table. We do this every hour for a number of core tables and we do not see any issues like this (and we would)
I am starting from the hypothesis that
Your 15 minute process is doing more than just processing the partitions with a refresh command
Something else is happening on the environment (either scheduled or not). Who has permissions to change the schema? Could it be users / developers deliberately or not making changes?
The only things that should cause that kind of error would be Alter, Delete or CreateOrReplace TMSL commands
So unless that triggers your own ideas on a diagnostic process I would do the following steps
Note: I presume that your users also see this issue on your test environment when you run your 15 min processing routine on that. You should do the following on that test environment where nothing else is running to eliminate the possibility of someone else interfering with the experiment. If you don't have a representative test environment then you will have to do on live but I would do this out of hours or under some kind of change control process with your 15 minute refresh turned off and admin permissions to the cube heavily locked down to ensure that nothing can interfere with your experiment.
First prove that you can reproduce this issue with the 15 minute routine
Get your sample PowerBI report that is known to present the error (I'd prefer Power BI for a repro as it is slightly simpler than Excel)
Refresh your PowerBI and explore the data to prove that the error doesn't occur
Run your 15 minute process
You should now see the problem reported. If you do, great, you have a reproduceable issue! If you don't then it is not quite as you thought it was and you need to find the way of reliably reproducing these errors. (perhaps something else is happening that isn't the 15 minute process)
So now you are sure how you can reproduce the issue, you need to isolate whether it is really the processing that is causing the problem
Refresh your PowerBI and explore the data to prove that the error doesn't occur
Execute (via SSMS) your XMLA that processes the entire database for one of your tables
it should look something like this
{
"refresh": {
"type": "full",
"objects": [
{
"database": "yourdbname"
}
]
}
}
Do the thing that your users do when they see the issue.
If you too see the issue, then I would raise to Microsoft Support as this shouldn't happen
If you don't see the issue then you can refine this processing to just be the partition for a single table. But as we have done a process for the entire db above if shouldn't change the result
If you still don't see the issue then it isn't the processing that is causing this issue (which I suspect) and it is something else in the 15 minute routine that is causing it. Look deeper into that process and understand what else it is doing.
Alongside this checking the logs should show if there are any other processing tasks or types of XMLA happening.
I hope these ideas get you closer to finding the actual activity that is causing this experience for your users. It would be great if you could post with how you got on and what you found.
I have the same problem here if I install the latest CU on my SQL Server 2017. My production environment is still running with CU3 (Jan/2018) due to this problem.
Knowing that I would suggest reverting your installation to a previous release. Maybe 13.0.5026.0 (SP2) or even to the 13.0.4466.4 (Jan/2018).
I am facing the same issue with SQL Server 2017 CU 11 installed.
The issue indeed occurs in case of a 'full refresh' in combination with the use of a 'perspective' in an existing connection. The workaround to use the default 'Model' in the connection does indeed 'solve' the issue.

SQL server 2008 replication without reinitialize

I have two databases in different servers - center_db on siglv01\sql2008 and center_db on sig\sql2008.
Can I restart replication without needing to reinitialize it? The connection dropped more than 3 days ago and is now too slow: so I want to start replication without a reinitialize.
Based on the brief conversation above, I don't think you can do this without a re-init. Specifically, the distribution database only keeps so many commands before it starts trimming. The default is 72 hours. If the last command delivered to all of your subscribers is older than that, the distribution database doesn't have what it needs to play forward all of the activity that has happened since then.
Your only hope would be if the distribution agent is still running (it knows when the above situation happens and will give you an error saying as much). If so, try to figure out why delivery is slow (troubleshoot this like any other "slow application"; replication isn't magic) and see if it can get caught up that way. Depending on how many commands are remain undelivered, it may be faster to just re-init.

azure virtual machine disk errors

I've been using 3 identical VMs on Azure for a month or more without problem.
Today I couldn't Remote Desktop to one of them, and restarted it from the Azure Portal. That took a long time. It eventually came back up, and the Event log has numerous entries such as:
"The IO operation at logical block address 70 for Disk 0 ..... was retried"
"Windows cannot access the file C:\windows\Microsoft.Net\v4.0.30319\clrjit.dll for one of the following reasons, network, disk etc.
There are lots of errors like this. To me they seem symptomatic that the underlying disk system is having serious problems. Given the VHD is stored in a triple replicated Azure blob, I would have thought there was some immunity to this kind of thing?
Many hours later it's still doing the same thing. It works fine for a few hours, then slows to a crawl with the Event log containing lots of disk problems. I can upload screen shots of the event log if people are interested.
This is a pretty vanilla VM, I'm only using the one OS disk it came with.
The other two identical VMs in the same region are fine.
Just wondering if anybody has seen this before with Azure VMs and how to safeguard against it, or recover from it.
Thanks.
Thank you for providing all the details and we apologize for the inconvenience. We have investigated the failures and determined that they were caused by a platform issue. Your virtual machine’s disk does not have any problems and therefore you should be able to continue using it as is.