HyperV Gen2 VM not booting over PXE - hyper-v

I have two VMs in HyperV, both on the same virtual switch (internal), on the same subnet. I am trying to set up one as a DHCP and TFTP server for PXE boot. With Gen1 machine, it's working fine with pxelinux. Gen2 with UEFI does not unfortunately work.
DHCP & TFTP Server
IP 192.168.1.2
VLAN identification is disabled
DHCP - ISC DHCP Server running in a docker container with "host" network type with the following configuration:
set vendorclass = option vendor-class-identifier;
option pxe-system-type code 93 = unsigned integer 16;
set pxetype = option pxe-system-type;
authoritative;
default-lease-time 7200;
max-lease-time 7200;
option tftp-server-name "192.168.1.2";
option bootfile-name "efi/core.efi";
subnet 192.168.1.0 netmask 255.255.255.0 {
interface "eth0:0";
option routers 192.168.1.1;
option subnet-mask 255.255.255.0;
range 192.168.1.100 192.168.1.150;
option broadcast-address 192.168.1.255;
option domain-name-servers 8.8.8.8, 8.8.4.4;
option domain-name "ad.lholota.net";
option domain-search "ad.lholota.net";
if substring(vendorclass, 0, 9)="PXEClient" {
if pxetype=00:06 or pxetype=00:07 {
filename "efi/core.efi";
} else {
filename "pxelinux/pxelinux.0";
}
}
next-server 192.168.1.2;
}
TFTP - tftp-hpa running in a docker container on a "host" type network. I can download the efi files manually through a standard tftp client.
Booting machine
HyperV Gen2
No virtual HDD or DVD
Firmware tab has only one item in the boot sequence - network
Secure boot is disabled
VLAN identification is disabled
Network adapter pointing into the same internal switch as the first VM
Enable virtual machine queue - checked
Enable IPsec task offloading - checked, maximum number: 512
MAC Address dynamic
Enable DHCP guard - NOT checked
Enable router advertisement guard - NOT checked
Procted network - NOT checked
Mirroring mode - None
Enable device naming - NOT checked
The trouble is that the machine doesn't even get to the TFTP server because it doesn't finish the DHCP Discover-Offer-Request-Ack flow. It gets stuck on offer as shown in the dhcpdump below. The booting machine never sends the request message. Funny enough, BIOS based Gen1 HyperV machine boots without any issue so the DHCP flow works there.
Can you please give me a hint of what might be wrong?
TIME: 2018-07-11 19:49:37.641
IP: 0.0.0.0 (0:15:5d:0:50:d0) > 255.255.255.255 (ff:ff:ff:ff:ff:ff)
OP: 1 (BOOTPREQUEST)
HTYPE: 1 (Ethernet)
HLEN: 6
HOPS: 0
XID: 8bf1c250
SECS: 0
FLAGS: 7f80
CIADDR: 0.0.0.0
YIADDR: 0.0.0.0
SIADDR: 0.0.0.0
GIADDR: 0.0.0.0
CHADDR: 00:15:5d:00:50:d0:00:00:00:00:00:00:00:00:00:00
SNAME: .
FNAME: .
OPTION: 53 ( 1) DHCP message type 1 (DHCPDISCOVER)
OPTION: 57 ( 2) Maximum DHCP message size 1472
OPTION: 55 ( 35) Parameter Request List 1 (Subnet mask)
2 (Time offset)
3 (Routers)
4 (Time server)
5 (Name server)
6 (DNS server)
12 (Host name)
13 (Boot file size)
15 (Domainname)
17 (Root path)
18 (Extensions path)
22 (Maximum datagram reassembly size)
23 (Default IP TTL)
28 (Broadcast address)
40 (NIS domain)
41 (NIS servers)
42 (NTP servers)
43 (Vendor specific info)
50 (Request IP address)
51 (IP address leasetime)
54 (Server identifier)
58 (T1)
59 (T2)
60 (Vendor class identifier)
66 (TFTP server name)
67 (Bootfile name)
97 (UUID/GUID)
128 (???)
129 (???)
130 (???)
131 (???)
132 (???)
133 (???)
134 (???)
135 (???)
OPTION: 97 ( 17) UUID/GUID 008c0c7ab81331a0 ...z..1.
4297445b2e41610e B.D[.Aa.
a8 .
OPTION: 94 ( 3) Client NDI 010300 ...
OPTION: 93 ( 2) Client System 0007 ..
OPTION: 60 ( 32) Vendor class identifier PXEClient:Arch:00007:UNDI:003000
---------------------------------------------------------------------------
TIME: 2018-07-11 19:49:37.641
IP: 0.0.0.0 (0:15:5d:0:50:12) > 255.255.255.255 (ff:ff:ff:ff:ff:ff)
OP: 2 (BOOTPREPLY)
HTYPE: 1 (Ethernet)
HLEN: 6
HOPS: 0
XID: 8bf1c250
SECS: 0
FLAGS: 7f80
CIADDR: 0.0.0.0
YIADDR: 192.168.1.105
SIADDR: 192.168.1.2
GIADDR: 0.0.0.0
CHADDR: 00:15:5d:00:50:d0:00:00:00:00:00:00:00:00:00:00
SNAME: .
FNAME: efi/core.efi.
OPTION: 53 ( 1) DHCP message type 2 (DHCPOFFER)
OPTION: 51 ( 4) IP address leasetime 7200 (2h)
OPTION: 1 ( 4) Subnet mask 255.255.255.0
OPTION: 3 ( 4) Routers 192.168.1.1
OPTION: 6 ( 8) DNS server 8.8.8.8,8.8.4.4
OPTION: 15 ( 14) Domainname ad.lholota.net
OPTION: 28 ( 4) Broadcast address 192.168.1.255

I have had what i believe is the same issue when booting HyperV virtual machines on win10 2004(19041.685): gen 1 works, gen 2 times out without ever asking for the boot file.
I strongly suspect this is an issue with the GEN2 UEFI PXE implementation. Because as soon as I have at least two entries to choose from in the pxe boot menu it requests files and downloads as expected.
I run dnsmasq for tftp and DHCP and my config file below works if and only if at least one of the last two rows are uncommented. (pxe-service=x86-64_EFI and pxe-service=7 are equal)
config context: https://linuxconfig.org/how-to-configure-a-raspberry-pi-as-a-pxe-boot-server
# /etc/dnsmasq.d/03-tftpboot.conf
enable-tftp
tftp-lowercase
tftp-root=/mnt/data/netboot
pxe-prompt="Choose:"
pxe-service=x86PC,"PXELINUX (BIOS)",bios/pxelinux.0
pxe-service=x86PC,"WinPE (BIOS)",boot/pxeboot.n12
pxe-service=x86-64_EFI,"PXELINUX (EFI)",efi64/syslinux.efi
pxe-service=x86-64_EFI,"winpe (EFI)",boot/wdsmgfw.efi
#pxe-service=7,"PXELINUX (EFI-7)",efi64/syslinux.efi

I think I am experiencing the same problem when using digital rebar provisioner. Works great on Gen 1 but not on Gen 2. Have followed the same configuration as well.
Looking at the digital rebar code it seems like it should work but does not: https://github.com/digitalrebar/provision/blob/8269e1c7ff12a82854c19eccd114d064e2278211/midlayer/pxe.go#L252
I think this could be related:
https://wiki.fogproject.org/wiki/index.php/BIOS_and_UEFI_Co-Existence
https://serverfault.com/questions/739138/hyper-v-2016-gen2-vm-pxe-dhcp-timeout-wireshark-dhcp-discover-offer

Related

OpenNebula - Bridge VM NIC with Host NIC - take Ip from LAN DCHP

I hope you are doing well,
I start using OpenNebula here, I deploy a basic setup one Opennebula fronend in centos 8
another server as OpenNebula Node,
I download an image from marketplace it's centos image, Then I create a network Under Network >> Virual Network. Bridge it with ens33 (ens3 is the physical interface of my node) in order to give VM access to LAN,
he is my Node net
[centos#host1 ~]$ ifconfig
ens33: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 192.168.0.60 netmask 255.255.255.0 broadcast 192.168.0.255
ether 00:0c:29:68:26:2b txqueuelen 1000 (Ethernet)
RX packets 679155 bytes 994474147 (948.4 MiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 41914 bytes 3220552 (3.0 MiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536
inet 127.0.0.1 netmask 255.0.0.0
inet6 ::1 prefixlen 128 scopeid 0x10<host>
loop txqueuelen 1000 (Local Loopback)
RX packets 6 bytes 672 (672.0 B)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 6 bytes 672 (672.0 B)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
virbr0: flags=4099<UP,BROADCAST,MULTICAST> mtu 1500
inet 192.168.122.1 netmask 255.255.255.0 broadcast 192.168.122.255
ether 52:54:00:89:84:b1 txqueuelen 1000 (Ethernet)
RX packets 0 bytes 0 (0.0 B)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 0 bytes 0 (0.0 B)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
once I create a VM and attach it to the bridge network I create already, i get status Failed with the bellow log :
Sat May 1 03:50:25 2021 [Z0][VM][I]: New state is ACTIVE
Sat May 1 03:50:25 2021 [Z0][VM][I]: New LCM state is PROLOG
Sat May 1 03:50:38 2021 [Z0][VM][I]: New LCM state is BOOT
Sat May 1 03:50:38 2021 [Z0][VMM][I]: Generating deployment file: /var/lib/one/vms/14/deployment.0
Sat May 1 03:50:39 2021 [Z0][VMM][I]: Successfully execute transfer manager driver operation: tm_context.
Sat May 1 03:50:40 2021 [Z0][VMM][I]: Command execution fail: cat << EOT | /var/tmp/one/vnm/bridge/pre
Sat May 1 03:50:40 2021 [Z0][VMM][E]: pre: Command "sudo ip link add name ens33 type bridge " failed.
Sat May 1 03:50:40 2021 [Z0][VMM][E]: pre: RTNETLINK answers: File exists
Sat May 1 03:50:40 2021 [Z0][VMM][E]: RTNETLINK answers: File exists
Sat May 1 03:50:40 2021 [Z0][VMM][E]:
Sat May 1 03:50:40 2021 [Z0][VMM][I]: ExitCode: 2
Sat May 1 03:50:40 2021 [Z0][VMM][I]: Failed to execute network driver operation: pre.
Sat May 1 03:50:40 2021 [Z0][VMM][E]: Error deploying virtual machine: bridge: RTNETLINK answers: File exists
Sat May 1 03:50:40 2021 [Z0][VM][I]: New LCM state is BOOT_FAILURE
can anyone please explain to me what's wrong here, Im familiar with vsphere esxi/vcenter, I want just to create a VMNetwork and attach it to the node physical NIC then attach the VM to this VMNetwork in order to give it LAN access, on VMware side it's easy simple but with OpenNebula Im not sure how it's work
Thank you
The problem here is that you are using a physical interface instead of using a bridge. If you would like to use bridge networking, you need to create a bridge or let OpenNebula create it for you.
Let me know if this answers your issue, if not, feel free to submit your query on OpenNebula Forum - https://forum.opennebula.io/. :)

IPVS (keepalived) doesn't balance UDP connections

I have two load balancer with Debian 8 and three Graylog server with Debian 9.
Every server in my network sends logs via rsyslog to a virtual server configured on the LB. The connection is UDP.
The problem is that the packets are not balanced. (all connections goes on the first real server on the list)
In case of failover the packets are correctly sent to the others real servers.
The only way I found to re-balance the connection is to remove all real server from the LB and the restart keepalived service.
I already tied to set:
ipvsadm --set 0 0 1
Timeout (tcp tcpfin udp): 900 120 1
I already set these two variables:
echo 1 > /proc/sys/net/ipv4/vs/expire_nodest_conn
echo 1 > /proc/sys/net/ipv4/vs/expire_quiescent_template
IPVS is configure as follow:
vrrp_instance logserver {
state MASTER
interface eth0
virtual_router_id 195
priority 200
advert_int 1
authentication {
auth_type keepalived
auth_pass xxxxxx
}
virtual_ipaddress {
10.20.20.195/22
}
}
virtual_server 10.20.20.195 0 {
delay_loop 60
protocol UDP
lb_algo wrr
lb_kind DR
persistence_timeout 30
real_server 10.20.20.196 0 {
weight 100
MISC_CHECK {
connect_timeout 3
misc_path "/etc/keepalived/checkgraylog 10.20.20.196"
}
}
real_server 10.20.20.197 0 {
weight 100
MISC_CHECK {
connect_timeout 3
misc_path "/etc/keepalived/checkgraylog 10.20.20.197"
}
}
real_server 10.20.20.198 0 {
weight 100
MISC_CHECK {
connect_timeout 3
misc_path "/etc/keepalived/checkgraylog 10.20.20.198"
} } }
Is there a way to effective balance UDP connection with Direct Routing?
Thank you
virtual_server 10.20.20.195 12333 {
delay_loop 60
protocol UDP
lb_algo wrr
lb_kind DR
ops # <<< - Try this. Works for me (Ubuntu 18.04, Keepalived v1.3.9, ipvsadm v1.28)
real_server 10.20.20.196 12333 {
Option ops for me works only if either:
Virtual server port is explicitly defined.
fwmark is used together with in virtual_server definition.
Does not work for virtual_server_IP 0 form - in that case ipvsadm -Ln shows that persistent option is used as well.

ASP.NET Core SignalR websocket connection limit

I produce load testing of SignalR (ASP.NET Core) application hosted at Windows Server 2016 standard using Microsoft.AspNetCore.SignalR.Client.
Dotnet core hosting 2.1.1 installed
And i can not create more than 3000 (2950-3050) connections.
Already tried recomendations as described here:
How to configure concurrency in .NET Core Web API?
Limiting performance factors of WebSocket in ASP.NET 4.5?
Set limit concurrent connections for websocket on iis 8
Added limits to UseKestrel (this seems to work if i set values to 100 or 1000):
var host = new WebHostBuilder()
.UseKestrel(options =>
{
options.Limits.MaxConcurrentConnections = 50000;
options.Limits.MaxConcurrentUpgradedConnections = 50000;
})
Changed all aspnet.config files by adding this:
<system.web>
<applicationPool maxConcurrentRequestsPerCPU="50000" />
</system.web>
Executed this command:
cd %windir%\System32\inetsrv\ appcmd.exe set config /section:system.webserver/serverRuntime /appConcurrentRequestLimit:50000
Added performance counter for Web Service\Current Connections - Maximum Connections. And Maximum Connections increases to 3300 and stops.
There are no exceptions in server logs. But I feel that there are some restrictions in system.
Server IIS logs contains only this:
GET /messageshub
id=A_3x1sH9kHM1Rc3oPSgP6w
80 - 172.20.192.11 - - 404 0 0 3
Client exceptions is basically the following:
System.Net.Http.HttpRequestException: Error while copying content to a
stream. ---> System.IO.IOException: Unable to read data from the
transport connection: An existing connection was forcibly closed by
the remote host.
On Windows you may have dynamic port assignment issue .
Windows by default has 5000 port numbers ready to be assigned to TCP connections and 1024 of them are reserved for the OS itself which you will end up with 3977 ports free to be assigned .
In your case the number is 3300 as you mentioned but it's possible that 3300 of the connections are established and 677 of them are Time_Waited.
In any case i recommend to use
netstat -an | find 'Established" -c
netstat -an | find 'TIME" -c
netstat -an | find 'CLOSED" -c
In order to figure out the number of established & time_wait & close_wait connections at the time you received the IO exception and if the number is close to 5000 just add this to your registry and reboot and test again
[HKEY_LOCAL_MACHINE \System \CurrentControlSet \Services \Tcpip \Parameters]
MaxUserPort = 5000 (Default = 5000, Max = 65534)

resolve.conf (generated) wrong order? (2 routers)

I have 2 routers in my network.
A) The one issued by my ISP (limited settings, had even to ask to get portforwarding settings), which is alo my modem.
B) My own router (wher i set my DHCP etc)
Now the generated resolve.txt on raspberrian and archlinux list:
domain local
nameserver <IP of A>
nameserver <IP of B>
As in understand it this is the order it will try to use when resolving names, but her it schould try my internal B before trying to resolve using A.
PS: Both subnetmasks are 255.255.255.0
Router A has 192.168.0.1
Router B has 192.168.1.1
All devices are in the 192.168.1.### range.
PPS: Archlinux is setup to use networkmanager, not a manual configured dhcpcd
NetworkManager may use dnsmasq for dhcp and to handle dns lookups.
I noticed that dnsmasq reverses the order of nameservers. Look at your logs. That would show up better in log if we also set dnsmasq to call dns servers in parallel:
#/etc/dnsmasq.conf
#all-servers
#/etc/dnsmasq.d/laptop.conf
all-servers
log-queries=extra
log-async=100
log-dhcp
#/etc/dnsmasq.d/servers.conf
server=66.187.76.168
server=162.248.241.94
server=165.227.22.116
/var/log/dnsmasq.log--
Mar 14 02:14:20 dnsmasq[3216]: 71700 127.0.0.1/38951 cached firefox.settings.services.mozilla.com is <CNAME>
Mar 14 02:14:20 dnsmasq[3216]: 71700 127.0.0.1/38951 forwarded firefox.settings.services.mozilla.com to 165.227.22.116
Mar 14 02:14:20 dnsmasq[3216]: 71700 127.0.0.1/38951 forwarded firefox.settings.services.mozilla.com to 162.248.241.94
Mar 14 02:14:20 dnsmasq[3216]: 71700 127.0.0.1/38951 forwarded firefox.settings.services.mozilla.com to 66.187.76.168
...order of calls is reversed in log lines!
I got rid of systemd-resolved to rely on dnsmasq.

tftp retry timeout exceeded

My issue is retry count exceeds when I download kernel image to Econa processor board (Econa is ARM based processor) via TFTP as shown below
CNS3000 # tftp 0x4000000 bootpImage.cns3420.uclibc
MAC PORT 0 : Initialize bcm53115M
MAC PORT 2 : Initialize RTL8211
TFTP from server 192.168.0.219; our IP address is 192.168.0.112
Filename 'bootpImage.cns3420.uclibc'.
Load address: 0x4000000
Loading: T T T T T T T T T T
Retry count exceeded; starting again
Following are the points which may help you in finding the cause of this error.
Ping response is OK
CNS3000 # ping 192.168.0.219
MAC PORT 0 : Initialize bcm53115M
MAC PORT 2 : Initialize RTL8211
host 192.168.0.219 is alive
When I tried to verify TFTP is running, I tried as shown below. It seems TFTP server is working. I placed a small file in /tftpboot:
# echo "Hello, embedded world" > /tftpboot/hello.txt"
Then I did localhost
# tftp localhost
tftp> get hello.txt
Received 23 bytes in 0.1 seconds
tftp> quit
Please note that there is no firewall or SELinux on my machine.
Please verify location of these files are OK. I have placed kernel image file bootpImage.cns3420.uclibc in /tftpbootTFTP service file is located in /etc/xinetd.d/tftp.
My TFTP service file is:
service tftp
{
socket_type =dgram
protocol=udp
wait=yes
user=root
server=/usr/sbin/in.tftpd
server_args=-s /tftpboot -b 512
disable=no
per_source=11
cps=100 2
flags=ipv4
}
printenv response in U-boot is:
CNS3000 # printenv
bootargs=root=/dev/mtdblock0 mem=256M console=ttyS0
baudrate=38400
ethaddr=00:53:43:4F:54:54
netmask=255.255.0.0
tftp_bsize=512
udp_frag_size=512
mmc_init=mmcinit
loading=fatload mmc 0 0x4000000 bootpimage-82511
running=go 0x4000000
bootcmd=run mmc_init;run loading;run running
serverip=192.168.0.219
ipaddr=192.168.0.112
bootdelay=5
port=1
bootfile=/tftpboot/bootpImage.cns3420.uclibcl
stdin=serial
stdout=serial
stderr=serial
verify=n
Environment size: 437/4092 bytes
Regards
Waqas
Loading: T T T T T T T T T T
Means there is no transfer at all; this can be caused by wrong interface setting i.e.
u-boot is configured for 100Mbit full duplex, and you try to connect via half duplex or 10Mbit (or some mix of it). Another point is the MTU size, should be 1500 (u-boot cannot handle packet fragmentation)
Hint for windows/vmware users:
tftp timeouts from u-boot are caused by windows ip-forwarding.
1) If you have a home network : switch it of.
2) You are running Routing and Remote Access service : shut down service
3) check registry for ip forwarding:
HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters\IPEnableRouter
set value to 0 (and maybe reboot)