计算引擎虚拟机失去网络连接



我们在计算引擎中有一个运行CentOs 8的VM。此虚拟机已运行很长时间,并且从未重新启动。昨晚,我们突然失去了通过内部和外部IP与虚拟机的连接。SSH也不可能。

在串行端口上,我们观察到以下日志:

Apr 29 15:53:18 <vm-name> google_osconfig_agent[1215]: default
Apr 29 16:05:18 <vm-name> google_osconfig_agent[1215]: default
Apr 29 16:11:06 <vm-name> NetworkManager[907]: <info>  [1619712666.3120] dhcp4 (eth0): option dhcp_lease_time      => '3600'
Apr 29 16:11:06 <vm-name> NetworkManager[907]: <info>  [1619712666.3124] dhcp4 (eth0): option domain_name          => 'us-central1-a.c.<project-name>.internal'
Apr 29 16:11:06 <vm-name> NetworkManager[907]: <info>  [1619712666.3124] dhcp4 (eth0): option domain_name_servers  => '169.254.169.254'
Apr 29 16:11:06 <vm-name> NetworkManager[907]: <info>  [1619712666.3124] dhcp4 (eth0): option domain_search        => 'us-central1-a.c.<project-name>.internal c.<project-name>.internal google.internal'
Apr 29 16:11:06 <vm-name> NetworkManager[907]: <info>  [1619712666.3124] dhcp4 (eth0): option expiry               => '1619716266'
Apr 29 16:11:06 <vm-name> NetworkManager[907]: <info>  [1619712666.3124] dhcp4 (eth0): option host_name            => '<vm-name>.us-central1-a.c.<project-name>.internal'
Apr 29 16:11:06 <vm-name> NetworkManager[907]: <info>  [1619712666.3124] dhcp4 (eth0): option interface_mtu        => '1460'
Apr 29 16:11:06 <vm-name> NetworkManager[907]: <info>  [1619712666.3124] dhcp4 (eth0): option ip_address           => '10.128.0.4'
Apr 29 16:11:06 <vm-name> NetworkManager[907]: <info>  [1619712666.3124] dhcp4 (eth0): option next_server          => '10.128.0.1'
Apr 29 16:11:06 <vm-name> NetworkManager[907]: <info>  [1619712666.3124] dhcp4 (eth0): option ntp_servers          => '169.254.169.254'
Apr 29 16:11:06 <vm-name> NetworkManager[907]: <info>  [1619712666.3124] dhcp4 (eth0): option requested_broadcast_address => '1'
Apr 29 16:11:06 <vm-name> NetworkManager[907]: <info>  [1619712666.3124] dhcp4 (eth0): option requested_domain_name => '1'
Apr 29 16:11:06 <vm-name> NetworkManager[907]: <info>  [1619712666.3125] dhcp4 (eth0): option requested_domain_name_servers => '1'
Apr 29 16:11:06 <vm-name> NetworkManager[907]: <info>  [1619712666.3125] dhcp4 (eth0): option requested_domain_search => '1'
Apr 29 16:11:06 <vm-name> NetworkManager[907]: <info>  [1619712666.3125] dhcp4 (eth0): option requested_host_name  => '1'
Apr 29 16:11:06 <vm-name> NetworkManager[907]: <info>  [1619712666.3125] dhcp4 (eth0): option requested_interface_mtu => '1'
Apr 29 16:11:06 <vm-name> NetworkManager[907]: <info>  [1619712666.3125] dhcp4 (eth0): option requested_ms_classless_static_routes => '1'
Apr 29 16:11:06 <vm-name> NetworkManager[907]: <info>  [1619712666.3125] dhcp4 (eth0): option requested_nis_domain => '1'
Apr 29 16:11:06 <vm-name> NetworkManager[907]: <info>  [1619712666.3125] dhcp4 (eth0): option requested_nis_servers => '1'
Apr 29 16:11:06 <vm-name> NetworkManager[907]: <info>  [1619712666.3125] dhcp4 (eth0): option requested_ntp_servers => '1'
Apr 29 16:11:06 <vm-name> NetworkManager[907]: <info>  [1619712666.3125] dhcp4 (eth0): option requested_rfc3442_classless_static_routes => '1'
Apr 29 16:11:06 <vm-name> NetworkManager[907]: <info>  [1619712666.3125] dhcp4 (eth0): option requested_root_path  => '1'
Apr 29 16:11:06 <vm-name> NetworkManager[907]: <info>  [1619712666.3126] dhcp4 (eth0): option requested_routers    => '1'
Apr 29 16:11:06 <vm-name> NetworkManager[907]: <info>  [1619712666.3126] dhcp4 (eth0): option requested_static_routes => '1'
Apr 29 16:11:06 <vm-name> NetworkManager[907]: <info>  [1619712666.3126] dhcp4 (eth0): option requested_subnet_mask => '1'
Apr 29 16:11:06 <vm-name> dbus-daemon[827]: [system] Activating via systemd: service name='org.freedesktop.nm_dispatcher' unit='dbus-org.freedesktop.nm-dispatcher.service' requested by ':1.8' (uid=0 pid=907 comm="/usr/sbin/NetworkManager --no-daemon " label="system_u:system_r:NetworkManager_t:s0")
Apr 29 16:11:06 <vm-name> NetworkManager[907]: <info>  [1619712666.3126] dhcp4 (eth0): option requested_time_offset => '1'
Apr 29 16:11:06 <vm-name> NetworkManager[907]: <info>  [1619712666.3126] dhcp4 (eth0): option requested_wpad       => '1'
Apr 29 16:11:06 <vm-name> NetworkManager[907]: <info>  [1619712666.3126] dhcp4 (eth0): option rfc3442_classless_static_routes => '10.128.0.1/32 0.0.0.0 0.0.0.0/0 10.128.0.1'
Apr 29 16:11:06 <vm-name> NetworkManager[907]: <info>  [1619712666.3126] dhcp4 (eth0): option routers              => '10.128.0.1'
Apr 29 16:11:06 <vm-name> NetworkManager[907]: <info>  [1619712666.3126] dhcp4 (eth0): option subnet_mask          => '255.255.255.255'
Apr 29 16:11:06 <vm-name> NetworkManager[907]: <info>  [1619712666.3126] dhcp4 (eth0): state changed extended -> extended
Apr 29 16:11:06 <vm-name> systemd[1]: Starting Network Manager Script Dispatcher Service...
Apr 29 16:11:06 <vm-name> dbus-daemon[827]: [system] Successfully activated service 'org.freedesktop.nm_dispatcher'
Apr 29 16:11:06 <vm-name> systemd[1]: Started Network Manager Script Dispatcher Service.
Apr 29 16:15:18 <vm-name> google_osconfig_agent[1215]: default
Apr 29 16:29:30 <vm-name> GCEGuestAgent[1269]: 2021-04-29T16:28:35.1302Z GCEGuestAgent Error main.go:181: Error watching metadata: Get http://metadata.google.internal/computeMetadata/v1//?recursive=true&alt=json&wait_for_change=true&timeout_sec=60&last_etag=4ac15b8179731d72: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
Apr 29 16:29:31 <vm-name> OSConfigAgent[1215]: 2021-04-29T16:29:03.9464Z OSConfigAgent Error main.go:189: Get http://169.254.169.254/computeMetadata/v1/?recursive=true&alt=json&wait_for_change=true&last_etag=4ac15b8179731d72&timeout_sec=60: dial tcp 169.254.169.254:80: i/o timeout
Apr 29 16:29:44 <vm-name> google_osconfig_agent[1215]: default
[16281276.978048] google_guest_agent[1269]: 2021/04/29 16:41:10 logging client: context deadline exceeded
Apr 29 17:11:14 <vm-name> google_guest_agent[1269]: 2021/04/29 16:41:10 logging client: context deadline exceeded
Apr 29 17:23:19 <vm-name> google_osconfig_agent[1215]: 2021/04/29 16:42:42 logging client: context deadline exceeded
Apr 29 17:23:19 <vm-name> google_osconfig_agent[1215]: default
Apr 29 17:23:19 <vm-name> google_osconfig_agent[1215]: 2021/04/29 17:03:31 logging client: context deadline exceeded
Apr 29 17:23:19 <vm-name> google_osconfig_agent[1215]: default
Apr 29 17:27:20 <vm-name> OSConfigAgent[1215]: 2021-04-29T16:41:09.0306Z OSConfigAgent Error main.go:189: Get http://169.254.169.254/computeMetadata/v1/?recursive=true&alt=json&wait_for_change=true&last_etag=4ac15b8179731d72&timeout_sec=60: dial tcp 169.254.169.254:80: i/o timeout
Apr 29 17:27:20 <vm-name> NetworkManager[907]: <info>  [1619717130.6935] dhcp4 (eth0): option dhcp_lease_time      => '3600'
Apr 29 17:27:20 <vm-name> google_osconfig_agent[1215]: 2021/04/29 17:26:00 logging client: context deadline exceeded
Apr 29 17:27:20 <vm-name> OSConfigAgent[1215]: 2021-04-29T17:07:12.7039Z OSConfigAgent Error main.go:189: Get http://169.254.169.254/computeMetadata/v1/?recursive=true&alt=json&wait_for_change=true&last_etag=4ac15b8179731d72&timeout_sec=60: dial tcp 169.254.169.254:80: i/o timeout
Apr 29 17:27:20 <vm-name> OSConfigAgent[1215]: 2021-04-29T17:22:13.2740Z OSConfigAgent Error main.go:189: network error when requesting metadata, make sure your instance has an active network and can reach the metadata server: Get http://169.254.169.254/computeMetadata/v1/?recursive=true&alt=json&wait_for_change=true&last_etag=4ac15b8179731d72&timeout_sec=60: dial tcp 169.254.169.254:80: connect: network is unreachable
Apr 29 17:28:55 <vm-name> NetworkManager[907]: <info>  [1619717199.5681] dhcp4 (eth0): option domain_name          => 'us-central1-a.c.<project-name>.internal'
Apr 29 17:34:48 <vm-name> OSConfigAgent[1215]: 2021-04-29T17:27:54.8816Z OSConfigAgent Error main.go:189: network error when requesting metadata, make sure your instance has an active network and can reach the metadata server: Get http://169.254.169.254/computeMetadata/v1/?recursive=true&alt=json&wait_for_change=true&last_etag=4ac15b8179731d72&timeout_sec=60: dial tcp 169.254.169.254:80: connect: network is unreachable
Apr 29 17:39:16 <vm-name> google_osconfig_agent[1215]: default
Apr 29 17:39:16 <vm-name> google_osconfig_agent[1215]: 2021/04/29 17:36:13 logging client: context deadline exceeded
Apr 29 17:39:16 <vm-name> google_osconfig_agent[1215]: default
Apr 29 17:55:16 <vm-name> NetworkManager[907]: <info>  [1619717287.9321] dhcp4 (eth0): option domain_name_servers  => '169.254.169.254'
Apr 29 17:58:36 <vm-name> OSConfigAgent[1215]: 2021-04-29T17:30:47.0872Z OSConfigAgent Error main.go:189: network error when requesting metadata, make sure your instance has an active network and can reach the metadata server: Get http://169.254.169.254/computeMetadata/v1/?recursive=true&alt=json&wait_for_change=true&last_etag=4ac15b8179731d72&timeout_sec=60: dial tcp 169.254.169.254:80: connect: network is unreachable
Apr 29 18:03:20 <vm-name> google_osconfig_agent[1215]: default
Apr 29 18:03:20 <vm-name> google_osconfig_agent[1215]: 2021/04/29 17:46:22 logging client: context deadline exceeded

重新启动计算机恢复了网络。我们在日志中找不到任何其他内容,监控也没有显示任何可疑内容。是什么原因造成的?

此OSConfigAgent错误似乎是问题的原因:

Apr 29 17:27:20 <vm-name> OSConfigAgent[1215]: 2021-04-29T17:22:13.2740Z OSConfigAgent Error main.go:189: network error when requesting metadata, make sure your instance has an active network and can reach the metadata server: Get http://169.254.169.254/computeMetadata/v1/?recursive=true&alt=json&wait_for_change=true&last_etag=4ac15b8179731d72&timeout_sec=60: dial tcp 169.254.169.254:80: connect: network is unreachable

根据GCP文件:

谷歌云在169.254.169.254的每个实例旁边运行一个本地元数据服务器。该服务器对实例的操作至关重要,因此无论您配置了什么防火墙规则,实例都可以访问该服务器。

基于此,/var/log/messages中一定有一些有趣的内容。您可能发现了与网络适配器相关的内容。

根据以下日志摘录,我可以推断这很可能是因为网络连接中断

OSConfigAgent Error main.go:189: network error when requesting metadata, make sure your instance has an active network and can reach the metadata server: Get http://169.254.169.254/computeMetadata/v1/? recursive=true&alt=json&wait_for_change=true&last_etag=4ac15b8179731d72&timeout_sec=60: dial tcp 169.254.169.254:80: connect: network is unreachable

从日志中,我注意到:

16:11:06,DHCP租约续订,持续时间为3600秒。因此,DHCP租约的第二次续订时间应该是17:11:06左右。但从日志来看,更新延迟了15分钟(约(,发生在17:26:39。下一次更新发生在17:55:07,大约27分钟后。

包括DHCP续订在内的整个网络堆栈被延迟,很可能是由于CPU或内存过载。当主机经历高CPU负载时,会导致网络降级。

为了检查过载是否在CPU或内存中,打开云控制台---->计算引擎--->VM实例--->单击实例的三个点--->查看监控--->CPU和内存利用率图表(检查问题发生时的时间范围(

请确保您的实例具有足够的内存利用率来维持负载。

最新更新