服务器在 16000 个请求后挂起一段时间



我是 boost::asio 的新手。尝试运行

ab -n 20000 -c 5  -r http://127.0.0.1:9999/

测试每次在 16000 请求后卡住。但它确实完成了。我也收到很多失败的请求。

代码在做什么:

  • A. 创建服务
  • B. 创建接受器
  • C. 绑定和倾听
  • D. 创建套接字
  • F.做async_connect
  • G.在处理程序async_connect关闭套接字。创建一个 并使用相同的处理程序执行async_connect。

代码如下:

#include <iostream>
#include <functional>
#include <string>
#include <boost/asio.hpp>
#include <boost/bind.hpp>
#include <boost/thread.hpp>
#include <memory>
// global variable for service and acceptor
boost::asio::io_service ioService;
boost::asio::ip::tcp::acceptor accp(ioService);
// callback for accept
void onAccept(const boost::system::error_code &ec, shared_ptr<boost::asio::ip::tcp::socket> soc) {
using boost::asio::ip::tcp;
soc->send(boost::asio::buffer("In Accept"));
soc->shutdown(boost::asio::ip::tcp::socket::shutdown_send);
soc.reset(new tcp::socket(ioService));
accp.async_accept(*soc, [=](const boost::system::error_code &ec) {
onAccept(ec, soc);
});
}
int main(int argc, char *argv[]) {
using boost::asio::ip::tcp;
boost::asio::ip::tcp::resolver resolver(ioService);
try {
boost::asio::ip::tcp::resolver::query query("127.0.0.1", boost::lexical_cast<std::string>(9999));
boost::asio::ip::tcp::endpoint endpoint = *resolver.resolve(query);
accp.open(endpoint.protocol());
accp.set_option(boost::asio::ip::tcp::acceptor::reuse_address(true));
accp.bind(endpoint);
cout << "Ready to accept @ 9999" << endl;
auto t1 = boost::thread([&]() { ioService.run(); });
accp.listen(boost::asio::socket_base::max_connections);
std::shared_ptr<tcp::socket> soc = make_shared<tcp::socket>(ioService);
accp.async_accept(*soc, [=](const boost::system::error_code &ec) { onAccept(ec, soc); });
t1.join();
} catch (std::exception &ex) {
std::cout << "[" << boost::this_thread::get_id() << "] Exception: " << ex.what() << std::endl;
}
}

为了完整起见:

  1. 我按照@Arunmu更改了我的代码
  2. 我在 Linux 中使用了 docker,因为 @david-schwartz 建议的套接字问题
  3. 服务器现在永远不会挂起。
    • 单线程 - 6045 每秒
    • 线程 - 5849 每秒
  4. 使用async_write

您的本地套接字用完了。不应通过从单个 IP 地址生成所有负载来进行测试。(此外,您的负载生成器应该足够智能,可以检测并解决这种情况,但可惜很多都不是。

首先,让我们更正确地做事。我已经将代码更改为使用独立的 asio 而不是 boost asio 并使用 c++14 功能。使用您的原始代码,有很多失败,我通过更改减少了这些失败。

法典:

#include <iostream>
#include <functional>
#include <string>
#include <asio.hpp>
#include <thread>
#include <memory>
#include <system_error>
#include <chrono>
//global variable for service and acceptor
asio::io_service ioService;
asio::ip::tcp::acceptor accp(ioService); 
const char* response = "HTTP/1.1 200 OKrnrnrn";
//callback for accept 
void onAccept(const std::error_code& ec, std::shared_ptr<asio::ip::tcp::socket> soc)
{
using asio::ip::tcp;
soc->set_option(asio::ip::tcp::no_delay(true));
auto buf = new asio::streambuf;
asio::async_read_until(*soc, *buf, "rnrn",
[=](auto ec, auto siz) {
asio::write(*soc, asio::buffer(response, std::strlen(response)));
soc->shutdown(asio::ip::tcp::socket::shutdown_send);
delete buf;
soc->close();
});
auto nsoc = std::make_shared<tcp::socket>(ioService);
//soc.reset(new tcp::socket(ioService));
accp.async_accept(*nsoc, [=](const std::error_code& ec){
onAccept(ec, nsoc);
});
}
int main( int argc, char * argv[] )
{
using asio::ip::tcp;
asio::ip::tcp::resolver resolver(ioService);
try{
asio::ip::tcp::resolver::query query( 
"127.0.0.1", 
std::to_string(9999)
);
asio::ip::tcp::endpoint endpoint = *resolver.resolve( query );
accp.open( endpoint.protocol() );
accp.set_option( asio::ip::tcp::acceptor::reuse_address( true ) );
accp.bind( endpoint );
std::cout << "Ready to accept @ 9999" << std::endl;
auto t1 = std::thread([&]() { ioService.run(); });
auto t2 = std::thread([&]() { ioService.run(); });
accp.listen( 1000 );
std::shared_ptr<tcp::socket> soc = std::make_shared<tcp::socket>(ioService);
accp.async_accept(*soc, [=](const std::error_code& ec) {
onAccept(ec, soc);
});
t1.join();
t2.join();
} catch(const std::exception & ex){
std::cout << "[" << std::this_thread::get_id()
<< "] Exception: " << ex.what() << std::endl;
} catch (...) {
std::cerr << "Caught unknown exception" << std::endl;
}
}

主要变化是:

  1. 发送正确的 HTTP 响应。
  2. 阅读请求。否则,您只是填满了套接字接收缓冲区。
  3. 插座关闭正确。
  4. 使用多个线程。这主要是Mac OS所必需的,Linux不需要的。

使用的测试命令:ab -n 20000 -c 1 -r http://127.0.0.1:9999/

Linux,测试很快就通过,没有任何错误,并且没有使用额外的线程进行io_service

但是,在Mac我能够重现该问题,即在处理了 16000 个请求后它卡住了。该时刻的流程示例为:

Call graph:
906 Thread_1887605   DispatchQueue_1: com.apple.main-thread  (serial)
+ 906 start  (in libdyld.dylib) + 1  [0x7fff868bc5c9]
+   906 main  (in server_hangs_so) + 2695  [0x10d3622b7]
+     906 std::__1::thread::join()  (in libc++.1.dylib) + 20  [0x7fff86ad6ba0]
+       906 __semwait_signal  (in libsystem_kernel.dylib) + 10  [0x7fff8f44c48a]
906 Thread_1887609
+ 906 thread_start  (in libsystem_pthread.dylib) + 13  [0x7fff8d0983ed]
+   906 _pthread_start  (in libsystem_pthread.dylib) + 176  [0x7fff8d09afd7]
+     906 _pthread_body  (in libsystem_pthread.dylib) + 131  [0x7fff8d09b05a]
+       906 void* std::__1::__thread_proxy<std::__1::tuple<main::$_2> >(void*)  (in server_hangs_so) + 124  [0x10d36317c]
+         906 asio::detail::scheduler::run(std::__1::error_code&)  (in server_hangs_so) + 181  [0x10d36bc25]
+           906 asio::detail::scheduler::do_run_one(asio::detail::scoped_lock<asio::detail::posix_mutex>&, asio::detail::scheduler_thread_info&, std::__1::error_code const&)  (in server_hangs_so) + 393  [0x10d36bfe9]
+             906 kevent  (in libsystem_kernel.dylib) + 10  [0x7fff8f44d21a]
906 Thread_1887610
906 thread_start  (in libsystem_pthread.dylib) + 13  [0x7fff8d0983ed]
906 _pthread_start  (in libsystem_pthread.dylib) + 176  [0x7fff8d09afd7]
906 _pthread_body  (in libsystem_pthread.dylib) + 131  [0x7fff8d09b05a]
906 void* std::__1::__thread_proxy<std::__1::tuple<main::$_3> >(void*)  (in server_hangs_so) + 124  [0x10d36324c]
906 asio::detail::scheduler::run(std::__1::error_code&)  (in server_hangs_so) + 181  [0x10d36bc25]
906 asio::detail::scheduler::do_run_one(asio::detail::scoped_lock<asio::detail::posix_mutex>&, asio::detail::scheduler_thread_info&, std::__1::error_code const&)  (in server_hangs_so) + 263  [0x10d36bf67]
906 __psynch_cvwait  (in libsystem_kernel.dylib) + 10  [0x7fff8f44c136]
Total number in stack (recursive counted multiple, when >=5):
Sort by top of stack, same collapsed (when >= 5):
__psynch_cvwait  (in libsystem_kernel.dylib)        906
__semwait_signal  (in libsystem_kernel.dylib)        906
kevent  (in libsystem_kernel.dylib)        906

只有在提供额外的线程后,我才能完成测试,结果如下:

Benchmarking 127.0.0.1 (be patient)
Completed 2000 requests
Completed 4000 requests
Completed 6000 requests
Completed 8000 requests
Completed 10000 requests
Completed 12000 requests
Completed 14000 requests
Completed 16000 requests
Completed 18000 requests
Completed 20000 requests
Finished 20000 requests

Server Software:
Server Hostname:        127.0.0.1
Server Port:            9999
Document Path:          /
Document Length:        2 bytes
Concurrency Level:      1
Time taken for tests:   33.328 seconds
Complete requests:      20000
Failed requests:        3
(Connect: 1, Receive: 1, Length: 1, Exceptions: 0)
Total transferred:      419979 bytes
HTML transferred:       39998 bytes
Requests per second:    600.09 [#/sec] (mean)
Time per request:       1.666 [ms] (mean)
Time per request:       1.666 [ms] (mean, across all concurrent requests)
Transfer rate:          12.31 [Kbytes/sec] received
Connection Times (ms)
min  mean[+/-sd] median   max
Connect:        0    0  30.7      0    4346
Processing:     0    1 184.4      0   26075
Waiting:        0    0   0.0      0       1
Total:          0    2 186.9      0   26075
Percentage of the requests served within a certain time (ms)
50%      0
66%      0
75%      0
80%      0
90%      0
95%      0
98%      0
99%      0
100%  26075 (longest request)

可能卡住的线程的堆栈跟踪:

* thread #3: tid = 0x0002, 0x00007fff8f44d21a libsystem_kernel.dylib`kevent + 10, stop reason = signal SIGSTOP
* frame #0: 0x00007fff8f44d21a libsystem_kernel.dylib`kevent + 10
frame #1: 0x0000000109c482ec server_hangs_so`asio::detail::kqueue_reactor::run(bool, asio::detail::op_queue<asio::detail::scheduler_operation>&) + 268
frame #2: 0x0000000109c48039 server_hangs_so`asio::detail::scheduler::do_run_one(asio::detail::scoped_lock<asio::detail::posix_mutex>&, asio::detail::scheduler_thread_info&, std::__1::error_code const&) + 393
frame #3: 0x0000000109c47c75 server_hangs_so`asio::detail::scheduler::run(std::__1::error_code&) + 181
frame #4: 0x0000000109c3f2fc server_hangs_so`void* std::__1::__thread_proxy<std::__1::tuple<main::$_3> >(void*) + 124
frame #5: 0x00007fff8d09b05a libsystem_pthread.dylib`_pthread_body + 131
frame #6: 0x00007fff8d09afd7 libsystem_pthread.dylib`_pthread_start + 176
frame #7: 0x00007fff8d0983ed libsystem_pthread.dylib`thread_start + 13

这可能是asio或 mac 系统本身中的kqueue_reactor实现问题(可能性较小)

更新:libevent也观察到相同的行为。因此,asio实现不是问题。它一定是 kqueue 内核实现中的一些错误。Linux 上的epoll看不到此问题。

就我而言,这是由于Mac OSX的net.inet.tcp.msl=15000,如下所述:

  • https://rocketeer.be/blog/2014/04/benchmarking-on-osx-http-timeouts/
  • https://kaiwern.com/posts/2022/08/11/benchmarking-http-server-stuck-at-16k-requests/

关闭的连接在"TIME_WAIT"中停留了一段时间 - 用完可用的临时端口 - 客户端卡在"SYN_SENT"中等待端口释放。

最新更新