来自 gcount() 的输出不一致

我编写了以下简单的MRE来重新生成程序中的错误：

#include <iostream>
#include <utility>
#include <sstream>
#include <string_view>
#include <array>
#include <vector>
#include <iterator>
// this function is working fine only if string_view contains all the user provided chars and nothing extra like null bytes
std::pair< bool, std::vector< std::string > > tokenize( const std::string_view inputStr, const std::size_t expectedTokenCount )
{
// unnecessary implementation details
std::stringstream ss;
ss << inputStr.data( ); // works for null-terminated strings, but not for the non-null terminated strings
// unnecessary implementation details
}
int main( )
{
constexpr std::size_t REQUIRED_TOKENS_COUNT { 3 };
std::array<char, 50> input_buffer { };
std::cin.getline( input_buffer.data( ), input_buffer.size( ) ); // user can enter at max 50 characters
const auto [ hasExpectedTokenCount, foundTokens ] { tokenize( { input_buffer.data( ), input_buffer.size( ) }, REQUIRED_TOKENS_COUNT ) };
for ( const auto& token : foundTokens ) // print the tokens
{
std::cout << ''' << token << "' ";
}
std::cout << 'n';
}

这是一个用于标记化的程序(有关完整代码，请参阅以下链接中的编译器资源管理器)。另外，我使用GCC v11.2。

首先，我想避免使用data()，因为它的效率有点低。

我在编译器资源管理器中查看了程序集，显然，data()调用strlen()因此当它到达第一个空字节时它会停止。但是，如果string_view对象不是以 null 结尾的呢？这有点令人担忧。所以我切换到ss << inputStr;.

其次，当我ss << inputStr;执行此操作时，整个 50 个字符的缓冲区与其所有空字节一起插入到ss中。下面是一些错误的示例输出：

示例 #1：

1                  2    3
'1' '2' '3                                     ' // '1' and '2' are correct, '3' has lots of null bytes

示例 #2(在这个示例中，我在 3 之后键入了一个空格字符)：

1                  2    3
'1' '2' '3' '                                     ' // an extra token consisting of 1 space char and lots of null bytes has been created!

有没有办法解决这个问题？我现在应该怎么做才能支持以非空结尾的字符串？我想出了gcount()的想法如下：

const std::streamsize charCount { std::cin.gcount( ) };
                  // here I pass charCount instead of the size of buffer
const auto [ hasExpectedTokenCount, foundTokens ] { tokenize( { input_buffer.data( ), charCount },
REQUIRED_TOKENS_COUNT ) };

但问题是，当用户输入的字符少于缓冲区大小时，gcount()返回的值比实际输入的chars 数多 1(例如，用户输入 5 个字符，但gcount返回 6 显然也考虑到了"\0")。

这会导致最后一个标记的末尾也有一个空字节：

1   2     3
'1' '2' '3 ' // see the null byte in '3 ', it's NOT a space char

我应该如何修复gcount的不一致输出？

或者，也许我应该更改函数tokenize，以便它摆脱string_view末尾的任何"\0"，然后开始标记它。

不过，这听起来像是一个XY问题。但我真的需要帮助来决定该怎么做。

您遇到的基本问题是operator<<函数。您已经尝试了其中两个：

operator<<(ostream &, const char *)，它将字符从指针向上(不包括)下一个 NUL。如您所指出的，如果指针来自没有终止 NUL 的string_view，则这可能是一个问题。
operator<<(ostream &, const string_view &)，它将从string_view中获取所有字符，包括可能存在的任何 NULL。

似乎您要做的是将字符从string_view到(不包括)第一个 NUL 或string_view的末尾，以先到者为准。您可以使用find并构造一个直到 NUL 或 end 的 substr 来做到这一点：

ss << inputStr.substr(0, inputStr.find(''));

相关内容

最新更新

热门标签：