以r开头和结尾匹配文本

我正在做一个简单的网页抓取但是我有一个我解决不了的问题。

当我下载webscode时，我需要提取一系列位置(主要是国家名称)。

我有这个文本:

text <- "Â  Agaon fasciatum Waterston (Life: Kingdom: Metazoa (animals); Phylum: Arthropoda; Class: Hexapoda; Order: Hymenoptera;Â  Superfamily: Chalcidoidea; Family: Agaonidae; Genus: Agaon) Agaon fasciatum Waterston, 1914, Agaon tridentatum Joseph 1959. Holotype in The Natural History Museum, London. Type locality:Â Uganda. Distribution Â  Cameroon, Gabon, Guinea, Uganda, Zambia. Biology Host fig: Ficus cyathistipula cyathistipula Warb. References Waterston, J. 1914. Notes on African Chalcidoidea. I. Bulletin of Entomological Research. 5:249-258. Credits Photographs Â© Jean-Yves Rasplus (INRA) or Â© Simon van Noort (Iziko Museums of South Africa). NextÂ  genus: AlfonsiellaÂ Â Â Â Â Â Â  Next species: Agaon gabonense"

和我需要提取分布，它对应于从distribution到.的单词，表示国家列表的结尾。

str_locate(string = text, pattern = "Distribution")

我可以检测到单词"Distribution."我现在，像".*\."的东西，我可以检测'。但当我尝试

str_locate(string = text, pattern = "Distribution.*\.")

我没有任何结果。

解决方案吗?我知道这应该很容易，但是我在任何地方都找不到答案。

提前感谢，

安东尼奥

base R

gsub(".*Distribution Â? *([^.]+)\..*", "\1", text)
# [1] "Cameroon, Gabon, Guinea, Uganda, Zambia"
### or
gsub(".*(Distribution Â? *[^.]+)\..*", "\1", text)
# [1] "Distribution Â  Cameroon, Gabon, Guinea, Uganda, Zambia"

或

regmatches(text, gregexpr("Distribution Â? *[^.]+\.", text))
# [[1]]
# [1] "Distribution Â  Cameroon, Gabon, Guinea, Uganda, Zambia."

如果您使用gsub，请注意，如果没有找到它，将返回原始的text，不变。(所以只要检查一下newtext != text，以确保你找到了一些东西。)

stringr

stringr::str_extract(text, "Distribution Â? *[^.]+\.")
# [1] "Distribution Â  Cameroon, Gabon, Guinea, Uganda, Zambia."

如果你需要位置字符串索引，

stringr::str_locate(text, "Distribution Â? *[^.]+\.")
#      start end
# [1,]   320 375

base R

stringr

相关内容

最新更新

热门标签：