我保存了一个.csv文件,它是UTF-8编码的。脚本是该文件中数据的天成文书。我能够在excel 中看到csv文件中的单词
में
लिए
किया
गया
हैं
नहीं
सिंह
पुलिस
दिया
करने
कहा
रहे
बाद
करें
साथ
रहा
但是当我在R中打开它时,单词并没有得到正确的编码。print((的输出如下:
word
सारे_खतरों_को
जानते_हà¥u0081à¤u008f_à¤à¥€
विवेक_ने
टीवी
我该如何解决此问题?我试过Sys.setlocale()
和read.delim(wordlist.csv, encoding = "UTF-8")
,但都不起作用。
评论太长(对不起,我是R
中的新手(:
print( sessionInfo())
library(stringi)
library(magrittr)
x <- read.delim("D:\bat\SO\64497248_devangari.csv", encoding = "UTF-8")
print('=== print(x)')
print(x)
for (line in x){
y <- line %>%
stri_replace_all_regex("<U\+([[:alnum:]]+)>", "\\u$1") %>%
stri_unescape_unicode() %>%
stri_enc_toutf8()
}
print('=== print(y)')
print(y)
print('=== for (i in y) {print(i)}')
for (i in y) {print(i)}
print('=== print(z)')
z <- x['word'] %>%
stri_replace_all_regex("<U\+([[:alnum:]]+)>", "\\u$1") %>%
stri_unescape_unicode() %>%
stri_enc_toutf8()
print(z)
输出(在Rgui.exe
控制台中(:
> source ( 'D:\bat\SO\64497248.r' )
R version 4.0.1 (2020-06-06)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19041)
Matrix products: default
locale:
[1] LC_COLLATE=Czech_Czechia.1250 LC_CTYPE=Czech_Czechia.1250 LC_MONETARY=Czech_Czechia.1250
[4] LC_NUMERIC=C LC_TIME=Czech_Czechia.1250
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] compiler_4.0.1
[1] "=== print(x)"
word
1 <U+092E><U+0947><U+0902>
2 <U+0932><U+093F><U+090F>
3 <U+0915><U+093F><U+092F><U+093E>
4 <U+0917><U+092F><U+093E>
5 <U+0939><U+0948><U+0902>
6 <U+0928><U+0939><U+0940><U+0902>
7 <U+0938><U+093F><U+0902><U+0939>
8 <U+092A><U+0941><U+0932><U+093F><U+0938>
9 <U+0926><U+093F><U+092F><U+093E>
10 <U+0915><U+0930><U+0928><U+0947>
11 <U+0915><U+0939><U+093E>
12 <U+0930><U+0939><U+0947>
13 <U+092C><U+093E><U+0926>
14 <U+0915><U+0930><U+0947><U+0902>
15 <U+0938><U+093E><U+0925>
16 <U+0930><U+0939><U+093E>
[1] "=== print(y)"
[1] "में" "लिए" "किया" "गया" "हैं" "नहीं" "सिंह" "पुलिस" "दिया" "करने" "कहा" "रहे" "बाद" "करें" "साथ" "रहा"
[1] "=== for (i in y) {print(i)}"
[1] "में"
[1] "लिए"
[1] "किया"
[1] "गया"
[1] "हैं"
[1] "नहीं"
[1] "सिंह"
[1] "पुलिस"
[1] "दिया"
[1] "करने"
[1] "कहा"
[1] "रहे"
[1] "बाद"
[1] "करें"
[1] "साथ"
[1] "रहा"
[1] "=== print(z)"
[1] "c("में", "लिए", "किया", "गया", "हैं", "नहीं", "सिंह", "पुलिस", "दिया", "करने", "कहा", "रहे", "बाद", "करें", "साथ", "रहा"n)"
Warning messages:
1: package ‘magrittr’ was built under R version 4.0.2
2: In stri_replace_all_regex(., "<U\+([[:alnum:]]+)>", "\\u$1") :
argument is not an atomic vector; coercing
>