如何删除语料库中类似的文档



我有一个关于给定主题的新闻文章语料库。其中一些文章是完全相同的文章,但被赋予了额外的页眉和页脚,这些页眉和页脚对内容进行了非常轻微的更改。我正在尝试删除除一个潜在重复项之外的所有内容,因此最终语料库仅包含唯一的文章。

我决定使用余弦相似性来识别潜在的重复项:

myDfm <- dfm(as.character(docs$text_main), verbose=FALSE)
cosinesim <- textstat_simil(x=myDfm, selection=docnames(myDfm), margin="documents", method="cosine")
cosinemat <- as.matrix(cosinesim)

在查看了数据的子集后,我选择了0.9余弦距离或更高的截止值来指示重复项。(我可以接受任何错误 鉴于此,我将对角线转换为 0(即,不是重复(并更改矩阵以指示哪些文档是重复的,哪些不是:

diag(cosinemat) <- 0
cosinemat[cosinemat >= .9] <- 1
cosinemat[cosinemat < .9] <- 0

我遇到的问题是弄清楚如何删除除一个重复文档之外的所有文档。最初,我设想了一个 for 循环来逐个单元格遍历每一列单元格,对于任何值为 1(即重复(的单元格,删除与当前单元格的行同名的列,重构矩阵并继续下一个单元格。for 循环似乎不喜欢在单元格等于 1 时删除具有当前行名称的列的代码行。不过,我不确定是否可以重新构建您正在循环的对象。像这样:

cosine_df <- as.data.frame(cosinemat)
for(col in 1:ncol(cosine_df)){
for(row in 1:nrow(cosine_df)){
if(cosine_df[col,row] == 0){
next
}
if(cosine_df[col,row] == 1){
cosine_df <- cosine_df[!rownames(cosine_df) %in% paste(rownames(cosine_df)[col,row]]  
}
}
}

我不打算采用这种方法,只要我能够识别类似的文档并删除除一个文档之外的所有文档,我就愿意接受创造性的解决方案。

如果有帮助,以下是文档的子集:

docs <- structure(list(text_main = c("Congressional Documents and PublicationsMay 26, 2016Copyright 2016 Federal Information and News Dispatch, Inc.Section:U.S. HOUSE OF REPRESENTATIVES DOCUMENTSLength:287 wordsBody(Washington, DC) Reps. Ted Deutch (D-FL) and Gus Bilirakis (R-FL) joined with Reps. Steve Israel (D-NY), Mike Kelly (R-PA), Ted Lieu (D-CA), Adam Kinzinger (R-IL), Hakeem Jeffries (D-NY), Lee Zeldin (R-NY), and Susan Davis (D-CA) to introduce a resolution (H. Res. 750) urging the European Union (EU) to designate the entirety of Hizballah as a terrorist organization and increase pressure on the organizations and its members. Currently, the EU only designates Hizballah's military wing as a terrorist organization, while the United States makes no distinction between its military and political branches when listing the group on its Foreign Terrorist Organization list.Upon introduction, the Members of Congress released the following statement:"Hizballah is an Iranian-backed terrorist organization with a global reach that engages in significant illicit criminal activity to fund its terrorism. It doesn't matter what part of the organization you're associated with; if you are connected with Hizballah, you are contributing to the rocket attacks on innocent Israeli civilians, targeted bombings of Jews around the world, slaughter of civilians in Syria, and destabilization of the Middle East. There is no distinction between parts of Hizballah when every part contributes to terrorism. We urge our EU allies to help rein in Hizballah's dangerous worldwide activities."The resolution can be viewed here .Last year, Congress passed the Hizballah International Financing Prevention Act which tightened sanctions on Hizballah's criminal and financial networks.Read this original document at: ", 
"Congressional Documents and PublicationsApril 20, 2016Copyright 2016 Federal Information and News Dispatch, Inc.Section:U.S. HOUSE OF REPRESENTATIVES DOCUMENTSLength:499 wordsBodyToday, members of the House of Representatives Bipartisan Taskforce for Combating Anti-Semitism sounded the alarm about a troubling surge in anti-Semitism on American college campuses. In a letter to the Secretary of Education, the Taskforce asked the Secretary about the Department's planned response to the issue. Additionally, the co-chairs made the following statement:"An alarming rise of anti-Israel programs on American college campuses contribute to increasing harassment, intimidation, and discrimination against Jewish students. While we believe that students' freedoms of speech and assembly should be respected, there are increasing reports that activity advertised as anti-Israel or anti-Zionist is devolving into displays of subtle, but sometimes outright anti-Semitism. Attacks on students because of their actual or perceived religion, ancestry, or ethnicity are unacceptable. We believe strongly that no student should ever face discrimination and that school activities must be structured in a respectful manner to ensure academic integrity and a nondiscriminatory environment throughout the entire campus. For these reasons, we ask the Department of Education to assess its ability to monitor and respond to anti-Semitic incidents and to take additional steps to combat intimidation and harassment against minority students on college campuses."In 2004, the U.S. Department of Education Office for Civil Rights (OCR) clarified its interpretation of Title VI of the Civil Rights Act of 1964, including protections for groups of students on the basis of their actual or perceived shared ancestry or ethnic characteristics, regardless of whether they are members of a faith community, as in the case for Jewish, Sikh, and Muslim students. The Department reiterated this policy again in 2010 and 2015.However, as the number of reported Boycott, Divestment, and Sanctions (BDS) movement campaigns and other anti-Israel initiatives rise on college campuses, Members of Congress believe the Department must proactively implement its anti-discrimination policy to mitigate anti-Semitism on college campuses.The Bipartisan Taskforce for Combating Anti-Semitism is co-chaired by U.S. Reps. Nita Lowey (D-NY), Chris Smith (R-NJ), Eliot Engel (D-NY), Ileana Ros-Lehtinen (R-FL), Kay Granger (R-TX), Steve Israel (D-NY), Peter Roskam (R-IL), and Ted Deutch (D-FL).The following organizations expressed their support for the letter: the Anti-Defamation League, Jewish Federation of North America, B'nai Brith International, Jewish United Fund/Jewish Federation of Metropolitan Chicago, the Louis D. Brandeis Center for Human Rights Under Law, the World Jewish Congress, and the Zionist Organization of America.Text of the letter can be found here .Read this original document at: ", 
"Targeted News ServiceApril 20, 2016 Wednesday 7:41 AM  ESTCopyright 2016 Targeted News Service LLC All Rights ReservedLength:511 wordsByline:Targeted News ServiceDateline:WASHINGTON BodyRep. Ted Deutch, D-Fla. (21st CD), issued the following news release:Today, members of the House of Representatives Bipartisan Taskforce for Combating Anti-Semitism sounded the alarm about a troubling surge in anti-Semitism on American college campuses. In a letter to the Secretary of Education, the Taskforce asked the Secretary about the Department's planned response to the issue. Additionally, the co-chairs made the following statement:"An alarming rise of anti-Israel programs on American college campuses contribute to increasing harassment, intimidation, and discrimination against Jewish students. While we believe that students' freedoms of speech and assembly should be respected, there are increasing reports that activity advertised as anti-Israel or anti-Zionist is devolving into displays of subtle, but sometimes outright anti-Semitism. Attacks on students because of their actual or perceived religion, ancestry, or ethnicity are unacceptable. We believe strongly that no student should ever face discrimination and that school activities must be structured in a respectful manner to ensure academic integrity and a nondiscriminatory environment throughout the entire campus. For these reasons, we ask the Department of Education to assess its ability to monitor and respond to anti-Semitic incidents and to take additional steps to combat intimidation and harassment against minority students on college campuses."In 2004, the U.S. Department of Education Office for Civil Rights (OCR) clarified its interpretation of Title VI of the Civil Rights Act of 1964, including protections for groups of students on the basis of their actual or perceived shared ancestry or ethnic characteristics, regardless of whether they are members of a faith community, as in the case for Jewish, Sikh, and Muslim students. The Department reiterated this policy again in 2010 and 2015.However, as the number of reported Boycott, Divestment, and Sanctions (BDS) movement campaigns and other anti-Israel initiatives rise on college campuses, Members of Congress believe the Department must proactively implement its anti-discrimination policy to mitigate anti-Semitism on college campuses.The Bipartisan Taskforce for Combating Anti-Semitism is co-chaired by U.S. Reps. Nita Lowey (D-NY), Chris Smith (R-NJ), Eliot Engel (D-NY), Ileana Ros-Lehtinen (R-FL), Kay Granger (R-TX), Steve Israel (D-NY), Peter Roskam (R-IL), and Ted Deutch (D-FL).The following organizations expressed their support for the letter: the Anti-Defamation League, Jewish Federation of North America, B'nai Brith International, Jewish United Fund/Jewish Federation of Metropolitan Chicago, the Louis D. Brandeis Center for Human Rights Under Law, the World Jewish Congress, and the Zionist Organization of America.Text of the letter can be found here ().Contact: Jason Attermann, 202/225-3001Copyright Targeted News Services30FurigayJof-5501453 30FurigayJof", 
"US Official NewsFebruary 13, 2013 WednesdayCopyright 2013 Plus Media Solutions Private Limited All Rights ReservedLength:298 wordsDateline:Washington Body Office of the House of Representative Ted Deutch, U.S Government has issued the following news release: Rep. Ted Deutch (D-FL) and Rep. Gus Bilirakis (R-GL) issued the following statements regarding the Bulgarian governments report that two individuals responsible for the July 2012 terrorist attack on a bus in Burgas, Bulgaria, have ties to Hezbollah. Five Israeli tourists and the Bulgarian bus driver were killed in the attack.Congressman Bilirakis: The Bulgarian governments report is yet another example of Hezbollah's deliberate use of terror across the globe. Contrary to some European opinions, Hezbollah is not merely a political organization and is actively involved in terrorist activities. As I have requested many times, the European Union must finally recognize Hezbollah for what it is: a terrorist organization. I commend the Bulgarian government for their thorough investigation and call on the members of the European Union to examine these findings closely.Congressman Deutch: The results of the Bulgarian governments investigation into the deadly attack in Burgas confirms what we already knew - Hezbollah is a terrorist organization that is willing to perpetrate attacks on innocent civilians around the globe. I continue to urge our European partners to formally designate Hezbollah as a terrorist organization. Failure to do so only emboldens Hezbollah to continue its reign of terror in Europe and around the world.In September 2012, Congressmen Bilirakis and Deutch initiated a bi-partisan letter signed by 268 Members of Congress to the President and Ministers of the Commission of the European Union, urging them to include Hezbollah on the European Union's list of terrorist organizations. For further information please visit: ", 
"Congressional Documents and PublicationsMay 4, 2011Copyright 2011 Federal Information and News Dispatch, Inc.Section:U.S. HOUSE OF REPRESENTATIVES DOCUMENTSLength:204 wordsBodyWashington, May 4 -Rep. Ted Deutch released the following statement on the Florida legislature's passage of SB 444, which expands upon the Protecting Florida's Investments Act, legislation he authored in 2007 in the Florida State Senate:"I applaud the Florida Legislature's passage of SB 444, legislation that will help ensure national and international security by preventing Florida's taxpayer dollars from supporting companies who choose to violate federal law by bolstering the Iranian regime. I congratulate the bill's sponsors, Sen. Ellyn Bogdanoff and Rep. Mack Bernard. This bill prevents state and local governments from awarding contracts to companies found to be investing in the Iranian energy sector. It is consistent with federal policy and sends a clear message that Floridians will not support any company that puts profit over international security. The Iranian regime continues to pursue its illicit nuclear weapons program, continues to engage in the most egregious human rights violations, and continues to support terrorism across the globe. We must continue to utilize every economic tool at our disposal to bring this regime to its knees. I urge Governor Scott to act quickly to sign this bill into law."", 
"Congressional Documents and PublicationsMarch 23, 2011Copyright 2011 Federal Information and News Dispatch, Inc.Section:U.S. HOUSE OF REPRESENTATIVES DOCUMENTSLength:128 wordsBodyBoca Raton, Mar 23 -Congressman Ted Deutch (D-FL) released the following statement in reaction to the explosion of a bomb today in Jerusalem that killed a 59-year-old woman and injured dozens more:"Today's horrific bombing in Jerusalem is yet another attack in a surge of violence perpetuated by Palestinian terrorists against innocent Israeli citizens," said Congressman Ted Deutch. "The victims of this heinous attack and the Israeli people deserve the full support of the international community as they seek to defend themselves against this relentless violence. It is deplorable that as Israelis endure this latest bombing in Jerusalem, as well as ongoing rocket attacks by Hamas, some astonishingly still seek to blame Israel for the lack of peace in the region."", 
"States News ServiceMarch 26, 2015 ThursdayCopyright 2015 States News ServiceLength:218 wordsByline:States News ServiceDateline:WASHINGTON BodyThe following information was released by the office of Florida Rep. Ted Deutch:Congressman Ted Deutch (FL-21) issued the following statement after voting in favor of the bipartisan 'doc fix' deal included in H.R. 2, the Medicare Access and CHIP Reauthorization Act, which passed the House of Representatives by a vote of 392-27:"Today I was pleased to vote for a bipartisan deal to permanently repeal the misguided Sustainable Growth Rate (SGR) policy that jeopardizes doctors participating in Medicare, and to permanently extend assistance to low-income seniors struggling to pay their Medicare premiums.In addition to building certainty in our Medicare system, this legislation also protects health care for low-income kids and funding for the federal health centers in Florida's 21st district that in 2013 alone served over 70,000 patients living below or near the federal poverty level."Today's deal is the product of the kind of bipartisan compromise that has become all too rare in Washington, in which neither Democrats nor Republicans get everything they want but come together to move our country forward. I can only imagine how much more we could achieve for the American people if we came together in the spirit of compromise on more issues, from infrastructure investment to immigration reform."", 
"Congressional Documents and PublicationsMarch 26, 2015Copyright 2015 Federal Information and News Dispatch, Inc.Section:U.S. HOUSE OF REPRESENTATIVES DOCUMENTSLength:250 wordsBodyCongressman Ted Deutch (FL-21) issued the following statement after voting in favor of the bipartisan 'doc fix' deal included in H.R. 2, the Medicare Access and CHIP Reauthorization Act, which passed the House of Representatives by a vote of 392-27:"Today I was pleased to vote for a bipartisan deal to permanently repeal the misguided Sustainable Growth Rate (SGR) policy that jeopardizes doctors participating in Medicare, and to permanently extend assistance to low-income seniors struggling to pay their Medicare premiums. In addition to building certainty in our Medicare system, this legislation also protects health care for low-income kids and funding for the federal health centers in Florida's 21st district that in 2013 alone served over 70,000 patients living below or near the federal poverty level."Today's deal is the product of the kind of bipartisan compromise that has become all too rare in Washington, in which neither Democrats nor Republicans get everything they want but come together to move our country forward. I can only imagine how much more we could achieve for the American people if we came together in the spirit of compromise on more issues, from infrastructure investment to immigration reform."For a fact sheet on H.R. 2, please go to: .Read this original document at: ", 
"US Official NewsMarch 27, 2015 FridayCopyright 2015 Plus Media Solutions Private Limited All Rights ReservedLength:241 wordsDateline:Washington Body Office of the House of Representative Ted Deutch, U.S Government has issued the following news release:Congressman Ted Deutch (FL-21) issued the following statement after voting in favor of the bipartisan 'doc fix' deal included in H.R. 2, the Medicare Access and CHIP Reauthorization Act, which passed the House of Representatives by a vote of 392-27:"Today I was pleased to vote for a bipartisan deal to permanently repeal the misguided Sustainable Growth Rate (SGR) policy that jeopardizes doctors participating in Medicare, and to permanently extend assistance to low-income seniors struggling to pay their Medicare premiums.In addition to building certainty in our Medicare system, this legislation also protects health care for low-income kids and funding for the federal health centers in Floridas 21st district that in 2013 alone served over 70,000 patients living below or near the federal poverty level."Todays deal is the product of the kind of bipartisan compromise that has become all too rare in Washington, in which neither Democrats nor Republicans get everything they want but come together to move our country forward. I can only imagine how much more we could achieve for the American people if we came together in the spirit of compromise on more issues, from infrastructure investment to immigration reform." In case of any query regarding this article or other content needs please contact: ", 
"US Official NewsMarch 27, 2015 FridayCopyright 2015 Plus Media Solutions Private Limited All Rights ReservedLength:241 wordsDateline:Washington Body Office of the House of Representative Ted Deutch, U.S Government has issued the following news release:Congressman Ted Deutch (FL-21) issued the following statement after voting in favor of the bipartisan 'doc fix' deal included in H.R. 2, the Medicare Access and CHIP Reauthorization Act, which passed the House of Representatives by a vote of 392-27:"Today I was pleased to vote for a bipartisan deal to permanently repeal the misguided Sustainable Growth Rate (SGR) policy that jeopardizes doctors participating in Medicare, and to permanently extend assistance to low-income seniors struggling to pay their Medicare premiums.In addition to building certainty in our Medicare system, this legislation also protects health care for low-income kids and funding for the federal health centers in Floridas 21st district that in 2013 alone served over 70,000 patients living below or near the federal poverty level."Todays deal is the product of the kind of bipartisan compromise that has become all too rare in Washington, in which neither Democrats nor Republicans get everything they want but come together to move our country forward. I can only imagine how much more we could achieve for the American people if we came together in the spirit of compromise on more issues, from infrastructure investment to immigration reform." In case of any query regarding this article or other content needs please contact: "
)), row.names = c(NA, 10L), class = "data.frame", .Names = "text_main")

以下是同一文档子集的相似性矩阵:

cosine_df <- structure(list(text1 = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0), text2 = c(0, 
0, 1, 0, 0, 0, 0, 0, 0, 0), text3 = c(0, 1, 0, 0, 0, 0, 0, 0, 
0, 0), text4 = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0), text5 = c(0, 
0, 0, 0, 0, 0, 0, 0, 0, 0), text6 = c(0, 0, 0, 0, 0, 0, 0, 0, 
0, 0), text7 = c(0, 0, 0, 0, 0, 0, 0, 1, 1, 1), text8 = c(0, 
0, 0, 0, 0, 0, 1, 0, 1, 1), text9 = c(0, 0, 0, 0, 0, 0, 1, 1, 
0, 1), text10 = c(0, 0, 0, 0, 0, 0, 1, 1, 1, 0)), .Names = c("text1", 
"text2", "text3", "text4", "text5", "text6", "text7", "text8", 
"text9", "text10"), row.names = c("text1", "text2", "text3", 
"text4", "text5", "text6", "text7", "text8", "text9", "text10"
), class = "data.frame")

如果其他人有类似的问题,这是我最终创建的解决方案:

library(quanteda)
myDfm <- dfm(as.character(docs$text_main), verbose=FALSE)
cosinesim <- textstat_simil(x=myDfm, selection=docnames(myDfm),     margin="documents", method="cosine")
cosinemat <- as.matrix(cosinesim) #this produces a matrix of the document similarities
threshold <- .9
similar_indices <- unique(apply(cosinemat, 1,
function(x) which(x > threshold)))
## keep only the first element of each set
if(class(similar_indices) == "list") {          # check if list or not
unique_indices <- unique(sapply(similar_indices, function(x) as.numeric(x[1])))
} else if (class(similar_indices) == "matrix"){
unique_indices <- unique(apply(similar_indices, 2, function(x) as.numeric(x[1])))
} else {
unique_indices <- similar_indices
}
## get only the unique texts
docs_unique <- docs[unique_indices ,]

最新更新