剥离html以删除所有js/css.html标记,从而提供实际文本(显示在浏览器上),用于索引和搜索



我尝试过strip_tag,但它仍然留下内联js:(function(){..})和内联css#按钮{}

我需要从html中提取纯文本,而不需要任何JS函数、样式或标记,这样我就可以对其进行索引并用于我的搜索功能。

html2text似乎也不能解决问题!

编辑

PHP代码:

$url = "http://blog.everymansoftware.com/2011/11/development-setup-for-neo4j-and-php_05.html";
$fileHeaders = @get_headers($url);
if( $fileHeaders[0] == "HTTP/1.1 200 OK" || $fileHeaders[0] == "HTTP/1.0 200 OK")
        {
            $content = strip_tags(file_get_contents($url));
        }

输出:

$content=

(function() { var a=window,c="jstiming",d="tick";var e=function(b){this.t={};this.tick=function(b,o,f){f=void 0!=f?f:(new Date).getTime();this.t[b]=[f,o]};this[d]("start",null,b)},h=new e;a.jstiming={Timer:e,load:h};if(a.performance&&a.performance.timing){var i=a.performance.timing,j=a[c].load,k=i.navigationStart,l=i.responseStart;0=k&&(j[d]("_wtsrt",void 0,k),j[d]("wtsrt_","_wtsrt",l))}
try{var m=null;a.chrome&&a.chrome.csi&&(m=Math.floor(a.chrome.csi().pageT));null==m&&a.gtbExternal&&(m=a.gtbExternal.pageT());null==m&&a.external&&(m=a.external.pageT);m&&(a[c].pt=m)}catch(n){};a.tickAboveFold=function(b){var g=0;if(b.offsetParent){do g+=b.offsetTop;while(b=b.offsetParent)}b=g;750>=b&&a[c].load[d]("aft")};var p=!1;function q(){p||(p=!0,a[c].load[d]("firstScrollTime"))}a.addEventListener?a.addEventListener("scroll",q,!1):a.attachEvent("onscroll",q);
 })();





Everyman Software: Development Setup for Neo4j and PHP: Part 2

#navbar-iframe { display:block }



if(window.addEventListener) {
    window.addEventListener('load', prettyPrint, false);
  } else {
    window.attachEvent('onload', prettyPrint);
  }
var a=navigator,b="userAgent",c="indexOf",f="&m=1",g="(^|&)m=",h="?",i="?m=1";function j(){var d=window.location.href,e=d.split(h);switch(e.length){case 1:return d+i;case 2:return 0

2011-11-05



Development Setup for Neo4j and PHP: Part 2


This is Part 2 of a series on setting up a development environment for building projects using the graph database Neo4j and PHP. In Part 1 of this series, we set up unit test and development databases.  In this part, we'll build a skeleton project that includes unit tests, and a minimalistic user interface.
All the files will live under a directory on our web server. In a real project, you'll probably want only the user interface files under the web server directory and your testing and library files somewhere more protected.
Also, I won't be using any specific PHP framework.  The principles in t

这是一个小片段,我总是用它来删除网页中所有隐藏的文本,包括<script>, <style>, <head>等标签之间的所有内容。此外,它还将用一个空格替换所有多次出现的任何类型的空格。

<?php
$url = "http://blog.everymansoftware.com/2011/11/development-setup-for-neo4j-and-php_05.html";
    $fileHeaders = @get_headers($url);
    if( $fileHeaders[0] == "HTTP/1.1 200 OK" || $fileHeaders[0] == "HTTP/1.0 200 OK")
    {
            $content = strip_html_tags(file_url_contents($url));
    }
############################################
//To fetch the $url by using cURL
function file_url_contents($url){
    $crl = curl_init();
    $timeout = 30;
    curl_setopt ($crl, CURLOPT_URL,$url);
    curl_setopt ($crl, CURLOPT_RETURNTRANSFER, 1);
    curl_setopt ($crl, CURLOPT_CONNECTTIMEOUT, $timeout);
    $ret = curl_exec($crl);
    curl_close($crl);
    return $ret;
} //file_url_contents ENDS
//To remove all the hidden text not displayed on a webpage
function strip_html_tags($str){
    $str = preg_replace('/(<|>)1{2}/is', '', $str);
    $str = preg_replace(
        array(// Remove invisible content
            '@<head[^>]*?>.*?</head>@siu',
            '@<style[^>]*?>.*?</style>@siu',
            '@<script[^>]*?.*?</script>@siu',
            '@<noscript[^>]*?.*?</noscript>@siu',
            ),
        "", //replace above with nothing
        $str );
    $str = replaceWhitespace($str);
    $str = strip_tags($str);
    return $str;
} //function strip_html_tags ENDS
//To replace all types of whitespace with a single space
function replaceWhitespace($str) {
    $result = $str;
    foreach (array(
    "  ", " t",  " r",  " n",
    "tt", "t ", "tr", "tn",
    "rr", "r ", "rt", "rn",
    "nn", "n ", "nt", "nr",
    ) as $replacement) {
    $result = str_replace($replacement, $replacement[0], $result);
    }
    return $str !== $result ? replaceWhitespace($result) : $result;
}
############################
?>

请在此处查看实际操作http://codepad.viper-7.com/txIxfE
和输出:http://pastebin.com/a86jd17s

strip_tags()将删除<和>。因此,例如,如果你有类似的东西

<script type="text/javascript">alert('hello world');</script>

它将减少为

alert('hello world');

这不会被执行,只是显示在您的网站上。

或者,尝试htmlenties()将"<"转换为"&lt",并将>转换为"&gt",这样它就可以安全地显示,而无需执行任何操作。

相反,如果您的问题是从标记中提取数据,那么最好使用正则表达式。例如,如果你有类似的东西

<a href="http://www.google.com">Google</a>

您可以简单地使用preg_match()从整个链接中获取单词"谷歌":

$content='<a href="http://www.google.com">Google</a>';
$regex="#<a href=".*?">(.+?)</a>#";
preg_match($regex,$content,$match);
echo $match[1];

顺便说一句,$match[1]无论如何都会从任何标记中清除匹配项,而$match[0]不会。要获得多个匹配项,请使用preg_match_all()。

最新更新