我想获取HTML页面上可用的任何元、标题、脚本、链接标记,这是我编写的程序(不正确,但会为专家提供想法)。
<?php
function get_tag($tag_name, $url)
{
$content = file_get_contents($url);
// this is not correct : regular expression please //
preg_match_all($tag_name, $content, $matches);
return $matches;
}
print_r(get_tag('title', 'http://stackoverflow.com'));
?>
输出应该是这样的:
Array
(
[0] => title
[1] => Stack Overflow
)
谢谢!!
function get_tags($tag, $url) {
//allow for improperly formatted html
libxml_use_internal_errors(true);
// Instantiate DOMDocument Class to parse html DOM
$xml = new DOMDocument();
// Load the file into the DOMDocument object
$xml->loadHTMLFile($url);
// Empty array to hold all links to return
$tags = array();
//Loop through all tags of the given type and store details in the array
foreach($xml->getElementsByTagName($tag) as $tag_found) {
if ($tag_found->tagName == "meta")
{
$tags[] = array("meta_name" => $tag_found->getAttribute("name"), "meta_value" => $tag_found->getAttribute("content"));
}
else {
$tags[] = array('tag' => $tag_found->tagName, 'text' => $tag_found->nodeValue);
}
}
//Return the links
return $tags;
}
这个答案实际上会将标记的名称作为第一个数组值,而不是"array",并且还会停止警告。
在使用regex解析HTML之前,您需要阅读此问题的第一个答案。
尝试使用DOMDocument,如下所示:
<?
function get_tags($tags, $url) {
// Create a new DOM Document to hold our webpage structure
$xml = new DOMDocument();
// Load the url's contents into the DOM
$xml->loadHTMLFile($url);
// Empty array to hold all links to return
$tags_found = array();
//Loop through each <$tags> tag in the dom and add it to the $tags_found array
foreach($xml->getElementsByTagName($tags) as $tag) {
$tags_found[] = array('tag' => $tags, 'text' => $tag->nodeValue);
}
//Return the links
return $tags_found;
}
print_r(get_tags('title', 'http://stackoverflow.com'));
?>
由于这些标记不能嵌套,因此不需要解析。
#<(meta|title|script|link)(?: .*?)?(?:/>|>(.*?)<(?:/1)>)#is
如果你在函数中使用这个,你将不得不写$tag_name,而不是"meta|title|script|link"。