March 27, 2019

PHP 网页爬虫, php 抓取网页页面, php解析html, php解析html页面元素, DiDOM解析器使用入门（web crawler）

By Gideon Php / Pear / Mysql / Node.js 1 Comment

之前一直用的 PHP 网页爬虫：Simple HTML DOM解析器使用入门（web crawler），最近发现一个更好用的解析器 DiDOM

主页：https://github.com/Imangazaliev/DiDOM

安装

要安装DiDOM，请运行以下命令：

composer require imangazaliev/didom

不懂 composer 的同学，已经移步到：PHP: Composer 依赖管理 Composer Cheat Sheet for developers 安装和用法

快速开始

use DiDom\Document;

$document = new Document('http://www.news.com/', true);

$posts = $document->find('.post');

foreach($posts as $post) {
    echo $post->text(), "\n";
}

创建新文档

DiDom允许以多种方式加载HTML：

用构造函数

// the first parameter is a string with HTML
$document = new Document($html);

// file path
$document = new Document('page.html', true);

// or URL
$document = new Document('http://www.example.com/', true);

第二个参数指定是否需要加载文件。默认是false。

签名：

__construct($string = null, $isFile = false, $encoding = 'UTF-8', $type = Document::TYPE_HTML)

$string – HTML或XML字符串或文件路径。

$isFile – 表示第一个参数是文件的路径。

$encoding – 文件编码。

$type– 文档类型（HTML – Document::TYPE_HTML，XML – Document::TYPE_XML）。

使用单独的方法

$document = new Document();

$document->loadHtml($html);

$document->loadHtmlFile('page.html');

$document->loadHtmlFile('http://www.example.com/');

有两种方法可用于加载XML：loadXml和loadXmlFile。

这些方法接受其他选项：

$document->loadHtml($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$document->loadHtmlFile($url, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);

$document->loadXml($xml, LIBXML_PARSEHUGE);
$document->loadXmlFile($url, LIBXML_PARSEHUGE);

搜索元素

DiDOM接受CSS选择器或XPath作为搜索表达式。您需要将路径表达式作为第一个参数，并在第二个参数中指定其类型（默认类型为Query::TYPE_CSS）：

方法`find()`：

use DiDom\Document;
use DiDom\Query;

...

// CSS selector
$posts = $document->find('.post');

// XPath
$posts = $document->find("//div[contains(@class, 'post')]", Query::TYPE_XPATH);

如果找到与给定表达式匹配的元素，则method返回一个实例数组DiDom\Element，否则为 – 一个空数组。您还可以获得一组DOMElement对象。为此，请传递false第三个参数。

用魔法`__invoke()`：

$posts = $document('.post');

警告：使用此方法是不可取的，因为将来可能会将其删除。

方法`xpath()`：

$posts = $document->xpath("//*[contains(concat(' ', normalize-space(@class), ' '), ' post ')]");

你可以在一个元素内搜索：

echo $document->find('nav')[0]->first('ul.menu')->xpath('//li')[0]->text();

验证元素是否存在

验证元素是否存在使用has()方法：

if ($document->has('.post')) {
    // code
}

如果你需要检查元素是否存在然后得到它：

if ($document->has('.post')) {
    $elements = $document->find('.post');
    // code
}

但它会更快这样：

if (count($elements = $document->find('.post')) > 0) {
    // code
}

因为在第一种情况下它会进行两次查询。

在元素中搜索

方法find()，first()，xpath()，has()，count()在元可得。

例：

echo $document->find('nav')[0]->first('ul.menu')->xpath('//li')[0]->text();

方法 `findInDocument()`

如果更改，替换或删除在另一个元素中找到的元素，则不会更改该文档。这是因为方法find()的Element类（A，分别first ()和xpath方法）创建一个新的文件进行搜索。

源文档中寻找元素，你必须使用方法findInDocument()和firstInDocument()：

// nothing will happen
$document->first('head')->first('title')->remove();

// but this will do
$document->first('head')->firstInDocument('title')->remove();

警告：方法 findInDocument()和firstInDocument()仅适用于属于文档的元素以及通过其创建的元素new Element(...)。如果元素不属于文档，LogicException则抛出;

支持的选择器

DiDom支持搜索：

标签
类，ID，属性的名称和值
伪类：
- 第一个，最后一个，第n个孩子
- 空的而不是空的
- 包含
- 具有

// all links
$document->find('a');

// any element with id = "foo" and "bar" class
$document->find('#foo.bar');

// any element with attribute "name"
$document->find('[name]');
// the same as
$document->find('*[name]');

// input field with the name "foo"
$document->find('input[name=foo]');
$document->find('input[name=\'bar\']');
$document->find('input[name="baz"]');

// any element that has an attribute starting with "data-" and the value "foo"
$document->find('*[^data-=foo]');

// all links starting with https
$document->find('a[href^=https]');

// all images with the extension png
$document->find('img[src$=png]');

// all links containing the string "example.com"
$document->find('a[href*=example.com]');

// text of the links with "foo" class
$document->find('a.foo::text');

// address and title of all the fields with "bar" class
$document->find('a.bar::attr(href|title)');

产量

获取HTML

方法`html()`

$posts = $document->find('.post');

echo $posts[0]->html();

投射到字符串：

$html = (string) $posts[0];

格式化HTML输出

$html = $document->format()->html();

元素没有format()方法，因此如果您需要输出元素的格式化HTML，那么首先必须将其转换为文档：

$html = $element->toDocument()->format()->html();

内部HTML

$innerHtml = $element->innerHtml();

innerHtml()因此，如果您需要获取文档的内部HTML，则首先将其转换为元素，因此Document没有该方法：

$innerHtml = $document->toElement()->innerHtml();

获取XML

echo $document->xml();

echo $document->first('book')->xml();

获取内容

$posts = $document->find('.post');

echo $posts[0]->text();

创建一个新元素

创建类的实例

use DiDom\Element;

$element = new Element('span', 'Hello');

// Outputs "<span>Hello</span>"
echo $element->html();

第一个参数是属性的名称，第二个参数是其值（可选），第三个参数是元素属性（可选）。

创建具有属性的元素的示例：

$attributes = ['name' => 'description', 'placeholder' => 'Enter description of item'];

$element = new Element('textarea', 'Text', $attributes);

可以从类的实例创建元素DOMElement：

use DiDom\Element;
use DOMElement;

$domElement = new DOMElement('span', 'Hello');

$element = new Element($domElement);

使用方法 `createElement`

$document = new Document($html);

$element = $document->createElement('span', 'Hello');

获取元素的名称

$element->tag;

获得父元素

$document = new Document($html);

$input = $document->find('input[name=email]')[0];

var_dump($input->parent());

获得兄弟元素

$document = new Document($html);

$item = $document->find('ul.menu > li')[1];

var_dump($item->previousSibling());

var_dump($item->nextSibling());

获得子的元素

$html = '<div>Foo<span>Bar</span><!--Baz--></div>';

$document = new Document($html);

$div = $document->first('div');

// element node (DOMElement)
// string(3) "Bar"
var_dump($div->child(1)->text());

// text node (DOMText)
// string(3) "Foo"
var_dump($div->firstChild()->text());

// comment node (DOMComment)
// string(3) "Baz"
var_dump($div->lastChild()->text());

// array(3) { ... }
var_dump($div->children());

获取文件

$document = new Document($html);

$element = $document->find('input[name=email]')[0];

$document2 = $element->getDocument();

// bool(true)
var_dump($document->is($document2));

使用元素属性

创建/更新属性

方法`setAttribute`：

$element->setAttribute('name', 'username');

方法`attr`：

$element->attr('name', 'username');

用魔法`__get`：

$element->name = 'username';

获取属性的值

方法`getAttribute`：

$username = $element->getAttribute('value');

方法`attr`：

$username = $element->attr('value');

用魔法`__get`：

$username = $element->name;

null如果找不到属性，则返回。

验证属性是否存在

方法`hasAttribute`：

if ($element->hasAttribute('name')) {
    // code
}

用魔法`__isset`：

if (isset($element->name)) {
    // code
}

删除属性：

方法`removeAttribute`：

$element->removeAttribute('name');

用魔法`__unset`：

unset($element->name);

比较元素

$element  = new Element('span', 'hello');
$element2 = new Element('span', 'hello');

// bool(true)
var_dump($element->is($element));

// bool(false)
var_dump($element->is($element2));

附加子元素

$list = new Element('ul');

$item = new Element('li', 'Item 1');

$list->appendChild($item);

$items = [
    new Element('li', 'Item 2'),
    new Element('li', 'Item 3'),
];

$list->appendChild($items);

添加子元素

$list = new Element('ul');

$item = new Element('li', 'Item 1');
$items = [
    new Element('li', 'Item 2'),
    new Element('li', 'Item 3'),
];

$list->appendChild($item);
$list->appendChild($items);

替换元素

$element = new Element('span', 'hello');

$document->find('.post')[0]->replace($element);

Waning: 您只能替换文档中直接找到的元素：

// nothing will happen
$document->first('head')->first('title')->replace($title);

// but this will do
$document->first('head title')->replace($title);

More about this in section Search for elements.

删除元素

$document->find('.post')[0]->remove();

Warning: 您只能删除直接在文档中找到的元素：

// nothing will happen
$document->first('head')->first('title')->remove();

// but this will do
$document->first('head title')->remove();

More about this in section Search for elements.

使用缓存

Cache是一个XPath表达式数组，由CSS转换而来。

从缓存中获取

use DiDom\Query;

...

$xpath    = Query::compile('h2');
$compiled = Query::getCompiled();

// array('h2' => '//h2')
var_dump($compiled);

缓存设置

Query::setCompiled(['h2' => '//h2']);

Miscellaneous

`preserveWhiteSpace`

By default, whitespace preserving is disabled.

You can enable the preserveWhiteSpace option before loading the document:

$document = new Document();

$document->preserveWhiteSpace();

$document->loadXml($xml);

`count`

The count () method counts children that match the selector:

// prints the number of links in the document
echo $document->count('a');

// prints the number of items in the list
echo $document->first('ul')->count('li');

`matches`

Returns true if the node matches the selector:

$element->matches('div#content');

// strict match
// returns true if the element is a div with id equals content and nothing else
// if the element has any other attributes the method returns false
$element->matches('div#content', true);

`isElementNode`

Checks whether an element is an element (DOMElement):

$element->isElementNode();

`isTextNode`

Checks whether an element is a text node (DOMText):

$element->isTextNode();

`isCommentNode`

Checks whether the element is a comment (DOMComment):

$element->isCommentNode();

本文：PHP 网页爬虫, php 抓取网页页面, php解析html, php解析html页面元素, DiDOM解析器使用入门（web crawler）

Tags:DiDOM解析器使用入门（web crawler）, php 抓取网页页面, PHP 网页爬虫, php解析html, php解析html页面元素

About Author

Gideon

One Comment

安装

快速开始

创建新文档

用构造函数

使用单独的方法

搜索元素

方法find()：

用魔法__invoke()：

方法xpath()：

验证元素是否存在

在元素中搜索

方法 findInDocument()

支持的选择器

产量

获取HTML

方法html()

投射到字符串：

格式化HTML输出

内部HTML

获取XML

获取内容

创建一个新元素

创建类的实例

使用方法 createElement

获取元素的名称

获得父元素

获得兄弟元素

获得子的元素

获取文件

使用元素属性

创建/更新属性

方法setAttribute：

方法attr：

用魔法__get：

获取属性的值

方法getAttribute：

方法attr：

用魔法__get：

验证属性是否存在

方法hasAttribute：

用魔法__isset：

删除属性：

方法removeAttribute：

用魔法__unset：

比较元素

附加子元素

添加子元素

替换元素

删除元素

使用缓存

从缓存中获取

缓存设置

Miscellaneous

preserveWhiteSpace

count

matches

isElementNode

isTextNode

isCommentNode

Related Posts

Related Posts

About Author

Gideon

Add a Comment

方法`find()`：

用魔法`__invoke()`：

方法`xpath()`：

方法 `findInDocument()`

方法`html()`

使用方法 `createElement`

方法`setAttribute`：

方法`attr`：

用魔法`__get`：

方法`getAttribute`：

方法`attr`：

用魔法`__get`：

方法`hasAttribute`：

用魔法`__isset`：

方法`removeAttribute`：

用魔法`__unset`：

`preserveWhiteSpace`

`count`

`matches`

`isElementNode`

`isTextNode`

`isCommentNode`