August 20, 2020

php 解析 pdf文件, parse PDF files and extract elements like text

By Gideon Php / Pear / Mysql / Node.js 0 Comments

PdfParser，一个独立的PHP库，提供了多种工具来从PDF文件提取数据。当前，不支持安全文档。该库仍在积极开发中。

因此，用户必须期望使用主版本时BC中断。该项目由Actualys支持。

先决条件

该库需要PHP 5.3。
PDFParser构建在TCPDF解析器之上。
该库将通过Composer命令行自动下载。

安装

Using Composer

将PDFParser添加到您的composer.json文件：

{
    "require": {
        "smalot/pdfparser": "*"
    }
}

现在，通过运行以下命令要求composer下载捆绑软件：

$ composer update smalot/pdfparser

作为独立库

首先，通过选择特定版本或直接从master从Github下载该库。

完成后，将其解压缩并使用composer运行以下命令行。

$ composer update

该命令将下载所有依赖项（Atoum库）并创建“ autoload.php”文件。

现在，在同一文件夹中创建一个具有此内容的新文件：

<?php
 
// Include 'Composer' autoloader.
include 'vendor/autoload.php';
 
// Your code
// ...
 
?>

使用Atoum进行单元测试

运行Atoum单元测试（代码覆盖-如果安装了xdebug）：

$ vendor/bin/atoum -d vendor/smalot/pdfparser/src/Smalot/PdfParser/Tests/

一旦该命令结束，文件夹“ coverage /”将包含带有代码覆盖率摘要的html页面。

使用方法

该示例将解析所有pdf文件，并从每个页面提取文本。

<?php
 
// Include Composer autoloader if not already done.
include 'vendor/autoload.php';
 
// Parse pdf file and build necessary objects.
$parser = new \Smalot\PdfParser\Parser();
$pdf    = $parser->parseFile('document.pdf');
 
$text = $pdf->getText();
echo $text;
 
?>

您也可以从每个页面手动提取文本，也可以从特定页面提取文本。

<?php
 
// Include Composer autoloader if not already done.
include 'vendor/autoload.php';
 
// Parse pdf file and build necessary objects.
$parser = new \Smalot\PdfParser\Parser();
$pdf    = $parser->parseFile('document.pdf');
 
// Retrieve all pages from the pdf file.
$pages  = $pdf->getPages();
 
// Loop over each page to extract text.
foreach ($pages as $page) {
    echo $page->getText();
}
 
?>

这里是一个示例代码，用于从文档（作者，创建者，CreationDate等）中提取元数据。

<?php
 
// Include Composer autoloader if not already done.
include 'vendor/autoload.php';
 
// Parse pdf file and build necessary objects.
$parser = new \Smalot\PdfParser\Parser();
$pdf    = $parser->parseFile('document.pdf');
 
// Retrieve all details from the pdf file.
$details  = $pdf->getDetails();
 
// Loop over each property to extract values (string or array).
foreach ($details as $property => $value) {
    if (is_array($value)) {
        $value = implode(', ', $value);
    }
    echo $property . ' => ' . $value . "\n";
}
 
?>

项目地址： https://github.com/smalot/pdfparser

本文：php 解析 pdf文件, parse PDF files and extract elements like text

Tags:parse PDF files and extract elements like text, php 解析 pdf文件

Just Code

php 解析 pdf文件, parse PDF files and extract elements like text

先决条件

安装

Using Composer

作为独立库

使用Atoum进行单元测试

使用方法

About Author

Gideon

Add a Comment

先决条件

安装

Using Composer

作为独立库

使用Atoum进行单元测试

使用方法

Related Posts

Related Posts

About Author

Gideon

Add a Comment