【js文件系列三】JS如何提取Pdf 中的图片和文字

13,889次阅读
没有评论

共计 2571 个字符,预计需要花费 7 分钟才能阅读完成。

前言

本文是 js 文件处理系列三,前两篇文章有介绍 js 文件处理,感兴趣的可以查看 导出 pdf 文件和 word/excel/pdf/ppt 在线预览,本文补充一下 js 提前 pdf 中的文字和图片的方法。

从 PDF 中提取文字 - 核心代码

其实核心代码还是利用了 pdf.js 这个库,之前上一篇文章也有提及这个库,主要可以做 pdfweb 端的预览。

文档地址:https://mozilla.github.io/pdf.js/api/draft/module-pdfjsLib-PDFPageProxy.html

/**
 * Retrieves the text of a specif page within a PDF Document obtained through pdf.js
 *
 * @param {Integer} pageNum Specifies the number of the page
 * @param {PDFDocument} PDFDocumentInstance The PDF document obtained
 **/
function getPageText(pageNum, PDFDocumentInstance) {
  // Return a Promise that is solved once the text of the page is retrieven
  return new Promise(function (resolve, reject) {PDFDocumentInstance.getPage(pageNum).then(function (pdfPage) {
      // The main trick to obtain the text of the PDF page, use the getTextContent method
      pdfPage.getTextContent().then(function (textContent) {
        var textItems = textContent.items;
        var finalString = '';

        // Concatenate the string of the item to the final string
        for (var i = 0; i 

从 PDF 中提取图片

核心代码如下:

// first here I open the document
pdf.getDocument('haorooms.pdf').promise.then(async function (pdfObj) {
  // because I am testing, I just wanted to get page 7
  const page = await pdfObj.getPage(7);

  // now I need to get the image information and for that I get the operator list
  const operators = await page.getOperatorList();

  // this is for the paintImageXObject one, there are other ones, like the paintJpegImage which I assume should work the same way, this gives me the whole list of indexes of where an img was inserted
  const rawImgOperator = operators.fnArray
    .map((f, index) => (f === pdf.OPS.paintImageXObject ? index : null))
    .filter((n) => n !== null);

  // now you need the filename, in this example I just picked the first one from my array, your array may be empty, but I knew for sure in page 7 there was an image... in your actual code you would use loops, such info is in the argsArray, the first arg is the filename, second arg is the width and height, but the filename will suffice here
  const filename = operators.argsArray[rawImgOperator[0]][0];

  // now we get the object itself from page.objs using the filename
  page.objs.get(filename, async (arg) => {
    // and here is where we need the canvas, the object contains information such as width and height
    const canvas = ccc.createCanvas(arg.width, arg.height);
    const ctx = canvas.getContext('2d');

    // now you need a new clamped array because the original one, may not contain rgba data, and when you insert you want to do so in rgba form, I think that a simple check of the size of the clamped array should work, if it's 3 times the size aka width*height*3 then it's rgb and shall be converted, if it's 4 times, then it's rgba and can be used as it is; in my case it had to be converted, and I think it will be the most common case
    const data = new Uint8ClampedArray(arg.width * arg.height * 4);
    let k = 0;
    let i = 0;
    while (i 

小结

本文主要介绍了 js 获取 pdf 中文本和图片的方法,其实 pdf 转 word 也是大致这个思路,主要获取文本和图片,放到 word 文档中。
本文主要是利用了 pdfjs 库,参考了 issue https://github.com/mozilla/pdf.js/issues/13541

    正文完
     0
    Yojack
    版权声明:本篇文章由 Yojack 于1970-01-01发表,共计2571字。
    转载说明:
    1 本网站名称:优杰开发笔记
    2 本站永久网址:https://yojack.cn
    3 本网站的文章部分内容可能来源于网络,仅供大家学习与参考,如有侵权,请联系站长进行删除处理。
    4 本站一切资源不代表本站立场,并不代表本站赞同其观点和对其真实性负责。
    5 本站所有内容均可转载及分享, 但请注明出处
    6 我们始终尊重原创作者的版权,所有文章在发布时,均尽可能注明出处与作者。
    7 站长邮箱:laylwenl@gmail.com
    评论(没有评论)