Làm cách nào để chuyển đổi pdf màu sang đen trắng?

17

Tôi muốn chuyển đổi một pdf với một số văn bản và hình ảnh màu trong một pdf khác chỉ có màu đen và trắng, để giảm kích thước của nó. Hơn nữa, tôi muốn giữ văn bản dưới dạng văn bản, mà không chuyển đổi các thành phần trang trong ảnh. Tôi đã thử lệnh sau:

convert -density 150 -threshold 50% input.pdf output.pdf

tìm thấy trong một câu hỏi khác, một liên kết , nhưng nó thực hiện những gì tôi không muốn: văn bản trong đầu ra bị biến đổi trong một hình ảnh kém và không còn có thể lựa chọn. Tôi đã thử với Ghostscript:

gs      -sOutputFile=output.pdf \
        -q -dNOPAUSE -dBATCH -dSAFER \
        -sDEVICE=pdfwrite \
        -dCompatibilityLevel=1.3 \
        -dPDFSETTINGS=/screen \
        -dEmbedAllFonts=true \
        -dSubsetFonts=true \
        -sColorConversionStrategy=/Mono \
        -sColorConversionStrategyForImages=/Mono \
        -sProcessColorModel=/DeviceGray \
        $1

nhưng nó cho tôi thông báo lỗi sau:

./script.sh: 19: ./script.sh: output.pdf: not found

Có cách nào khác để tạo tập tin không?

— BowPark
nguồn

Điều này có vẻ rất tốt superuser.com/questions/200378/

— Mạnh

1

Liên quan: unix.stackexchange.com/questions/84709/ từ

— slm

Thận trọng khi sử dụng một số cách tiếp cận siêu người dùng, họ chuyển đổi PDF thành phiên bản rasterized, do đó, nó không còn là đồ họa vector.

— slm

1

Đó có phải là toàn bộ kịch bản bạn đã chạy? Nó không giống như nó, bạn có thể đăng toàn bộ kịch bản?

— terdon

22

Ví dụ gs

Các gslệnh bạn đang chạy trên có dấu $1mà thường có nghĩa là cho đi qua đối số dòng lệnh vào một kịch bản. Vì vậy, tôi không chắc chắn những gì bạn thực sự đã cố gắng nhưng tôi đoán rằng bạn đã cố gắng đưa lệnh đó vào một kịch bản , script.sh:

#!/bin/bash

gs      -sOutputFile=output.pdf \
        -q -dNOPAUSE -dBATCH -dSAFER \
        -sDEVICE=pdfwrite \
        -dCompatibilityLevel=1.3 \
        -dPDFSETTINGS=/screen \
        -dEmbedAllFonts=true \
        -dSubsetFonts=true \
        -sColorConversionStrategy=/Mono \
        -sColorConversionStrategyForImages=/Mono \
        -sProcessColorModel=/DeviceGray \
        $1

Và chạy nó như thế này:

$ ./script.sh: 19: ./script.sh: output.pdf: not found

Không chắc chắn cách bạn thiết lập tập lệnh này nhưng nó cần phải thực thi.

$ chmod +x script.sh

Một cái gì đó chắc chắn không đúng với kịch bản đó. Khi tôi thử nó, tôi đã gặp lỗi này:

Lỗi không thể phục hồi: rangecheck trong .putdeviceprops

Một sự thay thế

Thay vì tập lệnh đó, tôi sẽ sử dụng tập lệnh này từ câu hỏi SU thay thế.

#!/bin/bash

gs \
 -sOutputFile=output.pdf \
 -sDEVICE=pdfwrite \
 -sColorConversionStrategy=Gray \
 -dProcessColorModel=/DeviceGray \
 -dCompatibilityLevel=1.4 \
 -dNOPAUSE \
 -dBATCH \
 $1

Sau đó chạy nó như thế này:

$ ./script.bash LeaseContract.pdf 
GPL Ghostscript 8.71 (2010-02-10)
Copyright (C) 2010 Artifex Software, Inc.  All rights reserved.
This software comes with NO WARRANTY: see the file PUBLIC for details.
Processing pages 1 through 2.
Page 1
Page 2

— SLM
nguồn

Bạn nói đúng, có gì đó không đúng với kịch bản: "cái gì đó" trong trường hợp này sẽ là sProcessColorModelcái dProcessColorModelthay thế.

— Sora.

8

Tôi tìm thấy một kịch bản ở đây có thể làm điều này. Nó đòi hỏi gsnhững gì bạn dường như có nhưng cũng có pdftk. Bạn chưa đề cập đến bản phân phối của mình nhưng trên các hệ thống dựa trên Debian, bạn sẽ có thể cài đặt nó với

sudo apt-get install pdftk

Bạn có thể tìm thấy RPM cho nó ở đây .

Khi bạn đã cài đặt pdftk, hãy lưu tập lệnh dưới dạng graypdf.shvà chạy như vậy:

./greypdf.sh input.pdf

Nó sẽ tạo một tập tin gọi là input-gray.pdf. Tôi bao gồm toàn bộ tập lệnh ở đây để tránh thối liên kết:

# convert pdf to grayscale, preserving metadata
# "AFAIK graphicx has no feature for manipulating colorspaces. " http://groups.google.com/group/latexusersgroup/browse_thread/thread/5ebbc3ff9978af05
# "> Is there an easy (or just standard) way with pdflatex to do a > conversion from color to grayscale when a PDF file is generated? No." ... "If you want to convert a multipage document then you better have pdftops from the xpdf suite installed because Ghostscript's pdf to ps doesn't produce nice Postscript." http://osdir.com/ml/tex.pdftex/2008-05/msg00006.html
# "Converting a color EPS to grayscale" - http://en.wikibooks.org/wiki/LaTeX/Importing_Graphics
# "\usepackage[monochrome]{color} .. I don't know of a neat automatic conversion to monochrome (there might be such a thing) although there was something in Tugboat a while back about mapping colors on the fly. I would probably make monochrome versions of the pictures, and name them consistently. Then conditionally load each one" http://newsgroups.derkeiler.com/Archive/Comp/comp.text.tex/2005-08/msg01864.html
# "Here comes optional.sty. By adding \usepackage{optional} ... \opt{color}{\includegraphics[width=0.4\textwidth]{intro/benzoCompounds_color}} \opt{grayscale}{\includegraphics[width=0.4\textwidth]{intro/benzoCompounds}} " - http://chem-bla-ics.blogspot.com/2008/01/my-phd-thesis-in-color-and-grayscale.html
# with gs:
# http://handyfloss.net/2008.09/making-a-pdf-grayscale-with-ghostscript/
# note - this strips metadata! so:
# http://etutorials.org/Linux+systems/pdf+hacks/Chapter+5.+Manipulating+PDF+Files/Hack+64+Get+and+Set+PDF+Metadata/
COLORFILENAME=$1
OVERWRITE=$2
FNAME=${COLORFILENAME%.pdf}
# NOTE: pdftk does not work with logical page numbers / pagination;
# gs kills it as well;
# so check for existence of 'pdfmarks' file in calling dir;
# if there, use it to correct gs logical pagination
# for example, see
# http://askubuntu.com/questions/32048/renumber-pages-of-a-pdf/65894#65894
PDFMARKS=
if [ -e pdfmarks ] ; then
PDFMARKS="pdfmarks"
echo "$PDFMARKS exists, using..."
# convert to gray pdf - this strips metadata!
gs -sOutputFile=$FNAME-gs-gray.pdf -sDEVICE=pdfwrite \
-sColorConversionStrategy=Gray -dProcessColorModel=/DeviceGray \
-dCompatibilityLevel=1.4 -dNOPAUSE -dBATCH "$COLORFILENAME" "$PDFMARKS"
else # not really needed ?!
gs -sOutputFile=$FNAME-gs-gray.pdf -sDEVICE=pdfwrite \
-sColorConversionStrategy=Gray -dProcessColorModel=/DeviceGray \
-dCompatibilityLevel=1.4 -dNOPAUSE -dBATCH "$COLORFILENAME"
fi
# dump metadata from original color pdf
## pdftk $COLORFILENAME dump_data output $FNAME.data.txt
# also: pdfinfo -meta $COLORFILENAME
# grep to avoid BookmarkTitle/Level/PageNumber:
pdftk $COLORFILENAME dump_data output | grep 'Info\|Pdf' > $FNAME.data.txt
# "pdftk can take a plain-text file of these same key/value pairs and update a PDF's Info dictionary to match. Currently, it does not update the PDF's XMP stream."
pdftk $FNAME-gs-gray.pdf update_info $FNAME.data.txt output $FNAME-gray.pdf
# (http://wiki.creativecommons.org/XMP_Implementations : Exempi ... allows reading/writing XMP metadata for various file formats, including PDF ... )
# clean up
rm $FNAME-gs-gray.pdf
rm $FNAME.data.txt
if [ "$OVERWRITE" == "y" ] ; then
echo "Overwriting $COLORFILENAME..."
mv $FNAME-gray.pdf $COLORFILENAME
fi
# BUT NOTE:
# Mixing TEX & PostScript : The GEX Model - http://www.tug.org/TUGboat/Articles/tb21-3/tb68kost.pdf
# VTEX is a (commercial) extended version of TEX, sold by MicroPress, Inc. Free versions of VTEX have recently been made available, that work under OS/2 and Linux. This paper describes GEX, a fast fully-integrated PostScript interpreter which functions as part of the VTEX code-generator. Unless specified otherwise, this article describes the functionality in the free- ware version of the VTEX compiler, as available on CTAN sites in systems/vtex.
# GEX is a graphics counterpart to TEX. .. Since GEX may exercise subtle influence on TEX (load fonts, or change TEX registers), GEX is op- tional in VTEX implementations: the default oper- ation of the program is with GEX off; it is enabled by a command-line switch.
# \includegraphics[width=1.3in, colorspace=grayscale 256]{macaw.jpg}
# http://mail.tug.org/texlive/Contents/live/texmf-dist/doc/generic/FAQ-en/html/FAQ-TeXsystems.html
# A free version of the commercial VTeX extended TeX system is available for use under Linux, which among other things specialises in direct production of PDF from (La)TeX input. Sadly, it���s no longer supported, and the ready-built images are made for use with a rather ancient Linux kernel.
# NOTE: another way to capture metadata; if converting via ghostscript:
# http://compgroups.net/comp.text.pdf/How-to-specify-metadata-using-Ghostscript
# first:
# grep -a 'Keywo' orig.pdf
# /Author(xxx)/Title(ttt)/Subject()/Creator(LaTeX)/Producer(pdfTeX-1.40.12)/Keywords(kkkk)
# then - copy this data in a file prologue.ini:
#/pdfmark where {pop} {userdict /pdfmark /cleartomark load put} ifelse
#[/Author(xxx)
#/Title(ttt)
#/Subject()
#/Creator(LaTeX with hyperref package + gs w/ prologue)
#/Producer(pdfTeX-1.40.12)
#/Keywords(kkkk)
#/DOCINFO pdfmark
#
# finally, call gs on the orig file,
# asking to process pdfmarks in prologue.ini:
# gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 \
# -dPDFSETTINGS=/screen -dNOPAUSE -dQUIET -dBATCH -dDOPDFMARKS \
# -sOutputFile=out.pdf in.pdf prologue.ini
# then the metadata will be in output too (which is stripped otherwise;
# note bookmarks are preserved, however).

— terdon
nguồn

3

Tôi cũng đã có một số pdf màu quét và pdf thang độ xám mà tôi muốn chuyển đổi sang bw. Tôi đã thử sử dụng gsvới mã được liệt kê ở đây và chất lượng hình ảnh vẫn tốt với văn bản pdf vẫn còn đó. Tuy nhiên, mã gs đó chỉ chuyển đổi sang thang độ xám (như được hỏi trong câu hỏi) và vẫn có kích thước tệp lớn. convertmang lại kết quả rất kém khi sử dụng trực tiếp.

Tôi muốn bw pdf với chất lượng hình ảnh tốt và kích thước tệp nhỏ. Tôi đã thử giải pháp của terdon, nhưng tôi không thể pdftksử dụng centOS 7 bằng yum (tại thời điểm viết bài).

Giải pháp của tôi sử dụng gsđể trích xuất các tệp bmp thang độ xám từ pdf, convertđể ngưỡng các bmps đó thành bw và lưu chúng dưới dạng tệp tiff, sau đó img2pdf để nén các hình ảnh tiff và hợp nhất tất cả chúng thành một pdf.

Tôi đã cố gắng trực tiếp đến tiff từ pdf nhưng chất lượng không giống nhau nên tôi lưu từng trang vào bmp. Đối với một tệp pdf một trang, convertthực hiện một công việc tuyệt vời từ bmp đến pdf. Thí dụ:

gs -sDEVICE=bmpgray -dNOPAUSE -dBATCH -r300x300 \
   -sOutputFile=./pdf_image.bmp ./input.pdf

convert ./pdf_image.bmp -threshold 40% -compress zip ./bw_out.pdf

Đối với nhiều trang, gscó thể hợp nhất nhiều tệp pdf thành một, nhưng img2pdfmang lại kích thước tệp nhỏ hơn gs. Các tệp tiff phải được giải nén làm đầu vào cho img2pdf. Hãy ghi nhớ với số lượng lớn các trang, các tệp bmp và tiff trung gian có xu hướng có kích thước lớn. pdftkhoặc joinpdfsẽ tốt hơn nếu họ có thể hợp nhất các tệp pdf nén từ convert.

Tôi tưởng tượng có một giải pháp thanh lịch hơn. Tuy nhiên, phương pháp của tôi tạo ra kết quả với chất lượng hình ảnh rất tốt và kích thước tệp nhỏ hơn nhiều. Để lấy lại văn bản trong pdf bw, hãy chạy lại OCR.

Kịch bản shell của tôi sử dụng gs, convert và img2pdf. Thay đổi các tham số (# trang, quét dpi, ngưỡng%, v.v.) được liệt kê ngay từ đầu khi cần và chạy chmod +x ./pdf2bw.sh. Đây là tập lệnh đầy đủ (pdf2bw.sh):

#!/bin/bash

num_pages=12
dpi_res=300
input_pdf_name=color_or_grayscale.pdf
bw_threshold=40%
output_pdf_name=out_bw.pdf
#-------------------------------------------------------------------------
gs -sDEVICE=bmpgray -dNOPAUSE -dBATCH -q -r$dpi_res \
   -sOutputFile=./%d.bmp ./$input_pdf_name
#-------------------------------------------------------------------------
for file_num in `seq 1 $num_pages`
do
  convert ./$file_num.bmp -threshold $bw_threshold \
          ./$file_num.tif
done
#-------------------------------------------------------------------------
input_files=""

for file_num in `seq 1 $num_pages`
do
  input_files+="./$file_num.tif "
done

img2pdf -o ./$output_pdf_name --dpi $dpi_res $input_files
#-------------------------------------------------------------------------
# clean up bmp and tif files used in conversion

for file_num in `seq 1 $num_pages`
do
  rm ./$file_num.bmp
  rm ./$file_num.tif
done

— OccamsRazor
nguồn

1

RHEL6 và RHEL5, cả hai bản Ghostscript cơ bản vào ngày 8.7, không thể sử dụng các hình thức của lệnh được đưa ra ở trên. Giả sử tập lệnh hoặc hàm mong đợi tệp PDF là đối số đầu tiên "$ 1", phần sau đây sẽ dễ mang theo hơn:

gs \
    -sOutputFile="grey_$1" \
    -sDEVICE=pdfwrite \
    -sColorConversionStrategy=Mono \
    -sColorConversionStrategyForImages=/Mono \
    -dProcessColorModel=/DeviceGray \
    -dCompatibilityLevel=1.3 \
    -dNOPAUSE -dBATCH \
    "$1"

Trong đó tệp đầu ra sẽ có tiền tố là "grey_".

RHEL6 và 5 có thể sử dụng CompabilitiesLevel = 1.4 nhanh hơn nhiều, nhưng tôi đã nhắm đến tính di động.

— Giàu có
nguồn

Devs nói ( 1 , 2 , 3 , 4 ) rằng không có sColorConversionStrategyForImagescông tắc.

— Igor

Cảm ơn, @Igor - Tôi không biết tôi đã lấy đoạn trích đó từ đâu! Tôi biết một sự thật rằng tôi đã thử nghiệm nó và nó đã hoạt động vào thời điểm đó . (Và đó, thưa các bạn, là lý do tại sao bạn phải luôn cung cấp tài liệu tham khảo cho mã của mình.)

— Rich

1

"Thông số giả" đó dường như là một thứ cực kỳ phổ biến trên web. GS bỏ qua các công tắc không xác định (điều đáng buồn), vì vậy nó vẫn hoạt động.

— Igor

1

Tôi nhận được kết quả đáng tin cậy khi dọn dẹp các bản pdf được quét để tương phản tốt với tập lệnh này;

#!/bin/bash
# 
# $ sudo apt install poppler-utils img2pdf pdftk imagemagick
#
# Output is still greyscale, but lots of scanner light tone fuzz removed.
#

pdfimages $1 pages

ls ./pages*.ppm | xargs -L1 -I {} convert {}  -quality 100 -density 400 \
  -fill white -fuzz 80% -auto-level -depth 4 +opaque "#000000" {}.jpg

ls -1 ./pages*jpg | xargs -L1 -I {} img2pdf {} -o {}.pdf

pdftk pages*.pdf cat output ${1/.pdf/}_bw.pdf

rm pages*

— Bijou Smith
nguồn