Linux下的word文档格式转换工具
最近接到一个需求,要将所有不同格式的文档(包括.doc/.docx/.wps)转成统一格式,如都转为.docx,或直接转为.html 或.txt。经调研后,发现有这样几款工具:
win32com
python-docx
pydocx
…
可能还有,我就不再赘述了。经过全面调研,我发现这些工具存在这样的问题——Python相关工具要么无法处理.doc(只能处理.docx),要么要求必须在windows环境下使用(如win32com)。当前大家的生产环境一般都是Linux环境,更换win服务器会造成一系列的连带问题,比如其他库是否兼容等等,非常麻烦,所以找到.doc/.wps在Linux下的处理方式非常重要。还好,最后被我找到了,那就是LibreOffice
LibreOffice具体用法
首先,直接在命令行执行libreoffice –version,看看你是否已经安装此款工具。如果还没有安装,参考下文安装LibreOffice
安装完毕后,使用以下命令,对待转格式的文档进行格式转换,示例如下:
将.doc格式文档转为txt格式:
libreoffice –headless –convert-to txt path-to-your-doc.doc
1
你同样可以指定转换后的文件输出路径,也可以批量地将doc/docx/wps文件传给LibreOffice接口:
libreoffice –headless –convert-to html –outdir /your/output/dir /your/doc_docx_wps/files/*.{dosx,doc,wps}
1
使用python脚本执行格式转换
这个其实没什么玄乎的,就是用Python执行命令行而已:
import os
os.system(“libreoffice –headless –convert-to txt path-to-your-doc.doc”)
当然,如果你嫌这个接口的单进程速度太慢,你也可以用Python执行多进程启动转换:
import subprocess
import os, glob
from multiprocessing.dummy import Pool
def worker(fname, dstdir=os.path.expanduser(“~”)):
subprocess.call([“libreoffice”, “–headless”, “–convert-to”, “pdf”, fname], cwd=dstdir)
pool = Pool()
pool.map(worker, glob.iglob(
os.path.join(os.path.expanduser(“~”), “*.doc”)
))
LibreOffice的其他转换功能
其实LibreOffice功能很强大,它还可以对xhtml、pdf、jpeg、png等等多种格式进行转换。具体支持的格式如下
The following list of document formats are currently available:
bib – BibTeX [.bib]
doc – Microsoft Word 97/2000/XP [.doc]
doc6 – Microsoft Word 6.0 [.doc]
doc95 – Microsoft Word 95 [.doc]
docbook – DocBook [.xml]
docx – Microsoft Office Open XML [.docx]
docx7 – Microsoft Office Open XML [.docx]
fodt – OpenDocument Text (Flat XML) [.fodt]
html – HTML Document (OpenOffice.org Writer) [.html]
latex – LaTeX 2e [.ltx]
mediawiki – MediaWiki [.txt]
odt – ODF Text Document [.odt]
ooxml – Microsoft Office Open XML [.xml]
ott – Open Document Text [.ott]
pdb – AportisDoc (Palm) [.pdb]
pdf – Portable Document Format [.pdf]
psw – Pocket Word [.psw]
rtf – Rich Text Format [.rtf]
sdw – StarWriter 5.0 [.sdw]
sdw4 – StarWriter 4.0 [.sdw]
sdw3 – StarWriter 3.0 [.sdw]
stw – Open Office.org 1.0 Text Document Template [.stw]
sxw – Open Office.org 1.0 Text Document [.sxw]
text – Text Encoded [.txt]
txt – Text [.txt]
uot – Unified Office Format text [.uot]
vor – StarWriter 5.0 Template [.vor]
vor4 – StarWriter 4.0 Template [.vor]
vor3 – StarWriter 3.0 Template [.vor]
wps – Microsoft Works [.wps]
xhtml – XHTML Document [.html]
The following list of graphics formats are currently available:
bmp – Windows Bitmap [.bmp]
emf – Enhanced Metafile [.emf]
eps – Encapsulated PostScript [.eps]
fodg – OpenDocument Drawing (Flat XML) [.fodg]
gif – Graphics Interchange Format [.gif]
html – HTML Document (OpenOffice.org Draw) [.html]
jpg – Joint Photographic Experts Group [.jpg]
met – OS/2 Metafile [.met]
odd – OpenDocument Drawing [.odd]
otg – OpenDocument Drawing Template [.otg]
pbm – Portable Bitmap [.pbm]
pct – Mac Pict [.pct]
pdf – Portable Document Format [.pdf]
pgm – Portable Graymap [.pgm]
png – Portable Network Graphic [.png]
ppm – Portable Pixelmap [.ppm]
ras – Sun Raster Image [.ras]
std – OpenOffice.org 1.0 Drawing Template [.std]
svg – Scalable Vector Graphics [.svg]
svm – StarView Metafile [.svm]
swf – Macromedia Flash (SWF) [.swf]
sxd – OpenOffice.org 1.0 Drawing [.sxd]
sxd3 – StarDraw 3.0 [.sxd]
sxd5 – StarDraw 5.0 [.sxd]
sxw – StarOffice XML (Draw) [.sxw]
tiff – Tagged Image File Format [.tiff]
vor – StarDraw 5.0 Template [.vor]
vor3 – StarDraw 3.0 Template [.vor]
wmf – Windows Metafile [.wmf]
xhtml – XHTML [.xhtml]
xpm – X PixMap [.xpm]
The following list of presentation formats are currently available:
bmp – Windows Bitmap [.bmp]
emf – Enhanced Metafile [.emf]
eps – Encapsulated PostScript [.eps]
fodp – OpenDocument Presentation (Flat XML) [.fodp]
gif – Graphics Interchange Format [.gif]
html – HTML Document (OpenOffice.org Impress) [.html]
jpg – Joint Photographic Experts Group [.jpg]
met – OS/2 Metafile [.met]
odg – ODF Drawing (Impress) [.odg]
odp – ODF Presentation [.odp]
otp – ODF Presentation Template [.otp]
pbm – Portable Bitmap [.pbm]
pct – Mac Pict [.pct]
pdf – Portable Document Format [.pdf]
pgm – Portable Graymap [.pgm]
png – Portable Network Graphic [.png]
potm – Microsoft PowerPoint 2007/2010 XML Template [.potm]
pot – Microsoft PowerPoint 97/2000/XP Template [.pot]
ppm – Portable Pixelmap [.ppm]
pptx – Microsoft PowerPoint 2007/2010 XML [.pptx]
pps – Microsoft PowerPoint 97/2000/XP (Autoplay) [.pps]
ppt – Microsoft PowerPoint 97/2000/XP [.ppt]
pwp – PlaceWare [.pwp]
ras – Sun Raster Image [.ras]
sda – StarDraw 5.0 (OpenOffice.org Impress) [.sda]
sdd – StarImpress 5.0 [.sdd]
sdd3 – StarDraw 3.0 (OpenOffice.org Impress) [.sdd]
sdd4 – StarImpress 4.0 [.sdd]
sxd – OpenOffice.org 1.0 Drawing (OpenOffice.org Impress) [.sxd]
sti – OpenOffice.org 1.0 Presentation Template [.sti]
svg – Scalable Vector Graphics [.svg]
svm – StarView Metafile [.svm]
swf – Macromedia Flash (SWF) [.swf]
sxi – OpenOffice.org 1.0 Presentation [.sxi]
tiff – Tagged Image File Format [.tiff]
uop – Unified Office Format presentation [.uop]
vor – StarImpress 5.0 Template [.vor]
vor3 – StarDraw 3.0 Template (OpenOffice.org Impress) [.vor]
vor4 – StarImpress 4.0 Template [.vor]
vor5 – StarDraw 5.0 Template (OpenOffice.org Impress) [.vor]
wmf – Windows Metafile [.wmf]
xhtml – XHTML [.xml]
xpm – X PixMap [.xpm]
The following list of spreadsheet formats are currently available:
csv – Text CSV [.csv]
dbf – dBASE [.dbf]
dif – Data Interchange Format [.dif]
fods – OpenDocument Spreadsheet (Flat XML) [.fods]
html – HTML Document (OpenOffice.org Calc) [.html]
ods – ODF Spreadsheet [.ods]
ooxml – Microsoft Excel 2003 XML [.xml]
ots – ODF Spreadsheet Template [.ots]
pdf – Portable Document Format [.pdf]
pxl – Pocket Excel [.pxl]
sdc – StarCalc 5.0 [.sdc]
sdc4 – StarCalc 4.0 [.sdc]
sdc3 – StarCalc 3.0 [.sdc]
slk – SYLK [.slk]
stc – OpenOffice.org 1.0 Spreadsheet Template [.stc]
sxc – OpenOffice.org 1.0 Spreadsheet [.sxc]
uos – Unified Office Format spreadsheet [.uos]
vor3 – StarCalc 3.0 Template [.vor]
vor4 – StarCalc 4.0 Template [.vor]
vor – StarCalc 5.0 Template [.vor]
xhtml – XHTML [.xhtml]
xls – Microsoft Excel 97/2000/XP [.xls]
xls5 – Microsoft Excel 5.0 [.xls]
xls95 – Microsoft Excel 95 [.xls]
xlt – Microsoft Excel 97/2000/XP Template [.xlt]
xlt5 – Microsoft Excel 5.0 Template [.xlt]
xlt95 – Microsoft Excel 95 Template [.xlt]
xlsx – Microsoft Excel 2007/2010 XML [.xlsx]