Linux下使用LibreOffice+python将doc/docx/wps格式的文档转成html/txt/docx等格式
admin
2023-07-30 20:15:20
0

Linux下的word文档格式转换工具
最近接到一个需求,要将所有不同格式的文档(包括.doc/.docx/.wps)转成统一格式,如都转为.docx,或直接转为.html 或.txt。经调研后,发现有这样几款工具:

win32com
python-docx
pydocx

可能还有,我就不再赘述了。经过全面调研,我发现这些工具存在这样的问题——Python相关工具要么无法处理.doc(只能处理.docx),要么要求必须在windows环境下使用(如win32com)。当前大家的生产环境一般都是Linux环境,更换win服务器会造成一系列的连带问题,比如其他库是否兼容等等,非常麻烦,所以找到.doc/.wps在Linux下的处理方式非常重要。还好,最后被我找到了,那就是LibreOffice
LibreOffice具体用法
首先,直接在命令行执行libreoffice –version,看看你是否已经安装此款工具。如果还没有安装,参考下文安装LibreOffice
安装完毕后,使用以下命令,对待转格式的文档进行格式转换,示例如下:
将.doc格式文档转为txt格式:
libreoffice –headless –convert-to txt path-to-your-doc.doc
1
你同样可以指定转换后的文件输出路径,也可以批量地将doc/docx/wps文件传给LibreOffice接口:

libreoffice –headless –convert-to html –outdir /your/output/dir /your/doc_docx_wps/files/*.{dosx,doc,wps}
1
使用python脚本执行格式转换
这个其实没什么玄乎的,就是用Python执行命令行而已:
import os
os.system(“libreoffice –headless –convert-to txt path-to-your-doc.doc”)

当然,如果你嫌这个接口的单进程速度太慢,你也可以用Python执行多进程启动转换:

import subprocess
import os, glob
from multiprocessing.dummy import Pool

def worker(fname, dstdir=os.path.expanduser(“~”)):
subprocess.call([“libreoffice”, “–headless”, “–convert-to”, “pdf”, fname], cwd=dstdir)

pool = Pool()
pool.map(worker, glob.iglob(
os.path.join(os.path.expanduser(“~”), “*.doc”)
))

LibreOffice的其他转换功能
其实LibreOffice功能很强大,它还可以对xhtml、pdf、jpeg、png等等多种格式进行转换。具体支持的格式如下

The following list of document formats are currently available:

bib – BibTeX [.bib]
doc – Microsoft Word 97/2000/XP [.doc]
doc6 – Microsoft Word 6.0 [.doc]
doc95 – Microsoft Word 95 [.doc]
docbook – DocBook [.xml]
docx – Microsoft Office Open XML [.docx]
docx7 – Microsoft Office Open XML [.docx]
fodt – OpenDocument Text (Flat XML) [.fodt]
html – HTML Document (OpenOffice.org Writer) [.html]
latex – LaTeX 2e [.ltx]
mediawiki – MediaWiki [.txt]
odt – ODF Text Document [.odt]
ooxml – Microsoft Office Open XML [.xml]
ott – Open Document Text [.ott]
pdb – AportisDoc (Palm) [.pdb]
pdf – Portable Document Format [.pdf]
psw – Pocket Word [.psw]
rtf – Rich Text Format [.rtf]
sdw – StarWriter 5.0 [.sdw]
sdw4 – StarWriter 4.0 [.sdw]
sdw3 – StarWriter 3.0 [.sdw]
stw – Open Office.org 1.0 Text Document Template [.stw]
sxw – Open Office.org 1.0 Text Document [.sxw]
text – Text Encoded [.txt]
txt – Text [.txt]
uot – Unified Office Format text [.uot]
vor – StarWriter 5.0 Template [.vor]
vor4 – StarWriter 4.0 Template [.vor]
vor3 – StarWriter 3.0 Template [.vor]
wps – Microsoft Works [.wps]
xhtml – XHTML Document [.html]

The following list of graphics formats are currently available:

bmp – Windows Bitmap [.bmp]
emf – Enhanced Metafile [.emf]
eps – Encapsulated PostScript [.eps]
fodg – OpenDocument Drawing (Flat XML) [.fodg]
gif – Graphics Interchange Format [.gif]
html – HTML Document (OpenOffice.org Draw) [.html]
jpg – Joint Photographic Experts Group [.jpg]
met – OS/2 Metafile [.met]
odd – OpenDocument Drawing [.odd]
otg – OpenDocument Drawing Template [.otg]
pbm – Portable Bitmap [.pbm]
pct – Mac Pict [.pct]
pdf – Portable Document Format [.pdf]
pgm – Portable Graymap [.pgm]
png – Portable Network Graphic [.png]
ppm – Portable Pixelmap [.ppm]
ras – Sun Raster Image [.ras]
std – OpenOffice.org 1.0 Drawing Template [.std]
svg – Scalable Vector Graphics [.svg]
svm – StarView Metafile [.svm]
swf – Macromedia Flash (SWF) [.swf]
sxd – OpenOffice.org 1.0 Drawing [.sxd]
sxd3 – StarDraw 3.0 [.sxd]
sxd5 – StarDraw 5.0 [.sxd]
sxw – StarOffice XML (Draw) [.sxw]
tiff – Tagged Image File Format [.tiff]
vor – StarDraw 5.0 Template [.vor]
vor3 – StarDraw 3.0 Template [.vor]
wmf – Windows Metafile [.wmf]
xhtml – XHTML [.xhtml]
xpm – X PixMap [.xpm]

The following list of presentation formats are currently available:

bmp – Windows Bitmap [.bmp]
emf – Enhanced Metafile [.emf]
eps – Encapsulated PostScript [.eps]
fodp – OpenDocument Presentation (Flat XML) [.fodp]
gif – Graphics Interchange Format [.gif]
html – HTML Document (OpenOffice.org Impress) [.html]
jpg – Joint Photographic Experts Group [.jpg]
met – OS/2 Metafile [.met]
odg – ODF Drawing (Impress) [.odg]
odp – ODF Presentation [.odp]
otp – ODF Presentation Template [.otp]
pbm – Portable Bitmap [.pbm]
pct – Mac Pict [.pct]
pdf – Portable Document Format [.pdf]
pgm – Portable Graymap [.pgm]
png – Portable Network Graphic [.png]
potm – Microsoft PowerPoint 2007/2010 XML Template [.potm]
pot – Microsoft PowerPoint 97/2000/XP Template [.pot]
ppm – Portable Pixelmap [.ppm]
pptx – Microsoft PowerPoint 2007/2010 XML [.pptx]
pps – Microsoft PowerPoint 97/2000/XP (Autoplay) [.pps]
ppt – Microsoft PowerPoint 97/2000/XP [.ppt]
pwp – PlaceWare [.pwp]
ras – Sun Raster Image [.ras]
sda – StarDraw 5.0 (OpenOffice.org Impress) [.sda]
sdd – StarImpress 5.0 [.sdd]
sdd3 – StarDraw 3.0 (OpenOffice.org Impress) [.sdd]
sdd4 – StarImpress 4.0 [.sdd]
sxd – OpenOffice.org 1.0 Drawing (OpenOffice.org Impress) [.sxd]
sti – OpenOffice.org 1.0 Presentation Template [.sti]
svg – Scalable Vector Graphics [.svg]
svm – StarView Metafile [.svm]
swf – Macromedia Flash (SWF) [.swf]
sxi – OpenOffice.org 1.0 Presentation [.sxi]
tiff – Tagged Image File Format [.tiff]
uop – Unified Office Format presentation [.uop]
vor – StarImpress 5.0 Template [.vor]
vor3 – StarDraw 3.0 Template (OpenOffice.org Impress) [.vor]
vor4 – StarImpress 4.0 Template [.vor]
vor5 – StarDraw 5.0 Template (OpenOffice.org Impress) [.vor]
wmf – Windows Metafile [.wmf]
xhtml – XHTML [.xml]
xpm – X PixMap [.xpm]

The following list of spreadsheet formats are currently available:

csv – Text CSV [.csv]
dbf – dBASE [.dbf]
dif – Data Interchange Format [.dif]
fods – OpenDocument Spreadsheet (Flat XML) [.fods]
html – HTML Document (OpenOffice.org Calc) [.html]
ods – ODF Spreadsheet [.ods]
ooxml – Microsoft Excel 2003 XML [.xml]
ots – ODF Spreadsheet Template [.ots]
pdf – Portable Document Format [.pdf]
pxl – Pocket Excel [.pxl]
sdc – StarCalc 5.0 [.sdc]
sdc4 – StarCalc 4.0 [.sdc]
sdc3 – StarCalc 3.0 [.sdc]
slk – SYLK [.slk]
stc – OpenOffice.org 1.0 Spreadsheet Template [.stc]
sxc – OpenOffice.org 1.0 Spreadsheet [.sxc]
uos – Unified Office Format spreadsheet [.uos]
vor3 – StarCalc 3.0 Template [.vor]
vor4 – StarCalc 4.0 Template [.vor]
vor – StarCalc 5.0 Template [.vor]
xhtml – XHTML [.xhtml]
xls – Microsoft Excel 97/2000/XP [.xls]
xls5 – Microsoft Excel 5.0 [.xls]
xls95 – Microsoft Excel 95 [.xls]
xlt – Microsoft Excel 97/2000/XP Template [.xlt]
xlt5 – Microsoft Excel 5.0 Template [.xlt]
xlt95 – Microsoft Excel 95 Template [.xlt]
xlsx – Microsoft Excel 2007/2010 XML [.xlsx]

相关内容

热门资讯

Mobi、epub格式电子书如... 在wps里全局设置里有一个文件关联,打开,勾选电子书文件选项就可以了。
定时清理删除C:\Progra... C:\Program Files (x86)下面很多scoped_dir开头的文件夹 写个批处理 定...
scoped_dir32_70... 一台虚拟机C盘总是莫名奇妙的空间用完,导致很多软件没法再运行。经过仔细检查发现是C:\Program...
500 行 Python 代码... 语法分析器描述了一个句子的语法结构,用来帮助其他的应用进行推理。自然语言引入了很多意外的歧义,以我们...
小程序支付时提示:appid和... [Q]小程序支付时提示:appid和mch_id不匹配 [A]小程序和微信支付没有进行关联,访问“小...
pycparser 是一个用... `pycparser` 是一个用 Python 编写的 C 语言解析器。它可以用来解析 C 代码并构...
微信小程序使用slider实现... 众所周知哈,微信小程序里面的音频播放是没有进度条的,但最近有个项目呢,客户要求音频要有进度条控制,所...
65536是2的几次方 计算2... 65536是2的16次方:65536=2⁶ 65536是256的2次方:65536=256 6553...
Apache Doris 2.... 亲爱的社区小伙伴们,我们很高兴地向大家宣布,Apache Doris 2.0.0 版本已于...
项目管理和工程管理的区别 项目管理 项目管理,顾名思义就是专注于开发和完成项目的管理,以实现目标并满足成功标准和项目要求。 工...