python爬虫教程_python语言程序设计

python爬虫教程_python语言程序设计rwxrr1rootrootrootroot月420:56MIDE-599.google.docx*-rwxrr1rootrootrootroot12455月720:07’pdf2doc2-副

在Ubuntu22.04下使用python3批量转换DOCX文档为TXT
2023/5/8 16:27

在WIN10下请参考本文,在Ubuntu22.04下需要不通的插件!
https://blog.csdn.net/weixin_/article/details/
python实现批量docx转txt

docx文档放到input目录中。
docx文档转txt之后的文档放到output目录中。
本文分3个步骤:
1、遍历input目录中的全部docx档。
2、docx档转txt档。
3、TXT档保存在output目录中。

0、python3的插件安装:
rootroot@rootroot-adol-ADOLBOOK-I421UAY-ADOL14UA:~$ ll *.py
-rwxr–r– 1 rootroot rootroot 1245  5月  7 20:07  pdf2doc2.py*
-rwxr–r– 1 rootroot rootroot 1245  5月  7 20:07 ‘pdf2doc2 – 副本.py’*
rootroot@rootroot-adol-ADOLBOOK-I421UAY-ADOL14UA:~$ python pdf2doc2.py 
Traceback (most recent call last):
  File “pdf2doc2.py”, line 3, in <module>
    from pdf2docx import Converter
ImportError: No module named pdf2docx
rootroot@rootroot-adol-ADOLBOOK-I421UAY-ADOL14UA:~$ 
rootroot@rootroot-adol-ADOLBOOK-I421UAY-ADOL14UA:~$ python3 pdf2doc2.py 
Traceback (most recent call last):
  File “/home/rootroot/pdf2doc2.py”, line 3, in <module>
    from pdf2docx import Converter
ModuleNotFoundError: No module named ‘pdf2docx’
rootroot@rootroot-adol-ADOLBOOK-I421UAY-ADOL14UA:~$ 
rootroot@rootroot-adol-ADOLBOOK-I421UAY-ADOL14UA:~$ 
rootroot@rootroot-adol-ADOLBOOK-I421UAY-ADOL14UA:~$ pip install pdf2docx
Defaulting to user installation because normal site-packages is not writeable
Collecting pdf2docx
  Downloading pdf2docx-0.5.6-py3-none-any.whl (148 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 148.4/148.4 KB 475.7 kB/s eta 0:00:00
Collecting opencv-python>=4.5
  Downloading opencv_python-4.7.0.72-cp37-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (61.8 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 61.8/61.8 MB 7.8 MB/s eta 0:00:00
Collecting fonttools>=4.24.0
  Downloading fonttools-4.39.3-py3-none-any.whl (1.0 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.0/1.0 MB 14.9 MB/s eta 0:00:00
Requirement already satisfied: numpy>=1.17.2 in ./.local/lib/python3.10/site-packages (from pdf2docx) (1.23.5)
Collecting PyMuPDF>=1.19.0
  Downloading PyMuPDF-1.22.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (14.1 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 14.1/14.1 MB 8.5 MB/s eta 0:00:00
Collecting python-docx>=0.8.10
  Downloading python-docx-0.8.11.tar.gz (5.6 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 5.6/5.6 MB 9.7 MB/s eta 0:00:00
  Preparing metadata (setup.py) … done
Collecting fire>=0.3.0
  Downloading fire-0.5.0.tar.gz (88 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 88.3/88.3 KB 3.1 MB/s eta 0:00:00
  Preparing metadata (setup.py) … done
Requirement already satisfied: six in /usr/lib/python3/dist-packages (from fire>=0.3.0->pdf2docx) (1.16.0)
Collecting termcolor
  Downloading termcolor-2.3.0-py3-none-any.whl (6.9 kB)
Collecting lxml>=2.3.2
  Downloading lxml-4.9.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.manylinux_2_24_x86_64.whl (7.1 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 7.1/7.1 MB 5.6 MB/s eta 0:00:00
Building wheels for collected packages: fire, python-docx
  Building wheel for fire (setup.py) … done
  Created wheel for fire: filename=fire-0.5.0-py2.py3-none-any.whl size= sha256=a75c7c45708f1b1f670d3656b47aa32ecdc45d8c6442cdf8541ab
  Stored in directory: /home/rootroot/.cache/pip/wheels/90/d4/f7/9404e5db0116bd4d43e5666eaa3e70ab53723e1e3ea40c9a95
  Building wheel for python-docx (setup.py) … done
  Created wheel for python-docx: filename=python_docx-0.8.11-py3-none-any.whl size= sha256=4bf244d8f5006e4c3bf7c9c5990a1731cfcee0f13997e02f44aa1d
  Stored in directory: /home/rootroot/.cache/pip/wheels/80/27/06/d4c3bd989b957af207bfd71d358d63a8194d
Successfully built fire python-docx
Installing collected packages: termcolor, PyMuPDF, opencv-python, lxml, fonttools, python-docx, fire, pdf2docx
Successfully installed PyMuPDF-1.22.2 fire-0.5.0 fonttools-4.39.3 lxml-4.9.2 opencv-python-4.7.0.72 pdf2docx-0.5.6 python-docx-0.8.11 termcolor-2.3.0
rootroot@rootroot-adol-ADOLBOOK-I421UAY-ADOL14UA:~$ 
rootroot@rootroot-adol-ADOLBOOK-I421UAY-ADOL14UA:~$ 
rootroot@rootroot-adol-ADOLBOOK-I421UAY-ADOL14UA:~$ pip install win32com
Defaulting to user installation because normal site-packages is not writeable
ERROR: Could not find a version that satisfies the requirement win32com (from versions: none)
ERROR: No matching distribution found for win32com
rootroot@rootroot-adol-ADOLBOOK-I421UAY-ADOL14UA:~$ 
rootroot@rootroot-adol-ADOLBOOK-I421UAY-ADOL14UA:~$ 
rootroot@rootroot-adol-ADOLBOOK-I421UAY-ADOL14UA:~$ pip install  pypiwin32
Defaulting to user installation because normal site-packages is not writeable
Collecting pypiwin32
  Downloading pypiwin32-223-py3-none-any.whl (1.7 kB)
  Downloading pypiwin32-219.zip (4.8 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 4.8/4.8 MB 4.0 MB/s eta 0:00:00
  Preparing metadata (setup.py) … error
  error: subprocess-exited-with-error
  
  × python setup.py egg_info did not run successfully.
  │ exit code: 1
  ╰─> [7 lines of output]
      Traceback (most recent call last):
        File “<string>”, line 2, in <module>
        File “<pip-setuptools-caller>”, line 34, in <module>
        File “/tmp/pip-install-rkpzj2x6/pypiwin32_8aa047f88d26fcc7255f3678/setup.py”, line 121
          print “Building pywin32”, pywin32_version
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      SyntaxError: Missing parentheses in call to ‘print’. Did you mean print(…)?
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

× Encountered error while generating package metadata.
╰─> See above for output.

note: This is an issue with the package mentioned above, not pip.
hint: See above for details.
rootroot@rootroot-adol-ADOLBOOK-I421UAY-ADOL14UA:~$ 
rootroot@rootroot-adol-ADOLBOOK-I421UAY-ADOL14UA:~$ 
rootroot@rootroot-adol-ADOLBOOK-I421UAY-ADOL14UA:~$ 
rootroot@rootroot-adol-ADOLBOOK-I421UAY-ADOL14UA:~$ sudo pip install python-docx
[sudo] password for rootroot: 
Collecting python-docx
  Downloading python-docx-0.8.11.tar.gz (5.6 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 5.6/5.6 MB 3.7 MB/s eta 0:00:00
  Preparing metadata (setup.py) … done
Collecting lxml>=2.3.2
  Downloading lxml-4.9.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.manylinux_2_24_x86_64.whl (7.1 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 7.1/7.1 MB 7.4 MB/s eta 0:00:00
Building wheels for collected packages: python-docx
  Building wheel for python-docx (setup.py) … done
  Created wheel for python-docx: filename=python_docx-0.8.11-py3-none-any.whl size= sha256=ed4fbb03122bebc37307ccfa92dc1b5ea1ef5d3c
  Stored in directory: /root/.cache/pip/wheels/80/27/06/d4c3bd989b957af207bfd71d358d63a8194d
Successfully built python-docx
Installing collected packages: lxml, python-docx
Successfully installed lxml-4.9.2 python-docx-0.8.11
WARNING: Running pip as the ‘root’ user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
rootroot@rootroot-adol-ADOLBOOK-I421UAY-ADOL14UA:~$ 
rootroot@rootroot-adol-ADOLBOOK-I421UAY-ADOL14UA:~$ 
rootroot@rootroot-adol-ADOLBOOK-I421UAY-ADOL14UA:~$ ll *.docx
-rwxr–r– 1 rootroot rootroot 80786  5月  4 20:56 MIDE-599.google.docx*
rootroot@rootroot-adol-ADOLBOOK-I421UAY-ADOL14UA:~$ 
rootroot@rootroot-adol-ADOLBOOK-I421UAY-ADOL14UA:~$ import docx
Command ‘import’ not found, but can be installed with:
sudo apt install graphicsmagick-imagemagick-compat  # version 1.4+really1.3.38-1ubuntu0.1, or
sudo apt install imagemagick-6.q16                  # version 8:6.9.11.60+dfsg-1.3ubuntu0.22.04.3
sudo apt install imagemagick-6.q16hdri              # version 8:6.9.11.60+dfsg-1.3ubuntu0.22.04.3
rootroot@rootroot-adol-ADOLBOOK-I421UAY-ADOL14UA:~$ 
rootroot@rootroot-adol-ADOLBOOK-I421UAY-ADOL14UA:~$ pyton3
Command ‘pyton3’ not found, did you mean:
  command ‘python3’ from deb python3 (3.10.6-1~22.04)
Try: sudo apt install <deb name>
rootroot@rootroot-adol-ADOLBOOK-I421UAY-ADOL14UA:~$ 
rootroot@rootroot-adol-ADOLBOOK-I421UAY-ADOL14UA:~$ python3
Python 3.10.6 (main, Mar 10 2023, 10:55:28) [GCC 11.3.0] on linux
Type “help”, “copyright”, “credits” or “license” for more information.
>>> 
>>> import docx
>>> 
>>> doc = docx.Document(‘MIDE-599.google.docx’)
>>> 
>>> docText = ‘\n’.join([paragraph.text for paragraph in doc.paragraphs])
>>> 
>>> print(docText) 

python爬虫教程_python语言程序设计

python爬虫教程_python语言程序设计 

 

1、遍历input目录中的全部docx档。
input2.py

import os

file = ‘input’

for root, dirs, files in os.walk(file):
    for file in files:
        path = os.path.join(root, file)
        print(path)

rootroot@rootroot-adol-ADOLBOOK-I421UAY-ADOL14UA:~$ python input2.py 
input/SSNI-205.google.docx
input/TEK-072.google.docx
input/TEK-076.google.docx
input/OAE-101.google.docx
input/SIVR-001.google.docx
input/OAE-165.google.docx
input/SSNI-101.google.docx
input/SIVR-012 2.google.docx
input/SIVR-002.google.docx
input/SSNI-009.google.docx
input/SIVR-003.google.docx
input/SIVR-017 2.google.docx
input/SSNI-493.google.docx
input/SNIS-896.google.docx
input/SSNI-409.google.docx
input/SSNI-730.google.docx
input/SIVR-034 1.google.docx
input/SIVR-067 1.google.docx
input/OFJE-189.google.docx
input/SIVR-067 3.google.docx
input/SIVR-044 2.google.docx
input/SSNI-542.google.docx
input/SIVR-034 2.google.docx
input/SIVR-016 2.google.docx
input/SIVR-016 1.google.docx
input/SSNI-229.google.docx
input/SSNI-030.google.docx
input/SSNI-127.google.docx
input/SIVR-033 5.google.docx
input/SIVR-061 1.google.docx
input/SNIS-986.google.docx
input/SIVR-033 2.google.docx
input/SIVR-033 3.google.docx
input/SSNI-516.google.docx
input/SSNI-388.google.docx
input/SSNI-473.google.docx
input/SNIS-872.google.docx
input/SIVR-067 2.google.docx
input/OFJE-139 2.google.docx
input/SNIS-786.google.docx
input/SSNI-674.google.docx
input/SSNI-178.google.docx
input/TEK-083Ö»ÓÐÒôƵ.google.docx
input/SNIS-964.google2.docx
input/SSNI-644.google.docx
input/SSNI-301.google.docx
input/TEK-080.google.docx
input/SIVR-044 1.google.docx
input/SSNI-566.google.docx
input/TEK-071.google.docx
input/TEK-097.google.docx
input/SSNI-279.google.docx
input/SIVR-061 4.google.docx
input/SSNI-344.google.docx
input/SIVR-033 1.google.docx
input/SSNI-618.google.docx
input/SIVR-017 1.google.docx
input/MIDE-599.google.docx
input/SNIS-850 1.google.docx
input/SIVR-061 2.google.docx
input/SSNI-254.google.docx
input/pSSNI-473.google.docx
input/SSNI-589.google.docx
input/SIVR-015 1.google.docx
input/SSNI-432.google.docx
input/SSNI-152.google.docx
input/SIVR-061 3.google.docx
input/SNIS-800.google.docx
input/SSNI-322.google.docx
input/SSNI-077.google.docx
input/SNIS-919.google.docx
input/SSNI-452.google.docx
input/SIVR-033 6.google.docx
input/TEK-073.google.docx
input/TEK-081Ö»ÓÐÒôƵ.google.docx
input/OFJE-139 1.google.docx
input/SNIS-850 2.google.docx
input/SNIS-964.google.docx
input/SIVR-033 4.google.docx
input/SSNI-703.google.docx
input/SIVR-015 2.google.docx
input/TEK-067.google.docx
input/SSNI-054.google.docx
input/SIVR-012 1.google.docx
input/SIVR-017 3.google.docx
input/SIVR-034 3.google.docx
input/TEK-079Ö»ÓÐÒôƵ.google.docx
input/OFJE-236.google.docx
rootroot@rootroot-adol-ADOLBOOK-I421UAY-ADOL14UA:~$ 

python爬虫教程_python语言程序设计

 

2、docx档转txt档。
docx3.py

import docx
doc = docx.Document(‘MIDE-599.google.docx’)
docText = ‘\n’.join([paragraph.text for paragraph in doc.paragraphs])
#print(docText)

f=open(“MIDE-599.google.txt”,”wb”)
#f.write(response.content)    
#f.write(docText)
#f.write(docText.decode())
f.write(docText.encode())
f.close()

rootroot@rootroot-adol-ADOLBOOK-I421UAY-ADOL14UA:~$ python3 docx3.py 
Traceback (most recent call last):
  File “/home/rootroot/docx3.py”, line 8, in <module>
    f.write(docText)
TypeError: a bytes-like object is required, not ‘str’
rootroot@rootroot-adol-ADOLBOOK-I421UAY-ADOL14UA:~$ 
rootroot@rootroot-adol-ADOLBOOK-I421UAY-ADOL14UA:~$ 

python爬虫教程_python语言程序设计

 

3、TXT档保存在output目录中。
input4.py

import os
import docx

file = ‘input’

for root, dirs, files in os.walk(file):
    for file in files:
        portion = os.path.splitext(file)
        if portion[1]==”.docx”:
            #doc = docx.Document(‘MIDE-599.google.docx’)
            path_docx = os.path.join(root, file)
            #doc = docx.Document(‘path_docx’)
            doc = docx.Document(path_docx)
            docText = ‘\n’.join([paragraph.text for paragraph in doc.paragraphs])
            
            newname = portion[0] + “.txt”
            #path = os.path.join(root, file)
            #path = os.path.join(root, newname)
            path = os.path.join(“output/”, newname)
            #print(path)
            
            #f=open(“MIDE-599.google.txt”,”wb”)
            f=open(path,”wb”)
            f.write(docText.encode())
            f.close()

python爬虫教程_python语言程序设计

python爬虫教程_python语言程序设计 

python爬虫教程_python语言程序设计 

在Ubuntu22.04 下是 UTF-8格式,WIN10下默认的是ANSI格式。

不能用BeyondCompare3.5直接比对!

参考资料:
ubuntu python docx txt
ubuntu python 批量 docx txt
python ubuntu 遍历目录
ubuntu python docx
ubuntu python 遍历
python如何遍历文件夹下的文件 python遍历文件夹中的文件
python 更换 扩展名
Python修改文件后缀名

https://blog.csdn.net/weixin_/article/details/
python批量修改文件扩展名

import os
dir=’/home/下载/’#文件所在目录
files = os.listdir(dir)#列出目录下所有文件名
files.sort()#按文件名排序
#print(‘files’,files)
#遍历文件
for name in files:
    lname=name.split(‘.’)#将文件名分割成名+后缀
    print(lname)
    if lname[-1]==’txt’:#判断
        print(lname)
        newname=lname[0]+’.tif’#修改
        print(newname)
        os.rename(dir+name, dir+newname)#写进文件夹

http://bjst.net.cn/ask/show-392333.html
精选回答:回答日期:2022年11月27日 以下内容仅供参考!

https://wenku.baidu.com/view/710331a94593daef5ef7ba0d4a7302768f996f55.html
Python修改文件后缀名

https://blog.csdn.net/faihung/article/details/
成功解决TypeError: a bytes-like object is required, not ‘str’
解决思路
问题出在python3.5和Python2.7在套接字返回值解码上有区别:
python bytes和str两种类型可以通过函数encode()和decode()相互转换,
str→bytes:encode()方法。str通过encode()方法可以转换为bytes。
bytes→str:decode()方法。如果我们从网络或磁盘上读取了字节流,那么读到的数据就是bytes。要把bytes变为str,就需要用decode()方法。

https://www.zhangshengrong.com/p/281oqB7DNw/
Ubuntu下使用python读取doc和docx文档的内容方法
sudo pip install python-docx

https://www.cnblogs.com/vulcat/p/12547027.html
用python实现批量替换.doc文件文件内容

https://blog.csdn.net/wx/article/details/
使用Python实现对word的批量操作

 

今天的文章
python爬虫教程_python语言程序设计分享到此就结束了,感谢您的阅读。

版权声明:本文内容由互联网用户自发贡献,该文观点仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 举报,一经查实,本站将立刻删除。
如需转载请保留出处:http://bianchenghao.cn/80332.html

(0)
编程小号编程小号

相关推荐

发表回复

您的电子邮箱地址不会被公开。 必填项已用 * 标注