Tìm kiếm thư mục con đệ quy và trả về các tệp trong một python danh sách

118

Tôi đang làm việc trên một tập lệnh để đi qua đệ quy các thư mục con trong một thư mục chính và tạo danh sách từ một loại tệp nhất định. Tôi đang gặp sự cố với tập lệnh. Nó hiện được thiết lập như sau

for root, subFolder, files in os.walk(PATH):
    for item in files:
        if item.endswith(".txt") :
            fileNamePath = str(os.path.join(root,subFolder,item))

vấn đề là biến subFolder đang kéo danh sách các thư mục con chứ không phải là thư mục chứa tệp ITEM. Tôi đã nghĩ đến việc chạy vòng lặp for cho thư mục con trước đó và tham gia phần đầu tiên của đường dẫn nhưng tôi đã tìm Id kiểm tra kỹ để xem có ai có bất kỳ đề xuất nào trước đó không. Cảm ơn bạn đã giúp đỡ!

— user2709514
nguồn

156

Bạn nên sử dụng cái dirpathmà bạn gọi root. Chúng dirnamesđược cung cấp để bạn có thể cắt bớt nếu có những thư mục mà bạn không muốn os.walktruy xuất lại.

import os
result = [os.path.join(dp, f) for dp, dn, filenames in os.walk(PATH) for f in filenames if os.path.splitext(f)[1] == '.txt']

Biên tập:

Sau lần phản đối mới nhất, tôi nhận ra đó globlà một công cụ tốt hơn để chọn theo phần mở rộng.

import os
from glob import glob
result = [y for x in os.walk(PATH) for y in glob(os.path.join(x[0], '*.txt'))]

Cũng là một phiên bản máy phát điện

from itertools import chain
result = (chain.from_iterable(glob(os.path.join(x[0], '*.txt')) for x in os.walk('.')))

Edit2 cho Python 3.4+

from pathlib import Path
result = list(Path(".").rglob("*.[tT][xX][tT]"))

— John La Rooy
nguồn

1

Mẫu hình cầu '*. [Tt] [Xx] [Tt]' sẽ làm cho tìm kiếm không phân biệt chữ hoa chữ thường.

— SergiyKolesnikov

@SergiyKolesnikov, Cảm ơn, tôi đã sử dụng nó trong phần chỉnh sửa ở phía dưới. Lưu ý rằng tính năng rglobnày không nhạy cảm trên nền tảng Windows - nhưng không phải là không nhạy cảm.

— John La Rooy

1

@JohnLaRooy Nó hoạt động với globquá (Python 3.6 ở đây):glob.iglob(os.path.join(real_source_path, '**', '*.[xX][mM][lL]')

— SergiyKolesnikov

@Sergiy: Của bạn iglobkhông hoạt động đối với các tệp trong thư mục con phụ trở xuống. Bạn cần thêm recursive=True.

— user136036

1

@ user136036, "tốt hơn" không phải lúc nào cũng có nghĩa là nhanh nhất. Đôi khi khả năng đọc và khả năng bảo trì cũng rất quan trọng.

— John La Rooy

111

Đã thay đổi trong Python 3.5 : Hỗ trợ cho các quả cầu đệ quy sử dụng “**”.

glob.glob()có một tham số đệ quy mới .

Nếu bạn muốn lấy mọi .txttệp dưới my_path(bao gồm đệ quy các mã con):

import glob

files = glob.glob(my_path + '/**/*.txt', recursive=True)

# my_path/     the dir
# **/       every file and dir under my_path
# *.txt     every file that ends with '.txt'

Nếu bạn cần một trình lặp, bạn có thể sử dụng iglob để thay thế:

for file in glob.iglob(my_path, recursive=False):
    # ...

— Rotareti
nguồn

1

Lỗi Loại: glob () có một cuộc tranh cãi từ khóa bất ngờ 'đệ quy'

— CyberJacob

1

Nó sẽ hoạt động. Đảm bảo bạn sử dụng phiên bản> = 3.5. Tôi đã thêm một liên kết đến tài liệu trong câu trả lời của mình để biết thêm chi tiết.

— Rotareti

Đó sẽ là lý do tại sao, tôi đang ở trên 2.7

— CyberJacob

1

Tại sao danh sách dễ hiểu chứ không phải chỉ files = glob.glob(PATH + '/*/**/*.txt', recursive=True)?

— tobltobs

Rất tiếc! :) Nó hoàn toàn dư thừa. Không biết điều gì đã khiến tôi viết nó như vậy. Cảm ơn vì đã đề cập đến nó! Tôi sẽ sửa chữa nó.

— Rotareti

20

Tôi sẽ dịch cách hiểu danh sách của John La Rooy sang lồng cho của for, đề phòng bất kỳ ai khác gặp khó khăn trong việc hiểu nó.

result = [y for x in os.walk(PATH) for y in glob(os.path.join(x[0], '*.txt'))]

Nên tương đương với:

import glob

result = []

for x in os.walk(PATH):
    for y in glob.glob(os.path.join(x[0], '*.txt')):
        result.append(y)

Đây là tài liệu để hiểu danh sách và các chức năng os.walk và global.glob .

— Jefferson Lima
nguồn

1

Câu trả lời này phù hợp với tôi trong Python 3.7.3. glob.glob(..., recursive=True)và list(Path(dir).glob(...'))không.

— miguelmorin

11

Điều này dường như là giải pháp nhanh nhất tôi có thể nghĩ ra, và là nhanh hơn os.walkvà nhanh hơn rất nhiều so với bất kỳ globgiải pháp .

Nó cũng sẽ cung cấp cho bạn danh sách tất cả các thư mục con lồng nhau về cơ bản mà không mất phí.
Bạn có thể tìm kiếm một số tiện ích mở rộng khác nhau.
Bạn cũng có thể chọn trả về đường dẫn đầy đủ hoặc chỉ tên cho các tệp bằng cách thay đổi f.paththành f.name(không thay đổi nó cho các thư mục con!).

Args: dir: str, ext: list.
Hàm trả về hai danh sách:subfolders, files .

Xem bên dưới để biết chi tiết về tốc độ anaylsis.

def run_fast_scandir(dir, ext):    # dir: str, ext: list
    subfolders, files = [], []

    for f in os.scandir(dir):
        if f.is_dir():
            subfolders.append(f.path)
        if f.is_file():
            if os.path.splitext(f.name)[1].lower() in ext:
                files.append(f.path)


    for dir in list(subfolders):
        sf, f = run_fast_scandir(dir, ext)
        subfolders.extend(sf)
        files.extend(f)
    return subfolders, files


subfolders, files = run_fast_scandir(folder, [".jpg"])

Phân tích tốc độ

cho các phương pháp khác nhau để lấy tất cả các tệp có phần mở rộng tệp cụ thể bên trong tất cả các thư mục con và thư mục chính.

tl; dr:
- fast_scandirrõ ràng thắng và nhanh gấp đôi so với tất cả các giải pháp khác, ngoại trừ os.walk.
- os.walklà vị trí thứ hai chậm hơn rất nhiều.
- sử dụng globsẽ làm chậm quá trình.
- Không có kết quả nào sử dụng sắp xếp tự nhiên . Điều này có nghĩa là các kết quả sẽ được sắp xếp như sau: 1, 10, 2. Để có được sắp xếp tự nhiên (1, 2, 10), vui lòng xem tại https://stackoverflow.com/a/48030307/2441026

Các kết quả:

fast_scandir    took  499 ms. Found files: 16596. Found subfolders: 439
os.walk         took  589 ms. Found files: 16596
find_files      took  919 ms. Found files: 16596
glob.iglob      took  998 ms. Found files: 16596
glob.glob       took 1002 ms. Found files: 16596
pathlib.rglob   took 1041 ms. Found files: 16596
os.walk-glob    took 1043 ms. Found files: 16596

Các thử nghiệm được thực hiện với W7x64, Python 3.8.1, 20 lần chạy. 16596 tệp trong 439 thư mục con (lồng nhau một phần).
find_fileslà từ https://stackoverflow.com/a/45646357/2441026 và cho phép bạn tìm kiếm một số tiện ích mở rộng.
fast_scandirdo chính tôi viết và cũng sẽ trả về một danh sách các thư mục con. Bạn có thể cung cấp cho nó một danh sách các tiện ích mở rộng để tìm kiếm (Tôi đã thử nghiệm một danh sách với một mục nhập đơn giản if ... == ".jpg"và không có sự khác biệt đáng kể).

# -*- coding: utf-8 -*-
# Python 3


import time
import os
from glob import glob, iglob
from pathlib import Path


directory = r"<folder>"
RUNS = 20


def run_os_walk():
    a = time.time_ns()
    for i in range(RUNS):
        fu = [os.path.join(dp, f) for dp, dn, filenames in os.walk(directory) for f in filenames if
                  os.path.splitext(f)[1].lower() == '.jpg']
    print(f"os.walk\t\t\ttook {(time.time_ns() - a) / 1000 / 1000 / RUNS:.0f} ms. Found files: {len(fu)}")


def run_os_walk_glob():
    a = time.time_ns()
    for i in range(RUNS):
        fu = [y for x in os.walk(directory) for y in glob(os.path.join(x[0], '*.jpg'))]
    print(f"os.walk-glob\ttook {(time.time_ns() - a) / 1000 / 1000 / RUNS:.0f} ms. Found files: {len(fu)}")


def run_glob():
    a = time.time_ns()
    for i in range(RUNS):
        fu = glob(os.path.join(directory, '**', '*.jpg'), recursive=True)
    print(f"glob.glob\t\ttook {(time.time_ns() - a) / 1000 / 1000 / RUNS:.0f} ms. Found files: {len(fu)}")


def run_iglob():
    a = time.time_ns()
    for i in range(RUNS):
        fu = list(iglob(os.path.join(directory, '**', '*.jpg'), recursive=True))
    print(f"glob.iglob\t\ttook {(time.time_ns() - a) / 1000 / 1000 / RUNS:.0f} ms. Found files: {len(fu)}")


def run_pathlib_rglob():
    a = time.time_ns()
    for i in range(RUNS):
        fu = list(Path(directory).rglob("*.jpg"))
    print(f"pathlib.rglob\ttook {(time.time_ns() - a) / 1000 / 1000 / RUNS:.0f} ms. Found files: {len(fu)}")


def find_files(files, dirs=[], extensions=[]):
    # https://stackoverflow.com/a/45646357/2441026

    new_dirs = []
    for d in dirs:
        try:
            new_dirs += [ os.path.join(d, f) for f in os.listdir(d) ]
        except OSError:
            if os.path.splitext(d)[1].lower() in extensions:
                files.append(d)

    if new_dirs:
        find_files(files, new_dirs, extensions )
    else:
        return


def run_fast_scandir(dir, ext):    # dir: str, ext: list
    # https://stackoverflow.com/a/59803793/2441026

    subfolders, files = [], []

    for f in os.scandir(dir):
        if f.is_dir():
            subfolders.append(f.path)
        if f.is_file():
            if os.path.splitext(f.name)[1].lower() in ext:
                files.append(f.path)


    for dir in list(subfolders):
        sf, f = run_fast_scandir(dir, ext)
        subfolders.extend(sf)
        files.extend(f)
    return subfolders, files



if __name__ == '__main__':
    run_os_walk()
    run_os_walk_glob()
    run_glob()
    run_iglob()
    run_pathlib_rglob()


    a = time.time_ns()
    for i in range(RUNS):
        files = []
        find_files(files, dirs=[directory], extensions=[".jpg"])
    print(f"find_files\t\ttook {(time.time_ns() - a) / 1000 / 1000 / RUNS:.0f} ms. Found files: {len(files)}")


    a = time.time_ns()
    for i in range(RUNS):
        subf, files = run_fast_scandir(directory, [".jpg"])
    print(f"fast_scandir\ttook {(time.time_ns() - a) / 1000 / 1000 / RUNS:.0f} ms. Found files: {len(files)}. Found subfolders: {len(subf)}")

— người dùng136036
nguồn

10

Cái mới pathlibThư viện đơn giản hóa điều này thành một dòng:

from pathlib import Path
result = list(Path(PATH).glob('**/*.txt'))

Bạn cũng có thể sử dụng phiên bản trình tạo:

from pathlib import Path
for file in Path(PATH).glob('**/*.txt'):
    pass

Điều này trả về Pathcác đối tượng, mà bạn có thể sử dụng cho hầu hết mọi thứ hoặc lấy tên tệp dưới dạng một chuỗi theo file.name.

— Emre
nguồn

6

Nó không phải là câu trả lời khó hiểu nhất, nhưng tôi sẽ đặt nó ở đây cho vui vì nó là một bài học ngắn gọn về đệ quy

def find_files( files, dirs=[], extensions=[]):
    new_dirs = []
    for d in dirs:
        try:
            new_dirs += [ os.path.join(d, f) for f in os.listdir(d) ]
        except OSError:
            if os.path.splitext(d)[1] in extensions:
                files.append(d)

    if new_dirs:
        find_files(files, new_dirs, extensions )
    else:
        return

Trên máy tính của tôi, tôi có hai thư mục rootvàroot2

mender@multivax ]ls -R root root2
root:
temp1 temp2

root/temp1:
temp1.1 temp1.2

root/temp1/temp1.1:
f1.mid

root/temp1/temp1.2:
f.mi  f.mid

root/temp2:
tmp.mid

root2:
dummie.txt temp3

root2/temp3:
song.mid

Giả sử tôi muốn tìm tất cả .txtvà tất cả .midcác tệp trong một trong hai thư mục này, sau đó tôi có thể làm

files = []
find_files( files, dirs=['root','root2'], extensions=['.mid','.txt'] )
print(files)

#['root2/dummie.txt',
# 'root/temp2/tmp.mid',
# 'root2/temp3/song.mid',
# 'root/temp1/temp1.1/f1.mid',
# 'root/temp1/temp1.2/f.mid']

— dermen
nguồn

4

Đệ quy mới trong Python 3.5, vì vậy nó sẽ không hoạt động trên Python 2.7. Đây là ví dụ sử dụng rchuỗi, vì vậy bạn chỉ cần cung cấp đường dẫn như trên Win, Lin, ...

import glob

mypath=r"C:\Users\dj\Desktop\nba"

files = glob.glob(mypath + r'\**\*.py', recursive=True)
# print(files) # as list
for f in files:
    print(f) # nice looking single line per file

Lưu ý: Nó sẽ liệt kê tất cả các tệp, bất kể nó phải đi sâu đến đâu.

— prosti
nguồn

3

Bạn có thể làm theo cách này để trả về cho bạn danh sách các tệp đường dẫn tuyệt đối.

def list_files_recursive(path):
    """
    Function that receives as a parameter a directory path
    :return list_: File List and Its Absolute Paths
    """

    import os

    files = []

    # r = root, d = directories, f = files
    for r, d, f in os.walk(path):
        for file in f:
            files.append(os.path.join(r, file))

    lst = [file for file in files]
    return lst


if __name__ == '__main__':

    result = list_files_recursive('/tmp')
    print(result)

— WilliamCanin
nguồn

3

Nếu bạn không phiền khi cài đặt thêm một thư viện ánh sáng, bạn có thể làm như sau:

pip install plazy

Sử dụng:

import plazy

txt_filter = lambda x : True if x.endswith('.txt') else False
files = plazy.list_files(root='data', filter_func=txt_filter, is_include_root=True)

Kết quả sẽ giống như sau:

['data/a.txt', 'data/b.txt', 'data/sub_dir/c.txt']

Nó hoạt động trên cả Python 2.7 và Python 3.

Github: https://github.com/kyzas/plazy#list-files

Tuyên bố từ chối trách nhiệm: Tôi là tác giả của plazy.

— Minh Nguyen
nguồn

1

Hàm này sẽ đệ quy chỉ đưa các tệp vào một danh sách. Hy vọng điều này sẽ bạn.

import os


def ls_files(dir):
    files = list()
    for item in os.listdir(dir):
        abspath = os.path.join(dir, item)
        try:
            if os.path.isdir(abspath):
                files = files + ls_files(abspath)
            else:
                files.append(abspath)
        except FileNotFoundError as err:
            print('invalid directory\n', 'Error: ', err)
    return files

— Yossarian42
nguồn

0

Giải pháp ban đầu của bạn gần như đúng, nhưng biến "root" được cập nhật động khi nó chạy đệ quy các đường dẫn xung quanh. os.walk () là một trình tạo đệ quy. Mỗi bộ tuple gồm (root, subFolder, files) dành cho một gốc cụ thể theo cách bạn thiết lập.

I E

root = 'C:\\'
subFolder = ['Users', 'ProgramFiles', 'ProgramFiles (x86)', 'Windows', ...]
files = ['foo1.txt', 'foo2.txt', 'foo3.txt', ...]

root = 'C:\\Users\\'
subFolder = ['UserAccount1', 'UserAccount2', ...]
files = ['bar1.txt', 'bar2.txt', 'bar3.txt', ...]

...

Tôi đã thực hiện một chỉnh sửa nhỏ đối với mã của bạn để in một danh sách đầy đủ.

import os
for root, subFolder, files in os.walk(PATH):
    for item in files:
        if item.endswith(".txt") :
            fileNamePath = str(os.path.join(root,item))
            print(fileNamePath)

Hi vọng điêu nay co ich!

— LastTigerEyes
nguồn