cách xóa một phần tử trong lxml

Question 1

Tôi cần loại bỏ hoàn toàn các phần tử, dựa trên nội dung của thuộc tính, bằng cách sử dụng lxml của python. Thí dụ:

import lxml.etree as et

xml="""
<groceries>
  <fruit state="rotten">apple</fruit>
  <fruit state="fresh">pear</fruit>
  <fruit state="fresh">starfruit</fruit>
  <fruit state="rotten">mango</fruit>
  <fruit state="fresh">peach</fruit>
</groceries>
"""

tree=et.fromstring(xml)

for bad in tree.xpath("//fruit[@state=\'rotten\']"):
  #remove this element from the tree

print et.tostring(tree, pretty_print=True)

Tôi muốn in cái này:

<groceries>
  <fruit state="fresh">pear</fruit>
  <fruit state="fresh">starfruit</fruit>
  <fruit state="fresh">peach</fruit>
</groceries>

Có cách nào để thực hiện việc này mà không cần lưu trữ một biến tạm thời và in ra nó theo cách thủ công, như:

newxml="<groceries>\n"
for elt in tree.xpath('//fruit[@state=\'fresh\']'):
  newxml+=et.tostring(elt)

newxml+="</groceries>"

Question 2

Sử dụng removephương thức của một xmlElement:

tree=et.fromstring(xml)

for bad in tree.xpath("//fruit[@state=\'rotten\']"):
  bad.getparent().remove(bad)     # here I grab the parent of the element to call the remove directly on it

print et.tostring(tree, pretty_print=True, xml_declaration=True)

Nếu tôi phải so sánh với phiên bản @Acorn, phiên bản của tôi sẽ hoạt động ngay cả khi các phần tử cần xóa không nằm ngay dưới nút gốc của xml của bạn.

Question 3

Bạn đang tìm kiếm removechức năng. Gọi phương thức loại bỏ của cây và chuyển cho nó một gia tốc con để loại bỏ.

import lxml.etree as et

xml="""
<groceries>
  <fruit state="rotten">apple</fruit>
  <fruit state="fresh">pear</fruit>
  <punnet>
    <fruit state="rotten">strawberry</fruit>
    <fruit state="fresh">blueberry</fruit>
  </punnet>
  <fruit state="fresh">starfruit</fruit>
  <fruit state="rotten">mango</fruit>
  <fruit state="fresh">peach</fruit>
</groceries>
"""

tree=et.fromstring(xml)

for bad in tree.xpath("//fruit[@state='rotten']"):
    bad.getparent().remove(bad)

print et.tostring(tree, pretty_print=True)

Kết quả:

<groceries>
  <fruit state="fresh">pear</fruit>
  <fruit state="fresh">starfruit</fruit>
  <fruit state="fresh">peach</fruit>
</groceries>

Question 4

Tôi đã gặp một tình huống:

<div>
    <script>
        some code
    </script>
    text here
</div>

div.remove(script)sẽ xóa text herephần mà tôi không cố ý.

theo câu trả lời ở đây , tôi thấy đó etree.strip_elementslà giải pháp tốt hơn cho tôi, bạn có thể kiểm soát việc bạn có xóa văn bản phía sau bằng with_tail=(bool)param hay không.

Nhưng tôi vẫn không biết liệu điều này có thể sử dụng bộ lọc xpath cho thẻ hay không. Chỉ cần đặt điều này để thông báo.

Đây là tài liệu:

strip_elements (tree_or_element, * tag_names, with_tail = True)

Xóa tất cả các phần tử có tên thẻ được cung cấp khỏi cây hoặc cây con. Thao tác này sẽ xóa các phần tử và toàn bộ cây con của chúng, bao gồm tất cả các thuộc tính, nội dung văn bản và con cháu của chúng. Nó cũng sẽ xóa văn bản đuôi của phần tử trừ khi bạn đặt rõ ràng with_tailtùy chọn đối số từ khóa thành Sai.

Tên thẻ có thể chứa các ký tự đại diện như trong _Element.iter.

Lưu ý rằng điều này sẽ không xóa phần tử (hoặc phần tử gốc ElementTree) mà bạn đã vượt qua ngay cả khi nó khớp. Nó sẽ chỉ đối xử với con cháu của nó. Nếu bạn muốn bao gồm phần tử gốc, hãy kiểm tra trực tiếp tên thẻ của nó trước khi gọi hàm này.

Ví dụ sử dụng ::
   strip_elements(some_element,
       'simpletagname',             # non-namespaced tag
       '{http://some/ns}tagname',   # namespaced tag
       '{http://some/other/ns}*'    # any tag from a namespace
       lxml.etree.Comment           # comments
       )

Question 5

Như đã đề cập, bạn có thể sử dụng remove()phương pháp để xóa các phần tử (con) khỏi cây:

for bad in tree.xpath("//fruit[@state=\'rotten\']"):
  bad.getparent().remove(bad)

Nhưng nó loại bỏ phần tử bao gồm phần tử của nó tail, đây là một vấn đề nếu bạn đang xử lý các tài liệu có nội dung hỗn hợp như HTML:

<div><fruit state="rotten">avocado</fruit> Hello!</div>

Trở thành

<div></div>

Đó là tôi cho rằng điều bạn không phải lúc nào cũng muốn :) Tôi đã tạo hàm trợ giúp để chỉ xóa phần tử và giữ lại phần đuôi của nó:

def remove_element(el):
    parent = el.getparent()
    if el.tail.strip():
        prev = el.getprevious()
        if prev:
            prev.tail = (prev.tail or '') + el.tail
        else:
            parent.text = (parent.text or '') + el.tail
    parent.remove(el)

for bad in tree.xpath("//fruit[@state=\'rotten\']"):
    remove_element(bad)

Bằng cách này, nó sẽ giữ văn bản đuôi:

<div> Hello!</div>

Question 6

Bạn cũng có thể sử dụng html từ lxml để giải quyết điều đó:

from lxml import html

xml="""
<groceries>
  <fruit state="rotten">apple</fruit>
  <fruit state="fresh">pear</fruit>
  <fruit state="fresh">starfruit</fruit>
  <fruit state="rotten">mango</fruit>
  <fruit state="fresh">peach</fruit>
</groceries>
"""

tree = html.fromstring(xml)

print("//BEFORE")
print(html.tostring(tree, pretty_print=True).decode("utf-8"))

for i in tree.xpath("//fruit[@state='rotten']"):
    i.drop_tree()

print("//AFTER")
print(html.tostring(tree, pretty_print=True).decode("utf-8"))

Nó sẽ xuất ra cái này:

//BEFORE
<groceries>
  <fruit state="rotten">apple</fruit>
  <fruit state="fresh">pear</fruit>
  <fruit state="fresh">starfruit</fruit>
  <fruit state="rotten">mango</fruit>
  <fruit state="fresh">peach</fruit>
</groceries>


//AFTER
<groceries>

  <fruit state="fresh">pear</fruit>
  <fruit state="fresh">starfruit</fruit>

  <fruit state="fresh">peach</fruit>
</groceries>