Đánh dấu thứ tự Byte củng cố việc đọc tệp trong Java

107

Tôi đang cố đọc tệp CSV bằng Java. Một số tệp có thể có dấu thứ tự byte ngay từ đầu, nhưng không phải tất cả. Khi xuất hiện, thứ tự byte được đọc cùng với phần còn lại của dòng đầu tiên, do đó gây ra các vấn đề với so sánh chuỗi.

Có cách nào dễ dàng để bỏ qua dấu thứ tự byte khi nó hiện diện không?

Cảm ơn!

java utf-8 byte-order-mark

— Tom
nguồn

có lẽ: rgagnon.com/javadetails/java-handle-utf8-file-with-bom.html

— Chris

114

CHỈNH SỬA : Tôi đã thực hiện một bản phát hành thích hợp trên GitHub: https://github.com/gpakosz/UnicodeBOMInputStream

Đây là một lớp tôi đã viết mã một thời gian trước, tôi chỉ chỉnh sửa tên gói trước khi dán. Không có gì đặc biệt, nó khá giống với các giải pháp được đăng trong cơ sở dữ liệu lỗi của SUN. Kết hợp nó trong mã của bạn và bạn ổn.

/* ____________________________________________________________________________
 * 
 * File:    UnicodeBOMInputStream.java
 * Author:  Gregory Pakosz.
 * Date:    02 - November - 2005    
 * ____________________________________________________________________________
 */
package com.stackoverflow.answer;

import java.io.IOException;
import java.io.InputStream;
import java.io.PushbackInputStream;

/**
 * The <code>UnicodeBOMInputStream</code> class wraps any
 * <code>InputStream</code> and detects the presence of any Unicode BOM
 * (Byte Order Mark) at its beginning, as defined by
 * <a href="http://www.faqs.org/rfcs/rfc3629.html">RFC 3629 - UTF-8, a transformation format of ISO 10646</a>
 * 
 * <p>The
 * <a href="http://www.unicode.org/unicode/faq/utf_bom.html">Unicode FAQ</a>
 * defines 5 types of BOMs:<ul>
 * <li><pre>00 00 FE FF  = UTF-32, big-endian</pre></li>
 * <li><pre>FF FE 00 00  = UTF-32, little-endian</pre></li>
 * <li><pre>FE FF        = UTF-16, big-endian</pre></li>
 * <li><pre>FF FE        = UTF-16, little-endian</pre></li>
 * <li><pre>EF BB BF     = UTF-8</pre></li>
 * </ul></p>
 * 
 * <p>Use the {@link #getBOM()} method to know whether a BOM has been detected
 * or not.
 * </p>
 * <p>Use the {@link #skipBOM()} method to remove the detected BOM from the
 * wrapped <code>InputStream</code> object.</p>
 */
public class UnicodeBOMInputStream extends InputStream
{
  /**
   * Type safe enumeration class that describes the different types of Unicode
   * BOMs.
   */
  public static final class BOM
  {
    /**
     * NONE.
     */
    public static final BOM NONE = new BOM(new byte[]{},"NONE");

    /**
     * UTF-8 BOM (EF BB BF).
     */
    public static final BOM UTF_8 = new BOM(new byte[]{(byte)0xEF,
                                                       (byte)0xBB,
                                                       (byte)0xBF},
                                            "UTF-8");

    /**
     * UTF-16, little-endian (FF FE).
     */
    public static final BOM UTF_16_LE = new BOM(new byte[]{ (byte)0xFF,
                                                            (byte)0xFE},
                                                "UTF-16 little-endian");

    /**
     * UTF-16, big-endian (FE FF).
     */
    public static final BOM UTF_16_BE = new BOM(new byte[]{ (byte)0xFE,
                                                            (byte)0xFF},
                                                "UTF-16 big-endian");

    /**
     * UTF-32, little-endian (FF FE 00 00).
     */
    public static final BOM UTF_32_LE = new BOM(new byte[]{ (byte)0xFF,
                                                            (byte)0xFE,
                                                            (byte)0x00,
                                                            (byte)0x00},
                                                "UTF-32 little-endian");

    /**
     * UTF-32, big-endian (00 00 FE FF).
     */
    public static final BOM UTF_32_BE = new BOM(new byte[]{ (byte)0x00,
                                                            (byte)0x00,
                                                            (byte)0xFE,
                                                            (byte)0xFF},
                                                "UTF-32 big-endian");

    /**
     * Returns a <code>String</code> representation of this <code>BOM</code>
     * value.
     */
    public final String toString()
    {
      return description;
    }

    /**
     * Returns the bytes corresponding to this <code>BOM</code> value.
     */
    public final byte[] getBytes()
    {
      final int     length = bytes.length;
      final byte[]  result = new byte[length];

      // Make a defensive copy
      System.arraycopy(bytes,0,result,0,length);

      return result;
    }

    private BOM(final byte bom[], final String description)
    {
      assert(bom != null)               : "invalid BOM: null is not allowed";
      assert(description != null)       : "invalid description: null is not allowed";
      assert(description.length() != 0) : "invalid description: empty string is not allowed";

      this.bytes          = bom;
      this.description  = description;
    }

            final byte    bytes[];
    private final String  description;

  } // BOM

  /**
   * Constructs a new <code>UnicodeBOMInputStream</code> that wraps the
   * specified <code>InputStream</code>.
   * 
   * @param inputStream an <code>InputStream</code>.
   * 
   * @throws NullPointerException when <code>inputStream</code> is
   * <code>null</code>.
   * @throws IOException on reading from the specified <code>InputStream</code>
   * when trying to detect the Unicode BOM.
   */
  public UnicodeBOMInputStream(final InputStream inputStream) throws  NullPointerException,
                                                                      IOException

  {
    if (inputStream == null)
      throw new NullPointerException("invalid input stream: null is not allowed");

    in = new PushbackInputStream(inputStream,4);

    final byte  bom[] = new byte[4];
    final int   read  = in.read(bom);

    switch(read)
    {
      case 4:
        if ((bom[0] == (byte)0xFF) &&
            (bom[1] == (byte)0xFE) &&
            (bom[2] == (byte)0x00) &&
            (bom[3] == (byte)0x00))
        {
          this.bom = BOM.UTF_32_LE;
          break;
        }
        else
        if ((bom[0] == (byte)0x00) &&
            (bom[1] == (byte)0x00) &&
            (bom[2] == (byte)0xFE) &&
            (bom[3] == (byte)0xFF))
        {
          this.bom = BOM.UTF_32_BE;
          break;
        }

      case 3:
        if ((bom[0] == (byte)0xEF) &&
            (bom[1] == (byte)0xBB) &&
            (bom[2] == (byte)0xBF))
        {
          this.bom = BOM.UTF_8;
          break;
        }

      case 2:
        if ((bom[0] == (byte)0xFF) &&
            (bom[1] == (byte)0xFE))
        {
          this.bom = BOM.UTF_16_LE;
          break;
        }
        else
        if ((bom[0] == (byte)0xFE) &&
            (bom[1] == (byte)0xFF))
        {
          this.bom = BOM.UTF_16_BE;
          break;
        }

      default:
        this.bom = BOM.NONE;
        break;
    }

    if (read > 0)
      in.unread(bom,0,read);
  }

  /**
   * Returns the <code>BOM</code> that was detected in the wrapped
   * <code>InputStream</code> object.
   * 
   * @return a <code>BOM</code> value.
   */
  public final BOM getBOM()
  {
    // BOM type is immutable.
    return bom;
  }

  /**
   * Skips the <code>BOM</code> that was found in the wrapped
   * <code>InputStream</code> object.
   * 
   * @return this <code>UnicodeBOMInputStream</code>.
   * 
   * @throws IOException when trying to skip the BOM from the wrapped
   * <code>InputStream</code> object.
   */
  public final synchronized UnicodeBOMInputStream skipBOM() throws IOException
  {
    if (!skipped)
    {
      in.skip(bom.bytes.length);
      skipped = true;
    }
    return this;
  }

  /**
   * {@inheritDoc}
   */
  public int read() throws IOException
  {
    return in.read();
  }

  /**
   * {@inheritDoc}
   */
  public int read(final byte b[]) throws  IOException,
                                          NullPointerException
  {
    return in.read(b,0,b.length);
  }

  /**
   * {@inheritDoc}
   */
  public int read(final byte b[],
                  final int off,
                  final int len) throws IOException,
                                        NullPointerException
  {
    return in.read(b,off,len);
  }

  /**
   * {@inheritDoc}
   */
  public long skip(final long n) throws IOException
  {
    return in.skip(n);
  }

  /**
   * {@inheritDoc}
   */
  public int available() throws IOException
  {
    return in.available();
  }

  /**
   * {@inheritDoc}
   */
  public void close() throws IOException
  {
    in.close();
  }

  /**
   * {@inheritDoc}
   */
  public synchronized void mark(final int readlimit)
  {
    in.mark(readlimit);
  }

  /**
   * {@inheritDoc}
   */
  public synchronized void reset() throws IOException
  {
    in.reset();
  }

  /**
   * {@inheritDoc}
   */
  public boolean markSupported() 
  {
    return in.markSupported();
  }

  private final PushbackInputStream in;
  private final BOM                 bom;
  private       boolean             skipped = false;

} // UnicodeBOMInputStream

Và bạn đang sử dụng nó theo cách này:

import java.io.BufferedReader;
import java.io.FileInputStream;
import java.io.InputStreamReader;

public final class UnicodeBOMInputStreamUsage
{
  public static void main(final String[] args) throws Exception
  {
    FileInputStream fis = new FileInputStream("test/offending_bom.txt");
    UnicodeBOMInputStream ubis = new UnicodeBOMInputStream(fis);

    System.out.println("detected BOM: " + ubis.getBOM());

    System.out.print("Reading the content of the file without skipping the BOM: ");
    InputStreamReader isr = new InputStreamReader(ubis);
    BufferedReader br = new BufferedReader(isr);

    System.out.println(br.readLine());

    br.close();
    isr.close();
    ubis.close();
    fis.close();

    fis = new FileInputStream("test/offending_bom.txt");
    ubis = new UnicodeBOMInputStream(fis);
    isr = new InputStreamReader(ubis);
    br = new BufferedReader(isr);

    ubis.skipBOM();

    System.out.print("Reading the content of the file after skipping the BOM: ");
    System.out.println(br.readLine());

    br.close();
    isr.close();
    ubis.close();
    fis.close();
  }

} // UnicodeBOMInputStreamUsage

— Gregory Pakosz
nguồn

2

Xin lỗi vì các lĩnh vực dài di chuyển, quá xấu không có tính năng đính kèm

— Gregory Pakosz

Cảm ơn Gregory, đó chỉ là những gì tôi đang tìm kiếm.

— Tom

3

Điều này cần được trong lõi Java API

— Denis Kniazhev

7

Đã 10 năm trôi qua và tôi vẫn đang nhận nghiệp vì điều này: D Tôi đang nhìn bạn Java!

— Gregory Pakosz

1

Được ủng hộ vì câu trả lời cung cấp lịch sử liên quan đến lý do luồng nhập tệp không cung cấp tùy chọn loại bỏ BOM theo mặc định.

— MxLDevs

94

Thư viện Apache Commons IO có InputStreamthể phát hiện và loại bỏ các BOM: BOMInputStream(javadoc) :

BOMInputStream bomIn = new BOMInputStream(in);
int firstNonBOMByte = bomIn.read(); // Skips BOM
if (bomIn.hasBOM()) {
    // has a UTF-8 BOM
}

Nếu bạn cũng cần phát hiện các mã hóa khác nhau, nó cũng có thể phân biệt giữa các dấu thứ tự byte khác nhau, ví dụ: UTF-8 so với UTF-16 big + little endian - chi tiết tại liên kết doc ở trên. Sau đó, bạn có thể sử dụng phát hiện ByteOrderMarkđể chọn một Charsetgiải mã luồng. (Có lẽ có một cách hợp lý hơn để thực hiện việc này nếu bạn cần tất cả các chức năng này - có thể là UnicodeReader trong câu trả lời của BalusC?). Lưu ý rằng, nói chung, không có cách nào tốt để phát hiện mã hóa một số byte đang ở trong đó, nhưng nếu luồng bắt đầu bằng BOM, rõ ràng điều này có thể hữu ích.

Chỉnh sửa : Nếu bạn cần phát hiện BOM trong UTF-16, UTF-32, v.v., thì hàm tạo phải là:

new BOMInputStream(is, ByteOrderMark.UTF_8, ByteOrderMark.UTF_16BE,
        ByteOrderMark.UTF_16LE, ByteOrderMark.UTF_32BE, ByteOrderMark.UTF_32LE)

Bình luận của Upvote @ martin-charlesworth :)

— rescdsk
nguồn

Chỉ cần bỏ qua BOM. Nên là giải pháp hoàn hảo cho 99% các trường hợp sử dụng.

— atamanroman

7

Tôi đã sử dụng câu trả lời này thành công. Tuy nhiên, tôi trân trọng bổ sung lập luận booleanđể xác định xem nên bao gồm hay loại trừ BOM. Ví dụ:BOMInputStream bomIn = new BOMInputStream(in, false); // don't include the BOM

— Kevin Meredith.

19

Tôi cũng muốn nói thêm rằng điều này chỉ phát hiện UTF-8 BOM. Nếu bạn muốn phát hiện tất cả các BOM utf-X thì bạn cần chuyển chúng vào phương thức khởi tạo BOMInputStream.

BOMInputStream bomIn = new BOMInputStream(is, ByteOrderMark.UTF_8, ByteOrderMark.UTF_16BE, 				ByteOrderMark.UTF_16LE, ByteOrderMark.UTF_32BE, ByteOrderMark.UTF_32LE);

— Martin Charlesworth

Đối với những nhận xét của @KevinMeredith, tôi muốn nhấn mạnh rằng các nhà xây dựng với boolean là rõ ràng hơn, nhưng các nhà xây dựng mặc định đã bị loại bỏ UTF-8 BOM, như javadoc gợi ý:BOMInputStream(InputStream delegate) Constructs a new BOM InputStream that excludes a ByteOrderMark.UTF_8 BOM.

— WesternGun

Bỏ qua giải quyết hầu hết các vấn đề của tôi. Nếu tệp của tôi bắt đầu bằng BOM UTF_16BE, tôi có thể tạo InputReader bằng cách bỏ qua BOM và đọc tệp dưới dạng UTF_8 không? Cho đến nay nó hoạt động, tôi muốn hiểu nếu có bất kỳ trường hợp cạnh? Cảm ơn trước.

— Bhaskar

31

Giải pháp đơn giản hơn:

public class BOMSkipper
{
    public static void skip(Reader reader) throws IOException
    {
        reader.mark(1);
        char[] possibleBOM = new char[1];
        reader.read(possibleBOM);

        if (possibleBOM[0] != '\ufeff')
        {
            reader.reset();
        }
    }
}

Mẫu sử dụng:

BufferedReader input = new BufferedReader(new InputStreamReader(new FileInputStream(file), fileExpectedCharset));
BOMSkipper.skip(input);
//Now UTF prefix not present:
input.readLine();
...

Nó hoạt động với tất cả 5 bảng mã UTF!

1

Andrei rất tốt. Nhưng bạn có thể giải thích tại sao nó hoạt động? Làm cách nào để mẫu 0xFEFF khớp thành công các tệp UTF-8 có vẻ như có một mẫu khác và 3 byte thay vì 2? Và làm thế nào mà mẫu đó có thể khớp với cả hai phần cuối của UTF16 và UTF32?

— Vahid Pazirandeh

1

Như bạn có thể thấy - tôi không sử dụng luồng byte nhưng luồng ký tự được mở bằng bộ ký tự mong đợi. Vì vậy, nếu ký tự đầu tiên từ luồng này là BOM - tôi sẽ bỏ qua. BOM có thể có các biểu diễn byte khác nhau cho mỗi bảng mã, nhưng đây là một ký tự. Vui lòng đọc bài viết này, nó giúp tôi: joelonsoftware.com/articles/Unicode.html

Giải pháp tốt, chỉ cần đảm bảo kiểm tra xem tệp không trống để tránh IOException trong phương pháp bỏ qua trước khi đọc. Bạn có thể làm điều đó bằng cách gọi if (reader.ready ()) {reader.read (có thểBOM) ...}

— Snow

Tôi thấy bạn đã bao gồm 0xFE 0xFF, là Dấu thứ tự Byte cho UTF-16BE. Nhưng nếu 3 byte đầu tiên là 0xEF 0xBB 0xEF thì sao? (dấu thứ tự byte cho UTF-8). Bạn khẳng định rằng điều này hoạt động cho tất cả các định dạng UTF-8. Điều nào có thể đúng (tôi chưa kiểm tra mã của bạn), nhưng sau đó nó hoạt động như thế nào?

— bvdb

1

Xem câu trả lời của tôi cho Vahid: Tôi không mở luồng byte mà là luồng ký tự và đọc một ký tự từ đó. Đừng bận tâm những gì utf mã hóa được dùng cho tập tin - bom tiền tố có thể được đại diện bởi số lượng khác nhau của byte, nhưng về mặt nhân vật nó chỉ là một nhân vật

24

API dữ liệu của Google có chức UnicodeReadernăng tự động phát hiện mã hóa.

Bạn có thể sử dụng nó thay vì InputStreamReader. Đây là phần trích xuất nguồn của nó - hơi phức tạp - khá đơn giản:

public class UnicodeReader extends Reader {
    private static final int BOM_SIZE = 4;
    private final InputStreamReader reader;

    /**
     * Construct UnicodeReader
     * @param in Input stream.
     * @param defaultEncoding Default encoding to be used if BOM is not found,
     * or <code>null</code> to use system default encoding.
     * @throws IOException If an I/O error occurs.
     */
    public UnicodeReader(InputStream in, String defaultEncoding) throws IOException {
        byte bom[] = new byte[BOM_SIZE];
        String encoding;
        int unread;
        PushbackInputStream pushbackStream = new PushbackInputStream(in, BOM_SIZE);
        int n = pushbackStream.read(bom, 0, bom.length);

        // Read ahead four bytes and check for BOM marks.
        if ((bom[0] == (byte) 0xEF) && (bom[1] == (byte) 0xBB) && (bom[2] == (byte) 0xBF)) {
            encoding = "UTF-8";
            unread = n - 3;
        } else if ((bom[0] == (byte) 0xFE) && (bom[1] == (byte) 0xFF)) {
            encoding = "UTF-16BE";
            unread = n - 2;
        } else if ((bom[0] == (byte) 0xFF) && (bom[1] == (byte) 0xFE)) {
            encoding = "UTF-16LE";
            unread = n - 2;
        } else if ((bom[0] == (byte) 0x00) && (bom[1] == (byte) 0x00) && (bom[2] == (byte) 0xFE) && (bom[3] == (byte) 0xFF)) {
            encoding = "UTF-32BE";
            unread = n - 4;
        } else if ((bom[0] == (byte) 0xFF) && (bom[1] == (byte) 0xFE) && (bom[2] == (byte) 0x00) && (bom[3] == (byte) 0x00)) {
            encoding = "UTF-32LE";
            unread = n - 4;
        } else {
            encoding = defaultEncoding;
            unread = n;
        }

        // Unread bytes if necessary and skip BOM marks.
        if (unread > 0) {
            pushbackStream.unread(bom, (n - unread), unread);
        } else if (unread < -1) {
            pushbackStream.unread(bom, 0, 0);
        }

        // Use given encoding.
        if (encoding == null) {
            reader = new InputStreamReader(pushbackStream);
        } else {
            reader = new InputStreamReader(pushbackStream, encoding);
        }
    }

    public String getEncoding() {
        return reader.getEncoding();
    }

    public int read(char[] cbuf, int off, int len) throws IOException {
        return reader.read(cbuf, off, len);
    }

    public void close() throws IOException {
        reader.close();
    }
}

— BalusC
nguồn

Có vẻ như liên kết cho biết API dữ liệu của Google không được dùng nữa? Người ta nên tìm API dữ liệu của Google ở đâu bây giờ?

— SOUser

1

@XichenLi: API GData không được dùng nữa vì mục đích đã định. Tôi không có ý định đề xuất sử dụng trực tiếp API GData (OP không sử dụng bất kỳ dịch vụ GData nào), nhưng tôi định sử dụng mã nguồn làm ví dụ cho việc triển khai của riêng bạn. Đó cũng là lý do tại sao tôi đưa nó vào câu trả lời của mình, sẵn sàng cho copypaste.

— BalusC

Có một lỗi trong này. Không thể truy cập hộp UTF-32LE. Để (bom[0] == (byte) 0xFF) && (bom[1] == (byte) 0xFE) && (bom[2] == (byte) 0x00) && (bom[3] == (byte) 0x00)đúng, thì trường hợp UTF-16LE ( (bom[0] == (byte) 0xFF) && (bom[1] == (byte) 0xFE)) đã khớp.

— Joshua Taylor

Vì mã này là từ API dữ liệu của Google, tôi đã đăng vấn đề 471 về nó.

— Joshua Taylor,

13

Các Apache Commons IOcủa Thư viện BOMInputStream đã được đề cập bởi @rescdsk, nhưng tôi không nhìn thấy nó đề cập đến làm thế nào để có được một InputStream không HĐQT.

Đây là cách tôi đã làm điều đó trong Scala.

 import java.io._
 val file = new File(path_to_xml_file_with_BOM)
 val fileInpStream = new FileInputStream(file)   
 val bomIn = new BOMInputStream(fileInpStream, 
         false); // false means don't include BOM

— Kevin Meredith
nguồn

Độc arg constructor không nó: public BOMInputStream(InputStream delegate) { this(delegate, false, ByteOrderMark.UTF_8); }. Nó loại trừ UTF-8 BOMtheo mặc định.

— Vladimir Vagaytsev

Tốt, Vladimir. Tôi thấy điều đó trong tài liệu của nó - commons.apache.org/proper/commons-io/javadocs/api-2.2/org/… :Constructs a new BOM InputStream that excludes a ByteOrderMark.UTF_8 BOM.

— Kevin Meredith

4

Để đơn giản xóa các ký tự BOM khỏi tệp của bạn, tôi khuyên bạn nên sử dụng Apache Common IO

public BOMInputStream(InputStream delegate,
              boolean include)
Constructs a new BOM InputStream that detects a a ByteOrderMark.UTF_8 and optionally includes it.
Parameters:
delegate - the InputStream to delegate to
include - true to include the UTF-8 BOM or false to exclude it

Đặt bao gồm thành sai và các ký tự BOM của bạn sẽ bị loại trừ.

— Andreas Baaserud
nguồn

2

Đáng tiếc là không. Bạn sẽ phải xác định và bỏ qua chính mình. Trang này trình bày chi tiết những gì bạn phải xem. Cũng xem câu hỏi SO này để biết thêm chi tiết.

— Brian Agnew
nguồn

1

Tôi đã gặp vấn đề tương tự và vì tôi không đọc trong một loạt các tệp, tôi đã thực hiện một giải pháp đơn giản hơn. Tôi nghĩ rằng mã hóa của tôi là UTF-8 vì khi tôi in ra ký tự vi phạm với sự trợ giúp của trang này: Nhận giá trị unicode của một ký tự tôi thấy rằng đó là \ufeff. Tôi đã sử dụng mã System.out.println( "\\u" + Integer.toHexString(str.charAt(0) | 0x10000).substring(1) );để in ra giá trị unicode vi phạm.

Khi tôi có giá trị unicode vi phạm, tôi đã thay thế nó trong dòng đầu tiên của tệp trước khi tiếp tục đọc. Logic kinh doanh của phần đó:

String str = reader.readLine().trim();
str = str.replace("\ufeff", "");

Điều này đã khắc phục sự cố của tôi. Sau đó, tôi có thể tiếp tục xử lý tệp mà không có vấn đề gì. Tôi đã thêm vào trim()chỉ trong trường hợp có khoảng trắng ở đầu hoặc ở cuối, bạn có thể làm điều đó hoặc không, dựa trên nhu cầu cụ thể của bạn là gì.

— Amy B Higgins
nguồn

1

Điều đó không hiệu quả với tôi, nhưng tôi đã sử dụng .replaceFirst ("\ u00EF \ u00BB \ u00BF", "").

— StackUMan