Trích xuất nhanh một phạm vi thời gian từ logfile syslog?

12

Tôi đã có một logfile ở định dạng syslog tiêu chuẩn. Nó trông như thế này, ngoại trừ hàng trăm dòng mỗi giây:

Jan 11 07:48:46 blahblahblah...
Jan 11 07:49:00 blahblahblah...
Jan 11 07:50:13 blahblahblah...
Jan 11 07:51:22 blahblahblah...
Jan 11 07:58:04 blahblahblah...

Nó không lăn vào đúng nửa đêm, nhưng nó sẽ không bao giờ có quá hai ngày trong đó.

Tôi thường phải trích xuất một khoảng thời gian từ tập tin này. Tôi muốn viết một kịch bản có mục đích chung cho việc này, mà tôi có thể gọi như:

$ timegrep 22:30-02:00 /logs/something.log

... và có nó kéo ra các dòng từ 22:30, trở đi qua ranh giới nửa đêm, cho đến 2 giờ sáng ngày hôm sau.

Có một vài cảnh báo:

Tôi không muốn phải bận tâm nhập (các) ngày trên dòng lệnh, chỉ là thời gian. Chương trình nên đủ thông minh để tìm ra chúng.
Định dạng ngày đăng nhập không bao gồm năm, do đó, nên đoán dựa trên năm hiện tại, nhưng dù sao cũng làm đúng vào ngày đầu năm mới.
Tôi muốn nó được nhanh chóng - nó nên sử dụng thực tế là các dòng để tìm kiếm xung quanh trong tệp và sử dụng tìm kiếm nhị phân.

Trước khi tôi dành một đống thời gian để viết nó, nó đã tồn tại chưa?

linux log-files grep

— mike
nguồn

9

Cập nhật: Tôi đã thay thế mã gốc bằng một phiên bản cập nhật với nhiều cải tiến. Hãy gọi đây là chất lượng alpha (thực tế?).

Phiên bản này bao gồm:

xử lý tùy chọn dòng lệnh
xác thực định dạng ngày dòng lệnh
một số trykhối
đọc dòng chuyển vào một chức năng

Văn bản gốc:

Vâng, những gì bạn biết không? "Tìm kiếm" và các ngươi sẽ tìm thấy! Đây là một chương trình Python tìm kiếm xung quanh trong tệp và sử dụng tìm kiếm nhị phân ít nhiều. Nó nhanh hơn đáng kể so với kịch bản AWK mà anh chàng khác đã viết.

Đó là (trước?) Chất lượng alpha. Nó nên có trycác khối và xác nhận đầu vào và rất nhiều thử nghiệm và không thể nghi ngờ gì nữa là Pythonic. Nhưng đây là để giải trí của bạn. Ồ, và nó được viết cho Python 2.6.

Mã mới:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
# timegrep.py by Dennis Williamson 20100113
# in response to http://serverfault.com/questions/101744/fast-extraction-of-a-time-range-from-syslog-logfile

# thanks to serverfault user http://serverfault.com/users/1545/mike
# for the inspiration

# Perform a binary search through a log file to find a range of times
# and print the corresponding lines

# tested with Python 2.6

# TODO: Make sure that it works if the seek falls in the middle of
#       the first or last line
# TODO: Make sure it's not blind to a line where the sync read falls
#       exactly at the beginning of the line being searched for and
#       then gets skipped by the second read
# TODO: accept arbitrary date

# done: add -l long and -s short options
# done: test time format

version = "0.01a"

import os, sys
from stat import *
from datetime import date, datetime
import re
from optparse import OptionParser

# Function to read lines from file and extract the date and time
def getdata():
    """Read a line from a file

    Return a tuple containing:
        the date/time in a format such as 'Jan 15 20:14:01'
        the line itself

    The last colon and seconds are optional and
    not handled specially

    """
    try:
        line = handle.readline(bufsize)
    except:
        print("File I/O Error")
        exit(1)
    if line == '':
        print("EOF reached")
        exit(1)
    if line[-1] == '\n':
        line = line.rstrip('\n')
    else:
        if len(line) >= bufsize:
            print("Line length exceeds buffer size")
        else:
            print("Missing newline")
        exit(1)
    words = line.split(' ')
    if len(words) >= 3:
        linedate = words[0] + " " + words[1] + " " + words[2]
    else:
        linedate = ''
    return (linedate, line)
# End function getdata()

# Set up option handling
parser = OptionParser(version = "%prog " + version)

parser.usage = "\n\t%prog [options] start-time end-time filename\n\n\
\twhere times are in the form hh:mm[:ss]"

parser.description = "Search a log file for a range of times occurring yesterday \
and/or today using the current time to intelligently select the start and end. \
A date may be specified instead. Seconds are optional in time arguments."

parser.add_option("-d", "--date", action = "store", dest = "date",
                default = "",
                help = "NOT YET IMPLEMENTED. Use the supplied date instead of today.")

parser.add_option("-l", "--long", action = "store_true", dest = "longout",
                default = False,
                help = "Span the longest possible time range.")

parser.add_option("-s", "--short", action = "store_true", dest = "shortout",
                default = False,
                help = "Span the shortest possible time range.")

parser.add_option("-D", "--debug", action = "store", dest = "debug",
                default = 0, type = "int",
                help = "Output debugging information.\t\t\t\t\tNone (default) = %default, Some = 1, More = 2")

(options, args) = parser.parse_args()

if not 0 <= options.debug <= 2:
    parser.error("debug level out of range")
else:
    debug = options.debug    # 1 = print some debug output, 2 = print a little more, 0 = none

if options.longout and options.shortout:
    parser.error("options -l and -s are mutually exclusive")

if options.date:
    parser.error("date option not yet implemented")

if len(args) != 3:
    parser.error("invalid number of arguments")

start = args[0]
end   = args[1]
file  = args[2]

# test for times to be properly formatted, allow hh:mm or hh:mm:ss
p = re.compile(r'(^[2][0-3]|[0-1][0-9]):[0-5][0-9](:[0-5][0-9])?$')

if not p.match(start) or not p.match(end):
    print("Invalid time specification")
    exit(1)

# Determine Time Range
yesterday = date.fromordinal(date.today().toordinal()-1).strftime("%b %d")
today     = datetime.now().strftime("%b %d")
now       = datetime.now().strftime("%R")

if start > now or start > end or options.longout or options.shortout:
    searchstart = yesterday
else:
    searchstart = today

if (end > start > now and not options.longout) or options.shortout:
    searchend = yesterday
else:
    searchend = today

searchstart = searchstart + " " + start
searchend = searchend + " " + end

try:
    handle = open(file,'r')
except:
    print("File Open Error")
    exit(1)

# Set some initial values
bufsize = 4096  # handle long lines, but put a limit them
rewind  =  100  # arbitrary, the optimal value is highly dependent on the structure of the file
limit   =   75  # arbitrary, allow for a VERY large file, but stop it if it runs away
count   =    0
size    =    os.stat(file)[ST_SIZE]
beginrange   = 0
midrange     = size / 2
oldmidrange  = midrange
endrange     = size
linedate     = ''

pos1 = pos2  = 0

if debug > 0: print("File: '{0}' Size: {1} Today: '{2}' Now: {3} Start: '{4}' End: '{5}'".format(file, size, today, now, searchstart, searchend))

# Seek using binary search
while pos1 != endrange and oldmidrange != 0 and linedate != searchstart:
    handle.seek(midrange)
    linedate, line = getdata()    # sync to line ending
    pos1 = handle.tell()
    if midrange > 0:             # if not BOF, discard first read
        if debug > 1: print("...partial: (len: {0}) '{1}'".format((len(line)), line))
        linedate, line = getdata()

    pos2 = handle.tell()
    count += 1
    if debug > 0: print("#{0} Beg: {1} Mid: {2} End: {3} P1: {4} P2: {5} Timestamp: '{6}'".format(count, beginrange, midrange, endrange, pos1, pos2, linedate))
    if  searchstart > linedate:
        beginrange = midrange
    else:
        endrange = midrange
    oldmidrange = midrange
    midrange = (beginrange + endrange) / 2
    if count > limit:
        print("ERROR: ITERATION LIMIT EXCEEDED")
        exit(1)

if debug > 0: print("...stopping: '{0}'".format(line))

# Rewind a bit to make sure we didn't miss any
seek = oldmidrange
while linedate >= searchstart and seek > 0:
    if seek < rewind:
        seek = 0
    else:
        seek = seek - rewind
    if debug > 0: print("...rewinding")
    handle.seek(seek)

    linedate, line = getdata()    # sync to line ending
    if debug > 1: print("...junk: '{0}'".format(line))

    linedate, line = getdata()
    if debug > 0: print("...comparing: '{0}'".format(linedate))

# Scan forward
while linedate < searchstart:
    if debug > 0: print("...skipping: '{0}'".format(linedate))
    linedate, line = getdata()

if debug > 0: print("...found: '{0}'".format(line))

if debug > 0: print("Beg: {0} Mid: {1} End: {2} P1: {3} P2: {4} Timestamp: '{5}'".format(beginrange, midrange, endrange, pos1, pos2, linedate))

# Now that the preliminaries are out of the way, we just loop,
#     reading lines and printing them until they are
#     beyond the end of the range we want

while linedate <= searchend:
    print line
    linedate, line = getdata()

if debug > 0: print("Start: '{0}' End: '{1}'".format(searchstart, searchend))
handle.close()

— Tạm dừng cho đến khi có thông báo mới.
nguồn

Ồ Tôi thực sự cần học Python ...

— Stefan Lasiewski

@Dennis Williamson: Tôi thấy một dòng chứa

if debug > 0: print("File: '{0}' Size: {1} Today: '{2}' Now: {3} Start: '{4}' End: '{5}'".format(file, size, today, now, searchstar$

. Là searchstarphải kết thúc bằng một $, hoặc đó là một lỗi đánh máy? Tôi gặp lỗi cú pháp trên dòng này (Dòng 159)

— Stefan Lasiewski

@Stefan Tôi sẽ thay thế bằng )).

— Bill Weiss

@Stefan: Cảm ơn. Đó là một lỗi đánh máy mà tôi đã sửa. Để tham khảo nhanh, $thay vào t, searchend))đó nên nói... searchstart, searchend))

— Tạm dừng cho đến khi có thông báo mới.

@Stefan: Xin lỗi về điều đó. Tôi nghĩ rằng đã có nó.

— Tạm dừng cho đến khi có thông báo mới.

0

Từ một tìm kiếm nhanh trên mạng, có những thứ trích xuất dựa trên các từ khóa (như FIRE hoặc như vậy :) nhưng không có gì trích xuất một phạm vi ngày từ tệp.

Có vẻ không khó để làm những gì bạn đề xuất:

Tìm kiếm thời gian bắt đầu.
In ra dòng đó.
Nếu thời gian kết thúc <thời gian bắt đầu và ngày của một dòng là> kết thúc và <bắt đầu, thì dừng lại.
Nếu thời gian kết thúc là> thời gian bắt đầu và ngày của dòng là> kết thúc, hãy dừng lại.

Có vẻ thẳng tiến, và tôi có thể viết nó cho bạn nếu bạn không phiền Ruby :)

— Michael Graff
nguồn

Tôi không bận tâm về Ruby, nhưng # 1 không đơn giản nếu bạn muốn thực hiện nó một cách hiệu quả trong một tệp lớn - bạn cần tìm kiếm () đến điểm giữa, tìm dòng gần nhất, xem cách nó bắt đầu và lặp lại với một điểm giữa mới. Nó quá kém hiệu quả để xem xét mọi dòng.

— chước

Bạn nói lớn, nhưng không chỉ định kích thước thực tế. Chỉ lớn là bao nhiêu? Tồi tệ hơn, nếu có nhiều ngày tham gia, sẽ rất dễ dàng tìm ra sai chỉ sử dụng thời gian. Rốt cuộc, nếu bạn vượt qua một ranh giới ngày, ngày kịch bản đang chạy sẽ luôn khác với thời gian bắt đầu. Các tập tin sẽ phù hợp với bộ nhớ thông qua mmap ()?

— Michael Graff

Khoảng 30 GB, trên đĩa gắn trên mạng.

— mike

0

Điều này sẽ in phạm vi mục nhập giữa thời gian bắt đầu và thời gian kết thúc dựa trên cách chúng liên quan đến thời gian hiện tại ("bây giờ").

Sử dụng:

timegrep [-l] start end filename

Thí dụ:

$ timegrep 18:47 03:22 /some/log/file

Các -l(dài) tùy chọn làm cho sản lượng dài nhất có thể. Thời gian bắt đầu sẽ được hiểu là ngày hôm qua nếu giá trị giờ và phút của thời gian bắt đầu nhỏ hơn cả thời gian kết thúc và bây giờ. Thời gian kết thúc sẽ được hiểu như ngày hôm nay nếu cả giá trị HH: MM của thời gian bắt đầu và thời gian kết thúc đều lớn hơn "bây giờ".

Giả sử rằng "bây giờ" là "ngày 11 tháng 1 19:00", đây là cách ví dụ khác nhau về thời gian bắt đầu và kết thúc sẽ được diễn giải (không -ltrừ khi được ghi chú):

bắt đầu phạm vi kết thúc bắt đầu phạm vi kết thúc
19:01 23:59 ngày 10 tháng 1 ngày 10 tháng 1
19:01 00:00 ngày 10 tháng 1 ngày 11 tháng 1
00:00 18:59 ngày 11 tháng 1 ngày 11 tháng 1
18:59 18:58 ngày 10 tháng 1 ngày 10 tháng 1
19:01 23:59 ngày 10 tháng 1 11 # -l
00:00 18:59 ngày 10 tháng 1 11 # -l
18:59 19:01 ngày 10 tháng 1 ngày 11 tháng 1 # -l

Hầu như tất cả các kịch bản được thiết lập. Hai dòng cuối cùng làm tất cả các công việc.

Cảnh báo: không có xác thực đối số hoặc kiểm tra lỗi được thực hiện. Trường hợp cạnh chưa được kiểm tra kỹ lưỡng. Điều này đã được viết bằng cách sử dụng gawkcác phiên bản khác của AWK có thể squawk.

#!/usr/bin/awk -f
BEGIN {
    arg=1
    if ( ARGV[arg] == "-l" ) {
        long = 1
        ARGV[arg++] = ""
    }
    start = ARGV[arg]
    ARGV[arg++] = ""
    end = ARGV[arg]
    ARGV[arg++] = ""

    yesterday = strftime("%b %d", mktime(strftime("%Y %m %d -24 00 00")))
    today = strftime("%b %d")
    now = strftime("%R")

    if ( start > now || start > end || long )
        startdate = yesterday
    else
        startdate = today

    if ( end > now && end > start && start > now && ! long )
        enddate = yesterday
    else
        enddate = today
    fi

startdate = startdate " " start
enddate = enddate " " end
}

$1 " " $2 " " $3 > enddate {exit}
$1 " " $2 " " $3 >= startdate {print}

Tôi nghĩ AWK rất hiệu quả trong việc tìm kiếm thông qua các tập tin. Tôi không nghĩ rằng bất cứ điều gì khác được thiết sẽ được bất kỳ nhanh tại tìm kiếm một unindexed tập tin văn bản.

— Tạm dừng cho đến khi có thông báo mới.
nguồn

Có vẻ như bạn đã bỏ qua điểm đạn thứ ba của tôi. Các nhật ký theo thứ tự 30 GB - nếu dòng đầu tiên của tệp là 7:00 và dòng cuối cùng là 23:00 và tôi muốn lát cắt trong khoảng từ 22:00 đến 22:01, tôi không muốn kịch bản nhìn vào mọi dòng từ 7:00 đến 22:00. Tôi muốn nó ước tính vị trí của nó, tìm đến thời điểm đó và đưa ra ước tính mới cho đến khi tìm thấy nó.

— mike

Tôi đã không bỏ qua nó. Tôi bày tỏ ý kiến của tôi trong đoạn cuối.

— Tạm dừng cho đến khi có thông báo mới.

0

Một chương trình C ++ áp dụng tìm kiếm nhị phân - nó sẽ cần một số sửa đổi đơn giản (tức là gọi strptime) để làm việc với ngày văn bản.

http://gitorious.org/bs_grep/

Tôi đã có một phiên bản trước với sự hỗ trợ cho ngày văn bản, tuy nhiên nó vẫn quá chậm so với quy mô của các tệp nhật ký của chúng tôi; profiling nói rằng hơn 90% thời gian đã được sử dụng trong thời gian ngắn, vì vậy, chúng tôi chỉ sửa đổi định dạng nhật ký để bao gồm cả dấu thời gian unix số.

0

Mặc dù câu trả lời này là quá muộn, nhưng nó có thể có lợi cho một số người.

Tôi đã chuyển đổi mã từ @Dennis Williamson thành một lớp Python có thể được sử dụng cho các công cụ python khác.

Tôi đã thêm hỗ trợ cho nhiều hỗ trợ ngày.

import os
from stat import *
from datetime import date, datetime
import re

# @TODO Support for rotated log files - currently using the current year for 'Jan 01' dates.
class LogFileTimeParser(object):
    """
    Extracts parts of a log file based on a start and enddate
    Uses binary search logic to speed up searching

    Common usage: validate log files during testing

    Faster than awk parsing for big log files
    """
    version = "0.01a"

    # Set some initial values
    BUF_SIZE = 4096  # self.handle long lines, but put a limit to them
    REWIND = 100  # arbitrary, the optimal value is highly dependent on the structure of the file
    LIMIT = 75  # arbitrary, allow for a VERY large file, but stop it if it runs away

    line_date = ''
    line = None
    opened_file = None

    @staticmethod
    def parse_date(text, validate=True):
        # Supports Aug 16 14:59:01 , 2016-08-16 09:23:09 Jun 1 2005  1:33:06PM (with or without seconds, miliseconds)
        for fmt in ('%Y-%m-%d %H:%M:%S %f', '%Y-%m-%d %H:%M:%S', '%Y-%m-%d %H:%M',
                    '%b %d %H:%M:%S %f', '%b %d %H:%M', '%b %d %H:%M:%S',
                    '%b %d %Y %H:%M:%S %f', '%b %d %Y %H:%M', '%b %d %Y %H:%M:%S',
                    '%b %d %Y %I:%M:%S%p', '%b %d %Y %I:%M%p', '%b %d %Y %I:%M:%S%p %f'):
            try:
                if fmt in ['%b %d %H:%M:%S %f', '%b %d %H:%M', '%b %d %H:%M:%S']:

                    return datetime.strptime(text, fmt).replace(datetime.now().year)
                return datetime.strptime(text, fmt)
            except ValueError:
                pass
        if validate:
            raise ValueError("No valid date format found for '{0}'".format(text))
        else:
            # Cannot use NoneType to compare datetimes. Using minimum instead
            return datetime.min

    # Function to read lines from file and extract the date and time
    def read_lines(self):
        """
        Read a line from a file
        Return a tuple containing:
            the date/time in a format supported in parse_date om the line itself
        """
        try:
            self.line = self.opened_file.readline(self.BUF_SIZE)
        except:
            raise IOError("File I/O Error")
        if self.line == '':
            raise EOFError("EOF reached")
        # Remove \n from read lines.
        if self.line[-1] == '\n':
            self.line = self.line.rstrip('\n')
        else:
            if len(self.line) >= self.BUF_SIZE:
                raise ValueError("Line length exceeds buffer size")
            else:
                raise ValueError("Missing newline")
        words = self.line.split(' ')
        # This results into Jan 1 01:01:01 000000 or 1970-01-01 01:01:01 000000
        if len(words) >= 3:
            self.line_date = self.parse_date(words[0] + " " + words[1] + " " + words[2],False)
        else:
            self.line_date = self.parse_date('', False)
        return self.line_date, self.line

    def get_lines_between_timestamps(self, start, end, path_to_file, debug=False):
        # Set some initial values
        count = 0
        size = os.stat(path_to_file)[ST_SIZE]
        begin_range = 0
        mid_range = size / 2
        old_mid_range = mid_range
        end_range = size
        pos1 = pos2 = 0

        # If only hours are supplied
        # test for times to be properly formatted, allow hh:mm or hh:mm:ss
        p = re.compile(r'(^[2][0-3]|[0-1][0-9]):[0-5][0-9](:[0-5][0-9])?$')
        if p.match(start) or p.match(end):
            # Determine Time Range
            yesterday = date.fromordinal(date.today().toordinal() - 1).strftime("%Y-%m-%d")
            today = datetime.now().strftime("%Y-%m-%d")
            now = datetime.now().strftime("%R")
            if start > now or start > end:
                search_start = yesterday
            else:
                search_start = today
            if end > start > now:
                search_end = yesterday
            else:
                search_end = today
            search_start = self.parse_date(search_start + " " + start)
            search_end = self.parse_date(search_end + " " + end)
        else:
            # Set dates
            search_start = self.parse_date(start)
            search_end = self.parse_date(end)
        try:
            self.opened_file = open(path_to_file, 'r')
        except:
            raise IOError("File Open Error")
        if debug:
            print("File: '{0}' Size: {1} Start: '{2}' End: '{3}'"
                  .format(path_to_file, size, search_start, search_end))

        # Seek using binary search -- ONLY WORKS ON FILES WHO ARE SORTED BY DATES (should be true for log files)
        try:
            while pos1 != end_range and old_mid_range != 0 and self.line_date != search_start:
                self.opened_file.seek(mid_range)
                # sync to self.line ending
                self.line_date, self.line = self.read_lines()
                pos1 = self.opened_file.tell()
                # if not beginning of file, discard first read
                if mid_range > 0:
                    if debug:
                        print("...partial: (len: {0}) '{1}'".format((len(self.line)), self.line))
                    self.line_date, self.line = self.read_lines()
                pos2 = self.opened_file.tell()
                count += 1
                if debug:
                    print("#{0} Beginning: {1} Mid: {2} End: {3} P1: {4} P2: {5} Timestamp: '{6}'".
                          format(count, begin_range, mid_range, end_range, pos1, pos2, self.line_date))
                if search_start > self.line_date:
                    begin_range = mid_range
                else:
                    end_range = mid_range
                old_mid_range = mid_range
                mid_range = (begin_range + end_range) / 2
                if count > self.LIMIT:
                    raise IndexError("ERROR: ITERATION LIMIT EXCEEDED")
            if debug:
                print("...stopping: '{0}'".format(self.line))
            # Rewind a bit to make sure we didn't miss any
            seek = old_mid_range
            while self.line_date >= search_start and seek > 0:
                if seek < self.REWIND:
                    seek = 0
                else:
                    seek -= self.REWIND
                if debug:
                    print("...rewinding")
                self.opened_file.seek(seek)
                # sync to self.line ending
                self.line_date, self.line = self.read_lines()
                if debug:
                    print("...junk: '{0}'".format(self.line))
                self.line_date, self.line = self.read_lines()
                if debug:
                    print("...comparing: '{0}'".format(self.line_date))
            # Scan forward
            while self.line_date < search_start:
                if debug:
                    print("...skipping: '{0}'".format(self.line_date))
                self.line_date, self.line = self.read_lines()
            if debug:
                print("...found: '{0}'".format(self.line))
            if debug:
                print("Beginning: {0} Mid: {1} End: {2} P1: {3} P2: {4} Timestamp: '{5}'".
                      format(begin_range, mid_range, end_range, pos1, pos2, self.line_date))
            # Now that the preliminaries are out of the way, we just loop,
            # reading lines and printing them until they are beyond the end of the range we want
            while self.line_date <= search_end:
                # Exclude our 'Nonetype' values
                if not self.line_date == datetime.min:
                    print self.line
                self.line_date, self.line = self.read_lines()
            if debug:
                print("Start: '{0}' End: '{1}'".format(search_start, search_end))
            self.opened_file.close()
        # Do not display EOFErrors:
        except EOFError as e:
            pass

— Jeffrey Devloo
nguồn