I'd like to parse about 6TB of logs from a verbose multi-threaded package, as the raw logs are a nightmare to decipher (and even then the results aren't stable).

In the process I'd like to apply some very simple logic to the parsing, but no matter how I try I can't seem to get the desired results:

alt text

I'm asking as language-agnostic because I'm sure the problem isn't tied to the one I'm using.

Any best practice, methodology or the like? Any suggestions at all really. Stupid logs.

EDIT: The log files are now 9TB. I'm keen to find a suitable answer so here are some example values and outputs that should help:

Logfile: yes
Output: No

Logfile: maybe
Output: No

Logfile: It's your decision
Output: No

Logfile: I'm not upset
Output: I'm leaving you

Logfile: Do you love me
Output: Bad times divided by good times... divide by zero error

I'm adding a bounty too - if you can help me figure out the algorithm I'm made!


If you'll forgive the management speak, this needs a bit of "outside the box" thinking.

Grep the verbose output for certain keywords in real time. It's up to you what particular keywords are of significance, however don't be surprised if you decide precious little is of any importance. There's also a very high incidence of repetition, so if your parser is offline you are unlikely to have to worry about data loss.

If the output gets particularly high bandwidth, it's actually safer to lower the process priority while your CPU gets on with other, high priority tasks.

Just pipe the output to /dev/null, the logs aren't worth keeping.

5 accepted

An old trick (adapted from chess) is to find another source of these log files, and make the two inputs process each other. Thus, it is like the illusion of a small child playing two chess-grand masters simultaneously. He is always able to end both games in a draw.


Well, I don't know if it will apply to your specific needs, and I have no idea how it will scale to a 6TB file... but...
Microsoft has a tool called LogParser which can be VERY useful, generically, in parsing log files. It lets you actually query them in essentially a SQL syntax.

Check this post: http://www.codinghorror.com/blog/archives/000369.html


At 6TB, it might be helpfull to use split the problem in two. First parse the raw text,something like LogParser might help, or you might need to roll your own, but then secondly, use an OLAP like tool to do your analysis.


Easy ;)

  1. Populate your array with this dictionary:


2.Tail your logs straight into this program:

Dim arrayPairs
dim arrayPairs1
Dim sSense

arrayPairs = split("We need:I want|Do what you want:You will pay for this later|Is my butt fat?:Tell me I am beautigul", "|")

for i=0 to Ubound(arrayPairs)
    arrayPairs1(i) = split(arrayPairs(i), ":")

function MakesSense(sInput)
    sSense = ""
    for i = 0 to UBound(arrayPairs)
        if arrayPairs1(i)(0) = sInput then sSense = arrayPairs1(i)(1)

    MakesSense = sSense <> ""
end function

'take input from stdin

'then feed them straight into this function

While MakesSense(sInputfromStdin)

in pseudo code I'm pretty sure the algorithm looks something like:

bool YouLose(string input)
      #pragma ignore input

      double d = new Random(double.Min, double.Max);

      return d != NaN;

Loop over all data in the log file until YouLose returns false or EOF.

EDIT: In order to meet the requirements of the specification...

MakesSense is true if and only if YouLose returns false.


Can you give us a little more information. Whats the average size of your log files. Are they flat txt/log files. Is this a once off parse or will you have to do it regularly? and can you give us a better example of your logic that you wish to apply?