10/01/2008

Fuzzy uniq (or Parsing Connect:Direct stats part 3)

I was doing an analysis on the current Connect:Direct feeds of several servers so I could cleanly migrate the feeds to another set of servers.  I parsed the stats to get a list of feeds.  Some of the feeds appeared to be monthly, some were weekly, some were daily, etc.  I wanted a list of the unique transmissions, including the source and destination nodes and filenames.  Sorting the list and using uniq to eliminate duplicates did not reduce the list because the filenames had date/time stamps in them, so really each line was unique.  I ended up importing a very large list into Excel and went through it line by line to delete lines that represented the same feed over and over.  Since I had several more migrations to do, I want a smarter uniq. 

 

The UNIX utility uniq examines each line of text piped into it, and if the line is identical to the previous line, it skips displaying it.  So if you pipe sorted text into it, you end up with a text file where every line is unique.  No repeated lines.  I want a uniq that examines each line of text piped into it, and if the line is similar to the previous line, it skips displaying it.   I want to end up with a text file that gives one example line of each group of lines that are similar to each other.  Well, I searched high and low, and if there is such a utility, I sure can't find it.  I set out to make a script that I call funiq, for fuzzy uniq.

 

Requirement:  Compare lines.  Each line should be calculated to have a certain percentage of similarity compared to the line above.  If the line is sufficiently similar to the line already displayed, don't display it and move on.  I did a lot of searching on the Internet to try to find out how to calculate similarity.  I ended up reading an interesting article here:  http://www.catalysoft.com/articles/StrikeAMatch.html

 

The basic idea is to take two lines of input and split them up into character pairs.  Throw out the character pairs that have spaces in them.  Throw out empty lines.  Count the number of character pairs in a line that match character pairs in the line below it.  This is the intersection of the two character-pair sets.  Apply a formula that computes two times the intersection divided by the total number of character pairs in both lines.  

 

#!/bin/ksh

PATH=/usr/xpg4/bin:/usr/bin

# get percentage of similarity from -s command line option, or 85 for default

if [ "$1" = "-s" -a -n "$2" ]; then

  SIM=$2

  shift 2

else

  SIM=85

fi

awk 'BEGIN {CURRPAIRS=0;PAIRMATCHES=0;PREVPAIRS=0}

{

  # load array of character pairs for current comparison string

  for (i=1;i<length($0);i++) {

    CURR[i]=substr($0,i,2)

  }

  # remove character pairs that contain spaces

  for (SUBC in CURR) {

    if ( index(CURR[SUBC]," ") ) {

      delete CURR[SUBC]

    }

  }

  # count the number of character pairs in comparison string,

  # and count matches compared to previous comparison string

  CURRPAIRS=0

  PAIRMATCHES=0

  for (SUBC in CURR) {

    CURRPAIRS++

    for (SUBP in PREV) {

      if (CURR[SUBC]==PREV[SUBP]) {

        PAIRMATCHES++

        # only count matches once

        delete PREV[SUBP]

      }

    }

  }

  # remove empty lines from consideration by skipping to next line

  if (CURRPAIRS==0) next

  # compute similarity

  SIM=200*PAIRMATCHES/(CURRPAIRS+PREVPAIRS)

  # display output if not similar

  if (REQSIM >= SIM) print $0

  # move array of character pairs to store as previous string

  for (SUB in PREV) delete PREV[SUB]

  for (SUB in CURR) PREV[SUB]=CURR[SUB]

  for (SUB in CURR) delete CURR[SUB]

  PREVPAIRS=CURRPAIRS

}' REQSIM=$SIM $@

 

 

Command line option:  the percentage of similarity can be specified by -s followed by a number between 0 and 100.  The default is 85.

 

Sample data input for funiq (378 lines, not all shown):

 

 

 

Example output of funiq:

 

 

 

 

 

 

1 comment:

Christian said...

fuzzy uniq, exactly what I was looking for the other day. Thanks!