10/05/2008

How to grep Connect:Direct UNIX stats and format for easy reading

Going into the direct prompt and doing select stat with various parameters works great most of the time, but sometimes I just want to grep for certain lines in the stat files in the work directory.  For example, if I am looking for all the records related to a particular file that was transmitted.  To make the output of such a grep more human-readable, I can parse the output as discussed in my previous posts.  What if I would like to see the output formatted the same way select stat detail=yes will show?  This is a short script that will do that.  I call it formatstats.
 
formatstats script:
#!/bin/ksh
PATH=/usr/xpg4/bin:/usr/bin
# add SUMM field and end of record marker on stat lines
awk '{print $0"|SUMM=N|EOR"}' |\
# format the STAT file, putting each field on a separate line
tr '|' '\012' |\
# separate times from dates and reformat source and destination file fields
# to have a space after the =
awk -F= '{
    if ($1=="DFIL" || $1=="SFIL") print $1 "= " $2
    else if ($1=="STAR" || $1=="SSTA" || $1=="STOP" ) {
      split($2,A," ")
      print $1 "=" A[1] "=" A[2]
    }
    else print
}' |\
# execute the ndmstat.awk that comes with Connect:Direct
awk -F= -f /export/home/ndm/cdunix/ndm/bin/ndmstat.awk |\
# additional formatting to remove the greater than sign arrows
sed 's/=>/=/g'
Note that the script is calling the ndmstat.awk that is in the bin directory of my Connect:Direct installation.  Your Connect:Direct may be installed somewhere else and the path to ndmstat.awk may need to be changed to find it. 
 
Here is an example of what the text looks like before formatting:
$ cat S20080305.001 |grep todtrigger.200803051837.results
STAR=20080305 18:39:01|PNAM=todbk|PNUM=300|SSTA=20080305 18:39:01|STRT=20080305
18:39:01|SUBM=sysx@paul|SBID=sysx|SBND=paul|SNOD=john|RECI=RSST|RECC=CAPR|TZDI=-
21600|MSGI=XSMG201I|MSST=Remote Step started.|FROM=P|RSTR=N|SNAM=step1|SFIL=/app
/fmtprt/data/sysx/send/todtrigger.200803051837.results|DFIL=todtrigger.200803051
837.results|PNOD=paul
STAR=20080305 18:39:01|PNAM=todbk|PNUM=300|SSTA=20080305 18:39:01|STRT=20080305
18:39:01|STOP=20080305 18:39:01|STPT=20080305 18:39:01|SELA=00:00:00|SUBM=sysx@p
aul|SBID=sysx|SBND=paul|SNOD=john|CCOD=0|RECI=CTRC|RECC=CAPR|TZDI=-21600|MSGI=SC
PA000I|MSST=Copy step successful.|STDL=Wed Mar  5 18:39:01 2008|CSDS=Wed Mar  5
18:39:01 2008|LCCD=0|LMSG=SCPA000I|OCCD=0|OMSG=SCPA000I|PNAM=todbk|PNUM=300|SNAM
=step1|PNOD=paul|SNOD=john|LNOD=S|FROM=P|XLAT=N|SCMP=N|ECMP=N|OERR=N|CKPT=Y|LKFL
=N|RSTR=N|RUSZ=65536|PACC=|SACC=|PPMN=|SFIL=/app/fmtprt/data/sysx/send/todtrigge
r.200803051837.results|SDS1= |SDS2= |SDS3= |SBYR=223|SFSZ=223|SRCR=1|SBYX=225|SR
UX=1|SVSQ=0|SVCN=0|SVOL=|DFIL=todtrigger.200803051837.results|PPMN=|DDS1=R|DDS2=
 |DDS3= |DBYW=223|DRCW=1|DBYX=225|DRUX=1|DVSQ=0|DVCN=0|DVOL=|ICRC=N|PCRC=N|DLDR=
/appl/biller/udot/input|ETMC=9|ETMK=0|ETMU=10
$
And here is the text with formatting:
$ cat S20080305.001 |grep todtrigger.200803051837.results |formatstats
===============================================================================
                           SELECT  STATISTICS
===============================================================================
PROCESS RECORD   Record Id =  RSST
Process Name     = todbk          Stat Log Date  = 03/05/2008
Process Number   = 300            Stat Log Time  = 18:39:01
Submitter Class  =
Submitter Id     =
sysx@paul
 
Step Start Date  = 03/05/2008     Step Start Time  = 18:39:01
Src  File        = /app/fmtprt/data/sysx/send/todtrigger.200803051837.results
Dest File        = todtrigger.200803051837.results
 
Step Name        = step1
From node        = P
Rstr             = N
SNODE            = john
Completion Code  = 0
Message Id       = XSMG201I
Short Text       = Remote Step started.
-------------------------------------------------------------------------------
PROCESS RECORD   Record Id =  CTRC
Process Name     = todbk          Stat Log Date  = 03/05/2008
Process Number   = 300            Stat Log Time  = 18:39:01
Submitter Class  =
Submitter Id     =
sysx@paul
 
Step Start Date  = 03/05/2008     Step Start Time  = 18:39:01
Step Stop Date   = 03/05/2008     Step Stop Time   = 18:39:01
Step Elapsed Time= 00:00:00
 
Step Name        = step1
From node        = P
Rstr             = N
SNODE            = john
Completion Code  = 0
Message Id       = SCPA000I
Short Text       = Copy step successful.
Ckpt=Y  Lkfl=N  Rstr=N  Xlat=N  Scmp=N  Ecmp=N CRC=N
Local node       = S
From node        = P
Src  File        = /app/fmtprt/data/sysx/send/todtrigger.200803051837.results
Dest File        = todtrigger.200803051837.results
     Source                     Destination
 Ccode      =0              Ccode        =0
 Msgid      =SCPA000I       Msgid        =SCPA000I
 Bytes Read =223            Bytes Written=223
 Recs  Read =1              Recs  Written=1
 Bytes Sent =225            Bytes Recvd  =225
 Rus   Sent =1              Rus   Recvd  =1
 Ru    Size =65536
-------------------------------------------------------------------------------
===============================================================================
$
A note about greater-than signs:
I really dislike greater-than signs as part of cursor prompts or formatted output.  That is the reason for the last line of the script, where the arrows like => are removed.  I have had too many proplems where I grab lines of text from my PuTTY window and then accidently hit the right mouse button and paste the text onto the command line.  If the output has greater-than signs in it, it will overwrite files and create garbage files on my system.
 
 
 

10/01/2008

Fuzzy uniq (or Parsing Connect:Direct stats part 3)

I was doing an analysis on the current Connect:Direct feeds of several servers so I could cleanly migrate the feeds to another set of servers.  I parsed the stats to get a list of feeds.  Some of the feeds appeared to be monthly, some were weekly, some were daily, etc.  I wanted a list of the unique transmissions, including the source and destination nodes and filenames.  Sorting the list and using uniq to eliminate duplicates did not reduce the list because the filenames had date/time stamps in them, so really each line was unique.  I ended up importing a very large list into Excel and went through it line by line to delete lines that represented the same feed over and over.  Since I had several more migrations to do, I want a smarter uniq. 

 

The UNIX utility uniq examines each line of text piped into it, and if the line is identical to the previous line, it skips displaying it.  So if you pipe sorted text into it, you end up with a text file where every line is unique.  No repeated lines.  I want a uniq that examines each line of text piped into it, and if the line is similar to the previous line, it skips displaying it.   I want to end up with a text file that gives one example line of each group of lines that are similar to each other.  Well, I searched high and low, and if there is such a utility, I sure can't find it.  I set out to make a script that I call funiq, for fuzzy uniq.

 

Requirement:  Compare lines.  Each line should be calculated to have a certain percentage of similarity compared to the line above.  If the line is sufficiently similar to the line already displayed, don't display it and move on.  I did a lot of searching on the Internet to try to find out how to calculate similarity.  I ended up reading an interesting article here:  http://www.catalysoft.com/articles/StrikeAMatch.html

 

The basic idea is to take two lines of input and split them up into character pairs.  Throw out the character pairs that have spaces in them.  Throw out empty lines.  Count the number of character pairs in a line that match character pairs in the line below it.  This is the intersection of the two character-pair sets.  Apply a formula that computes two times the intersection divided by the total number of character pairs in both lines.  

 

#!/bin/ksh

PATH=/usr/xpg4/bin:/usr/bin

# get percentage of similarity from -s command line option, or 85 for default

if [ "$1" = "-s" -a -n "$2" ]; then

  SIM=$2

  shift 2

else

  SIM=85

fi

awk 'BEGIN {CURRPAIRS=0;PAIRMATCHES=0;PREVPAIRS=0}

{

  # load array of character pairs for current comparison string

  for (i=1;i<length($0);i++) {

    CURR[i]=substr($0,i,2)

  }

  # remove character pairs that contain spaces

  for (SUBC in CURR) {

    if ( index(CURR[SUBC]," ") ) {

      delete CURR[SUBC]

    }

  }

  # count the number of character pairs in comparison string,

  # and count matches compared to previous comparison string

  CURRPAIRS=0

  PAIRMATCHES=0

  for (SUBC in CURR) {

    CURRPAIRS++

    for (SUBP in PREV) {

      if (CURR[SUBC]==PREV[SUBP]) {

        PAIRMATCHES++

        # only count matches once

        delete PREV[SUBP]

      }

    }

  }

  # remove empty lines from consideration by skipping to next line

  if (CURRPAIRS==0) next

  # compute similarity

  SIM=200*PAIRMATCHES/(CURRPAIRS+PREVPAIRS)

  # display output if not similar

  if (REQSIM >= SIM) print $0

  # move array of character pairs to store as previous string

  for (SUB in PREV) delete PREV[SUB]

  for (SUB in CURR) PREV[SUB]=CURR[SUB]

  for (SUB in CURR) delete CURR[SUB]

  PREVPAIRS=CURRPAIRS

}' REQSIM=$SIM $@

 

 

Command line option:  the percentage of similarity can be specified by -s followed by a number between 0 and 100.  The default is 85.

 

Sample data input for funiq (378 lines, not all shown):

 

 

 

Example output of funiq: