10/05/2008

How to grep Connect:Direct UNIX stats and format for easy reading

Going into the direct prompt and doing select stat with various parameters works great most of the time, but sometimes I just want to grep for certain lines in the stat files in the work directory.  For example, if I am looking for all the records related to a particular file that was transmitted.  To make the output of such a grep more human-readable, I can parse the output as discussed in my previous posts.  What if I would like to see the output formatted the same way select stat detail=yes will show?  This is a short script that will do that.  I call it formatstats.
 
formatstats script:
#!/bin/ksh
PATH=/usr/xpg4/bin:/usr/bin
# add SUMM field and end of record marker on stat lines
awk '{print $0"|SUMM=N|EOR"}' |\
# format the STAT file, putting each field on a separate line
tr '|' '\012' |\
# separate times from dates and reformat source and destination file fields
# to have a space after the =
awk -F= '{
    if ($1=="DFIL" || $1=="SFIL") print $1 "= " $2
    else if ($1=="STAR" || $1=="SSTA" || $1=="STOP" ) {
      split($2,A," ")
      print $1 "=" A[1] "=" A[2]
    }
    else print
}' |\
# execute the ndmstat.awk that comes with Connect:Direct
awk -F= -f /export/home/ndm/cdunix/ndm/bin/ndmstat.awk |\
# additional formatting to remove the greater than sign arrows
sed 's/=>/=/g'
Note that the script is calling the ndmstat.awk that is in the bin directory of my Connect:Direct installation.  Your Connect:Direct may be installed somewhere else and the path to ndmstat.awk may need to be changed to find it. 
 
Here is an example of what the text looks like before formatting:
$ cat S20080305.001 |grep todtrigger.200803051837.results
STAR=20080305 18:39:01|PNAM=todbk|PNUM=300|SSTA=20080305 18:39:01|STRT=20080305
18:39:01|SUBM=sysx@paul|SBID=sysx|SBND=paul|SNOD=john|RECI=RSST|RECC=CAPR|TZDI=-
21600|MSGI=XSMG201I|MSST=Remote Step started.|FROM=P|RSTR=N|SNAM=step1|SFIL=/app
/fmtprt/data/sysx/send/todtrigger.200803051837.results|DFIL=todtrigger.200803051
837.results|PNOD=paul
STAR=20080305 18:39:01|PNAM=todbk|PNUM=300|SSTA=20080305 18:39:01|STRT=20080305
18:39:01|STOP=20080305 18:39:01|STPT=20080305 18:39:01|SELA=00:00:00|SUBM=sysx@p
aul|SBID=sysx|SBND=paul|SNOD=john|CCOD=0|RECI=CTRC|RECC=CAPR|TZDI=-21600|MSGI=SC
PA000I|MSST=Copy step successful.|STDL=Wed Mar  5 18:39:01 2008|CSDS=Wed Mar  5
18:39:01 2008|LCCD=0|LMSG=SCPA000I|OCCD=0|OMSG=SCPA000I|PNAM=todbk|PNUM=300|SNAM
=step1|PNOD=paul|SNOD=john|LNOD=S|FROM=P|XLAT=N|SCMP=N|ECMP=N|OERR=N|CKPT=Y|LKFL
=N|RSTR=N|RUSZ=65536|PACC=|SACC=|PPMN=|SFIL=/app/fmtprt/data/sysx/send/todtrigge
r.200803051837.results|SDS1= |SDS2= |SDS3= |SBYR=223|SFSZ=223|SRCR=1|SBYX=225|SR
UX=1|SVSQ=0|SVCN=0|SVOL=|DFIL=todtrigger.200803051837.results|PPMN=|DDS1=R|DDS2=
 |DDS3= |DBYW=223|DRCW=1|DBYX=225|DRUX=1|DVSQ=0|DVCN=0|DVOL=|ICRC=N|PCRC=N|DLDR=
/appl/biller/udot/input|ETMC=9|ETMK=0|ETMU=10
$
And here is the text with formatting:
$ cat S20080305.001 |grep todtrigger.200803051837.results |formatstats
===============================================================================
                           SELECT  STATISTICS
===============================================================================
PROCESS RECORD   Record Id =  RSST
Process Name     = todbk          Stat Log Date  = 03/05/2008
Process Number   = 300            Stat Log Time  = 18:39:01
Submitter Class  =
Submitter Id     =
sysx@paul
 
Step Start Date  = 03/05/2008     Step Start Time  = 18:39:01
Src  File        = /app/fmtprt/data/sysx/send/todtrigger.200803051837.results
Dest File        = todtrigger.200803051837.results
 
Step Name        = step1
From node        = P
Rstr             = N
SNODE            = john
Completion Code  = 0
Message Id       = XSMG201I
Short Text       = Remote Step started.
-------------------------------------------------------------------------------
PROCESS RECORD   Record Id =  CTRC
Process Name     = todbk          Stat Log Date  = 03/05/2008
Process Number   = 300            Stat Log Time  = 18:39:01
Submitter Class  =
Submitter Id     =
sysx@paul
 
Step Start Date  = 03/05/2008     Step Start Time  = 18:39:01
Step Stop Date   = 03/05/2008     Step Stop Time   = 18:39:01
Step Elapsed Time= 00:00:00
 
Step Name        = step1
From node        = P
Rstr             = N
SNODE            = john
Completion Code  = 0
Message Id       = SCPA000I
Short Text       = Copy step successful.
Ckpt=Y  Lkfl=N  Rstr=N  Xlat=N  Scmp=N  Ecmp=N CRC=N
Local node       = S
From node        = P
Src  File        = /app/fmtprt/data/sysx/send/todtrigger.200803051837.results
Dest File        = todtrigger.200803051837.results
     Source                     Destination
 Ccode      =0              Ccode        =0
 Msgid      =SCPA000I       Msgid        =SCPA000I
 Bytes Read =223            Bytes Written=223
 Recs  Read =1              Recs  Written=1
 Bytes Sent =225            Bytes Recvd  =225
 Rus   Sent =1              Rus   Recvd  =1
 Ru    Size =65536
-------------------------------------------------------------------------------
===============================================================================
$
A note about greater-than signs:
I really dislike greater-than signs as part of cursor prompts or formatted output.  That is the reason for the last line of the script, where the arrows like => are removed.  I have had too many proplems where I grab lines of text from my PuTTY window and then accidently hit the right mouse button and paste the text onto the command line.  If the output has greater-than signs in it, it will overwrite files and create garbage files on my system.
 
 
 

10/01/2008

Fuzzy uniq (or Parsing Connect:Direct stats part 3)

I was doing an analysis on the current Connect:Direct feeds of several servers so I could cleanly migrate the feeds to another set of servers.  I parsed the stats to get a list of feeds.  Some of the feeds appeared to be monthly, some were weekly, some were daily, etc.  I wanted a list of the unique transmissions, including the source and destination nodes and filenames.  Sorting the list and using uniq to eliminate duplicates did not reduce the list because the filenames had date/time stamps in them, so really each line was unique.  I ended up importing a very large list into Excel and went through it line by line to delete lines that represented the same feed over and over.  Since I had several more migrations to do, I want a smarter uniq. 

 

The UNIX utility uniq examines each line of text piped into it, and if the line is identical to the previous line, it skips displaying it.  So if you pipe sorted text into it, you end up with a text file where every line is unique.  No repeated lines.  I want a uniq that examines each line of text piped into it, and if the line is similar to the previous line, it skips displaying it.   I want to end up with a text file that gives one example line of each group of lines that are similar to each other.  Well, I searched high and low, and if there is such a utility, I sure can't find it.  I set out to make a script that I call funiq, for fuzzy uniq.

 

Requirement:  Compare lines.  Each line should be calculated to have a certain percentage of similarity compared to the line above.  If the line is sufficiently similar to the line already displayed, don't display it and move on.  I did a lot of searching on the Internet to try to find out how to calculate similarity.  I ended up reading an interesting article here:  http://www.catalysoft.com/articles/StrikeAMatch.html

 

The basic idea is to take two lines of input and split them up into character pairs.  Throw out the character pairs that have spaces in them.  Throw out empty lines.  Count the number of character pairs in a line that match character pairs in the line below it.  This is the intersection of the two character-pair sets.  Apply a formula that computes two times the intersection divided by the total number of character pairs in both lines.  

 

#!/bin/ksh

PATH=/usr/xpg4/bin:/usr/bin

# get percentage of similarity from -s command line option, or 85 for default

if [ "$1" = "-s" -a -n "$2" ]; then

  SIM=$2

  shift 2

else

  SIM=85

fi

awk 'BEGIN {CURRPAIRS=0;PAIRMATCHES=0;PREVPAIRS=0}

{

  # load array of character pairs for current comparison string

  for (i=1;i<length($0);i++) {

    CURR[i]=substr($0,i,2)

  }

  # remove character pairs that contain spaces

  for (SUBC in CURR) {

    if ( index(CURR[SUBC]," ") ) {

      delete CURR[SUBC]

    }

  }

  # count the number of character pairs in comparison string,

  # and count matches compared to previous comparison string

  CURRPAIRS=0

  PAIRMATCHES=0

  for (SUBC in CURR) {

    CURRPAIRS++

    for (SUBP in PREV) {

      if (CURR[SUBC]==PREV[SUBP]) {

        PAIRMATCHES++

        # only count matches once

        delete PREV[SUBP]

      }

    }

  }

  # remove empty lines from consideration by skipping to next line

  if (CURRPAIRS==0) next

  # compute similarity

  SIM=200*PAIRMATCHES/(CURRPAIRS+PREVPAIRS)

  # display output if not similar

  if (REQSIM >= SIM) print $0

  # move array of character pairs to store as previous string

  for (SUB in PREV) delete PREV[SUB]

  for (SUB in CURR) PREV[SUB]=CURR[SUB]

  for (SUB in CURR) delete CURR[SUB]

  PREVPAIRS=CURRPAIRS

}' REQSIM=$SIM $@

 

 

Command line option:  the percentage of similarity can be specified by -s followed by a number between 0 and 100.  The default is 85.

 

Sample data input for funiq (378 lines, not all shown):

 

 

 

Example output of funiq:

 

 

 

 

 

 

9/26/2008

Parsing Connect:Direct stats (part 2)


In part 1, we had some awk code that could parse the stat file and show the PNUMs with the source and destination servers and file names, tab delimited. 
 
To illustrate why anybody would be interested in doing this, here is what some stats look like without a parse script:
 
 
Not very convenient to gather information from, especially if you have thousands of transmissions and detailed analysis to do.  Let's add some features to the awk code from part 1.  First, we can wrap the awk into a shell script, and accept the fields we want on the command line.  Some fields, such as RMTP, may have equal signs in the values.  Since we are splitting on equal signs we need to put the values back together, so let's fix that, also.

 
#!/bin/ksh
PATH=/usr/xpg4/bin:/usr/bin
# default field IDs
FIELDIDS=PNUM,PNOD,SNOD,SFIL,DFIL,CCOD
# get different field IDs from command line, if they are specified
if [ "$1" = "-f" -a -n "$2" ]; then
  FIELDIDS=$2
fi
 
awk -F"|" '{
  # populate array B with all values using field names as subscripts
  for (i=1;i<=NF;i++) {
    SS=split($i,A,"=");SUB=A[1]; B[SUB]=A[2];delete A[1];delete A[2]
    # if the field has a second = in it, that means $i was split
    # into more than 2 pieces, gather the pieces
    for (j=3;j<=SS;j++) {
      B[SUB]=B[SUB]"="A[j];delete A[j]
    }
  }
  # go through all the fields requested and show values
  NE=split(FIELDIDS,F,",")
  for (IX=1;IX<=NE;IX++) {
    FLD=F[IX]
    printf "%s\t",B[FLD]
    delete F[IX]
  }
  print ""
  # clear array B
  for (SUB in B) delete B[SUB]
}'  FIELDIDS=$FIELDIDS -

 
I name this script parse.  Now let's use it to show the stats.  This shows me a neat, tab-delimited list of what files I transmitted outbound yesterday, with destination node name and IP, filenames, and completion codes:
 
$ cat S20080924.001 |grep PNOD=john |egrep "CTRC|SSTR" |parse -f PNUM,RECI,SNOD,DFIL,RMTP,CCOD
123     SSTR    paul            192.168.11.132, PORT=1364
123     CTRC    paul    bill.udot.mktb.200809241130             0
124     SSTR    george          192.168.11.145, PORT=1364
124     CTRC    george  bill.udot.mktb.200809241130             0
125     SSTR    paul            192.168.11.132, PORT=1364
125     CTRC    paul    bill.udot.mkta.200809241137             0
126     SSTR    george          192.168.11.145, PORT=1364
126     CTRC    george  bill.udot.mkta.200809241137             0
127     SSTR    paul            192.168.11.132, PORT=1364
127     CTRC    paul    bill.udot.mkta.200809241153             0
128     SSTR    george          192.168.11.145, PORT=1364
128     CTRC    george  bill.udot.mkta.200809241153             0
129     SSTR    paul            192.168.11.132, PORT=1364
129     CTRC    paul    bill.udot.mktb.200809241156             0
130     SSTR    george          192.168.11.145, PORT=1364
130     CTRC    george  bill.udot.mktb.200809241156             0
131     SSTR    paul            192.168.11.132, PORT=1364
131     CTRC    paul    bill.udot.mktb.200809241230             0
132     SSTR    george          192.168.11.145, PORT=1364
132     CTRC    george  bill.udot.mktb.200809241230             0
133     SSTR    paul            192.168.11.132, PORT=1364
133     CTRC    paul    bill.udot.mkta.200809241237             0
134     SSTR    george          192.168.11.145, PORT=1364
134     CTRC    george  bill.udot.mkta.200809241237             0
135     SSTR    paul            192.168.11.132, PORT=1364
135     CTRC    paul    bill.udot.mkta.200809241253             0
136     SSTR    george          192.168.11.145, PORT=1364
136     CTRC    george  bill.udot.mkta.200809241253             0
137     SSTR    paul            192.168.11.132, PORT=1364
137     CTRC    paul    bill.udot.mktb.200809241256             0
138     SSTR    george          192.168.11.145, PORT=1364
138     CTRC    george  bill.udot.mktb.200809241256             0
139     SSTR    paul            192.168.11.132, PORT=1364
139     CTRC    paul    todtrigger.200809241722.input           0
140     SSTR    paul            192.168.11.132, PORT=1364
140     CTRC    paul    todtrigger.200809241723.input           0
141     SSTR    paul            192.168.11.132, PORT=1364
141     CTRC    paul    todtrigger.200809241724.input           0
142     SSTR    paul            192.168.11.132, PORT=1364
142     CTRC    paul    todtrigger.200809241725.input           0
143     SSTR    paul            192.168.11.132, PORT=1364
143     CTRC    paul    todtrigger.200809241726.input           0
144     SSTR    paul            192.168.11.132, PORT=1364
144     CTRC    paul    todtrigger.200809241727.input           0
 
 
What if I wanted to clean up this output a little bit more?  I really just wanted the destination IP address, but the RMTP field has the IP address and port number in it, and a comma.  This is where we can derive fields from existing information inside other fields.
 
#!/bin/ksh
PATH=/usr/xpg4/bin:/usr/bin
# default field IDs
FIELDIDS=PNUM,PNOD,SNOD,SFIL,DFIL,CCOD
# get different field IDs from command line, if they are specified
if [ "$1" = "-f" -a -n "$2" ]; then
  FIELDIDS=$2
fi
 
awk -F"|" '{
  # populate array B with all values using field names as subscripts
  for (i=1;i<=NF;i++) {
    SS=split($i,A,"=");SUB=A[1]; B[SUB]=A[2];delete A[1];delete A[2]
    # if the field has a second = in it, that means $i was split
    # into more than 2 pieces, gather the pieces
    for (j=3;j<=SS;j++) {
      B[SUB]=B[SUB]"="A[j];delete A[j]
    }
  }
  # go through all the fields requested and show values
  NE=split(FIELDIDS,F,",")
  for (IX=1;IX<=NE;IX++) {
    FLD=F[IX]
    if (FLD=="LOCIP") {
      split(B["LCLP"],A,",")
      B["LOCIP"]=A[1]
      delete A[1]; delete A[2]
    }
    if (FLD=="LOCPORT") {
      split(B["LCLP"],A,"=")
      B["LOCPORT"]=A[2]
      delete A[1]; delete A[2]
    }
    if (FLD=="RMTIP") {
      split(B["RMTP"],A,",")
      B["RMTIP"]=A[1]
      delete A[1]; delete A[2]
    }
    if (FLD=="RMTPORT") {
      split(B["RMTP"],A,"=")
      B["RMTPORT"]=A[2]
      delete A[1]; delete A[2]
    }
    printf "%s\t",B[FLD]
    delete F[IX]
  }
  print ""
  # clear array B
  for (SUB in B) delete B[SUB]
}'  FIELDIDS=$FIELDIDS -

 
Now we have additional derived fields to choose from, besides the regular fields inside the stat records.
 
$ cat S20080924.001 |grep PNOD=john |egrep "CTRC|SSTR" |parse -f PNUM,RECI,SNOD,DFIL,RMTIP,CCOD
123     SSTR    paul            192.168.11.132
123     CTRC    paul    bill.udot.mktb.200809241130             0
124     SSTR    george          192.168.11.145
124     CTRC    george  bill.udot.mktb.200809241130             0
125     SSTR    paul            192.168.11.132
125     CTRC    paul    bill.udot.mkta.200809241137             0
126     SSTR    george          192.168.11.145
126     CTRC    george  bill.udot.mkta.200809241137             0
127     SSTR    paul            192.168.11.132
127     CTRC    paul    bill.udot.mkta.200809241153             0
128     SSTR    george          192.168.11.145
128     CTRC    george  bill.udot.mkta.200809241153             0
129     SSTR    paul            192.168.11.132
129     CTRC    paul    bill.udot.mktb.200809241156             0
130     SSTR    george          192.168.11.145
130     CTRC    george  bill.udot.mktb.200809241156             0
131     SSTR    paul            192.168.11.132
131     CTRC    paul    bill.udot.mktb.200809241230             0
132     SSTR    george          192.168.11.145
132     CTRC    george  bill.udot.mktb.200809241230             0
133     SSTR    paul            192.168.11.132
133     CTRC    paul    bill.udot.mkta.200809241237             0
134     SSTR    george          192.168.11.145
134     CTRC    george  bill.udot.mkta.200809241237             0
135     SSTR    paul            192.168.11.132
135     CTRC    paul    bill.udot.mkta.200809241253             0
136     SSTR    george          192.168.11.145
136     CTRC    george  bill.udot.mkta.200809241253             0
137     SSTR    paul            192.168.11.132
137     CTRC    paul    bill.udot.mktb.200809241256             0
138     SSTR    george          192.168.11.145
138     CTRC    george  bill.udot.mktb.200809241256             0
139     SSTR    paul            192.168.11.132
139     CTRC    paul    todtrigger.200809241722.input           0
140     SSTR    paul            192.168.11.132
140     CTRC    paul    todtrigger.200809241723.input           0
141     SSTR    paul            192.168.11.132
141     CTRC    paul    todtrigger.200809241724.input           0
142     SSTR    paul            192.168.11.132
142     CTRC    paul    todtrigger.200809241725.input           0
143     SSTR    paul            192.168.11.132
143     CTRC    paul    todtrigger.200809241726.input           0
144     SSTR    paul            192.168.11.132
144     CTRC    paul    todtrigger.200809241727.input           0

 
This makes a beautiful and almost effortless import into Excel for further analysis:
 
 
Note:  In the previous posting's comments I mentioned that you should use nawk instead of awk in Solaris.  In the script I put /usr/xpg4/bin in the PATH before /usr/bin.  So, if you run this in Solaris it will pick the newer, standards-compliant version of awk, which is like nawk.  On other systems such as HPUX or Linux, the extra directory in the PATH will be harmless, but the script will be portable. 
 
 
 
 

9/13/2008

Parsing Connect:Direct stats (part 1)

 
When you look at the stat files in the work directory (S20080912.001, for example), it is not very human readable because you have to hunt through the text to find the fields you want by name.  And if you want to compare one transmission to another it is difficult.  If you go into the direct prompt and run a select stat detail=yes, it is very readable but now the output is not very machine readable, meaning you can't feed that into a script or spreadsheet.  Wouldn't it be great to be able to parse information out of the stat files and get exactly what you want?
 
I started out wanting to look at stats for a specific set of files that were transmitted but I didn't know the process numbers.  The full file name may be something like procfeed.20080912_140814.input, but I want to see a list of PNUMs for all the files similar to that.

cat S200809* | grep RECI=CTRC | grep procfeed.200809 | \
awk -F"|" '{
  # find the PNUM field and show value
  for (i=1;i<=NF;i++) {
    split($i,A,"=")
    if (A[1]=="PNUM") {
      print A[2]
      break
    }
  }
}'

So I cat the stat file and grep for the first part of the file name and look for just the CTRC copy records.    Then pipe that through a simple awk script.  The -F"|" means consider a pipe character to be the field separator.  Count through all the fields, splitting each one at the equals sign.  If the part before the equals sign is PNUM, we've found the field.  Print the value and break out of the for loop to go to the next record.  The above gives an output like this:

22505
22675
22802
23216
23289

Then I went into the direct prompt and did a "select stat detail=yes pnum=(22505,22675,22802,23216,23289);" to get the details about the transmissions in a human-readable form.
 
I soon found myself doing this sort of thing repeatedly, and I'm looking for just a couple of pieces of info from the stats.  I sure would like to view more than just the PNUM field, that way I could skip the step of going into the direct prompt to do my select stat command.

cat S200809* | grep RECI=CTRC | grep procfeed.200809 | \
awk -F"|" '{
    # populate array B with all values using field names as subscripts
    for (i=1;i<=NF;i++) {
      SS=split($i,A,"=");SUB=A[1]; B[SUB]=A[2];delete A[1];delete A[2]
    }
    # go through all the fields we need to see and show values
    NE=split("PNUM,PNOD,SNOD,SFIL,DFIL,CCOD",F,",")
    for (IX=1;IX<=NE;IX++) {
      FLD=F[IX]
      printf "%s\t",B[FLD]
      delete F[IX]
    }
    print ""
    # clear array B
    for (SUB in B) delete B[SUB]
}'

Now this gives me a nice tab-delimited list of PNUMs with the source and destination servers and file names.  What the above code does is grep through the stat files for the copy records with  the first part of the tranmitted file name in them, and feeds just those lines into the awk program.  In the awk program we are going through all of the fields as separated by pipe characters, and populating an array with the info.  One array element for each field, with the field name used for the subscript.  Then go through a list of just the field names we want and display those array elements.  Separate the output with tabs, print a trailing newline, and clear the array for housekeeping sake.