Is this string a number?

Is this string a number? - 6 décembre 2010

A blog post (Double Parsing Regexp) caught my attention. The base problem is to check if a string is a valid number, without necessarily needing the real value. Or, perhaps, to skip the conversion if we are sure it will fail, as Double.valueOf throws an exception in this case, and the try/catch mechanism is known to be slow.

The numbers given were intriguing, and I wondered if the given regular expression could be optimized. Somehow, it was a great opportunity to use Caliper a library to do correctly micro-benchmarks.

Indeed, it is tempting, when we wonder which of two (or more) methods is the fastest, to write a simple loop around the code snippet, to run it lot of times, and to measure the run time to compare the results.
But there are a couple of caveats of such tests: you have to run them several times to ensure multitask OS isn't busy elsewhere, you can have some of the code optimized away by the compiler (detecting operations doing nothing), and in the case of Java, the Hotspot Jit compiler takes some time to kick in, so you must take in account the warm up period.

Caliper is supposed to take care of most of these issues, automating the warm up sequence, averaging several runs, etc.

So I have to get the library first. The project offers no files yet, but being lazy (and not wanting to waste time), I found a caliper.jar in the Vogar project's VCS, and used it. I don't know if it is cutting edge, but it seems to fit the job.

I give only parts of the code here, as it is a bit repetitive, but if you are curious, you can find the whole file at FloatingPointTesterBenchmark.java

I kept the idea from Scratch's blog to test separately the cases where we have real numbers as strings from the cases where the strings hold invalid numbers.

I spare you with the full regex, as taken straight from JavaDoc for Double.valueOf. It is used as "worst case" test.
The test code is quite simple, used in most of the tests against the regular expressions below, just changing the name of the pattern:

  private static final Pattern fpPattern = Pattern.compile(fpRegex);
  
  public static boolean checkFloatingPointNumber(String wouldBeFloatingPoint)
  {
    Matcher m = fpPattern.matcher(wouldBeFloatingPoint);
    return m.matches();
  }

I simplified the expression, removing the cases of NaN and Infinity (it is much preferable to test them up front before using a regex, anyway), the trimming, the hexa case, but keeping the case of numbers starting with a dot, as it is a quite commonly used notation, and it offers an interesting case to avoid alternation in the RE.

That makes my fpSimpleRegexA. I deliberately kept as many captures as there were in the original RE. So I made fpSimpleRegexB to see if I transform them to non-capturing grouping, there was a speed gain. Apparently, no.

  private static final String fpSimpleRegexA =
      "[+-]?(" +
      "((\\d+)(\\.)?((\\d+)?)([eE][+-]?(\\d+))?)|" +
      "(\\.((\\d+))([eE][+-]?(\\d+))?)" +
      ")[fFdD]?";
  private static final String fpSimpleRegexB =
      "[+-]?(?:" +
      "(?:(?:\\d+)(?:\\.)?(?:(?:\\d+)?)(?:[eE][+-]?(?:\\d+))?)|" +
      "(?:\\.(?:(?:\\d+))(?:[eE][+-]?(?:\\d+))?)" +
      ")[fFdD]?";

fpSimpleRegexC removes all the unneeded captures/grouping, and there we see a little speed gain. It seems that the non-capturing case isn't optimized by the Sun Java implementation of regexes.

  private static final String fpSimpleRegexC =
      "[+-]?(" +
      "\\d+\\.?\\d*([eE][+-]?\\d+)?|" +
      "\\.\\d+([eE][+-]?\\d+)?" +
      ")[fFdD]?";

I also tried with fpSimpleRegexD to see if a smaller alternative branch helps, but curiously it seems to be the reverse!

  private static final String fpSimpleRegexD =
      "[+-]?(" +
      "\\d+\\.?\\d*|" +
      "\\.\\d+" +
      ")([eE][+-]?\\d+)?[fFdD]?";

Lastly, I tried one of my pet peeve: I believe that some tests before applying a RE can go a long way to simplify them, and even to speed them up. So I made fpFastRegex1 and fpFastRegex2, and a little code to choose if I must apply the first one or the second one. And indeed, I gain a little speed, although less than I thought initially, as I first checked against the first REs above.

  private static final String fpFastRegex1 =
      "[+-]?" +
      "(\\d+\\.?\\d*([eE][+-]?\\d+)?)" +
      "[fFdD]?";
  private static final String fpFastRegex2 =
      "[+-]?" +
      "\\.\\d+([eE][+-]?\\d+)?" +
      "[fFdD]?";

  private static final Pattern fpFastPattern1 = Pattern.compile(fpFastRegex1);
  private static final Pattern fpFastPattern2 = Pattern.compile(fpFastRegex2);

  public static boolean checkFloatingPointNumberFast(String wouldBeFloatingPoint)
  {
    if (wouldBeFloatingPoint.length() >= 2)
    {
      char c0 = wouldBeFloatingPoint.charAt(0);
      char c1 = wouldBeFloatingPoint.charAt(1);
      if (c0 == '.' ||
          (c0 == '+' || c0 == '-') && c1 == '.')
      {
        Matcher m = fpFastPattern2.matcher(wouldBeFloatingPoint);
        return m.matches();
      }
    }
    Matcher m = fpFastPattern1.matcher(wouldBeFloatingPoint);
    return m.matches();
  }

The test code is slightly more complex but still simple and it allows to use simpler expressions.
I compared all these regex matching methods against the classical method of using Double.valueOf and catching exceptions if it isn't a good string:

  public static boolean checkFloatingPointNumberAccurate(String wouldBeFloatingPoint)
  {
    try
    {
      Double d = Double.valueOf(wouldBeFloatingPoint);
      return true;
    }
    catch (NumberFormatException e)
    {
      return false;
    }
  }

Indeed, this one is much faster on good cases (valid numbers) and much slower (the issue of try/catch mechanism) on bad cases.
My test cases are not exhaustive, they could be beefed up a bit (work in progress in the repository).

  public static final String[] testStringsOK =
  {
    "0",
    "0.",
    "0.1",
    ".1",
    "-.1",
    "3.14159265358979323d",
    "14140728F",
    "-0.001",
    "-898",
    "666",
    "42E7",
    "1e-7",
    "55.e-77",
    ".333",
    "-.333e-1",
    "-77.77e-77",
  };
  public static final String[] testStringsKO =
  {
    ".0.",
    "0..",
    "0.1-",
    "..1",
    "--.1",
    "3,14159265358979323D",
    "14 140 728f",
    "0.1%",
    "-0.001$",
    "-8/98",
    "#666",
    "42F7",
    "1f-7",
    "55.E.-77",
    ".33.3",
    "-.333.e-1",
    "-77.77e-77.",
  };

I made some code to verify my routines are working OK, without exceptions or errors:

  private static void testCheckers()
  {
    for (String testString : testStringsOK)
    {
      if (!checkFloatingPointNumber(testString)) System.out.println("RE: Error on " + testString);
      if (!checkFloatingPointNumberSimpleA(testString)) System.out.println("SimpleA: Error on " + testString);
      // etc;
    }
    for (String testString : testStringsKO)
    {
      if (checkFloatingPointNumber(testString)) System.out.println("RE: OK on " + testString);
      if (checkFloatingPointNumberSimpleA(testString)) System.out.println("SimpleA: OK on " + testString);
      // etc;
    }
  }

  public static void main(String args[])
  {
    if (args.length == 0)
    {
      testCheckers();
      testBenchmark();
      return;
    }
    System.out.println("Is '" + args[0] + "' a valid floating point number? " +
        checkFloatingPointNumber(args[0]));
  }
}

The testBenchmark() was made before the testCheckers, to verify my Caliper code was OK, not throwing exceptions.

  private static void testBenchmark()
  {
    FloatingPointTesterBenchmark fptb = new FloatingPointTesterBenchmark();
    fptb.timeFPCheckingOK(1);
    fptb.timeFPCheckingKO(1);
    fptb.timeFPCheckingSimpleAOK(1);
    fptb.timeFPCheckingSimpleAKO(1);
    // etc.
  }

The FloatingPointTesterBenchmark class is the Caliper dedicated class that is run to time to various routines. The timeFPXxxOK() routines are quite simple.

public class FloatingPointTesterBenchmark extends SimpleBenchmark
{
  @Override
  protected void setUp()
  {
  }

  public void timeFPCheckingSimpleAOK(int reps)
  {
    for (int i = 0; i < reps; i++)
    {
      for (String testString : FloatingPointTester.testStringsOK)
      {
        FloatingPointTester.checkFloatingPointNumberSimpleA(testString);
      }
    }
  }
  public void timeFPCheckingSimpleAKO(int reps)
  {
    for (int i = 0; i < reps; i++)
    {
      for (String testString : FloatingPointTester.testStringsKO)
      {
        FloatingPointTester.checkFloatingPointNumberSimpleA(testString);
      }
    }
  }
  // Etc.

  public static void main(String[] args) throws Exception
  {
    Runner.main(FloatingPointTesterBenchmark.class, args);
  }
}

I was feeling we could do better. Somehow, writing a simple finite state machine to match the RE isn't that hard and can fail fast if needed. So I drawn such FSM, implemented it and after some adjustments, I found out it was beating all the above methods hands down! It might still have some bugs: as I wrote, the test set should be expanded to test it more thoroughly.
The FSM (a quick drawing made with Inkscape):
The finite state machine to parse a double
This FSM works better than I expected, blowing out all the concurrents, particularly on bad cases: it can exit early, not throwing exception, so it is fast.
The coding of the FSM:

  private static boolean isDigit(char c)
  {
//~     return Character.isDigit(c); // If Unicode awareness if required
    return c >= '0' && c <= '9';
  }
  public static boolean checkFloatingPointNumberByHand(String pfp)
  {
    int len = pfp.length();
    if (len == 0)
      return false;
    if (len == 1)
      return Character.isDigit(pfp.charAt(0));

    int state = 0;
    int cursor = 0;
    boolean bCheckAgain = false;
    char c = ' ';
    while (cursor < len || bCheckAgain)
    {
      if (!bCheckAgain)
      {
        c = pfp.charAt(cursor++);
      }
      else
      {
        bCheckAgain = false;
      }
      switch (state)
      {
        case 0:
          state = 1;
          if (c == '+' || c == '-')
          {
            break;
          }
          bCheckAgain = true; // Not a sign, process current char
          break;
        case 1:
          if (c == '.')
          {
            state = 2;
          }
          else if (isDigit(c))
          {
            state = 4;
          }
          else
          {
            return false; // Unexpected char
          }
          break;
        case 2: // After prefixing dot, want digits
          if (isDigit(c))
          {
            state = 3;
          }
          else
          {
            return false;
          }
          break;
        case 3: // After dot, want more digits
          if (isDigit(c))
          {
            break; // Continue on this state
          }
          else
          {
            state = 5; // Before exponent, if any
            bCheckAgain = true; // Check for exponent or type symbol
          }
          break;
        case 4: // After initial digit, want more digits or dot
          if (c == '.')
          {
            state = 3;
          }
          else if (isDigit(c))
          {
            break; // Continue on this state
          }
          else
          {
            state = 5;
            bCheckAgain = true; // Check for exponent or type symbol
          }
          break;
        case 5: // Before exponent or type symbol
          if (c == 'e' || c == 'E')
          {
            state = 6;
          }
          else
          {
            state = 9; // Check for type symbol
            bCheckAgain = true; // Not an exponent, process current char
          }
          break;
        case 6:
          state = 7;
          if (c == '+' || c == '-')
          {
            break;
          }
          bCheckAgain = true; // Not a sign, process current char
          break;
        case 7: // Expect a digit
          if (isDigit(c))
          {
            state = 8;
          }
          else
          {
            return false;
          }
          break;
        case 8: // Loop on digits
          if (isDigit(c))
          {
            break;  // Continue on this state
          }
          else
          {
            state = 9;
            bCheckAgain = true; // Not a sign, process current char
          }
          break;
        case 9:
          state = 10;
          if (c == 'f' || c == 'F' || c == 'd' || c == 'D')
          {
            break;
          }
          return false; // We should have no more chars here!
        case 10:
          return false; // (at least) one char too much
      }
    }
    if (state == 3 || state == 4 || state == 8 || state == 10)
    {
      return cursor == len; // OK if there are no trailing chars
    }
    return false;
  }

The run of the benchmark is as follows (on an old computer):

> java -cp .;D:\Archives\_Recent\caliper.jar FloatingPointTesterBenchmark
 0% Scenario{vm=java, trial=0, benchmark=FPCheckingOK} 29170,40 ns; ?=486,58 ns @ 10 trials
 6% Scenario{vm=java, trial=0, benchmark=FPCheckingKO} 46397,68 ns; ?=846,52 ns @ 10 trials
13% Scenario{vm=java, trial=0, benchmark=FPCheckingSimpleAOK} 18295,58 ns; ?=141,39 ns @ 3 trials
19% Scenario{vm=java, trial=0, benchmark=FPCheckingSimpleAKO} 24767,98 ns; ?=95,57 ns @ 3 trials
25% Scenario{vm=java, trial=0, benchmark=FPCheckingSimpleBOK} 17565,72 ns; ?=113,22 ns @ 3 trials
31% Scenario{vm=java, trial=0, benchmark=FPCheckingSimpleBKO} 24203,24 ns; ?=147,63 ns @ 3 trials
38% Scenario{vm=java, trial=0, benchmark=FPCheckingSimpleCOK} 13368,73 ns; ?=120,19 ns @ 4 trials
44% Scenario{vm=java, trial=0, benchmark=FPCheckingSimpleCKO} 17449,91 ns; ?=234,34 ns @ 10 trials
50% Scenario{vm=java, trial=0, benchmark=FPCheckingSimpleDOK} 13567,15 ns; ?=159,90 ns @ 10 trials
56% Scenario{vm=java, trial=0, benchmark=FPCheckingSimpleDKO} 18279,78 ns; ?=162,53 ns @ 4 trials
63% Scenario{vm=java, trial=0, benchmark=FPCheckingFastOK} 13042,81 ns; ?=124,76 ns @ 8 trials
69% Scenario{vm=java, trial=0, benchmark=FPCheckingFastKO} 15943,08 ns; ?=132,36 ns @ 4 trials
75% Scenario{vm=java, trial=0, benchmark=FPCheckingAccurateOK} 7039,20 ns; ?=33,15 ns @ 3 trials
81% Scenario{vm=java, trial=0, benchmark=FPCheckingAccurateKO} 67122,62 ns; ?=437,61 ns @ 3 trials
88% Scenario{vm=java, trial=0, benchmark=FPCheckingByHandOK} 1394,36 ns; ?=23,77 ns @ 10 trials
94% Scenario{vm=java, trial=0, benchmark=FPCheckingByHandKO} 1241,15 ns; ?=14,37 ns @ 10 trials

           benchmark    us logarithmic runtime
        FPCheckingOK 29,17 =======================
        FPCheckingKO 46,40 ===========================
 FPCheckingSimpleAOK 18,30 ====================
 FPCheckingSimpleAKO 24,77 ======================
 FPCheckingSimpleBOK 17,57 ====================
 FPCheckingSimpleBKO 24,20 ======================
 FPCheckingSimpleCOK 13,37 ==================
 FPCheckingSimpleCKO 17,45 ====================
 FPCheckingSimpleDOK 13,57 ==================
 FPCheckingSimpleDKO 18,28 ====================
    FPCheckingFastOK 13,04 ==================
    FPCheckingFastKO 15,94 ===================
FPCheckingAccurateOK  7,04 =============
FPCheckingAccurateKO 67,12 ==============================
  FPCheckingByHandOK  1,39 =
  FPCheckingByHandKO  1,24 =

vm: java
trial: 0

Overall, it was an interesting experience, with test of a useful library (allowing to choose between alternative designs when speed is important) and a routine that can be used in many cases.
Note 1: all these methods, except, obviously the Double.valueOf method, check for syntax, but not for bounds: they will happily accept an exponent with a dozen of digits! I suppose that's the price of speed...
Note 2: I forgot to put my usual headers, but this code is thereby put under my usual zlib/libpng license, a kind of BSD license. Feel free to reuse if you want.

Philippe Lhostedans Java Lundi 6 décembre 2010 à 07:21
Ajouter un commentaire Liens vers le billet

Rétroliens

Rétrolien spécifique pour ce billet

Pas de rétroliens

Commentaires

Afficher les commentaires en (Vue non groupée | Vue groupée)

Pas de commentaires

Ajouter un commentaire

Nom
Email
Site
En réponse à
Commentaire: Les smilies standard comme :-) et ;-) sont convertis en images.
Les adresses Email ne sont pas affichées, et sont seulement utilisées pour la communication.

Pour éviter le spam par des robits automatisés (spambots), merci d'entrer les caractères que vous voyez dans l'image ci-dessous dans le champ de fomulaire prévu à cet effet. Assurez-vous que votre navigateur gère et accepte les cookies, sinon votre commentaire ne pourra pas être enregistré.

Entrez le texte de l'image anti-spam ci-dessus ici:
: Enregistrer ces informations ?
Souscrire à ce billet
Les commentaires postés doivent être approuvés avant d'être affichés dans le blog.

PhiLho tech web log

Is this string a number? - 6 décembre 2010

Catégories

Calendrier

Archives

Syndiquer ce Blog

«	Avril '24					»
Lu	Ma	Me	Je	Ve	Sa	Di
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30