2

I need to escape special characters which are sent to Apache Lucene.

Since the code will run on a production server, I want the code to be the fastest possible.

I've seen multiple ways to do it:

  • Using Pattern
  • Using Replace
  • Using Library

See: http://www.javalobby.org/java/forums/t86124.html

I'm wondering:

  • For trivial cases such as this, should I use RegEx or custom?
  • Can the below code be optimized further?

    /*
     * Lucene supports escaping special characters that are part of the
     * query syntax. The current list special characters are + - && || !
     * ( ) { } [ ] ^ " ~ * ? : \
     * 
     * To escape these character use the \ before the character.
     */
    String query = "http://This+*is||a&&test(whatever!!!!!!)";
    char[] queryCharArray = new char[query.length()*2];
    char c;
    int length = query.length();
    int currentIndex = 0;
    for (int i = 0; i < length; i++) 
    {
        c = query.charAt(i);
        switch (c) {                
        case ':':
        case '\\':
        case '?':
        case '+':
        case '-':
        case '!':
        case '(':
        case ')':
        case '{':
        case '}':
        case '[':
        case ']':
        case '^':
        case '"':
        case '~':
        case '*':
            queryCharArray[currentIndex++] = '\\'; 
            queryCharArray[currentIndex++] = c; 
        break;
    
        case '&':
        case '|':   
            if(i+1 < length && query.charAt(i+1) == c)
            {
                queryCharArray[currentIndex++] = '\\'; 
                queryCharArray[currentIndex++] = c; 
                queryCharArray[currentIndex++] = c; 
                i++;
            }
        break;
    
        default:
            queryCharArray[currentIndex++] = c;     
    
        }
    
    }
    
    query = new String(queryCharArray,0,currentIndex);
    
    System.out.println("TEST="+query);
    
Tot Zam
  • 103
  • 5
Menelaos
  • 267
  • 3
  • 14
  • I think so.. for some reason Lucent wants to escape && || when their both... atleast that's what the comments seem to denote. – Menelaos Sep 23 '13 at 10:36

1 Answers1

1

I would use a boolean[65536] which flags if the character has to be escaped. I am quite confident that this is faster than the switch.

But only profiling can show if it is really faster.

String query = "http://This+*is||a&&test(whatever!!!!!!)";
char[] queryCharArray = new char[query.length()*2];
char c;
int length = query.length();
int currentIndex = 0;
for (int i = 0; i < length; i++) 
{
    c = query.charAt(i);
    if(mustBeEscaped[c]){        
      if('&'==c || '|'==c){
        if(i+1 < length && query.charAt(i+1) == c){
            queryCharArray[currentIndex++] = '\\'; 
            queryCharArray[currentIndex++] = c; 
            queryCharArray[currentIndex++] = c; 
            i++;
        }
      }    
      else{
        queryCharArray[currentIndex++] = '\\'; 
        queryCharArray[currentIndex++] = c; 
      }     
    }
    else{
        queryCharArray[currentIndex++] = c;     
    }
}

query = new String(queryCharArray,0,currentIndex);

System.out.println("TEST="+query);

private static final boolean[] mustBeEscaped = new boolean[65536];
static{
mustBeEscaped[':']=  //
  for(char c: "\\?+-!(){}[]^\"~*&|".toCharArray()){
     mustBeEscaped[c]=true; 
  }
}
MrSmith42
  • 1,041
  • 7
  • 12