Class CollationElementIterator

java.lang.Object
com.ibm.icu.text.CollationElementIterator

public final class CollationElementIterator extends Object
CollationElementIterator is an iterator created by a RuleBasedCollator to walk through a string. The return result of each iteration is a 32-bit collation element (CE) that defines the ordering priority of the next character or sequence of characters in the source string.

For illustration, consider the following in Slovak and in traditional Spanish collation:

 "ca" -> the first collation element is CE('c') and the second
         collation element is CE('a').
 "cha" -> the first collation element is CE('ch') and the second
          collation element is CE('a').
 
And in German phonebook collation,
 Since the character 'æ' is a composed character of 'a' and 'e', the
 iterator returns two collation elements for the single character 'æ'

 "æb" -> the first collation element is collation_element('a'), the
              second collation element is collation_element('e'), and the
              third collation element is collation_element('b').
 

For collation ordering comparison, the collation element results can not be compared simply by using basic arithmetic operators, e.g. <, == or >, further processing has to be done. Details can be found in the ICU User Guide. An example of using the CollationElementIterator for collation ordering comparison is the class StringSearch.

To construct a CollationElementIterator object, users call the method getCollationElementIterator() on a RuleBasedCollator that defines the desired sorting order.

Example:

  String testString = "This is a test";
  RuleBasedCollator rbc = new RuleBasedCollator("&a<b");
  CollationElementIterator iterator = rbc.getCollationElementIterator(testString);
  int primaryOrder = iterator.IGNORABLE;
  while (primaryOrder != iterator.NULLORDER) {
      int order = iterator.next();
      if (order != iterator.IGNORABLE &&
          order != iterator.NULLORDER) {
          // order is valid, not ignorable and we have not passed the end
          // of the iteration, we do something
          primaryOrder = CollationElementIterator.primaryOrder(order);
          System.out.println("Next primary order 0x" +
                             Integer.toHexString(primaryOrder));
      }
  }
 

The method next() returns the collation order of the next character based on the comparison level of the collator. The method previous() returns the collation order of the previous character based on the comparison level of the collator. The Collation Element Iterator moves only in one direction between calls to reset(), setOffset(), or setText(). That is, next() and previous() can not be inter-used. Whenever previous() is to be called after next() or vice versa, reset(), setOffset() or setText() has to be called first to reset the status, shifting current position to either the end or the start of the string (reset() or setText()), or the specified position (setOffset()). Hence at the next call of next() or previous(), the first or last collation order, or collation order at the specified position will be returned. If a change of direction is done without one of these calls, the result is undefined.

This class is not subclassable.

See Also:
  • Nested Class Summary

    Nested Classes
    Modifier and Type
    Class
    Description
    private static final class 
     
  • Field Summary

    Fields
    Modifier and Type
    Field
    Description
    private byte
    <0: backwards; 0: just after reset() (previous() begins from end); 1: just after setOffset(); >1: forward
    static final int
    This constant is returned by the iterator in the methods next() and previous() when a collation element result is to be ignored.
     
    static final int
    This constant is returned by the iterator in the methods next() and previous() when the end or the beginning of the source string has been reached, and there are no more valid collation elements to return.
    private UVector32
    Stores offsets from expansions and from unsafe-backwards iteration, so that getOffset() returns intermediate offsets for the CEs that are consistent with forward iteration.
    private int
     
     
    private String
     
  • Constructor Summary

    Constructors
    Modifier
    Constructor
    Description
    private
     
    (package private)
    CollationElementIterator constructor.
    (package private)
    CollationElementIterator constructor.
    (package private)
    CollationElementIterator constructor.
  • Method Summary

    Modifier and Type
    Method
    Description
    private static final boolean
    ceNeedsTwoParts(long ce)
     
    (package private) static final Map<Integer,Integer>
     
    boolean
    equals(Object that)
    Tests that argument object is equals to this CollationElementIterator.
    private static final int
    getFirstHalf(long p, int lower32)
     
    int
    Returns the maximum length of any expansion sequence that ends with the specified collation element.
    (package private) static int
    getMaxExpansion(Map<Integer,Integer> maxExpansions, int order)
     
    int
    Returns the character offset in the source string corresponding to the next collation element.
    Deprecated.
    This API is ICU internal only.
    private static final int
    getSecondHalf(long p, int lower32)
     
    int
    Mock implementation of hashCode().
    int
    Get the next collation element in the source string.
    private byte
    Normalizes dir_=1 (just after setOffset()) to dir_=0 (just after reset()).
    int
    Get the previous collation element in the source string.
    static final int
    primaryOrder(int ce)
    Return the primary order of the specified collation element, i.e.
    void
    Resets the cursor to the beginning of the string.
    static final int
    Return the secondary order of the specified collation element, i.e.
    void
    setOffset(int newOffset)
    Sets the iterator to point to the collation element corresponding to the character at the specified offset.
    void
    Set a new source string iterator for iteration, and reset the offset to the beginning of the text.
    void
    setText(String source)
    Set a new source string for iteration, and reset the offset to the beginning of the text.
    void
    Set a new source string iterator for iteration, and reset the offset to the beginning of the text.
    static final int
    tertiaryOrder(int ce)
    Return the tertiary order of the specified collation element, i.e.

    Methods inherited from class java.lang.Object

    clone, finalize, getClass, notify, notifyAll, toString, wait, wait, wait
  • Field Details

    • iter_

      private CollationIterator iter_
    • rbc_

      private RuleBasedCollator rbc_
    • otherHalf_

      private int otherHalf_
    • dir_

      private byte dir_
      <0: backwards; 0: just after reset() (previous() begins from end); 1: just after setOffset(); >1: forward
    • offsets_

      private UVector32 offsets_
      Stores offsets from expansions and from unsafe-backwards iteration, so that getOffset() returns intermediate offsets for the CEs that are consistent with forward iteration.
    • string_

      private String string_
    • NULLORDER

      public static final int NULLORDER
      This constant is returned by the iterator in the methods next() and previous() when the end or the beginning of the source string has been reached, and there are no more valid collation elements to return.

      See class documentation for an example of use.

      See Also:
    • IGNORABLE

      public static final int IGNORABLE
      This constant is returned by the iterator in the methods next() and previous() when a collation element result is to be ignored.

      See class documentation for an example of use.

      See Also:
  • Constructor Details

    • CollationElementIterator

      private CollationElementIterator(RuleBasedCollator collator)
    • CollationElementIterator

      CollationElementIterator(String source, RuleBasedCollator collator)
      CollationElementIterator constructor. This takes a source string and a RuleBasedCollator. The iterator will walk through the source string based on the rules defined by the collator. If the source string is empty, NULLORDER will be returned on the first call to next().
      Parameters:
      source - the source string.
      collator - the RuleBasedCollator
    • CollationElementIterator

      CollationElementIterator(CharacterIterator source, RuleBasedCollator collator)
      CollationElementIterator constructor. This takes a source character iterator and a RuleBasedCollator. The iterator will walk through the source string based on the rules defined by the collator. If the source string is empty, NULLORDER will be returned on the first call to next().
      Parameters:
      source - the source string iterator.
      collator - the RuleBasedCollator
    • CollationElementIterator

      CollationElementIterator(UCharacterIterator source, RuleBasedCollator collator)
      CollationElementIterator constructor. This takes a source character iterator and a RuleBasedCollator. The iterator will walk through the source string based on the rules defined by the collator. If the source string is empty, NULLORDER will be returned on the first call to next().
      Parameters:
      source - the source string iterator.
      collator - the RuleBasedCollator
  • Method Details

    • primaryOrder

      public static final int primaryOrder(int ce)
      Return the primary order of the specified collation element, i.e. the first 16 bits. This value is unsigned.
      Parameters:
      ce - the collation element
      Returns:
      the element's 16 bits primary order.
    • secondaryOrder

      public static final int secondaryOrder(int ce)
      Return the secondary order of the specified collation element, i.e. the 16th to 23th bits, inclusive. This value is unsigned.
      Parameters:
      ce - the collation element
      Returns:
      the element's 8 bits secondary order
    • tertiaryOrder

      public static final int tertiaryOrder(int ce)
      Return the tertiary order of the specified collation element, i.e. the last 8 bits. This value is unsigned.
      Parameters:
      ce - the collation element
      Returns:
      the element's 8 bits tertiary order
    • getFirstHalf

      private static final int getFirstHalf(long p, int lower32)
    • getSecondHalf

      private static final int getSecondHalf(long p, int lower32)
    • ceNeedsTwoParts

      private static final boolean ceNeedsTwoParts(long ce)
    • getOffset

      public int getOffset()
      Returns the character offset in the source string corresponding to the next collation element. I.e., getOffset() returns the position in the source string corresponding to the collation element that will be returned by the next call to next() or previous(). This value could be any of:
      • The index of the first character corresponding to the next collation element. (This means that if setOffset(offset) sets the index in the middle of a contraction, getOffset() returns the index of the first character in the contraction, which may not be equal to the original offset that was set. Hence calling getOffset() immediately after setOffset(offset) does not guarantee that the original offset set will be returned.)
      • If normalization is on, the index of the immediate subsequent character, or composite character with the first character, having a combining class of 0.
      • The length of the source string, if iteration has reached the end.
      Returns:
      The character offset in the source string corresponding to the collation element that will be returned by the next call to next() or previous().
    • next

      public int next()
      Get the next collation element in the source string.

      This iterator iterates over a sequence of collation elements that were built from the string. Because there isn't necessarily a one-to-one mapping from characters to collation elements, this doesn't mean the same thing as "return the collation element [or ordering priority] of the next character in the string".

      This function returns the collation element that the iterator is currently pointing to, and then updates the internal pointer to point to the next element.

      Returns:
      the next collation element or NULLORDER if the end of the iteration has been reached.
    • previous

      public int previous()
      Get the previous collation element in the source string.

      This iterator iterates over a sequence of collation elements that were built from the string. Because there isn't necessarily a one-to-one mapping from characters to collation elements, this doesn't mean the same thing as "return the collation element [or ordering priority] of the previous character in the string".

      This function updates the iterator's internal pointer to point to the collation element preceding the one it's currently pointing to and then returns that element, while next() returns the current element and then updates the pointer.

      Returns:
      the previous collation element, or NULLORDER when the start of the iteration has been reached.
    • reset

      public void reset()
      Resets the cursor to the beginning of the string. The next call to next() or previous() will return the first and last collation element in the string, respectively.

      If the RuleBasedCollator used by this iterator has had its attributes changed, calling reset() will reinitialize the iterator to use the new attributes.

    • setOffset

      public void setOffset(int newOffset)
      Sets the iterator to point to the collation element corresponding to the character at the specified offset. The value returned by the next call to next() will be the collation element corresponding to the characters at offset.

      If offset is in the middle of a contracting character sequence, the iterator is adjusted to the start of the contracting sequence. This means that getOffset() is not guaranteed to return the same value set by this method.

      If the decomposition mode is on, and offset is in the middle of a decomposible range of source text, the iterator may not return a correct result for the next forwards or backwards iteration. The user must ensure that the offset is not in the middle of a decomposible range.

      Parameters:
      newOffset - the character offset into the original source string to set. Note that this is not an offset into the corresponding sequence of collation elements.
    • setText

      public void setText(String source)
      Set a new source string for iteration, and reset the offset to the beginning of the text.
      Parameters:
      source - the new source string for iteration.
    • setText

      public void setText(UCharacterIterator source)
      Set a new source string iterator for iteration, and reset the offset to the beginning of the text.

      The source iterator's integrity will be preserved since a new copy will be created for use.

      Parameters:
      source - the new source string iterator for iteration.
    • setText

      public void setText(CharacterIterator source)
      Set a new source string iterator for iteration, and reset the offset to the beginning of the text.
      Parameters:
      source - the new source string iterator for iteration.
    • computeMaxExpansions

      static final Map<Integer,Integer> computeMaxExpansions(CollationData data)
    • getMaxExpansion

      public int getMaxExpansion(int ce)
      Returns the maximum length of any expansion sequence that ends with the specified collation element. If there is no expansion with this collation element as the last element, returns 1.
      Parameters:
      ce - a collation element returned by previous() or next().
      Returns:
      the maximum length of any expansion sequence ending with the specified collation element.
    • getMaxExpansion

      static int getMaxExpansion(Map<Integer,Integer> maxExpansions, int order)
    • normalizeDir

      private byte normalizeDir()
      Normalizes dir_=1 (just after setOffset()) to dir_=0 (just after reset()).
    • equals

      public boolean equals(Object that)
      Tests that argument object is equals to this CollationElementIterator. Iterators are equal if the objects uses the same RuleBasedCollator, the same source text and have the same current position in iteration.
      Overrides:
      equals in class Object
      Parameters:
      that - object to test if it is equals to this CollationElementIterator
    • hashCode

      public int hashCode()
      Mock implementation of hashCode(). This implementation always returns a constant value. When Java assertion is enabled, this method triggers an assertion failure.
      Overrides:
      hashCode in class Object
    • getRuleBasedCollator

      @Deprecated public RuleBasedCollator getRuleBasedCollator()
      Deprecated.
      This API is ICU internal only.