Ferret is a Lucene port to Ruby. Looks like a very good port. What I like:

  • it handles wildcards from both sides of a string (Lucene began to allow prefixing with * not long ago, may be with v2.0),
  • it has good highlight facility out of the box and that it was very easy to plug into Rails application together with (also great) acts_as_ferret plugin: index dir configuration, population and updating – all appeared to be automatic, much simpler than using Lucene in Java
  • acts_as_ferret makes it simple to use index in cluster environment with ferret_server, just start it and it works. Well, almost: I had to patch vendor/plugins/acts_as_ferret/lib/server_manager.rb:
@@ -35,8 +35,8 @@
begin
ENV['FERRET_USE_LOCAL_INDEX'] = 'true'
ENV['RAILS_ENV'] = $ferret_server_options['environment']
-  #require(File.join(File.dirname(__FILE__), '../../../../config/environment'))
-  require(File.join(File.dirname(ENV['_']), '../config/environment'))
+  require(File.join(File.dirname(__FILE__), '../../../../config/environment'))
+  #require(File.join(File.dirname(ENV['_']), '../config/environment'))
require 'acts_as_ferret'
ActsAsFerret::Remote::Server.new.send($ferret_server_action)
rescue Exception => e

What I didn’t like is that after I thought search was working it appeared that it’s broken for words with non-latinic symbols. Both on Windows platform where I develop and on Linux where I deploy to. For example, if text contains ‘Antônio’ and I search by ‘Antônio’ I don’t find anything; also if I search for ‘ant’ I get something like ‘Ant??nio’ in results (if highlight is used).More strict example:


require 'rubygems'
require 'ferret'

text = "Antônio"
include Ferret::Analysis
tokenizer = StandardAnalyzer.new.token_stream(:field, text)
while token = tokenizer.next
  puts token
end

The output was:

token["antÃ":0:4:1]
token["nio":5:8:1]

I have some experience with Lucene and know that it handles unicode without problems; Rails is also ok with unicode; the whole situation looked stupid (Ruby being not very unicode-friendly also looks rather stupid but I guess Japanese had some reasons for this – and I hope 1.9 will be much better) so I decided to dig this problem. Surprisingly I didn’t find a lot of information on this, it seems that people don’t really have it. Ok for those unlucky like me here’s the solution I found.Basically Ferret handles Unicode well out of the box if operating system is not Windows and default locale is Unicode-friendly. Both my environments don’t satisfy these requirements so I got into not-very-documented troubles. To resolve them we have to say Ferret some magic words:


require 'rubygems'
require 'ferret'

if PLATFORM =~ /win32/
  #strange hack for Windows from Ferret's author to make it unicode-friendly.
  #http://ferret.davebalmain.com/trac/ticket/326#comment:3
  Ferret.locale = ''
else
  #tell Ferret to be unicode-friendly. This unfortunately doesn't work on Windows.
  Ferret.locale = "en_US.UTF-8"
end
puts "failed to set locale" if Ferret.locale.nil?

text = "Antônio"
include Ferret::Analysis
tokenizer = StandardAnalyzer.new.token_stream(:field, text)
while token = tokenizer.next
  puts token
end

Output on Linux:

token["antônio":0:8:1]

Output on Windows:

token["antгґnio":0:8:1]

Well on Windows that wasn’t that good but at least it’s one word so search by ‘Antônio’ works.. and in Rails highlight doesn’t corrupt symbols now.That’s it!

Versions I used for this article:

  • Ferret 0.11.5
  • acts_as_ferret 0.4.3 (wow, both start with zeroes)
  • rails 2.0.1
  • ruby 1.8.6

Update:

Well this solution doesn’t exactly work on Windows. The test is broken (gives 2 tokens instead of 1) when I set Windows locale (‘Standarts and formats’) to English; but when it’s Russian it works fine. Ok, at least that was a stable solution for Linux.

Update:

ferret_server doesn’t work for Windows, so I cannot use it in my development environment :(

Ruby world pushes me more and more out of Windows :)



5 Responses to “Adventures with Ferret and unicode strings”  

  1. Great solution :) Really solves cyrillic issues I had.
    Greets

  2. 2 thirstydoh

    Thanks Jordan!

    It’s funny that you had issues with Cyrillic and I had with Portuguese – it’s me who’s Russian here )

  3. 3 Virgo

    Somehow i missed the point. Probably lost in translation :) Anyway … nice blog to visit.

    cheers, Virgo.

  4. Great article! It really solved the Cyrillic problem until… I upgraded to Rails 2.2.
    Now the line
    Ferret.locale = ”
    throws a “sticking” error ‘unexpected $end expecting kEnd’ upon a first call and keeps throwing it on every single page (that have nothing to do with Ferret) until server is restarted.

    Has anyone dealt with this already?

  5. Fabulous! This sorted out a weird tokenization problem I was having.


Leave a Reply