Ferret is a Lucene port to Ruby. Looks like a very good port. What I like:

  • it handles wildcards from both sides of a string (Lucene began to allow prefixing with * not long ago, may be with v2.0),
  • it has good highlight facility out of the box and that it was very easy to plug into Rails application together with (also great) acts_as_ferret plugin: index dir configuration, population and updating - all appeared to be automatic, much simpler than using Lucene in Java
  • acts_as_ferret makes it simple to use index in cluster environment with ferret_server, just start it and it works. Well, almost: I had to patch vendor/plugins/acts_as_ferret/lib/server_manager.rb:

@@ -35,8 +35,8 @@
begin
ENV['FERRET_USE_LOCAL_INDEX'] = 'true'
ENV['RAILS_ENV'] = $ferret_server_options['environment']
-  #require(File.join(File.dirname(__FILE__), '../../../../config/environment'))
-  require(File.join(File.dirname(ENV['_']), '../config/environment'))
+  require(File.join(File.dirname(__FILE__), '../../../../config/environment'))
+  #require(File.join(File.dirname(ENV['_']), '../config/environment'))
require 'acts_as_ferret'
ActsAsFerret::Remote::Server.new.send($ferret_server_action)
rescue Exception => e

What I didn’t like is that after I thought search was working it appeared that it’s broken for words with non-latinic symbols. Both on Windows platform where I develop and on Linux where I deploy to. For example, if text contains ‘Antônio’ and I search by ‘Antônio’ I don’t find anything; also if I search for ‘ant’ I get something like ‘Ant??nio’ in results (if highlight is used).More strict example:


require 'rubygems'
require 'ferret'

text = "Antônio"
include Ferret::Analysis
tokenizer = StandardAnalyzer.new.token_stream(:field, text)
while token = tokenizer.next
  puts token
end

The output was:

token["antÃ":0:4:1]
token["nio":5:8:1]

I have some experience with Lucene and know that it handles unicode without problems; Rails is also ok with unicode; the whole situation looked stupid (Ruby being not very unicode-friendly also looks rather stupid but I guess Japanese had some reasons for this - and I hope 1.9 will be much better) so I decided to dig this problem. Surprisingly I didn’t find a lot of information on this, it seems that people don’t really have it. Ok for those unlucky like me here’s the solution I found.Basically Ferret handles Unicode well out of the box if operating system is not Windows and default locale is Unicode-friendly. Both my environments don’t satisfy these requirements so I got into not-very-documented troubles. To resolve them we have to say Ferret some magic words:


require 'rubygems'
require 'ferret'

if PLATFORM =~ /win32/
  #strange hack for Windows from Ferret's author to make it unicode-friendly.
  #http://ferret.davebalmain.com/trac/ticket/326#comment:3
  Ferret.locale = ''
else
  #tell Ferret to be unicode-friendly. This unfortunately doesn't work on Windows.
  Ferret.locale = "en_US.UTF-8"
end
puts "failed to set locale" if Ferret.locale.nil?

text = "Antônio"
include Ferret::Analysis
tokenizer = StandardAnalyzer.new.token_stream(:field, text)
while token = tokenizer.next
  puts token
end

Output on Linux:

token["antônio":0:8:1]

Output on Windows:

token["antгґnio":0:8:1]

Well on Windows that wasn’t that good but at least it’s one word so search by ‘Antônio’ works.. and in Rails highlight doesn’t corrupt symbols now.That’s it!

Versions I used for this article:

  • Ferret 0.11.5
  • acts_as_ferret 0.4.3 (wow, both start with zeroes)
  • rails 2.0.1
  • ruby 1.8.6

Update:

Well this solution doesn’t exactly work on Windows. The test is broken (gives 2 tokens instead of 1) when I set Windows locale (’Standarts and formats’) to English; but when it’s Russian it works fine. Ok, at least that was a stable solution for Linux.

Update:

ferret_server doesn’t work for Windows, so I cannot use it in my development environment :(

Ruby world pushes me more and more out of Windows :)


2 Responses to “Adventures with Ferret and unicode strings”  

  1. 1 Jordan Dichev

    Great solution :) Really solves cyrillic issues I had.
    Greets

  2. 2 thirstydoh

    Thanks Jordan!

    It’s funny that you had issues with Cyrillic and I had with Portuguese - it’s me who’s Russian here )

Leave a Reply