Ferret is a Lucene port to Ruby. Looks like a very good port. What I like:
- it handles wildcards from both sides of a string (Lucene began to allow prefixing with * not long ago, may be with v2.0),
- it has good highlight facility out of the box and that it was very easy to plug into Rails application together with (also great) acts_as_ferret plugin: index dir configuration, population and updating - all appeared to be automatic, much simpler than using Lucene in Java
- acts_as_ferret makes it simple to use index in cluster environment with ferret_server, just start it and it works. Well, almost: I had to patch vendor/plugins/acts_as_ferret/lib/server_manager.rb:
@@ -35,8 +35,8 @@ begin ENV['FERRET_USE_LOCAL_INDEX'] = 'true' ENV['RAILS_ENV'] = $ferret_server_options['environment'] - #require(File.join(File.dirname(__FILE__), '../../../../config/environment')) - require(File.join(File.dirname(ENV['_']), '../config/environment')) + require(File.join(File.dirname(__FILE__), '../../../../config/environment')) + #require(File.join(File.dirname(ENV['_']), '../config/environment')) require 'acts_as_ferret' ActsAsFerret::Remote::Server.new.send($ferret_server_action) rescue Exception => e
What I didn’t like is that after I thought search was working it appeared that it’s broken for words with non-latinic symbols. Both on Windows platform where I develop and on Linux where I deploy to. For example, if text contains ‘Antônio’ and I search by ‘Antônio’ I don’t find anything; also if I search for ‘ant’ I get something like ‘Ant??nio’ in results (if highlight is used).More strict example:
require 'rubygems' require 'ferret' text = "Antônio" include Ferret::Analysis tokenizer = StandardAnalyzer.new.token_stream(:field, text) while token = tokenizer.next puts token end
The output was:
token["antÃ":0:4:1]
token["nio":5:8:1]
I have some experience with Lucene and know that it handles unicode without problems; Rails is also ok with unicode; the whole situation looked stupid (Ruby being not very unicode-friendly also looks rather stupid but I guess Japanese had some reasons for this - and I hope 1.9 will be much better) so I decided to dig this problem. Surprisingly I didn’t find a lot of information on this, it seems that people don’t really have it. Ok for those unlucky like me here’s the solution I found.Basically Ferret handles Unicode well out of the box if operating system is not Windows and default locale is Unicode-friendly. Both my environments don’t satisfy these requirements so I got into not-very-documented troubles. To resolve them we have to say Ferret some magic words:
require 'rubygems' require 'ferret' if PLATFORM =~ /win32/ #strange hack for Windows from Ferret's author to make it unicode-friendly. #http://ferret.davebalmain.com/trac/ticket/326#comment:3 Ferret.locale = '' else #tell Ferret to be unicode-friendly. This unfortunately doesn't work on Windows. Ferret.locale = "en_US.UTF-8" end puts "failed to set locale" if Ferret.locale.nil? text = "Antônio" include Ferret::Analysis tokenizer = StandardAnalyzer.new.token_stream(:field, text) while token = tokenizer.next puts token end
Output on Linux:
token["antônio":0:8:1]
Output on Windows:
token["antгґnio":0:8:1]
Well on Windows that wasn’t that good but at least it’s one word so search by ‘Antônio’ works.. and in Rails highlight doesn’t corrupt symbols now.That’s it!
Versions I used for this article:
- Ferret 0.11.5
- acts_as_ferret 0.4.3 (wow, both start with zeroes)
- rails 2.0.1
- ruby 1.8.6
Update:
Well this solution doesn’t exactly work on Windows. The test is broken (gives 2 tokens instead of 1) when I set Windows locale (’Standarts and formats’) to English; but when it’s Russian it works fine. Ok, at least that was a stable solution for Linux.
Update:
ferret_server doesn’t work for Windows, so I cannot use it in my development environment :(
Ruby world pushes me more and more out of Windows :)
Search
You are currently browsing the Artem Vasiliev's Weblog weblog archives.
Great solution :) Really solves cyrillic issues I had.
Greets
Thanks Jordan!
It’s funny that you had issues with Cyrillic and I had with Portuguese - it’s me who’s Russian here )