SolrのcopyFieldとdynamicFieldを使いこなす

僕は、その昔、SIerで企業の業務システムの構築をしていました。
名寄せ検索とかよくある機能。例えば、姓名か名前に”海”って文字列が入ってる〜的な。
　
考えられるユースケース的には、企業の営業部とかで、、
　Aさん：ｘｘ社のｘｘ担当の人の名前なんだっけ？
　Bさん：春野さんだか、ナンチャラ春美さんだか、そんな感じ〜
　Aさん：了解！調べてみる〜(検索窓に”春”って入力してボチっと)
　
当時はオープンソースの検索エンジンとか馴染みがなくて、RDBMSのLike検索で頑張っていました。
お察しの通りに、データ件数が増えていくとイヤ〜ンな事になったりして。
パフォチューしようにも、あんまりデータベースのスキーマ変えられなかったり。。
　
ということで、当時コレがあったらさぞかし便利だっただろうに、、というSolrの機能をご紹介したいと思います。
流れ的には↓の最近改訂版が出た Apache Solr入門を片手に進めていきます。

[改訂新版] Apache Solr入門 ~オープンソース全文検索エンジン (Software Design plus)

posted with amazlet at 13.12.22

大谷純阿部慎一朗大須賀稔北野太郎鈴木教嗣平賀一昭
技術評論社
売り上げランキング: 9,428

Amazon.co.jpで詳細を見る

　
　
■ Solrのスキーマの設計
　
では、まずスキーマの定義を考えていきましょう。
名寄せ検索の場合に出てきそうなのは、
・苗字
・名前
・性別
・年齢
・住所
・タグ
って感じでしょうか。
僕がSIerにいた頃は”タグ”っていうのがあんまり一般的でなくて、
フリーフォーマットのテキストエリア的にしてたのですが、今だったらとりあえず #未払い
とかいうタグ付けて放り込んでおけば、おｋ的な感じにするとナイスかなと思ったりしました。
　
まずはサンプルのSolrサーバー立てます。

$ mkdir nayose //名寄せのディレクトリを作って
$ cp ~/Download/solr-4.6.0.tgz . //Solrのページからダウンロードしてきたバイナリを配置
$  ls
solr-4.6.0.tgz
$ tar xvf solr-4.6.0.tgz //解凍
$ cd solr-4.6.0
$ ls
CHANGES.txt		NOTICE.txt		SYSTEM_REQUIREMENTS.txt	dist			example
LICENSE.txt		README.txt		contrib			docs			licenses
$ cd example //とりあえずサンプルディレクトリ行って
$ java -jar start.jar //サーバーを起動してみる
0    [main] INFO  org.eclipse.jetty.server.Server  – jetty-8.1.10.v20130312

ブラウザからアクセスすると↓こんな感じで起動の確認が出来ました。

　
では、スキーマの定義をしていきましょう。今回はサンプルの定義を書き換える形で省エネでいきます。笑
　
example/solrディレクトリに行くと、solr.xmlというぽゆいファイルがあります。
中身を見ていただくとわかりますが、インフラっぽい設定なので、今回は省きます。
で、大事なのはcollection1です。

$ cd [nayoseのディレクトリ]/solr-4.6.0/example/solr
$ ls
README.txt	bin		collection1	solr.xml	zoo.cfg

collection1に行くと以下のようになっています。
＃また別途ブログ書く予定ですがSolrにはマルチコアという機能があって、
＃1つのメモリ空間というかプロセスで複数のアレをやりくりすることが出来ます。
なんとなく想像付きそうですが、スキーマ関連の設定はconfにあって、
実際のインデックスデータはdataに保存されます。

$ cd collection1/
$ ls
README.txt	conf		core.properties	data

では、confに行ってみましょう。
んま、いろいろあるんですが、スキーマ定義は schema.xml で行います。
シノニムとか、マッピング(①を1にするとか)とか、stopwords(意味の無い単語”は”とか”の”とか)とか、
検索機能を向上させるためのテクニック的なアレもココには入っています。

$ cd conf
$ ls
admin-extra.html		clustering			lang				protwords.txt			solrconfig.xml			synonyms.txt			xslt
admin-extra.menu-bottom.html	currency.xml			mapping-FoldToASCII.txt		schema.xml			spellings.txt			update-script.js
admin-extra.menu-top.html	elevate.xml			mapping-ISOLatin1Accent.txt	scripts.conf			stopwords.txt			velocity

　
今回やりたかったのはスキーマ定義ですので、schema.xmlを見ていきます。
最初の方はコメントがブワーっと書いてあるので飛ばしていきますが、
＃っていうか、コレを日本語訳するのも意味ありそうだなーー

  66  <fields> //このタグからフィールド定義の始まり

//この辺はSolrの内部的なアレ
 109    <!-- If you remove this field, you must _also_ disable the update log in solrconfig.xml
 110       or Solr won't start. _version_ and update log are required for SolrCloud
 111    -->
 112    <field name="_version_" type="long" indexed="true" stored="true"/>
 113
 114    <!-- points to the root document of a block of nested documents. Required for nested
 115       document support, may be removed otherwise
 116    -->
 117    <field name="_root_" type="string" indexed="true" stored="false"/>

//idはユニークキーなのでアレよ、と。
 119    <!-- Only remove the "id" field if you have a very good reason to. While not strictly
 120      required, it is highly recommended. A <uniqueKey> is present in almost all Solr
 121      installations. See the <uniqueKey> declaration below where <uniqueKey> is set to "id".
 122    -->
 123    <field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" />

//で、いよいよソレっぽいの。
//SKUってJANコードとかソレ系のヤツで、カテゴリとか、在庫あるとか、ソレ系のが並んでます。
 125    <field name="sku" type="text_en_splitting_tight" indexed="true" stored="true" omitNorms="true"/>
 126    <field name="name" type="text_general" indexed="true" stored="true"/>
 127    <field name="manu" type="text_general" indexed="true" stored="true" omitNorms="true"/>
 128    <field name="cat" type="string" indexed="true" stored="true" multiValued="true"/>
 129    <field name="features" type="text_general" indexed="true" stored="true" multiValued="true"/>
 130    <field name="includes" type="text_general" indexed="true" stored="true" termVectors="true" termPositions="true" termOffsets="true" />
 131
 132    <field name="weight" type="float" indexed="true" stored="true"/>
 133    <field name="price"  type="float" indexed="true" stored="true"/>
 134    <field name="popularity" type="int" indexed="true" stored="true" />
 135    <field name="inStock" type="boolean" indexed="true" stored="true" />
 136
 137    <field name="store" type="location" indexed="true" stored="true"/>

//タイトルとか、概要とか、著者とか、URLとか
 139    <!-- Common metadata fields, named specifically to match up with
 140      SolrCell metadata when parsing rich documents such as Word, PDF.
 141      Some fields are multiValued only because Tika currently may return
 142      multiple values for them. Some metadata is parsed from the documents,
 143      but there are some which come from the client context:
 144        "content_type": From the HTTP headers of incoming stream
 145        "resourcename": From SolrCell request param resource.name
 146    -->
 147    <field name="title" type="text_general" indexed="true" stored="true" multiValued="true"/>
 148    <field name="subject" type="text_general" indexed="true" stored="true"/>
 149    <field name="description" type="text_general" indexed="true" stored="true"/>
 150    <field name="comments" type="text_general" indexed="true" stored="true"/>
 151    <field name="author" type="text_general" indexed="true" stored="true"/>
 152    <field name="keywords" type="text_general" indexed="true" stored="true"/>
 153    <field name="category" type="text_general" indexed="true" stored="true"/>
 154    <field name="resourcename" type="text_general" indexed="true" stored="true"/>
 155    <field name="url" type="text_general" indexed="true" stored="true"/>
 156    <field name="content_type" type="string" indexed="true" stored="true" multiValued="true"/>
 157    <field name="last_modified" type="date" indexed="true" stored="true"/>
 158    <field name="links" type="string" indexed="true" stored="true" multiValued="true"/>

//次にdynamicFieldが続きます。(後で詳しく説明します。コレ使うと開発時とても便利です)
 195    <!-- Dynamic field definitions allow using convention over configuration
 196        for fields via the specification of patterns to match field names.
 197        EXAMPLE:  name="*_i" will match any field ending in _i (like myid_i, z_i)
 198        RESTRICTION: the glob-like pattern in the name attribute must have
 199        a "*" only at the start or the end.  -->
 200
 201    <dynamicField name="*_i"  type="int"    indexed="true"  stored="true"/>
 202    <dynamicField name="*_is" type="int"    indexed="true"  stored="true"  multiValued="true"/>
 203    <dynamicField name="*_s"  type="string"  indexed="true"  stored="true" />
 204    <dynamicField name="*_ss" type="string"  indexed="true"  stored="true" multiValued="true"/>
 205    <dynamicField name="*_l"  type="long"   indexed="true"  stored="true"/>
 206    <dynamicField name="*_ls" type="long"   indexed="true"  stored="true"  multiValued="true"/>

//ユニークキーの定義
 248  <!-- Field to use to determine and enforce document uniqueness.
 249       Unless this field is marked with required="false", it will be a required field
 250    -->
 251  <uniqueKey>id</uniqueKey>

//コレは昔よく使ったデフォルトの検索フィールド定義ですがDEPRECATEDなんだそうで。
 253  <!-- DEPRECATED: The defaultSearchField is consulted by various query parsers when
 254   parsing a query string that isn't explicit about the field.  Machine (non-user)
 255   generated queries are best made explicit, or they can use the "df" request parameter
 256   which takes precedence over this.
 257   Note: Un-commenting defaultSearchField will be insufficient if your request handler
 258   in solrconfig.xml defines "df", which takes precedence. That would need to be removed.
 259  <defaultSearchField>text</defaultSearchField> -->

//こちらも昔よく使ったAND検索なのかOR検索なのかってヤツ。コレは検索クエリ側で指定しはなれ、と。
 261  <!-- DEPRECATED: The defaultOperator (AND|OR) is consulted by various query parsers
 262   when parsing a query string to determine if a clause of the query should be marked as
 263   required or optional, assuming the clause isn't already marked by some operator.
 264   The default is OR, which is generally assumed so it is not a good idea to change it
 265   globally here.  The "q.op" request parameter takes precedence over this.
 266  <solrQueryParser defaultOperator="OR"/> -->

//んで、最後にcopyField。いろいろ定義したけど、検索の時は"text"ってフィールドにまとめてやっちゃう的な。
 268   <!-- copyField commands copy one field to another at the time a document
 269         is added to the index.  It's used either to index the same field differently,
 270         or to add multiple fields to the same field for easier/faster searching.  -->
 271
 272    <copyField source="cat" dest="text"/>
 273    <copyField source="name" dest="text"/>
 274    <copyField source="manu" dest="text"/>
 275    <copyField source="features" dest="text"/>
 276    <copyField source="includes" dest="text"/>
 277    <copyField source="manu" dest="manu_exact"/>
 278
 279    <!-- Copy the price into a currency enabled field (default USD) -->
 280    <copyField source="price" dest="price_c"/>

　
では、上記をこざっぱりさせて今回の定義を入れていきましょう。

 <fields>

//いじらないところ
   <field name="_version_" type="long" indexed="true" stored="true"/>
   <field name="_root_" type="string" indexed="true" stored="false"/>
   <field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" />

//氏名(text_jaは後述します)、性別、年齢(tはTrieのアレ。範囲検索が可能になる)、
//住所(住所1、2とかありそうなのでmultivalued)、タグ(複数貼れるのでmultivalued。普通のStringで)
   <field name="first_name" type="text_ja" indexed="true" stored="true"/>
   <field name="family_name" type="text_ja" indexed="true" stored="true"/>
   <field name="sex" type="boolean" indexed="true" stored="true"/>
   <field name="age" type="tint" indexed="true" stored="true"/>
   <field name="address" type="text_ja" indexed="true" stored="true" multiValued="true"/>
   <field name="tag" type="string" indexed="true" stored="true" multiValued="true"/>

//更新日付と作成日付。こちらもtdateなら範囲検索できる
   <field name="updated" type="tdate" indexed="true" stored="true"/>
   <field name="created" type="tdate" indexed="true" stored="true"/>

//例えば、上記のfirst_nameとfamiliy_nameのどちらかに"海"って文字が入ってたら〜とか
   <field name="combined_name" type="text_ja" indexed="true" stored="false" multiValued="true"/>

//何かと任意のテキストフィールドを追加してくれっていうのは運用的によくある話で。
//ｘｘ部 記入欄とか、ｘｘ部長 コメント欄とか。その辺は動的に日本語フィールドを用意しておく
   <dynamicField name="*_txt" type="text_ja" indexed="true"  stored="true" multiValued="true"/>
 </fields>
〜略〜
//上記のcombined_nameにfirst_nameとfamily_nameのフィールドをコピー
   <copyField source="first_name" dest="combined_name"/>
   <copyField source="family_name" dest="combined_name"/>

　
text_jaって何それ？ってアレですが再帰のSolrは↓こんな感じでデフォルトでKuromojiという
日本語用のトークナイザが同梱されていて、ソレを使えますよ、と。
JapaneseKatakanaStemFilterFactoryでカタカナのステミングとか気が効いてるというか(サーバー⇒サーバ的な)
(3年前くらいにお話伺った時はまだそんなじゃなかった⇒ http://shinodogg.com/?p=3346)

    <!-- Japanese using morphological analysis (see text_cjk for a configuration using bigramming)

         NOTE: If you want to optimize search for precision, use default operator AND in your query
         parser config with <solrQueryParser defaultOperator="AND"/> further down in this file.  Use
         OR if you would like to optimize for recall (default).
    -->
    <fieldType name="text_ja" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="false">
      <analyzer>
      <!-- Kuromoji Japanese morphological analyzer/tokenizer (JapaneseTokenizer)

           Kuromoji has a search mode (default) that does segmentation useful for search.  A heuristic
           is used to segment compounds into its parts and the compound itself is kept as synonym.

           Valid values for attribute mode are:
              normal: regular segmentation
              search: segmentation useful for search with synonyms compounds (default)
            extended: same as search mode, but unigrams unknown words (experimental)

           For some applications it might be good to use search mode for indexing and normal mode for
           queries to reduce recall and prevent parts of compounds from being matched and highlighted.
           Use <analyzer type="index"> and <analyzer type="query"> for this and mode normal in query.

           Kuromoji also has a convenient user dictionary feature that allows overriding the statistical
           model with your own entries for segmentation, part-of-speech tags and readings without a need
           to specify weights.  Notice that user dictionaries have not been subject to extensive testing.

           User dictionary attributes are:
                     userDictionary: user dictionary filename
             userDictionaryEncoding: user dictionary encoding (default is UTF-8)

           See lang/userdict_ja.txt for a sample user dictionary file.

           Punctuation characters are discarded by default.  Use discardPunctuation="false" to keep them.

           See http://wiki.apache.org/solr/JapaneseLanguageSupport for more on Japanese language support.
        -->
        <tokenizer class="solr.JapaneseTokenizerFactory" mode="search"/>
        <!--<tokenizer class="solr.JapaneseTokenizerFactory" mode="search" userDictionary="lang/userdict_ja.txt"/>-->
        <!-- Reduces inflected verbs and adjectives to their base/dictionary forms (辞書形) -->
        <filter class="solr.JapaneseBaseFormFilterFactory"/>
        <!-- Removes tokens with certain part-of-speech tags -->
        <filter class="solr.JapanesePartOfSpeechStopFilterFactory" tags="lang/stoptags_ja.txt" />
        <!-- Normalizes full-width romaji to half-width and half-width kana to full-width (Unicode NFKC subset) -->
        <filter class="solr.CJKWidthFilterFactory"/>
        <!-- Removes common tokens typically not useful for search, but have a negative effect on ranking -->
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_ja.txt" />
        <!-- Normalizes common katakana spelling variations by removing any last long sound character (U+30FC) -->
        <filter class="solr.JapaneseKatakanaStemFilterFactory" minimumLength="4"/>
        <!-- Lower-cases romaji characters -->
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
    </fieldType>

　
で、Solrのサーバーを再起動しようとしたら怒られた。。

3273 [searcherExecutor-4-thread-1] ERROR org.apache.solr.core.SolrCore  – org.apache.solr.common.SolrException: undefined field text
	at org.apache.solr.schema.IndexSchema.getDynamicFieldType(IndexSchema.java:1235)
	at org.apache.solr.schema.IndexSchema$SolrQueryAnalyzer.getWrappedAnalyzer(IndexSchema.java:419)
	at org.apache.lucene.analysis.AnalyzerWrapper.initReader(AnalyzerWrapper.java:117)
	at org.apache.lucene.analysis.Analyzer.tokenStream(Analyzer.java:178)

使わないけど足しておきます。。。(コレってそういうもの？どなたかご存知でしたら。。)

   <field name="text" type="text_general" indexed="true" stored="false" multiValued="true"/>

　
　
■ Solrにデータを投入
　
では、Solrにデータを投入してみます。
先日 SolrJを使ったクライアントプログラミングのブログを書きましたが⇒http://shinodogg.com/?p=5899
ソレを活用してやってみます。
　
インデクシングするコードは↓こんな感じ。
西麻布と鎌倉に居を構えるVIP会員の篠原様ですよ、的な。

        String url = "http://localhost:8983/solr/collection1";
        SolrServer server = new HttpSolrServer(url);

        SolrInputDocument document = new SolrInputDocument();
        document.addField("id", 1);
        document.addField("first_name", "英治");
        document.addField("family_name", "篠原");
        document.addField("sex", false);
        document.addField("age", 34);
        String[] addresses = {"東京都港区西麻布", "神奈川県鎌倉市"};
        document.addField("address", addresses);
        String[] tags = {"#優良顧客", "#VIP会員"};
        document.addField("tag", tags);
        document.addField("updated", new Date());
        document.addField("created", new Date());
        server.add(document);
        server.commit();

　
　
■ Solrの管理画面から検索
　
では、Solrの管理画面から検索してみましょう。イイ感じですね。

　
次に複数件のデータをいれてみます。上記をListで定義して回して入れる的な。
↓のようなふざけたデータが帰ってきました。笑

  "response": {
    "numFound": 4,
    "start": 0,
    "docs": [
      {
        "id": "1",
        "first_name": "英治",
        "family_name": "篠原",
        "sex": false,
        "age": 34,
        "address": [
          "東京都港区西麻布",
          "神奈川県鎌倉市"
        ],
        "tag": [
          "#優良顧客",
          "#VIP会員"
        ],
        "updated": "2013-12-22T05:25:38.731Z",
        "created": "2013-12-22T05:25:38.731Z",
        "_version_": 1455098365345792000
      },
      {
        "id": "2",
        "first_name": "一郎",
        "family_name": "鈴木",
        "sex": false,
        "age": 40,
        "address": [
          "愛知県名古屋市"
        ],
        "tag": [
          "#一般会員"
        ],
        "updated": "2013-12-22T05:34:05.906Z",
        "created": "2013-12-22T05:34:05.906Z",
        "_version_": 1455098897158373400
      },
      {
        "id": "3",
        "first_name": "花子",
        "family_name": "田中",
        "sex": true,
        "age": 50,
        "address": [
          "沖縄県那覇市"
        ],
        "tag": [
          "#クレーマー"
        ],
        "updated": "2013-12-22T05:34:06.107Z",
        "created": "2013-12-22T05:34:06.107Z",
        "boss_txt": [
          "12月24日に個別対応"
        ],
        "_version_": 1455098897222336500
      },
      {
        "id": "4",
        "first_name": "義経",
        "family_name": "源",
        "sex": false,
        "age": 999,
        "address": [
          "奥州平泉"
        ],
        "tag": [
          "#要存在確認"
        ],
        "updated": "2013-12-22T05:34:06.137Z",
        "created": "2013-12-22T05:34:06.137Z",
        "soumu_txt": [
          "こちらのお客様は住民票の提示が必要です"
        ],
        "_version_": 1455098897252745200
      }
    ]
  }

　
では、イロイロ検索してみます。
　
– copyFieldでcombine_nameという苗字と名前をガッチャンコしたフィールドを作りました。
combined_name:源
と、検索してみると、以下のように義経さんがヒットしました。

　
　
– dynamicFieldで *_txt というフィールドを作りました。
いずれもイイ感じにヒットしました。
boss_txtで12月24日に個別対応した件。

　
soumu_txtで住民票が必要な件。

　
　
いかがでしょうか？Solrがイイ感じによしなにやってくれる感が伝わったらイイなと思いますヽ(´▽`)ﾉ
ぼちぼちSolrCloudとかも見て行きたいですね。
　
また、インデックスのスキーマ設計や日本語ならではの細かいテクニックは沢山あるのですが、
↓読んでおけば、とりあえず現場でイイ線いってる人レベルになれる気がします。

[改訂新版] Apache Solr入門 ~オープンソース全文検索エンジン (Software Design plus)

posted with amazlet at 13.12.22

大谷純阿部慎一朗大須賀稔北野太郎鈴木教嗣平賀一昭
技術評論社
売り上げランキング: 9,428

Amazon.co.jpで詳細を見る