Apache Solr入門を読む(3)_デフォルトのschema.xmlを眺める - ◯◯のホームページにようこそ！あなたは△△番目の☆☆です

Apache Solr入門の第二章、スキーマに関する部分を読んでいく。

schema.xmlだとコアのリロード（あるいはsolrの再起動）が必須なのだけど、managed-schemaをSchema APIから変更した場合は即時反映されるらしい。面白そうなのでちょっとやってみる。

実験用のコアを作成

コマンドラインから作成する。-dオプションを使わずに起動するとデフォルトのconfigsetが選択され、本番で使うのはおすすめしないぜって警告が出てくる。

solr@91ae230d58a3:/opt/solr-9.2.1$ /opt/solr/bin/solr create_core -c sandbox_schema
WARNING: Using _default configset with data driven schema functionality. NOT RECOMMENDED for production use.
         To turn off: bin/solr config -c sandbox_schema -p 8983 -action set-user-property -property update.autoCreateFields -value false

Created new core 'sandbox_schema'

気にせずもう一つ作る。

solr@91ae230d58a3:/opt/solr-9.2.1$ /opt/solr/bin/solr create_core -c sandbox_schema_legacy
WARNING: Using _default configset with data driven schema functionality. NOT RECOMMENDED for production use.
         To turn off: bin/solr config -c sandbox_schema_legacy -p 8983 -action set-user-property -property update.autoCreateFields -value false

Created new core 'sandbox_schema_legacy'

sandbox_schema_legacyの方は、 schema.xml を参照するように設定をいじっておく。

solr@91ae230d58a3:/opt/solr-9.2.1$ diff /var/solr/data/sandbox_schema/conf/solrconfig.xml /var/solr/data/sandbox_schema_legacy/conf/solrconfig.xml
23a24,25
>   <schemaFactory class="ClassicIndexSchemaFactory"/>
>

書籍ではClassicSchemaFactroyと書いてたけど、それだと Caused by: java.lang.ClassNotFoundException: ClassicSchemaFactory とか言ってエラーになるのでガイドを見て適当に書き換えた。

Schema Factory Configuration :: Apache Solr Reference Guide

solrconfig.xmlの修正が終わったら、manage-scemaをrenameしてschama.xmlにしておく（切り替えるなら必要かなと思ってやったけど、これは実際意味があったのかはわからない）。

solr@91ae230d58a3:/opt/solr-9.2.1$ ls /var/solr/data/sandbox_schema_legacy/conf
lang  protwords.txt  schema.xml  solrconfig.xml  stopwords.txt  synonyms.txt

これで、managed-schemaとschema.xml、それぞれのコアで別々に実験することができる。

最小設定で起動してみる

デフォルト設定のschema.xmlはあれこれ書いてあって、素人が見ると何が何やらになってしまう。見通しをよくするために、コメントや使っていない設定をごっそり消して、schema.xmlをなるべく小さくしてみる。

<?xml version="1.0" encoding="UTF-8" ?>
<schema name="default-config" version="1.6">
    <field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" />
    <field name="_version_" type="plong" indexed="false" stored="false"/>
    <field name="_root_" type="string" indexed="true" stored="false" docValues="false" />
    <field name="_nest_path_" type="_nest_path_" /><fieldType name="_nest_path_" class="solr.NestPathField" />
    <field name="_text_" type="text_general" indexed="true" stored="false" multiValued="true"/>

    <uniqueKey>id</uniqueKey>

    <fieldType name="string" class="solr.StrField" sortMissingLast="true" docValues="true" />
    <fieldType name="plong" class="solr.LongPointField" docValues="true"/>

    <fieldType name="boolean" class="solr.BoolField" sortMissingLast="true"/>
    <fieldType name="booleans" class="solr.BoolField" sortMissingLast="true" multiValued="true"/>
    <fieldType name="pdates" class="solr.DatePointField" docValues="true" multiValued="true"/>
    <fieldType name="plongs" class="solr.LongPointField" docValues="true" multiValued="true"/>
    <fieldType name="pdoubles" class="solr.DoublePointField" docValues="true" multiValued="true"/>

    <fieldType name="text_general" class="solr.TextField" positionIncrementGap="100" multiValued="true">
      <analyzer type="index">
        <tokenizer name="standard"/>
        <filter name="stop" ignoreCase="true" words="stopwords.txt" />
        <filter name="lowercase"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer name="standard"/>
        <filter name="stop" ignoreCase="true" words="stopwords.txt" />
        <filter name="synonymGraph" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
        <filter name="lowercase"/>
      </analyzer>
    </fieldType>

</schema>

これで34行。もともとが1034行なので、ちょうど1000行ほど消したことになる。

スキーマを変更したら、coreをリロードする必要がある。リロードはAPIから実行することができて、成功すると以下のようなレスポンスが返ってくる。

# curl "http://localhost:8983/solr/admin/cores?action=RELOAD&core=sandbox_schema_legacy"
{
  "responseHeader":{
    "status":0,
    "QTime":92}}

ちなみにリロードに失敗するとこんな感じで500が返ってくる。schemaのどこがだめだったかログに出してくれるので大変わかりやすい。

# curl "http://localhost:8983/solr/admin/cores?action=RELOAD&core=sandbox_schema_legacy"
{
  "responseHeader":{
    "status":500,
    "QTime":89},
  "error":{
    "metadata":[
      "error-class","org.apache.solr.common.SolrException",
      "root-error-class","org.apache.solr.common.SolrException"],
    "msg":"Unable to reload core [sandbox_schema_legacy]",
    "trace": [スタックトレースがここに表示される],
    "code":500}}

たとえば、必要な定義が足りなかったらこんな感じで表示される。

Caused by: org.apache.solr.common.SolrException: fieldType 'pdoubles' not found in the schema

schema.xmlの以下の部分に関しては、使っていないと思ったので一回消して、ログでいわれるがままに追加した。どうやら必要らしい。

    <fieldType name="boolean" class="solr.BoolField" sortMissingLast="true"/>
    <fieldType name="booleans" class="solr.BoolField" sortMissingLast="true" multiValued="true"/>
    <fieldType name="pdates" class="solr.DatePointField" docValues="true" multiValued="true"/>
    <fieldType name="plongs" class="solr.LongPointField" docValues="true" multiValued="true"/>
    <fieldType name="pdoubles" class="solr.DoublePointField" docValues="true" multiValued="true"/>

（追記）SOに同じことを疑問に思っている人がいた。実際に確認してみると、たしかにsolrconfig.xmlの中で型を参照していた。

stackoverflow.com

schema.xmlを眺めてみる

小さくした設定を見てみると、ほとんどが型の定義で、フィールド自体は5つしかないことがわかる。アンスコで挟まれているやつはシステムで使う値っぽい雰囲気もするので、実質はidの一つだけだろうか。

また、 text_general の定義だけかなり長く、異彩を放っている。どうやらこれが2.4.3節に記載されているテキスト系フィールドタイプというやつらしい。

    <fieldType name="text_general" class="solr.TextField" positionIncrementGap="100" multiValued="true">
      <analyzer type="index">
        <tokenizer name="standard"/>
        <filter name="stop" ignoreCase="true" words="stopwords.txt" />
        <filter name="lowercase"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer name="standard"/>
        <filter name="stop" ignoreCase="true" words="stopwords.txt" />
        <filter name="synonymGraph" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
        <filter name="lowercase"/>
      </analyzer>
    </fieldType>

更新時と検索時のanalyzerが定義されていて、それぞれのanalyzerの中でtokenizerとfilterが定義されている。本によるとcharFilter（文字フィルタ）というのもあるらしい。

analyzerに関してはドキュメントのここに詳しい：https://solr.apache.org/guide/solr/9_2/indexing-guide/analyzers.html

個人的には検索時のanalyzerってが面白いなとおもった。記事を投入したときに（転置インデックスを作るために）トークナイズすることや、表記ゆれや大文字小文字なんかをそろえるためにフィルター処理をかけるのは想像ができたけど、検索時も同じ処理が必要というのは、言われてみれば確かになあという感じだ。

見通しをよくするために消してしまったけど、デフォルトのスキーマに日本語用のテキスト系フィールドタイプも定義されている。そっちもfilterが多かったりして面白い（それぞれのフィルタの役割や例も本で紹介されていて面白い）。他の言語も大量にあって、solrが世界的に使われていることがよくわかる

https://github.com/apache/solr/blob/releases/solr/9.2.1/solr/server/solr/configsets/_default/conf/managed-schema.xml#L837-L880

storedについて

本の2.5節で、field定義するときの重要なオプションとしてstoredが紹介されていて、「stored=trueにするとインデックスサイズが増える」という記載があったのだけど、これの理由がよくわからない。

「インデックス」と呼ばれているのは記事を引っ張ってくるための目次で、記事のデータ自体が増えてもインデックスサイズに影響ないのではと思ったのだけど・・・

本を読み進めたらわかるかもしれないので楽しみにしておく。