Verdict

Weave は、Verdict Python ライブラリを介して行われるすべての呼び出しの追跡とログを簡単に行えるように設計されています。 AI 評価パイプラインを扱う際、デバッグは非常に重要です。パイプラインのステップが失敗したり、出力が予期しないものであったり、ネストされた操作が混乱を招いたりする場合、問題の特定は困難を極めることがあります。Verdict アプリケーションは多くの場合、複数のパイプラインステップ、ジャッジ（判断ロジック）、変換で構成されており、評価ワークフローの内部動作を理解することが不可欠です。 Weave は、Verdict アプリケーションのトレースを自動的にキャプチャすることで、このプロセスを簡素化します。これにより、パイプラインのパフォーマンスを監視・分析でき、AI 評価ワークフローのデバッグや最適化が容易になります。

クイックスタート

まずは、スクリプトの冒頭で weave.init(project=...) を呼び出すだけです。project 引数を使用して、team-name/project-name で特定の W&B Team 名にログを記録するか、project-name を指定してデフォルトのチーム / Entity にログを記録します。

import weave
from verdict import Pipeline
from verdict.common.judge import JudgeUnit
from verdict.schema import Schema

# プロジェクト名で Weave を初期化
weave.init("verdict_demo")

# シンプルな評価パイプラインを作成
pipeline = Pipeline()
pipeline = pipeline >> JudgeUnit().prompt("Rate the quality of this text: {source.text}")

# サンプルデータを作成
data = Schema.of(text="This is a sample text for evaluation.")

# パイプラインを実行 - これは自動的にトレースされます
output = pipeline.run(data)

print(output)

呼び出しメタデータの追跡

Verdict パイプラインの呼び出しからメタデータを追跡するには、weave.attributes コンテキストマネージャを使用できます。このコンテキストマネージャを使用すると、パイプラインの実行や評価バッチなど、特定のコードブロックに対してカスタムメタデータを設定できます。

import weave
from verdict import Pipeline
from verdict.common.judge import JudgeUnit
from verdict.schema import Schema

# プロジェクト名で Weave を初期化
weave.init("verdict_demo")

pipeline = Pipeline()
pipeline = pipeline >> JudgeUnit().prompt("Evaluate sentiment: {source.text}")

data = Schema.of(text="I love this product!")

with weave.attributes({"evaluation_type": "sentiment", "batch_id": "batch_001"}):
    output = pipeline.run(data)

print(output)

Weave は、Verdict パイプライン呼び出しのトレースに対して自動的にメタデータを追跡します。メタデータは Weave の Web インターフェースで確認できます。

トレース

AI 評価パイプラインのトレースを中央データベースに保存することは、開発とプロダクションの両方の段階において極めて重要です。これらのトレースは、貴重なデータセットを提供することで、評価ワークフローのデバッグや改善に不可欠な役割を果たします。 Weave は Verdict アプリケーションのトレースを自動的にキャプチャします。Verdict ライブラリを通じて行われる以下のようなすべての呼び出しを追跡し、ログを記録します。

パイプライン実行ステップ
ジャッジユニットによる評価
レイヤー変換
プーリング操作
カスタムユニットおよび変換

Weave の Web インターフェースでトレースを表示し、パイプライン実行の階層構造を確認できます。

パイプライントレースの例

以下は、Weave がネストされたパイプライン操作をどのようにトレースするかを示す、より複雑な例です。

import weave
from verdict import Pipeline, Layer
from verdict.common.judge import JudgeUnit
from verdict.transform import MeanPoolUnit
from verdict.schema import Schema

# プロジェクト名で Weave を初期化
weave.init("verdict_demo")

# 複数のステップを持つ複雑なパイプラインを作成
pipeline = Pipeline()
pipeline = pipeline >> Layer([
    JudgeUnit().prompt("Rate coherence: {source.text}"),
    JudgeUnit().prompt("Rate relevance: {source.text}"),
    JudgeUnit().prompt("Rate accuracy: {source.text}")
], 3)
pipeline = pipeline >> MeanPoolUnit()

# サンプルデータ
data = Schema.of(text="This is a comprehensive evaluation of text quality across multiple dimensions.")

# パイプラインを実行 - すべての操作がトレースされます
result = pipeline.run(data)

print(f"Average score: {result}")

これにより、以下を示す詳細なトレースが作成されます。

メインの Pipeline 実行
Layer 内の各 JudgeUnit 評価
MeanPoolUnit 集計ステップ
各操作のタイミング情報

設定

weave.init() を呼び出すと、Verdict パイプラインのトレースが自動的に有効になります。このインテグレーションは、Pipeline.__init__ メソッドをパッチして VerdictTracer を挿入し、すべてのトレースデータを Weave に転送することで動作します。追加の設定は不要です。Weave は自動的に以下を行います。

すべてのパイプライン操作をキャプチャ
実行タイミングの追跡
入力と出力のログ記録
トレースの階層構造の維持
パイプラインの並列実行の処理

カスタムトレーサーと Weave

アプリケーションでカスタムの Verdict トレーサーを使用している場合でも、Weave の VerdictTracer はそれらと併用できます。

import weave
from verdict import Pipeline
from verdict.common.judge import JudgeUnit
from verdict.util.tracing import ConsoleTracer
from verdict.schema import Schema

# プロジェクト名で Weave を初期化
weave.init("verdict_demo")

# Verdictの内蔵トレーサーも引き続き使用可能
console_tracer = ConsoleTracer()

# Weave（自動）と Console トレースの両方を使用してパイプラインを作成
pipeline = Pipeline(tracer=[console_tracer])  # Weaveトレーサーは自動的に追加されます
pipeline = pipeline >> JudgeUnit().prompt("Evaluate: {source.text}")

data = Schema.of(text="Sample evaluation text")

# これにより Weave とコンソールの両方にトレースされます
result = pipeline.run(data)

Models と Evaluations

複数のパイプラインコンポーネントを持つ AI システムを整理し評価することは困難な場合があります。weave.Model を使用すると、プロンプト、パイプライン設定、評価パラメータなどの実験の詳細をキャプチャして整理でき、異なるイテレーションの比較が容易になります。次の例では、Verdict パイプラインを WeaveModel でラップする方法を示します。

import asyncio
import weave
from verdict import Pipeline
from verdict.common.judge import JudgeUnit
from verdict.schema import Schema

# プロジェクト名で Weave を初期化
weave.init("verdict_demo")

class TextQualityEvaluator(weave.Model):
    judge_prompt: str
    pipeline_name: str

    @weave.op()
    async def predict(self, text: str) -> dict:
        pipeline = Pipeline(name=self.pipeline_name)
        pipeline = pipeline >> JudgeUnit().prompt(self.judge_prompt)
        
        data = Schema.of(text=text)
        result = pipeline.run(data)
        
        return {
            "text": text,
            "quality_score": result.score if hasattr(result, 'score') else result,
            "evaluation_prompt": self.judge_prompt
        }

model = TextQualityEvaluator(
    judge_prompt="Rate the quality of this text on a scale of 1-10: {source.text}",
    pipeline_name="text_quality_evaluator"
)

text = "This is a well-written and informative piece of content that provides clear value to readers."

prediction = asyncio.run(model.predict(text))

# Jupyter Notebook の場合は以下を実行:
# prediction = await model.predict(text)

print(prediction)

このコードは Weave UI で視覚化可能なモデルを作成し、パイプライン構造と評価結果の両方を表示します。

Evaluations

Evaluations（評価）は、評価パイプライン自体のパフォーマンスを測定するのに役立ちます。weave.Evaluation クラスを使用することで、特定のタスクやデータセットに対して Verdict パイプラインがどの程度良好に機能するかをキャプチャできます。

import asyncio
import weave
from verdict import Pipeline
from verdict.common.judge import JudgeUnit
from verdict.schema import Schema

# Weave を初期化
weave.init("verdict_demo")

# 評価モデルを作成
class SentimentEvaluator(weave.Model):
    @weave.op()
    async def predict(self, text: str) -> dict:
        pipeline = Pipeline()
        pipeline = pipeline >> JudgeUnit().prompt(
            "Classify sentiment as positive, negative, or neutral: {source.text}"
        )
        
        data = Schema.of(text=text)
        result = pipeline.run(data)
        
        return {"sentiment": result}

# テストデータ
texts = [
    "I love this product, it's amazing!",
    "This is terrible, worst purchase ever.",
    "The weather is okay today."
]
labels = ["positive", "negative", "neutral"]

examples = [
    {"id": str(i), "text": texts[i], "target": labels[i]}
    for i in range(len(texts))
]

# スコアリング関数
@weave.op()
def sentiment_accuracy(target: str, output: dict) -> dict:
    predicted = output.get("sentiment", "").lower()
    return {"correct": target.lower() in predicted}

model = SentimentEvaluator()

evaluation = weave.Evaluation(
    dataset=examples,
    scorers=[sentiment_accuracy],
)

scores = asyncio.run(evaluation.evaluate(model))
# Jupyter Notebook の場合は以下を実行:
# scores = await evaluation.evaluate(model)

print(scores)

これにより、Verdict パイプラインがさまざまなテストケースでどのように機能するかを示す評価トレースが作成されます。

ベストプラクティス

パフォーマンス監視

Weave は、すべてのパイプライン操作のタイミング情報を自動的にキャプチャします。これを使用して、パフォーマンスのボトルネックを特定できます。

import weave
from verdict import Pipeline, Layer
from verdict.common.judge import JudgeUnit
from verdict.schema import Schema

weave.init("verdict_demo")

# パフォーマンスにばらつきが出る可能性のあるパイプラインを作成
pipeline = Pipeline()
pipeline = pipeline >> Layer([
    JudgeUnit().prompt("Quick evaluation: {source.text}"),
    JudgeUnit().prompt("Detailed analysis: {source.text}"),  # こちらの方が遅い可能性があります
], 2)

data = Schema.of(text="Sample text for performance testing")

# タイミングのパターンを確認するために複数回実行
for i in range(3):
    with weave.attributes({"run_number": i}):
        result = pipeline.run(data)

エラー処理

Weave は、パイプラインの実行中に発生した例外を自動的にキャプチャします。

import weave
from verdict import Pipeline
from verdict.common.judge import JudgeUnit
from verdict.schema import Schema

weave.init("verdict_demo")

pipeline = Pipeline()
pipeline = pipeline >> JudgeUnit().prompt("Process: {source.invalid_field}")  # これによりエラーが発生します

data = Schema.of(text="Sample text")

try:
    result = pipeline.run(data)
except Exception as e:
    print(f"Pipeline failed: {e}")
    # エラー詳細は Weave トレースにキャプチャされます

Weave を Verdict と統合することで、AI 評価パイプラインに対する包括的なオブザーバビリティが得られ、評価ワークフローのデバッグ、最適化、理解が容易になります。

Get Started

Guides

Cookbooks

Reference

Open Source

Community

クイックスタート

呼び出しメタデータの追跡

トレース

パイプライントレースの例

設定

カスタムトレーサーと Weave

Models と Evaluations

Evaluations

ベストプラクティス

パフォーマンス監視

エラー処理

Get Started

Guides

Cookbooks

Reference

Open Source

Community

​クイックスタート

​呼び出しメタデータの追跡

​トレース

​パイプライン トレース の例

​設定

​カスタムトレーサーと Weave

​Models と Evaluations

​Evaluations

​ベストプラクティス

​パフォーマンス監視

​エラー処理

クイックスタート

呼び出しメタデータの追跡

トレース

パイプライントレースの例

設定

カスタムトレーサーと Weave

Models と Evaluations

Evaluations

ベストプラクティス

パフォーマンス監視

エラー処理