{"library":"metaflow-checkpoint","title":"Metaflow Checkpoint","description":"Metaflow-checkpoint is an experimental extension for Metaflow that provides in-task checkpointing capabilities. It allows users to periodically save the progress of long-running Metaflow steps, such as machine learning model training, ensuring recovery from failures without losing significant work. The library, currently at version 0.2.10, is released as an independent extension to core Metaflow.","language":"python","status":"active","last_verified":"Thu Apr 16","install":{"commands":["pip install metaflow-checkpoint"],"cli":null},"imports":["from metaflow import FlowSpec, step, current, checkpoint","current.checkpoint.save()"],"auth":{"required":false,"env_vars":[]},"quickstart":{"code":"import os\nimport random\nfrom metaflow import FlowSpec, step, current, checkpoint, retry\n\nclass CheckpointCounterFlow(FlowSpec):\n    @retry(times=2, minutes_between_retries=1)\n    @checkpoint(load_policy='eager') # Use 'eager' for development across runs\n    @step\n    def start(self):\n        self.counter = 0\n        if current.checkpoint.is_loaded:\n            print(f\"Resuming from checkpoint. Counter was {self.counter}\")\n            with open(os.path.join(current.checkpoint.directory, 'counter'), 'r') as f:\n                self.counter = int(f.read())\n            print(f\"Successfully loaded counter: {self.counter}\")\n        else:\n            print(\"Starting from scratch.\")\n\n        for i in range(5):\n            self.counter += 1\n            print(f\"Processing iteration {i+1}, counter is {self.counter}\")\n            # Save progress periodically\n            with open(os.path.join(current.checkpoint.directory, 'counter'), 'w') as f:\n                f.write(str(self.counter))\n            current.checkpoint.save()\n\n            # Simulate a flaky operation\n            if random.random() < 0.3:\n                raise Exception(\"Simulated failure!\")\n\n        self.next(self.end)\n\n    @step\n    def end(self):\n        print(f\"Flow finished. Final counter value: {self.counter}\")\n\nif __name__ == '__main__':\n    CheckpointCounterFlow()","lang":"python","description":"This quickstart demonstrates a Metaflow flow using the `@checkpoint` decorator. The `start` step simulates a long-running, flaky process that increments a counter. It saves the counter value to `current.checkpoint.directory` and calls `current.checkpoint.save()` periodically. Upon restart (due to `@retry` or `resume` command), it loads the last saved counter using `current.checkpoint.is_loaded` and `current.checkpoint.directory`. The `load_policy='eager'` allows checkpoints to be reused across different runs, aiding iterative development. Run with `python your_flow.py run` and try `python your_flow.py resume start` after an interruption.","tag":null,"tag_description":null,"last_tested":null,"results":[]},"compatibility":null}