This PR makes it possible to benchmark inference speed based on
inference engine.
In my case I was able to detect that for GGUF, the model runs twice as
fast as on transformers based config for qwen2.5 0.5B:
| path | date | average_duration (s) | max_duration (s) | min_duration
(s) | median_duration (s) | median_frequency | average_speed (tokens/s)
| max_speed (tokens/s) | min_speed (tokens/s) | median_speed (tokens/s)
| total_tokens |
|---------------------------|---------------------|----------------------|------------------|------------------|---------------------|------------------|---------------------------|-----------------------|-----------------------|---------------------------|--------------|
| dora-llama-cpp-python | 2025-03-17 15:45:25 | 0.03 | 0.09 | 0.03 |
0.03 | 37.76 | 222.73 | 233.59 | 69.38 | 226.54 | 6 |
| dora-transformers | 2025-03-17 16:20:33 | 0.07 | 0.40 | 0.05 | 0.06 |
16.15 | 96.37 | 111.81 | 15.14 | 96.90 | 6 |
This PR makes it possible to run dataflow from python using:
```python
from dora import run
# Make sure to build it first with the CLI.
# Build step is on the todo list.
run("qwen2.5.yaml", uv=True)
```
Making it easier to run dataflow based on our needs
This pull request introduces a new Dora node called `dora-transformer`
that allows access to Hugging Face transformer models for text
generation and chat completion. The changes include adding a
comprehensive README, initializing the main functionality of the node,
defining dependencies, and adding a test file.
Key changes:
### Core functionality:
* Implemented the main logic in `dora_transformer/main.py` for loading
models, generating responses, and handling memory management.
## Summary by Sourcery
Introduces a new `dora-transformer` node that provides access to Hugging
Face transformer models for text generation and chat completion. The
node supports multi-platform GPU acceleration, memory-efficient model
loading, configurable system prompts, and conversation history
management.
New Features:
- Introduces a new `dora-transformer` node for text generation and chat
completion using Hugging Face transformer models.
- Provides multi-platform GPU acceleration support (CUDA, CPU, MPS).
- Offers memory-efficient model loading with 8-bit quantization.
- Supports configurable system prompts and activation words.
- Implements conversation history management.
- Integrates seamlessly with speech-to-text and text-to-speech
pipelines.
- Optimized for both small and large language models
Tests:
- Adds a test file to verify the basic functionality of the
`dora-transformer` node.