* Added the ability to save and load individual conversations in a batched executor.
- New example
- Added `BatchedExecutor.Load(filepath)` method
- Added `Conversation.Save(filepath)` method
- Added new (currently internal) `SaveState`/`LoadState` methods in LLamaContext which can stash some extra binary data in the header
* Added ability to save/load a `Conversation` to an in-memory state, instead of to file.
* Moved the new save/load methods out to an extension class specifically for the batched executor.
* Removed unnecessary spaces
* Updated binaries, using [this build](https://github.com/SciSharp/LLamaSharp/actions/runs/8654672719/job/23733195669) for llama.cpp commit `f7001ccc5aa359fcf41bba19d1c99c3d25c9bcc7`.
- Added all new functions.
- Moved some functions (e.g. `SafeLlamaModelHandle` specific functions) into `SafeLlamaModelHandle.cs`
- Exposed tokens on `SafeLlamaModelHandle` and `LLamaWeights` through a `Tokens` property. As new special tokens are added in the future they can be added here.
- Changed all token properties to return nullable tokens, to handle some models not having some tokens.
- Fixed `DefaultSamplingPipeline` to handle no newline token in some models.
* Moved native methods to more specific locations.
- Context specific things have been moved into `SafeLLamaContextHandle.cs` and made private - they're exposed through C# properties and methods already.
- Checking that GPU layer count is zero if GPU offload is not supported.
- Moved methods for creating default structs (`llama_model_quantize_default_params` and `llama_context_default_params`) into relevant structs.
* Removed exception if `GpuLayerCount > 0` when GPU is not supported.
* - Added low level wrapper methods for new per-sequence state load/save in `SafeLLamaContextHandle`
- Added high level wrapper methods (save/load with `State` object or memory mapped file) in `LLamaContext`
- Moved native methods for per-sequence state load/save into `SafeLLamaContextHandle`
* Added update and defrag methods for KV cache in `SafeLLamaContextHandle`
* Updated submodule to `f7001ccc5aa359fcf41bba19d1c99c3d25c9bcc7`
* Passing the sequence ID when saving a single sequence state
* - Added `NativeLogConfig` which allows overriding the llama.cpp log callback
- Delaying binding of this into llama.cpp until after `NativeLibraryConfig` has loaded
* Using the log callback to show loading log messages during loading.
* Registering log callbacks before any calls to llama.cpp except `llama_empty_call`, this is specifically selected to be a method that does nothing and is just there for triggering DLL loading.
* - Removed much of the complexity of logging from `NativeApi.Load`. It always call whatever log callbacks you have registered.
- Removed alternative path for `ILogger` in NativeLibraryConfig, instead it redirects to wrapping it in a delegate.
* Saving a GC handle to keep the log callback alive
* Removed prefix, logger should already do that.
* Buffering up messages until a newline is encountered before passing log message to ILogger.
* - Added trailing `\n` to log messages from loading.
- Using `ThreadLocal<StringBuilder>` to ensure messages from separate threads don't get mixed together.
This commit is originally made by lcarrere in https://github.com/SciSharp/LLamaSharp/issues/180 .
I have confirmed this modification is OK in my windows 11 laptop, add make this commit according require of AsakusaRinne.
* Previously when a conversation was forked this would result in both the parent and the child sharing exactly the same logits. Since sampling is allowed to modify logits this could lead to issues in sampling (e.g. one conversation is sampled and overwrites logits to be all zero, second conversation is sampled and generates nonsense). Fixed this by setting a "forked" flag, logits are copied if this flag is set. Flag is cleared next time the conversation is prompted so this extra copying only happens once after a fork occurs.
* Removed finalizer from `BatchedExecutor`. This class does not directly own any unmanaged resources so it is not necessary.
Replaced `BatchedExecutor.Prompt(string)` method with `BatchedExecutor.Create()` method. This improves the API in two ways:
- A conversation can be created, without immediately prompting it
- Other prompting overloads (e.g. prompt with token list) can be used without duplicating all the overloads onto `BatchedExecutor`
Added `BatchSize` property to `LLamaContext`
- Modified library loading to be based on `SetDllImportResolver`. This replaces the built in loading system and ensures there can't be two libraries loaded at once.
- llava and llama are loaded separately, as needed.
- All the previous loading logic is still used, within the `SetDllImportResolver`
- Split out CUDA, AVX and MacOS paths to separate helper methods.
- `Description` now specifies if it is for `llama` or `llava`
* Add llava_binaries, update all binaries to make the test
* Llava API + LlavaTest
Preliminary
* First prototype of Load + Unit Test
* Temporary run test con branch LlavaAPI
* Disable Embed test to review the rest of the test
* Restore Embedding test
* Use BatchThread to eval image embeddings
Test Threads default value to ensure it doesn´t produce problems.
* Rename test file
* Update action versions
* Test only one method, no release embeddings
* Revert "Test only one method, no release embeddings"
This reverts commit 264e176dcc.
* Correct API call
* Only test llava related functionality
* Cuda and Cblast binaries
* Restore build policy
* Changes related with code review
* Add SafeHandles
* Set overwrite to upload-artifact@v4
* Revert to upload-artifact@v3
* revert to upload-artifact@v3
* Added a lock object into `SafeLlamaModelHandle` which all calls to `llama_decode` (in the `SafeLLamaContextHandle`) lock first. This prevents two contexts from running inference on the same model at the same time, which seems to be unsafe in llama.cpp.
* Modified the lock to be global over _all_ inferences. This seems to be necessary (at least with the CUDA backend).