DLAI-本地大模型-LlamaFile-笔记-全-

DLAI 本地大模型 LlamaFile 笔记(全)

001:开始本地大型语言模型的 Llamafile|Beginning Llamafile for Local Large Language Models (LLMs) p01 0_与您的导师 Alfredo Deza 会面.zh_en -BV1e6421Z7sg_p1-

Hi, my name is Alfre Lodesa and in this course we will see a lot about open source models。

how to interact with them, how to work to find them。

what are some some of the flavors of open source models that you might be wanting to interact with we'll go through certain things like or certain scenarios like using the models on the browser。

with ja, which I think is' pretty incredible, but as well using them with Python libraries。

which is very common as well as rust, we'll see some of the differences between Python and rust and why you want to choose one versus the other but concentrating in trying to make it very accessible very simple and in some situations with off the shelf ready to use local models and we'll be building interesting projects and you will be able to apply a lot of these concepts with hands on。

Labbs, very practical examples so you can actually go in and try the concept that we will be trying out all throughout these course I have several years of experience teaching machine learning teaching Python I've been a software engineer and a system administrator in the past and I think that my approach to demonstrating some of these examples and the repositories and all the lessons you will be able to kind of like get a little bit of that experience all throughout in very practical very straightforward examples so hopefully you get good understanding of how to work with open source models and get you get to apply it in your own projects with your own ideas。

002:开始本地大型语言模型的 Llamafile|Beginning Llamafile for Local Large Language Models (LLMs) p02 1_由 Mozilla 介绍 Llamafile 概述.zh_en -BV1e6421Z7sg_p2-

Hey there, My name is Stephen Hood, I work for Mozilla and my role at Mozilla is to oversee open source AI projects and initiatives and one of my projects is Lmaophil which we're here to talk about today Lmaophil is is a project that makes it very easy to use open models on any consumer hardware without much technical understanding or knowledge with very little installation or configuration it makes what is really a stack of complexity for using open source AI and collapses it down to just a file that you download and run。

We were inspired to create Lmophil from our own research into the state of open source AI technology。

we found that there's a tremendous amount of activity today in the open source AI space but a lot of the tech is still too hard to use so there are a huge number of models to choose from to begin with but once you have those models they come in different formats the formats keep changing as the technology evolves and then you have to make decisions once you have a model about what runtime you're going to use so how you're actually going to get it up and running in order to do inference against that model and depending on your model。

your platform and your chosen runtime solution you may have a lot of things you have to do you may have to compile C or C++ code and make sure you have the right tool chain installed to do that you may have to install and manage Python dependencies in some cases you have to install native software。

And sometimes all these things sort of come together, and it can be complicated。

and our concern has been that the harder it is to use open source AI。

The less likely developers are to adopt it, or the more slowly they will adopt it。

And that could hinder its evolution, which we think is incredibly important。

we think that having competitive, advanced open source AI technology in the hands of the public。

is critical to ensuring that AI does not end up being dominated by a handful of commercial interests。

the way the Internet has arguably become。 And that's a cause we care about a lot being Mozilla。

the makers of Firefox, we care about the web, and therefore we have to care about AI as well。

because it's already transforming the way we all use the web and the way we use computers。

So with that in mind, we createdllma file and what we did was we took two different projects and we combined them in a unique way。

The first project is Lamus。cppP this is a very popular open source project that makes it possible to run quantized open models on everyday consumer hardware。

so these are models that are compressed down so they require less memory to run。

and then the software itself is a very highly efficient C++based inference engine that can execute these models on everyday hardware whether it' be MacBooks or Windows machines or Linux boxes。

So we took that project and we combined it with another one called cosmopolitan。

Cosmopolitan is a project created by Justine Tney and what it does is it makes it possible to compile C and C+ plus programs in a way that they will run on almost any computer I don't mean。

You create different executables, one for Windows, one for Linux,1 for Mac and so forth。

The same compiled file runs unaltered on all these different platforms。

You can even run them on Raspberry Pis。 It's a pretty remarkable little bit of technology there that she has created。

And we working with Justine combine these two projects together。

And what we it allows us to do is take any open weights in the Gg Uf format。

which is a popular open model format today。 You can find these unhuging face and other various places。

😊,And you can then wrap it in the Luma file executable code。

which adds only 40 or 50 megabytes to what is a multigabyte file that's basically a rounding error and now that file you can distribute and people can download it no matter what kind of computer they're running and it will just work when they run it。

it brings up their web browser with a web-based locally hosted chatbot interface。

The AI is running entirely on the user's local machine。

You could unplug it from the Internet at that point, and it will continue to work。

So it's 100% local。 It's 100% private to the user so they can use it with confidence that no one else is listening in。

no one else is doing anything with their data。And this works like I said on almost any computer。

if you have a GPU, a graphics processing unit which can accelerate large language model inference。

then it will detect it, whether it be NviIDdia or AMD and this is a big deal because quite often AMD does not have the same support that NVIDdia has in the AI machine learning space。

so we've been able to address that as well。We have on our website our GitHub repo。

a number of Lma files example LAMma files that you can download and use。

and I will in future videos, I can talk more about how to create your own LAMma files。

how to use LaMma file yourself, and how to also use it as a developer。

as a drop in replacement for open AI。Hope you enjoy using Lamaophil thanks a lot。

003:开始本地大型语言模型的 Llamafile|Beginning Llamafile for Local Large Language Models (LLMs) p03 2_使用 Llamafile API.zh_en -BV1e6421Z7sg_p3-

Hey there, I just want to talk to you about Lma file and its server mode when you bring up any LAma file by default。

it also brings up a server in the background that provides an open AI compatible API endpoint so it mimics the open AI API signature for completions and other functions。

but instead of using open AI, it uses the LAMma file you're running on your local machine。

What this means is you can take code that's been written to work with open AI and you can use it with allma file。

this lets you switch from using a commercial centralized offering like open AI to an open source freely available system that's under your control。

And you can run this under your local machine, or you could also doize a Lama file and run it on a server if you want as well。

people do both。Here I have in Visual Studio I just have a quick example, this is some very。

very simple Python code that actually uses the open AI library Python clan。

and all I've done is change the baseE to point to my local host where I have currently allmaophil running。

it's Mil 7B。😊,And then I I got some very simple code here where I'm saying there's a system prompt and I'm asking it to tell me a short story aboutllmas。

And if I run this over here in my console, it's going to call that endpoint and it thinks that it's talking to open AI but it's really talking to a model running locally on my machine and then this is the output which I'm just dumping raw to the console you can do this in other way too you can do this in jascript So here's some very simple jascript that's using straight up HTML sorry Hp requests to talk to that same API signature but instead of pointing to open AI it is pointing to my local host and this really is going to do the same thing but run inside a browser so I can show you my browser of a very simple interface。

Ask it a question。 I submit this all happens through jascript and it comes back with a response。

Again, I'm just dumping the raw content to the console。

So this is something that we have out today it works today in Lmaophil。

you can actually when you create your own LMma files。

you can create them so they only run in the server mode so they don't bring up any of the WebUI。

they're just purely an API server so you can optimize that if you want and that's your use case but we really hope developers enjoy using this we really want to make it easier for people to switch from centralized commercial offerings to open source offerings because we think that will just help accelerate all the progress we want to see in the open source space。

😊,Thanks a lot。

004:开始本地大型语言模型的 Llamafile|Beginning Llamafile for Local Large Language Models (LLMs) p04 3_创建 Llamafile.zh_en -BV1e6421Z7sg_p4-

Hey everybody, I just want to give you a quick demonstration of Lama file to get allma file。

simply go to Lama file AI or Google Lama file and you will find our Github repository the readme file in this repository has links to a number of lama files that we've created for you as examples These are all hosted a hugging face where you can find others as well and you simply download them I've downloaded here you can see my console window several of these I'm going to run one right now for you Mitral 7b Miral 7 B is a recently released small model that has great performance and is open license and if I simply just type it in and run it like an executable that's what happens it loads the model into memory it brings up automatically a chat interface that I can engage with I can ask it the canonical。

😊,Model evaluation question, please tell me a short story about llmas。And it will start doing it。

So this is an AI running entirely locally on my machine, it is not connecting to the internet at all。

it is 100% open source using an open model, it's running in total privacy and under my control。😊。

Now, it's running at a speed that's less than you you would get from chat GT。

but it's a serviceable speed。 And this is just using my CPU。

This is not using any other hardware I might happen to have, which could accelerate inference。😊。

But even at this speed, we're still getting usable performance here, yeah, 11。34 tokens per second。

it's not too bad, but if we run it again and this time passing a command line parameter,😊。

This will tell it actually to load as many layers of the model as it can into my GPUs VRAM and then use the GPU for acceleration。

So if I run this again now, the interface comes back up。

I ask the same question and you will note a dramatic difference in the inference performance。

Almost 120 tokens per second。So this is the power of GPU acceleration and Lammaophil makes it very easy to use out of the box。

it'll work with NviIDdia GPUs, but it'll also work with AMD GPUs, which is much less common。

NVIDIdia really has a stranglehold on the market and AMD cards are typically harder to get working we're very happy to say that with Lmophile they just work out of the box。

Let me show you another example here, this is a model called lava lava is a multimodal model and what that means is it can accept not just text as input it can also accept images so if I load an image here i've got an example that I have and I'll just say you know what is in this image。

😊,And ask it。You will see that it analyzes the image it gives me a description of it and did a pretty good job。

it's not aware that this is the Terinator in other runs when I've tried this。

it has actually figured that out, but as an open model。

this is I think very impressive capability for the open source community to have and it works just fine in Lama file。

😊,Last, let me show you one more example, which is a model called Rock 3B。This is a very。

very small model that's designed to work in constrained environments and because it's smaller it's not going to be as powerful or smart as larger models。

but this is pretty amazing that with no GPU acceleration just running I can ask it that same question and I get a short but very quick answer and it's not a bad answer and the beauty of this is that with Lmaophil you could run this model and ones of the same size on something as modest as a raspberry pie this is a $35 credit card sizeized computer doesn't get much more simple than that。

but it can run an open source pretty capable AI and I think that just says a lot about where we are in terms of progress in open source AI。

Lastly, I just want to show you how you make your own Lama file if you'd like it'll work with any GGUF or Guff file this is a very popular currently format for open model weights you can go on hugging face and you will find thousands if not tens of thousands of GGUF files ready for use at different levels of quantization here I've downloaded a version of the Mstral model that's been finetuned and is quantized at four bits and with a single command I can turn this data file which currently can't do anything on its own into an executable I just say Lama file convert。

😊,The name of the GF file, I hit En。And in a few seconds, it will turn it into a dot lamma file。

and now I can now run this。😊,Just like the other ones I've been showing you。

and that file now isn't executable。😊,And it brings up a command line and it does everything。 sorry。

brings up an interface, and I can do everything I want to do with it just like the other models。

So it's very easy to get your own Lama files running if you want to。😊。

I hope you enjoyed the demonstration, thanks a lot。

005:开始本地大型语言模型的 Llamafile|Beginning Llamafile for Local Large Language Models (LLMs) p05 4_使用 Cosmopolitan 构建便携式二进制文件.zh_en -BV1e6421Z7sg_p5-

。This is a library called cosmopolitan, which is at the core of Lamaophil。

which allows it to be a portable large language model。

So let's go ahead and take a look at how this actually works。

We have here the cosmopolitan Lib C makes a C build once run anywhere language like Java right So that's one of the cool things about it。

So all you need to do is unzip it, go to the releases right here unzip it and then use this cosmo compiler right where we say Cosmo。

CC dash O, hello。This would be the output and the C file will be hello do C and what's also kind of cool is that for old school Unix people。

you can see that there's an Stra facility here as well so you can actually see what's actually happening underneath the hood This was a utility that a lot of ciss admins would use back in the day but it's also very useful in modern times as well to really dig into all the calls that are happening in a piece of code。

So let's go ahead and take a look at how we would do this So first up what I'm going to do is I have a lab here。

That I'm going to set up and in this lab what I'll first do is create a hello world file。

so let's go ahead and say that。 let's say touch hello do C and then what I will do goes refresh this for a second and then go to hello do C and then we can just paste in a hello example。

So first up here we can paste this in。 you can see we just include the standard library here。

header file and then we have a main function and then it goes through and it prints hello world。

So how do we run this Well I have it inside of the bin directory So we're gonna to go to bin cosmocc so we'll get the cosmo compiler and then we'll say dash o so we want to make a portable binary called hello。

and then we'll feed it in that C file There we go pretty simple and then we just go ahead and run it there we go。

we got hello world So really easy to get started here with building portable binaries using。

And so let's go ahead and do something a little bit more complex。

which is build a Marco Polo function here next。 So let's go ahead and do that and we'll say touch and we'll call this in this particular example of Marco。

Polo C perfect again we'll go ahead and refresh right here and then what I'm going to do is I'm going to open this up and we'll need to do a little bit more。

we'll include some other header files so in this case we'll have the standard library and the string as well and what I'll do is I'll make a main function that takes a little bit of logic so let's go ahead and take a look at how this works so we'll say inside of here the main function will go through here and look at the arguments in this case if it's Marco it's going to go ahead and return back Polo otherwise it'll just give us essentially a help menu so it's a really simple program here but that's what's nice about Cs you can write really elegant simple programs and in the case of this cosmopolitan acrossplform binary portability tool all we have to do is go through here and compile it so let's go ahead。

And say bin。Cosmo CC, and we'll go dash O, Marco Polo。There we go。

And we can go and compile pretty straightforward, and now we can just run it。

was it say Marco Polo Marco, we'll see we get back Polo。

and then if we go through and we say not Marco。We can see that。 In fact, it doesn't work。

It gives us the help menu here that says, hey, you got to run it this way。

So a very nice library here to get started。 And again, why do I care about this thing。

Because this is the heart and soul of lamaophil。 And so if we go to Lam file。

Here, which allows you to run and distribute large language models。

you can see that this is actually a key component of it。

so it is nice to go down to the first principles, sometimes play around with that core library to see exactly what it works like and then build some things around it so then later you'll appreciate how elegant lammaophil really is。

006:开始本地大型语言模型的 Llamafile|Beginning Llamafile for Local Large Language Models (LLMs) p06 5_使用 Cosmopolitan 构建短语生成器.zh_en -BV1e6421Z7sg_p6-

Here is an insult of the cosmopolitan portable binary framework and what I'm going to do with it is build a portable phrase generator commandlan tool。

So typically commandme tools are things that benefit the most from portability and what's really nice about the cosmopolitan framework is that it helps you build out these portable binary。

So in this case know it's something simple just for teaching but you could do something like a static website generator or some kind of AI tool etc with this approach。

let's go ahead and start first at what it will build。

So you can see here that it's going to build a binary that allows you to specify the account and then a phrase you also have the option for a longer parameter here。

So if we say dash count。You also can do that or you can do dash phrase as well。

So to start with we include the standard library headers。

which are the standard doth for input output string doth for string functions like string comparison and then also the standard Lib which declares utility functions like AtiIO for converting strings to integer。

So these are all right here Now next we have a function here that allows us to repeat a phrase。

you can see here that it's going to repeat a phrase in number of times accepts an integer and then we have the phrase right there so it goes through and does a for loop and then this is the main logic here and what happens is that we basically initially have count and then we parse through the command line arguments and then here's the string comparison we say if basically we have count or we have dash C。

We do this operation otherwise if we have dash phrase or dash P。

we do this other operation and then we go through at the very end here and we generate the phrase so how do we actually compile this thing well because the cosmopolitan framework has its own compiler。

we're just going to say,Bin, so we'll do bin slash cosmo。CC and then we'll do dash O。

and this will be the binary that we would create。 That's again, portable。

And then we just feed it in the C file here, which is Fr gen de C。There we go。 pretty easy to create。

you can see it creates all these different artifacts here as well。 Now, in order to run it。

I can just take my comment right here and I can actually go through and。Run it, we can change it。

Have some different things in here, for example,7。Whatever you want to do for the total number of counts So the idea here is that with phrase generation。

it's really an important kind of demo type framework that I like to do when I'm testing on a new chameleon utility。

but the advantage of a cosmopolitan framework is that I build this binary once and it can be deployed to many different platforms at the same time。

Windows, Linux, Uni etc。 and so this is a very compelling technology。

especially for things like large language models, which is again why we're showing this because Lamaophil uses this to make portability much more straightforward。

007:开始本地大型语言模型的 Llamafile|Beginning Llamafile for Local Large Language Models (LLMs) p07 6_开始使用 Llamafile.zh_en -BV1e6421Z7sg_p7-

。Here we have the Lama file project from Mozilla, probably the easiest possible way to run a local large language model。

Many people are interested in running large language models locally because of the privacy aspects and also it's free。

right if you're able to use your own machine you can dive right into the Lama file here。

So let's go ahead and take a look at the structure, this project first。

So you can see that Lama file lets you distribute and run LLMs with a single file and what's really fascinating about this is because they use this library cosmopolitan lid C。

it is able to collapse everything into a single file executable here called Lama file。

and its actually even though it's called Lama file actually is not really tied to any particular large language model。

there's an example here where you could download lava which is able to do reading of images。

which is a pretty cool project。If you want to play around with that。

the one that I think is one of the more interesting ones is this mixtureal one。

So Mire is one of the better performing open source models here and you can see it's Apache2 license and it's a 30 gig file。

So it's a pretty big download but all you have to do is download it and then do dot slash run Also what's pretty cool about this is there's a Python API that mimics the open AI API so you can basically convert from or upgrade or even graduate from closed commercial proprietary models to open models using this API。

and you also could do a Chroal command。 And you can see these Chal commands here。

So very fascinating project here。 let's go ahead and take a look at how this works。

I'm going take my terminal over here and you can see I've already downloaded it。😊,AndM or running。

It's that simple。 Here we see that Lama dot CPP is running locally here。

And if I wanted to reset to some kind of default state, I can reset this。

and this would reset all of these defaults。 What I'm going to do is I'm going to change this bot to AI。

And then I'm going to ask it to do something。 So let's, let's do a Python howo world。

We'll say Python。

Hello, world function。Show。Me Python hollowello world function。 All right, great。

let's go ahead and do send。And then it' going to use my Mac GPU here to get this very fast response。

So some of the metrics around Mitro are that it's actually just as good as closed proprietary models you can see here that this looks pretty good。

And if I wanted to I could keep doing more and more functions like build a Python function that add to numbers so really the takeaway here is that it is actually very straightforward to run your own local model and the big advantages here when you're using the Lmaophil here are that you have privacy you're not sending your data to some company where you don't know what they're doing with it and then second that performance is actually much better than calling in external API because there's less latency。

And then third is actually it's free。 So there's some significant advantages to checking out this。

A fileile project, go ahead and take a look at it and let me know what you think。

008:开始本地大型语言模型的 Llamafile|Beginning Llamafile for Local Large Language Models (LLMs) p08 7_Llamafile 本地系统指标.zh_en -BV1e6421Z7sg_p8-

Let's take a look here at how we can monitor system metrics with Lama file。 First up。

I'm going to look at the actual use of the Lama file in terms of an elo rating so this shows how people perceive the strength of this model and you can see here this mixture model is right in the running here very small difference than commercial models which shows that in fact there should be a huge percentage of people interested in running this locally is long as the performance is good enough and so is the performance good enough。

let's go ahead and take a look again at this Lama file project now I've got this running inside of this web browser window and I also have all of these metrics right here that our letting us take a look at what's happening and the things to pay attention to could be for example。

the GPU time that's one another one would be to look at all the cores but then also to look at this Mtu。

Ultra GPU now the M2 Ultra the one I have has 60 GPU cores and it is able to actually leverage all of those GPU cores on the Mac operating system because of the great work by the authors of the Lama file and also Lama CPP so let's go ahead and take a look at this we'll see show a Python function if I scroll down here let's see what happens in terms of this GPU。

We'll see that in fact this will spike here and also we should see some GPU here as well。

so we can see this as well that the GPU is actually being hit by this dot a1。

9 and in fact we see this running again and we could do another one, we could say。

you know give me an example of a recursive。Function in Python。And let's see what it does。And, again。

There we go。 We see a recursive function in Python, and we can also see that, in fact。

it is going to be using this GPU。 So you can see the performance is actually incredible right。

and we can see historical GPU here is able to leverage it。

And so for people that are actually using commercial models。😊。

Obviously there's some usefulness in them, but it definitely raises the question of why long term would you use commercial models if you can just download a model that's good enough and the performance is just incredible and it's free。

So something to think about as you're playing around with mammaophil or even going into commercial arrangements for large language models is first take a look at what the local performance really is。

look at the metrics and you'll see that it's very impressive。

posted @ 2026-03-26 08:13  绝不原创的飞龙  阅读(6)  评论(0)    收藏  举报