Skip to content

GitHub Models

In this section of the workshop you will start to use actual LLMs, including those from OpenAI! You're going to use a service called GitHub Models that hosts LLMs. You can use GitHub Models, with limits, for free. All you need is the same free, personal GitHub account that you used to create a GitHub Codespace in the previous section.

Exploring the GitHub Models Catalog

To get started with GitHub Models go to https://gh.io/models.

This will take you to GitHub Models on the GitHub Marketplace. And you can see a few of the models hosted. Click the link to explore the full model catalog to see the entire list of models hosted on GitHub Models.

You can filter this list. Click the Publisher dropdown, and select Azure OpenAI Service. This shows all of the models from the Azure OpenAI Service. It includes well known models like GPT-4o and GPT-4.1. Let's take a closer look at one of the models. Click on the card for OpenAI GPT-4.1-mini.

You will see the information page for the GPT-4.1-mini model. Direct your attention in the right sidebar to the Free rate limit tier. Since you are using GitHub Models for free, you will be affected by this. For GPT-4.1-mini, the rate limit tier is Low. Click on it to see what this means.

You'll see a table with the rate limits for different tiers, models and pricing plans. For GitHub Models, your usage will fall under the Copilot Free pricing plans. Notice that the Low tier allows you 15 requests per minute and 150 requests per day. Granted in comparison to a production application, this is not a lot. But it's actually quite generous for exploring the model's capabilities and more than enough for this workshop.

Also notice the tokens per request. For the Low rate limit tier this is 8000 tokens in and 4000 tokens out. The "tokens in" is the maximum number of tokens that may be submitted with a request. Recall that a token is on average, about 3/4 of a word so 8000 tokens is approximately 6000 words. This comes out to about 15-20 written pages. The "tokens out" is the maximum number of tokens that is generated by the LLMs. So 4000 tokens would be approximately 3000 words. And you can do this 150 times a day.

Notice that the High rate limit tier is more restrictive. This limit is applied to the larger models. There are also special limits applied to the embedding models that we won't make use of in this workshop. And there are also some models with custom rate limits that depend on the model itself. For example, Grok-3 and DeepSeek. And for GPT-5 and the OpenAI reasoning models, those are not available with the Copilot Free tier. But we will still have plenty of free models to use for the workshop.

Note: While GitHub Models does not offer free access to GPT-5 models, OpenAI itself does have a free tier that includes GPT-5 models. There are three caveats: 1. To get the free tier, you must add a billing method 2. If you exceed the free tier limit, you will automatically be billed for any overages 3. Any data used with the free tier will be used for training models and will not be private See the OpenAI blog for more details.

Experimenting with Models Using a Playground

Go back to the model information page. Click on the Playground button in the upper right of the page. This brings up a ChatGPT-like interface where you can interact with the model. Enter a prompt in the text box at the bottom with the placeholder Type your prompt.... Something like:

What are three advantages of OpenAI GPT-4.1 over GPT-3.5?

Again, similar to ChatGPT, the playground will display the generated response and render any markdown. Also notice in the upper left of the playground, you can see the total number of input tokens, output tokens and the time to generate the response.

In the left sidebar you can again see information about the model For example, the training cutoff date is May of 2024. And the Context is the number of tokens for the input and output per request if you are using a paid pricing plan. For the free plan, you can ignore this because as we just saw, you are restricted to 8000 tokens in and 4000 tokens out.

Model Parameters

At the top of the sidebar, click the Parameters tab. Here you can set values to configure how the model behaves. The first of these is the system message. In an LLM-based chat application, the system messages defines how the model behaves by configuring goals, rules and restrictions for the model to observe. The often used default system message is something like "You are a helpful assistant." However, this is too generic for most chatbots.

Look in the 03-github-models folder in the GitHub repository for this workshop. You'll see a file named customer_service_prompt.txt. Open it to see an example of more specific system prompt. It outlines: * Goals: "greet the customer", "acknowledge their concern", "provide clear, step-by-step help". * Rules: "use a friendly and professional tone", "apologize when appropriarte" * Restrictions: "if you don't know the answer, offer to escalate or find more information"

Copy this prompt and paste it in the System prompt textarea in the Parameters tab.

Scroll down in the tab to see several more values that will influence the behavoid of the LLM. The first is self explanatory. Max Completion Tokens limits the number of tokens produced in the response. Keep in mind that on the CoPilot Free plan, you are restricted to rate limits that could be lower that the value in the playground.

The next two parameters, Temperature and Top P, cooperate and determine the expressiveness of the response. When generating a response, the LLM will collect a pool of candidate tokens that are closely related to the request. Then the request will be composed of tokens selected from that pool. The Top P parameter determines the size of the pool of tokens. The Temperature pararmeter determines how randomly the tokens will be selected. Thus setting the Temperature and Top P to low values will yield more predictable and repeatable responses. On the other end of the scale, setting them to high values will yield more diverse and creative responses.

Using the controls in the Parameters tab set * Max Completion Tokens to 2000 * Temperature to 0.3 * Top P to 0.7

These settings - low Temperature and high Top P - will yield a response that is professional (the low Temperature) while at the same time not as dry as a lower Top P value would generate.

Chatting with the Models

Using the text input at the bottom of the playground, send the prompt `I was charged twice for the same item. What do I do?" You'll see the response in the chat session window. Keep in mind that playground usage counts towards your free quota. So you just used one of your 150 daily requests for gpt-4.1-nano.

The playground also has a feature that allows you to compare the response from two models. Next to the Model dropdown list in the upper left of the playground, click the Compare button. Select OpenAI GPT-4.1 from the list of models. Notice that the Parameters were copied over from the existing GPT-4.1-nano model. Copy the prompt: I was charged twice for the same item. What do I do? into one model's prompt box. It will be copied into the other. Push the send button.

The response for the full GPT-4.1 model might take a little longer to generate. Also it has more detail and uses Markdown formatting. That Markdown is rendered in the chat session window.

Try another prompt: I bought the wrong item. How can I return it?. Again, the respone from GPT-4.1 is more detailed, and it uses Markdown formatting.