azure.ai.inference package

class azure.ai.inference.ChatCompletionsClient(endpoint: str, credential: AzureKeyCredential | TokenCredential, *, frequency_penalty: float | None = None, presence_penalty: float | None = None, temperature: float | None = None, top_p: float | None = None, max_tokens: int | None = None, response_format: Literal['text', 'json_object'] | JsonSchemaFormat | None = None, stop: List[str] | None = None, tools: List[ChatCompletionsToolDefinition] | None = None, tool_choice: str | ChatCompletionsToolChoicePreset | ChatCompletionsNamedToolChoice | None = None, seed: int | None = None, model: str | None = None, model_extras: Dict[str, Any] | None = None, **kwargs: Any)[source]

ChatCompletionsClient.

Parameters:
  • endpoint (str) – Service endpoint URL for AI model inference. Required.

  • credential (AzureKeyCredential or TokenCredential) – Credential used to authenticate requests to the service. Is either a AzureKeyCredential type or a TokenCredential type. Required.

Keyword Arguments:
  • frequency_penalty (float) – A value that influences the probability of generated tokens appearing based on their cumulative frequency in generated text. Positive values will make tokens less likely to appear as their frequency increases and decrease the likelihood of the model repeating the same statements verbatim. Supported range is [-2, 2]. Default value is None.

  • presence_penalty (float) – A value that influences the probability of generated tokens appearing based on their existing presence in generated text. Positive values will make tokens less likely to appear when they already exist and increase the model’s likelihood to output new topics. Supported range is [-2, 2]. Default value is None.

  • temperature (float) – The sampling temperature to use that controls the apparent creativity of generated completions. Higher values will make output more random while lower values will make results more focused and deterministic. It is not recommended to modify temperature and top_p for the same completions request as the interaction of these two settings is difficult to predict. Supported range is [0, 1]. Default value is None.

  • top_p (float) – An alternative to sampling with temperature called nucleus sampling. This value causes the model to consider the results of tokens with the provided probability mass. As an example, a value of 0.15 will cause only the tokens comprising the top 15% of probability mass to be considered. It is not recommended to modify temperature and top_p for the same completions request as the interaction of these two settings is difficult to predict. Supported range is [0, 1]. Default value is None.

  • max_tokens (int) – The maximum number of tokens to generate. Default value is None.

  • response_format (Union[Literal['text', 'json_object'], ~azure.ai.inference.models.JsonSchemaFormat]) – The format that the AI model must output. AI chat completions models typically output unformatted text by default. This is equivalent to setting “text” as the response_format. To output JSON format, without adhering to any schema, set to “json_object”. To output JSON format adhering to a provided schema, set this to an object of the class ~azure.ai.inference.models.JsonSchemaFormat. Default value is None.

  • stop (list[str]) – A collection of textual sequences that will end completions generation. Default value is None.

  • tools (list[ChatCompletionsToolDefinition]) – The available tool definitions that the chat completions request can use, including caller-defined functions. Default value is None.

  • tool_choice (str or ChatCompletionsToolChoicePreset or ChatCompletionsNamedToolChoice) – If specified, the model will configure which of the provided tools it can use for the chat completions response. Is either a Union[str, “_models.ChatCompletionsToolChoicePreset”] type or a ChatCompletionsNamedToolChoice type. Default value is None.

  • seed (int) – If specified, the system will make a best effort to sample deterministically such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed. Default value is None.

  • model (str) – ID of the specific AI model to use, if more than one model is available on the endpoint. Default value is None.

  • model_extras (dict[str, Any]) – Additional, model-specific parameters that are not in the standard request payload. They will be added as-is to the root of the JSON in the request body. How the service handles these extra parameters depends on the value of the extra-parameters request header. Default value is None.

  • api_version (str) – The API version to use for this operation. Default value is “2024-05-01-preview”. Note that overriding this default value may result in unsupported behavior.

close() None[source]
complete(*, messages: List[ChatRequestMessage] | List[Dict[str, Any]], stream: Literal[False] = False, frequency_penalty: float | None = None, presence_penalty: float | None = None, temperature: float | None = None, top_p: float | None = None, max_tokens: int | None = None, response_format: Literal['text', 'json_object'] | JsonSchemaFormat | None = None, stop: List[str] | None = None, tools: List[ChatCompletionsToolDefinition] | None = None, tool_choice: str | ChatCompletionsToolChoicePreset | ChatCompletionsNamedToolChoice | None = None, seed: int | None = None, model: str | None = None, model_extras: Dict[str, Any] | None = None, **kwargs: Any) ChatCompletions[source]
complete(*, messages: List[ChatRequestMessage] | List[Dict[str, Any]], stream: Literal[True], frequency_penalty: float | None = None, presence_penalty: float | None = None, temperature: float | None = None, top_p: float | None = None, max_tokens: int | None = None, response_format: Literal['text', 'json_object'] | JsonSchemaFormat | None = None, stop: List[str] | None = None, tools: List[ChatCompletionsToolDefinition] | None = None, tool_choice: str | ChatCompletionsToolChoicePreset | ChatCompletionsNamedToolChoice | None = None, seed: int | None = None, model: str | None = None, model_extras: Dict[str, Any] | None = None, **kwargs: Any) Iterable[StreamingChatCompletionsUpdate]
complete(*, messages: List[ChatRequestMessage] | List[Dict[str, Any]], stream: bool | None = None, frequency_penalty: float | None = None, presence_penalty: float | None = None, temperature: float | None = None, top_p: float | None = None, max_tokens: int | None = None, response_format: Literal['text', 'json_object'] | JsonSchemaFormat | None = None, stop: List[str] | None = None, tools: List[ChatCompletionsToolDefinition] | None = None, tool_choice: str | ChatCompletionsToolChoicePreset | ChatCompletionsNamedToolChoice | None = None, seed: int | None = None, model: str | None = None, model_extras: Dict[str, Any] | None = None, **kwargs: Any) Iterable[StreamingChatCompletionsUpdate] | ChatCompletions
complete(body: MutableMapping[str, Any], *, content_type: str = 'application/json', **kwargs: Any) Iterable[StreamingChatCompletionsUpdate] | ChatCompletions
complete(body: IO[bytes], *, content_type: str = 'application/json', **kwargs: Any) Iterable[StreamingChatCompletionsUpdate] | ChatCompletions

Gets chat completions for the provided chat messages. Completions support a wide variety of tasks and generate text that continues from or “completes” provided prompt data. When using this method with stream=True, the response is streamed back to the client. Iterate over the resulting StreamingChatCompletions object to get content updates as they arrive.

Parameters:

body (JSON or IO[bytes]) – Is either a MutableMapping[str, Any] type (like a dictionary) or a IO[bytes] type that specifies the full request payload. Required.

Keyword Arguments:
  • messages (list[ChatRequestMessage] or list[dict[str, Any]]) – The collection of context messages associated with this chat completions request. Typical usage begins with a chat message for the System role that provides instructions for the behavior of the assistant, followed by alternating messages between the User and Assistant roles. Required.

  • stream (bool) – A value indicating whether chat completions should be streamed for this request. Default value is False. If streaming is enabled, the response will be a StreamingChatCompletions. Otherwise the response will be a ChatCompletions.

  • frequency_penalty (float) – A value that influences the probability of generated tokens appearing based on their cumulative frequency in generated text. Positive values will make tokens less likely to appear as their frequency increases and decrease the likelihood of the model repeating the same statements verbatim. Supported range is [-2, 2]. Default value is None.

  • presence_penalty (float) – A value that influences the probability of generated tokens appearing based on their existing presence in generated text. Positive values will make tokens less likely to appear when they already exist and increase the model’s likelihood to output new topics. Supported range is [-2, 2]. Default value is None.

  • temperature (float) – The sampling temperature to use that controls the apparent creativity of generated completions. Higher values will make output more random while lower values will make results more focused and deterministic. It is not recommended to modify temperature and top_p for the same completions request as the interaction of these two settings is difficult to predict. Supported range is [0, 1]. Default value is None.

  • top_p (float) – An alternative to sampling with temperature called nucleus sampling. This value causes the model to consider the results of tokens with the provided probability mass. As an example, a value of 0.15 will cause only the tokens comprising the top 15% of probability mass to be considered. It is not recommended to modify temperature and top_p for the same completions request as the interaction of these two settings is difficult to predict. Supported range is [0, 1]. Default value is None.

  • max_tokens (int) – The maximum number of tokens to generate. Default value is None.

  • response_format (Union[Literal['text', 'json_object'], ~azure.ai.inference.models.JsonSchemaFormat]) – The format that the AI model must output. AI chat completions models typically output unformatted text by default. This is equivalent to setting “text” as the response_format. To output JSON format, without adhering to any schema, set to “json_object”. To output JSON format adhering to a provided schema, set this to an object of the class ~azure.ai.inference.models.JsonSchemaFormat. Default value is None.

  • stop (list[str]) – A collection of textual sequences that will end completions generation. Default value is None.

  • tools (list[ChatCompletionsToolDefinition]) – The available tool definitions that the chat completions request can use, including caller-defined functions. Default value is None.

  • tool_choice (str or ChatCompletionsToolChoicePreset or ChatCompletionsNamedToolChoice) – If specified, the model will configure which of the provided tools it can use for the chat completions response. Is either a Union[str, “_models.ChatCompletionsToolChoicePreset”] type or a ChatCompletionsNamedToolChoice type. Default value is None.

  • seed (int) – If specified, the system will make a best effort to sample deterministically such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed. Default value is None.

  • model (str) – ID of the specific AI model to use, if more than one model is available on the endpoint. Default value is None.

  • model_extras (dict[str, Any]) – Additional, model-specific parameters that are not in the standard request payload. They will be added as-is to the root of the JSON in the request body. How the service handles these extra parameters depends on the value of the extra-parameters request header. Default value is None.

Returns:

ChatCompletions for non-streaming, or Iterable[StreamingChatCompletionsUpdate] for streaming.

Return type:

ChatCompletions or StreamingChatCompletions

Raises:

HttpResponseError

get_model_info(**kwargs: Any) ModelInfo[source]

Returns information about the AI model. The method makes a REST API call to the /info route on the given endpoint. This method will only work when using Serverless API or Managed Compute endpoint. It will not work for GitHub Models endpoint or Azure OpenAI endpoint.

Returns:

ModelInfo. The ModelInfo is compatible with MutableMapping

Return type:

ModelInfo

Raises:

HttpResponseError

send_request(request: HttpRequest, *, stream: bool = False, **kwargs: Any) HttpResponse[source]

Runs the network request through the client’s chained policies.

>>> from azure.core.rest import HttpRequest
>>> request = HttpRequest("GET", "https://www.example.org/")
<HttpRequest [GET], url: 'https://www.example.org/'>
>>> response = client.send_request(request)
<HttpResponse: 200 OK>

For more information on this code flow, see https://aka.ms/azsdk/dpcodegen/python/send_request

Parameters:

request (HttpRequest) – The network request you want to make. Required.

Keyword Arguments:

stream (bool) – Whether the response payload will be streamed. Defaults to False.

Returns:

The response of your network call. Does not do error handling on your response.

Return type:

HttpResponse

class azure.ai.inference.EmbeddingsClient(endpoint: str, credential: AzureKeyCredential | TokenCredential, *, dimensions: int | None = None, encoding_format: str | EmbeddingEncodingFormat | None = None, input_type: str | EmbeddingInputType | None = None, model: str | None = None, model_extras: Dict[str, Any] | None = None, **kwargs: Any)[source]

EmbeddingsClient.

Parameters:
  • endpoint (str) – Service endpoint URL for AI model inference. Required.

  • credential (AzureKeyCredential or TokenCredential) – Credential used to authenticate requests to the service. Is either a AzureKeyCredential type or a TokenCredential type. Required.

Keyword Arguments:
  • dimensions (int) – Optional. The number of dimensions the resulting output embeddings should have. Default value is None.

  • encoding_format (str or EmbeddingEncodingFormat) – Optional. The desired format for the returned embeddings. Known values are: “base64”, “binary”, “float”, “int8”, “ubinary”, and “uint8”. Default value is None.

  • input_type (str or EmbeddingInputType) – Optional. The type of the input. Known values are: “text”, “query”, and “document”. Default value is None.

  • model (str) – ID of the specific AI model to use, if more than one model is available on the endpoint. Default value is None.

  • model_extras (dict[str, Any]) – Additional, model-specific parameters that are not in the standard request payload. They will be added as-is to the root of the JSON in the request body. How the service handles these extra parameters depends on the value of the extra-parameters request header. Default value is None.

  • api_version (str) – The API version to use for this operation. Default value is “2024-05-01-preview”. Note that overriding this default value may result in unsupported behavior.

close() None[source]
embed(*, input: List[str], dimensions: int | None = None, encoding_format: str | _models.EmbeddingEncodingFormat | None = None, input_type: str | _models.EmbeddingInputType | None = None, model: str | None = None, model_extras: Dict[str, Any] | None = None, **kwargs: Any) _models.EmbeddingsResult[source]
embed(body: JSON, *, content_type: str = 'application/json', **kwargs: Any) _models.EmbeddingsResult
embed(body: IO[bytes], *, content_type: str = 'application/json', **kwargs: Any) _models.EmbeddingsResult

Return the embedding vectors for given text prompts. The method makes a REST API call to the /embeddings route on the given endpoint.

Parameters:

body (JSON or IO[bytes]) – Is either a MutableMapping[str, Any] type (like a dictionary) or a IO[bytes] type that specifies the full request payload. Required.

Keyword Arguments:
  • input (list[str]) – Input text to embed, encoded as a string or array of tokens. To embed multiple inputs in a single request, pass an array of strings or array of token arrays. Required.

  • dimensions (int) – Optional. The number of dimensions the resulting output embeddings should have. Default value is None.

  • encoding_format (str or EmbeddingEncodingFormat) – Optional. The desired format for the returned embeddings. Known values are: “base64”, “binary”, “float”, “int8”, “ubinary”, and “uint8”. Default value is None.

  • input_type (str or EmbeddingInputType) – Optional. The type of the input. Known values are: “text”, “query”, and “document”. Default value is None.

  • model (str) – ID of the specific AI model to use, if more than one model is available on the endpoint. Default value is None.

  • model_extras (dict[str, Any]) – Additional, model-specific parameters that are not in the standard request payload. They will be added as-is to the root of the JSON in the request body. How the service handles these extra parameters depends on the value of the extra-parameters request header. Default value is None.

Returns:

EmbeddingsResult. The EmbeddingsResult is compatible with MutableMapping

Return type:

EmbeddingsResult

Raises:

HttpResponseError

get_model_info(**kwargs: Any) ModelInfo[source]

Returns information about the AI model. The method makes a REST API call to the /info route on the given endpoint. This method will only work when using Serverless API or Managed Compute endpoint. It will not work for GitHub Models endpoint or Azure OpenAI endpoint.

Returns:

ModelInfo. The ModelInfo is compatible with MutableMapping

Return type:

ModelInfo

Raises:

HttpResponseError

send_request(request: HttpRequest, *, stream: bool = False, **kwargs: Any) HttpResponse[source]

Runs the network request through the client’s chained policies.

>>> from azure.core.rest import HttpRequest
>>> request = HttpRequest("GET", "https://www.example.org/")
<HttpRequest [GET], url: 'https://www.example.org/'>
>>> response = client.send_request(request)
<HttpResponse: 200 OK>

For more information on this code flow, see https://aka.ms/azsdk/dpcodegen/python/send_request

Parameters:

request (HttpRequest) – The network request you want to make. Required.

Keyword Arguments:

stream (bool) – Whether the response payload will be streamed. Defaults to False.

Returns:

The response of your network call. Does not do error handling on your response.

Return type:

HttpResponse

class azure.ai.inference.ImageEmbeddingsClient(endpoint: str, credential: AzureKeyCredential | TokenCredential, *, dimensions: int | None = None, encoding_format: str | EmbeddingEncodingFormat | None = None, input_type: str | EmbeddingInputType | None = None, model: str | None = None, model_extras: Dict[str, Any] | None = None, **kwargs: Any)[source]

ImageEmbeddingsClient.

Parameters:
  • endpoint (str) – Service endpoint URL for AI model inference. Required.

  • credential (AzureKeyCredential or TokenCredential) – Credential used to authenticate requests to the service. Is either a AzureKeyCredential type or a TokenCredential type. Required.

Keyword Arguments:
  • dimensions (int) – Optional. The number of dimensions the resulting output embeddings should have. Default value is None.

  • encoding_format (str or EmbeddingEncodingFormat) – Optional. The desired format for the returned embeddings. Known values are: “base64”, “binary”, “float”, “int8”, “ubinary”, and “uint8”. Default value is None.

  • input_type (str or EmbeddingInputType) – Optional. The type of the input. Known values are: “text”, “query”, and “document”. Default value is None.

  • model (str) – ID of the specific AI model to use, if more than one model is available on the endpoint. Default value is None.

  • model_extras (dict[str, Any]) – Additional, model-specific parameters that are not in the standard request payload. They will be added as-is to the root of the JSON in the request body. How the service handles these extra parameters depends on the value of the extra-parameters request header. Default value is None.

  • api_version (str) – The API version to use for this operation. Default value is “2024-05-01-preview”. Note that overriding this default value may result in unsupported behavior.

close() None[source]
embed(*, input: List[_models.ImageEmbeddingInput], dimensions: int | None = None, encoding_format: str | _models.EmbeddingEncodingFormat | None = None, input_type: str | _models.EmbeddingInputType | None = None, model: str | None = None, model_extras: Dict[str, Any] | None = None, **kwargs: Any) _models.EmbeddingsResult[source]
embed(body: JSON, *, content_type: str = 'application/json', **kwargs: Any) _models.EmbeddingsResult
embed(body: IO[bytes], *, content_type: str = 'application/json', **kwargs: Any) _models.EmbeddingsResult

Return the embedding vectors for given images. The method makes a REST API call to the /images/embeddings route on the given endpoint.

Parameters:

body (JSON or IO[bytes]) – Is either a MutableMapping[str, Any] type (like a dictionary) or a IO[bytes] type that specifies the full request payload. Required.

Keyword Arguments:
  • input (list[ImageEmbeddingInput]) – Input image to embed. To embed multiple inputs in a single request, pass an array. The input must not exceed the max input tokens for the model. Required.

  • dimensions (int) – Optional. The number of dimensions the resulting output embeddings should have. Default value is None.

  • encoding_format (str or EmbeddingEncodingFormat) – Optional. The desired format for the returned embeddings. Known values are: “base64”, “binary”, “float”, “int8”, “ubinary”, and “uint8”. Default value is None.

  • input_type (str or EmbeddingInputType) – Optional. The type of the input. Known values are: “text”, “query”, and “document”. Default value is None.

  • model (str) – ID of the specific AI model to use, if more than one model is available on the endpoint. Default value is None.

  • model_extras (dict[str, Any]) – Additional, model-specific parameters that are not in the standard request payload. They will be added as-is to the root of the JSON in the request body. How the service handles these extra parameters depends on the value of the extra-parameters request header. Default value is None.

Returns:

EmbeddingsResult. The EmbeddingsResult is compatible with MutableMapping

Return type:

EmbeddingsResult

Raises:

HttpResponseError

get_model_info(**kwargs: Any) ModelInfo[source]

Returns information about the AI model. The method makes a REST API call to the /info route on the given endpoint. This method will only work when using Serverless API or Managed Compute endpoint. It will not work for GitHub Models endpoint or Azure OpenAI endpoint.

Returns:

ModelInfo. The ModelInfo is compatible with MutableMapping

Return type:

ModelInfo

Raises:

HttpResponseError

send_request(request: HttpRequest, *, stream: bool = False, **kwargs: Any) HttpResponse[source]

Runs the network request through the client’s chained policies.

>>> from azure.core.rest import HttpRequest
>>> request = HttpRequest("GET", "https://www.example.org/")
<HttpRequest [GET], url: 'https://www.example.org/'>
>>> response = client.send_request(request)
<HttpResponse: 200 OK>

For more information on this code flow, see https://aka.ms/azsdk/dpcodegen/python/send_request

Parameters:

request (HttpRequest) – The network request you want to make. Required.

Keyword Arguments:

stream (bool) – Whether the response payload will be streamed. Defaults to False.

Returns:

The response of your network call. Does not do error handling on your response.

Return type:

HttpResponse

azure.ai.inference.load_client(endpoint: str, credential: AzureKeyCredential | TokenCredential, **kwargs: Any) ChatCompletionsClient | EmbeddingsClient | ImageEmbeddingsClient[source]

Load a client from a given endpoint URL. The method makes a REST API call to the /info route on the given endpoint, to determine the model type and therefore which client to instantiate. Keyword arguments are passed to the appropriate client’s constructor, so if you need to set things like api_version, logging_enable, user_agent, etc., you can do so here. This method will only work when using Serverless API or Managed Compute endpoint. It will not work for GitHub Models endpoint or Azure OpenAI endpoint. Keyword arguments are passed through to the client constructor (you can set keywords such as api_version, user_agent, logging_enable etc. on the client constructor).

Parameters:
  • endpoint (str) – Service endpoint URL for AI model inference. Required.

  • credential (AzureKeyCredential or TokenCredential) – Credential used to authenticate requests to the service. Is either a AzureKeyCredential type or a TokenCredential type. Required.

Returns:

The appropriate synchronous client associated with the given endpoint

Return type:

ChatCompletionsClient or EmbeddingsClient or ImageEmbeddingsClient

Raises:

HttpResponseError

Subpackages

Submodules

azure.ai.inference.tracing module

class azure.ai.inference.tracing.AIInferenceInstrumentor[source]

A class for managing the trace instrumentation of AI Inference.

This class allows enabling or disabling tracing for AI Inference. and provides functionality to check whether instrumentation is active.

instrument(enable_content_recording: bool | None = None) None[source]

Enable trace instrumentation for AI Inference.

Parameters:

enable_content_recording (bool, optional) – Whether content recording is enabled as part of the traces or not. Content in this context refers to chat message content and function call tool related function names, function parameter names and values. True will enable content recording, False will disable it. If no value s provided, then the value read from environment variable AZURE_TRACING_GEN_AI_CONTENT_RECORDING_ENABLED is used. If the environment variable is not found, then the value will default to False. Please note that successive calls to instrument will always apply the content recording value provided with the most recent call to instrument (including applying the environment variable if no value is provided and defaulting to false if the environment variable is not found), even if instrument was already previously called without uninstrument being called in between the instrument calls.

is_content_recording_enabled() bool[source]

This function gets the content recording value.

Returns:

A bool value indicating whether content recording is enabled.

Return type:

bool

is_instrumented() bool[source]

Check if trace instrumentation for AI Inference is currently enabled.

Returns:

True if instrumentation is active, False otherwise.

Return type:

bool

uninstrument() None[source]

Disable trace instrumentation for AI Inference.

Raises:

RuntimeError – If instrumentation is not currently enabled.

This method removes any active instrumentation, stopping the tracing of AI Inference.