Development AI LLM [LLM] Setting Up a Local LLM Environment on Apple Silicon with MLX

Overview

Setting up a local LLM environment with MLX and Qwen 3.6 on a MacBook Pro M5 Pro and laying the groundwork for researching agent frameworks.

Steps

1. Research Goals and Environment

The research goals are below.

Implement an agent framework from scratch (ReAct, Reflection, Plan-and-Execute)
Understand the raw tool calling format and write a parser
Compare behavioral differences between Dense and MoE models
Observe reasoning patterns using Qwen’s <think> blocks

The environment is summarized below.

MacBook Pro M5 Pro
48GB unified memory
macOS 26.4.1
mlx-lm v0.31.3 (isolated install via uv tool)

The key decisions are summarized below.

The main model is unsloth/Qwen3.6-27B-UD-MLX-6bit (Dense, about 22GB)
The sub model is unsloth/Qwen3.6-35B-A3B-UD-MLX-4bit (MoE, about 19GB)
Start with the built-in mlx_lm.server, then verify with mlx-openai-server later in the research cycle
6bit is the default quantization. 8bit lacks memory margin and 4bit risks noise

2. Hardware and Unified Memory Analysis

The primary constraint for choosing a local LLM model is unified memory. You must accurately estimate available memory and verify that model weights plus KV cache fit within it.

2.1. System Specs

Model Identifier: Mac17,8 (MacBook Pro)
Chip: Apple M5 Pro
CPU: 18 cores (6 efficiency + 12 performance)
GPU: 20 cores
Unified memory: 48GB
Disk free: about 790GB
macOS: 26.4.1

2.2. What Unified Memory Means

Apple Silicon shares the same RAM between CPU and GPU. Unlike NVIDIA setups, you do not need to manage system RAM and VRAM separately. MLX leverages this structure to load quantized weights from disk straight into memory. As a result, the disk download size and the in-memory weight size are identical.

2.3. Memory Budget (48GB Allocation)

macOS + system background: about 10-12GB
Work environment such as IDE (Cursor / VSCode), browser, Claude Code: about 8-10GB
Available for model weights + KV cache: about 26-30GB

2.4. KV Cache Estimates

Context	27B Dense	35B-A3B (MoE)
8K	+2-3GB	+1-2GB
16K	+4-6GB	+3-4GB
32K	+8-10GB	+5-6GB
64K	+15-18GB	+10-12GB

2.5. Conclusion

Model weights should stay at or below 20-22GB so that multi-turn agents (8K-32K context) can hold their KV cache stably. 8bit quantization at 28GB or more risks triggering swap and is not recommended during the research phase.

3. Installing MLX (mlx-lm + uv tool)

Isolate the MLX environment from the system Python and expose the CLI globally.

3.1. Difference Between MLX and mlx-lm

The relationship is similar to that of PyTorch and Hugging Face transformers.

MLX has the following characteristics.

A low-level ML framework built directly by Apple (released December 2023)
Provides tensor operations, automatic differentiation, and Metal GPU acceleration
Sits at the same layer as PyTorch / JAX
Used when training models or implementing new architectures

mlx-lm has the following characteristics.

An LLM-focused library built on top of MLX
Includes tools for model download, quantization, inference, and serving
Sits at the same layer as transformers / vLLM
Used to run already-trained LLMs

Running uv tool install mlx-lm automatically pulls in mlx as a dependency.

3.2. Installation

uv tool install is conceptually similar to pipx. It installs CLI tools into an isolated venv and exposes only the entry points on PATH.

uv tool install mlx-lm --python 3.12

The install locations are below.

Isolated venv: ~/.local/share/uv/tools/mlx-lm/
CLI symlinks: ~/.local/bin/mlx_lm.*

3.3. Python Version Handling

The system Python (3.9.6) does not support MLX
Specifying --python 3.12 makes uv automatically download cpython 3.12.13
Without specification, uv automatically picks the latest compatible version available

3.4. Install Verification

mlx_lm.server --help
uv tool list

Verify Metal GPU recognition with the following command.

uv tool run --from mlx-lm python -c \
  "import mlx.core as mx; print('Metal:', mx.metal.is_available()); print('Device:', mx.default_device())"

The output should be Metal: True and Device: Device(gpu, 0).

3.5. Available CLI (mlx-lm v0.31.3, 17 tools)

mlx_lm.generate: One-shot text generation
mlx_lm.server: OpenAI-compatible HTTP server
mlx_lm.chat: Interactive REPL
mlx_lm.convert: Quantization conversion
mlx_lm.lora: LoRA fine-tuning
mlx_lm.evaluate: Benchmark evaluation
Others: awq, dwq, gptq, fuse, manage, perplexity, share, upload, cache_prompt, benchmark, dynamic_quant

3.6. Benefits of the uv tool Approach

Avoids transformers version conflicts with other ML tools (vllm, llama-cpp-python, etc.)
Cleanly removes dependencies on uninstall (uv tool uninstall mlx-lm)
mlx-lm version upgrades do not affect other tools

4. Choosing the Qwen 3.6 Model (Dense vs MoE)

Run two model variants in parallel, Dense and MoE, to fit the goal of researching and building an agent framework.

4.1. Qwen 3.6 Lineup (as of April 2026)

Qwen3.6-Max-Preview (2026-04-20): Closed API, top of 6 coding benchmarks, 260K context
Qwen3.6-27B (2026-04-22): Open-weight Dense, Apache 2.0, 262K native / 1M extended
Qwen3.6-35B-A3B: Open-weight MoE (3B active), 262K native / 1M extended

4.2. Main Model: unsloth/Qwen3.6-27B-UD-MLX-6bit

Type: Dense 27B
Quantization: Unsloth Dynamic 6bit
Disk / memory weight: about 22GB
Use case: Serious behavior verification, ReAct decision tracing, quality baseline

4.3. Sub Model: unsloth/Qwen3.6-35B-A3B-UD-MLX-4bit

Type: MoE (35B total, 3B active)
Quantization: Unsloth Dynamic 4bit
Disk / memory weight: about 19GB
Use case: Fast iteration debugging, prompt tuning, high-volume testing

4.4. Dense vs MoE Comparison

Item	27B Dense 6bit	35B-A3B 4bit
Active parameters	All 27B	Only 3B
Token generation speed	15-25 tok/s	50-80 tok/s
Behavioral consistency	High (deterministic)	Routing non-determinism
Debugging ease	Excellent	Hard to trace
Coding benchmarks	Stronger	Slightly lower

The reason for choosing Dense as the main model is as follows. Because of expert routing, MoE produces subtly different responses for the same input. When you trace why an agent made a particular decision, routing non-determinism becomes noise.

The reason for keeping MoE as a sub model is that running the same agent code on both models alternately turns the Dense vs MoE behavioral difference into the most valuable research point.

4.5. Research Value of Qwen 3.6

Thinking Preservation: Preserves reasoning traces in <think> blocks across multi-turn dialogue
Tool calling stability: Trained on Hermes-style tool use, so OpenAI-compatible function calls are stable
Stronger agentic coding: Records 73.4% on SWE-bench Verified

4.6. Running Both Models

Loading both models simultaneously (22GB + 19GB = 41GB) on 48GB unified memory is impossible. Keep both on disk and compare them by restarting the server to swap memory.

5. Quantization Comparison

Analyze the trade-offs along two axes: bit width and quantization scheme.

5.1. Memory by Bit Width (Qwen 3.6 27B Reference)

Quantization	Weight memory	With KV (8K)	With KV (32K)	48GB Fit
4bit	15GB	18-22GB	23-27GB	Plenty of room
6bit	22GB	24-25GB	30-32GB	Best
8bit	28GB	32-38GB	40GB+	Tight, swap risk

5.2. Bit Width Selection Guide

8bit is effectively identical in quality to full precision but risks triggering macOS swap, which is fatal for the research cycle
4bit has measurable quantization noise. Using it as the main model causes you to mistake response quality drops for model limitations
6bit is the sweet spot. The memory footprint is 22GB and Unsloth UD 6bit shows essentially negligible quality difference from 8bit

5.3. Uniform Quantization vs Dynamic Quantization

mlx-community (uniform quantization) has the following traits.

Converts every layer to the same bit width
Generated by the built-in mlx-lm tool (mlx_lm.convert)
The most standard and verified approach

unsloth UD (Dynamic quantization) has the following traits.

Applies different bit widths according to layer importance
Keeps key layers like attention and embeddings at higher bits (8bit)
Applies lower bits (4-5bit) to less important layers
Average bit count is the same but quality loss is smaller

Combinations where Dynamic quantization shines are below.

Dense models + lower bits (4-6bit) gain the most
MoE models + lower bits gain partially (only a subset of experts is active anyway)
The difference is negligible at 8bit and above

5.4. Comparing Two Providers (Qwen3.6-27B 6bit)

Build	Quantization	Quality	Memory
`mlx-community/Qwen3.6-27B-6bit`	Uniform 6bit	Standard	22GB
`unsloth/Qwen3.6-27B-UD-MLX-6bit`	Dynamic 6bit	Better	22GB

5.5. Conclusion

For the goal of researching and building an agent framework, the most balanced choice is unsloth UD-MLX-6bit.

Secures a memory margin (22GB leaves room for KV cache)
Maintains 8bit-level quality (Dynamic quantization)
Lets you use the model’s actual capability as research material

6. Inference Server Options

Spin up an OpenAI-compatible HTTP server and connect your own agent framework as a client.

6.1. Built-in Server (Included with mlx-lm v0.31.3)

The launch command is below.

mlx_lm.server \
  --model unsloth/Qwen3.6-27B-UD-MLX-6bit \
  --port 8080

The supported endpoints are below.

POST /v1/chat/completions (with streaming)
POST /v1/completions
GET /v1/models

The supported features are below.

OpenAI Chat Completions API compatibility
Streaming (SSE)
Function calling / tool use
Automatic chat template application (recognizes Qwen Hermes-style)

The call example is below.

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8080/v1",
    api_key="not-needed"
)

response = client.chat.completions.create(
    model="unsloth/Qwen3.6-27B-UD-MLX-6bit",
    messages=[{"role": "user", "content": "Hello"}]
)

6.2. Third-Party Server: mlx-openai-server (cubist38, FastAPI-based)

Additional features over the built-in server are below.

Multi-model concurrent serving (YAML config)
Standardized per-model tool call parsers (qwen3, qwen3_5, qwen3_coder, gemma4, etc.)
Reasoning parser separation (automatically separates Qwen <think> blocks into a reasoning_content field)
Vision + text multimodal support

The installation command is below.

uv tool install mlx-openai-server

6.3. Selection by Research Stage

Early stage: Use the built-in server. The raw tool call output becomes research material
Middle stage: Write your own parser using the built-in server’s raw tool calls. Learn to implement ReAct and Reflection patterns
Late stage: Use mlx-openai-server to verify that your parser matches the standard

6.4. Other Options

Tool	Features	Suitability for Research
`mlx_lm.server` (built-in)	OpenAI compatible, single model, standard	Starting point
mlx-openai-server	Multi-model, standardized parsers, reasoning split	Verification
LM Studio	GUI app, model manager	If you prefer GUI
FastMLX	Lightweight FastAPI wrapper	Simplicity-first
vllm-mlx	Continuous batching, 400+ tok/s, MCP support	Production

6.5. Recommended Workflow

Step 1: Start with the built-in mlx_lm.server
Step 2: Verify single tool calls with curl / OpenAI SDK
Step 3: Analyze the Hermes raw output and write your own tool call parser
Step 4: Implement agent patterns such as ReAct
Step 5: Verify your parser with mlx-openai-server

7. Execution Guide (From Install Verification to Troubleshooting)

This section organizes step-by-step commands from mlx-lm installation to API calls against both models.

7.1. Environment Verification

mlx_lm.server --help
uv tool run --from mlx-lm python -c "import mlx.core as mx; print(mx.metal.is_available(), mx.default_device())"
uv tool list

7.2. Download the Main Model (Qwen 3.6 27B Dense 6bit, About 22GB)

mlx_lm.generate --model unsloth/Qwen3.6-27B-UD-MLX-6bit --prompt "Hello, who are you?" --max-tokens 100

Monitor progress with the following command.

du -sh ~/.cache/huggingface/hub/models--unsloth--Qwen3.6-27B-UD-MLX-6bit

Confirm completion with the following command.

ls -lh ~/.cache/huggingface/hub/models--unsloth--Qwen3.6-27B-UD-MLX-6bit/snapshots/*/

7.3. Download the Sub Model (Qwen 3.6 35B-A3B MoE 4bit, About 19GB)

mlx_lm.generate --model unsloth/Qwen3.6-35B-A3B-UD-MLX-4bit --prompt "Hello, who are you?" --max-tokens 100
du -sh ~/.cache/huggingface/hub/models--unsloth--Qwen3.6-35B-A3B-UD-MLX-4bit

7.4. Run the Server (One Model at a Time)

# Main model
mlx_lm.server --model unsloth/Qwen3.6-27B-UD-MLX-6bit --port 8080

# Sub model (after stopping the main)
mlx_lm.server --model unsloth/Qwen3.6-35B-A3B-UD-MLX-4bit --port 8080

When you see Starting server on 127.0.0.1:8080, the server is up. To swap models, free memory with Ctrl+C and run the same command with the other model.

7.5. API Call Test (curl)

Call the server from another terminal while it is running. The examples below assume the main model. If you swapped to the sub model, change the model value to unsloth/Qwen3.6-35B-A3B-UD-MLX-4bit. Paths are matched exactly, so a trailing slash or a query string returns 404.

A basic chat completion is below.

curl -X POST http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
  "model": "unsloth/Qwen3.6-27B-UD-MLX-6bit",
  "messages": [{"role": "user", "content": "Hello"}],
  "max_tokens": 50
}'

The model list is below.

curl http://localhost:8080/v1/models

The streaming example is below.

curl -X POST http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
  "model": "unsloth/Qwen3.6-27B-UD-MLX-6bit",
  "messages": [{"role": "user", "content": "Count from 1 to 10"}],
  "stream": true,
  "max_tokens": 100
}'

7.6. Tool Calling Test

A single tool call is below.

curl -X POST http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
  "model": "unsloth/Qwen3.6-27B-UD-MLX-6bit",
  "messages": [{"role": "user", "content": "What is the weather in Seoul?"}],
  "tools": [{
    "type": "function",
    "function": {
      "name": "get_weather",
      "description": "Get weather for a city",
      "parameters": {
        "type": "object",
        "properties": {
          "city": {"type": "string", "description": "City name"}
        },
        "required": ["city"]
      }
    }
  }],
  "max_tokens": 200
}'

The response should include a tool_calls field.

A multi-turn tool call (returning the result, then producing the final response) is below.

curl -X POST http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
  "model": "unsloth/Qwen3.6-27B-UD-MLX-6bit",
  "messages": [
    {"role": "user", "content": "What is the weather in Seoul?"},
    {"role": "assistant", "content": null, "tool_calls": [{
      "id": "call_001",
      "type": "function",
      "function": {"name": "get_weather", "arguments": "{\"city\": \"Seoul\"}"}
    }]},
    {"role": "tool", "tool_call_id": "call_001", "content": "{\"temp\": 18, \"condition\": \"clear\"}"}
  ],
  "max_tokens": 200
}'

7.7. OpenAI SDK Call (Python)

uv tool run --from openai python -c "
from openai import OpenAI
client = OpenAI(base_url='http://localhost:8080/v1', api_key='not-needed')
response = client.chat.completions.create(
    model='unsloth/Qwen3.6-27B-UD-MLX-6bit',
    messages=[{'role': 'user', 'content': 'Hi'}],
    max_tokens=50
)
print(response.choices[0].message.content)
"

7.8. Disk Usage / Removing Models

After downloading both models, total usage is about 41GB.

du -sh ~/.cache/huggingface/hub/models--unsloth--Qwen3.6-*
rm -rf ~/.cache/huggingface/hub/models--unsloth--Qwen3.6-27B-UD-MLX-6bit
rm -rf ~/.cache/huggingface/hub/models--unsloth--Qwen3.6-35B-A3B-UD-MLX-4bit

7.9. Troubleshooting

The error command not found: mlx_lm.server occurs when ~/.local/bin is not on PATH.

uv tool update-shell
source ~/.zshrc

To resume an interrupted download, run mlx_lm.generate or mlx_lm.server again. It automatically resumes via huggingface_hub.

For out-of-memory (OOM) issues, close other apps or shorten the context. Starting with an 8K context is recommended.

Resolve server port conflicts with the following commands.

lsof -i :8080
kill -9 <PID>

A 404 response on a POST request usually means the path is wrong. mlx_lm.server compares paths exactly, so a trailing slash (/v1/chat/completions/), a query string (?...), or a typo all return 404. The supported paths are only /v1/chat/completions, /chat/completions, and /v1/completions (text completion).

A 411 response on a POST request means the Content-Length header is missing. curl adds it automatically, but specify it explicitly when working with raw sockets.

8. Comprehensive Benchmark (Qwen3.6 27B 6bit)

The unsloth/Qwen3.6-27B-UD-MLX-6bit model is loaded into mlx_lm.server and memory limits and performance are measured across six scenarios: length stress, multi-turn, tool call, deep reasoning, general capability, and GPU throughput. The test machine is Apple Silicon with 48GB unified memory.

8.1. Environment

The test environment is below.

Model: unsloth/Qwen3.6-27B-UD-MLX-6bit (Dense 27B, 6bit quantization)
Weight size on disk: approximately 22GB
Server: mlx_lm.server v0.31.3, port 8080
System: Apple Silicon, 48GB unified memory, macOS Darwin 25.4

The measurement tools are below.

Process RSS: ps -o rss=
System memory: vm_stat (Pages free / inactive / wired / Swapouts)
HTTP calls: standard library urllib.request
Measurement scripts: /tmp/mlx_stress.py, /tmp/mlx_extra.py, /tmp/mlx_gpu.py

8.2. Difference Between RSS and Wired Memory

RSS showed only 2.8GB at the start of measurement, which was unexpected. The cause is macOS’s mmap and memory compression behavior.

mlx_lm loads weights via mmap. Pages are brought into physical memory only when actually accessed.
macOS compresses idle pages. RSS may not reflect compressed memory.
Pages wired down (wired) is more accurate for actual GPU occupancy. MLX holds Metal tensors as wired pages.
Measurement result: wired jumps from 3.23GB at idle to 32.71GB when the model is active, an increase of approximately 29.5GB. This is the actual GPU memory occupancy.

For this reason, wired memory is used as the primary memory metric in this benchmark.

8.3. Length Stress (Incremental Prompt Token Growth)

prompt and max_tokens are increased incrementally in a single call to measure the length limit.

target prompt	actual prompt	output	elapsed	RSS peak	swap Δ	min free+inactive	finish
128	222	128	15.7s	2.80GB	0.0GB	7.87GB	length
512	796	422	50.8s	2.84GB	0.0GB	7.51GB	stop
2048	3089	420	55.9s	2.84GB	0.0GB	7.58GB	stop
4096	6147	597	83.2s	2.86GB	0.0GB	5.65GB	stop
8192	12264	339	69.5s	2.86GB	0.0GB	4.96GB	stop
16384	24492	376	200.6s	1.94GB	2.34GB	3.11GB	stop
24576	-	-	-	-	thrashing	<1GB	(aborted)

Key observations are below.

Up to 12K prompt, calls are stable at 50-90 seconds each. The first call after idle takes longer due to weight paging-in.
At an actual prompt of 24K (target 16384), elapsed spikes to 200 seconds and 2.3GB of swap occurs. This is a signal that the processing limit has been reached.
At target 24576, system free drops to 0.5GB, wired reaches 25GB, and swap accumulates at 1.5-2GB every 5 seconds, causing thrashing. Processing is effectively impossible.

8.4. Multi-Turn (5-Turn Accumulated Conversation)

Measurement uses a 5-turn conversation on the TCA pattern as the topic. Each turn is called with the accumulated context.

turn	prompt	completion	total	elapsed	RSS peak	swap Δ
1	28	400	428	46.1s	18.00GB	0.00GB
2	191	400	591	44.5s	18.02GB	0.03GB
3	615	400	1015	45.5s	18.02GB	0.00GB
4	1036	400	1436	46.2s	18.02GB	0.00GB
5	1457	400	1857	45.4s	18.02GB	0.00GB

Key observations are below.

RSS is very stable at 18GB for an accumulated 1857 tokens across 5 turns.
Elapsed time is nearly identical across turns (44-46s). At the 1500-token context level, the prefill overhead is negligible.
Total swap across 5 turns is 0.03GB, which is effectively zero.

8.5. Tool Call (Function Calling)

The get_weather function is registered and a request is made to compare the weather in Seoul and Tokyo. The full cycle of tool call request, result return, and final response is measured.

step	stage	elapsed	RSS peak
1	tool_call_request	37.0s	18.02GB
2	final_answer	17.9s	18.02GB

Key observations are below.

In step 1, the model issues parallel tool_calls for both cities in a single call (parallel function calling is supported). Two tool_calls come back together in a single response.
In step 2, two tool results are added to messages and a call is made, generating a natural comparison response.
Response: “Seoul 18 degrees clear, Tokyo 22 degrees clear, Tokyo is 4 degrees warmer”
Tool calling for Qwen3.6 works correctly in mlx_lm.server v0.31.3.

8.6. Deep Reasoning (Thinking Mode)

A seating inference puzzle (“Who is sitting across from the mother?”) is used for measurement. Thinking mode is controlled via chat_template_kwargs.

mode	elapsed	RSS peak	reasoning field
default	113.5s	18.10GB	present
`enable_thinking=True`	112.5s	18.10GB	present
`enable_thinking=False`	26.3s	18.09GB	absent

Key observations are below.

Qwen3.6 has thinking on by default (the reasoning field is populated automatically).
Switching to thinking_off makes responses 4.3x faster. This is recommended for simple responses or when an immediate answer is needed.
With thinking_on, the reasoning process goes into the reasoning field and the final answer is separated into the content field.
Memory usage is independent of mode.

8.7. General Capability (Code / Math / Korean / Translation)

Four short cases are used to quickly verify the model’s general capability.

case	elapsed	RSS peak	Note
code (Python deep merge)	67.1s	18.10GB	Response includes thinking process
math (Sum of multiples of 7 from 1 to 100)	67.4s	18.10GB	Correct solution
korean_reasoning (Ungrammatical sentence correction)	68.4s	18.10GB	Natural correction
translation (Korean to English/Japanese)	67.1s	18.10GB	Natural translation

Key observations are below.

All 4 cases have a consistent elapsed time of 67-68 seconds, which matches the expected decode speed of approximately 9 tok/s at max_tokens 600.
Response quality passes the baseline bar. No separate quantitative evaluation is performed.
No obvious errors are found in Korean reasoning or translation.

8.8. GPU / Throughput

prefill and decode speeds are measured separately.

label	prompt	completion	elapsed	total tok/s	decode tok/s	wired before	wired after
baseline	13	50	6.4s	9.9	7.9	3.23GB	32.71GB
medium	24	400	44.5s	9.5	9.0	32.71GB	32.44GB
long_decode	25	1500	163.7s	9.3	9.2	32.44GB	32.72GB
long_prompt	1222	800	90.4s	22.4	8.8	32.72GB	32.79GB

Key observations are below.

Decode throughput is consistently 8-9 tok/s, which is the expected range for a 27B 6bit model on the M series.
On the baseline first call, wired jumps from 3.23GB to 32.71GB. Model weights and Metal buffers are locked as wired pages.
Subsequent calls are stable at 32.4-32.8GB wired. The KV cache fluctuates by 0.2-0.3GB depending on context length.
The total 22.4 tok/s for long_prompt reflects the effect of processing 1222 prefill tokens very quickly. Prefill speed is estimated at 30-40 tok/s, which is 4-5x faster than decode.
The stable operating point for this machine is 8-9 tok/s in terms of user-perceived speed (first-token latency and decode speed).

8.9. Streaming (TTFT, ITL)

The SSE streaming response is parsed directly to measure time to first token (TTFT) and inter-token latency (ITL).

label	tokens	elapsed	TTFT	ITL avg	ITL p50	ITL p95	ITL max
short(cold)	100	15.7s	4.77s	110.6ms	110.1ms	114.2ms	121.9ms
medium(warm)	392	44.8s	0.68s	112.8ms	110.4ms	114.0ms	332.0ms
long(warm)	786	88.8s	0.71s	112.2ms	109.7ms	115.4ms	329.9ms

Key observations are below.

TTFT is about 4.77s cold (immediately after weight page-in) and about 0.7s warm, a significant difference.
ITL p50 is approximately 110ms and very consistent, which matches the 9 tok/s decode rate.
ITL p95 stays within 115ms, so the distribution is tight.
Occasional 330ms spikes occur, which translates to a brief stutter in the UI.
User-perceived responsiveness is good in the warm state. The near-5-second first-call latency requires pre-warming for chat UX.

8.10. Conclusion

8.10.1. Memory Limits

With the model active, wired memory occupies approximately 33GB as a baseline. Of the 48GB unified memory, 15GB remains free.
The safe operating context is approximately 12K actual prompt tokens. Beyond that, KV cache pressure causes elapsed time to spike.
The hard limit is around 24K actual prompt tokens. Beyond that, thrashing makes processing effectively impossible.
A 5-turn multi-turn conversation (approximately 1.8K accumulated tokens) places almost no memory burden.

8.10.2. Performance

Decode throughput is stable at 8-9 tok/s, with a tight ITL p50 of 110ms.
Prefill is estimated at 30-40 tok/s. For short prompts, the prefill contribution is small, so response time is proportional to decode length.
TTFT is about 4.8s cold (weight page-in) and about 0.7s warm.
The first call takes approximately 30 seconds longer due to weight paging-in. This can be ignored for workloads with frequent calls.

8.10.3. Features

Tool calling works correctly and parallel function calling is supported.
Reasoning defaults to thinking on. Explicitly setting enable_thinking=False accelerates responses by 4.3x.
Korean, English, and Japanese baseline quality all pass the bar.

8.11. Recommended Usage Guidelines

General chatbot/QA (2-4K context) works without issues. Responses come back at approximately 9 tok/s decode speed.
Multi-turn conversations (up to approximately 10K accumulated context) are stable.
Long document summarization (8-12K prompt) is possible. Responses take 60-90 seconds.
Long document analysis (16K+ prompt) is possible but responses take 200 seconds or more and swap begins to occur.
Very long inputs (20K+) are not recommended due to the risk of thrashing.
For immediate-response workloads, setting chat_template_kwargs.enable_thinking=False provides a 4x speedup.
When running other heavy apps simultaneously (such as Xcode builds or virtual machines), it is recommended to operate with a narrower context.

8.12. Measurement Limitations and Future Items

Concurrent inference is not measured. Since mlx_lm.server processes requests in a single queue, this may not be particularly meaningful.
GPU utilization, power, and temperature via powermetrics are not measured because they require sudo. This can be done in a separate session.
Quantization comparison (4bit vs 6bit vs 8bit) will be covered in a separate quantization comparison note.
A comparison of the 35B-A3B MoE 4bit model under the same scenarios is planned as a follow-up.

9. Closing Notes

The clearest conclusion from this environment setup and benchmarking is that a 6bit quantized 27B Dense model can be kept running at a practical level on an Apple Silicon Mac with 48GB of unified memory. There are, however, clear limits on what it can take on, and recognizing those limits is what keeps the experience from disappointing.

This environment fits best with light document work inside a personal workflow. Tidying scattered notes into a single page, polishing a rough draft, pulling the key points out of meeting notes, translating paragraphs between Korean, English, and Japanese, drafting short emails or READMEs - all of these are well within reach. Decode at 9 tok/s with ITL p50 around 110ms feels close to natural reading speed, so waiting on output rarely becomes uncomfortable, and multi-turn conversations up to roughly 1.8K cumulative tokens flow without any memory pressure.

Running entirely on local hardware rather than a cloud API means more than just cost reduction. Notes containing personal information, private documents, and in-progress drafts that are uncomfortable to send outside can be handled with peace of mind. The same workflow continues uninterrupted on a plane or with an unreliable connection, which is a small but real benefit. Because all processing happens inside a single laptop, there is no data exfiltration path to worry about.

The limits are equally clear. Reading a long document end-to-end (prompts of 16K or more) pushes responses past 200 seconds and starts triggering swap, so practicality drops fast. Chatbot services serving dozens of users at once, large-scale codebase refactoring, and heavy long-context RAG workloads belong with cloud APIs. Even as an agent backbone, 9 tok/s decode is fine for interactive tools but becomes slow when a job has to emit tens of thousands of tokens in the background.

In short, this environment’s place is as a personal writing and tidying tool. Without trying to offload everything onto it and only handing it the tasks that fit, it runs smoothly without sending every prompt to a cloud service. Setting expectations at that level, a 6bit quantized 27B Dense model turns out to be the kind of tool that quietly stays in daily use much longer than expected.

References

Owners:

Planning: binaryloader
Research: binaryloader Claude
Drafting: binaryloader Claude
Editing: binaryloader
Review: binaryloader Claude
Translation: Claude
Thumbnail: Claude
Publishing: Claude

Share on

X Facebook LinkedIn

[LLM] Setting Up a Local LLM Environment on Apple Silicon with MLX

binaryloader

Overview

Steps

1. Research Goals and Environment

2. Hardware and Unified Memory Analysis

2.1. System Specs

2.2. What Unified Memory Means

2.3. Memory Budget (48GB Allocation)

2.4. KV Cache Estimates

2.5. Conclusion

3. Installing MLX (mlx-lm + uv tool)

3.1. Difference Between MLX and mlx-lm

3.2. Installation

3.3. Python Version Handling

3.4. Install Verification

3.5. Available CLI (mlx-lm v0.31.3, 17 tools)

3.6. Benefits of the uv tool Approach

4. Choosing the Qwen 3.6 Model (Dense vs MoE)

4.1. Qwen 3.6 Lineup (as of April 2026)

4.2. Main Model: unsloth/Qwen3.6-27B-UD-MLX-6bit

4.3. Sub Model: unsloth/Qwen3.6-35B-A3B-UD-MLX-4bit

4.4. Dense vs MoE Comparison

4.5. Research Value of Qwen 3.6

4.6. Running Both Models

5. Quantization Comparison

5.1. Memory by Bit Width (Qwen 3.6 27B Reference)

5.2. Bit Width Selection Guide

5.3. Uniform Quantization vs Dynamic Quantization

5.4. Comparing Two Providers (Qwen3.6-27B 6bit)

5.5. Conclusion

6. Inference Server Options

6.1. Built-in Server (Included with mlx-lm v0.31.3)

6.2. Third-Party Server: mlx-openai-server (cubist38, FastAPI-based)

6.3. Selection by Research Stage

6.4. Other Options

6.5. Recommended Workflow

7. Execution Guide (From Install Verification to Troubleshooting)

7.1. Environment Verification

7.2. Download the Main Model (Qwen 3.6 27B Dense 6bit, About 22GB)

7.3. Download the Sub Model (Qwen 3.6 35B-A3B MoE 4bit, About 19GB)

7.4. Run the Server (One Model at a Time)

7.5. API Call Test (curl)

7.6. Tool Calling Test

7.7. OpenAI SDK Call (Python)

7.8. Disk Usage / Removing Models

7.9. Troubleshooting

8. Comprehensive Benchmark (Qwen3.6 27B 6bit)

8.1. Environment

8.2. Difference Between RSS and Wired Memory

8.3. Length Stress (Incremental Prompt Token Growth)

8.4. Multi-Turn (5-Turn Accumulated Conversation)

8.5. Tool Call (Function Calling)

8.6. Deep Reasoning (Thinking Mode)

8.7. General Capability (Code / Math / Korean / Translation)

8.8. GPU / Throughput

8.9. Streaming (TTFT, ITL)

8.10. Conclusion

8.10.1. Memory Limits

8.10.2. Performance

8.10.3. Features

8.11. Recommended Usage Guidelines

8.12. Measurement Limitations and Future Items

9. Closing Notes

References

Share on

Leave a comment

You may also enjoy

[Agent] korea-persona-interview, a Korean Synthetic Persona Interview Tool Powered by CLI and MCP

[Agent] Analyzing Core Assets of Existing Agent Frameworks for Planning Native Agent Framework Mollo

[Agent] Yangsiljang, a Real Estate Law & Precedent RAG Chatbot Built as a Side Project