HZ.
聊一聊Let’s talk
正在寻找机器学习实习 · 2026 暑期Available for ML internships · Summer 2026

庄宏。Hong Zhuang.

机器学习工程师 · 多模态大模型训练与加速 Machine Learning Engineer · Multimodal LLM Training & Acceleration

慕尼黑工业大学(TUM)信息学硕士在读 · 此前在华为昇腾模型中台主导长序列并行训练优化,并在 Moii.AI 构建实时小物体检测系统。 MSc Informatics @ Technical University of Munich · previously building long-sequence parallel training at Huawei Ascend and real-time small-object detection at Moii.AI.

  • 0% GPU 节省(Wan2.1)GPU saved (Wan2.1)
  • 0% 显存下降memory cut
  • +0% 检测准确率CV accuracy
  • +0% 推理速度inference speed
Hong Zhuang portrait / 庄宏头像 德国 慕尼黑Munich, Germany
PyTorch · vLLM TRL · VERL DeepSpeed-Ulysses
/01

关于我About

先快速了解一下:我现在在哪、在做什么、下一步在找什么。A quick snapshot of where I am, what I work on, and what I’m looking for next.

你好,我是 庄宏慕尼黑工业大学(TUM)信息学硕士在读。我专注于大规模 多模态与强化学习模型 的训练、扩展与加速。

此前在 华为昇腾模型中台,我主导了 Wan2.1 14B 文生视频模型的长序列并行优化(与京东联合创新项目)— GPU 用量降低 75%,显存下降 50%。更早在 Moii.AI 参与构建了实时枪支检测系统,获得 Amazon Sigma Award

目前正在寻找 模型工程、算法落地、模型加速 方向的实习机会。

Hi, I’m Hong Zhuang — an MSc Informatics student at the Technical University of Munich (TUM). I work on training, scaling, and accelerating large multimodal and reinforcement-learning models.

Most recently at Huawei’s Ascend Model Platform, I drove long-sequence parallel optimization for the Wan2.1 14B text-to-video model in a joint innovation with JD — cutting GPU usage by 75% and memory by 50%. Earlier, I co-built a real-time firearm detection system at Moii.AI that won the Amazon Sigma Award.

I’m now open to internships in model engineering, algorithm deployment, and model acceleration.

/02

专业技能Skills

Bento 视角的技术栈速览 — 按日常用途分组。A bento snapshot — favorite tools grouped by what they get me through the day.

A1 / Core

AI 框架与训练AI Frameworks & Training

从研究原型到多 NPU 全参强化学习微调的日常主力。Daily driver for everything from quick research prototypes to full-parameter multi-NPU RL fine-tuning.

  • PyTorch
  • HuggingFace
  • LLaMA-Factory
  • DiffSynth-Studio
  • Pandas
A2 / Acceleration

分布式与推理Distributed & Inference

长序列并行、PagedAttention,以及现代 LLM 推理服务栈。Long-sequence parallelism, paged attention, and the rest of the modern LLM serving stack.

  • DeepSpeed-Ulysses
  • vLLM
  • Sequence Parallel
A3 / RL

强化学习微调RL Fine-tuning

GRPO 类强化学习微调从 GPU 迁移到昇腾 NPU — 训练、推理、调优、部署全链路。GRPO-style RL fine-tuning ported from GPU to Ascend NPU — end-to-end training, debugging, and deployment.

  • TRL
  • VERL
  • GRPO
B1 / Lang

编程语言Languages

  • Python
  • C++
  • Java
  • Bash
B2 / Vision

计算机视觉Computer Vision

  • YOLOv8
  • SAHI
  • OpenCV
  • I3D
B3 / Infra

基础设施与工具Infra & Tooling

  • Ascend NPU
  • GCP
  • FastAPI
  • Git
  • Ubuntu
  • SQLite
/03

工作经历Experience

两段经历 — 一段聚焦大模型规模化训练,一段把检测系统落地到真实摄像头上。Two chapters so far — pushing big multimodal models at scale, and shipping detection systems on real cameras.

  1. 软件工程师(OD) · 昇腾模型中台Software Engineer (OD) · Ascend Model Platform

    华为技术有限公司Huawei Technologies 2024 年 8 月 — 2025 年 9 月Aug 2024 — Sep 2025

    主要负责 多模态大模型的训练与推理加速 — 跨芯片适配(GPU → 昇腾 NPU)、序列并行,以及多 NPU 集群上的全参强化学习训练。

    • 设计并实现长序列并行特性,使训练规模下降 75%、显存节省 50%,支持相同硬件下更长序列的训练。
    • 主导基于 TRLverl 的 GRPO 强化学习栈从 GPU 到昇腾 NPU 的迁移,负责训练、推理、调优、部署全链路的技术排查与方案设计。
    • 搭建并维护高效的开发 / 测试 / 部署环境,覆盖 NPU 与 GPU。

    Focused on multimodal LLM training and inference acceleration — cross-chip adaptation (GPU → Ascend NPU), sequence parallelism, and full-parameter RL training on multi-NPU clusters.

    • Designed and implemented a long-sequence parallel feature that cut training scale by 75% and memory by 50%, unlocking longer sequences on the same hardware.
    • Led GPU → Ascend NPU porting of GRPO RL stacks built on TRL and verl, owning end-to-end debugging across training, inference, tuning, and deployment.
    • Built and maintained efficient dev / test / deployment environments spanning NPU and GPU.
  2. 机器学习工程师实习生Machine Learning Engineer Intern

    Moii.AI 2023 年 8 月 — 2023 年 12 月Aug 2023 — Dec 2023

    设计并开发用于 实时威胁检测 的机器学习推理 API,支持图片、视频和实时视频流。训练 YOLOv8,并优化 SAHI 的自适应切片大小算法,用于小物体检测 — 准确率 +30%,推理速度 +200%

    与开发团队定期 code review,确定系统架构、功能实现、性能与成本。

    Designed and built ML inference APIs for real-time threat detection on images, videos, and live streams. Trained YOLOv8 and tuned SAHI’s adaptive tile sizing for small-object detection — accuracy +30%, inference +200%.

    Collaborated with the team through regular code reviews on system architecture, performance, and cost.

/04

教育背景Education

两个校园,两个国家,一条主线。Two campuses, two countries, one through-line.

慕尼黑工业大学(TUM)Technical University of Munich

信息学硕士(M.Sc. Informatics)M.Sc. Informatics

2025 — 2027 · 德国 慕尼黑2025 — 2027 · Munich, Germany

QS 世界大学排名全球 前 25。研究方向:机器学习、分布式系统、计算机视觉。 QS World University Rankings: Top 25 globally. Coursework and research focused on machine learning, distributed systems, and computer vision.

密歇根州立大学Michigan State University

计算机科学学士 · 辅修商务B.Sc. Computer Science · Minor: Business

2019 — 2023 · 美国密歇根州 东兰辛2019 — 2023 · East Lansing, MI, USA

GPA 3.892 / 4.00(前 10%)· 专业 GPA 3.924。院长嘉许名单 × 6。 GPA 3.892 / 4.00 (Top 10%) · Major GPA 3.924. Dean’s List × 6.

/05

论文发表Publications

已发表第一作者论文,后续仍有筹备中作品。First-author work, with more in the pipeline.

/06

精选项目Selected work

三个项目,覆盖模型加速、应用 CV、多模态研究三个方向。Three projects spanning model acceleration, applied CV, and multimodal research.

分布式 · 多模态 · 文生视频Distributed · Multimodal · Text-to-Video

Wan2.1 14B 长序列并行训练Wan2.1 14B Long-Sequence Parallel

京东联合创新项目。基于 ModelScope DiffSynth-Studio,在昇腾集群上完成 Wan2.1 14B 文生视频模型的适配与显存 profiling,实现 DeepSpeed-Ulysses 序列并行策略 — 32K 序列训练从 32 卡降到 8 卡,节点数 − 75%、显存 − 50% Joint innovation with JD. Memory-profiled the Wan2.1 14B text-to-video model on Ascend clusters and implemented a DeepSpeed-Ulysses sequence-parallel strategy — 32K-sequence training on 8 NPUs instead of 32, cutting nodes by 75% and memory by 50%.

  • DeepSpeed-Ulysses
  • Ascend NPU
  • DiffSynth-Studio
Real-time firearm detection screenshot / 实时枪支检测系统截图
计算机视觉 · FastAPI · GCPComputer Vision · FastAPI · GCP

实时枪支检测与预警系统Real-time Firearm Detection

网站允许用户接入摄像头实时流,后端实时检测枪支等小目标。YOLOv8 + SAHI — 准确率 +30%,推理 +200%。检测到威胁立即邮件告警。获得 Amazon Sigma Award(30 支团队第一)。 Web system that plugs into live camera streams for real-time detection of small objects like firearms. YOLOv8 + SAHI — accuracy +30%, inference +200%. Emergency emails on threat detection. Won the Amazon Sigma Award (1st of 30 teams).

  • YOLOv8
  • SAHI
  • FastAPI
  • Google Cloud
弱监督时序动作定位 · 多模态 · 科研Weakly-Supervised TAL · Multimodal · Research

面向医学考核的多模态动作识别Multimodal WTAL for Medical Exams

本科生科研项目。基于 SOTA 弱监督时序动作定位(WTAL)多模态模型,引入文本特征提升识别效果,自动为医学生实践考核打分。用 I3D 从约 9,000 个视频(> 1.2 TB)提取光流和 RGB 特征,4× RTX A6000 训练 — 92.75% 准确率、94.88% ROC-AUC。 Undergraduate research. SOTA weakly-supervised temporal action localization with text cues to score medical students’ practical exams. Extracted optical-flow + RGB features from ~9,000 videos (> 1.2 TB) using I3D, trained on 4× RTX A6000 — 92.75% accuracy and 94.88% ROC-AUC.

  • I3D
  • WTAL
  • 4× A6000
/07

联系方式Get in touch

最适合的话题:实习机会、研究合作,或只是打个招呼。Best for internship intros, research collaborations, or just to say hi.