About
关于
Profile 个人简介I am a Master's student in Computer Science at the University of Macau, working under the supervision of Prof. Derek F. Wong at the NLP & Portuguese–Chinese Machine Translation Laboratory (NLP2CT). My research focuses on trustworthy large language models, with an emphasis on LLM-generated text detection, adversarial robustness, and safety alignment across multilingual and low-resource settings.
I am additionally interested in LLM agents and machine-translation evaluation, as well as the robustness and reliability of detectors across languages and model families. I am fortunate to be closely mentored by Junchao Wu, whose guidance has been instrumental in shaping my research trajectory.
I am actively looking for collaborations on trustworthy LLMs, multilingual NLP, and LLM-generated content detection — please feel free to reach out via email.
我是澳门大学计算机科学专业的硕士研究生,在自然语言处理与葡中机器翻译实验室(NLP2CT)由 Derek F. Wong 教授指导开展研究。我的研究方向聚焦于可信赖大语言模型,尤其关注大语言模型生成文本检测、对抗鲁棒性,以及多语言与低资源环境下的安全对齐。
我同时关注大语言模型智能体与机器翻译评估,以及检测器在不同语言和模型族之间的鲁棒性与可靠性。我有幸得到吴俊超师兄的悉心指导,他的建议对我的研究方向起到了关键作用。
我正在积极寻找合作,方向包括可信赖大语言模型、多语言自然语言处理以及大语言模型生成内容检测,欢迎通过邮件与我联系。
Research Overview
研究概述
Thematic pillars 研究方向News
动态
Latest updates 最新消息Education
教育背景
Academic record 学业履历Experience
经历
Research & Teaching 研究与教学- Joined as an undergraduate exchange researcher in May 2024 and continued as a graduate researcher from August 2025, sustaining continuous contribution across two years of multilingual NLP and LLM evaluation projects.
- Conceived and led IndicDetect, a cross-lingual benchmark for LLM-generated text detection across Hindi, Telugu, and Tamil; drove dataset construction, evaluation framework design, and analysis of zero-shot and fine-tuned detector performance.
- Built CultMetric, a reference-free, gloss-grounded framework for evaluating cultural fidelity in machine translation, reproducing human rankings with a deterministic LLM-judge protocol.
- Developed and optimised Transformer-based models (BERT, RoBERTa, GPT-style) for multilingual machine translation, specialising in failure triage and configuration tuning for low-resource language pairs.
- Built automated evaluation pipelines (BLEU, ROUGE, BERTScore, WER) and containerised workflows with Git, Docker, MLflow, and Weights & Biases to streamline experimentation and ensure reproducibility.
- Co-authored manuscripts submitted to ACL Rolling Review (ARR) 2026, contributing experimental design, writing, and result analysis.
- 2024 年 5 月以本科交流研究员身份加入,2025 年 8 月起继续以研究生身份开展研究,在两年间持续参与多语言自然语言处理与大语言模型评估项目。
- 主导并提出 IndicDetect——面向印地语、泰卢固语与泰米尔语的跨语言大语言模型生成文本检测基准;负责数据集构建、评估框架设计,以及零样本与微调检测器的性能分析。
- 构建 CultMetric——一套无参考译文、基于词条释义的机器翻译文化忠实度评估框架,通过确定性的大语言模型评判协议复现人工排名。
- 针对多语言机器翻译开发并优化 Transformer 模型(BERT、RoBERTa、GPT 系列),专注于低资源语对的错误分析与配置调优。
- 构建自动化评估流水线(BLEU、ROUGE、BERTScore、WER),并基于 Git、Docker、MLflow 与 Weights & Biases 搭建容器化工作流,以简化实验流程并确保可复现性。
- 合作撰写并向 ACL Rolling Review (ARR) 2026 投稿论文,参与实验设计、写作与结果分析。
- Facilitated weekly tutorials for undergraduate courses in Discrete Structures and Software Project Management, mentoring students in algorithmic complexity and systematic quality-assurance practice.
- Engineered assessment frameworks and grading rubrics, applying test-coverage principles and edge-case analysis to evaluate student submissions.
- Collaborated with lead faculty to design supplementary instructional materials and coordinate grading across large-scale computer-science classes.
- 为本科课程《离散结构》与《软件项目管理》主持每周辅导课,指导学生掌握算法复杂度与系统化质量保障实践。
- 设计评估框架与评分标准,运用测试覆盖率原则与边界情况分析评估学生作业。
- 与主讲教师协作设计辅助教学材料,并协调大规模计算机科学课程的评分工作。
Selected Projects
代表项目
Research output 研究成果- Curated a benchmark of 84K human-written and LLM-generated samples across Hindi, Telugu, and Tamil, spanning four domains (academic, news, creative, movie reviews) using GPT-4.1, Qwen-Plus, and DeepSeek-v3.2.
- Engineered seven Brahmic-script-aware adversarial attacks (paraphrase via back-translation, character perturbation, whitespace, insert-paragraph, alternative spelling, misspelling, synonym swap) to stress-test detector robustness.
- Benchmarked eight detectors, including zero-shot statistical methods (Log-Likelihood, Log-Rank, LRR, FastDetectGPT, Binoculars) and supervised neural models (XLM-RoBERTa Base/Large, QLoRA fine-tuned Qwen 2.5-7B) across six evaluation settings.
- Demonstrated that Qwen 2.5-7B achieved top average scores of 87.16 (Telugu), 85.74 (Hindi), 87.23 (Tamil), while zero-shot detectors degraded sharply under generator shifts and adversarial perturbations.
- 构建了涵盖印地语、泰卢固语与泰米尔语的 8.4 万条人写与大语言模型生成样本基准,跨越四个领域(学术、新闻、创意写作、影评),使用 GPT-4.1、Qwen-Plus 与 DeepSeek-v3.2 生成。
- 设计了七种面向婆罗米文字系的对抗性攻击(回译改写、字符扰动、空白字符、段落插入、替代拼写、拼写错误、同义词替换),用于压力测试检测器的鲁棒性。
- 在六个评估设置下对八种检测器进行基准测试,包括零样本统计方法(Log-Likelihood、Log-Rank、LRR、FastDetectGPT、Binoculars)与监督式神经模型(XLM-RoBERTa Base/Large、QLoRA 微调的 Qwen 2.5-7B)。
- 结果表明 Qwen 2.5-7B 取得最佳平均分——泰卢固语 87.16、印地语 85.74、泰米尔语 87.23,而零样本检测器在生成器切换与对抗扰动下表现显著下降。
- Designed a reference-free MT evaluation framework that grounds an LLM judge in a curated 872-entry glossary of culture-specific items, classifies failures into five typed categories, and produces a deterministic 0–100 score with bit-exact reproducibility.
- Curated the CSI glossary from Classical Chinese source texts via LLM-assisted extraction and expert validation against authoritative scholarly translations, covering religious, social, material, ecological, and linguistic categories.
- Evaluated four MT systems (GLM-5.1, DeepSeek-V4 Flash, Llama-3, Google Translate) on ~6,400 segments, reproducing the human cultural-fidelity ranking exactly — Spearman correlation 2.7× stronger than the best non-judge baseline.
- Ran ablation studies across two independent judge models (GPT-4o, Qwen-3.6 Flash) and culturally-flattened paraphrase conditions, demonstrating ranking robustness across all configurations.
- 设计了无参考译文的机器翻译评估框架,将大语言模型评判器锚定于含 872 条文化特有项的人工词表,将翻译错误归为五类,并生成确定性的 0–100 分评分,具备逐位可复现性。
- 基于古汉语源文本,通过大语言模型辅助抽取并经专家对照权威学术译本校验,构建文化特有项(CSI)词表,覆盖宗教、社会、物质、生态与语言五类范畴。
- 在约 6,400 个句段上评估四套机器翻译系统(GLM-5.1、DeepSeek-V4 Flash、Llama-3、Google 翻译),完全复现人工文化忠实度排名,Spearman 相关性较最佳非评判器基线强 2.7 倍。
- 在两个独立评判模型(GPT-4o、Qwen-3.6 Flash)及文化扁平化改写条件下开展消融实验,验证排名在所有配置下的鲁棒性。
- Built a CNN-LSTM pipeline for multi-class emotion recognition from disordered clinical speech, extracting MFCC, prosodic, and spectral features with Librosa and addressing class imbalance via pitch shifting, time stretching, and noise injection.
- Automated the end-to-end evaluation pipeline (feature extraction, inference, multi-class reporting) and conducted systematic failure analysis across emotion categories to guide targeted augmentation.
- 构建 CNN-LSTM 流水线,用于从障碍临床语音中进行多类别情感识别,使用 Librosa 提取 MFCC、韵律与频谱特征,并通过音调变换、时间拉伸与噪声注入处理类别不平衡问题。
- 自动化端到端评估流水线(特征提取、推理、多类别报告),并对各情感类别进行系统性错误分析,以指导针对性数据增强。
Publications
论文发表
Papers 论文C = Conference · J = Journal · S = In Submission · T = Thesis · * = equal contribution
C = 会议 · J = 期刊 · S = 投稿中 · T = 学位论文 · * = 共同一作
-
S.1
In Submission · ACL ARR 2026
投稿中 · ACL ARR 2026
IndicDetect: Evaluating Cross-Lingual LLM-Generated Text Detection for Hindi, Telugu, and TamilIndicDetect:面向印地语、泰卢固语与泰米尔语的跨语言大语言模型生成文本检测评估Manuscript submitted to ACL Rolling Review (ARR).论文已投稿至 ACL Rolling Review (ARR)。
-
S.2
In Submission · ACL ARR 2026
投稿中 · ACL ARR 2026
CultMetric: Gloss-Grounded Cultural Fidelity Evaluation for Machine TranslationCultMetric:基于词条释义的机器翻译文化忠实度评估Manuscript submitted to ACL Rolling Review (ARR).论文已投稿至 ACL Rolling Review (ARR)。
Technical Skills
技术技能
Competencies 专业能力Relevant Coursework
相关课程
Curriculum 课程内容Academic Service
学术服务
Reviewing 审稿工作References
推荐人
Referees 推荐人信息Department of Computer and Information Science
Faculty of Science and Technology
University of Macau
计算机与信息科学系
科技学院
澳门大学
Department of Computer and Information Science
Faculty of Science and Technology
University of Macau
计算机与信息科学系
科技学院
澳门大学
Department of Computer and Information Science
Faculty of Science and Technology
University of Macau
计算机与信息科学系
科技学院
澳门大学