Job Description
To apply for this job, you need to complete both steps below:
STEP 1:
Please click the link to submit your application directly to the company:
Your application will only be received by Recruiter if submitted via above link.
STEP 2:
Kindly scroll to the bottom of this page and complete the short VinUni Tracking Form.
Filling out this form alone does not count as applying. Kindly remind this form is not part of the company’s application process. It only helps Careers, Alumni, Industry and Development (CAID) Department discover more opportunities and follow up in case of system issues.
What you’ll be doing:
-
Engage in improving and perfecting a sophisticated speech data normalization and alignment tool with the assistance of NVIDIA's NeMo framework.
-
Apply techniques in text normalization and audio-text alignment to prepare large-scale datasets for advanced speech processing tasks.
-
Ensure robust handling of input audio and textual data to deliver highly accurate, automated spoken text outputs.
-
Collaborate with multidisciplinary teams to translate requirements into practical tool building and implementation.
-
Conduct comprehensive testing and validation to meet exacting internal quality standards.
-
Assist in the deployment of modern advancements geared towards improving NVIDIA’s speech and language AI technologies.
What we need to see:
-
Current enrollment in a Bachelor’s, Master’s, or PhD program in Computer Science, Engineering, or related field.
-
Strong programming experience in Python (or a comparable language).
-
Familiarity with ML frameworks/libraries (e.g., PyTorch or TensorFlow).
-
Foundational knowledge of NLP and speech recognition concepts.
-
Collaborative, inclusive approach with strong problem-solving skills and curiosity.
Ways to stand out from the crowd:
-
Hands-on experience with NVIDIA’s NeMo toolkit or similar platforms for speech and language processing.
-
Previous internships or substantial project experience in data engineering or machine learning for speech applications.
-
Direct knowledge of techniques for audio signal processing and text normalization.

