The current study investigates the Voice Onset Time (VOT) of Pakistani English (PE) Speech and Sindhi L1. The study hypothesizes that PE speakers transmit their L1 negative VOT to L2 English-voiced stops, generate English plosives with shorter pre-voicing durations than their L1-voiced plosives, and that their characteristics modify depending on place of articulation and gender. The stimuli were L2 English coronal and dorsal allophones, namely labial [pʰ], coronal [tʰ], and velar [kʰ], while Sindhi L1 distinct phonemes as aspirated labial /pʰ/, retroflex / ʈʰ/, velar / kʰ/ consonants: bilabial /b/, alveolar /d/, and velar /ɡ/. Voice Onset Time is an important acoustic element in the generation of plosives and has been extensively investigated in numerous languages. Machine learning modeling of VOT in second language (L2) learning yields useful data in phonetics, speech processing, and linguistics. To analyze and understand the data, the study applies advanced computer techniques such as speech recognition and machine learning modelling. This study presents useful insights into the Voice Onset Time patterns and variances in the two languages, which can aid in the development of better speech recognition algorithms and language teaching materials. The sample size was thirty individuals-Sindhi speaking second language learners who recorded voice samples, and the results confirmed the hypotheses. © 2024 Elsevier B.V., All rights reserved.