The recent advances in Machine Learning (ML) and Deep Learning (DL) have led to many exciting challenges and opportunities for CS and AI researchers alike. Modern ML/DL and Data Science frameworks including TensorFlow, PyTorch, Dask, and several others have emerged that offer ease of use and flexibility to train, and deploy various types of ML models and Deep Neural Networks (DNNs). In this tutorial, we will provide an overview of interesting trends in ML/DL and how cutting-edge hardware architectures and high-performance interconnects are playing a key role in moving the field forward. We will also present an overview of different DNN architectures and ML/DL frameworks. Most ML/DL frameworks started with a single-node design. However, approaches to parallelize the process of model training are also being actively explored. The AI community has moved along different distributed training designs that exploit communication runtimes like gRPC, MPI, and NCCL. We highlight new challenges and opportunities for communication runtimes to exploit high-performance CPU and GPU architectures to efficiently support large-scale distributed training. We also highlight some of our co-design efforts to utilize MPI for large-scale DNN training on cutting-edge CPU and GPU architectures available on modern HPC clusters. The tutorial covers training traditional ML models including---K-Means, linear regression, nearest neighbors---using the cuML framework accelerated using MVAPICH2-GDR. Also, the tutorial resents accelerating GPU-based data science applications using the MPI4Dask package, which provides an MPI-based backend for Dask. Throughout the tutorial, we include hands-on exercises to enable attendees to gain first-hand experience of running distributed ML/DL training and Dask on a modern GPU cluster.