My name is Gaurav Pandey. I graduated from Columbia University (Dept. of Computer Science) and have been working as a Software Developer on a variety of technologies in the beautiful city of New York :)

Run pyspark on your windows machine

1) Download Spark lib on your local machine and decompress the archive. Then set SPARK_HOME and HADOOP_HOME env variables to point to this decompressed folder location – For example: C:\Users\some_user\PycharmProjects\spark-2.4.4-bin-hadoop2.7 Also lookup the winutils executable online and you need to put it in the spark bin folder. 2) Install Java JDK if you do not […]


java-jdk in pyspark project

A pyspark project that is running locally requires JAVA_HOME environment variable setup. If you’re using conda or anaconda-project to manage packages, then you do not need to install the bloated Oracle Java JDK but just add the java-jdk package from bioconda (linux) or cyclus (linux and win) channel and point JAVA_HOME property to the bin […]


Handle bimodal dataset

If you have bimodal (or higher) dataset, then calculate the mean and subtract that from the dataset and finally calculate the absolute value of it. abs(row[‘ColumnX’] – m) Or apply central limit theory and classify points for the two distributions


Handle numpy log of smaller values or zero

Pass a condition to handle discrete or continuous values in the where param for np.log >>> import numpy as np >>> x = np.array([0, 1, 2]) >>> np.log(x) __main__:1: RuntimeWarning: divide by zero encountered in log array([ -inf, 0. , 0.69314718]) >>> >>> >>> np.log(x, where=x > 0) array([0. , 0. , 0.69314718]) >>> np.log(x, […]


Change comma separator in Excel to handle Lakhs instead of Millions formatting convention

Use the below custom formatter [>=10000000]##\,##\,##\,##0;[>=100000] ##\,##\,##0;##,##0