Run pyspark on your windows machine

1) Download Spark lib on your local machine and decompress the archive.
Then set SPARK_HOME and HADOOP_HOME env variables to point to this decompressed folder location – For example: C:\Users\some_user\PycharmProjects\spark-2.4.4-bin-hadoop2.7
Also lookup the winutils executable online and you need to put it in the spark bin folder.

2) Install Java JDK if you do not have it already from Oracle’s website.
Then set the JAVA_HOME env variable to point to the folder where Jdk is installed – For example: C:\Program Files\Java\jdk

3) Update the PATH env variable to include the above listed env variables – So for example: %SPARK_HOME%\bin and %HADOOP_HOME%\bin and %JAVA_HOME%

4) Install pyspark package via conda or anaconda-project – this will also include py4j as a dependency

5) Define another env variable – PYSPARK_PYTHON and point this to the python executable from the conda env that has the pyspark installed.

6) Open a new command prompt and try running “pyspark” executable as a test – this should work just fine

0  

java-jdk in pyspark project

A pyspark project that is running locally requires JAVA_HOME environment variable setup. If you’re using conda or anaconda-project to manage packages, then you do not need to install the bloated Oracle Java JDK but just add the java-jdk package from bioconda (linux) or cyclus (linux and win) channel and point JAVA_HOME property to the bin folder of your conda env as that will have the java.exe

0  

Handle bimodal dataset

If you have bimodal (or higher) dataset, then calculate the mean and subtract that from the dataset and finally calculate the absolute value of it.

abs(row[‘ColumnX’] – m)

Or apply central limit theory and classify points for the two distributions

0  

Handle numpy log of smaller values or zero

Pass a condition to handle discrete or continuous values in the where param for np.log

>>> import numpy as np
>>> x = np.array([0, 1, 2])
>>> np.log(x)
__main__:1: RuntimeWarning: divide by zero encountered in log
array([ -inf, 0. , 0.69314718])
>>>
>>>
>>> np.log(x, where=x > 0)
array([0. , 0. , 0.69314718])
>>> np.log(x, where=x > 0.0)
array([0. , 0. , 0.69314718])

0  

Change comma separator in Excel to handle Lakhs instead of Millions formatting convention

Use the below custom formatter

[>=10000000]##\,##\,##\,##0;[>=100000] ##\,##\,##0;##,##0

0