Run pyspark on your windows machine

1) Download Spark lib on your local machine and decompress the archive.
Then set SPARK_HOME and HADOOP_HOME env variables to point to this decompressed folder location – For example: C:\Users\some_user\PycharmProjects\spark-2.4.4-bin-hadoop2.7
Also lookup the winutils executable online and you need to put it in the spark bin folder.

2) Install Java JDK if you do not have it already from Oracle’s website.
Then set the JAVA_HOME env variable to point to the folder where Jdk is installed – For example: C:\Program Files\Java\jdk

3) Update the PATH env variable to include the above listed env variables – So for example: %SPARK_HOME%\bin and %HADOOP_HOME%\bin and %JAVA_HOME%

4) Install pyspark package via conda or anaconda-project – this will also include py4j as a dependency

5) Define another env variable – PYSPARK_PYTHON and point this to the python executable from the conda env that has the pyspark installed.

6) Open a new command prompt and try running “pyspark” executable as a test – this should work just fine

Posted on September 13, 2019 by GauZ in Pyspark, spark

Tags: java, JDK, pyspark, Python, setup, Spark

java-jdk in pyspark project

A pyspark project that is running locally requires JAVA_HOME environment variable setup. If you’re using conda or anaconda-project to manage packages, then you do not need to install the bloated Oracle Java JDK but just add the java-jdk package from bioconda (linux) or cyclus (linux and win) channel and point JAVA_HOME property to the bin folder of your conda env as that will have the java.exe

Posted on September 13, 2019 by GauZ in Java, Python, Technology, Uncategorized

Tags: anaconda, conda, cyclus, java, java-jdk, java_home, JDK, package, pyspark

Handle bimodal dataset

If you have bimodal (or higher) dataset, then calculate the mean and subtract that from the dataset and finally calculate the absolute value of it.

abs(row[‘ColumnX’] – m)

Or apply central limit theory and classify points for the two distributions

Posted on August 26, 2018 by GauZ in Analysis, Data Science, Python, Statistics

Handle numpy log of smaller values or zero

Pass a condition to handle discrete or continuous values in the where param for np.log

>>> import numpy as np >>> x = np.array([0, 1, 2]) >>> np.log(x) __main__:1: RuntimeWarning: divide by zero encountered in log array([ -inf, 0. , 0.69314718]) >>> >>> >>> np.log(x, where=x > 0) array([0. , 0. , 0.69314718]) >>> np.log(x, where=x > 0.0) array([0. , 0. , 0.69314718])

Posted on August 19, 2018 by GauZ in Analysis, Data Science, Python

Tags: analysis, divide by zero, log, Numpy, Python, smaller value, where, zero

Change comma separator in Excel to handle Lakhs instead of Millions formatting convention

Use the below custom formatter

[>=10000000]##\,##\,##\,##0;[>=100000] ##\,##\,##0;##,##0

Posted on January 6, 2018 by GauZ in Uncategorized

Gauz's view on Data Science, Engineering, Finance, and beyond!

Run pyspark on your windows machine

java-jdk in pyspark project

Handle bimodal dataset

Handle numpy log of smaller values or zero

Change comma separator in Excel to handle Lakhs instead of Millions formatting convention

Recent Posts

Archives

Recent Comments

Gauz's view on Data Science, Engineering, Finance, and beyond!

Run pyspark on your windows machine

java-jdk in pyspark project

Handle bimodal dataset

Handle numpy log of smaller values or zero

Change comma separator in Excel to handle Lakhs instead of Millions formatting convention

Tags

Recent Posts

Search this site

Archives

Recent Comments