Run pyspark on your windows machine

1) Download Spark lib on your local machine and decompress the archive.
Then set SPARK_HOME and HADOOP_HOME env variables to point to this decompressed folder location – For example: C:\Users\some_user\PycharmProjects\spark-2.4.4-bin-hadoop2.7
Also lookup the winutils executable online and you need to put it in the spark bin folder.

2) Install Java JDK if you do not have it already from Oracle’s website.
Then set the JAVA_HOME env variable to point to the folder where Jdk is installed – For example: C:\Program Files\Java\jdk

3) Update the PATH env variable to include the above listed env variables – So for example: %SPARK_HOME%\bin and %HADOOP_HOME%\bin and %JAVA_HOME%

4) Install pyspark package via conda or anaconda-project – this will also include py4j as a dependency

5) Define another env variable – PYSPARK_PYTHON and point this to the python executable from the conda env that has the pyspark installed.

6) Open a new command prompt and try running “pyspark” executable as a test – this should work just fine

leave your comment