Category Archives: Technology

Run pyspark on your windows machine

1) Download Spark lib on your local machine and decompress the archive. Then set SPARK_HOME and HADOOP_HOME env variables to point to this decompressed folder location – For example: C:\Users\some_user\PycharmProjects\spark-2.4.4-bin-hadoop2.7 Also lookup the winutils executable online and you need to put it in the spark bin folder. 2) Install Java JDK if you do not […]

0  

java-jdk in pyspark project

A pyspark project that is running locally requires JAVA_HOME environment variable setup. If you’re using conda or anaconda-project to manage packages, then you do not need to install the bloated Oracle Java JDK but just add the java-jdk package from bioconda (linux) or cyclus (linux and win) channel and point JAVA_HOME property to the bin […]

0  

Handle bimodal dataset

If you have bimodal (or higher) dataset, then calculate the mean and subtract that from the dataset and finally calculate the absolute value of it. abs(row[‘ColumnX’] – m) Or apply central limit theory and classify points for the two distributions

0  

Handle numpy log of smaller values or zero

Pass a condition to handle discrete or continuous values in the where param for np.log >>> import numpy as np >>> x = np.array([0, 1, 2]) >>> np.log(x) __main__:1: RuntimeWarning: divide by zero encountered in log array([ -inf, 0. , 0.69314718]) >>> >>> >>> np.log(x, where=x > 0) array([0. , 0. , 0.69314718]) >>> np.log(x, […]

0  

Setup Apache Spark in JetBrains IntelliJ (Scala Version)

Please refer this post to see how to setup JetBrains IntelliJ IDE and Scala on your machine. 1) Download latest version of Apache Spark from http://spark.apache.org/downloads.html 2) Unpack the tarball to /opt/spark folder 3) Launch IntelliJ, create a new Scala project and choose SBT as the Build tool. 4) Once IntelliJ IDE is up and all dependencies […]

0