Tag Archives: Python

Run pyspark on your windows machine

1) Download Spark lib on your local machine and decompress the archive. Then set SPARK_HOME and HADOOP_HOME env variables to point to this decompressed folder location – For example: C:\Users\some_user\PycharmProjects\spark-2.4.4-bin-hadoop2.7 Also lookup the winutils executable online and you need to put it in the spark bin folder. 2) Install Java JDK if you do not […]


Handle numpy log of smaller values or zero

Pass a condition to handle discrete or continuous values in the where param for np.log >>> import numpy as np >>> x = np.array([0, 1, 2]) >>> np.log(x) __main__:1: RuntimeWarning: divide by zero encountered in log array([ -inf, 0. , 0.69314718]) >>> >>> >>> np.log(x, where=x > 0) array([0. , 0. , 0.69314718]) >>> np.log(x, […]


Tinkering with Apache Hadoop – Map Reduce Framework

I have used a Map Reduce based system at my present employer (Bank of America – Merrill Lynch) to process (read “crunch”) extremely large datasets in matter of seconds. Sometimes I used those to price bonds in real-time otherwise it was used for data processing/reporting purposes. It is an in-house product, known as Hugs framework […]


Computing similarities between datasets

Similarity measures is used all over the web and is pretty well known by anyone who has performed Internet searches using a search engine. Assuming the entire internet comprising of all the websites as one single database which could be divided into two classes – those which can answer your query and other which cannot. […]


Wrote my first strategy game – KALAH!

Kalah or Mancala is a strategy game. It is a two-player game and heavily favors the person who starts playing first. Each participant starts with certain number of seeds which are randomly allocated. At every turn, the user tries to grab seeds from the opponent’s house. The goal is to finish the game with the […]