Category Archives: Python

java-jdk in pyspark project

A pyspark project that is running locally requires JAVA_HOME environment variable setup. If you’re using conda or anaconda-project to manage packages, then you do not need to install the bloated Oracle Java JDK but just add the java-jdk package from bioconda (linux) or cyclus (linux and win) channel and point JAVA_HOME property to the bin […]

0  

Handle bimodal dataset

If you have bimodal (or higher) dataset, then calculate the mean and subtract that from the dataset and finally calculate the absolute value of it. abs(row[‘ColumnX’] – m) Or apply central limit theory and classify points for the two distributions

0  

Handle numpy log of smaller values or zero

Pass a condition to handle discrete or continuous values in the where param for np.log >>> import numpy as np >>> x = np.array([0, 1, 2]) >>> np.log(x) __main__:1: RuntimeWarning: divide by zero encountered in log array([ -inf, 0. , 0.69314718]) >>> >>> >>> np.log(x, where=x > 0) array([0. , 0. , 0.69314718]) >>> np.log(x, […]

0  

Tinkering with Apache Hadoop – Map Reduce Framework

I have used a Map Reduce based system at my present employer (Bank of America – Merrill Lynch) to process (read “crunch”) extremely large datasets in matter of seconds. Sometimes I used those to price bonds in real-time otherwise it was used for data processing/reporting purposes. It is an in-house product, known as Hugs framework […]

2  

Computing similarities between datasets

Similarity measures is used all over the web and is pretty well known by anyone who has performed Internet searches using a search engine. Assuming the entire internet comprising of all the websites as one single database which could be divided into two classes – those which can answer your query and other which cannot. […]

0