Category Archives: Python

java-jdk in pyspark project

A pyspark project that is running locally requires JAVA_HOME environment variable setup. If you’re using conda or anaconda-project to manage packages, then you do not need to install the bloated Oracle Java JDK but just add the java-jdk package from bioconda (linux) or cyclus (linux and win) channel and point JAVA_HOME property to the bin […]

Posted on September 13, 2019 by GauZ in Java, Python, Technology, Uncategorized

Tags: anaconda, conda, cyclus, java, java-jdk, java_home, JDK, package, pyspark

Handle bimodal dataset

If you have bimodal (or higher) dataset, then calculate the mean and subtract that from the dataset and finally calculate the absolute value of it. abs(row[‘ColumnX’] – m) Or apply central limit theory and classify points for the two distributions

Posted on August 26, 2018 by GauZ in Analysis, Data Science, Python, Statistics

Handle numpy log of smaller values or zero

Pass a condition to handle discrete or continuous values in the where param for np.log >>> import numpy as np >>> x = np.array([0, 1, 2]) >>> np.log(x) __main__:1: RuntimeWarning: divide by zero encountered in log array([ -inf, 0. , 0.69314718]) >>> >>> >>> np.log(x, where=x > 0) array([0. , 0. , 0.69314718]) >>> np.log(x, […]

Posted on August 19, 2018 by GauZ in Analysis, Data Science, Python

Tags: analysis, divide by zero, log, Numpy, Python, smaller value, where, zero

Tinkering with Apache Hadoop – Map Reduce Framework

I have used a Map Reduce based system at my present employer (Bank of America – Merrill Lynch) to process (read “crunch”) extremely large datasets in matter of seconds. Sometimes I used those to price bonds in real-time otherwise it was used for data processing/reporting purposes. It is an in-house product, known as Hugs framework […]

Posted on February 18, 2013 by GauZ in Java, Python, Statistics, Technology

Tags: Apache, awk, bash, basic, combiner, dataset, doublewritable, hadoop, intwritable, java, map reduce, mapper, mapred, mapreduce, Python, reducer, text, tutorial, Ubuntu, weather

Computing similarities between datasets

Similarity measures is used all over the web and is pretty well known by anyone who has performed Internet searches using a search engine. Assuming the entire internet comprising of all the websites as one single database which could be divided into two classes – those which can answer your query and other which cannot. […]

Posted on January 27, 2013 by GauZ in Python

Tags: correlation, cosine, dataset, dice, euclidean, jaccard, manhattan, overlap, pearson, Python, score, similarity, spearman

Gauz's view on Data Science, Engineering, Finance, and beyond!

Category Archives: Python

java-jdk in pyspark project

Handle bimodal dataset

Handle numpy log of smaller values or zero

Tinkering with Apache Hadoop – Map Reduce Framework

Computing similarities between datasets

Recent Posts

Archives

Recent Comments

Gauz's view on Data Science, Engineering, Finance, and beyond!

Category Archives: Python

java-jdk in pyspark project

Handle bimodal dataset

Handle numpy log of smaller values or zero

Tinkering with Apache Hadoop – Map Reduce Framework

Computing similarities between datasets

Tags

Recent Posts

Search this site

Archives

Recent Comments