Ben Chuanlong Du's Blog

It is never too late to learn.

Tips on C++

Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!

IDE

  1. Eclispe CDT is a good IDE for C/C++ development in Unix/Linux sytem. Configuration of Eclipse CDT in Windows system is not pleasant. Netbeans and code::blocks are good …

Algorithms and Tools for Encryption

Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!

  1. RSA is most widely used algorithm but is computationally expensive.

  2. A good compromise is to use RSA to encrypt the symmetric key that is then used in AES encryption of the …

Common Errors Encountered in Scala and Solutions

Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!

  1. Java Version Issue

    Unsupported major minor version

    https://stackoverflow.com/questions/22489398/unsupported-major-minor-version-52-0

java.lang.NoSuchMethodError: scala.Product.\(init\)

Fixing the Scala error: java.lang.NoSuchMethodError: scala.Product.\(init\)

It probably …

Read/Write CSV in PySpark

Load Data in CSV Format

  1. .load is a general method for reading data in different format. You have to specify the format of the data via the method .format of course. .csv (both for CSV and TSV), .json and .parquet are specializations of .load. .format is optional if you use a specific loading function (csv, json, etc.).

Tips on Dataset in PyTorch

  1. If your data can be fit into the CPU memory, it is a good practice to save your data into one pickle file (or other format that you know how to deserialize). This comes with several advantages. First, it is easier and faster to read from a single big file rather than many small files. Second, it avoids the possible system error of openning too many files (even though avoiding lazying data loading is another way to fix the issue). Some example datasets (e.g., MNIST) have separate training and testing files (i.e., 2 pickle files), so that research work based on it can be easily reproduced. I personally suggest that you keep only 1 file containing all data when implementing your own Dataset class. You can always use the function torch.utils.data.random_split