Ben Chuanlong Du's Blog

It is never too late to learn.

Device Managment in PyTorch

Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!

  1. Modules can hold parameters of different types on different devices, so it's not always possible to unambiguously determine the device. The recommended workflow in PyTorch is to create the device object …

Nature Language Processing Using NLTK

Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!

nltk.util.ngrams nltk.bigrams nltk.PorterStemmer

from nltk.util import ngrams
sentence = 'this is a foo bar sentences and i want to ngramize it'
n = 6
sixgrams = ngrams(sentence.split …

Keywords Extracting from Text

Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!

Word Stemming

  1. existing stemming method such as NLTK.PorterStem, etc.

  2. didn't -> did not, there's -> there is, etc. Mr. -> Mister Mrs. -> ... Ms. -> ...

Other things

  1. it seems that it is hard to get …

Tips on Dataset in PyTorch

  1. If your data can be fit into the CPU memory, it is a good practice to save your data into one pickle file (or other format that you know how to deserialize). This comes with several advantages. First, it is easier and faster to read from a single big file rather than many small files. Second, it avoids the possible system error of openning too many files (even though avoiding lazying data loading is another way to fix the issue). Some example datasets (e.g., MNIST) have separate training and testing files (i.e., 2 pickle files), so that research work based on it can be easily reproduced. I personally suggest that you keep only 1 file containing all data when implementing your own Dataset class. You can always use the function torch.utils.data.random_split