Data processing for Machine Learning






















======= ======================================
2) Missing Values








=================== ===========================

Scaling







Some Additional things in Jupyter Note Book

Pandas Very useful methods for data analysis



1) Importing Pandas library & looking at versions.


-------

2) Reading the data which resides in Excel file.




-------

3) DF.[Column].unique(), get unique records in the column

# the following code gets unique records of column, 'Region' , values are :['Central', 'East', 'West', 'South'],

type: Numpy array
code: type(Sample_Superstore_Updated['Region'].unique())


-------

3)

4) DF.info() function gets the information about the DataFrame, it includes number of records, Index, Data type of the input(DataFrame), number of not null records in each column, data type

Columns for each different kind of data type.




































# removing stop words
# import nltk
# from nltk.corpus import stopwords
# stop_words_1 = list(set(stopwords.words('english')))

# import spacy
# en = spacy.load('en_core_web_sm')
# stop_words_2 = list(set(en.Defaults.stop_words))
# stop_words_total = list(set(stop_words_2 + stop_words_1))
# # print( "len(stop_words):" , len(stop_words_total),
# # "\nlen(stop_words_2)-SPACY:" , len(stop_words_2),
# # "\nlen(stop_words_1)-NLTK:" , len(stop_words_1)
# # )

# removing stop words
#https://gist.github.com/sebleier/554280
# import nltk
# from nltk.corpus import stopwords
# stop_words_1 = list(set(stopwords.words('english')))

# import spacy # en = spacy.load('en_core_web_sm') # stop_words_2 = list(set(en.Defaults.stop_words)) # stop_words_total = list(set(stop_words_2 + stop_words_1)) # print( "len(stop_words):" , len(stop_words_total), # "\nlen(stop_words_2)-SPACY:" , len(stop_words_2), # "\nlen(stop_words_1)-NLTK:" , len(stop_words_1) # ) # import requests # stopwords_list = requests.get("https://gist.githubusercontent.com/rg089/35e00abf8941d72d419224cfd5b5925d/raw/12d899b70156fd0041fa9778d657330b024b959c/stopwords.txt").content # stopwords = set(stopwords_list.decode().splitlines())
# stopwords



Previous Post Next Post

Contact Form