[Machine Learning with Python] How to get your data?
Using Pandas Library
The simplest way is to read data from .csv files and store it as a data frame object:
import pandas as pd
df = pd.read_csv('olympics.csv', index_col=0, skiprows=1)
You can also read .xsl files and directly select the rows and columns you are interested in by setting parameters skiprows, usecols. Also, you can indicate index column by parameter index_col.
energy=pd.read_excel('Energy Indicators.xls', sheet_name='Energy',skiprows=8,usecols='E,G', index_col=None, na_values=['NA'])
For .txt files, you can also use read_csv function by defining the separation symbol:
university_towns=pd.read_csv('university_towns.txt',sep='\n',header=None)
See more about pandas io operations in http://pandas.pydata.org/pandas-docs/stable/io.html
Using os Module
Read .csv files:
import os
import csv
for file in os.listdir("objective_folder"):
with open('objective_folder/'+file, newline='') as csvfile:
rows = csv.reader(csvfile) # read csc file
for row in rows: # print each line in the file
print(row)
Read .xsl files:
import os
import xlrd
for file in os.listdir("objective_folder/"):
data = xlrd.open_workbook('objective_folder/'+file)
table = sheel_1 = data.sheet_by_index(0)#the first sheet in Excel
nrows = table.nrows #row number
for i in range(nrows):
if i == 0: # skip the first row if it defines variable names
continue
row_values = table.row_values(i) #read each row value
print(row_values)
Download from Website Automatically
We can also try to read data directly from url link. This time, the .csv file is compressed as housing.tgz. We need to download the file and then decompress it. So you can write a small function as below to realize it. It is a worthy effort because you can get the most recent data every time you run the function.
1 import os 2 import tarfile 3 from six.moves import urllib 4 DOWNLOAD_ROOT = "https://raw.githubusercontent.com/ageron/handson-ml/master/" 5 HOUSING_PATH = "datasets/housing" 6 HOUSING_URL = DOWNLOAD_ROOT + HOUSING_PATH + "/housing.tgz" 7 def fetch_housing_data(housing_url=HOUSING_URL, housing_path=HOUSING_PATH): 8 if not os.path.isdir(housing_path): 9 os.makedirs(housing_path) 10 tgz_path = os.path.join(housing_path, "housing.tgz") 11 urllib.request.urlretrieve(housing_url, tgz_path) 12 housing_tgz = tarfile.open(tgz_path) 13 housing_tgz.extractall(path=housing_path) 14 housing_tgz.close()
when you call fetch_housing_data(), it creates a datasets/housing directory in your workspace, downloads the housing.tgz file, and extracts the housing.csv from it in this directory.
Now let’s load the data using Pandas. Once again you should write a small function to load the data:
import pandas as pd def load_housing_data(housing_path=HOUSING_PATH): csv_path = os.path.join(housing_path, "housing.csv") return pd.read_csv(csv_path)
What’s more?
These methods are what I have met so far. In typical environments your data would be available in a relational database (or some other common datastore) and spread across multiple tables/documents/files. To access it, you would first need to get your credentials and access authorizations, and familiarize yourself with the data schema. I will supplement more methods if I encounter in the future.

浙公网安备 33010602011771号