Handle various file types in Python for Data Science | Pandas

Zahash Z
The Startup
Published in
5 min readJan 30, 2020

--

If you have been using pandas for a while then you might already know how to import simple CSV files. But in reality, data is stored in many many file formats. What would you do if your side project requires you to access, say, MATLAB, JSON, Stata or even hdf5 files in Python?

This article will show you how easy it is to import those files using pandas

Let’s start off with

Pickle files

pickle is a filetype native to python which means you cannot use the data in other programming languages.

The idea behind pickle is to serialize a given Python object into a byte stream and store it as a ‘.pickle’ file in Python3 and ‘.pkl’ in Python2

To retrieve the contents from the file, you have to unpickle (deserialize) the pickled file.

PS: Everything(list, dictionary, string, ….) in python is an object.

First import the library (you don’t have to install anything since this library comes built-in)

import pickle

Then create some dummy data.

both ‘l’ and ‘d’ contain the same data but with different datatypes

# nested list
l = [['col_1', 'col_2', 'col_3'],
[27, 34, 'zahash'],
[28, 30, 'is'],
[30, 33, 'an'],
[12, 23, 'awesome'],
[20, 27, 'programmer']]
# dictionary
d = {
'col_1' : [27, 28, 30, 12, 20],
'col_2' : [34, 30, 33, 23, 27],
'col_3' : ['zahash', 'is', 'an', 'awesome', 'programmer']
}

Writing pickle file:

# 'wb' means 'write byte'
# 'pickleFile_L.pickle' file will be created if it doesn't exist
# 'dump' the data into the file
with open('pickleFile_L.pickle', 'wb') as pickFile_l:
pickle.dump(l, pickFile_l)
with open('pickleFile_D.pickle', 'wb') as pickFile_d:
pickle.dump(d, pickFile_d)

Reading pickle file:

# 'rb' means 'read byte'
# 'load' the contents of the pickle file into a variable
with open('pickleFile_L.pickle', 'rb') as readFile_l:
data_l = pickle.load(readFile_l)
with open('pickleFile_D.pickle', 'rb') as readFile_d:
data_d = pickle.load(readFile_d)

You can store multiple python objects into a single pickle file but is NOT recommended at all and you should avoid doing that.

After fetching the data, you can make a DataFrame out of it using Pandas

df1 = pd.DataFrame(d)
df2 = pd.DataFrame(l[0:], columns = l[0])

moving onto …

Excel files

you first have to install a few packages to read and write excel files with pandas

pip3 install xlrd
pip3 install xlwt
pip3 install openpyxl

Reading excel files:

sales_data = pd.read_excel('SampleData.xlsx', sheet_name = 'sales')
churn_data = pd.read_excel('SampleData.xlsx', sheet_name = 'churns')

Writing DataFrames to excel files:

df.to_excel('new file.xlsx', sheet_name='d')

if you don’t want to include the index then you can specify the parameter index = False

SAS files

it’s easy to read SAS files but unfortunately, you cannot write SAS files. You can instead save the file as .csv and use the SAS software to do that

You can import both (.sas7bdat) and (.xport) files

Reading SAS files:

sas_df = pd.read_sas('airline.sas7bdat')sas_df.head()

Output:

YEAR      Y      W       R      L      K
0 1948.0 1.214 0.243 0.1454 1.415 0.612
1 1949.0 1.354 0.260 0.2181 1.384 0.559
2 1950.0 1.569 0.278 0.3157 1.388 0.573
3 1951.0 1.948 0.297 0.3940 1.550 0.564
4 1952.0 2.265 0.310 0.3559 1.802 0.574

Stata files

Pandas also has support for reading and writing Stata files

Reading Stata files:

stata_df = pd.read_stata('airline.dta')

Output:

YEAR      Y      W       R      L      K
0 1948.0 1.214 0.243 0.1454 1.415 0.612
1 1949.0 1.354 0.260 0.2181 1.384 0.559
2 1950.0 1.569 0.278 0.3157 1.388 0.573
3 1951.0 1.948 0.297 0.3940 1.550 0.564
4 1952.0 2.265 0.310 0.3559 1.802 0.574

Writing Stata files:

df.to_stata('some_file_name.dta')

HDF5 files

Hierarchical Data Format (HDF) is an open-source file format for storing huge amounts of numerical data typically used in research.

recently LIGO used HDF5 file format to store Terabytes of gravitational waves data. This shows how robust the file format is!!

HDF5 files have a structure similar to nested dictionaries so working with them is quite easy.

you will need to install the h5py library first

pip3 install h5py

Reading HDF5 file:

with h5py.File('alloc.h5', 'r') as file_obj:print('keys : {}'.format(file_obj.keys()))

data1 = file_obj['DS1']
print('dataset(DS1) type : {}'.format(type(data1)))

print('\n printing the first dataset... \n')
for ele in data1:
print(ele)

print('\n')


data2 = file_obj['DS2']
print('dataset(DS2) type : {}'.format(type(data2)))

print('\n printing the second dataset... \n')
for ele in data2:
print(ele)

output:

keys : <KeysViewHDF5 ['DS1', 'DS2']>dataset(DS1) type :  <class 'h5py._hl.dataset.Dataset'> printing the first dataset... [ 0 -1 -2 -3 -4 -5 -6]
[0 0 0 0 0 0 0]
[0 1 2 3 4 5 6]
[ 0 2 4 6 8 10 12]
dataset(DS2) type : <class 'h5py._hl.dataset.Dataset'> printing the second dataset... [ 0 -1 -2 -3 -4 -5 -6]
[0 0 0 0 0 0 0]
[0 1 2 3 4 5 6]
[ 0 2 4 6 8 10 12]

This shows us that the actual file is like this

this a relatively simple file. Most HDF5 files are deeply nested

Writing an HDF5 file:

to keep everything simple and clear, let’s write the same data that we read above.

dataset1 = [[0, -1, -2, -3, -4, -5, -6]
[0, 0, 0, 0, 0, 0, 0]
[0, 1, 2, 3, 4, 5, 6]
[0, 2, 4, 6, 8, 10, 12]]
dataset2 = [[0, -1, -2, -3, -4, -5, -6]
[0, 0, 0, 0, 0, 0, 0]
[0, 1, 2, 3, 4, 5, 6]
[0, 2, 4, 6, 8, 10, 12]]
with h5py.File('some_file_name.h5', 'w') as file_obj:
file_obj.create_dataset('DS1', data=dataset1)
file_obj.create_dataset('DS2', data=dataset2)

JSON files

JSON is one of the most common file formats you will encounter both as a Data Scientist and as a Web Developer.

luckily python has a built-in json library that you can just import

import json

Reading JSON file:

let’s say you have a JSON file named ‘sample.json’ like this

Reading JSON file:

with open('sample.json', 'r') as file_obj:
data = json.load(file_obj)

The data will be stored in the form of a python dictionary

Writing JSON file:

use arguments indent = 4 to avoid writing the whole data in a single line in the file. The problem with writing all the data into a single line in the file is that text editors will have a really hard time opening that file because they have to read the entire data at once.

sort_keys = True will sort the keys in alphabetic order

data = {'a': [1, 2, 3], 'b': ['lol', 'what', 'if']}with open("some_file_name.json", "w") as file_obj:
json.dump(data, file_obj, indent=4, sort_keys=True)

The output file will look something like this

you can also make a DataFrame out of a JSON file by first reading it into a python dictionary and then running pd.DataFrame(data)

That’s it for this article. Thanks for reading :)

--

--

Zahash Z
The Startup

Just a college student who strives to be a data scientist